loop-engineering-skill

Why I built it

Loop Engineering blew up this year, but most people treat a “loop” as a universal accelerator. I build agent products by day and do agent-evaluation research at Tongji by night, and the more I look, the surer I am of one thing: a loop is only as good as its verifier — not as good as the model is strong.

A loop without a reliable judge is just a machine for producing garbage faster and more confidently — and the stronger the model, the faster it piles up errors. Most loops fail at “this task shouldn’t have been looped” and “the verifier isn’t faithful,” not in the loop body itself. So I crystallized this judgment into a Claude Code skill, so the AI is forced to answer “should this even be a loop?” before it helps you build one.

First, what it isn’t: the built-in /loop

Claude Code ships a bundled /loop that executes a loop — it re-runs a prompt at a fixed or adaptive interval until you stop it. This skill is the other half: it runs nothing. It helps you decide the two things /loop doesn’t — whether a task should be looped at all, and how to set the verifier, gates, and stop conditions. Use it to design the loop right, then point /loop, a cron job, or a queue worker at it.

What it does

Once it’s active, the AI stops treating “wrap it in a loop and let it run” as the default. It first does the thing most people skip — triage: a GREEN / YELLOW / RED verdict + decisive reasons across four dimensions (can a machine judge right from wrong, is judging it cheap, can a mistake be undone, can it be split small), with two hard vetoes — no faithful verifier, or an irreversible action that can’t be gated — going straight to RED.

GREEN: a faithful, cheap verifier, reversible, decomposable — a deterministic, largely unattended loop.
YELLOW: the verifier needs a model + rubric, or steps are only semi-reversible — keep a human checkpoint, cap iterations tightly.
RED: no faithful verifier can be built, or an irreversible action can’t be gated — don’t fully automate; build a verifier first, or keep a human in the loop.

Once a task clears triage, it designs the loop in three layers (outer scheduler / inner refinement loop / commit gate) and holds a few red lines that production loops fail on. It serves five modes: assess, design, diagnose, harden, implement.

The red lines it holds

maker ≠ acceptance authority. The model doing the work can’t grade its own output — it grades too leniently; and independence of evidence matters more than swapping in another model.
Protect the control plane. A stuck agent’s favorite “fix” is to weaken the test, not fix the artifact — the skill forbids the maker from touching the acceptance contract, locked tests, budgets, or stop conditions.
Irreversible actions live in a commit gate. Mass-sends, releases, money transfers — anything you can’t claw back — live outside the inner loop, behind approve → preview → commit.
Three hard stops, none optional: threshold met, budget hit, no-progress — and “no-progress” must watch a real signal, not something an agent can dodge with meaningless cosmetic edits.

Its temperament: an honest advisor that says “no”

Its design philosophy is one line: make the AI an honest advisor that will say “you can’t build a verifier for this, so don’t fully automate it” — not an “everything automates” salesperson.

Even if you explicitly ask for “full autonomy,” it won’t strip the approval / sandbox / commit gates — it gives you the closest design that stays inside a safe boundary. The things it should stop you on are exactly the ones where the judge can’t be built but you really want full autonomy — pure creativity, setting direction, calling strategy.

How it was evaluated, and where it stands

I ran a paired control/treatment battery: the same tasks answered by fresh agents without the skill (control) and others that read it (treatment), to isolate what the skill adds — five scenarios across opus / sonnet / haiku, plus a fresh held-out battery in domains the skill never mentions. The finding is deliberately honest: a strong base model already does the right thing on the blatant cases (moving money, deleting tests, a pressured no-approval deploy) on its own; the skill’s real marginal value is consistency, structure, and the verification metrics it forces — and it shows up most on smaller / faster models and across many repeated runs.

The evaluation method itself is still something I’m iterating on — rigorously proving “the skill actually makes loops more reliable” is a hard problem, and exactly what my research is after. It’s v1; the repo open-sources the evals (cases, prompts, rubric, raw outputs, a limitations writeup) and schemas. Issues welcome — let’s iterate.

How to install & use

Install: clone the repo, copy the loop-engineering/ directory into ~/.claude/skills/, start a new session.
Use: it activates when you design / evaluate / diagnose / harden an agent loop; or just ask “is this a good fit for a loop?”
Read: SKILL.md is the core judgment; three references cover production hardening, deployment, and the full “why it’s reinforcement learning moved to inference time” reasoning.