Loop Engineering Is Inference-Time RL

A warm-toned, semi-transparent person standing before a cool-blue glowing loop machine; a glowing balance scale hovers at its center, and he's placing a freshly produced result onto it to weigh — like a judge for the spinning loop

This is the sequel to the last Loop Engineering piece — its “hard mode.” It digs deeper, and it’s meant for readers with a bit of algorithm, agent, or large-model background. If you just want the “what and how,” the previous piece is enough; here we take apart what it really is, why it behaves the way it does, and how to write a loop that actually works. At the end I’ve open-sourced a skill that crystallizes this judgment.

A Perspective That Kept Clicking Into Place

I build AI agent products by day, and on the side I’m chipping away at a related research direction: a framework that lets an agent self-evolve from real business-level evaluation. In plain terms, using the verdicts from real business as the signal to keep nudging an agent, reliably, toward getting better on its own.

So when Loop Engineering suddenly caught fire this year, I stared at it for a long time, with a feeling that kept getting stronger: it isn’t a new thing at all. It’s an old friend in a new set of clothes.

This old friend is called reinforcement learning (RL).

What I want to argue in this piece is one thing: Loop Engineering is, at bottom, “reinforcement learning moved to inference time, with the weights frozen.” Once you accept that mapping, every design point and every way a loop crashes can be derived straight from RL. It isn’t a pile of scattered engineering tips; it has a spine.

Let’s stand that spine up first.

The Spine: A Loop’s Ceiling = Its Verifier’s Ceiling

A while loop wrapped around a model is trivial to write. What actually decides whether the loop is usable is the thing that judges whether each round’s output is right — we call it the verifier (the checker / the judge).

So my core claim is:

Loop engineering is, at its core, verifier engineering. The loop is the easy part. A loop without a faithful verifier is just a machine for accumulating errors faster, and the stronger the model, the faster it accumulates them.

This isn’t just my own bias. The New Stack ran a piece whose headline says it outright: “loops are replacing prompts, and verification is about to be your biggest problem.” And there’s a comment I love that nails it, roughly: designing the loop is only half of it; the other half is putting something into the loop that can say “no” — a test, a type check, a real error message. That thing that can say “no” is the verifier.

I should add one caveat: verifier fidelity is usually the tightest constraint, but not the only one. Whether a loop converges also depends on the action space, state completeness, observability, whether the task can be decomposed, and the recovery strategy after a failure. But of all of these, the verifier is the heart.

Okay, the spine is up. Now let’s see why I say it “just is RL.”

What It Resembles: An RL That Doesn’t Update Weights

Lay the mapping out:

The model = policy
The engineering scaffold wrapped around the model (tools, environment, state transitions) = the MDP / environment
The success standard you set = the reward function
“Run until it passes” = the rollout

I have to stress one line: this is an analogy, not a literal equivalence. No weights are updated anywhere in the process, so it isn’t actually RL. Where this analogy really earns its keep is this: it precisely predicts one class of failure — the class about goals and feedback.

And the part of RL that’s hardest and least solved happens to be reward. You’ve taken the hardest problem from training time and moved it, untouched, to inference time. Concretely, you inherit three mountains.

A warm-toned, semi-transparent person running lap after lap along a cool-blue glowing circular track, a cool-blue lock hovering dead center — it just is an RL, one that never updated a single weight from start to finish

Mountain One: Reward Design Is Hard in Itself

Defining a success signal that’s “cheap, faithful, and machine-checkable” all at once — that’s a famously hard bone to chew in RL.

We PMs often have an illusion that “pinning down the success metric” is a product problem, something a doc can settle. It isn’t. The hard part isn’t defining “a success metric”; it’s defining one that can’t be gamed. That’s a reward-design problem, and it has no standard answer.

Mountain Two: Reward Hacking / Goodhart — It Games You Precisely

You’ve probably heard Goodhart’s law: “when a measure becomes a target, it stops being a good measure.”

Dropped onto a loop, it looks like this: you set an agent looping against the goal of “tests pass,” and it’ll very likely write you code that games the tests, not code that solves the problem. The moment there’s a gap between the verifier and what you actually wanted, the loop will tirelessly, precisely, drill into that gap.

And here’s the counterintuitive part: the stronger the loop, the harder it drills. Capability and the capacity to game grow together. Langfuse coined a wonderfully vivid name for this output: “agent slop” — low-quality junk mass-produced by a swarm of agents optimizing against an incomplete eval. This is the precise mechanism behind what many call “loops only produce expensive garbage”: it’s not that loops don’t work; it’s that gap between the verifier and the real goal doing its damage.

A cool-blue wall of standards with a thin crack down the middle; a warm-toned drill bit boring precisely, tirelessly, into that crack — the moment there's a gap between verifier and real goal, the loop will drill into it

Mountain Three: Dense vs. Sparse — and a Pit I Fell Into Myself

This is the area where my thinking shifted the most, so let me write a bit more.

Dense reward: feedback at every step. Writing code, every step compiles and runs tests — it’s like handing the model a gradient it can climb.
Sparse reward: you only find out at the very end whether it worked. A strategic decision pays off half a year later, with no signal in between.

My initial judgment was: “dense means you can loop; sparse means you’re flying blind — death sentence.” That judgment was wrong, and I later overturned it myself.

Wrong how? Sparse feedback doesn’t mean no reliable verifier. A compiler ultimately gives you a pass/fail; a theorem prover ultimately accepts or rejects; a final test suite ultimately passes or doesn’t; a data migration runs one full consistency check when it’s done. These signals are all very “sparse” (they only show up at the end), but they’re cheap and trustworthy, and they can absolutely sustain a search loop with a budget cap.

What actually leaves you flying blind is a different combination: sparse + failure gives you no actionable evidence + a huge search space. “Loop until the copy is moving” is the real dead end, because “moving” is both sparse and gives no diagnostic information. So the right question for whether a task can be looped isn’t “how dense is the signal,” but: will a failed attempt return evidence that’s faithful and actionable enough?

This correction matters, because it directly determines whether you’ll misjudge a whole pile of tasks that are actually doable (like “loop until the proof checker accepts”) as impossible.

How to Write a Good Loop: A Few Things Grown from the Spine

Once you get that “it’s RL,” the design points below stop being rules and become corollaries. Let me pull out a few I’ve settled on myself.

1. The Verifier Is the Heart, and maker ≠ Acceptance Authority

Let the model that produces the result grade its own result, and it’ll grade too leniently — this is all but inevitable. So you separate maker / checker: one does, one independently verifies. The maker can self-check and cheaply catch obvious errors, but it can never be the final acceptance authority.

There’s a point deeper than “use two different models”: independence of evidence > independence of model identity. Two different models may share the same wrong knowledge, the same prompt bias, the same incomplete understanding of the requirement, the same training distribution. So what really matters isn’t “swap in another model to judge,” but “swap in an independent source of evidence”: external facts / deterministic results, a locked holdout test, sentinel checks not exposed to the maker, a model judge in an independent context, sampled human review. There’s no one-size-fits-all ordering here; choose and combine by fidelity to the acceptance contract, independence, coverage, stability, and cost.

2. Guard the “Control Plane” — Don’t Let the maker Edit the Tests

This is the one I think gets overlooked most, yet is most lethal in production.

What’s the easiest “fix” a stuck agent reaches for? Edit the tests, or lower the bar, not fix the artifact. That’s Goodhart’s most vivid footnote.

So you must explicitly protect the control plane: the maker may not modify the acceptance contract, may not touch locked/holdout tests, may not change checker instructions, may not change the rubric, may not adjust the budget, may not change the stopping conditions or approval policy. Any change to these counts as “the goal became a new version” and must go through independent approval. Wherever you suspect it’ll game you, keep at least one acceptance set completely hidden from the maker.

3. Stopping Conditions: Three Hard Stops, and no-progress Needs a “Real Signal”

The root of the money-burning black hole is badly set stopping conditions. Three hard stops, not one can be missing: stop-on-pass, stop-on-budget (iterations / tokens / dollars), and stop-on-no-progress.

But “no progress” has a trap: if you only judge “the same error appears N times in a row,” the agent can dodge it by producing a pile of meaningless small changes so every round looks slightly different. So no-progress itself needs a real progress signal: is the set of failing tests shrinking, are the unmet rubric items decreasing, is the failure fingerprint repeating, has it been several rounds of nothing but cosmetic diffs. Even “no progress” itself needs an executable verifier.

4. Escalate on “Observable Triggers,” Not on the Model “Feeling Unsure”

A model’s self-reported confidence is notoriously uncalibrated. So “ask for help when you should” can’t be written as a personality requirement; it has to become a rule, triggered by observable conditions: verifiers disagree with each other, input falls outside the known range, an unknown tool error shows up, the budget nears its cap, the same failure fingerprint repeats, a high-risk action is involved, key evidence is missing, state and external data are inconsistent.

This one also hides the real bottleneck for scaling: when you run many loops, throughput doesn’t depend on how strong the model is — it depends on how well these loops “raise their hand at the right moment.” A loop that fails silently is far worse than one that fails loudly. How many loops one person can manage ≈ how well those loops shout “I’m not sure.”

5. Don’t Mash the Three Logical Layers Together

Last, architecture. I split a loop into three logical layers (present as needed; for a single task the outermost can degenerate into a one-shot shell):

Outer scheduler: pick the next work item, lock, dispatch, queue-level termination.
Inner refinement loop: converge on one work item — do → verify → revise on failure evidence → stop on pass / stall / over-budget.
Commit gate: before any irreversible, high-privilege, or externally visible side effect, do pre-approval, a final independent verification, the commit decision, and postcondition checks. Every irreversible action lives in this layer, outside the inner loop.

Locking actions like “mass-send,” “publish,” “wire money” into the commit gate, instead of letting them run naked inside the inner loop — this one can save your life.

A cool-blue gate held firmly shut, with a few warm-toned blocks locked behind it — irreversible actions like wiring money and mass-sending; outside, a loop spins on, unable to reach them

The Part I Worry About Most: Three Kinds of “Debt”

All of the above is the machine side. But there are three failures that never show up on any dashboard — they’re on the human side. Addy Osmani makes the point sharply in his blog post; let me paraphrase, because this is the part I’m personally most wary of:

Comprehension debt: the more code a loop delivers that you never read, and the faster it does, the wider the gulf between “what exists” and “what you actually understand.” A verifier can’t replace your reading what it actually wrote.
Cognitive surrender: once the loop is running on its own, it’s terribly easy for a person to stop thinking and take it all at face value. I love Osmani’s line, roughly — designing loops with judgment is the cure; doing it to escape thinking is the accelerant; the same action, opposite outcomes.
Slop: you optimize the goal you wrote down, not the nuance still left in your head. So you have to sample the real output and read it yourself, and feed the corrections back.

I pulled this section out on its own because, having built agent products long enough, I’m more and more convinced: the most dangerous posture is exactly the most comfortable one — hitting “start,” and then never looking at what it produced again.

The Frontier, and a Bit of Self-Interest

So where’s the real frontier? In manufacturing a verifier, by hand, for domains that don’t have one — decomposing the final “good or not” into a string of step-by-step scorable rubrics, or having the model generate a test suite first and then looping against it.

But this is dangerous: in turning sparse into dense, are you approaching the true verifier, or manufacturing a more polished, more presentable, but equally gameable proxy? So any reward shaping has to first prove that your shaping signal really correlates with the true goal, before you trust it.

This is exactly what my research direction wants to chew on: how to use real business-level evaluation to keep manufacturing a faithful verifier, and then let the agent evolve along it on its own. In a sense, Loop Engineering took the hardest part of training time (reward / evaluation), moved it to inference time, and then smeared it directly across every product engineer’s face. Those of us building agents in this generation can’t escape it.

Landing It: I Crystallized This Judgment into a skill

Talk is cheap. I took this whole judgment — from the four-way triage of “should this even be looped” to verifier design, guarding the control plane, the three-layer architecture, escalation triggers, and production monitoring — and crystallized it into a skill you load into Claude / Codex, and open-sourced it.

Its design philosophy is one line: make the AI an honest advisor that’s willing to say “you can’t build a verifier for this task, so don’t fully automate it,” instead of a salesman who claims “anything can be automated.”

GitHub: qingqingpi/loop-engineering-skill (the README is bilingual; clone it and drop it into ~/.claude/skills/ to use).

I didn’t just toss the files up and call it done. The repo has a section specifically on “How it was evaluated”: I actually tested it with paired experiments, five scenarios, each run twice, once with an agent without the skill (control) and once with an agent that had read it (treatment), to isolate what the skill actually adds. The conclusion is honest, too: in a scenario like “build an auto-publisher with no human review,” the agent with the skill held the commit gate while the control eventually caved; but I also flagged a caveat — a strong base model will refuse bad autonomy on its own, so the skill’s marginal value is mainly in consistency, structure, completeness, and the verification metrics it forces, and it only really amplifies on smaller/faster models and across many repeated runs.

It’s v1 right now. Text can write down judgment; it can’t write down all of runtime behavior. When you actually use it, you’ll probably hit a pit or two I didn’t foresee — issues welcome, let’s iterate together.

To close, back to the opening line. Loop Engineering isn’t magic; it’s a mirror that exposes demons: what it exposes is whether your task has a faithful acceptance signal at all. If it does, the loop is your leverage; if it doesn’t, the prettiest loop is just producing expensive garbage faster.

Engineering can’t patch a missing judge. That’s probably the plainest, most repeatedly confirmed thing I’ve learned doing this work.

— Sun Xin, an AI product manager working on agents. More scattered thoughts at sunxin.xin.