How Far Can You Trust an AI Agent?

A warm-toned humanoid agent standing quietly inside a boundary drawn by a ring of cool blue light, with unlit blankness beyond the edge

I hand whole jobs off to an AI every day, and slowly something dawned on me. What decides how far I can let it run is whether I can check its work quickly and reliably. How smart it is matters a lot less than you’d think.

I’ve been building applications on large language models for two or three years now. I got into this in 2023, and these days I spend all day inside tools like Claude Code, writing code by telling it what I want. I give it a sentence, and it goes off on its own, opens files, edits code, runs tests, and finishes the whole job. The longer I use it, the more one question keeps nagging at me. With this kind of AI, the kind where you say one thing and it does a pile of work, how far can you actually let it run? Which jobs can I hand off and then go to sleep, and which ones do I have to watch like a hawk, never daring to look away?

At first I assumed the line was drawn by how smart the model is. The stronger it got, the more I could let go. But the more I thought about it, the less that held up. This essay is me working that question out, slowly, until it finally came clear. I’ll try to keep it plain enough that someone who’s never touched this field can follow along, because I think the answer is a lot more concrete than the big abstract question of whether AI is going to replace us.

First I should say what I mean by “agent.” The way people use AI has changed over the last couple of years. It used to be that you asked a question and got an answer, like a super-powered search box. Now this new crop of tools can act on its own. You give it a goal, and it breaks the goal into steps, picks its tools, and works through the whole thing. People in the field call this kind of “AI that does its own work” an agent. Claude Code is a classic example, and it’s the one I live in.

Bill Gates has said he’s witnessed two genuine technology revolutions in his life. One was the graphical interface, which freed ordinary people from memorizing command lines and let them run a computer by pointing and clicking. The other is today’s large language models. I believe that call, and it’s why I’ve bet the next decade or two of my career on AI. Riding that wave, a lot of people describe agents as a brand-new layer growing on top of the software world. Underneath sits the old bedrock of operating systems and databases, and on top you add an assistant that understands plain speech. You tell it what you want in ordinary words, and it goes and works the big stack below.

Half of that is right, and the other half is a trap. It was by chasing down exactly where the trap is that I worked my way to the boundary.

An Agent Is Like a Capable Contractor Who Sometimes Gets It Wrong

So where’s the trap? Here’s how I came to understand it.

When we build software, we stand it on a pile of utterly dependable parts: databases, file systems, things like that. They all share one trait. Use them a certain way, and they respond the same way every single time, steady as a faucet. Turn it and you get water, always water, never a cat shooting out at you. It’s precisely because they’re dependable to the point of boredom that you can build a skyscraper on top of them without losing sleep.

At first I instinctively treated an agent as one of those parts. Turn the tap, get water. It took me a while to catch on that it’s nothing of the sort. It takes the complicated work off your hands, sure, but its performance has luck baked into it, and every so often it goes sideways. And this isn’t a bug from sloppy engineering that you can tune away. It’s there by nature, and no amount of model intelligence will get rid of it.

Then I switched metaphors, and everything fell into place. Don’t think of an agent as a part. Think of it as a highly capable contractor who occasionally screws up. Dealing with dependable parts is like stacking blocks: you snap them together and you’re done. Dealing with a contractor takes a different skill set entirely. How clearly do you spec the job, how do you inspect it when it’s finished, how do you cover yourself when it goes wrong. The more I use agents, the more convinced I am that the whole secret to using them well is hidden in that second set of skills.

Autonomy and Intelligence Are Two Different Things

If using an agent well comes down to inspection, then the next question follows on its own. What kind of work, exactly, can you safely hand off?

I fell into an intuitive trap first. The smarter the model, the more it can decide for itself, so the more I can let go. Sounds self-evident, right? But two of the most ordinary objects imaginable turned me completely around: a robot vacuum and a calculator.

A robot vacuum is adorably dumb, but think about it. It’s a full-blown helper that does the entire job on its own. You hand it the whole goal, “clean the room,” and it plans, it works, and when it’s done it trundles back to its dock to charge, all without you glancing at it once. A calculator is the opposite. Its raw computing power is absurd, leagues beyond the vacuum, yet it’s forever just a tool, because which buttons to press and what to compute, all of that lives in your hands.

Set those two side by side and it finally clicked. How much something can do on its own and how smart it is are two completely unrelated axes. The robot vacuum is more autonomous than Claude Code, and dumber by an order of magnitude. The intuition I’d believed without question, that smarter means more like an independent helper, got demolished in one second by the vacuum on my own floor.

So what actually separates a “tool” from a “helper”? I stared at those two examples for a long time, and it all came down to one thing. Whether the job can be checked, fast and reliably, as done or not done.

The thing that nailed this down for me was a line from Yao Shunyu on Zhang Xiaojun’s podcast. (Zhang Xiaojun runs one of the most thoughtful long-form interview shows in the Chinese tech world; Yao is an AI researcher.) He said he couldn’t figure out how to train an AI to be a good product manager, because product work “has no scale.” Whether it’s any good, you only find out after you’ve actually built the thing and shipped it to users, and the feedback comes back slow and blurry. He didn’t know what standard he’d even use to teach the AI.

That stopped me cold, because flipped around, his problem was exactly the answer I’d been looking for. A product manager’s judgment can’t be given a quick, accurate score on the spot. The results are diffuse, the cycle runs in months, and you can’t “run it again for comparison.” You can’t re-run the parallel world where that product manager wasn’t there, to see the difference. And the credit is tangled up with the team and with luck. With no ruler to grade against in the moment, an AI can’t grind its way to being a product manager the way it grinds its way to coding. A programmer has a ready-made answer key: whether the code they wrote is right, whether it runs, a machine can tell in an instant. A product manager has no such key.

And the reverse came clear too. As long as the result can be checked at a glance, the AI can train itself to superhuman levels even when there’s no ceiling on what counts as “best.” Chess is like this. There’s always a better move to be found, yet whether you won or lost is obvious the moment you flip the board, so the engines play each other day and night and you never have to watch. So what actually pins an agent down is never how hard the task is or how high the goal sits. It’s whether the result can be checked cheaply.

A quick aside, because this conclusion also corrected one of my own early ideas. I used to think the thing deciding whether you could fully let an AI go was whether human ambition in that field ever maxed out. In chess, say, someone always wants to play better, so doesn’t someone always have to keep watch? But held up against chess, that idea shattered on the spot. Chess ambition clearly has no end, and yet the AI can still let go completely. The variable that actually does the work is “can it be checked.” Ambition only ever mattered through that. I’ve come to believe this one: push an idea to its extreme and falsify it with your own hands, and you often learn more than by following it forward.

On the left, a goofy little robot vacuum that nonetheless finishes the job on its own; on the right, a powerful but complicated machine dangling from a tangle of wires, held in a human hand. Which one you can let go of has nothing to do with which one is smarter

The Second Thing That Matters: Can You Undo the Damage

At this point I figured I’d solved it. Can it be checked, that’s the line. But it wasn’t long before I noticed the line was missing half of itself.

What exposed the gap was a slightly juvenile thought: weld a flamethrower onto that robot vacuum. Sit with that for a second. Checking “did it clean the floor” hasn’t gotten one bit harder, and yet now you’d absolutely watch it like a hawk, never daring to step away.

Why? What I worked out is that there was a second line I’d missed. Once it acts, can the damage be undone, and is there a cap on how bad it gets. The first line is about whether you can teach it and whether you can grade it. This second one is about something else entirely. Even if you can grade it, do you dare actually let it run unwatched?

Looking back at why Claude Code can fool around so boldly inside its own sandbox, it suddenly made sense. Half of it is that whether the code is right can be tested on the spot. The other half is that even if it breaks something, one command undoes it, the tests re-run, and any damage is recoverable. But the moment it has to touch a real production system, touch money, touch an email that’s already gone out, “can it be checked” hasn’t changed a word, yet the human gets yanked right back in front of the screen. The reason is simple. Get these wrong and you can’t take them back.

Snap the two lines together and I had a reasonably clean way to put it:

How far an agent can be let off the leash is exactly how far an “inspector” like this can reach. The inspector watches the real goal, and is fast enough to call a halt before any irreversible damage is done.

The moment you go past that range (inspection too slow, gone off course, or simply absent), the human has to stay in the loop. Because in the end, the human is the inspector of last resort.

Once I got here, my whole understanding of the phrase “an agent’s capability boundary” changed. It isn’t a line drawn on capability at all. It’s this: how far your cheap, fast, reliable ability to check the work can reach.

There’s one more layer, the one I chewed on longest and found most counterintuitive. A smarter model only raises the ceiling on what an agent dares to attempt. It doesn’t conveniently build you an inspector along the way. The inspector is a separate craft, one that has nothing to do with how smart the model is. A programmer’s answer key got assembled by humans, one problem at a time. The rigorous proofs in mathematics were invented by humans. The future tools that can simulate how real users will react will have to be built by humans too. So as the model rockets ahead, this boundary barely budges, until we get better at building inspectors. The more I think about it, the more I believe that the real speed at which agents advance is the speed at which we build inspectors, not the speed at which models get smarter.

Why AI’s “Getting It Wrong” Can Never Be Removed: It’s the Same Thing as Being Smart

For a while I really went down a rabbit hole. If this getting-it-wrong business is so annoying, can’t we just “fix” it? Make the agent into something that never errs, fixed input, fixed output, an honest calculator where one is one and two is two.

I thought about it for a long time and found the path is a dead end. And the reason it’s a dead end is exactly what explains why AI is smart in the first place. The moment I saw it, I got a little chill.

First I cleared up a misunderstanding. What kills “smart” is not “stability.” You can absolutely turn a model’s randomness all the way down so it gives the identical answer to the same sentence every time, and it doesn’t get one bit dumber. “Stable input, stable output” doesn’t fight with being smart.

What actually makes it stop being smart, here’s how I worked it out, comes down to whether it can be written into a table in advance. The fatal flaw of old-school software is that “which input maps to which output” is something you can list out in full ahead of time and review line by line. A large model, even tuned to answer the same way every time, can’t be reduced to that lookup table. It always finds a way to surprise you. So the real Achilles’ heel of this smart medium is that it can’t be fully described in advance. And it’s precisely because it can’t be described in advance that it can handle the new situations you never taught it.

Writing this, I stopped short, because that earlier wish to “fix the errors” was now nailed shut for good. The ability that makes it look smart (handling situations it’s never seen) and the flaw that makes it always get things wrong (faceplanting in situations you didn’t anticipate) are the very same thing. You can’t build a gate to block a kind of crash you have no way to describe in advance.

So this isn’t a trade-off between “stability or smarts” at all. They’re two ends of the same dial. Handling the unexpected equals can’t-be-pinned-down-in-advance equals the possibility of error can’t be removed. If you want it to generalize from one case to the next, you have to accept that it’ll occasionally fall on its face. Those were always one and the same thing.

Put it in terms of “following instructions” and it gets even plainer. A large model isn’t really “understanding and executing” your instruction. It’s more like writing onward from your words, producing a stretch of text that looks like it’s obeying. So its “obedience” is always just a high probability, never an ironclad guarantee, and you can’t draw a circle in advance around where it’ll crash. Old-school software is different. The standard for whether it’s right is built into itself. A large model has thrown that “check for correctness” job outside of its own body.

Once I got this layer, I finally understood what all those “bolt-on frameworks” are for. What they add isn’t smarts. They strap on, from the outside, a guarantee of correctness that this kind of smartness can never grow on its own.

A humanoid figure made of warm light; the same beam that illuminates it and lets it handle new situations also leaves an unfillable crack down its body. What makes it smart and what makes it err are one and the same beam

The “Bolt-Ons” We Build for AI Actually Come in Two Kinds

Once I’d worked all that out, I started looking at this business of “building bolt-ons for AI” with new eyes.

First, what’s a bolt-on? It’s the ring of extra scaffolding we build around the AI model: helping it remember things, helping it break a big task into small steps, helping it double-check. I used to think these were all one kind of thing, and that as the model got stronger they’d all bow out together. Then I realized there are two kinds in there, with completely opposite fates.

The first kind, bolt-ons that add capability. They help the model do things it can’t yet do smoothly: breaking a big task into small steps for it, helping it remember what came before. This kind gets more redundant the stronger the model gets, trending toward zero. The industry is already watching this happen: once a model’s memory and ability to think holistically improve, the bolt-ons that used to break steps down for it start getting in the way.

The second kind, bolt-ons that add guarantees. They add what the model can’t give by nature: hard safety guarantees, the permission gate before an irreversible operation, an external check for correctness. This kind doesn’t retire no matter how strong the model gets. On the contrary, the higher the stakes and the more untouchable the situation, the thicker it has to be built. Because no matter how smart the model is, it won’t make a transfer that’s already gone out reversible, and it won’t grow itself an inspector out of thin air where none existed.

The difference between these two strikes me as deadly important. Pull out a capability bolt-on and you hurt average performance: a bit slower, a bit dumber. Pull out a guarantee bolt-on and average performance might not be scratched at all, but what you hurt is safety in the worst case.

Which leads to a conclusion that sends a small chill down my back. An evaluation that only watches “average performance” is essentially blind to the “guarantee” kind of bolt-on. It will, one second before an agent wires money to the wrong account, calmly report that this safety module is useless now and can be removed.

So when I look at a bolt-on now, the question I ask is no longer “is this thing still needed.” It’s “is it filling a gap in capability (which evaporates on its own the moment the model gets strong) or one of those holes that can never be filled by nature (which no amount of strength can fix).” The former gets erased outright by a stronger model; the latter doesn’t. The latter is the kind that “the stronger the model gets, the more its importance shows.”

The opposite fates of the two kinds of bolt-on: one slowly turns transparent and evaporates as the model grows stronger, the other gets built up thicker and thicker in higher and higher stakes situations

Conclusion: This Is the Final Form of the AI Era

After this whole long detour, I can finally answer the question from the very beginning. Will the future be one super AI model that does absolutely everything?

I’m betting no. And the more I think about it, the more I believe my bet runs deeper than “today’s models aren’t strong enough yet.” This isn’t a problem peculiar to large models at all.

Here’s how I talked myself into it. Any system that works by “finding patterns in massive piles of examples,” rather than by “rigorous proof” guaranteeing correctness, throws the “check for correctness” job outside its own body. Hard guarantees require a black-and-white foundation you can verify step by step. Pattern-finding requires flexibility, fuzziness, matching on the gist. Those two pull in opposite directions. So the error you can’t remove isn’t a defect specific to large models. It comes built into “generalizing by learning” itself. Swap in some other technology and it won’t save you, not as long as you still want that ability to generalize.

Once I got here, that line that’s been repeated to death, that the future is models plus programs plus humans working together, finally carried real weight for me.

That combination isn’t a stopgap waiting for a stronger model to absorb it. It is the final form of “letting AI do its own work while bearing the consequences.” The bolt-ons will slowly die off on the “add capability” side, and stay forever on the “add guarantee” side.

What the large model decides is how high the agent’s ceiling can go. But in any specific situation, how far an agent can be let off the leash depends on whether you can assemble, right there, the existing rules, the unchangeable facts, the humans, and the already-written programs into an inspector that can call a halt before irreversible damage is done. Money transfers are the perfect example. Once money’s gone out it can’t be clawed back, so that checkpoint is held mostly by layers of permission from programs and humans, not by the AI. Because here the “can’t be undone” is right out in the open, and you can block it.

So in the end, the two sentences I keep for myself are these. The model is responsible for raising the ceiling. The inspector is responsible for letting you dare to let go. These are two different things, and the second one is the agent’s real boundary.

As for whether we can build the kind of inspector that “predicts success or failure without actually shipping,” and shove the whole boundary way out in one big step, that question is far too tempting. Unfortunately it’ll probably stay out of reach for the next few years, so I’ll save it for later.

This essay was distilled from the thinking in a single conversation that ran several hours, and a lot of the ideas took shape slowly through back-and-forth argument and friendly sparring. Wherever something isn’t clear, odds are I just haven’t thought it through yet.