- Most agent rollouts fail because teams don’t define what “done” means up front or don’t verify the output against that definition afterward.
- The model is rarely the root cause; agents usually do what they were asked, and vague prompts are effectively asking for nothing.
- With agents, quality standards must be made explicit at the start and enforced at the end, because you’re no longer applying judgment continuously during the work.
- A real “bar” looks like an acceptance test: specific, verifiable criteria that someone else could use to decide pass/fail without asking questions.
- Machine-checkable goal conditions (tests passing, linters clean, correct status codes) dramatically improve outcomes by letting the agent self-verify and retry until it truly meets the bar.

When an agent deployment goes wrong and a team calls me in, I ask two questions before I look at a single log. What did you tell the agent done meant, and who checked the result against that definition. In every failed rollout I have sat with this past year, the answer to at least one of those questions is silence.
The model is almost never the problem. Claude, GPT, whatever they are running, the agent did roughly what it was asked, and what it was asked was close to nothing. Teams fail with agents because nobody set a bar going in, or nobody held it coming out. That is the whole diagnosis, and it shows up so consistently that I gave it a name I now use in every engagement: the bar.
The framework has exactly two parts. You set the bar before the agent runs, by defining what done and correct mean in a form that can be checked. You hold the bar after it finishes, by actually checking. Everything between those two moments, the writing, the wiring, the grinding, has crossed over to the machine. The two ends are what is left of your job, and they are not the small part.
The work used to carry the bar inside it
For most of my thirty years in IT, the standard and the doing were the same activity. You held a notion of quality in your head and enforced it continuously, line by line, as you built. When something drifted, you felt it and corrected, often without ever naming the standard you were applying. This is why good engineers are famously bad at explaining their own taste. They never had to externalize it, they just applied it as they went.
Agents broke that arrangement. When the doing moved to the machine, the bar got ripped out of the middle and exposed at the two ends, where it now has to be explicit. You cannot enforce a standard by applying it as you go, because you are no longer the one going. You have to state it before the run and verify it after, and both of those are skills most of us never deliberately practiced, because the old job never required them as separate acts.
That is the real reason the transition feels harder than the demos suggest. The skill did not get more complex. It got pulled out of your hands and turned into something you have to articulate, twice, at moments when you would rather just be building.
Setting the bar
Setting the bar means defining done before the agent starts, in a form that someone else, human or machine, could check without you in the room.
"Make it better" is no bar at all. Neither is "clean this up" or "add subscriptions." Those are moods. A bar reads like an acceptance test: build a subscription flow where a user can sign up, create a subscription in Stripe test mode, cancel it from the account page, and the existing test suite still passes. Every clause in that sentence is verifiable. A colleague could take the result and your sentence and decide pass or fail without asking you a single question. That is the test of whether you set a bar or set a mood.
The strongest version of this is machine-checkable. In my agent workflows I attach a goal condition the loop can evaluate on its own: pytest green, ruff clean, the new endpoint returns 201 on valid input and 422 on garbage. A goal condition is just a bar that runs without you. The agent can self-verify against it mid-run, retry on failure, and stop when it actually clears, instead of stopping when the output looks plausible. The difference in result quality between that and a one-line prompt is larger than the difference between model generations.
The economics here changed quietly. When you did the work yourself, vagueness was nearly free, because you noticed the drift halfway through and corrected. The bar assembled itself as you worked. With an agent there is no halfway, because you are not present for it. Whatever you specified at the start is the only standard the entire run answers to. Vagueness used to cost one course-correction. Now it costs the whole result.
Holding the bar
Holding the bar means verifying the finished work against the standard you set, and the hard part is that finished work actively discourages you from doing it.
Anthropic measured this in its AI Fluency Index, and the finding matches what I see in the field. When output arrives as a polished artifact, a formatted document, clean code, a working-looking tool, people invest more in directing the work and less in checking it. Fact-checking and gap-spotting drop at precisely the moment the output looks most done. Polish reads as correctness, and the surface signal quietly substitutes for the verification.
So holding the bar is a discipline that runs against reflex, and the defense is structural. A code review is holding the bar, which is why I tell teams that review matters more now, not less. One team I worked with had agents opening around thirty pull requests a week, all green in CI, and review had collapsed into rubber-stamping the checkmark. The trouble is that the agent had already optimized against CI. The tests it wrote passed the tests it wrote. The bug that eventually bit them, a retry loop that double-charged on a specific webhook ordering, was invisible to the suite and obvious to the first human who read the diff with intent.
The fix was not a better model but a reviewer in fresh context whose only instruction was to find where the result misses the bar, plus a rule that the verifying check cannot be authored by the thing being verified. The agent that produced an answer is the worst available judge of that answer. A separate check with no stake in the outcome, a test suite it did not write, a build gate, a second agent told to break it, a human reading the diff cold, is the right judge.
The practical rule I give teams: treat polish as the cue. The moment you feel ready to wave something through because it looks clean is the moment to check hardest.
The middle is a trap
There is a third place your attention can go, and it is where it wants to go: watching the agent work. Hovering over the session, reading the tokens as they stream, nudging the prompt mid-run. It feels productive. It is the residue of the job you used to have.
The middle is where the agent has already caught up to you. Every minute spent supervising the doing is a minute not spent sharpening the specification or checking the output, the two places your judgment is still the scarce input. When I catch myself babysitting a run, I ask one question: was the bar set well enough that I could walk away and trust the check to catch the misses. If the answer is no, the fix is the bar, not another nudge.
Better models never learn your bar
Here is why I think this framework outlasts the current tooling. Every model release improves the middle. Cleaner code, fewer hallucinations, less drift. None of that improvement touches the question of what you wanted or whether the result fits your situation, because that information was never inside the model. It lives in you, in your codebase's history, in your customers' tolerance for breakage, and the model can only act on the fraction of it you managed to express.
So as models climb, value does not spread evenly across the workflow. It concentrates at the two ends, where a human still has to say what good means and confirm that this is it. Prompt engineering depreciated in two years because better models needed less of it. Setting and holding the bar appreciates, because better models produce more output per hour that needs a bar to clear.
This also reshapes who is valuable on a team. A senior engineer's worth never came from the typing, it came from knowing what good looks like and recognizing whether the thing in front of them met it. That judgment translates directly: set and hold the bar across ten times the volume of work they could ever have typed. The exposure falls on whoever's contribution was the middle itself, and the answer for them is not to defend the middle but to move to the ends.
The ends are the job
The doing is gone and it is not coming back. You will hand the agent more of the middle every quarter, and it will keep getting better at it than you are. What stays yours is the bar: stated before the run, in a form that can be checked, and enforced after it, especially when the output looks finished enough to skip the check.
A vague intent going in, or an unreviewed result coming out, and the whole pipeline fails regardless of how good the model is. Both ends held, and even a mediocre model produces work you can ship. That asymmetry is the clearest signal I know about where to spend your effort. The bar is the work now, and the teams that internalize that will quietly outperform the ones still arguing about which model to buy.
Before you run an agent, write down a concrete, checkable definition of success—ideally as acceptance tests or automated goal conditions the agent can evaluate. Then, after it finishes, actually run those checks and treat the result as pass/fail rather than “looks plausible.” If you catch yourself asking for “clean it up” or “make it better,” pause and turn that mood into specific behaviors, constraints, and measurable outputs.