The Judge Does Not Run Your Tests

Gloss Key Takeaways

In agent loops like Claude Code’s /goal, the “judge” model only sees the transcript, not your repo, filesystem, or test runner output.
Many so-called self-verifying setups are actually “verification theater” because the judge ends up grading confident narration instead of evidence.
Completion conditions work best when they require receipts in the transcript (e.g., test output, build logs, exit codes) rather than subjective claims like “it’s robust.”
A reliable condition has two parts: a measurable end state and an explicit way the agent must prove it via its own surfaced output.

A stack of printed receipts on a clipboard beside a closed laptop on a wooden desk, lit by soft window light

I have had the same conversation with three different teams in the past month. They set up an agent loop, usually Claude Code's /goal, gave it a completion condition, and told me the setup was self-verifying because "a second model checks the work." Then I ask one question: what exactly does that second model see? Nobody has answered it correctly yet.

The answer changes how you write every condition you will ever give an agent. The judge model that decides whether your loop is done does not run your tests. It does not read your repo. It does not execute anything. It reads the transcript of what the working agent did, and only that. If the proof is not sitting in that transcript, the judge is grading an essay about your codebase, not your codebase.

My position after running these loops for months: most setups people call self-verifying are verification theater, and the fix is to write conditions a machine can check from the paper trail the agent leaves behind.

What actually happens after each turn

The mechanism behind /goal is simple enough to describe in one paragraph, and the simplicity is exactly where the misunderstanding hides.

You give /goal a condition. The main agent works in turns, editing files, running commands, producing output. After each turn, Claude Code sends your condition plus the conversation so far to a separate small model, Haiku by default. That model returns a yes or no and a reason. A no feeds the reason back as guidance for the next turn. A yes stops the loop.

Notice what is not in that description. The judge has no shell. It has no filesystem access. It never calls npm test, never opens src/auth/token.ts, never checks whether the container actually starts. Its entire universe is the text of the session. The model that does the work is not the model that decides it is done, which is good, but the model that decides it is done can only read what the working model chose to surface.

None of this is a flaw in Claude Code, and once you understand the reasoning, the design is right. A judge with its own tools would be slow, expensive, and would just reintroduce the same trust problem one level up. The judge is a reader. Plan for a reader.

Receipts, not opinions

Once you accept that the judge only reads the transcript, the difference between a working condition and a useless one becomes mechanical.

"All tests in test/auth pass and npm test exits 0" works. Not because the judge verifies it, but because the agent, in the normal course of pursuing that goal, runs npm test, and the full output lands in the conversation. Forty-one passing tests and an exit code are sitting right there in plain text. The judge reads the receipt and says done.

"The auth is solid" does not work, and it fails in a worse way than just erroring out. The agent will write code, describe its own changes in confident prose, and the judge will read that confident prose and, often enough, say yes. Nothing in the transcript settles the question, so the judge settles it on vibes, which are the working agent's vibes about its own work. You have built a machine where the author grades itself with one extra hop in the middle.

I watched a run like this fail in slow motion. The condition was "the API error handling is robust." Four turns in, the agent had added try/catch blocks, written a tidy summary of how robust everything now was, and the judge agreed. No test had run. One of the new catch blocks swallowed a connection error that the old code correctly propagated. The transcript looked great. The code was worse than before the run started.

The two-part shape of a real condition

Every condition that has held up for me has the same two parts: a measurable end state, and a stated way the agent proves that state in its own output.

The end state is the easy half. Tests pass, build succeeds, the linter is clean, the diff touches only these files. The proof clause is the half people skip, and it is the half that makes the judge useful. You are telling the agent which receipts to produce, because you know the judge can only read receipts.

A condition I actually use: "All tests in test/auth pass, npm test exits 0, and git status shows only auth files changed, or stop after 20 turns." Every clause in that sentence corresponds to a command whose output appears in the transcript. The test run, the exit code, the file list. The judge never has to trust a claim, it only has to read a result.

Then the cap. Always the cap. "Or stop after 20 turns" inside the condition, and --max-turns 20 on the command line as a hard backstop. An unattended loop with a vibes condition and no cap is not an autonomy setup, it is a billing event with extra steps.

This is every LLM-as-judge pipeline, not just /goal

The reason I keep pushing on this detail is that it generalizes. Any pipeline where one model grades another model's work has the same property: the judge grades the paper trail, not the world.

Your eval harness that asks GPT or Claude to score outputs for "helpfulness" reads the output text, not the user's actual outcome. Your CI bot that asks a model whether a PR "follows the architecture guidelines" reads the diff and whatever context you stuffed in, not the running system. Your multi-agent setup where a reviewer agent approves a builder agent's work reads what the builder reported, unless you explicitly gave the reviewer its own tools and made it rerun the checks.

In every one of these, the verification is only as real as the evidence that reaches the judge. A judge reading rich, machine-generated evidence, test output, exit codes, diffs, logs, benchmark numbers, is doing something close to verification. A judge reading the working model's self-description is doing literary criticism.

The practical consequence for anyone building these pipelines: spend your effort on evidence generation, not on judge prompting. Teams tune the judge prompt for days, adding rubrics and chain-of-thought instructions, when the actual problem is that nothing checkable ever enters the context. A mediocre judge reading a real test run beats a brilliant judge reading a summary, every time, because the first one has something to be right about.

The check is the verification, the judge is the clerk

The mental model I give clients now: the real verification in an agent loop is still a boring deterministic command. npm test. docker build. curl against a health endpoint. The agent runs it, the world answers, and the answer is text in a transcript. The judge is a clerk who reads that answer and decides whether it satisfies the condition you wrote.

Clerks are genuinely useful. They handle the fuzzy matching, "two tests fail in login.test.ts" maps to "not done, the expired-token case still throws," and that reason steers the next turn better than a raw exit code would. A good clerk turns a failed check into a useful instruction. That is worth having.

But nobody confuses a clerk with an inspector, and that is the confusion I keep finding in the wild. When someone tells me their agent setup is self-verifying, I now ask them to point at the deterministic command in the loop and show me where its output lands in the transcript. If they can, the setup is probably fine. If they answer with the judge's prompt, the rubric, the second model's reasoning capabilities, they have theater.

Write conditions like you are leaving instructions for an auditor who will only ever see the file, never the building. Because that is precisely who is deciding when your agent stops.

Gloss What This Means For You

When you write a /goal completion condition, assume the judge is just a reader of the chat log. Phrase goals so the agent must produce verifiable artifacts in the transcript—run the tests, show the full output, include lint/build results, and surface any relevant diffs or file lists. Avoid subjective success criteria, and add an explicit “proof” clause so the loop can’t end on vibes alone.