Cost per Solved Task, Not Cost per Token

Gloss Key Takeaways

Per-token pricing was a useful proxy for chat, but it breaks down for agentic workflows where the number of turns and retries can vary wildly.
The metric that matters for agents is cost per solved task, and the biggest driver is whether the agent’s loops converge and halt—not the model’s sticker price.
The same high-priced model can be either the cheapest or the most expensive option depending on task type and convergence behavior, so rate cards alone can’t guide routing decisions.
Budget blowups in agentic deployments usually come from runaway loops (timeouts, repeated rewrites, endless retries) that go unnoticed, not from choosing an inherently “too expensive” model.
Spending figures (e.g., a $110 day of usage) are meaningless without outcomes; value depends on what actually shipped or got fixed.

A household circuit breaker panel with one switch flipped off, lit by soft window light

Uber burned through its annual AI budget in four months. The fix was a cap: 1,500 dollars per engineer, per tool, per month, for Claude Code and Cursor. When that number made the rounds, most of the commentary treated it as evidence that these tools are too expensive. I read it the opposite way. What Uber had was a halting problem dressed up as a pricing problem, and the cap is a blunt instrument for a discipline nobody had built yet.

The number printed on the model's price page used to be the number that mattered. For agentic work, it no longer is. The only number that matters now is cost per solved task, and the biggest lever on it has almost nothing to do with which model you pick. It comes down to whether your loops stop.

Per-token pricing made sense for chat

For the first few years of this market, per-token pricing was an honest proxy for cost. You sent a prompt, you got an answer, the transaction ended. Tokens in, tokens out, one unit of work per exchange. Comparing models on dollars per million tokens was like comparing cars on price per liter of fuel when every trip is the same length.

Agents broke that proxy. An agent does not produce one answer, it produces an open-ended sequence of attempts: read the codebase, run the tests, fail, read the error, try again. The trip length is no longer fixed. Two runs of the same task on the same model can differ in cost by a factor of fifty depending on whether the agent converges in three turns or grinds for two hours.

Once trip length varies that much, fuel price tells you very little about what the journey costs. The rate card became the least informative number on the invoice, and most procurement conversations I sit in are still negotiating it as if it were the only one.

The model that proves the point

Anthropic's Fable 5 is the cleanest case study I have seen, because it is simultaneously the most expensive model on the market and, for certain work, the cheapest.

The rate card looks brutal. Ten dollars per million input tokens, fifty per million output, double Opus 4.8 on both. And the model's defining trait makes the sticker worse: it keeps going. CodeRabbit ran it through 33 coding tasks and 19 of them hit the timeout rather than converging. The model does not know when it is finished. Left unattended, it will happily convert that uncertainty into output tokens at fifty dollars a million.

Then you look at the other end of the distribution. Stripe pointed Fable 5 at a migration across a 50 million line Ruby codebase, work a team had scoped at more than two months. It finished in a day. I do not know what that run cost in tokens, but it does not matter much, because at any plausible token count the cost per solved task is a rounding error against two months of engineering salaries.

Same model, same rate card, opposite verdicts. On a quick edit or a code review pass, Fable 5 loses badly on cost per solved task, you are paying double the rate for a job Opus finishes faster, and CodeRabbit measured its review precision below Opus anyway, 32.8 percent against 35.5. On a long migration it wins by such a margin that the per-token price is irrelevant. The rate card cannot distinguish these two situations. Cost per solved task can, and it is the only lens that gets the routing decision right.

Simon Willison spent 110 dollars in a single day of ordinary use putting Fable through its paces. Whether that was expensive depends entirely on what landed. If it shipped a feature and fixed four library bugs, which in his case it did, that is a very good day at consultant rates. The dollar figure alone tells you nothing.

The bill is written by the loop, not the model

Here is the part I keep having to walk clients through. When an agentic deployment blows its budget, the postmortem almost never finds an expensive model. It finds a loop that did not halt.

The runs that hurt are the ones where the agent got stuck and nobody noticed. It rewrote the same file fourteen times. It re-ran a failing test suite for three hours, reading the same error and trying the same fix. Every one of those turns billed full price and produced nothing. A converging run and a stuck run look identical on the invoice, the difference only shows up when you divide spend by tasks that actually finished.

This means your effective cost per solved task is mostly a function of engineering you control, not pricing you negotiate. Two teams using the identical model at the identical rate can land an order of magnitude apart, because one of them built stop conditions and the other left the meter running.

It also means the new generation of autonomous models raises the stakes in both directions. Fable 5's persistence is the product, it is why the Stripe migration finished. The same persistence is why 19 of 33 tasks ran to timeout. The capability and the cost hazard are the same trait, and the only thing standing between them is whether you told it when to stop.

The three caps every production loop needs

After enough of these postmortems, the fix converges on the same three hard stops. I now treat them as a checklist before any loop runs unattended, the way you would not commission an electrical circuit without a breaker.

A max-turns cap. The runaway stop. Every loop gets a hard ceiling on iterations, twenty turns, fifteen turns, whatever fits the task, enforced by the harness and not by the model's judgment. In Claude Code that is --max-turns 20 on the command line, plus "or stop after 20 turns" written into the goal condition itself. This is the cap that catches the run that would otherwise go all night.

A no-progress stop. The stuck detector. A run can stay under its turn cap while accomplishing nothing, burning full-price tokens on the same failed approach. The simplest version is a wrapper that compares git diff --stat between turns and halts when nothing has changed for three or four rounds. No frameworks required, a few lines of shell. This is the cap that catches the agent rewriting the same file.

A budget ceiling. The wallet stop. A hard dollar limit at the workspace level, set once in the provider's console, that does not care how clever the run thinks it is. Turn caps stop a single runaway, the dollar ceiling stops a slow bleed across fifty quiet loops you forgot about. Uber's 1,500 dollar cap is exactly this control, applied at the level of people because nobody had applied it at the level of loops.

Three caps, none of them sophisticated. Most of the engineering in production agents turns out to be making sure things halt rather than prompting them well, and these three caps are the difference between a loop that is an engine and a loop that is a billing event.

Measuring it honestly

Cost per solved task only works as a metric if you compute it without flattering yourself. The denominator is tasks that actually landed, merged, deployed, accepted. The numerator is everything you spent getting there, including the runs that timed out, the runs the no-progress detector killed, and the retries. The failed runs are not noise to exclude, they are the metric. A model that converges 9 times out of 10 at double the rate beats one that converges half the time at half the rate, and you can only see that if the failures stay in the numerator.

You do not need elaborate tooling for this. A spreadsheet with task, model, spend, and outcome, kept for a month, will tell you more about your real economics than any benchmark. In my experience it also reshuffles model choices fast. Teams discover that their default model is wrong in both directions, too big for the small work, too small for the big work, and the rate card was hiding both errors.

The vendors will keep competing on dollars per million tokens because it fits on a pricing page. Your accountant will keep asking about it because it is the number on the contract. Neither of them is wrong, exactly. But the organizations getting real leverage out of agents have quietly stopped arguing about it, because they learned what Uber learned the expensive way: the model's price was never where the money went. The money goes wherever the loop is allowed to take it.

Gloss What This Means For You

Track your AI spend by completed outcomes, not tokens, and instrument your agent runs so you can see when they’re stalling. Put hard stop conditions in place—timeouts, max iterations, and escalation to a human—so a stuck loop can’t quietly burn budget. Then route work by task type: use cheaper, faster models for short edits and reviews, and reserve pricier models for high-leverage jobs where finishing quickly dominates token costs.