The Benchmarks Started Measuring Endurance

Gloss Key Takeaways

Classic benchmarks like SWE-bench Verified are saturated, so small score differences are mostly marketing and don’t predict real-world performance well.
New long-horizon benchmarks (e.g., SWE-bench Pro, FrontierCode Diamond, GDPval-AA) measure endurance on messy, multi-hour tasks and show much larger, meaningful gaps between frontier models.
Long tasks amplify small per-step reliability differences because success compounds across many dependent steps (roughly p^n), turning near-ties into blowouts.
Top long-horizon performance also depends on recovery behavior—detecting mistakes, revising approach, and maintaining the goal over hours—rather than just answering a single prompt correctly.
In practice, buyers pay for finished work, and endurance benchmarks better reflect that than isolated bug-fix or single-question tests.

A hydraulic fatigue-testing rig cycling a metal specimen in an engineering lab, measuring how long the material holds rather than how much it lifts

A while back I wrote that GPT-5.5's 88.7 percent on SWE-bench was a marketing number. The argument was simple: SWE-bench measures isolated bug fixes on well-documented repositories, the most structured slice of an engineer's week, and labs were selling single percentage points of progress on it as if they meant something about real work. I stood by that piece then and I stand by it now.

Something changed in 2026, though, and it deserves an honest follow-up. The benchmarks that separate frontier models today are not the ones I was complaining about. The new generation measures whether a model can stay on a messy task for hours, with tools, without losing the plot. And on those benchmarks, the gaps between models stop being theater and start being enormous.

The crowded top of the old leaderboard

Look at classic SWE-bench Verified after the Fable 5 release this week. Anthropic's new model posts 95.0, Opus 4.8 sits at 88.6, GPT-5.5 at 82.6. Real differences, but compressed. Every frontier model now clears the bar of "given a clear bug report and a known codebase, produce a working fix." The benchmark is doing what saturated benchmarks always do: it confirms everyone at the table is competent and tells you almost nothing about who to hire.

This is exactly the regime where my benchmark theater critique applied. When the field is bunched within a dozen points on a test everyone has optimized for, a press release celebrating a 1.5 point gain is marketing, not measurement.

The gaps that explode

Move to the long-horizon sets and the picture changes completely. On SWE-bench Pro, which uses harder, multi-file, more realistic engineering tasks, the same three models score 80.3, 69.2, and 58.6. Fable's lead over Opus goes from 6.4 points to 11.1, and the gap to GPT-5.5 widens to nearly 22.

On FrontierCode Diamond, the hardest of the new coding sets, the spread becomes a different category of thing entirely: 29.3 for Fable 5, 13.4 for Opus 4.8, 5.7 for GPT-5.5. The leader is more than double the second-place model and five times the third. Nobody is within a rounding error of anybody.

The same pattern shows up outside coding. GDPval-AA evaluates real economic knowledge-work tasks, the kind of multi-hour analysis and document work that white-collar jobs are made of, and Fable 5 scores 1932 against Gemini 3.1 Pro's 1314. These are not single-prompt quizzes, they are jobs, and on jobs the leaderboard reshuffles hard.

Why long horizons separate models

The mechanism is compounding, and it is worth doing the napkin math once because it explains the whole 2026 leaderboard.

A long task is a chain of dependent steps. Read the codebase, form a plan, edit a file, run the tests, interpret the failure, adjust, repeat for hours. If a model succeeds at each step with probability p, its odds of finishing an n-step task are roughly p to the power n. Two models that look nearly identical per step, say 99 percent against 97 percent, land in different universes over a hundred steps: about 37 percent completion against about 5 percent.

Single-question benchmarks measure p. Long-horizon benchmarks measure p to the power n. That is why SWE-bench Verified shows a crowded field while FrontierCode Diamond shows a blowout. The models are genuinely close on individual steps. They are nowhere near close on not falling over across a thousand of them.

There is a second ingredient beyond raw reliability: knowing what to do after a mistake. The endurance benchmarks reward models that notice a failed test, back out of a bad approach, and keep the original goal in view three hours in. The single most quoted line about Fable 5 came from Zapier, and it describes exactly this trait: "Where Opus stops to ask, Fable 5 keeps looking."

This is the thing buyers actually pay for

I have spent the past two years helping organizations deploy these tools, and no client has ever paid for a correct answer to a well-specified question. They pay for finished work. The migration completed, the report delivered, the integration tested and merged. The unit of value is the outcome at the end of a long, messy chain, not any individual link.

The old benchmarks measured the link. The new ones measure the chain, and the chain is the product. Stripe reports that Fable 5 completed a migration across a 50 million line Ruby codebase in a single day, work a team had scoped at more than two months. Whatever discount you apply to a customer quote in a launch post, that is a claim about endurance, not about answer quality. No score on SWE-bench Verified predicts it. The long-horizon scores at least point at it.

This is the part I got right in the benchmark theater piece without following it to the conclusion. I argued the gap between benchmark performance and production performance was the only number that mattered, and that nobody published it. The long-horizon benchmarks are the first public attempt to close that gap from the benchmark side. They are still proxies. They are much better proxies.

Still ceilings, not promises

The caveats from the original piece have not gone anywhere, so let me apply them to the new numbers with the same skepticism.

These are vendor-reported figures. Some have been cross-checked by independent aggregators like Artificial Analysis, which is better than nothing, but the lab that publishes the chart chose the chart. Treat every number above as a ceiling under favorable conditions, not a promise about your codebase.

Endurance also has a failure mode that the headline scores hide. CodeRabbit ran Fable 5 on 33 coding tasks and 19 of them ran to timeout rather than converging. The same persistence that wins FrontierCode Diamond will happily burn tokens past the point of usefulness. Simon Willison spent 110 dollars in one day of ordinary use. A model that does not stop is impressive on a benchmark with a fixed horizon and expensive in a harness without one.

And the per-point theater is already migrating to the new benchmarks. The moment SWE-bench Pro becomes the number in the keynote, labs will tune for it, the field will compress, and a 0.8 point gain will get its own slide. The benchmark treadmill did not break in 2026, it just moved to a better gym.

What this changes about your own evaluation

The practical advice from the first piece survives intact: run the model on your actual workload before believing anything. What changes is the shape of the test you should run.

An internal eval built from 50 single-prompt questions is now measuring the dimension where every frontier model is fine. If you want to know which model to standardize on, give each one the same genuinely long task from your backlog. A real migration, a multi-service refactor, a report that requires pulling from six systems. Set an explicit stop condition and a budget cap, because the endurance models will not set one for themselves. Then measure cost per finished task, not tokens, not latency, not score.

On that metric the expensive model often wins and sometimes loses badly, and which one happens depends entirely on whether the task actually needed the endurance. That is information no public leaderboard will ever give you.

The deeper shift is worth sitting with. For three years we ranked these systems the way schools rank students, by their answers to questions. In 2026 we started ranking them the way employers rank people, by whether they finish what they start. The second ranking disagrees with the first, and the second one is the one the market was always going to settle on, because it is the one the money cares about. The benchmarks did not get more honest, they got closer to the job.

Gloss What This Means For You

When you evaluate or choose a model, stop over-weighting saturated leaderboards and ask for evidence on long-horizon, tool-using tasks that resemble your actual workflows. Run a small bake-off using multi-step projects (debugging across files, iterating on tests, drafting and revising documents) and measure completion rate and recovery after failures, not just first-try accuracy. Watch for models that can keep context and momentum for hours, because that’s where small reliability differences turn into big productivity gains.