Gloss

Tagged: evals

Watching Your Agent Work Is Not the Same as Knowing It Works

Teams instrument their agents before they grade them, 89 percent run observability and only 52 percent run evals. Watching what an agent did is not the same as knowing whether it was any good.

A 30 Minute Eval Harness You Will Actually Run Every Week

As open coding models hit similar capability ceilings, the differentiator is internal evals tied to your product. Here is one you will actually run.

All ai evals agents observability ai-engineering