Gloss Key Takeaways
  1. Public benchmarks are no longer a meaningful differentiator; frequent internal evals that match your product’s real tasks are.
  2. Keep the eval harness minimal so it actually gets run: a small golden dataset, a simple scorer, a runner, and a report driven by one command.
  3. Measure one clearly defined feature with explicit success criteria (e.g., policy compliance), not vague notions of “good answers.”
  4. Start with a small, trusted golden dataset (20–200 real examples) created with domain experts, and treat it as a quality bar, not training data.
  5. Begin with fast, simple keyword/forbidden-term scoring to catch most regressions, and upgrade to LLM-as-judge later without changing the harness interface.

A small evaluation harness running every week

A 30 Minute Eval Harness You Will Actually Run Every Week

Open coding models keep stacking up against the same capability ceiling. Last quarter's benchmark gap is this quarter's rounding error. The differentiator is not which public benchmark your team likes. The differentiator is whether you have an internal evaluation that measures the thing your product actually does, run often enough that you catch regressions before your users do.

Most teams skip evaluations because every framework feels heavy. You assess the framework for a week, install three packages, configure five YAML files, build a custom runner, and never look at the results. The harness becomes the project. The actual measurement never happens.

This post builds the smallest evaluation harness that is still useful. Golden dataset, scorer, runner, report, all driven by one Makefile target. You can have it running in 30 minutes. You will actually run it every week, because there is nothing to maintain.

What you are measuring

Pick one feature. Not your whole product. One feature that has a clear input and output. For this example, a customer support agent that answers refund policy questions. Input: a customer question. Output: a response.

The check is not "is the answer good?" The check is "does the answer match the policy we wrote down?" If you cannot write down what good looks like, you cannot measure it. Go write down what good looks like first.

Golden dataset feeding a scorer

The golden dataset

A golden dataset is a small set of input-output pairs you trust. Twenty examples is enough to start. Two hundred is plenty. The mistake is treating it as training data. It is not. It is the bar your system has to clear.

{"id": "001", "input": "Can I get a refund after 30 days?", "expected_keywords": ["30 day", "no refund", "exchange"], "must_not_contain": ["yes", "always"]}
{"id": "002", "input": "My package arrived damaged.", "expected_keywords": ["damaged", "refund", "photo"], "must_not_contain": []}
{"id": "003", "input": "I want a refund for a digital product.", "expected_keywords": ["digital", "non-refundable"], "must_not_contain": ["refund issued"]}

JSONL because it is the simplest format that survives editing. One line per example. Add IDs so you can reference specific failures. The expected_keywords and must_not_contain fields are your scorer's job, which we'll get to.

Build this dataset by sitting with your support team for an hour. Have them write down 20 real questions and the answer they would give. That's the dataset. It is more valuable than any synthetic generation pipeline.

The scorer

The scorer takes an input and an output and returns a number. Start simple. You can always make it smarter.

import json

def score_example(example: dict, output: str) -> dict:
    output_lower = output.lower()

    keyword_hits = sum(
        1 for kw in example["expected_keywords"]
        if kw.lower() in output_lower
    )
    keyword_score = keyword_hits / max(1, len(example["expected_keywords"]))

    forbidden_hits = sum(
        1 for kw in example["must_not_contain"]
        if kw.lower() in output_lower
    )
    forbidden_penalty = 1.0 if forbidden_hits == 0 else 0.0

    return {
        "id": example["id"],
        "keyword_score": keyword_score,
        "forbidden_penalty": forbidden_penalty,
        "passed": keyword_score >= 0.7 and forbidden_penalty == 1.0,
    }

Keyword matching is unfashionable and underrated. It catches 80% of regressions and runs in milliseconds. When you outgrow it, swap in an LLM-as-judge scorer for the same interface. The harness does not change.

def score_with_llm(example: dict, output: str) -> dict:
    prompt = f"""
Question: {example['input']}
Expected concepts: {example['expected_keywords']}
Forbidden concepts: {example['must_not_contain']}
Response: {output}

Did the response cover the expected concepts and avoid the forbidden ones?
Return JSON: {{"passed": true|false, "reason": "..."}}
"""
    judgment = call_llm(prompt)
    return {"id": example["id"], **json.loads(judgment)}

Same shape, smarter inside. Your harness does not care.

The runner

The runner reads the dataset, calls your feature, scores the output, writes a report.

import json
from pathlib import Path
from datetime import datetime

def run_check(dataset_path: str, output_dir: str):
    examples = [json.loads(line) for line in Path(dataset_path).read_text().splitlines()]
    results = []

    for ex in examples:
        try:
            output = call_my_feature(ex["input"])
            score = score_example(ex, output)
            results.append({**score, "input": ex["input"], "output": output})
        except Exception as e:
            results.append({"id": ex["id"], "passed": False, "error": str(e)})

    timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
    report_path = Path(output_dir) / f"report-{timestamp}.json"
    report_path.write_text(json.dumps({
        "timestamp": timestamp,
        "total": len(results),
        "passed": sum(1 for r in results if r.get("passed")),
        "results": results,
    }, indent=2))

    return report_path

if __name__ == "__main__":
    import sys
    print(run_check(sys.argv[1], sys.argv[2]))

That is the whole runner. About 30 lines. It does the one thing the harness exists to do.

The report

The default report is the JSON file. Useful for diffing. To see what failed at a glance, add a second target that prints a summary.

def summarize_report(report_path: str):
    report = json.loads(Path(report_path).read_text())
    print(f"Pass rate: {report['passed']}/{report['total']}")
    print()
    for r in report["results"]:
        if not r.get("passed"):
            print(f"FAIL {r['id']}: {r.get('error') or r.get('reason') or 'low score'}")
            if "input" in r:
                print(f"  Input: {r['input'][:100]}")
                print(f"  Output: {r.get('output', '')[:200]}")
            print()

That output is the report. Pass rate at the top, failures listed below with their input and output. If you need a graph, point a notebook at the JSON files in the output directory. You probably do not need a graph.

Weekly evaluation report with pass rate

The Makefile target

This is the part that determines whether you actually run the suite.

.PHONY: check

DATASET := evals/golden.jsonl
REPORT_DIR := evals/reports

check:
	@mkdir -p $(REPORT_DIR)
	@python evals/run.py $(DATASET) $(REPORT_DIR) | tail -1 | xargs python evals/summarize.py

Type make check. Get a pass rate and a list of failures. Done.

Add it to a CI job that runs on a schedule. Once a week is plenty for early teams. Daily if you ship daily.

on:
  schedule:
    - cron: "0 9 * * 1"
  workflow_dispatch:

jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: make check
      - if: failure()
        run: |
          echo "Score regressed. Posting to Slack."
          curl -X POST -H 'Content-type: application/json' \
            --data '{"text":"Score regressed. Check the report."}' \
            $SLACK_WEBHOOK

Monday morning, 9 AM, the suite runs. If it regresses, Slack tells you. The report is in the artifacts. You read it before standup.

Why this works when other harnesses don't

Three properties.

It runs from a single command. The friction of "how do I run this thing" is the reason most teams don't measure their systems. make check removes the friction.

It has zero dependencies you don't already have. Python, a JSONL file, a Makefile. You can read every line of code in 10 minutes. Nothing breaks because nothing was magical.

It produces a number. Pass rate today, pass rate last week. If the number went down, something broke. You don't need a dashboard. You don't need a measurement framework. You need to know whether the number went down.

The temptation will be to make this fancier. Resist. Make the dataset bigger before you make the framework smarter. Add a second feature's check before you add a UI. The harness is not the product. The measurement is the product. Everything else is overhead.

Run it Monday. Read the failures. Fix the worst one. Run it again next Monday. That's the whole loop. It is small enough to actually do, which is the only quality that matters in this kind of harness.

Gloss What This Means For You

Pick a single high-impact feature in your product and write down what “correct” means in concrete terms. Then build a tiny JSONL golden set from real user questions with your frontline team, add a quick keyword/forbidden-term scorer, and wire it to a one-command runner you can execute weekly. Once that’s in place, you’ll catch regressions early and can iterate on scoring sophistication (including LLM judging) without turning evaluation into a maintenance project.