4 min read
Benchmark before you ship
- ai
- benchmark
- patterns
I have shipped two benchmark harnesses in the last year: kb-arena for retrieval strategies and ai-workflow-benchmark for AI coding workflows. People have asked, reasonably, why I keep building benchmarks instead of just shipping the feature. This is the answer.
The benchmark forces a definition
A benchmark is a definition of “better” that you commit to before you ship. Without one, “better” is whatever you remember about the last demo. That sounds harmless until the second demo, when you cannot reconstruct why one configuration was preferred over another, and you reach for the configuration that produced the nicer-looking output for the example you happened to test.
The benchmark removes the ambiguity. kb-arena’s three axes are recall@5, exact-answer presence, and p95 latency. Those are not the only things that matter about a retrieval pipeline, but they are the things kb-arena commits to. A strategy that scores higher on those three axes wins. A strategy that loses on those three axes but feels nicer in a demo does not win. The benchmark is the arbiter, not the demo.
ai-workflow-benchmark’s twelve capability dimensions plus sigmoid scoring do the same job for AI coding workflows. The dimensions are deliberately uneven in difficulty. The sigmoid scoring rewards solutions in the band between “minimum acceptable” and “near-optimal” and gives diminishing returns above that, which discourages overengineering. The benchmark commits to a definition of “good code” before any workflow is scored.
The benchmark catches surprising winners
The interesting result from kb-arena is not which strategy won on the included corpus. It is that the answer changes per corpus. Naive vector loses on technical documentation. QnA pair extraction wins on FAQ-shaped data. Knowledge graph wins on structured records. Hybrid is the strongest default but not the strongest specialist. None of this is in the AI-blogosphere consensus, which mostly recommends naive vector and moves on.
The interesting result from ai-workflow-benchmark is that the gap between the top-performing workflow configuration and the median configuration is wider than the gap between the top model and the second-best model. If you are choosing where to spend engineering effort, the marginal hour on workflow tooling returns more than the marginal hour waiting for the next frontier model. That conclusion is uncomfortable for several adjacent industries, which is part of how I know it is worth saying.
Neither result was the headline I expected when I started. Both surfaced because the benchmark put numbers on configurations I would have eyeballed and called “about the same.”
The benchmark is cheaper than the cleanup
The cost of shipping the wrong configuration is not the time it took to ship. It is the time it takes to undo. By the time the wrong retrieval strategy is wired into your product, three other things depend on it: prompt templates that assume the chunk shape, evaluation scripts that assume the latency profile, customer expectations that anchor to the current behavior. Replacing it later is a multi-week project. Picking the right one at the start, given a benchmark to compare against, is a one-hour decision.
ai-workflow-benchmark exists because I was about to commit a team to a coding workflow on the strength of a single demo. The benchmark exists so I could check, before that commitment, whether the demo was representative or a lucky run.
What a good benchmark looks like
Three properties matter more than the rest.
First, the scoring function has to be defended in advance, not retrofitted to the result. Sigmoid scoring with explicit saturation points is the kb-arena and ai-workflow-benchmark answer. Binary pass/fail loses partial-credit signal. Naive percent loses the difference between a near-miss and a wild miss.
Second, the test set has to be sized and split honestly. A benchmark with 10 questions is a demo. A benchmark with 1,000 questions on a single corpus is a benchmark for one corpus, not a benchmark for the strategy. Source-disjoint or event-disjoint splits matter for the same reason they matter in any ML evaluation: random splits over near-duplicate data inflate every metric.
Third, the benchmark publishes the configurations it tested, not just the winner. The matrix of where a configuration shines and where it lags is more useful than a single bragging number. ai-workflow-benchmark outputs a heat map per capability dimension. kb-arena outputs the full table across all seven strategies and three axes. A team picks the harness or strategy that matches their actual work mix, not an abstract average.
The practical takeaway
Build the benchmark before you build the feature. The benchmark will tell you which feature is worth building. It will also tell you, six months later, when something that used to work has stopped working, which is a thing every AI product eventually has to answer.
Repos: kb-arena, ai-workflow-benchmark.