3 min read
Benchmarking AI coding workflows, not just models
- ai
- benchmark
- case-study
Most AI coding leaderboards measure raw model capability on toy tasks: complete this function, fix this synthetic bug, write a one-shot script. Real coding work doesn’t look like that. It happens in workflows: plan, search the codebase, edit several files, run tests, debug the failures, ship.
ai-workflow-benchmark scores the workflow plus tool stack, not just the model.
Methodology
The benchmark is a hundred representative coding tasks spread across twelve capability dimensions: debugging, refactoring, multi-file change, dependency upgrades, test authoring, performance investigation, security review, error handling, API design, data migration, build configuration, and documentation. Each task ships with an unambiguous pass/fail oracle (a test suite, a linter rule, a property check) and a soft sigmoid score for partial credit.
The dimensions are deliberately uneven in difficulty. Some tasks are an hour of focused work. Others would take a senior engineer a day. The point is to surface where a workflow excels and where it stalls, not to produce a single bragging number.
Sigmoid scoring
Naive percentage scoring punishes near-misses the same as wild misses, which is wrong: a fix that compiles, passes the linter, and gets two of three test cases right is qualitatively closer to working than one that doesn’t compile at all. Binary pass/fail is worse: it loses all signal about partial progress.
The benchmark uses a sigmoid scoring function where the midpoint is “minimum acceptable solution” and the saturation point is “near-optimal.” This means a barely-passing solution scores around 0.5, a clean canonical solution scores close to 1.0, and the gap between them is the gap that actually matters. Scores above the saturation point earn diminishing returns, which discourages overengineering.
What the data shows
This part is illustrative because the headline finding will move as the benchmark and tools evolve. From the runs so far, the gap between the top-performing workflow configuration and the median configuration is wider than the gap between the top model and the second-best model. In other words, picking the right tool harness matters more than picking the right model, past a capability threshold.
That has practical consequences. If you’re choosing where to spend engineering effort, the marginal hour spent on workflow tooling (better search, better test feedback loops, better context management) returns more than the marginal hour spent waiting for the next frontier model.
Capability gap matrix
The benchmark outputs a heat map of where a configuration shines versus where it lags. A workflow that nails refactoring and multi-file changes might drop badly on performance investigation if its test feedback loop is slow. The matrix lets a team pick the harness that matches their actual work mix instead of optimizing for an abstract average.