3 min read
kb-arena: which retrieval strategy wins on what data
- ai
- rag
- benchmark
- case-study
Every other blog post says RAG is “easy” and then ships naive cosine similarity over OpenAI embeddings as if that were the end of the story. Real corpora behave differently. Technical documentation behaves nothing like a customer-support FAQ, which behaves nothing like a structured regulatory codex. The retrieval strategy that wins on one set will lose on another.
kb-arena is a benchmark you point at your own documents to find which retrieval strategy actually scores best on the data you have.
Strategies tested
Seven retrieval strategies sit in the arena:
- Naive vector: cosine similarity over chunked text and a dense embedding model. The straw-man baseline.
- Contextual retrieval: the Anthropic-style approach where each chunk gets a one-line context prepended before embedding, so the chunk knows where in the document it sits.
- QnA pair extraction: an LLM walks the corpus, extracts likely question-answer pairs, and the index is over the questions instead of raw text.
- Knowledge graph: entities and relations are extracted into triples, retrieval is graph traversal plus a vector index over node descriptions.
- RAPTOR: recursive abstractive clustering. Chunks roll up into summary nodes, summary nodes roll up into higher-level summaries, and retrieval can hit any layer.
- PageIndex: a tree of headings and section summaries indexed alongside the leaf chunks.
- Hybrid: BM25 plus vector recall, reranked by a cross-encoder.
Scoring
Each strategy is scored against a held-out QA set on three axes: recall@5 (does the right chunk appear in the top five?), exact-answer presence (does the model produce the verbatim answer when fed those five chunks?), and p95 latency (how long does the full retrieval and generation cycle take at the 95th percentile?).
Sample results
The numbers below are illustrative, not authoritative. They come from a single corpus and a single QA set, and the right number for your data is whatever you measure against your data. Don’t quote them.
| Strategy | Recall@5 | Exact answer | p95 latency |
|---|---|---|---|
| Naive vector | 0.62 | 0.41 | 120 ms |
| Contextual | 0.78 | 0.58 | 180 ms |
| QnA pairs | 0.71 | 0.62 | 95 ms |
| Knowledge graph | 0.55 | 0.48 | 220 ms |
| RAPTOR | 0.74 | 0.51 | 190 ms |
| PageIndex | 0.69 | 0.46 | 110 ms |
| Hybrid | 0.81 | 0.63 | 210 ms |
Practical takeaway
Pure vector search is rarely the right default. For technical and regulatory corpora, contextual retrieval or hybrid usually wins because the additional structure cuts ambiguous chunks out of the candidate set. For FAQ-shaped corpora, QnA extraction wins because the index is now over the actual question phrasing instead of the answer prose. For factual lookup over structured records, the knowledge graph wins because the schema is the index.
The point of the arena is not to crown a universal winner. It’s to make the comparison cheap enough that you stop guessing.
Repo: github.com/xmpuspus/kb-arena.