Skip to content

3 min read

kb-arena: which retrieval strategy wins on what data

  • ai
  • rag
  • benchmark
  • case-study

Every other blog post says RAG is “easy” and then ships naive cosine similarity over OpenAI embeddings as if that were the end of the story. Real corpora behave differently. Technical documentation behaves nothing like a customer-support FAQ, which behaves nothing like a structured regulatory codex. The retrieval strategy that wins on one set will lose on another.

kb-arena is a benchmark you point at your own documents to find which retrieval strategy actually scores best on the data you have.

Strategies tested

Seven retrieval strategies sit in the arena:

  • Naive vector: cosine similarity over chunked text and a dense embedding model. The straw-man baseline.
  • Contextual retrieval: the Anthropic-style approach where each chunk gets a one-line context prepended before embedding, so the chunk knows where in the document it sits.
  • QnA pair extraction: an LLM walks the corpus, extracts likely question-answer pairs, and the index is over the questions instead of raw text.
  • Knowledge graph: entities and relations are extracted into triples, retrieval is graph traversal plus a vector index over node descriptions.
  • RAPTOR: recursive abstractive clustering. Chunks roll up into summary nodes, summary nodes roll up into higher-level summaries, and retrieval can hit any layer.
  • PageIndex: a tree of headings and section summaries indexed alongside the leaf chunks.
  • Hybrid: BM25 plus vector recall, reranked by a cross-encoder.

Scoring

Each strategy is scored against a held-out QA set on three axes: recall@5 (does the right chunk appear in the top five?), exact-answer presence (does the model produce the verbatim answer when fed those five chunks?), and p95 latency (how long does the full retrieval and generation cycle take at the 95th percentile?).

Sample results

The numbers below are illustrative, not authoritative. They come from a single corpus and a single QA set, and the right number for your data is whatever you measure against your data. Don’t quote them.

StrategyRecall@5Exact answerp95 latency
Naive vector0.620.41120 ms
Contextual0.780.58180 ms
QnA pairs0.710.6295 ms
Knowledge graph0.550.48220 ms
RAPTOR0.740.51190 ms
PageIndex0.690.46110 ms
Hybrid0.810.63210 ms

Practical takeaway

Pure vector search is rarely the right default. For technical and regulatory corpora, contextual retrieval or hybrid usually wins because the additional structure cuts ambiguous chunks out of the candidate set. For FAQ-shaped corpora, QnA extraction wins because the index is now over the actual question phrasing instead of the answer prose. For factual lookup over structured records, the knowledge graph wins because the schema is the index.

The point of the arena is not to crown a universal winner. It’s to make the comparison cheap enough that you stop guessing.

Repo: github.com/xmpuspus/kb-arena.