Skip to content

2 min read

memory-arena: 20 agent-memory strategies, one eval

  • ai
  • memory
  • benchmark
  • case-study

Agent memory is sold as a solved problem. Half a dozen funded SDKs will tell you to plug them in and forget about it. Run all of them on the same eval, against the same corpus, judged by the same model, and a 30-line ChromaDB script outranks every one. Memory is an open problem, not a shipped feature.

memory-arena is the harness that runs the comparison so you do not have to take a vendor’s word for it.

What is in the arena

Twenty strategies, three groups:

  • Six vendor SDKs at their documented defaults: Mem0, Graphiti, Graphiti-on-FalkorDB, Cognee, LangMem, Memori.
  • Twelve pure-Python baselines and retrievers: vector, BM25, RRF, HyDE, RAPTOR, Reflection, Persona Profile, an LLM wiki, A-MEM, HippoRAG 2, full-context, and a recency window.
  • Two quantum rerankers over the same vector store: QISS, a NumPy fidelity reranker, and SQR, a Qiskit SWAP-test reranker.

How it is judged

Every strategy runs the same setup -> ingest -> recall -> teardown lifecycle over the same chat-session corpus (a LongMemEval-S smoke set), judged by the same model (Claude Opus 4.7), at the same top_k. The vendor SDKs and the baselines share one model, so a baseline winning is not a model handicap. Each result JSON is stamped with the commit SHA, package versions, model IDs, and seed, so anyone can re-run it and bisect a regression.

The finding

The funded vendor SDKs all land below naive_vector, a 30-line ChromaDB script. mem0 and langmem run on the same model as the baselines, so it is not that they were handed a weaker model. And the absolute numbers are humbling: even the leader gets only about half the questions right. Whatever “agent memory” is, none of the shipped products is close to solving it.

What it does not claim

It is not vendor-tuned, and that cuts both ways: a tuned config might move a vendor up, so PR one and the harness will re-run it. It is not a single-judge truth either. Opus and GPT-4o cross-judge with a Spearman of +0.967; GPT-4o grades about 15 points more leniently in absolute terms but agrees on the ranking. The smoke subset is 16 questions; the full LongMemEval-S run of 500 is the next milestone. Read the numbers as a ranking you can reproduce, not a leaderboard to quote.

Repo: github.com/xmpuspus/memory-arena.