LongMemEval Explained: The Benchmark That Tests Agent Memory
LongMemEval is the ICLR 2025 benchmark for evaluating long-term memory in conversational AI. Learn what it tests, why it's hard, and how to read benchmark claims critically.

LongMemEval is the ICLR 2025 benchmark for evaluating long-term memory in conversational AI. Learn what it tests, why it's hard, and how to read benchmark claims critically.

LongMemEval is the ICLR 2025 benchmark for evaluating long-term memory in conversational AI. Developed by UCLA and Tencent AI Lab, it measures how well agents retain and use information across extended conversations.
If you're evaluating memory solutions for your agent, understanding this benchmark helps you read claims critically.
The benchmark presents a chat assistant with a long conversation history, then asks questions that require recalling specific information from that history.
It tests five memory abilities:
Can the system retrieve specific details from extensive conversations?
Example: After 40 sessions of conversation, ask "What restaurant did the user mention wanting to try?"
This is basic retrieval. The information exists somewhere in the history; can the system find it?
Can the system synthesize information across multiple sessions?
Example: "How many times has the user mentioned being stressed about work?" or "Compare the user's opinions on remote work from January vs March."
This requires aggregating information scattered across sessions, not just finding a single fact.
Can the system handle information that changes over time?
Example: In January, the user says "I work at Acme Corp." In March, they say "I just started at Globex." When asked "Where does the user work?", the system should answer Globex, not Acme.
This is surprisingly hard. Both facts are semantically relevant to "where does the user work?" Vector similarity doesn't distinguish old from new.
Can the system understand time-related context?
Example: "What did the user do last Tuesday?" or "What appointments does the user have next week?"
This requires understanding explicit timestamps, relative time references ("last week", "tomorrow"), and temporal ordering of events.
Can the system recognize when information wasn't provided?
Example: If the user never mentioned their birthday, the system should say "I don't have that information" rather than hallucinating an answer.
This tests whether the system knows the boundaries of its knowledge.
LongMemEval uses an "attribute-controlled pipeline" to create realistic conversation histories with known ground truth.
| Variant | Sessions | Tokens | Purpose |
|---|---|---|---|
| Oracle | 1-2 | Minimal | Baseline (answer in context) |
| S | ~48 | ~115K | Standard evaluation |
| M | ~500 | ~1.5M | Stress test (exceeds context) |
Oracle contains only the sessions directly relevant to the question. If a system can't answer correctly with Oracle, it has a reading comprehension problem, not a memory problem.
S (Standard) is the main benchmark. ~115K tokens across ~48 sessions. This fits in modern context windows but requires finding needles in haystacks.
M (Medium) is ~1.5M tokens across ~500 sessions. This exceeds even the largest context windows, requiring genuine memory systems.
Each question is labeled by:
LongMemEval reveals several challenges that simpler benchmarks miss.
Long-context LLMs show a 30-60% performance drop from Oracle to S variant. The information is there, but models struggle to find and use it.
This isn't a context length problem. 115K tokens fits in GPT-4's context. The problem is attention—models lose track of relevant information among irrelevant context.
Even top systems score ~83% on multi-session reasoning while achieving 90%+ on simpler categories. Synthesizing across sessions is fundamentally harder than single-fact retrieval.
When testing knowledge updates, systems frequently return outdated information. The old fact and the new fact are both semantically similar to the query. Without temporal awareness, the system can't distinguish them.
Even with perfect retrieval (returning exactly the right context), systems still make errors. The reading comprehension step—extracting the answer from retrieved context—has its own failure modes.
Higher is better, but context matters:
But score alone doesn't tell you:
LongMemEval scores are frequently gamed or misrepresented. Here's what to watch for:
Some companies report Oracle scores (where retrieval is trivial) without clarifying they didn't test on S or M variants. Always ask: which variant?
Setting top_k=50 on a dataset with 30 items returns everything. The "memory system" contributes nothing—you've just dumped the whole dataset into context.
Most benchmark reports omit latency entirely. But a high-accuracy system with multi-second retrieval latency is production-useless for real-time agents.
Some systems identify failing questions, engineer fixes for those specific cases, then re-test. This inflates scores without improving general capability.
When evaluating a memory system's benchmark claims:
No benchmark is complete. LongMemEval has blind spots:
Even the M variant (1.5M tokens) represents maybe 6 months of daily conversations. Some applications need years of history. The BEAM benchmark tests at 10M tokens.
LongMemEval is read-only. It doesn't test how well systems handle rapid memory updates, conflicts, or deletions.
The dataset is text-only. Agents that handle images, audio, or structured data need additional evaluation.
The conversations are general-purpose. Domain-specific applications (medical, legal, software) may have different retrieval patterns.
LongMemEval doesn't test whether systems appropriately forget outdated information. PersistBench addresses this.
If you're evaluating memory solutions:
The benchmark is public. Run the same tests on systems you're considering:
LongMemEval is general-purpose. Also test with data that matches your application:
Beyond accuracy, measure:
A system scoring 90%+ vs 85% may not matter if both handle your use cases. The last 5% often comes from edge cases that don't appear in production.