Milla Jovovich's MemPalace: What a 100% LongMemEval Benchmark Score Reveals About AI Memory Design

Milla Jovovich and a systems engineer built an AI memory system using Claude Code. Within 48 hours of launch, it had 7,000 GitHub stars. Within a week, 40,000.

The numbers alone would make MemPalace notable. But the more interesting story is what it reveals about a fundamental tension in AI memory design: should AI decide what's worth remembering, or should you keep everything and make it searchable?

MemPalace takes the second position to its logical extreme. And the results are forcing a conversation the industry has been avoiding.

The Origin Story

Milla Jovovich — yes, the actress from Resident Evil and The Fifth Element, spent months using AI assistants for business and creative work. She accumulated thousands of conversations. Each one started from scratch.

She tried existing memory solutions. Mem0, Zep, the usual suspects. They all had the same design philosophy: use AI to extract what seems important, discard the rest.

This frustrated her. The AI kept deciding that details were irrelevant. But relevance changes. Something that seems like noise today might be exactly what you need six months from now. A passing mention of a restaurant becomes critical when you're planning a meeting in that city. A casual preference becomes the key to a gift idea.

Her insight was simple: "Why should AI decide what I need to remember? Nobody knows what's going to be relevant tomorrow."

So she and engineer Ben Sigman built MemPalace with a different philosophy: store everything verbatim, then make it findable.

The Architecture: Memory Palace Metaphor

MemPalace organizes memory hierarchically using the ancient memory palace technique:

Level	Name	Purpose
Top	Wings	Major containers (people, projects)
Mid	Rooms	Topic-specific categories
Mid	Halls	Memory types (facts, events, preferences)
Bottom	Closets	Compressed summaries
Bottom	Drawers	Verbatim originals (never deleted)
Cross-link	Tunnels	References between rooms

The key insight is the Drawers layer. Original conversations are stored verbatim and permanently. They're never modified, never summarized, never deleted. The higher layers (Closets, Rooms) provide compressed views for efficiency, but the originals remain accessible.

Technically, it's ChromaDB for vector search, SQLite for metadata, entirely local. No cloud dependency. Zero API cost for the base system.

The 170-Token Startup

Most memory systems front-load context. Load everything relevant at session start, burn tokens on information that might not matter.

MemPalace takes a different approach with four retrieval layers:

L0 (~50 tokens): Core identity, always loaded
L1 (~120 tokens): Critical facts, always loaded
L2: Room-specific recall, loaded when topics surface
L3: Full semantic search, on-demand

This means session startup costs ~170 tokens. Compare that to systems that load hundreds of thousands of tokens of "potentially relevant" context.

The trade-off is latency on first mention of a topic — the system needs to fetch from L2/L3. But for most conversations, you don't need everything upfront.

The Benchmark Results

Here's where it gets interesting.

Mode	Score	Notes
Raw (no API)	96.6%	Verbatim retrieval, local only
Hybrid	100%	With Claude Haiku reranking

The 96.6% raw score is the highest published local-only result on LongMemEval. No API calls, no cloud services, just ChromaDB and SQLite retrieving verbatim conversation chunks.

For comparison:

Zep: 71.2%
Supermemory: 85.2%
Hindsight: 94.6%

On the ConvoMem benchmark (conversational memory specifically), the gap is even wider: MemPalace hits 92.9% versus Mem0's 30-45%.

On these retrieval benchmarks, the verbatim approach outperforms extraction-based systems. Whether this translates to better real-world performance depends on factors benchmarks don't measure — latency at scale, storage costs, and the types of queries your application needs to support.

The 100% Controversy

The 100% hybrid score drew scrutiny. A GitHub issue documented several concerns:

Hand-coded patches. The 100% came after targeted fixes for specific failing questions, then retesting on the same dataset, not a held-out set. This is teaching to the test.

Retrieval parameter choices. Setting top_k=50 on conversations with 19-32 sessions essentially retrieves everything. The "memory system" becomes "dump everything into context and let the LLM figure it out."

Metric mismatch. They measured retrieval recall, not end-to-end QA accuracy, which is what LongMemEval's leaderboard actually measures.

The maintainers' defense: they were demonstrating a ceiling, not claiming production performance. The 96.6% raw score remains the honest metric.

This controversy matters beyond MemPalace. It highlights how benchmark scores in the memory space are often incomparable — different variants, different metrics, different retrieval parameters. A "95%" from one system may not mean the same thing as "95%" from another.

The Case For Verbatim Storage

MemPalace's approach rests on a specific hypothesis: AI-based extraction loses information that may matter later.

The argument for verbatim:

Future relevance can differ from current relevance — what seems irrelevant today may be critical tomorrow
Exact wording is preserved (important for quotes, preferences, specifics)
Nothing is irreversibly lost
Errors are retrieval errors, not storage errors — fixable with better search

The argument against extraction:

AI applies current relevance to future needs
Summarization loses nuance and specificity
Discarding is irreversible
Extraction errors compound over time

The trade-off is storage cost and retrieval complexity. Verbatim requires more storage but preserves everything. Extraction requires less storage but makes irreversible decisions about what to keep.

The Case For Extraction

Extraction-based systems like Mem0 and Zep exist for reasons. Verbatim storage has real limitations:

Storage scales linearly. Every conversation, forever. For heavy users, this becomes gigabytes. ChromaDB handles it, but it's not free.

No semantic consolidation. If you have 50 conversations mentioning your preference for dark mode, you have 50 chunks about dark mode. Extraction would consolidate these into one fact, enabling queries like "how many times did the user mention X?"

Retrieval depends on phrasing. If the user asks about "color scheme preferences" but the stored conversations say "dark mode," vector similarity has to bridge that gap. Extracted facts can normalize terminology.

Structured queries are harder. "List all the user's food preferences" is straightforward with extracted structured data. With verbatim chunks, you're searching and aggregating across many documents.

The AAAK experiment. MemPalace includes an experimental compression format (AAAK) claiming 30x "lossless" compression. But their own benchmarks show it drops retrieval accuracy from 96.6% to 84.2% — a 12-point regression. Compression that loses information isn't lossless.

Two Philosophies of Memory

MemPalace represents one end of a design spectrum. At the other end are extraction-based systems.

Extraction-first (Mem0, Zep, Letta):

Use AI to identify important facts
Store structured, normalized data
Discard raw conversations
Optimize for storage efficiency and structured queries

Verbatim-first (MemPalace):

Store everything as-is
Never discard originals
Invest in retrieval over extraction
Optimize for recall completeness

The right choice depends on your application:

Need structured queries ("list all preferences")? Extraction helps.
Need exact quotes and full context? Verbatim preserves them.
Storage-constrained? Extraction is smaller.
Worried about losing information? Verbatim loses nothing.

MemPalace's benchmark results show verbatim can achieve strong retrieval accuracy. But benchmarks measure retrieval, not the full range of memory use cases. Real applications may need capabilities that favor one approach over the other.

The Celebrity Factor

MemPalace's viral growth is partly celebrity-driven. Milla Jovovich has millions of followers. The origin story — actress frustrated with AI builds her own solution, is inherently shareable.

This doesn't invalidate the technical merits, but it does mean the 40,000 GitHub stars reflect distribution as much as capability. For context, Mem0 has 51,000 stars. Stars measure attention, not performance.

The question of whether outsider perspectives bring value to technical problems is worth considering. Jovovich's core insight — "why should AI decide what I need to remember?" — isn't new. But building a system around it and achieving competitive benchmark results is a real contribution, regardless of who built it.

Conclusion

MemPalace's 96.6% raw score is real and significant. It's the highest local-only retrieval result on LongMemEval, achieved with a philosophically different approach: store everything, discard nothing, invest in retrieval.

The 100% hybrid claim deserves skepticism. Benchmark engineering is common in this space, and MemPalace's methodology has documented issues.

What MemPalace demonstrates is that verbatim storage with good vector search is a viable approach to agent memory — one that trades storage efficiency for retrieval completeness. Whether that trade-off makes sense depends on your application.

The memory space is still early. Extraction-based systems offer structured queries and storage efficiency. Verbatim systems offer completeness and simpler storage semantics.