The Embedding Problem: Extracting Structure from Text

Part 3 of the Hypabase Memory Series

"Alice gave Bob a book" and "Bob gave Alice a book" have nearly identical embeddings — same words, same length, similar meaning. But the facts are opposite. Who gave, who received — this structure is invisible to vector similarity.

This is the embedding problem. And it's why most RAG-based memory systems fail at precise recall.

The user says: "I bought a hybrid bike from REI last month and I've been using it for my commute."

The memory system needs to store this as structured hyperedges — not as a text chunk, not as disconnected triples, but as coherent facts with labeled participants. We have hypergraphs to hold n-ary facts and Kāraka roles to label participants. Now we need to extract these structures from natural language.

Why Extraction Matters

Most AI memory systems skip this step. They store conversation chunks as text, embed them, and retrieve by similarity. Simple, but limited.

The problem: text embeddings conflate structure. "Alice gave Bob a book" and "Bob gave Alice a book" have nearly identical embeddings — same words, same length, similar meaning. But the facts are different. Who gave, who received, what was given — this structure is invisible to embeddings.

Structured extraction recovers what embeddings lose. Instead of storing text, we extract the underlying fact:

(gave :subject Alice :object book :recipient Bob)

Now "who gave" has an answer: Alice. "Who received" has an answer: Bob. The structure is explicit, queryable, and preserved.

The Extraction Target

Our target representation is a hyperedge with Kāraka-labeled participants. For the bike example:

Input: "I bought a hybrid bike from REI last month"

Output:

(bought 
  :subject user 
  :object "hybrid bike" 
  :source REI 
  :locus "last month"
  :memory_type episodic)

This is one hyperedge connecting four entities: user, hybrid bike, REI, and a time reference. Each entity has a role label. The edge has a type (bought) and metadata (episodic memory).

The notation we use is PENMAN — an S-expression format from computational linguistics. But the notation is incidental. What matters is the structure: verb + participants with roles + metadata.

Kāraka Roles as Extraction Schema

The 8 Kāraka roles provide a fixed vocabulary for extraction:

When you see...	Extract as...
Who did the action	:subject
What was affected	:object
Who/what benefited	:recipient
What tool/method was used	:instrument
Where something came from	:source
Where/when it happened	:locus
What property is described	:attribute
What value it has	:value

This vocabulary is closed. Every participant in every fact gets one of these eight labels. No schema drift, no ad-hoc invention.

Compare to free-form extraction where each fact might use different labels: "buyer", "purchaser", "customer" for the same semantic role. Closed vocabulary means consistent storage and predictable retrieval.

Extraction Examples

Multi-participant event:

"Bob sent the quarterly report to Alice using the shared drive yesterday."

(sent 
  :subject Bob 
  :object "quarterly report" 
  :recipient Alice 
  :instrument "shared drive"
  :locus yesterday
  :memory_type episodic)

One hyperedge, five participants, each with a distinct role. The complete event is captured as a single retrievable unit.

Nested belief:

"Alice thinks Bob prefers tea."

(thinks 
  :subject Alice 
  :object (prefers :subject Bob :object tea))

The :object of "thinks" is itself a structured fact. Nesting preserves attribution — we know it's Alice's belief about Bob's preference, not a direct observation.

Why This Produces Better Memory

1. Retrieval becomes precise

With structured extraction, queries map to role filters:

"What does the user prefer?" → find edges where :subject = user, verb = prefers, return :object
"What tools does Alice use?" → find edges where :subject = Alice, return :instrument
"What happened at REI?" → find edges where :locus = REI

Without structure, these queries rely on embedding similarity — hoping "prefer" and "like" are close enough, hoping "REI" appears in relevant text. Structure makes retrieval deterministic.

2. Graph connections emerge naturally

When the same entity appears in multiple hyperedges, it becomes a connection point:

(bought :subject user :object "hybrid bike" :source REI ...)
(uses :subject user :object "hybrid bike" :locus commute ...)
(repaired :subject user :object "hybrid bike" :instrument "new brakes" ...)

Three hyperedges, connected through "hybrid bike" and "user". Query for the bike and you get the complete history: purchase, usage, maintenance. The graph structure emerges from extraction — no manual linking required.

3. N-ary facts stay coherent

The bike purchase is one fact with four participants. In a triple-based system:

(user, bought, bike)
(bike, from, REI)  
(purchase, when, last_month)

Three fragments. The connection between "bought" and "REI" requires inference. In hypergraph extraction, it's one edge — the structure matches the original fact.

The Extraction Process

Extraction converts natural language into structured hyperedges:

Input: Conversation text (windowed to preserve context)
Extract: Identify facts and map participants to Kāraka roles
Validate: Parse output, validate role names, coerce types
Store: Each extracted atom becomes a hyperedge with labeled incidences

The extractor itself can be an LLM or a dedicated NLP pipeline. LLMs offer flexibility with informal language, implicit references, and domain-specific terminology — they handle the linguistic complexity (passive voice, coreference, implied arguments) that rule-based systems struggle with. Trained semantic parsers offer higher throughput and lower cost for standard language patterns.

The key architectural decision: extraction happens outside the memory system. Hypabase Memory does graph work — store, traverse, score, return. It accepts structured atoms and persists them. The calling agent or pipeline handles the natural language parsing, choosing whatever extraction method fits the use case.

Why Not Triples or Text Chunks?

Consider: "Alice gave Bob a book at the library."

As triples:

(Alice, gave, book)
(giving_event, recipient, Bob)
(giving_event, location, library)

Three fragments plus a synthetic node. The coherence is lost. "Who did Alice give books to?" requires joining across triples. Different extractors might produce (book, given_by, Alice) — no enforced role semantics.

As text chunks:

The sentence embeds as a vector. But "Alice gave Bob a book" and "Bob gave Alice a book" have nearly identical embeddings — the role reversal is invisible. You can't query "who was the recipient" — you can only ask "what chunks are similar to 'gave book'".

As a hyperedge:

(gave :subject Alice :object book :recipient Bob :locus library)

One edge, four labeled participants. Query by any role. No joins, no ambiguity, no lost structure.

The Tradeoff

Structured extraction has a cost: extraction quality bounds system quality.

If the LLM extracts "bought" as "acquired" or drops the time reference, that information is lost. No retrieval algorithm can recover what extraction missed.

This is the representation thesis from Blog 1: the structure you store determines the ceiling of what you can retrieve. Structured extraction raises the ceiling — but it also means extraction errors are consequential.

In practice, modern LLMs extract reliably when prompted well. The precision gain from structure outweighs the occasional extraction error. And extraction errors are debuggable — you can inspect the hypergraph and see exactly what was stored.

Hypabase Memory: Kāraka-Based Extraction

Hypabase Memory extracts structured hyperedges using Kāraka roles. The remember() API accepts PENMAN-formatted atoms, validates role names against the 8-role vocabulary, and stores each as a hyperedge with labeled incidences. The same structure that enables precise storage enables precise retrieval.

Previous: What Ancient Sanskrit Solves in AI Memory

Next in series: The Forgetting Problem: How Neuroscience Solves AI Memory