Every Tool Solving the AI Memory Problem Is Solving a Different Problem

Mem0, Zep, GBrain, Karpathy's wiki, Hindsight: what each one actually does, and the gap they all leave open.

Mem0, Zep, GBrain, Karpathy's wiki, Hindsight: what each one actually does, and the gap they all leave open.

Share to

Blog 2 of 3 in the XTrace Memory Manager series

If you read Blog 1, you already know the problem: agents forget. Every session starts cold. Context degrades. The decisions you made three weeks ago, the rationale behind the architecture you chose, the version history of the thing you're actually building — none of it survives between calls. Your AI assistant is perpetually new on the job.

So now you're asking the reasonable follow-up: what should I actually use?

The honest answer is: it depends on what problem you're trying to solve. The memory landscape is crowded right now, and most tools are genuinely good at something. The mistake is assuming they're all solving the same problem. They're not.

The "Compile It Once" Insight

Two releases from April 2026 put memory back in the conversation in a big way.

Andrej Karpathy's LLM Wiki proposed something conceptually clean: instead of re-deriving answers from raw documents at every query, have an LLM compile a persistent wiki. Three layers — raw sources (immutable), wiki pages (LLM-maintained markdown), schema (how to maintain it). His personal version grew to 100 articles and 400,000 words, none written by him. The key insight: "knowledge is compiled once and then kept current, not re-derived on every query."

A week later, Garry Tan open-sourced GBrain — a personal memory system built on markdown files and Postgres/pgvector. 5,400 GitHub stars in 24 hours. It implements the same core idea with a "compiled truth + timeline" pattern: a summary at the top of every entity page, chronological entries below. A "Dream Cycle" runs overnight enriching pages. Integrates with Claude Code, Cursor, and others via 37 MCP operations.

These got something genuinely right. The compiled/persistent knowledge pattern is an improvement on the standard RAG approach. Human-readable, auditable markdown memory is a real advantage. If you want a personal knowledge base that your coding agent can consult, GBrain is a reasonable starting point.

What they leave open: both are document-compilation systems. They capture what you know, not what you decided and why. There's no belief revision — when a fact changes, GBrain rewrites the page, but there's no lineage, no provenance, no record of what was believed before. Both are self-hosted and developer-operated: Karpathy's wiki requires Claude Code and a terminal; GBrain requires managing a Postgres instance. And everything is in plaintext on your filesystem.

The Managed Memory Platforms

Mem0 is the most widely adopted memory layer right now. Flat key-value facts with vector search, open source self-hosted plus managed cloud. Benchmarks at 64.20% on LoCoMo (versus a 91.21% full-context ceiling). It's genuinely good for user profile memory — remembering that a user prefers TypeScript, dislikes verbose explanations, works in the healthcare space. Where it struggles: facts that change over time, conflict resolution between contradictory beliefs, anything requiring reasoning about why a decision was made. Graph memory is behind a $249/month paywall.

Zep is the most technically interesting managed option. Its standout differentiator is temporal knowledge graphs — every node and edge carries valid_at and invalid_at timestamps. Benchmarks at 85.22% on LoCoMo, substantially better than Mem0. It actually knows when a fact was true, which matters more than most people realize. Context assembly (not just raw retrieval) is a real advantage over simpler systems. Where Zep stops short: it tracks temporal change without formally modeling belief status. Knowing when something changed is not the same as knowing why, or recording that a belief was retracted versus superseded. Cloud-only.

LangMem makes sense if you're already deep in the LangChain/LangGraph ecosystem. Flat key-value, no graph, no temporal reasoning. Free and open source. Limited retrieval quality compared to Zep or Supermemory. The integration story is the value proposition, not the memory model.

Letta/MemGPT takes a different angle: give the agent explicit control over its own memory. OS-inspired paging — agents decide what to write and read. Conceptually powerful, especially for autonomous agents that need to manage their own context window. Complex to operate in production. No semantic conflict resolution between beliefs.

Supermemory benchmarks well — 85.4% on LongMemEval. All-in-one cloud: memory plus RAG plus user profiles plus connectors. Fast. The trade-off: closed-source, no self-hosting outside enterprise agreements, no encryption. If you want a fast managed service and your data sensitivity requirements are low, it's worth evaluating.

Hindsight is the most interesting recent entrant in the retrieval-quality race. Open source (MIT), 91.4% on LongMemEval with a scaled backbone. Four parallel retrieval strategies plus cross-encoder reranking. First open-source system to break 90% on that benchmark. If retrieval fidelity is your primary constraint and you can self-host, Hindsight deserves serious consideration. Still fundamentally retrieval-first — it returns highly relevant results, but doesn't model beliefs.

MnemeBrain is the conceptually closest neighbor to what XMem is building — a belief-state memory system grounded in AGM revision theory and Truth Maintenance Systems. Small and early, but the architecture is interesting. Worth watching.

The DIY Approaches

Custom CLAUDE.md files, .brain folders, manual markdown in prompts — these are everywhere in the vibe-coding community. They work for session-level context. They fall apart over weeks. Karpathy himself described them as temporary scaffolding. The pattern of "developer maintains a markdown file by hand" doesn't survive contact with real project complexity.

Where XMem Is Actually Different

Fair accounting done. Here's the case for XMem across four specific dimensions.

Work memory, not user memory

Every existing tool is optimized for one of two things: user profile memory (facts about the person using the system) or document retrieval (searching a corpus). Neither captures work memory — the decisions, artifacts, rationale, and version history of what's being built.

Felix, XTrace's founder, frames it this way: "Our hypothesis is that 80-90% of our work will involve AI. But there's currently no good way to capture work memory — the decision graph that explains why you moved from V1 to V2 to V3."

GBrain and Karpathy's wiki come closest to this idea. But they're document compilation systems. They capture what you know. Work memory captures what you've built and why — the reasoning behind the architectural choice, the version history of the artifact, the context that explains a decision made three weeks ago. That's a different substrate.

Belief revision, not overwrite

All existing tools resolve conflicting beliefs the same way: overwrite or append. Zep is the best at temporal tracking — it knows when a fact changed. But no existing system formally models belief status.

XMem is grounded in AGM belief revision theory — the formal framework for rational belief change developed in epistemology and applied by AI researchers. Every belief in XMem carries a status: ACTIVE, SUPERSEDED, or RETRACTED. There's an entrenchment hierarchy: a user-stated preference outranks an LLM inference. When a belief changes, the lineage is preserved, not just that the belief changed, but why. A retraction (belief abandoned without contradiction) is recorded differently from a supersession (belief replaced by something stronger).

This matters in practice when you're working across weeks. "We decided against microservices" is a different memory artifact than "we haven't discussed microservices." The distinction between retraction and supersession is load-bearing in long-running work.

The context agent, a minimum viable context for any task

Every existing tool solves the same sub-problem: given a query, retrieve relevant memories. None of them help an agent understand which memories it needs before a task starts.

XMem's context agent does something different. Before a task begins, it interviews to understand scope, packages a relevant context space, and produces a structured "contract" — a document that tells the agent what's in the context, when to use it, and where its limits are. It's the difference between dumping a search result set into a prompt and creating a documented interface between memory and the agent consuming it.

Felix's analogy: "It's like when Claude and Cursor are training people to use plan mode. You first go through a thorough plan, then summarize all your tasks. The context agent is one step after plan mode — before you start working, gather all the correct context."

Because the contract is structured and explicit, XMem is agent-agnostic. The same context package works for Claude, ChatGPT, or a custom agent — because the rules are spelled out, not embedded in retrieval heuristics.

Encrypted by default

This is structural, not a feature toggle.

Every other tool stores memories in plaintext, on their servers or yours. Mem0, Zep, GBrain, Supermemory, LangMem — all plaintext. XMem stores everything on xTrace's encrypted vector infrastructure: AES-256 client-side encryption, Paillier homomorphic encryption for vector search. The server cannot read your data.

This matters specifically for the kind of memory XMem is designed to hold: decisions, architectural rationale, project history, preferences revealed through work. That's sensitive in a way that a user's preferred programming language is not. As Felix put it directly: "If you're using other vector databases, I can see all your data. That's the difference."

What XMem Is Not

Fair is fair.

XMem is not a document retrieval system. If your use case is "search a corpus of PDFs," Supermemory or a standard RAG stack is the right tool. XMem is not released yet — the API is in private development, docs coming. The context agent is in active development; the belief revision engine is the mature piece. Benchmarks: the predecessor architecture (EverMemOS) scored 92.32% on LoCoMo — above Zep and Mem0. XMem-specific numbers will publish at launch.

Who This Is For

If you're building agents that operate across days and weeks — not single sessions — and you've felt the friction of context degradation, XMem is worth following.

If you work with data that shouldn't live in plaintext on someone else's server — financial decisions, client project history, architectural choices — the encryption model is structural, not cosmetic.

If you're frustrated that existing memory tools help you remember facts but not reasoning, that's the gap XMem is designed to fill.

And if you're just interested in where agent memory is actually going — past retrieval, into formal belief revision, toward systems that can hold the full context of complex evolving work — we're publishing everything as we build it.

Follow XTrace and request early API access. Blog 3 covers the architecture: how the belief graph, the context agent, and the encrypted vector layer actually fit together.

XTrace is building the Memory Manager API for AI agents. Blog 1: The problem with agent memory. Blog 3: The architecture (coming soon).

Frequently Asked Questions

What are the main AI agent memory tools in 2026?

The landscape splits into four categories. Managed platforms: Mem0 (flat key-value, 64.20% on LoCoMo), Zep (temporal knowledge graphs, 85.22% on LoCoMo), Supermemory (all-in-one cloud, 85.4% on LongMemEval), and LangMem (LangChain-native). Document compilation: Karpathy's LLM Wiki and Garry Tan's GBrain. Agent-controlled: Letta/MemGPT. Retrieval leaders: Hindsight (open source, 91.4% on LongMemEval). Belief-state systems: MnemeBrain and XTrace's XMem. Each is good at something different, the mistake is treating them as interchangeable.

What's the difference between Mem0 and Zep?

Mem0 stores flat key-value facts with vector retrieval, optimized for user profiles and stable preferences (64.20% on LoCoMo). Zep builds a temporal knowledge graph where every node and edge carries valid_at and invalid_at timestamps, so it knows when a fact was true (85.22% on LoCoMo). The shared limitation: both retrieve facts but neither models belief status, the difference between a fact that was overwritten, superseded, or explicitly retracted. Mem0 is simpler for profiles; Zep is stronger for time-sensitive data; neither handles decision lineage.

How is XMem different from Mem0, Zep, and other AI memory tools?

XMem differs on four dimensions. (1) Work memory, not user memory: it captures decisions, artifacts, and rationale rather than user profile facts. (2) Belief revision: every belief carries a status (ACTIVE, REVISED, RETRACTED), lineage, and an entrenchment hierarchy where user-stated facts outrank LLM inferences. (3) Context agent: before a task starts, XMem packages a structured contract that works across Claude, ChatGPT, or custom agents. (4) Encrypted by default: AES-256 client-side plus Paillier homomorphic encryption, so the server cannot read your data, unlike Mem0, Zep, GBrain, Supermemory, and LangMem, which all store memory in plaintext.