BLOG

May 1, 2026

By Alyx

How to give your AI agents persistent memory (and why most solutions only solve part of it)

A frank look at vector stores, memory files, and hosted chat memory. They each fix a slice of the amnesia problem. None of them scope memory to the work itself.

All posts

Last Tuesday I asked Claude to add a new route to the web server. It cheerfully wrote a file that duplicated three utilities we already had, used a naming convention we moved away from in January, and imported from a package we deleted when we split the monorepo. I corrected it, watched it apologise, and got working code ten minutes later.

Wednesday I opened a new session, asked a similar question, and watched the exact same thing happen.

Then I switched to Codex to do some Tailwind work, and Codex had never heard of any of it: the naming convention, the deleted package, the utilities. It was a blank stranger on day one.

That's the problem this post is about, and I mean the operational version of it, not the philosophical one about what memory even means for a language model. You have real work in flight, agents helping with it, and the agents keep forgetting; every time they forget, you pay the cost of re-explaining.

If you search for how to give an AI agent persistent memory, you land in a pile of framework articles aimed at people building memory systems from scratch. Most of us don't want to build one; we just want our agents to remember. This post is an honest survey of what works, where the gaps are, and what I ended up doing about it.

Why agents forget in the first place

Three mechanics produce the amnesia.

First, the context window is finite. Modern frontier models have very large windows (Claude Opus goes up to a million tokens at the time of writing), but that is still not a hard drive. Anything you want the model to know in a given turn has to be inside the window at inference time, and between large codebases, long conversations, and lots of retrieved documents, the window fills up fast.

Second, sessions are bounded. When you close a chat or start a new one, the model does not carry anything forward on its own. The API call is stateless. Each message from your client sends the full conversation history back, and when you start a new conversation, that history is empty.

Third, there are usually multiple agents in play. I use Claude for code and copy, Codex for UI iteration, Gemini for long-document analysis, and each one has its own vendor, its own chat history, its own notion of what we talked about. There is no shared substrate between them, so giving Claude a perfect memory does nothing for Codex.

The question "how do I add persistent memory" is really the question "how do I make some state survive between calls, survive between sessions, and be accessible to more than one agent." Those three constraints do not always get addressed together.

The three approaches in the wild

People solve parts of this in roughly three ways.

Memory frameworks and vector stores

Mem0, Zep, Letta, LangMem, and a long tail of smaller projects. The shape is similar across most of them. The framework intercepts conversations, extracts facts or summaries with an LLM pass, and stores them in some retrieval backend. On the next turn it runs a query against that store and injects relevant memories back into the prompt.

The differences are real but narrower than the marketing suggests. Mem0 uses a hybrid of vector, graph, and key-value stores, scopes memory by user and session and agent, and relies on LLM extraction to decide what gets written down. Zep's newer version is built on a temporal knowledge graph called Graphiti, which tracks when facts were true, not just what facts existed, useful if "when did we switch off Stripe" is a question you need answered. LangMem is a lighter library that plugs into LangGraph.

What you get: your agent remembers things across sessions without you typing them every time. Retrieval is fuzzy, so paraphrases still work.

What you pay for: another service in your stack, extra LLM calls on every turn for extraction and retrieval, a vector database to run or subscribe to, and a nontrivial dance around what counts as a memory worth storing. The "LLM decides what to remember" step is both the magic and the weak point, and it will sometimes forget the thing you wanted and remember that you once said "hello." It is also still scoped to one agent or one app, so Mem0 does nothing for Codex.

Memory files and skills

The other common pattern is older and simpler. You write a markdown file and the agent reads it every session.

Claude Code popularised this with CLAUDE.md. You put project instructions, conventions, and architectural notes into a file at the repo root. Claude Code walks up the directory tree, concatenates what it finds, and stuffs the result into context at session start. Recent versions also ship an auto-memory feature where Claude writes its own notes into a per-project directory and reloads an index file on startup.

Anthropic's API has a related primitive called the memory tool. It exposes a file-system-like interface on a /memories directory that your application is responsible for implementing. The model writes and reads files; you decide where they live. This is genuinely flexible, and because it is client-side you can back it with whatever you want.

What you get: memory you can read, edit, and put in version control, with no extraction magic deciding what was worth remembering on your behalf.

What you pay for: you write and maintain the files yourself, or you trust the model to write them well. Files grow, adherence drops past a couple of hundred lines, and the whole thing is still scoped to one agent. CLAUDE.md does nothing for Codex.

Hosted chat memory

The third approach is the one most end users meet first. ChatGPT's memory, Claude's built-in memory, Gemini's equivalent. The vendor scans your chat history, generates a rolling summary of facts about you, and surfaces it in new conversations.

What you get: zero configuration, and a decent answer to "why doesn't this tool remember my name."

What you pay for: the memory is scoped to you as a user, inside one vendor's product. It is not scoped to a project, and the chance that your personal preferences leak into a work session (or vice versa) is the reason many teams turn it off. It also cannot accept input from other agents; Claude's memory knows what Claude learned, and has never heard of Codex.

The blind spot

Notice what the three approaches share. They all scope memory to a user, to an app, or to an agent, and none of them scope memory to the actual work.

When I think about my actual projects, the unit that should own the memory is not "me" and not "Claude" and not anthropic.com. It is the project itself: the workunit, the feature, the investigation, the thing with a problem statement and a goal attached.

The reason a decision from three weeks ago matters today is because it bears on this specific problem. The reason a failed approach matters is because the next agent attacking the same problem should not rerun it. Memory that is right for a project is memory that any agent working on that project can read, and any agent making progress can write to.

User-scoped memory can't do that, agent-scoped memory can't do that, and a file tucked into one tool's config directory can't either.

What shared agent memory actually needs

If you sketch out the feature honestly, you end up with a short list of requirements.

Workspace scope

A memory lives attached to a project, feature, or workunit, not to a person or a tool. Two people and three agents can read the same memory because they are working on the same thing.

Typed records

"Everything we learned" is too broad to be useful. It helps to distinguish a decision (we chose HTMX over React) from an attempt (we tried batching the queries, it regressed latency by 40 percent) from an insight (Postgres LISTEN blocks during a long transaction). Retrieval works better on typed records because your query is usually "what decisions did we make" or "what have we tried," not an undifferentiated semantic search.

Importance and supersession

A decision from last week can be reversed by a decision from this week. The old one should not disappear (the reasoning is often still useful), but it should be clearly marked as superseded. Without this, memory rots. You get two contradictory notes and no way to tell which one is current.

Reach

Every agent you care about should be able to read and write it. In practice today, that means an MCP server. MCP is the one protocol that Claude, Codex, Cursor, Gemini CLI, and most of the serious clients actually implement. If your memory lives only behind a REST API with a bespoke SDK, you will be writing adapters forever.

Humans in the loop

The memory cannot be write-only by agents. Humans should be able to read what the agents are saying to each other, edit it, correct it, and write their own notes in the same stream.

Context atoms, briefly

We ended up building this for ourselves. Workunit organises work into workunits (the name is doing some work there), and every workunit has a stream of what we call context atoms. An atom is one typed record: decision, insight, question, attempt, or progress. Each has an importance, optional tags, and an optional supersedes link that points back at the atom it replaces.

The key move is that atoms belong to the workunit, not the agent. Claude can save a decision, Codex can read it on the next session, Gemini can write an insight, and I can edit any of them in the web UI. All of them travel through the same MCP server, so the same call signature works from any MCP-capable client.

Retrieval is filter-first (by type, by tag, by importance) with an optional text search, because in practice the question you ask your memory is usually "what decisions are still live" or "what have we already tried on the cache layer," not a vague similarity query. The supersede link preserves history: you can see that we used to think X and then changed our minds, with the reasoning intact on both atoms.

This is not the only shape you could build. A temporal knowledge graph over the same data would give you better "what was true on this date" queries; a vector index on top would give you better fuzzy recall, and we may add both. But the core commitment stays the same, and it is the part I keep coming back to: the memory is attached to the work, not to any one of the agents doing the work, and every agent reaches it through the same protocol. The interesting question was never how a language model remembers, because inside one window it remembers fine. The question is where memory lives once the window closes, and who else is allowed to touch it.

Get started free

Bring your AI agents and your team into one workspace

Workunit gives your agents structured context and your team a shared place to plan, track, and ship the work. Free to start, no credit card.

Keep going

Dive deeper

Want to go from the idea to the actual workflow? These guides walk through it step by step.

AI Features

Context atoms, the AI digest, and per-project model defaults.

Open guide

MCP Integration

How agents save and retrieve memory through the MCP server.

Open guide

Questions?

The full guide library covers every part of Workunit, and the community discussions are open if you want to ask.

All Guides Community Discussions