Is agent memory just RAG?

No, though RAG is part of it. Retrieval-augmented generation is one mechanism for loading semantic memory into the context window at query time. Agent memory is the larger system that also includes working memory, episodic event records, and procedural routines, plus the decisions about what to store where and when to retrieve it. RAG is how one layer gets read. It is not the whole stack.

Will a bigger context window fix forgetting?

It buys you room, not memory. A larger window lets you hold more at once during a single run, but it is still cleared when the run ends, still attended to unevenly when it is full, and still billed by the token. Anything you want the agent to know across sessions has to live in a durable store outside the prompt. The window is a desk, not a filing cabinet, no matter how large the desk gets.

Do I need a knowledge graph for agent memory?

Only for the semantic layer, and only when your questions are genuinely about relationships across multiple hops. If your recall is mostly "find me things similar to this," a vector index is simpler and has far less to maintain. If it is "trace how this connects to that through three other entities," that is what a graph is for. We wrote a separate post on how to tell which one you actually need.

What is the difference between episodic and semantic memory for an agent?

Episodic memory is events: what happened, when, in what order. Semantic memory is settled facts: what is true regardless of any single event. "The user asked us to prioritize speed last Tuesday" is episodic. "This contract is governed by New York law" is semantic. Agents need both, and they belong in different stores, an event log for the first and a graph or fact store for the second.

How do I start fixing my agent's memory?

Begin by naming which layer is actually failing. Sort your agent's mistakes into the four kinds: is it losing the current task, forgetting past events, missing settled facts, or relearning the same routine every run? Each points at a different store. Most teams discover they have built only working memory and need an episodic log and a semantic layer next. If you want a second set of eyes on which layer is the bottleneck, that is the kind of review we do.

Why Your AI Agent Keeps Forgetting

TL;DR: Your agent forgets because it does not have a memory. It has a context window, and a context window is working memory only: ephemeral, expensive, and wiped at the end of every run. Real agent memory is a layered system, short-term working memory plus long-term episodic, semantic, and procedural stores, each living somewhere outside the prompt. The teams whose agents seem to "remember" did not buy a smarter model. They built the memory stack. This post defines that stack, shows where a knowledge graph fits inside it, and gives you the vocabulary for the rest of this series.

An agent nails a task on Monday and botches the same task on Thursday. The model did not get worse between Monday and Thursday. What happened is simpler and more frustrating: on Thursday it never actually remembered Monday. It re-read a transcript someone pasted back in, or it started cold. That is not memory. That is a goldfish with a very good vocabulary.

This is the first post in a new series on agent memory and context graphs. The last series was about graph rot, the slow silent decay of a knowledge graph. This one is about the layer above it: how an agent holds on to what it knows, across a turn, across a session, and across months. Almost every "the agent is dumb" complaint we get called in to fix turns out to be a memory problem wearing a model costume. So before any of the deeper posts, the comparisons, the retrieval mechanics, the evaluation, we need a shared map of what agent memory actually is.

An agent without a memory architecture is not reasoning over your business. It is improvising from whatever happened to fit in the prompt this time.

What is agent memory, really?

Agent memory is the set of systems that let an agent carry information across boundaries it would otherwise lose it at: across turns in a conversation, across separate sessions, and across the gap between one task and the next. It is not one thing. Borrowing from how cognitive scientists describe human memory, a working agent has several distinct kinds, and they live in different places.

There are four that matter in practice. Working memory is what the agent is actively holding right now, the current question and the few facts it just pulled. Episodic memory is the record of what happened, the events: this user asked for X last week, that run failed at step three. Semantic memory is the settled facts, the entities and relationships that are true regardless of any single conversation: this company owns that subsidiary, this clause supersedes that one. Procedural memory is the learned how-to, the routines and tool sequences the agent has found work for a given job.

Most agents people ship in 2026 have exactly one of these four. They have working memory, because the context window provides it automatically, and nothing else. Everything past the edge of the prompt is gone. That single missing distinction is the root of more "forgetting" bugs than any model limitation.

The agent memory stack: four layers (working, episodic, semantic, procedural), what each holds, where it lives, and how long it lasts, with only working memory inside the context window.

Why isn't a context window the same as memory?

Because a context window is rented, not owned. It is working memory and only working memory: it holds what is in front of the agent for the duration of one run, and then it is gone. Treating it as the whole memory system fails in three specific ways, and they get worse as the system gets more useful.

First, it is ephemeral. The moment the run ends, the window is cleared. Anything the agent "learned" mid-task that you did not deliberately write somewhere durable is lost. The next session starts from zero, which is why so many agents feel competent in a demo and amnesiac in production.

Second, it is lossy under pressure. As the window fills, models attend unevenly to it, the well-documented tendency to lean on the beginning and the end and skim the middle. So even the things that are technically "in memory" are not reliably used. More context is not more memory. Past a point it is just more noise the model has to fight through.

Third, it is expensive, and the cost compounds in a way that is easy to miss until the invoice arrives. Picture a support agent that should know a customer's full history, say a hundred and fifty thousand tokens of past tickets and notes. If you keep that in the window, you resend all hundred and fifty thousand tokens on every turn. A forty-turn conversation is six million input tokens, for a single conversation, spent entirely on re-reading what the agent should already remember. At current frontier-model input prices that lands somewhere around ten to twenty dollars per conversation in resend cost alone, before the agent has produced one new sentence. Multiply by every customer and every day, and the window-as-storage approach collapses on cost long before it ever reaches the context limit. You cannot put a year of history, a thousand-page contract set, or a company knowledge base in the window, and even where a model technically allows it, you are paying full price to re-read the same tokens forever, and still fighting the loss-under-pressure problem. The window is the wrong place to keep anything you want the agent to know next week.

The fix is not a bigger window. It is to stop using the window as a filing cabinet and start using it as a desk: a small workspace that you load, deliberately, from durable stores that live outside the prompt. That deliberate loading is what people have started calling context engineering, and the stores it loads from are the rest of the memory stack.

What are the types of agent memory, in a real system?

Map the four kinds onto where they actually live, and the architecture stops being abstract. Here is how the stack looks in the systems we build:

Working memory lives in the context window. Keep it small and curated. Its only job is to hold the current task plus the handful of retrieved facts the agent needs for this step, not the agent's entire past.

Episodic memory lives in a log or an event store. Every meaningful event the agent should be able to recall later, conversations, decisions, failures, gets written as a record with a timestamp. This is what lets an agent say "last time we tried that, it broke here" instead of cheerfully repeating the mistake.

Semantic memory lives in a knowledge graph, and often a vector store alongside it. This is the settled, deduplicated, cross-checked layer of facts and relationships, the part that should be true no matter which conversation is asking. It is the most valuable layer and the hardest to keep honest, which is the entire reason the previous series existed.

Procedural memory lives in your tools, prompts, and routing logic. The sequences that work get encoded as reusable routines rather than rediscovered every run.

The skill is not building any one of these. It is deciding what belongs in which, and wiring retrieval so the right slice of the durable layers lands in working memory at the right moment. Get that wiring wrong and you have four stores and an agent that still forgets, because nothing reaches the desk when it is needed.

Where does the knowledge graph fit?

The knowledge graph is your agent's long-term semantic memory. It is the layer that holds what is true about your domain, the entities and the relationships between them, in a form an agent can traverse rather than just match against. When an agent needs to know that a parent company owns a subsidiary that holds a position that is governed by a policy, that is a multi-hop semantic-memory query, and a graph is the structure built to answer it.

This is the bridge from the last series to this one. Everything we wrote about graph rot was, it turns out, about the failure modes of an agent's long-term memory. A stale edge is a false memory. A duplicated entity is a memory split in two so the agent only ever recalls half of it. A mislink is a confident memory of something that never happened. We covered each of those in depth: how graphs rot in the first place, how to keep one fresh without rebuilding it, and how to run a twenty-minute health check on one. If the semantic layer of your memory stack is a graph, those posts are how you keep that layer from lying.

There is a second-order cost here that teams almost always underestimate when they first reach for a graph. Semantic memory is the layer that decays, and it decays unevenly. A company's founding year never changes, but who owns what, who works where, and what something costs can turn over in months, and the graph keeps answering with the old version in the same confident voice. A memory you wrote once and never re-checked is not an asset, it is a liability that looks like an asset. Long-term memory is not a one-time extraction job, it is a maintained system, and the teams whose agents stay trustworthy are the ones who budgeted for the maintenance from the start rather than discovering it the first time an agent confidently cited a fact that stopped being true a quarter ago.

A graph is not always the right semantic store. Sometimes a vector index is enough, and we wrote a whole buyer's post on how to tell the difference between a real multi-hop need and a similarity lookup dressed up as one. The point of the stack is not "use a graph everywhere." It is to know which layer of memory you are actually building, and to pick the right structure for that layer.

What does agent memory look like in production?

Two systems we have built show the two ends of the stack clearly.

On a wealth-management platform, the semantic layer is a knowledge graph: a Neo4j store with six entity types and five relationship types, built by an eight-stage extraction pipeline that pulls entities and relationships out of unstructured documents, resolves duplicates so one company is not stored under five names, and checks each candidate edge against the sentence that justifies it before that edge is ever written. A vector store of a few hundred chunks sits alongside the graph for similarity recall, so the system can do both kinds of retrieval: traverse relationships when the question is genuinely multi-hop, and fall back to plain similarity when it is not. Every fact in the graph carries a confidence score from zero to one. That one number is what lets the agent set a floor and keep its shakiest memories out of its most important answers, the way a careful analyst trusts a signed filing more than a half-remembered phone call. That is semantic long-term memory done deliberately: extracted, deduplicated, scored, and traversable, not a pile of documents stuffed into a prompt and hoped over.

On Paralegent, our multi-agent legal-analysis system, the interesting layer is working memory at scale. It runs twenty-three agents, twelve that score and eleven that analyze. The naive design would hand every agent the full case in its own context window and let them talk, which means twenty-three copies of the same large context, re-sent on every exchange: slow, costly, and impossible to keep consistent when one agent updates a fact the other twenty-two are still holding a stale copy of. Instead they coordinate through a shared scores table and a queue, a small structured external working memory that every agent reads from and writes to, so no single agent ever has to carry the whole picture in its prompt. A router then decides which agents even need to run for a given document and prunes the rest, rather than waking all twenty-three every time. Moving the shared state out of the context window and into a table, together with that routing, is a large part of why model calls dropped by around seventy-five percent. The memory lived in the right place, so the agents stopped re-carrying it on every call.

Those are different layers of the same stack: one a semantic graph, the other a shared working memory. Neither was solved by a bigger model or a longer window. Both were solved by deciding where the memory should live.

The teams whose agents seem to remember did not buy a smarter model. They decided, on purpose, where each kind of memory was going to live.

Why building the stores is only half the job

Standing up the four stores is the easy half. The hard half is retrieval: getting the right slice of durable memory onto the desk at the right moment, and only that slice. A store full of correct facts is useless if the agent pulls the wrong ten of them into a limited window. Naive similarity search does exactly this, returning what is textually similar rather than what actually bears on the decision in front of the agent. Ask "what are the risks in this deal" and a similarity search will hand back every paragraph containing the word risk, while the one clause that creates the real exposure, written in language that never uses the word, never surfaces.

Retrieval is a ranking problem, not a lookup, and it is where most production memory systems are quietly won or lost. A single turn often needs different memory from several layers at once: the current task from working memory, the relevant prior events from the episodic log, and the settled facts from the graph, all ranked and trimmed to fit the window together. Wire that well and a small model with good memory will beat a large model with none. Wire it badly and you have four well-built stores feeding an agent that still answers from the wrong facts. This is the difference between a memory system that exists and one that works, and it earns its own post later in this series.

How do you know memory is your problem?

A few symptoms point almost always at the memory stack rather than the model. The agent repeats questions it already has the answer to. It contradicts a decision from earlier in the same project. It performs well on the first turn and degrades as the session goes on, the working memory filling with noise. It cannot tell you why it did something last week, because nothing recorded that it did. And the standard reflex, paste more history into the prompt, helps for one run and then stops, because you are still using the window as storage.

If that list sounds like your agent, the fix is architectural, and it is the subject of the rest of this series. Next we will compare the tools people reach for first, Mem0, Graphiti, and the option of a plain knowledge graph, and show what each is and is not good for. After that: working memory at scale, episodic versus semantic stores, why retrieval is really a ranking problem, how memory rots the same way a graph does, and how to evaluate whether your agent's memory is any good. The map first. The mechanics next.

Not sure which layer of the stack your agent is missing? That diagnosis is exactly the work we do. Book a 15-minute call and we will map your agent's memory and where it is leaking. We work US business hours.

Why Your AI Agent Keeps Forgetting

What is agent memory, really?

Why isn't a context window the same as memory?

What are the types of agent memory, in a real system?

Where does the knowledge graph fit?

What does agent memory look like in production?

Why building the stores is only half the job

How do you know memory is your problem?

Share this article

Muhammad Mudassir

Muhammad Mudassir

Frequently Asked Questions

Is agent memory just RAG?

Will a bigger context window fix forgetting?

Do I need a knowledge graph for agent memory?

What is the difference between episodic and semantic memory for an agent?

How do I start fixing my agent's memory?

Still have questions?

Related Articles

Graph Rot: Why Your Knowledge Graph Is Lying to Your AI

Keeping a Knowledge Graph Fresh Without Rebuilding It

The 20-Minute Knowledge Graph Health Check

Explore More Insights