Does a bigger context window mean my agent has a better memory?

No. A larger window gives the agent more room to hold tokens in a single turn, but it is not a memory, it is a cache: finite, and used less reliably as it fills. Any fixed window overflows once a task runs long enough, and research benchmarks show that models use far less of their advertised window than the label implies and tend to miss information buried in the middle. Memory is the store the agent pages from, outside the window. A bigger window delays the problem instead of solving it, and often costs more for worse attention.

What is the difference between a context window and agent memory?

The context window is the working set the model sees on a given turn, the equivalent of RAM: small, fast, expensive, and wiped between sessions unless you resend it. Agent memory is the durable store that lives outside the window, the equivalent of disk: a vector index, a knowledge graph, a log, large and persistent. The window holds a small projection of memory for the current turn; the store holds everything. Confusing the two, treating the window as the memory, is the root cause of agents that forget once a conversation gets long.

What is context compaction and when should I use it?

Compaction is recursive summarization applied to a conversation when the window nears its limit: you summarize the older turns into a short digest and continue from that digest instead of the raw history. Use it when a session is long enough to overflow the window but the older material still matters in gist. The tradeoff is that summarizing is lossy, it smooths away exact quantities, identifiers, and wordings, so pin anything that must stay exact and compact only the material you can afford to hold in summary form. Compaction is one move in a larger policy, not a complete memory strategy on its own.

Why does my agent get slower and more expensive on long conversations?

Because a chat model's API is stateless, so you resend the entire conversation every turn. The tokens you pay for on each turn grow with the length of the conversation, and the total over a session grows with the square of its length, which is why both cost and latency climb steadily as a thread gets longer. Prompt caching discounts the price of the resent prefix but does not reduce the token count, so it softens the slope without changing the shape. The structural fix is to cap the live context with a working-memory policy, which turns the quadratic growth into linear.

How do I stop my agent from forgetting earlier instructions in a long session?

Stop relying on the default of dropping the oldest turns, which evicts by age and is blind to importance, so it discards an early constraint while keeping recent chatter. Pin the instructions and hard constraints in a protected part of the window that is never evicted. Rank what stays by relevance to the current goal rather than by recency. Compact the older history into a summary instead of dropping it outright, and keep the full detail in an external store you re-retrieve from when a turn needs it. The instruction survives because you decided it was load-bearing, not because it happened to be recent.

Agent Working Memory: Bigger Window Won't Fix It

TL;DR: A million-token context window does not give your agent a memory. It gives it a bigger desk, and a long enough task will bury any desk. The context window is working memory, a cache: finite, expensive, and used less reliably the more you cram into it, with peer-reviewed work showing that models miss the middle of a long input and use far less of their advertised length than the label claims. Real memory lives outside the window, in a store you page from. So the engineering problem at scale is not the size of the window, it is the policy that decides what stays in it: pin the few invariants, keep the recent turns, compact the warm middle into a running summary, and offload the cold to a store you re-retrieve from precisely. Get that policy right and a small model holds a coherent thread across a thousand turns. Get it wrong and the cost of the conversation grows with the square of its length while the agent quietly forgets the one constraint that mattered.

An agent runs beautifully for twenty turns and then loses the plot. It forgets a constraint you gave it at the start. It re-asks a question you already answered. It contradicts a decision it made ten messages ago. The reflex is to reach for a bigger context window, and the model vendors are happy to sell you one. But you have almost certainly met an agent with a giant window that still forgets, because window size was never the thing that was broken. What broke is that nobody decided what the agent should keep in front of it as the session grew, so it kept the wrong things.

This is the fourth post in our series on agent memory. The first established that a context window is not a memory. The second compared the tools for building the store that is. The third showed that getting the right facts out of that store is a ranking problem, not a lookup. This post is about the moment those facts arrive and have to share one small, finite window with everything else the agent is already holding. Retrieval decides what is worth bringing in. Working memory decides what is worth keeping once it is in, and what has to leave to make room. That second decision quietly determines whether your agent can hold a long task together.

The context window is not where your agent's memory lives. It is the small, expensive surface that memory is projected onto, one turn at a time. Manage the projection, or it manages you.

A bigger window is not a bigger memory

The pitch is seductive: the window is now a million tokens, so just put everything in it and stop worrying about memory. Two things are wrong with that.

The first is simple arithmetic. A window of any fixed size, however large, is still fixed, and an agent that runs long enough will always generate more history than it holds. A coding agent working through a real task, a support agent on a multi-day thread, a research agent reading source after source, all of them cross any fixed boundary eventually. A bigger window moves the wall further out. It does not remove it. So you will need an eviction policy no matter how large the window is, and the only question is whether you design that policy or let a crude default make it for you.

The second problem is subtler and better evidenced: models do not use the whole window equally well, even below the limit. An NVIDIA research benchmark called RULER measured the effective context length of long-context models, the length at which they still perform reliably, and found it routinely far shorter than the advertised number. A model that claims a large window often degrades well before it. Separately, a 2025 study called NoLiMa showed that when you strip away literal word overlap, so the model has to actually reason over a long input rather than match a keyword, accuracy falls off sharply as the input grows, long before the stated limit. And the well-known Lost in the Middle study, published in the journal TACL in 2024, found a U-shaped pattern: a model attends most reliably to the very start and the very end of its context and is most likely to miss what sits in the middle. That last one is a position effect, not a length effect, and the three are distinct findings, but together they deliver one verdict. The usable part of a context window is smaller than the number on the box, and it gets less reliable the more you fill it. Stuffing the window is not a memory strategy. It is a way to pay more for worse attention.

Your context window is a cache, not a memory

The cleaner mental model, the one that tells you what to actually do, comes from a 2023 paper titled MemGPT: Towards LLMs as Operating Systems. It is an arXiv paper rather than a peer-reviewed one, so weigh it as a design proposal, but its framing is the most useful one in the field: treat the context window the way an operating system treats RAM. RAM is fast, small, and expensive. Disk is slow, vast, and cheap. The operating system's whole job is to keep the few things the program needs right now in RAM and page everything else to and from disk. MemGPT applies this directly: the context window is main context, the agent's RAM, and an external store is the agent's disk, with the model itself deciding what to page in and out.

Once you see the window as a cache, the design questions become the right ones. A cache is not where your data lives. It is a small, fast staging area that holds a working copy of the slice you need this instant. Your data lives in the backing store. The skill in caching has never been making the cache bigger, it has been deciding what earns a slot and what gets evicted. The same is true here. Your agent's memory does not live in the window. It lives in the store we spent the first and second posts building. The window holds the working set: the small projection of that memory the agent needs for the turn in front of it.

In mature agent systems, almost all of the state lives outside the window on purpose. The agent writes its progress to a file, keeps a task list it reads back each turn, records decisions in a structured log, and pulls only the relevant slice into context when it needs it. The window at any moment holds a fraction of what the agent knows. This is not a workaround for small windows. It is the architecture, and it is the same one whether the window is eight thousand tokens or a million, because the cache discipline is what keeps attention focused, not just what keeps the agent under a token limit.

Position-blind eviction is the real failure

So the window fills, and something has to leave. The default, the one you get if you do nothing, is to drop the oldest turns and keep the most recent ones. This is the crudest possible policy, and it is worth being precise about why it is bad, because the obvious criticism is wrong and the real one is worse.

The obvious criticism is that dropping the oldest turns deletes the task goal, which usually sits at the start. In practice any competent system pins the system prompt and the goal so they are never evicted, so that specific fear is easy to handle. The real problem is deeper. Dropping the oldest turns evicts by position, by age, and age is blind to importance. A constraint you were given at turn twelve, do not contact this customer by phone, only by email, is now old, so it gets dropped, while three turns of recent small talk survive because they are new. The policy threw away the load-bearing fact and kept the noise, purely because of when each arrived. Recency is a fine tiebreaker and a terrible primary criterion.

There is an instructive result from the systems layer here, worth borrowing as an analogy rather than as a direct claim. A 2024 paper called StreamingLLM found that when you run a model over a very long stream and naively evict the earliest tokens, performance collapses, because models park a large share of their attention on the first few tokens, what the authors called attention sinks. Keep a few of those initial tokens plus a recent window and the model stays stable over millions of tokens. That work is about token-level attention, not about dropping conversation turns, so do not overclaim it. But the lesson rhymes with ours: blindly evicting the oldest thing is the policy most likely to break, and what you choose to protect from eviction matters more than how much you keep.

The fix follows directly from the previous post. Eviction is just retrieval run in reverse. If retrieval is a ranking problem, deciding what to bring in, then eviction is the same ranking problem deciding what to let go: rank what is in the window by importance and relevance to the current goal, not by age, and evict from the bottom. The constraint from turn twelve outranks the recent small talk, so it stays. That is the whole move.

Compaction trades detail for room, and it has a cost

When ranking alone cannot clear enough space, you compress. The production name for this is compaction, and it is worth saying plainly what it is: compaction is recursive summarization triggered when the window nears its limit. You take the older part of the transcript, summarize it into something far shorter, and reinitialize the window with that summary in place of the raw turns. Several systems ship this now as an automatic feature, summarizing the conversation as it approaches a threshold and continuing from the digest. The idea is old. Stanford's Generative Agents, the UIST 2023 paper, had its agents periodically reflect, synthesizing many low-level memories into a few higher-level ones, which is consolidation by another name. Summarize the summaries and you have hierarchical compaction.

But compaction has a cost, and this is the part teams underestimate. Summarizing is lossy by definition: you are paying detail for room. A benchmark called LoCoMo found that giving a model summaries of a long conversation did not significantly help despite the summaries having high recall, because the act of compressing dialogue into a summary loses the specific information the task later needed. Quantities, identifiers, the exact wording of a constraint, these are precisely what a summary smooths away, and precisely what a careful task depends on.

Here is the tension that makes working memory genuinely hard, and it is the real thesis of this post. You cannot win by refusing to summarize, because keeping everything verbatim overflows the window, triggers the lost-in-the-middle problem, and costs you quadratically, as the next section shows. And you cannot win by summarizing everything either, because aggressive summarization paraphrases away the details that detail-sensitive work depends on. Both extremes fail. The working set has to stay small, which forces you to choose what gets kept in full and what gets compressed. That choice, not the window size, is the design problem.

Pin, keep, compact, offload: the four moves

Put the pieces together and a working-memory policy has four moves. None of them is exotic. The discipline is in applying all four deliberately instead of letting a default truncate your history.

Pin the invariants. The goal, the hard constraints, the schema or format the output must follow, the small set of facts that must never be wrong: these go in a protected region of the window that is never evicted, held verbatim, ideally near the start or the end where the model attends most reliably. This is a tiny budget and it pays for itself many times over.

Keep the recent turns. The last several exchanges stay raw, because recency genuinely matters for the immediate next step, and because summarizing what just happened is both lossy and pointless. This is the one place the crude default is right.

Compact the warm middle. The older-but-not-ancient history, the part between the pinned invariants and the recent raw turns, gets summarized into a running digest. This is where compaction lives. You trade the detail of those turns for the room to keep going, accepting the loss knowingly and only for the material least likely to be needed word for word.

Offload the cold, and re-retrieve on demand. Everything else goes out to the store, the vector index and the knowledge graph from the earlier posts, and is pulled back only when a specific turn needs it, through the ranking pipeline from the retrieval post. This is the move that makes the whole thing scale: the agent's full history can be enormous, because almost none of it is in the window at any given moment. It is on disk, and the agent pages in the slice it needs. That store has to be kept clean and current to be worth paging from, which is its own discipline, but once it is, the window stops being the bottleneck.

Two refinements separate a careful implementation from a crude one. The first is to distinguish clearing from compacting. Some of what clogs a window is not conversation at all, it is the bulky output of old tool calls, stale search results, a file you read six steps ago. That material can often be removed outright rather than summarized, which loses nothing but the bulk, whereas summarizing a constraint risks losing the constraint. Clear what is safe to drop, compact only what you must keep but cannot keep in full. The second refinement is sub-agent isolation: when a subtask needs to read a great deal to produce a little, spin up a sub-agent with its own fresh window, let it do the heavy reading there, and return only its conclusion to the main agent. The detailed context never touches the parent's window. Anthropic's published multi-agent research system works this way, with worker agents each exploring in their own context and handing back a short summary. Its reported gains are an internal evaluation, so treat the headline number as a vendor result, but the architecture is sound and now widely adopted.

A working memory hierarchy for an AI agent. A long session produces a growing transcript that overflows a finite context window. A policy decides what stays: pin the invariants such as the goal and hard constraints so they are never evicted, keep the recent turns raw, compact the warm middle into a running summary, and offload the cold history to an external store of a knowledge graph and vector index. The needed slice is paged back into the window on demand. Position-blind truncation drops a load-bearing constraint, while the managed working set keeps it.

The cost nobody budgets: history is quadratic

There is a cost argument here that most teams do not see until the invoice arrives, and it is the most concrete reason to manage the working set rather than let it grow.

A chat model's API is stateless. It does not remember your previous turns. To continue a conversation, you resend the entire history every single turn. That means the input you pay for grows with every exchange. On turn two you resend turn one. On turn fifty you resend the previous forty-nine. The tokens you are billed for on a single turn grow linearly with the length of the conversation, and the total over the whole session grows with the square of its length. This is the ordinary cost of a stateless conversation, and it is quadratic.

Put rough numbers on it. Say your system prompt and tools are a few thousand tokens and each exchange adds about a thousand. A hundred-turn session with no memory management resends an ever-growing transcript and runs through something on the order of five million input tokens by the end, with the final turn alone carrying around a hundred thousand. At a mid-tier model price in the rough range of three dollars per million input tokens in early 2026, that is roughly fifteen dollars of input for one session, and the per-turn cost is still climbing when it ends. Now run it for a thousand turns. The quadratic bites hard: you are into hundreds of millions of input tokens and well over a thousand dollars for the single session, except it never gets there, because a thousand turns at a thousand tokens each is a million-token transcript that overflows the window entirely and simply stops working.

Now cap the live context with the policy above, say at twenty thousand tokens. Per-turn input stops growing, because the window holds a bounded working set no matter how long the session runs. The cost goes from quadratic to linear. The hundred-turn session drops to a few dollars. The thousand-turn session, which could not even run before, costs tens of dollars and runs fine. The savings ratio is not fixed, it grows with length: a couple of times cheaper at a hundred turns, an order of magnitude or more by a thousand. Treat the dollar figures as illustrative, model prices move month to month, but the shape is the point and the shape does not change. Unmanaged history is quadratic. A bounded working set is linear.

One distinction is worth nailing down, because it trips people up. Prompt caching does not solve this. Caching lets you resend a stable prefix at a steep discount, often around a tenth of the normal input price for the cached part, which is a real and worthwhile saving. But it is a discount on the price of the tokens, not a reduction in their number. The cached tokens still fill the window, the model still attends over all of them, and the count still grows with the conversation. Caching shrinks the multiplier. It does not turn the quadratic into a line. Only managing the working set does that. And to head off a common confusion: prompt caching is a billing feature about reusing a prefix, which is a different thing from the model's internal attention cache, and a different thing again from the second quadratic that lives inside a single forward pass, where attention compute scales with the square of the context length. That last one is another good reason to keep the window small, but it is a separate mechanism. The billing quadratic, the one that grows across turns, is the one that shows up on your invoice.

What this looks like when we build it

The abstract policy is pin, keep, compact, offload. Here is the concrete version, from systems we have shipped, anonymized.

On a wealth-management platform we built, the agent's memory is emphatically not the window. The durable memory is a knowledge graph in Neo4j, six entity types and five relationship types, sitting beside a vector store of a few hundred chunks. At any moment the window holds only the working set for the question in front of the agent: the active client thread, the specific holdings and clauses the current question touches, the constraints that must hold. Everything else, the full history of every client and instrument, stays in the store and is paged in by retrieval only when a turn needs it. The window never tries to hold the whole client. It holds the slice, and the graph holds the rest. That is the cache discipline made physical, and it is why the system stays coherent over long sessions instead of degrading as the history piles up.

On Paralegent, our multi-agent legal-analysis system, the working set lives outside every individual window by design. Its twenty-three agents, twelve scorers and eleven analysts, do not each carry the full case in their own context. They coordinate through a shared scores table and a queue, which is external working memory: structured state that no single agent holds in full and any of them can read the relevant row of. A router decides which agents and which facts a given document actually needs and prunes the rest, so the system is not waking all twenty-three and re-sending the same large context to each. Deciding what does not need to be in any window is as much of the design as deciding what does, and moving the shared state out of the prompts into that table, together with the pruning, is a large part of why model calls dropped by roughly seventy-five percent. The agents stay small and fast because the memory they share is not crammed into each of their windows. It sits outside, and they read what they need.

Neither system was made to scale by a longer window. Both were made to scale by keeping the window small on purpose and putting the memory where memory belongs, outside it, in a store built to hold it.

How to tell working memory is your bottleneck

A handful of symptoms point at the working-memory layer rather than at the model or the store. The agent does fine on a short demo and falls apart on a long real session. It forgets or contradicts a decision it made earlier in the same conversation. It re-asks something you already answered, or redoes work it already did. Its answers get less reliable the longer the session runs, not more, even though it is technically still under the context limit. And the cost and latency climb steadily turn over turn, which is the quadratic from the last section showing up as a bill and a lag.

When that is the pattern, reaching for a bigger window is the expensive non-fix. The cheaper and more durable fix is the policy: cap the live context and pin the invariants so the goal and the hard constraints are never evicted; keep the recent turns raw; compact the warm middle knowingly; and offload the cold to the store you already built, pulling it back through the ranking pipeline only when a turn needs it. Clear stale tool output instead of carrying it. Isolate heavy subtasks in sub-agents so their reading never floods the main window.

The thread through this whole series holds here too. A context window is not a memory. Retrieval decides what is worth bringing into it. Working memory decides what is worth keeping there, and what has to leave. None of those three is solved by a larger model or a longer window. They are solved by deciding, deliberately, what your agent should be holding at each moment, and building the small amount of machinery that keeps it holding that and nothing else. That is the difference between an agent that demos well and one you can trust on a task that runs all day.

Is your agent solid in a short demo and shaky on a long real session? That is almost always a working-memory problem, not a model problem, and capping the window, pinning what matters, compacting the rest, and paging the store is work we do every week. Book a 15-minute call and we will tell you honestly whether your bottleneck is the window, the retrieval, or the store. We work US business hours.

Why a Bigger Context Window Won't Save Your Agent

A bigger window is not a bigger memory

Your context window is a cache, not a memory

Position-blind eviction is the real failure

Compaction trades detail for room, and it has a cost

Pin, keep, compact, offload: the four moves

The cost nobody budgets: history is quadratic

What this looks like when we build it

How to tell working memory is your bottleneck

Share this article

Muhammad Mudassir

Muhammad Mudassir

Frequently Asked Questions

Does a bigger context window mean my agent has a better memory?

What is the difference between a context window and agent memory?

What is context compaction and when should I use it?

Why does my agent get slower and more expensive on long conversations?

How do I stop my agent from forgetting earlier instructions in a long session?

Still have questions?

Related Articles

Why Your Agent Retrieves the Wrong Memory

Why Your AI Agent Keeps Forgetting

Mem0 vs Graphiti vs Building Your Own Graph

Explore More Insights