Back to Blog
Published:
Last Updated:
Fresh Content
Agent Memory & Context GraphsChapter 3

Why a Bigger Context Window Won't Save Your Agent

17 min read
3,805 words
high priority
Muhammad Mudassir

Muhammad Mudassir

Founder & CEO, Cognilium AI

Why a Bigger Context Window Won't Save Your Agent

TL;DR

Why a bigger context window will not fix your AI agent’s memory. The window is a cache, not a store: pin the invariants, keep recent turns, compact the middle, and offload the rest. How working-memory policy beats window size, and why unmanaged history costs you quadratically.

A bigger context window does not give your agent a memory. The window is a cache: finite, expensive, and used less reliably as it fills, so a long session always overflows it. The real lever at scale is the working-memory policy that decides what stays: pin the invariants, keep recent turns, compact the warm middle, and offload the cold to a store you re-retrieve from. Manage the window, or the cost of the conversation grows with the square of its length while the agent forgets the one constraint that mattered.
agent working memorycontext window managementcontext window is not memorycontext compactionllm memory evictionrecursive summarizationsliding window memorylost in the middleworking setoffload agent memoryagent context engineeringstateless api quadratic cost

TL;DR: A million-token context window does not give your agent a memory. It gives it a bigger desk, and a long enough task will bury any desk. The context window is working memory, a cache: finite, expensive, and used less reliably the more you cram into it, with peer-reviewed work showing that models miss the middle of a long input and use far less of their advertised length than the label claims. Real memory lives outside the window, in a store you page from. So the engineering problem at scale is not the size of the window, it is the policy that decides what stays in it: pin the few invariants, keep the recent turns, compact the warm middle into a running summary, and offload the cold to a store you re-retrieve from precisely. Get that policy right and a small model holds a coherent thread across a thousand turns. Get it wrong and the cost of the conversation grows with the square of its length while the agent quietly forgets the one constraint that mattered.

An agent runs beautifully for twenty turns and then loses the plot. It forgets a constraint you gave it at the start. It re-asks a question you already answered. It contradicts a decision it made ten messages ago. The reflex is to reach for a bigger context window, and the model vendors are happy to sell you one. But you have almost certainly met an agent with a giant window that still forgets, because window size was never the thing that was broken. What broke is that nobody decided what the agent should keep in front of it as the session grew, so it kept the wrong things.

This is the fourth post in our series on agent memory. The first established that a context window is not a memory. The second compared the tools for building the store that is. The third showed that getting the right facts out of that store is a ranking problem, not a lookup. This post is about the moment those facts arrive and have to share one small, finite window with everything else the agent is already holding. Retrieval decides what is worth bringing in. Working memory decides what is worth keeping once it is in, and what has to leave to make room. That second decision quietly determines whether your agent can hold a long task together.

The context window is not where your agent's memory lives. It is the small, expensive surface that memory is projected onto, one turn at a time. Manage the projection, or it manages you.

A bigger window is not a bigger memory

The pitch is seductive: the window is now a million tokens, so just put everything in it and stop worrying about memory. Two things are wrong with that.

The first is simple arithmetic. A window of any fixed size, however large, is still fixed, and an agent that runs long enough will always generate more history than it holds. A coding agent working through a real task, a support agent on a multi-day thread, a research agent reading source after source, all of them cross any fixed boundary eventually. A bigger window moves the wall further out. It does not remove it. So you will need an eviction policy no matter how large the window is, and the only question is whether you design that policy or let a crude default make it for you.

The second problem is subtler and better evidenced: models do not use the whole window equally well, even below the limit. An NVIDIA research benchmark called RULER measured the effective context length of long-context models, the length at which they still perform reliably, and found it routinely far shorter than the advertised number. A model that claims a large window often degrades well before it. Separately, a 2025 study called NoLiMa showed that when you strip away literal word overlap, so the model has to actually reason over a long input rather than match a keyword, accuracy falls off sharply as the input grows, long before the stated limit. And the well-known Lost in the Middle study, published in the journal TACL in 2024, found a U-shaped pattern: a model attends most reliably to the very start and the very end of its context and is most likely to miss what sits in the middle. That last one is a position effect, not a length effect, and the three are distinct findings, but together they deliver one verdict. The usable part of a context window is smaller than the number on the box, and it gets less reliable the more you fill it. Stuffing the window is not a memory strategy. It is a way to pay more for worse attention.

Your context window is a cache, not a memory

The cleaner mental model, the one that tells you what to actually do, comes from a 2023 paper titled MemGPT: Towards LLMs as Operating Systems. It is an arXiv paper rather than a peer-reviewed one, so weigh it as a design proposal, but its framing is the most useful one in the field: treat the context window the way an operating system treats RAM. RAM is fast, small, and expensive. Disk is slow, vast, and cheap. The operating system's whole job is to keep the few things the program needs right now in RAM and page everything else to and from disk. MemGPT applies this directly: the context window is main context, the agent's RAM, and an external store is the agent's disk, with the model itself deciding what to page in and out.

Once you see the window as a cache, the design questions become the right ones. A cache is not where your data lives. It is a small, fast staging area that holds a working copy of the slice you need this instant. Your data lives in the backing store. The skill in caching has never been making the cache bigger, it has been deciding what earns a slot and what gets evicted. The same is true here. Your agent's memory does not live in the window. It lives in the store we spent the first and second posts building. The window holds the working set: the small projection of that memory the agent needs for the turn in front of it.

In mature agent systems, almost all of the state lives outside the window on purpose. The agent writes its progress to a file, keeps a task list it reads back each turn, records decisions in a structured log, and pulls only the relevant slice into context when it needs it. The window at any moment holds a fraction of what the agent knows. This is not a workaround for small windows. It is the architecture, and it is the same one whether the window is eight thousand tokens or a million, because the cache discipline is what keeps attention focused, not just what keeps the agent under a token limit.

Position-blind eviction is the real failure

So the window fills, and something has to leave. The default, the one you get if you do nothing, is to drop the oldest turns and keep the most recent ones. This is the crudest possible policy, and it is worth being precise about why it is bad, because the obvious criticism is wrong and the real one is worse.

The obvious criticism is that dropping the oldest turns deletes the task goal, which usually sits at the start. In practice any competent system pins the system prompt and the goal so they are never evicted, so that specific fear is easy to handle. The real problem is deeper. Dropping the oldest turns evicts by position, by age, and age is blind to importance. A constraint you were given at turn twelve, do not contact this customer by phone, only by email, is now old, so it gets dropped, while three turns of recent small talk survive because they are new. The policy threw away the load-bearing fact and kept the noise, purely because of when each arrived. Recency is a fine tiebreaker and a terrible primary criterion.

There is an instructive result from the systems layer here, worth borrowing as an analogy rather than as a direct claim. A 2024 paper called StreamingLLM found that when you run a model over a very long stream and naively evict the earliest tokens, performance collapses, because models park a large share of their attention on the first few tokens, what the authors called attention sinks. Keep a few of those initial tokens plus a recent window and the model stays stable over millions of tokens. That work is about token-level attention, not about dropping conversation turns, so do not overclaim it. But the lesson rhymes with ours: blindly evicting the oldest thing is the policy most likely to break, and what you choose to protect from eviction matters more than how much you keep.

The fix follows directly from the previous post. Eviction is just retrieval run in reverse. If retrieval is a ranking problem, deciding what to bring in, then eviction is the same ranking problem deciding what to let go: rank what is in the window by importance and relevance to the current goal, not by age, and evict from the bottom. The constraint from turn twelve outranks the recent small talk, so it stays. That is the whole move.

Compaction trades detail for room, and it has a cost

When ranking alone cannot clear enough space, you compress. The production name for this is compaction, and it is worth saying plainly what it is: compaction is recursive summarization triggered when the window nears its limit. You take the older part of the transcript, summarize it into something far shorter, and reinitialize the window with that summary in place of the raw turns. Several systems ship this now as an automatic feature, summarizing the conversation as it approaches a threshold and continuing from the digest. The idea is old. Stanford's Generative Agents, the UIST 2023 paper, had its agents periodically reflect, synthesizing many low-level memories into a few higher-level ones, which is consolidation by another name. Summarize the summaries and you have hierarchical compaction.

But compaction has a cost, and this is the part teams underestimate. Summarizing is lossy by definition: you are paying detail for room. A benchmark called LoCoMo found that giving a model summaries of a long conversation did not significantly help despite the summaries having high recall, because the act of compressing dialogue into a summary loses the specific information the task later needed. Quantities, identifiers, the exact wording of a constraint, these are precisely what a summary smooths away, and precisely what a careful task depends on.

Here is the tension that makes working memory genuinely hard, and it is the real thesis of this post. You cannot win by refusing to summarize, because keeping everything verbatim overflows the window, triggers the lost-in-the-middle problem, and costs you quadratically, as the next section shows. And you cannot win by summarizing everything either, because aggressive summarization paraphrases away the details that detail-sensitive work depends on. Both extremes fail. The working set has to stay small, which forces you to choose what gets kept in full and what gets compressed. That choice, not the window size, is the design problem.

Pin, keep, compact, offload: the four moves

Put the pieces together and a working-memory policy has four moves. None of them is exotic. The discipline is in applying all four deliberately instead of letting a default truncate your history.

Pin the invariants. The goal, the hard constraints, the schema or format the output must follow, the small set of facts that must never be wrong: these go in a protected region of the window that is never evicted, held verbatim, ideally near the start or the end where the model attends most reliably. This is a tiny budget and it pays for itself many times over.

Keep the recent turns. The last several exchanges stay raw, because recency genuinely matters for the immediate next step, and because summarizing what just happened is both lossy and pointless. This is the one place the crude default is right.

Compact the warm middle. The older-but-not-ancient history, the part between the pinned invariants and the recent raw turns, gets summarized into a running digest. This is where compaction lives. You trade the detail of those turns for the room to keep going, accepting the loss knowingly and only for the material least likely to be needed word for word.

Offload the cold, and re-retrieve on demand. Everything else goes out to the store, the vector index and the knowledge graph from the earlier posts, and is pulled back only when a specific turn needs it, through the ranking pipeline from the retrieval post. This is the move that makes the whole thing scale: the agent's full history can be enormous, because almost none of it is in the window at any given moment. It is on disk, and the agent pages in the slice it needs. That store has to be kept clean and current to be worth paging from, which is its own discipline, but once it is, the window stops being the bottleneck.

Two refinements separate a careful implementation from a crude one. The first is to distinguish clearing from compacting. Some of what clogs a window is not conversation at all, it is the bulky output of old tool calls, stale search results, a file you read six steps ago. That material can often be removed outright rather than summarized, which loses nothing but the bulk, whereas summarizing a constraint risks losing the constraint. Clear what is safe to drop, compact only what you must keep but cannot keep in full. The second refinement is sub-agent isolation: when a subtask needs to read a great deal to produce a little, spin up a sub-agent with its own fresh window, let it do the heavy reading there, and return only its conclusion to the main agent. The detailed context never touches the parent's window. Anthropic's published multi-agent research system works this way, with worker agents each exploring in their own context and handing back a short summary. Its reported gains are an internal evaluation, so treat the headline number as a vendor result, but the architecture is sound and now widely adopted.

A working memory hierarchy for an AI agent. A long session produces a growing transcript that overflows a finite context window. A policy decides what stays: pin the invariants such as the goal and hard constraints so they are never evicted, keep the recent turns raw, compact the warm middle into a running summary, and offload the cold history to an external store of a knowledge graph and vector index. The needed slice is paged back into the window on demand. Position-blind truncation drops a load-bearing constraint, while the managed working set keeps it.

The cost nobody budgets: history is quadratic

There is a cost argument here that most teams do not see until the invoice arrives, and it is the most concrete reason to manage the working set rather than let it grow.

A chat model's API is stateless. It does not remember your previous turns. To continue a conversation, you resend the entire history every single turn. That means the input you pay for grows with every exchange. On turn two you resend turn one. On turn fifty you resend the previous forty-nine. The tokens you are billed for on a single turn grow linearly with the length of the conversation, and the total over the whole session grows with the square of its length. This is the ordinary cost of a stateless conversation, and it is quadratic.

Put rough numbers on it. Say your system prompt and tools are a few thousand tokens and each exchange adds about a thousand. A hundred-turn session with no memory management resends an ever-growing transcript and runs through something on the order of five million input tokens by the end, with the final turn alone carrying around a hundred thousand. At a mid-tier model price in the rough range of three dollars per million input tokens in early 2026, that is roughly fifteen dollars of input for one session, and the per-turn cost is still climbing when it ends. Now run it for a thousand turns. The quadratic bites hard: you are into hundreds of millions of input tokens and well over a thousand dollars for the single session, except it never gets there, because a thousand turns at a thousand tokens each is a million-token transcript that overflows the window entirely and simply stops working.

Now cap the live context with the policy above, say at twenty thousand tokens. Per-turn input stops growing, because the window holds a bounded working set no matter how long the session runs. The cost goes from quadratic to linear. The hundred-turn session drops to a few dollars. The thousand-turn session, which could not even run before, costs tens of dollars and runs fine. The savings ratio is not fixed, it grows with length: a couple of times cheaper at a hundred turns, an order of magnitude or more by a thousand. Treat the dollar figures as illustrative, model prices move month to month, but the shape is the point and the shape does not change. Unmanaged history is quadratic. A bounded working set is linear.

One distinction is worth nailing down, because it trips people up. Prompt caching does not solve this. Caching lets you resend a stable prefix at a steep discount, often around a tenth of the normal input price for the cached part, which is a real and worthwhile saving. But it is a discount on the price of the tokens, not a reduction in their number. The cached tokens still fill the window, the model still attends over all of them, and the count still grows with the conversation. Caching shrinks the multiplier. It does not turn the quadratic into a line. Only managing the working set does that. And to head off a common confusion: prompt caching is a billing feature about reusing a prefix, which is a different thing from the model's internal attention cache, and a different thing again from the second quadratic that lives inside a single forward pass, where attention compute scales with the square of the context length. That last one is another good reason to keep the window small, but it is a separate mechanism. The billing quadratic, the one that grows across turns, is the one that shows up on your invoice.

What this looks like when we build it

The abstract policy is pin, keep, compact, offload. Here is the concrete version, from systems we have shipped, anonymized.

On a wealth-management platform we built, the agent's memory is emphatically not the window. The durable memory is a knowledge graph in Neo4j, six entity types and five relationship types, sitting beside a vector store of a few hundred chunks. At any moment the window holds only the working set for the question in front of the agent: the active client thread, the specific holdings and clauses the current question touches, the constraints that must hold. Everything else, the full history of every client and instrument, stays in the store and is paged in by retrieval only when a turn needs it. The window never tries to hold the whole client. It holds the slice, and the graph holds the rest. That is the cache discipline made physical, and it is why the system stays coherent over long sessions instead of degrading as the history piles up.

On Paralegent, our multi-agent legal-analysis system, the working set lives outside every individual window by design. Its twenty-three agents, twelve scorers and eleven analysts, do not each carry the full case in their own context. They coordinate through a shared scores table and a queue, which is external working memory: structured state that no single agent holds in full and any of them can read the relevant row of. A router decides which agents and which facts a given document actually needs and prunes the rest, so the system is not waking all twenty-three and re-sending the same large context to each. Deciding what does not need to be in any window is as much of the design as deciding what does, and moving the shared state out of the prompts into that table, together with the pruning, is a large part of why model calls dropped by roughly seventy-five percent. The agents stay small and fast because the memory they share is not crammed into each of their windows. It sits outside, and they read what they need.

Neither system was made to scale by a longer window. Both were made to scale by keeping the window small on purpose and putting the memory where memory belongs, outside it, in a store built to hold it.

How to tell working memory is your bottleneck

A handful of symptoms point at the working-memory layer rather than at the model or the store. The agent does fine on a short demo and falls apart on a long real session. It forgets or contradicts a decision it made earlier in the same conversation. It re-asks something you already answered, or redoes work it already did. Its answers get less reliable the longer the session runs, not more, even though it is technically still under the context limit. And the cost and latency climb steadily turn over turn, which is the quadratic from the last section showing up as a bill and a lag.

When that is the pattern, reaching for a bigger window is the expensive non-fix. The cheaper and more durable fix is the policy: cap the live context and pin the invariants so the goal and the hard constraints are never evicted; keep the recent turns raw; compact the warm middle knowingly; and offload the cold to the store you already built, pulling it back through the ranking pipeline only when a turn needs it. Clear stale tool output instead of carrying it. Isolate heavy subtasks in sub-agents so their reading never floods the main window.

The thread through this whole series holds here too. A context window is not a memory. Retrieval decides what is worth bringing into it. Working memory decides what is worth keeping there, and what has to leave. None of those three is solved by a larger model or a longer window. They are solved by deciding, deliberately, what your agent should be holding at each moment, and building the small amount of machinery that keeps it holding that and nothing else. That is the difference between an agent that demos well and one you can trust on a task that runs all day.

Is your agent solid in a short demo and shaky on a long real session? That is almost always a working-memory problem, not a model problem, and capping the window, pinning what matters, compacting the rest, and paging the store is work we do every week. Book a 15-minute call and we will tell you honestly whether your bottleneck is the window, the retrieval, or the store. We work US business hours.

Share this article

Muhammad Mudassir

Muhammad Mudassir

Founder & CEO, Cognilium AI | 10+ years

Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...

Founder & CEO of Cognilium AI; 50+ projects delivered with 96% client satisfaction; 4 production AI products built and operated; multi-cloud AI architecture (AWSGCPAzure)
Agentic AIRAG → GraphRAG retrievalVoice AIMulti-Agent Orchestration

Frequently Asked Questions

Find answers to common questions about the topics covered in this article.

Still have questions?

Get in touch with our team for personalized assistance.

Contact Us

Related Articles

Continue exploring related topics and insights from our content library.

Why Your Agent Retrieves the Wrong Memory
15 min
1
Muhammad Mudassir
June 25, 2026

Why Your Agent Retrieves the Wrong Memory

Your agent’s memory store is probably fine. Its retrieval is the bug. Top-k by similarity is a lookup; production memory retrieval is a ranking problem. How to rank by relevance, recency, and importance, retrieve then rerank, combine vector, keyword, and graph, and assemble it all into a limited window.

words
Read Article
Why Your AI Agent Keeps Forgetting
12 min
2
Muhammad Mudassir
June 23, 2026

Why Your AI Agent Keeps Forgetting

Your agent does not have a memory problem. It has a memory architecture problem. The four kinds of agent memory, where each one lives, and why a context window was never going to be enough.

words
Read Article
Mem0 vs Graphiti vs Building Your Own Graph
13 min
3
Muhammad Mudassir
June 24, 2026

Mem0 vs Graphiti vs Building Your Own Graph

Mem0 or Graphiti? The honest answer is not a benchmark, it is one question: do the facts your agent remembers change over time? Plus the costs no vendor quotes, and when to build your own graph instead.

words
Read Article

Explore More Insights

Discover more expert articles on AI, engineering, and technology trends.