What Is a Context Window in LLMs?

Have you ever had a long conversation with an AI assistant and noticed that, somewhere in the middle, it seemed to forget what you told it at the very beginning? You are not imagining it. This is not a flaw of any particular model; it is a fundamental property of how today's language models work. The phenomenon comes down to one concept: the context window.

Understanding the context window is one of the most useful things you can do as someone who uses AI tools. It explains why models sometimes lose track, why pasting a huge document can backfire, and why prompt order matters more than people assume. This article explains what a context window is, how it is measured, and how to get the most out of it.

The core idea: a model reads, it does not remember

It is tempting to think of a chatbot as a person who accumulates memories over a conversation. The reality is closer to a reader who, before answering any single message, re-reads the entire transcript from the top.

Every time you send a new message, the model does not simply look at your latest sentence. It receives the whole conversation up to that point — every user message, every assistant reply, plus any system instructions — and reads it fresh before generating its next response. The context window is simply the maximum amount of text the model can hold in that "working memory" for a single response.

The illustration below shows this dynamic. The window has a fixed size. As the conversation grows, newer messages push older ones toward the edge, and once they fall outside the window the model can no longer see them.

Diagram showing a fixed-size context window filling up as a conversation grows: older messages slide out and become invisible to the model while newer ones enter.

This is the single most important mental model to hold: the model has no persistent memory between turns. Whatever is not inside the current window, for that one response, effectively does not exist as far as the model is concerned.

Tokens, not words

Context windows are measured in tokens, not words or characters. A token is the basic unit a model chops text into, and it is roughly — but not exactly — a word. Common words are often one token; longer or unusual words may split into several. A useful rule of thumb is that one token corresponds to about three-quarters of a typical English word, so 1,000 tokens is roughly 750 words.

Numbers matter here. When a model is advertised as having a "128K context window," that means it can accept up to roughly 128,000 tokens of combined input and output in a single turn. Because both your prompt and the model's reply count against the budget, a very long input leaves less room for a long answer.

A few reference points to calibrate your intuition:

A short email is a few hundred tokens.
A typical news article is 800–1,500 tokens.
A 20-page report can run 10,000 tokens or more.
An entire short novel might be 60,000–100,000 tokens.

Once you start thinking in tokens, you can estimate for yourself whether something will fit.

Why the window is finite

It is natural to ask: why not just make the window enormous and be done with it? The answer is cost, and it grows steeply.

The dominant cost in a modern Transformer is the attention mechanism, which lets every token consider every other token to decide what is relevant. This "every-token-to-every-token" relationship means the computation does not scale linearly with length. Doubling the context can more than double the work, and for very long windows the cost climbs uncomfortably fast. Attention is powerful precisely because it is exhaustive, and that exhaustiveness is what makes large windows expensive.

This is one reason the field is so active. Researchers are exploring alternative backbones — linear-attention variants and state-space model families such as Mamba — that handle long sequences more cheaply. Some production models are hybrids that combine these cheaper backbones with standard attention only where it earns its cost. If you are interested in how sparsity fits into this picture, our article on Mixture of Experts covers a complementary efficiency technique.

But for now, the practical reality is that every model has a window, and the window is the budget you are working within.

What happens when you exceed the window

When a conversation grows past the window, something has to give. Different systems handle this differently, but the common patterns are:

Truncation from the front. The oldest messages are quietly dropped so the newest ones fit. The model keeps responding, but it has lost access to whatever was removed. This is why a model might "forget" an instruction you gave early on.
Summarization. Some products compress older parts of the conversation into a summary before discarding the raw text. This preserves the gist but loses detail and exact wording.
Hard refusal. If a single input — say, a giant pasted document — exceeds the window by itself, the system will often refuse outright rather than silently truncate.

The important takeaway is that forgetting is usually silent. The model will not announce that it dropped your earlier message; it will simply respond as though that message never existed. When a model contradicts something it "already knew," the first thing to check is whether that information has fallen outside the window.

Practical tips for working within the window

A little awareness goes a long way. Here are concrete habits that help:

Put the most important material last. Because attention is weighted toward recent tokens and the edges of the input are the most likely to be trimmed, the instruction or data you most need the model to act on should sit near the end of your prompt, not the beginning.

Do not paste everything "just in case." Dumping an entire codebase or a full document into the window sounds safe, but it dilutes the model's attention and burns budget that could go to a better answer. Include only the parts relevant to the task.

Re-state critical constraints. If a rule matters across a long conversation ("answer in under 50 words," "assume Python 3.11"), do not trust that an early instruction will survive. Restate key constraints when the conversation gets long.

Start fresh for a new task. A thread that has meandered through several unrelated topics carries a lot of irrelevant context. Beginning a new conversation for a new task gives the model a clean window and often better results.

Prefer structured, scannable input. Clear headings and delimited sections help the model find what it needs in a long input, which improves both speed and accuracy.

Context window vs. training knowledge

A common confusion is mixing up the context window with what the model "knows." They are different things:

Training knowledge is everything the model learned before you ever talk to it. It is fixed once training ends and lives in the model's weights.
The context window is the working memory for the current conversation. It is short-term, holds only what you put into it, and resets every turn.

A model can know an enormous amount and still fail at a task if the relevant information is not in the window. Conversely, pasting the right facts into the window lets a modest model outperform a much larger one that is working from memory alone. This is the entire basis of retrieval-augmented generation: getting the right text into the window beats hoping the model remembers it.

Conclusion

The context window is the single concept that best explains the everyday quirks of AI assistants. It is why models forget, why long chats degrade, why order matters, and why a bigger window is not automatically a better answer if the cost grows too fast. Once you internalize that a model re-reads everything it can see before each reply — and that "everything it can see" is bounded — a lot of confusing behavior starts to make sense.

The good news is that the window is something you can work with deliberately. By managing what goes in, where it sits, and when to start fresh, you get noticeably better results from any model, regardless of its size.