Imagine trying to answer a complex question while only being allowed to look at one page of notes at a time. Whatever isn't on that page may as well not exist. A language model lives under exactly this constraint. As we saw with large language models, a model has no memory between calls — so the text you hand it in a single request is, quite literally, everything it can think about. The container that holds that text has a fixed size, and that container is the context window.
A bounded budget of tokens
The window is measured in tokens — the same small text chunks the model reads and writes. Every model has a maximum: it can take in only so many tokens per call, full stop. Think of it less like a notebook you can keep adding pages to and more like a single whiteboard of a fixed size. You can write whatever you like on it, but once it's full, something has to be erased before anything new fits. That hard ceiling shapes almost everything about working with these models.
How it works: what shares the space
A lot has to fit on that one whiteboard at the same time. There's the system prompt — the standing instructions that set how the model behaves. There's the conversation history — every previous turn, re-sent so the model appears to remember. There are the tool definitions that describe what actions are available. And there's whatever you've pasted in — files, error logs, documentation. All of it competes for the same fixed budget. The diagram below shows everything packed into one bounded window, and what happens at the edge when there's no more room.
- Context windowThe model's only working memory: a fixed maximum number of tokens it can see at once.
- System promptStanding instructions that shape how the model behaves — always present in the window.
- ConversationThe running back-and-forth. It grows with every turn and eats into the budget.
- Dropped / summarizedWhen the window fills, older content must be trimmed or compressed to make room.
In our stack — when Claude Code works on your project, its context window is constantly being assembled: the instructions that guide it, the conversation so far, the definitions of the tools and any MCP connections it can use, and the slices of your codebase relevant to the task. The Claude model behind it has a generous window, but it is still finite — so Claude Code is deliberate about which files it pulls in rather than dumping the whole repository onto the whiteboard.
When the window fills up
Long sessions inevitably bump against the ceiling. When the total — instructions plus history plus tools plus files — would exceed the limit, something has to give. The usual responses are to drop the oldest, least relevant content, or to summarize earlier parts of the conversation into a shorter recap that preserves the gist while freeing up tokens. Either way, detail is lost. This is why a model can seem to "forget" something you mentioned a long time ago in a marathon session: it didn't forget so much as the note got erased from the whiteboard to make room.
Why this is worth understanding
Once you picture the window as a finite, shared space, a lot of behavior stops being mysterious and becomes manageable. Putting the most important instructions where they won't get crowded out, keeping irrelevant files out of the budget, and starting fresh when a conversation has grown bloated are all ways of respecting the limit. Doing this deliberately — choosing what deserves a spot on the whiteboard and what doesn't — is its own discipline, which the context engineering lesson explores in depth.