Phones have done this for years: you type "I'll be there in five" and the keyboard offers "minutes." A large language model is, at heart, the same idea scaled up enormously — a system that looks at the text so far and predicts what is likely to come next. That sounds almost too simple to power something that writes essays and code, but the surprise of the last few years is that when you make that predictor big enough and train it on enough text, it gets startlingly good at it.
It predicts the next token
Models don't actually work in whole words; they work in tokens — small chunks of text, often a word or a piece of one. "Predicting" means: given all the tokens it has seen so far, the model estimates which token is most likely to come next, picks one, and adds it to the text. There is no grand plan or sentence laid out in advance. The reason a model can answer a question or write a function is that, across a vast amount of training text, the most likely continuation of a well-posed question is its answer. Good predictions, stacked one after another, look a lot like thinking.
How it works
Everything you send — your question, any instructions, any files — becomes a sequence of tokens that the model reads all at once. It then produces one token, appends it to that sequence, and repeats: read everything, predict the next token, add it, read again. That loop is why output appears to stream out word by word and always reads left to right — each new token is generated with full sight of everything before it but nothing after. The diagram below shows the shape of it: tokens go in, the model predicts, and tokens come out the other side.
- Prompt tokensYour input chopped into tokens — small chunks of text the model reads all at once.
- Language ModelA trained network that, given the context, predicts which token is most likely to come next.
- Predicted tokensThe output, generated one token at a time, each fed back in to predict the following one.
In our stack — the models doing this prediction are Anthropic's Claude models. When Claude Code works on your project, it bundles your request and the relevant code into tokens, sends them to a Claude model, and streams back the predicted tokens — which might be an explanation, a patch, or a decision to call a tool. The model itself stays the same between requests; all the project-specific knowledge rides along in what gets sent.
Trained once, then it just runs
It helps to separate two very different phases. Training is the slow, expensive process where the model's internal settings are tuned by exposure to huge amounts of text — this happens once, ahead of time, in a data center. Inference is what happens when you actually use it: the finished model takes your input and predicts tokens. The key thing to internalize is that inference does not change the model. Asking it something today teaches it nothing for tomorrow; the model that answers your next question is byte-for-byte the same one.
No memory between calls
This is the consequence that trips people up most. Because using the model doesn't change it, a model has no memory of past conversations. Each call starts cold. If it seems to "remember" what you said three messages ago, that's only because all of those earlier messages are being re-sent to it every single time. Anything the model needs to know — the conversation so far, the contents of a file, your preferences — has to be packed into the input on each call. That input has a size limit, which is exactly what the context window lesson is about.