How LLMs Work
Large language models (LLMs) are the technology behind ChatGPT, Claude, Llama, and others, all based on the transformer architecture. Understanding how they work helps you make better decisions when running them, whether you’re choosing between tools, tuning inference parameters, or debugging unexpected behaviour. This page covers the essentials, no prior ML knowledge needed.
What an LLM does
An LLM does one thing:
Given a sequence of tokens, it predicts a probability distribution over the next token.
Everything else comes from repeating this operation in a loop: answering questions, writing code, reasoning. The model generates one token, appends it to the sequence, and runs again.
A token is a chunk of text (often a word, a part of a word, or even a punctuation) produced by a tokenizer. The vocabulary is typically around 128,000 tokens for modern models. For example, the word “Kubernetes” might be a single token whereas “kubectl” might be split into several.
Training vs inference
Before a model can predict anything useful, it must be trained. Training exposes the model to vast amounts of text and gradually adjusts its internal parameters, what is called the weights, so that its predictions get better over time. Training is usually done long before you download and run a model. The weights are fixed. Inference is just the model using what it already learned.
The transformer layer
Modern LLMs are built on the transformer architecture, first introduced in the 2017 paper Attention Is All You Need. A transformer model is a stack of identical layers: 32 for an 8B model, 80 for a 70B model. Each layer has two sub-blocks that do different things:
graph LR
A[Input] --> B["Attention\n(context)"]
B --> C["MLP\n(knowledge)"]
C --> D[Output]
Attention lets each token look at the other tokens in the sequence and decide which ones are relevant to it. When you write “the cat sat on the mat because it was tired”, attention is what allows the model to link “it” to “the cat” rather than “the mat”.
That contextually enriched representation is then passed to the feed-forward network (called MLP). Think of it as the model’s long-term memory: it’s where learned associations are stored, baked in during training and frozen at inference time. Another example: when the model reads “the Eiffel Tower”, attention identifies we’re talking about a specific item and the MLP is what supplies everything it learned about it: that it’s in Paris, 330 metres tall, and more. It makes up roughly 60% of the model’s parameters.
In summary: the Attention figures out what’s relevant, whereas the MLP knows what things mean.
Each layer doesn’t replace its input, but it adds to it. A single vector per token (the residual stream) flows through all layers, accumulating updates, until it’s ready to be turned into a prediction. You can think of it as the model’s working memory for each token.
From output to token : sampling
After all layers have processed the input, the model produces a logit, which is one score per token in the vocabulary, representing how likely that token is to come next. These scores go through a sampling pipeline:
Logits [one score per vocabulary token]
↓ ÷ Temperature
↓ Top-K filter (keep only the top K candidates)
↓ Top-P filter (keep the smallest set covering P% of probability)
↓ Softmax (convert scores to probabilities)
↓ Sample (draw one token)Note: the exact order can vary slightly between implementations, but the idea is the same.
The sampling parameters directly affect the model’s behaviour:
| Parameter | Effect |
|---|---|
| Temperature = 0 | Always pick the top token — fully deterministic |
| Temperature < 1 | More confident and focused |
| Temperature > 1 | More creative and unpredictable |
| Top-K | Hard cap: only consider the K most likely tokens |
| Top-P | Soft cap: keep tokens until cumulative probability reaches P |
This is why setting temperature=0 in an API call gives you reproducible outputs, and why raising it makes the model more creative (and less reliable).
Why this matters in practice
Knowing that a model is weights (frozen knowledge) plus attention (dynamic context) makes several common techniques less mysterious:
| Topic | How it connects |
|---|---|
| Quantization | Reducing the numerical precision of the model’s weights to save memory, trading a bit of quality for a lot of VRAM. (e.g. FP16 → INT8 → 4-bit) |
| LoRA / fine-tuning | Adding small trainable layers on top of frozen model weights to adapt the model to a specific task |
| RAG | Injecting external knowledge into the prompt so the model can reason over it since the model can only use what’s in its weights and the current context |
| MoE (Mixture of Experts) | Replacing one large MLP with N smaller ones, activating only 1-2 per token. This provides the same quality with less compute |
| Context window limits | The model can only “see” a fixed number of tokens at once, after that the older context is lost |
| vLLM / PagedAttention | Managing inference memory efficiently so more requests can be served in parallel |