LLM Internals
This page assumes you’ve read How LLMs Work and want to go deeper: into the actual tensors, the memory mechanics, and the engineering decisions that follow from them.
Tensors: the only data structure
Everything in a neural network is made of tensors: multi-dimensional arrays of floating-point numbers. A vector is a 1D tensor. A matrix is a 2D tensor.
At inference time there are two kinds:
| Kind | Examples | Lifetime |
|---|---|---|
| Weights (trained) | W_Q, W_K, W_V, W_E… | Permanent: loaded once, never change |
| Activations (computed) | Q, K, V | Ephemeral: created per token, discarded |
The model file on disk is a collection of weight tensors. For example, the “8B” in “Llama 8B” is the total count of individual float values across all of them.
Token embeddings: W_E
A forward pass is the journey a token takes through the entire neural network: from input to predicted next token. The very first operation of that pass is a lookup in the embedding table W_E: for each token, its ID is used to retrieve a vector from this table. That vector is what flows through all subsequent layers. W_E has one row per token in the vocabulary (128,000 for Llama 3) and one value per embedding dimension (4,096). This dimension size is called d_model throughout the rest of this article. Tokens with similar meanings end up with similar vectors. This is the model’s initial representation of token meaning, which is then refined and transformed across all layers.
Attention: Q, K, V in detail
The embedding vectors produced by W_E are then passed into the attention block. Each token is projected into three vectors using three learned weight matrices, W_Q, W_K, and W_V:
| Vector | Matrix | Role |
|---|---|---|
| Q (Query) | W_Q | “What am I looking for?” |
| K (Key) | W_K | “What do I offer?” |
| V (Value) | W_V | “What information do I carry?” |
W_Q, W_K, and W_V do not directly store factual knowledge. Instead, they learn how to route information between tokens, encoding structural patterns such as syntax, word relationships, and attention behavior.
Once Q, K, and V have been computed, attention compares the Q vector of the current token against all K vectors using a dot product, scaled to keep scores in a stable range before normalizing. Those scores are then normalized so they sum to 1 and used to produce a weighted sum of all V vectors, collecting information from the most relevant tokens. A fourth matrix, W_O, then projects that result back to d_model so it can flow into the next layer.
Multi-head attention. In practice, each attention block runs multiple attention operations in parallel, each called a head. Every head has its own W_Q, W_K, W_V projections and learns to focus on a different type of relationship: one head might track subject-verb dependencies, another might handle pronoun references, another word order. The outputs of all heads are concatenated and projected through W_O. Because d_model is divided equally across heads, the individual K and V projections are smaller than d_model. For SmolLM2-135M with 3 heads, each head uses 576/3 = 192 dimensions, which is why K and V show [192, 576] in the tensor output rather than [576, 576].
MLP block
The output of the attention block is passed to the MLP. Where attention figures out which tokens are relevant to each other, the MLP applies learned transformations to the resulting representation, capturing associations, patterns, and higher-level features learned during training. It expands the vector to a larger internal dimension, filters and reshapes the values, then compresses back down.
How the pieces fit together
Before listing all tensors, it’s important to get an overview of the entire process:
- W_E (token embedding) runs once per token, at the very start of the forward pass, before any layer is involved
- Attention + MLP run once per layer, repeated N times for every token
Every token, whether part of the input prompt or newly generated, goes through its own forward pass: W_E lookup first, then through all N layers.
Each token's forward pass
│
├── W_E lookup ← once, before any layer
│ └── token ID → embedding vector
│
├── Layer 0
│ ├── Attention (Q, K, V, W_O)
│ └── MLP (W_up, W_gate, W_down)
│
├── Layer 1
│ ├── Attention
│ └── MLP
│
│ ... repeated for all N layers ...
│
├── Layer N-1
│ ├── Attention
│ └── MLP
│
├── final_norm
└── W_unembed → one score per vocabulary wordW_E and W_unembed are global: they sit outside the layer stack. Everything in between (attention + MLP) repeats once per layer.
Full tensor inventory per layer
One more dimension appears in the tensor tables: d_ffn, the larger intermediate size the MLP expands into before compressing back down to d_model. This wider space gives the model more room to detect and combine patterns.
The tensor names below are conceptual labels. In practice, naming varies by model: Llama uses mlp.up_proj / mlp.gate_proj / mlp.down_proj, GPT-2 uses mlp.c_fc and mlp.c_proj. The structure is the same; only the names differ.
Each layer contains the same set of tensors:
Attention block
| Tensor | Role |
|---|---|
W_Q | Query projection |
W_K | Key projection |
W_V | Value projection |
W_O | Projects attention output back to d_model |
norm1.weight | Layer norm before attention |
MLP block
| Tensor | Role |
|---|---|
W_up | Expands from d_model → d_ffn |
W_gate | Controls information flow (gating) |
W_down | Compresses back from d_ffn → d_model |
norm2.weight | Layer norm before MLP |
Global tensors (outside layers)
| Tensor | Role |
|---|---|
W_E | Token embedding table |
| RoPE | Encodes each token’s position in the sequence; without it, attention would have no sense of word order |
final_norm.weight | Norm after the last layer |
W_unembed (lm_head) | Final projection from d_model to vocab size → logits |
A note on layer norms. The norm1 and norm2 tensors perform layer normalization: they rescale the token vectors before each sub-block to keep values in a stable numerical range. Without them, numbers could grow or shrink uncontrollably as they pass through many layers.
For reference, approximate parameter counts in Llama 3 8B:
| Component | Parameters |
|---|---|
| Token embeddings | ~500M |
| Attention (×32 layers) | ~1.3B |
| MLP (×32 layers) | ~5.5B |
| Norms and other | negligible |
| Total | ~8B |
The full forward pass
Before reading the diagram, one key concept: the residual stream. Rather than each layer replacing the token vector, every sub-block only adds its output to the existing vector. This makes it difficult for later layers to completely erase earlier information, and encourages the model to build progressively on top of existing representations.
Input tokens (one per word or subword)
↓ W_E lookup
Embeddings (one vector per token)
↓ + positional encoding (RoPE, applied to Q and K)
↓
━━━[ Layer 0 ]━━━━━━━━━━━━━━━━━━━
↓ norm1 ← stabilize values before attention
↓ × W_Q, W_K, W_V → Q, K, V
↓ compare Q against all K vectors ← compute attention scores
↓ convert scores to weights ← each weight says how much to attend to each token
↓ weighted sum of V ← collect information from relevant tokens
↓ × W_O ← project back to d_model
↓ residual add ← add attention output to the token vector
↓ norm2 ← stabilize before MLP
↓ × W_up, × W_gate → expand to d_ffn
↓ activation function ← filters and shapes the expanded values
↓ × W_down → compress to d_model
↓ residual add ← add MLP output to the token vector
━━━[ Layers 1..N ]━━━━━━━━━━━━━━━
↓ (same structure repeated)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
↓ final_norm
↓ × W_unembed
Output: one score per word in the vocabularyThe final output is a list of logits: one raw score per word in the vocabulary. A high score means the model considers that word a likely next token. These scores are then converted to probabilities, from which the next token is selected via sampling or other decoding strategies.
Inspecting tensors in practice
Everything above can be verified on a real model. The example below uses SmolLM2-135M, a small openly available model that follows the same Llama-style architecture. No HuggingFace account needed.
pip install torch safetensors huggingface_hub
# Download the model
hf download HuggingFaceTB/SmolLM2-135M --local-dir ./smollmfrom safetensors import safe_open
with safe_open("./smollm/model.safetensors", framework="pt") as f:
for key in f.keys():
shape = f.get_slice(key).get_shape()
print(f"{key:55s} {shape}")This reads shapes directly from the file header without loading the weights into memory. The output shows every tensor with its real name and shape:
See full output
$ python test.py
model.embed_tokens.weight [49152, 576]
model.layers.0.input_layernorm.weight [576]
model.layers.0.mlp.down_proj.weight [576, 1536]
model.layers.0.mlp.gate_proj.weight [1536, 576]
model.layers.0.mlp.up_proj.weight [1536, 576]
model.layers.0.post_attention_layernorm.weight [576]
model.layers.0.self_attn.k_proj.weight [192, 576]
model.layers.0.self_attn.o_proj.weight [576, 576]
model.layers.0.self_attn.q_proj.weight [576, 576]
model.layers.0.self_attn.v_proj.weight [192, 576]
model.layers.1.input_layernorm.weight [576]
model.layers.1.mlp.down_proj.weight [576, 1536]
model.layers.1.mlp.gate_proj.weight [1536, 576]
model.layers.1.mlp.up_proj.weight [1536, 576]
model.layers.1.post_attention_layernorm.weight [576]
model.layers.1.self_attn.k_proj.weight [192, 576]
model.layers.1.self_attn.o_proj.weight [576, 576]
model.layers.1.self_attn.q_proj.weight [576, 576]
model.layers.1.self_attn.v_proj.weight [192, 576]
model.layers.10.input_layernorm.weight [576]
model.layers.10.mlp.down_proj.weight [576, 1536]
model.layers.10.mlp.gate_proj.weight [1536, 576]
model.layers.10.mlp.up_proj.weight [1536, 576]
model.layers.10.post_attention_layernorm.weight [576]
model.layers.10.self_attn.k_proj.weight [192, 576]
model.layers.10.self_attn.o_proj.weight [576, 576]
model.layers.10.self_attn.q_proj.weight [576, 576]
model.layers.10.self_attn.v_proj.weight [192, 576]
model.layers.11.input_layernorm.weight [576]
model.layers.11.mlp.down_proj.weight [576, 1536]
model.layers.11.mlp.gate_proj.weight [1536, 576]
model.layers.11.mlp.up_proj.weight [1536, 576]
model.layers.11.post_attention_layernorm.weight [576]
model.layers.11.self_attn.k_proj.weight [192, 576]
model.layers.11.self_attn.o_proj.weight [576, 576]
model.layers.11.self_attn.q_proj.weight [576, 576]
model.layers.11.self_attn.v_proj.weight [192, 576]
model.layers.12.input_layernorm.weight [576]
model.layers.12.mlp.down_proj.weight [576, 1536]
model.layers.12.mlp.gate_proj.weight [1536, 576]
model.layers.12.mlp.up_proj.weight [1536, 576]
model.layers.12.post_attention_layernorm.weight [576]
model.layers.12.self_attn.k_proj.weight [192, 576]
model.layers.12.self_attn.o_proj.weight [576, 576]
model.layers.12.self_attn.q_proj.weight [576, 576]
model.layers.12.self_attn.v_proj.weight [192, 576]
model.layers.13.input_layernorm.weight [576]
model.layers.13.mlp.down_proj.weight [576, 1536]
model.layers.13.mlp.gate_proj.weight [1536, 576]
model.layers.13.mlp.up_proj.weight [1536, 576]
model.layers.13.post_attention_layernorm.weight [576]
model.layers.13.self_attn.k_proj.weight [192, 576]
model.layers.13.self_attn.o_proj.weight [576, 576]
model.layers.13.self_attn.q_proj.weight [576, 576]
model.layers.13.self_attn.v_proj.weight [192, 576]
model.layers.14.input_layernorm.weight [576]
model.layers.14.mlp.down_proj.weight [576, 1536]
model.layers.14.mlp.gate_proj.weight [1536, 576]
model.layers.14.mlp.up_proj.weight [1536, 576]
model.layers.14.post_attention_layernorm.weight [576]
model.layers.14.self_attn.k_proj.weight [192, 576]
model.layers.14.self_attn.o_proj.weight [576, 576]
model.layers.14.self_attn.q_proj.weight [576, 576]
model.layers.14.self_attn.v_proj.weight [192, 576]
model.layers.15.input_layernorm.weight [576]
model.layers.15.mlp.down_proj.weight [576, 1536]
model.layers.15.mlp.gate_proj.weight [1536, 576]
model.layers.15.mlp.up_proj.weight [1536, 576]
model.layers.15.post_attention_layernorm.weight [576]
model.layers.15.self_attn.k_proj.weight [192, 576]
model.layers.15.self_attn.o_proj.weight [576, 576]
model.layers.15.self_attn.q_proj.weight [576, 576]
model.layers.15.self_attn.v_proj.weight [192, 576]
model.layers.16.input_layernorm.weight [576]
model.layers.16.mlp.down_proj.weight [576, 1536]
model.layers.16.mlp.gate_proj.weight [1536, 576]
model.layers.16.mlp.up_proj.weight [1536, 576]
model.layers.16.post_attention_layernorm.weight [576]
model.layers.16.self_attn.k_proj.weight [192, 576]
model.layers.16.self_attn.o_proj.weight [576, 576]
model.layers.16.self_attn.q_proj.weight [576, 576]
model.layers.16.self_attn.v_proj.weight [192, 576]
model.layers.17.input_layernorm.weight [576]
model.layers.17.mlp.down_proj.weight [576, 1536]
model.layers.17.mlp.gate_proj.weight [1536, 576]
model.layers.17.mlp.up_proj.weight [1536, 576]
model.layers.17.post_attention_layernorm.weight [576]
model.layers.17.self_attn.k_proj.weight [192, 576]
model.layers.17.self_attn.o_proj.weight [576, 576]
model.layers.17.self_attn.q_proj.weight [576, 576]
model.layers.17.self_attn.v_proj.weight [192, 576]
model.layers.18.input_layernorm.weight [576]
model.layers.18.mlp.down_proj.weight [576, 1536]
model.layers.18.mlp.gate_proj.weight [1536, 576]
model.layers.18.mlp.up_proj.weight [1536, 576]
model.layers.18.post_attention_layernorm.weight [576]
model.layers.18.self_attn.k_proj.weight [192, 576]
model.layers.18.self_attn.o_proj.weight [576, 576]
model.layers.18.self_attn.q_proj.weight [576, 576]
model.layers.18.self_attn.v_proj.weight [192, 576]
model.layers.19.input_layernorm.weight [576]
model.layers.19.mlp.down_proj.weight [576, 1536]
model.layers.19.mlp.gate_proj.weight [1536, 576]
model.layers.19.mlp.up_proj.weight [1536, 576]
model.layers.19.post_attention_layernorm.weight [576]
model.layers.19.self_attn.k_proj.weight [192, 576]
model.layers.19.self_attn.o_proj.weight [576, 576]
model.layers.19.self_attn.q_proj.weight [576, 576]
model.layers.19.self_attn.v_proj.weight [192, 576]
model.layers.2.input_layernorm.weight [576]
model.layers.2.mlp.down_proj.weight [576, 1536]
model.layers.2.mlp.gate_proj.weight [1536, 576]
model.layers.2.mlp.up_proj.weight [1536, 576]
model.layers.2.post_attention_layernorm.weight [576]
model.layers.2.self_attn.k_proj.weight [192, 576]
model.layers.2.self_attn.o_proj.weight [576, 576]
model.layers.2.self_attn.q_proj.weight [576, 576]
model.layers.2.self_attn.v_proj.weight [192, 576]
model.layers.20.input_layernorm.weight [576]
model.layers.20.mlp.down_proj.weight [576, 1536]
model.layers.20.mlp.gate_proj.weight [1536, 576]
model.layers.20.mlp.up_proj.weight [1536, 576]
model.layers.20.post_attention_layernorm.weight [576]
model.layers.20.self_attn.k_proj.weight [192, 576]
model.layers.20.self_attn.o_proj.weight [576, 576]
model.layers.20.self_attn.q_proj.weight [576, 576]
model.layers.20.self_attn.v_proj.weight [192, 576]
model.layers.21.input_layernorm.weight [576]
model.layers.21.mlp.down_proj.weight [576, 1536]
model.layers.21.mlp.gate_proj.weight [1536, 576]
model.layers.21.mlp.up_proj.weight [1536, 576]
model.layers.21.post_attention_layernorm.weight [576]
model.layers.21.self_attn.k_proj.weight [192, 576]
model.layers.21.self_attn.o_proj.weight [576, 576]
model.layers.21.self_attn.q_proj.weight [576, 576]
model.layers.21.self_attn.v_proj.weight [192, 576]
model.layers.22.input_layernorm.weight [576]
model.layers.22.mlp.down_proj.weight [576, 1536]
model.layers.22.mlp.gate_proj.weight [1536, 576]
model.layers.22.mlp.up_proj.weight [1536, 576]
model.layers.22.post_attention_layernorm.weight [576]
model.layers.22.self_attn.k_proj.weight [192, 576]
model.layers.22.self_attn.o_proj.weight [576, 576]
model.layers.22.self_attn.q_proj.weight [576, 576]
model.layers.22.self_attn.v_proj.weight [192, 576]
model.layers.23.input_layernorm.weight [576]
model.layers.23.mlp.down_proj.weight [576, 1536]
model.layers.23.mlp.gate_proj.weight [1536, 576]
model.layers.23.mlp.up_proj.weight [1536, 576]
model.layers.23.post_attention_layernorm.weight [576]
model.layers.23.self_attn.k_proj.weight [192, 576]
model.layers.23.self_attn.o_proj.weight [576, 576]
model.layers.23.self_attn.q_proj.weight [576, 576]
model.layers.23.self_attn.v_proj.weight [192, 576]
model.layers.24.input_layernorm.weight [576]
model.layers.24.mlp.down_proj.weight [576, 1536]
model.layers.24.mlp.gate_proj.weight [1536, 576]
model.layers.24.mlp.up_proj.weight [1536, 576]
model.layers.24.post_attention_layernorm.weight [576]
model.layers.24.self_attn.k_proj.weight [192, 576]
model.layers.24.self_attn.o_proj.weight [576, 576]
model.layers.24.self_attn.q_proj.weight [576, 576]
model.layers.24.self_attn.v_proj.weight [192, 576]
model.layers.25.input_layernorm.weight [576]
model.layers.25.mlp.down_proj.weight [576, 1536]
model.layers.25.mlp.gate_proj.weight [1536, 576]
model.layers.25.mlp.up_proj.weight [1536, 576]
model.layers.25.post_attention_layernorm.weight [576]
model.layers.25.self_attn.k_proj.weight [192, 576]
model.layers.25.self_attn.o_proj.weight [576, 576]
model.layers.25.self_attn.q_proj.weight [576, 576]
model.layers.25.self_attn.v_proj.weight [192, 576]
model.layers.26.input_layernorm.weight [576]
model.layers.26.mlp.down_proj.weight [576, 1536]
model.layers.26.mlp.gate_proj.weight [1536, 576]
model.layers.26.mlp.up_proj.weight [1536, 576]
model.layers.26.post_attention_layernorm.weight [576]
model.layers.26.self_attn.k_proj.weight [192, 576]
model.layers.26.self_attn.o_proj.weight [576, 576]
model.layers.26.self_attn.q_proj.weight [576, 576]
model.layers.26.self_attn.v_proj.weight [192, 576]
model.layers.27.input_layernorm.weight [576]
model.layers.27.mlp.down_proj.weight [576, 1536]
model.layers.27.mlp.gate_proj.weight [1536, 576]
model.layers.27.mlp.up_proj.weight [1536, 576]
model.layers.27.post_attention_layernorm.weight [576]
model.layers.27.self_attn.k_proj.weight [192, 576]
model.layers.27.self_attn.o_proj.weight [576, 576]
model.layers.27.self_attn.q_proj.weight [576, 576]
model.layers.27.self_attn.v_proj.weight [192, 576]
model.layers.28.input_layernorm.weight [576]
model.layers.28.mlp.down_proj.weight [576, 1536]
model.layers.28.mlp.gate_proj.weight [1536, 576]
model.layers.28.mlp.up_proj.weight [1536, 576]
model.layers.28.post_attention_layernorm.weight [576]
model.layers.28.self_attn.k_proj.weight [192, 576]
model.layers.28.self_attn.o_proj.weight [576, 576]
model.layers.28.self_attn.q_proj.weight [576, 576]
model.layers.28.self_attn.v_proj.weight [192, 576]
model.layers.29.input_layernorm.weight [576]
model.layers.29.mlp.down_proj.weight [576, 1536]
model.layers.29.mlp.gate_proj.weight [1536, 576]
model.layers.29.mlp.up_proj.weight [1536, 576]
model.layers.29.post_attention_layernorm.weight [576]
model.layers.29.self_attn.k_proj.weight [192, 576]
model.layers.29.self_attn.o_proj.weight [576, 576]
model.layers.29.self_attn.q_proj.weight [576, 576]
model.layers.29.self_attn.v_proj.weight [192, 576]
model.layers.3.input_layernorm.weight [576]
model.layers.3.mlp.down_proj.weight [576, 1536]
model.layers.3.mlp.gate_proj.weight [1536, 576]
model.layers.3.mlp.up_proj.weight [1536, 576]
model.layers.3.post_attention_layernorm.weight [576]
model.layers.3.self_attn.k_proj.weight [192, 576]
model.layers.3.self_attn.o_proj.weight [576, 576]
model.layers.3.self_attn.q_proj.weight [576, 576]
model.layers.3.self_attn.v_proj.weight [192, 576]
model.layers.4.input_layernorm.weight [576]
model.layers.4.mlp.down_proj.weight [576, 1536]
model.layers.4.mlp.gate_proj.weight [1536, 576]
model.layers.4.mlp.up_proj.weight [1536, 576]
model.layers.4.post_attention_layernorm.weight [576]
model.layers.4.self_attn.k_proj.weight [192, 576]
model.layers.4.self_attn.o_proj.weight [576, 576]
model.layers.4.self_attn.q_proj.weight [576, 576]
model.layers.4.self_attn.v_proj.weight [192, 576]
model.layers.5.input_layernorm.weight [576]
model.layers.5.mlp.down_proj.weight [576, 1536]
model.layers.5.mlp.gate_proj.weight [1536, 576]
model.layers.5.mlp.up_proj.weight [1536, 576]
model.layers.5.post_attention_layernorm.weight [576]
model.layers.5.self_attn.k_proj.weight [192, 576]
model.layers.5.self_attn.o_proj.weight [576, 576]
model.layers.5.self_attn.q_proj.weight [576, 576]
model.layers.5.self_attn.v_proj.weight [192, 576]
model.layers.6.input_layernorm.weight [576]
model.layers.6.mlp.down_proj.weight [576, 1536]
model.layers.6.mlp.gate_proj.weight [1536, 576]
model.layers.6.mlp.up_proj.weight [1536, 576]
model.layers.6.post_attention_layernorm.weight [576]
model.layers.6.self_attn.k_proj.weight [192, 576]
model.layers.6.self_attn.o_proj.weight [576, 576]
model.layers.6.self_attn.q_proj.weight [576, 576]
model.layers.6.self_attn.v_proj.weight [192, 576]
model.layers.7.input_layernorm.weight [576]
model.layers.7.mlp.down_proj.weight [576, 1536]
model.layers.7.mlp.gate_proj.weight [1536, 576]
model.layers.7.mlp.up_proj.weight [1536, 576]
model.layers.7.post_attention_layernorm.weight [576]
model.layers.7.self_attn.k_proj.weight [192, 576]
model.layers.7.self_attn.o_proj.weight [576, 576]
model.layers.7.self_attn.q_proj.weight [576, 576]
model.layers.7.self_attn.v_proj.weight [192, 576]
model.layers.8.input_layernorm.weight [576]
model.layers.8.mlp.down_proj.weight [576, 1536]
model.layers.8.mlp.gate_proj.weight [1536, 576]
model.layers.8.mlp.up_proj.weight [1536, 576]
model.layers.8.post_attention_layernorm.weight [576]
model.layers.8.self_attn.k_proj.weight [192, 576]
model.layers.8.self_attn.o_proj.weight [576, 576]
model.layers.8.self_attn.q_proj.weight [576, 576]
model.layers.8.self_attn.v_proj.weight [192, 576]
model.layers.9.input_layernorm.weight [576]
model.layers.9.mlp.down_proj.weight [576, 1536]
model.layers.9.mlp.gate_proj.weight [1536, 576]
model.layers.9.mlp.up_proj.weight [1536, 576]
model.layers.9.post_attention_layernorm.weight [576]
model.layers.9.self_attn.k_proj.weight [192, 576]
model.layers.9.self_attn.o_proj.weight [576, 576]
model.layers.9.self_attn.q_proj.weight [576, 576]
model.layers.9.self_attn.v_proj.weight [192, 576]
model.norm.weight [576]You can map every row back to the concepts above: q_proj is W_Q, o_proj is W_O, gate_proj and up_proj are the MLP expansion, input_layernorm is norm1. The shapes make the dimensions concrete: [576, 576] for Q means d_model=576, and [1536, 576] for the MLP expansion means d_ffn=1536, roughly 2.7× d_model for this small model. K and V show [192, 576] rather than [576, 576] because of multi-head attention: d_model (576) is split across 3 heads, giving 192 dimensions per head.
The KV cache
To understand why the KV cache exists, start with the problem it solves.
Attention works by comparing the current token against every previous token in the sequence. When the model generates a new word, it needs to look back at everything that came before to decide what comes next. That comparison relies on the K and V vectors of every previous token.
Without a cache, the model would recompute K and V for all previous tokens on every single generation step. For a 500-token prompt, generating the first new word means computing K and V for 500 tokens. Generating the second word means doing it again for 501 tokens. By the 100th generated word, that becomes a huge amount of redundant work.
The solution is simple: save K and V the first time you compute them, and reuse them at every subsequent step.
Q is not cached. The query vector asks “what am I looking for right now?” It only makes sense for the current token and is thrown away after use. K and V, on the other hand, represent what each past token offers and carries. Once computed for a given token at a specific position in the sequence, they remain fixed and can be reused for all future generation steps.
Processing the prompt (500 tokens):
Compute K and V for all 500 tokens
Store them in the cache (500 rows per layer)
Generating token 501:
Compute K and V for this new token only
Append to cache (now 501 rows)
Use Q of token 501 to compare against all 501 cached K vectors
Collect the corresponding V vectors to produce the output
Generating token 502:
Compute K and V for this new token only
Append to cache (now 502 rows)
... and so onThe cache only ever grows. Past entries are never rewritten. Each new token adds one K and one V vector per layer.
This has a real memory cost. The cache holds K and V for every token, across every layer of the model. On large models with long contexts, this easily reaches several GB of GPU memory. That cost is the main bottleneck for high concurrency and long contexts, and the direct reason tools like vLLM’s PagedAttention exist.
Prefill vs decode
It helps to separate two levels of structure. Attention → MLP is the inner structure: what happens inside each transformer layer on every forward pass. Prefill and decode are the outer structure: the two phases that govern how the forward pass is applied across an entire generation.
Inference
├── Prefill (once, all prompt tokens processed in parallel)
│ └── Forward pass on the full prompt
│ └── Each layer: Attention → MLP
│ └── Produces the KV cache
│
└── Decode (repeated, one new token per step)
└── Forward pass on a single token
└── Each layer: Attention (reads KV cache) → MLP
└── Appends one K, V row to the cacheThe KV cache is the link between the two phases: prefill creates it, and decode consumes and grows it one row at a time.
The forward pass itself is identical in both phases. What changes is what it operates on: the full prompt sequence during prefill, a single token during decode.
Every generation has two distinct phases with very different performance profiles:
| Prefill | Decode | |
|---|---|---|
| Input | Entire prompt at once | One token at a time |
| Parallelism | Full: all tokens processed in parallel | Sequential within a sequence |
| Bottleneck | Compute (GPU tensor cores) | Memory bandwidth (reading KV cache) |
| Output | First token + populated KV cache | One new token per step |
Prefill is fast but compute-heavy. Decode is slow because it’s sequential and memory-bound: each step requires reading the entire KV cache from GPU memory, even though only a small fraction of it changes.
This distinction explains several advanced techniques:
- Speculative decoding: a small model drafts several tokens in a prefill-like parallel step, the large model verifies them all at once
- Chunked prefill: split a long prompt into chunks to avoid stalling the GPU on a single massive prefill
- Prefix caching: if many requests share the same system prompt, compute and cache its KV once and reuse it
Where knowledge lives
| Location | What it contributes |
|---|---|
W_E embeddings | Initial semantic representation of tokens |
| MLP weights | Learned associations, patterns, and transformations |
| Attention weights | How information flows between tokens (structure, relationships) |
In practice these are not cleanly separated: everything is distributed across the network, and each component shapes the final representation. W_Q, W_K, W_V in particular encode structural routing strategies rather than facts, but that routing is still a form of learned knowledge. Fine-tuning techniques such as LoRA work by adjusting a small subset of these weight matrices rather than retraining the entire model from scratch, which makes them much cheaper to run.