LLM Internals

This page assumes you’ve read How LLMs Work and want to go deeper: into the actual tensors, the memory mechanics, and the engineering decisions that follow from them.

Tensors: the only data structure

Everything in a neural network is made of tensors: multi-dimensional arrays of floating-point numbers. A vector is a 1D tensor. A matrix is a 2D tensor.

At inference time there are two kinds:

KindExamplesLifetime
Weights (trained)W_Q, W_K, W_V, W_E…Permanent: loaded once, never change
Activations (computed)Q, K, VEphemeral: created per token, discarded

The model file on disk is a collection of weight tensors. For example, the “8B” in “Llama 8B” is the total count of individual float values across all of them.

Token embeddings: W_E

A forward pass is the journey a token takes through the entire neural network: from input to predicted next token. The very first operation of that pass is a lookup in the embedding table W_E: for each token, its ID is used to retrieve a vector from this table. That vector is what flows through all subsequent layers. W_E has one row per token in the vocabulary (128,000 for Llama 3) and one value per embedding dimension (4,096). This dimension size is called d_model throughout the rest of this article. Tokens with similar meanings end up with similar vectors. This is the model’s initial representation of token meaning, which is then refined and transformed across all layers.

Attention: Q, K, V in detail

The embedding vectors produced by W_E are then passed into the attention block. Each token is projected into three vectors using three learned weight matrices, W_Q, W_K, and W_V:

VectorMatrixRole
Q (Query)W_Q“What am I looking for?”
K (Key)W_K“What do I offer?”
V (Value)W_V“What information do I carry?”

W_Q, W_K, and W_V do not directly store factual knowledge. Instead, they learn how to route information between tokens, encoding structural patterns such as syntax, word relationships, and attention behavior.

Once Q, K, and V have been computed, attention compares the Q vector of the current token against all K vectors using a dot product, scaled to keep scores in a stable range before normalizing. Those scores are then normalized so they sum to 1 and used to produce a weighted sum of all V vectors, collecting information from the most relevant tokens. A fourth matrix, W_O, then projects that result back to d_model so it can flow into the next layer.

Multi-head attention. In practice, each attention block runs multiple attention operations in parallel, each called a head. Every head has its own W_Q, W_K, W_V projections and learns to focus on a different type of relationship: one head might track subject-verb dependencies, another might handle pronoun references, another word order. The outputs of all heads are concatenated and projected through W_O. Because d_model is divided equally across heads, the individual K and V projections are smaller than d_model. For SmolLM2-135M with 3 heads, each head uses 576/3 = 192 dimensions, which is why K and V show [192, 576] in the tensor output rather than [576, 576].

MLP block

The output of the attention block is passed to the MLP. Where attention figures out which tokens are relevant to each other, the MLP applies learned transformations to the resulting representation, capturing associations, patterns, and higher-level features learned during training. It expands the vector to a larger internal dimension, filters and reshapes the values, then compresses back down.

How the pieces fit together

Before listing all tensors, it’s important to get an overview of the entire process:

  • W_E (token embedding) runs once per token, at the very start of the forward pass, before any layer is involved
  • Attention + MLP run once per layer, repeated N times for every token

Every token, whether part of the input prompt or newly generated, goes through its own forward pass: W_E lookup first, then through all N layers.

Each token's forward pass
│
├── W_E lookup  ← once, before any layer
│     └── token ID → embedding vector
│
├── Layer 0
│   ├── Attention (Q, K, V, W_O)
│   └── MLP (W_up, W_gate, W_down)
│
├── Layer 1
│   ├── Attention
│   └── MLP
│
│   ... repeated for all N layers ...
│
├── Layer N-1
│   ├── Attention
│   └── MLP
│
├── final_norm
└── W_unembed → one score per vocabulary word

W_E and W_unembed are global: they sit outside the layer stack. Everything in between (attention + MLP) repeats once per layer.

Full tensor inventory per layer

One more dimension appears in the tensor tables: d_ffn, the larger intermediate size the MLP expands into before compressing back down to d_model. This wider space gives the model more room to detect and combine patterns.

The tensor names below are conceptual labels. In practice, naming varies by model: Llama uses mlp.up_proj / mlp.gate_proj / mlp.down_proj, GPT-2 uses mlp.c_fc and mlp.c_proj. The structure is the same; only the names differ.

Each layer contains the same set of tensors:

Attention block

TensorRole
W_QQuery projection
W_KKey projection
W_VValue projection
W_OProjects attention output back to d_model
norm1.weightLayer norm before attention

MLP block

TensorRole
W_upExpands from d_modeld_ffn
W_gateControls information flow (gating)
W_downCompresses back from d_ffnd_model
norm2.weightLayer norm before MLP

Global tensors (outside layers)

TensorRole
W_EToken embedding table
RoPEEncodes each token’s position in the sequence; without it, attention would have no sense of word order
final_norm.weightNorm after the last layer
W_unembed (lm_head)Final projection from d_model to vocab size → logits

A note on layer norms. The norm1 and norm2 tensors perform layer normalization: they rescale the token vectors before each sub-block to keep values in a stable numerical range. Without them, numbers could grow or shrink uncontrollably as they pass through many layers.

For reference, approximate parameter counts in Llama 3 8B:

ComponentParameters
Token embeddings~500M
Attention (×32 layers)~1.3B
MLP (×32 layers)~5.5B
Norms and othernegligible
Total~8B

The full forward pass

Before reading the diagram, one key concept: the residual stream. Rather than each layer replacing the token vector, every sub-block only adds its output to the existing vector. This makes it difficult for later layers to completely erase earlier information, and encourages the model to build progressively on top of existing representations.

Input tokens  (one per word or subword)
  ↓  W_E lookup
Embeddings  (one vector per token)
  ↓  + positional encoding (RoPE, applied to Q and K)
  ↓
━━━[ Layer 0 ]━━━━━━━━━━━━━━━━━━━
  ↓  norm1                            ← stabilize values before attention
  ↓  × W_Q, W_K, W_V  →  Q, K, V
  ↓  compare Q against all K vectors  ← compute attention scores
  ↓  convert scores to weights        ← each weight says how much to attend to each token
  ↓  weighted sum of V                ← collect information from relevant tokens
  ↓  × W_O                            ← project back to d_model
  ↓  residual add                     ← add attention output to the token vector
  ↓  norm2                            ← stabilize before MLP
  ↓  × W_up, × W_gate  →  expand to d_ffn
  ↓  activation function              ← filters and shapes the expanded values
  ↓  × W_down  →  compress to d_model
  ↓  residual add                     ← add MLP output to the token vector
━━━[ Layers 1..N ]━━━━━━━━━━━━━━━
  ↓  (same structure repeated)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ↓  final_norm
  ↓  × W_unembed
Output: one score per word in the vocabulary

The final output is a list of logits: one raw score per word in the vocabulary. A high score means the model considers that word a likely next token. These scores are then converted to probabilities, from which the next token is selected via sampling or other decoding strategies.

Inspecting tensors in practice

Everything above can be verified on a real model. The example below uses SmolLM2-135M, a small openly available model that follows the same Llama-style architecture. No HuggingFace account needed.

pip install torch safetensors huggingface_hub

# Download the model
hf download HuggingFaceTB/SmolLM2-135M --local-dir ./smollm
inspect_tensors.py
from safetensors import safe_open

with safe_open("./smollm/model.safetensors", framework="pt") as f:
    for key in f.keys():
        shape = f.get_slice(key).get_shape()
        print(f"{key:55s} {shape}")

This reads shapes directly from the file header without loading the weights into memory. The output shows every tensor with its real name and shape:


See full output
$ python test.py
model.embed_tokens.weight                               [49152, 576]
model.layers.0.input_layernorm.weight                   [576]
model.layers.0.mlp.down_proj.weight                     [576, 1536]
model.layers.0.mlp.gate_proj.weight                     [1536, 576]
model.layers.0.mlp.up_proj.weight                       [1536, 576]
model.layers.0.post_attention_layernorm.weight          [576]
model.layers.0.self_attn.k_proj.weight                  [192, 576]
model.layers.0.self_attn.o_proj.weight                  [576, 576]
model.layers.0.self_attn.q_proj.weight                  [576, 576]
model.layers.0.self_attn.v_proj.weight                  [192, 576]
model.layers.1.input_layernorm.weight                   [576]
model.layers.1.mlp.down_proj.weight                     [576, 1536]
model.layers.1.mlp.gate_proj.weight                     [1536, 576]
model.layers.1.mlp.up_proj.weight                       [1536, 576]
model.layers.1.post_attention_layernorm.weight          [576]
model.layers.1.self_attn.k_proj.weight                  [192, 576]
model.layers.1.self_attn.o_proj.weight                  [576, 576]
model.layers.1.self_attn.q_proj.weight                  [576, 576]
model.layers.1.self_attn.v_proj.weight                  [192, 576]
model.layers.10.input_layernorm.weight                  [576]
model.layers.10.mlp.down_proj.weight                    [576, 1536]
model.layers.10.mlp.gate_proj.weight                    [1536, 576]
model.layers.10.mlp.up_proj.weight                      [1536, 576]
model.layers.10.post_attention_layernorm.weight         [576]
model.layers.10.self_attn.k_proj.weight                 [192, 576]
model.layers.10.self_attn.o_proj.weight                 [576, 576]
model.layers.10.self_attn.q_proj.weight                 [576, 576]
model.layers.10.self_attn.v_proj.weight                 [192, 576]
model.layers.11.input_layernorm.weight                  [576]
model.layers.11.mlp.down_proj.weight                    [576, 1536]
model.layers.11.mlp.gate_proj.weight                    [1536, 576]
model.layers.11.mlp.up_proj.weight                      [1536, 576]
model.layers.11.post_attention_layernorm.weight         [576]
model.layers.11.self_attn.k_proj.weight                 [192, 576]
model.layers.11.self_attn.o_proj.weight                 [576, 576]
model.layers.11.self_attn.q_proj.weight                 [576, 576]
model.layers.11.self_attn.v_proj.weight                 [192, 576]
model.layers.12.input_layernorm.weight                  [576]
model.layers.12.mlp.down_proj.weight                    [576, 1536]
model.layers.12.mlp.gate_proj.weight                    [1536, 576]
model.layers.12.mlp.up_proj.weight                      [1536, 576]
model.layers.12.post_attention_layernorm.weight         [576]
model.layers.12.self_attn.k_proj.weight                 [192, 576]
model.layers.12.self_attn.o_proj.weight                 [576, 576]
model.layers.12.self_attn.q_proj.weight                 [576, 576]
model.layers.12.self_attn.v_proj.weight                 [192, 576]
model.layers.13.input_layernorm.weight                  [576]
model.layers.13.mlp.down_proj.weight                    [576, 1536]
model.layers.13.mlp.gate_proj.weight                    [1536, 576]
model.layers.13.mlp.up_proj.weight                      [1536, 576]
model.layers.13.post_attention_layernorm.weight         [576]
model.layers.13.self_attn.k_proj.weight                 [192, 576]
model.layers.13.self_attn.o_proj.weight                 [576, 576]
model.layers.13.self_attn.q_proj.weight                 [576, 576]
model.layers.13.self_attn.v_proj.weight                 [192, 576]
model.layers.14.input_layernorm.weight                  [576]
model.layers.14.mlp.down_proj.weight                    [576, 1536]
model.layers.14.mlp.gate_proj.weight                    [1536, 576]
model.layers.14.mlp.up_proj.weight                      [1536, 576]
model.layers.14.post_attention_layernorm.weight         [576]
model.layers.14.self_attn.k_proj.weight                 [192, 576]
model.layers.14.self_attn.o_proj.weight                 [576, 576]
model.layers.14.self_attn.q_proj.weight                 [576, 576]
model.layers.14.self_attn.v_proj.weight                 [192, 576]
model.layers.15.input_layernorm.weight                  [576]
model.layers.15.mlp.down_proj.weight                    [576, 1536]
model.layers.15.mlp.gate_proj.weight                    [1536, 576]
model.layers.15.mlp.up_proj.weight                      [1536, 576]
model.layers.15.post_attention_layernorm.weight         [576]
model.layers.15.self_attn.k_proj.weight                 [192, 576]
model.layers.15.self_attn.o_proj.weight                 [576, 576]
model.layers.15.self_attn.q_proj.weight                 [576, 576]
model.layers.15.self_attn.v_proj.weight                 [192, 576]
model.layers.16.input_layernorm.weight                  [576]
model.layers.16.mlp.down_proj.weight                    [576, 1536]
model.layers.16.mlp.gate_proj.weight                    [1536, 576]
model.layers.16.mlp.up_proj.weight                      [1536, 576]
model.layers.16.post_attention_layernorm.weight         [576]
model.layers.16.self_attn.k_proj.weight                 [192, 576]
model.layers.16.self_attn.o_proj.weight                 [576, 576]
model.layers.16.self_attn.q_proj.weight                 [576, 576]
model.layers.16.self_attn.v_proj.weight                 [192, 576]
model.layers.17.input_layernorm.weight                  [576]
model.layers.17.mlp.down_proj.weight                    [576, 1536]
model.layers.17.mlp.gate_proj.weight                    [1536, 576]
model.layers.17.mlp.up_proj.weight                      [1536, 576]
model.layers.17.post_attention_layernorm.weight         [576]
model.layers.17.self_attn.k_proj.weight                 [192, 576]
model.layers.17.self_attn.o_proj.weight                 [576, 576]
model.layers.17.self_attn.q_proj.weight                 [576, 576]
model.layers.17.self_attn.v_proj.weight                 [192, 576]
model.layers.18.input_layernorm.weight                  [576]
model.layers.18.mlp.down_proj.weight                    [576, 1536]
model.layers.18.mlp.gate_proj.weight                    [1536, 576]
model.layers.18.mlp.up_proj.weight                      [1536, 576]
model.layers.18.post_attention_layernorm.weight         [576]
model.layers.18.self_attn.k_proj.weight                 [192, 576]
model.layers.18.self_attn.o_proj.weight                 [576, 576]
model.layers.18.self_attn.q_proj.weight                 [576, 576]
model.layers.18.self_attn.v_proj.weight                 [192, 576]
model.layers.19.input_layernorm.weight                  [576]
model.layers.19.mlp.down_proj.weight                    [576, 1536]
model.layers.19.mlp.gate_proj.weight                    [1536, 576]
model.layers.19.mlp.up_proj.weight                      [1536, 576]
model.layers.19.post_attention_layernorm.weight         [576]
model.layers.19.self_attn.k_proj.weight                 [192, 576]
model.layers.19.self_attn.o_proj.weight                 [576, 576]
model.layers.19.self_attn.q_proj.weight                 [576, 576]
model.layers.19.self_attn.v_proj.weight                 [192, 576]
model.layers.2.input_layernorm.weight                   [576]
model.layers.2.mlp.down_proj.weight                     [576, 1536]
model.layers.2.mlp.gate_proj.weight                     [1536, 576]
model.layers.2.mlp.up_proj.weight                       [1536, 576]
model.layers.2.post_attention_layernorm.weight          [576]
model.layers.2.self_attn.k_proj.weight                  [192, 576]
model.layers.2.self_attn.o_proj.weight                  [576, 576]
model.layers.2.self_attn.q_proj.weight                  [576, 576]
model.layers.2.self_attn.v_proj.weight                  [192, 576]
model.layers.20.input_layernorm.weight                  [576]
model.layers.20.mlp.down_proj.weight                    [576, 1536]
model.layers.20.mlp.gate_proj.weight                    [1536, 576]
model.layers.20.mlp.up_proj.weight                      [1536, 576]
model.layers.20.post_attention_layernorm.weight         [576]
model.layers.20.self_attn.k_proj.weight                 [192, 576]
model.layers.20.self_attn.o_proj.weight                 [576, 576]
model.layers.20.self_attn.q_proj.weight                 [576, 576]
model.layers.20.self_attn.v_proj.weight                 [192, 576]
model.layers.21.input_layernorm.weight                  [576]
model.layers.21.mlp.down_proj.weight                    [576, 1536]
model.layers.21.mlp.gate_proj.weight                    [1536, 576]
model.layers.21.mlp.up_proj.weight                      [1536, 576]
model.layers.21.post_attention_layernorm.weight         [576]
model.layers.21.self_attn.k_proj.weight                 [192, 576]
model.layers.21.self_attn.o_proj.weight                 [576, 576]
model.layers.21.self_attn.q_proj.weight                 [576, 576]
model.layers.21.self_attn.v_proj.weight                 [192, 576]
model.layers.22.input_layernorm.weight                  [576]
model.layers.22.mlp.down_proj.weight                    [576, 1536]
model.layers.22.mlp.gate_proj.weight                    [1536, 576]
model.layers.22.mlp.up_proj.weight                      [1536, 576]
model.layers.22.post_attention_layernorm.weight         [576]
model.layers.22.self_attn.k_proj.weight                 [192, 576]
model.layers.22.self_attn.o_proj.weight                 [576, 576]
model.layers.22.self_attn.q_proj.weight                 [576, 576]
model.layers.22.self_attn.v_proj.weight                 [192, 576]
model.layers.23.input_layernorm.weight                  [576]
model.layers.23.mlp.down_proj.weight                    [576, 1536]
model.layers.23.mlp.gate_proj.weight                    [1536, 576]
model.layers.23.mlp.up_proj.weight                      [1536, 576]
model.layers.23.post_attention_layernorm.weight         [576]
model.layers.23.self_attn.k_proj.weight                 [192, 576]
model.layers.23.self_attn.o_proj.weight                 [576, 576]
model.layers.23.self_attn.q_proj.weight                 [576, 576]
model.layers.23.self_attn.v_proj.weight                 [192, 576]
model.layers.24.input_layernorm.weight                  [576]
model.layers.24.mlp.down_proj.weight                    [576, 1536]
model.layers.24.mlp.gate_proj.weight                    [1536, 576]
model.layers.24.mlp.up_proj.weight                      [1536, 576]
model.layers.24.post_attention_layernorm.weight         [576]
model.layers.24.self_attn.k_proj.weight                 [192, 576]
model.layers.24.self_attn.o_proj.weight                 [576, 576]
model.layers.24.self_attn.q_proj.weight                 [576, 576]
model.layers.24.self_attn.v_proj.weight                 [192, 576]
model.layers.25.input_layernorm.weight                  [576]
model.layers.25.mlp.down_proj.weight                    [576, 1536]
model.layers.25.mlp.gate_proj.weight                    [1536, 576]
model.layers.25.mlp.up_proj.weight                      [1536, 576]
model.layers.25.post_attention_layernorm.weight         [576]
model.layers.25.self_attn.k_proj.weight                 [192, 576]
model.layers.25.self_attn.o_proj.weight                 [576, 576]
model.layers.25.self_attn.q_proj.weight                 [576, 576]
model.layers.25.self_attn.v_proj.weight                 [192, 576]
model.layers.26.input_layernorm.weight                  [576]
model.layers.26.mlp.down_proj.weight                    [576, 1536]
model.layers.26.mlp.gate_proj.weight                    [1536, 576]
model.layers.26.mlp.up_proj.weight                      [1536, 576]
model.layers.26.post_attention_layernorm.weight         [576]
model.layers.26.self_attn.k_proj.weight                 [192, 576]
model.layers.26.self_attn.o_proj.weight                 [576, 576]
model.layers.26.self_attn.q_proj.weight                 [576, 576]
model.layers.26.self_attn.v_proj.weight                 [192, 576]
model.layers.27.input_layernorm.weight                  [576]
model.layers.27.mlp.down_proj.weight                    [576, 1536]
model.layers.27.mlp.gate_proj.weight                    [1536, 576]
model.layers.27.mlp.up_proj.weight                      [1536, 576]
model.layers.27.post_attention_layernorm.weight         [576]
model.layers.27.self_attn.k_proj.weight                 [192, 576]
model.layers.27.self_attn.o_proj.weight                 [576, 576]
model.layers.27.self_attn.q_proj.weight                 [576, 576]
model.layers.27.self_attn.v_proj.weight                 [192, 576]
model.layers.28.input_layernorm.weight                  [576]
model.layers.28.mlp.down_proj.weight                    [576, 1536]
model.layers.28.mlp.gate_proj.weight                    [1536, 576]
model.layers.28.mlp.up_proj.weight                      [1536, 576]
model.layers.28.post_attention_layernorm.weight         [576]
model.layers.28.self_attn.k_proj.weight                 [192, 576]
model.layers.28.self_attn.o_proj.weight                 [576, 576]
model.layers.28.self_attn.q_proj.weight                 [576, 576]
model.layers.28.self_attn.v_proj.weight                 [192, 576]
model.layers.29.input_layernorm.weight                  [576]
model.layers.29.mlp.down_proj.weight                    [576, 1536]
model.layers.29.mlp.gate_proj.weight                    [1536, 576]
model.layers.29.mlp.up_proj.weight                      [1536, 576]
model.layers.29.post_attention_layernorm.weight         [576]
model.layers.29.self_attn.k_proj.weight                 [192, 576]
model.layers.29.self_attn.o_proj.weight                 [576, 576]
model.layers.29.self_attn.q_proj.weight                 [576, 576]
model.layers.29.self_attn.v_proj.weight                 [192, 576]
model.layers.3.input_layernorm.weight                   [576]
model.layers.3.mlp.down_proj.weight                     [576, 1536]
model.layers.3.mlp.gate_proj.weight                     [1536, 576]
model.layers.3.mlp.up_proj.weight                       [1536, 576]
model.layers.3.post_attention_layernorm.weight          [576]
model.layers.3.self_attn.k_proj.weight                  [192, 576]
model.layers.3.self_attn.o_proj.weight                  [576, 576]
model.layers.3.self_attn.q_proj.weight                  [576, 576]
model.layers.3.self_attn.v_proj.weight                  [192, 576]
model.layers.4.input_layernorm.weight                   [576]
model.layers.4.mlp.down_proj.weight                     [576, 1536]
model.layers.4.mlp.gate_proj.weight                     [1536, 576]
model.layers.4.mlp.up_proj.weight                       [1536, 576]
model.layers.4.post_attention_layernorm.weight          [576]
model.layers.4.self_attn.k_proj.weight                  [192, 576]
model.layers.4.self_attn.o_proj.weight                  [576, 576]
model.layers.4.self_attn.q_proj.weight                  [576, 576]
model.layers.4.self_attn.v_proj.weight                  [192, 576]
model.layers.5.input_layernorm.weight                   [576]
model.layers.5.mlp.down_proj.weight                     [576, 1536]
model.layers.5.mlp.gate_proj.weight                     [1536, 576]
model.layers.5.mlp.up_proj.weight                       [1536, 576]
model.layers.5.post_attention_layernorm.weight          [576]
model.layers.5.self_attn.k_proj.weight                  [192, 576]
model.layers.5.self_attn.o_proj.weight                  [576, 576]
model.layers.5.self_attn.q_proj.weight                  [576, 576]
model.layers.5.self_attn.v_proj.weight                  [192, 576]
model.layers.6.input_layernorm.weight                   [576]
model.layers.6.mlp.down_proj.weight                     [576, 1536]
model.layers.6.mlp.gate_proj.weight                     [1536, 576]
model.layers.6.mlp.up_proj.weight                       [1536, 576]
model.layers.6.post_attention_layernorm.weight          [576]
model.layers.6.self_attn.k_proj.weight                  [192, 576]
model.layers.6.self_attn.o_proj.weight                  [576, 576]
model.layers.6.self_attn.q_proj.weight                  [576, 576]
model.layers.6.self_attn.v_proj.weight                  [192, 576]
model.layers.7.input_layernorm.weight                   [576]
model.layers.7.mlp.down_proj.weight                     [576, 1536]
model.layers.7.mlp.gate_proj.weight                     [1536, 576]
model.layers.7.mlp.up_proj.weight                       [1536, 576]
model.layers.7.post_attention_layernorm.weight          [576]
model.layers.7.self_attn.k_proj.weight                  [192, 576]
model.layers.7.self_attn.o_proj.weight                  [576, 576]
model.layers.7.self_attn.q_proj.weight                  [576, 576]
model.layers.7.self_attn.v_proj.weight                  [192, 576]
model.layers.8.input_layernorm.weight                   [576]
model.layers.8.mlp.down_proj.weight                     [576, 1536]
model.layers.8.mlp.gate_proj.weight                     [1536, 576]
model.layers.8.mlp.up_proj.weight                       [1536, 576]
model.layers.8.post_attention_layernorm.weight          [576]
model.layers.8.self_attn.k_proj.weight                  [192, 576]
model.layers.8.self_attn.o_proj.weight                  [576, 576]
model.layers.8.self_attn.q_proj.weight                  [576, 576]
model.layers.8.self_attn.v_proj.weight                  [192, 576]
model.layers.9.input_layernorm.weight                   [576]
model.layers.9.mlp.down_proj.weight                     [576, 1536]
model.layers.9.mlp.gate_proj.weight                     [1536, 576]
model.layers.9.mlp.up_proj.weight                       [1536, 576]
model.layers.9.post_attention_layernorm.weight          [576]
model.layers.9.self_attn.k_proj.weight                  [192, 576]
model.layers.9.self_attn.o_proj.weight                  [576, 576]
model.layers.9.self_attn.q_proj.weight                  [576, 576]
model.layers.9.self_attn.v_proj.weight                  [192, 576]
model.norm.weight                                       [576]

You can map every row back to the concepts above: q_proj is W_Q, o_proj is W_O, gate_proj and up_proj are the MLP expansion, input_layernorm is norm1. The shapes make the dimensions concrete: [576, 576] for Q means d_model=576, and [1536, 576] for the MLP expansion means d_ffn=1536, roughly 2.7× d_model for this small model. K and V show [192, 576] rather than [576, 576] because of multi-head attention: d_model (576) is split across 3 heads, giving 192 dimensions per head.

The KV cache

To understand why the KV cache exists, start with the problem it solves.

Attention works by comparing the current token against every previous token in the sequence. When the model generates a new word, it needs to look back at everything that came before to decide what comes next. That comparison relies on the K and V vectors of every previous token.

Without a cache, the model would recompute K and V for all previous tokens on every single generation step. For a 500-token prompt, generating the first new word means computing K and V for 500 tokens. Generating the second word means doing it again for 501 tokens. By the 100th generated word, that becomes a huge amount of redundant work.

The solution is simple: save K and V the first time you compute them, and reuse them at every subsequent step.

Q is not cached. The query vector asks “what am I looking for right now?” It only makes sense for the current token and is thrown away after use. K and V, on the other hand, represent what each past token offers and carries. Once computed for a given token at a specific position in the sequence, they remain fixed and can be reused for all future generation steps.

Processing the prompt (500 tokens):
  Compute K and V for all 500 tokens
  Store them in the cache (500 rows per layer)

Generating token 501:
  Compute K and V for this new token only
  Append to cache (now 501 rows)
  Use Q of token 501 to compare against all 501 cached K vectors
  Collect the corresponding V vectors to produce the output

Generating token 502:
  Compute K and V for this new token only
  Append to cache (now 502 rows)
  ... and so on

The cache only ever grows. Past entries are never rewritten. Each new token adds one K and one V vector per layer.

This has a real memory cost. The cache holds K and V for every token, across every layer of the model. On large models with long contexts, this easily reaches several GB of GPU memory. That cost is the main bottleneck for high concurrency and long contexts, and the direct reason tools like vLLM’s PagedAttention exist.

Prefill vs decode

It helps to separate two levels of structure. Attention → MLP is the inner structure: what happens inside each transformer layer on every forward pass. Prefill and decode are the outer structure: the two phases that govern how the forward pass is applied across an entire generation.

Inference
├── Prefill  (once, all prompt tokens processed in parallel)
│   └── Forward pass on the full prompt
│       └── Each layer: Attention → MLP
│           └── Produces the KV cache
│
└── Decode  (repeated, one new token per step)
    └── Forward pass on a single token
        └── Each layer: Attention (reads KV cache) → MLP
            └── Appends one K, V row to the cache

The KV cache is the link between the two phases: prefill creates it, and decode consumes and grows it one row at a time.

The forward pass itself is identical in both phases. What changes is what it operates on: the full prompt sequence during prefill, a single token during decode.

Every generation has two distinct phases with very different performance profiles:

PrefillDecode
InputEntire prompt at onceOne token at a time
ParallelismFull: all tokens processed in parallelSequential within a sequence
BottleneckCompute (GPU tensor cores)Memory bandwidth (reading KV cache)
OutputFirst token + populated KV cacheOne new token per step

Prefill is fast but compute-heavy. Decode is slow because it’s sequential and memory-bound: each step requires reading the entire KV cache from GPU memory, even though only a small fraction of it changes.

This distinction explains several advanced techniques:

  • Speculative decoding: a small model drafts several tokens in a prefill-like parallel step, the large model verifies them all at once
  • Chunked prefill: split a long prompt into chunks to avoid stalling the GPU on a single massive prefill
  • Prefix caching: if many requests share the same system prompt, compute and cache its KV once and reuse it

Where knowledge lives

LocationWhat it contributes
W_E embeddingsInitial semantic representation of tokens
MLP weightsLearned associations, patterns, and transformations
Attention weightsHow information flows between tokens (structure, relationships)

In practice these are not cleanly separated: everything is distributed across the network, and each component shapes the final representation. W_Q, W_K, W_V in particular encode structural routing strategies rather than facts, but that routing is still a form of learned knowledge. Fine-tuning techniques such as LoRA work by adjusting a small subset of these weight matrices rather than retraining the entire model from scratch, which makes them much cheaper to run.