LLM Internals

This page assumes you’ve read How LLMs Work and want to go deeper: into the actual tensors, the memory mechanics, and the engineering decisions that follow from them.

Tensors: the only data structure

A transformer model is a type of neural network: a large mathematical function built from layers of matrix multiplications. Everything in that network is made of tensors: multi-dimensional arrays of floating-point numbers. A vector is a 1D tensor. A matrix is a 2D tensor.

At inference time there are two kinds:

KindExamplesLifetime
Weights (trained)W_Q, W_K, W_V, W_E…Permanent: loaded once, never change
Activations (computed)Q, K, VEphemeral: created per token, discarded

The model file on disk is a collection of weight tensors. For example, the “8B” in “Llama 8B” is the total count of learned float values (parameters) across all weight tensors.

Token embeddings: W_E

A forward pass is the journey a token takes through the entire neural network: from input to predicted next token. The first operation is a lookup in the embedding table W_E: each token’s ID indexes a row, and that row’s vector is what flows through subsequent layers. W_E has one row per token in the vocabulary (128,256 for Llama 3) and one value per embedding dimension (4,096). This dimension size is called d_model throughout the rest of this article. Tokens with related meanings tend to start out with similar vectors, though the model refines and contextualizes those representations across all layers.

Attention: Q, K, V in detail

At layer 0, the embedding vectors produced by W_E enter the attention block. At every subsequent layer, what enters is the residual stream: the token’s accumulated representation built up by all previous layers. In both cases, each token’s current vector is projected into three new vectors using three learned weight matrices, W_Q, W_K, and W_V:

VectorMatrixRole
Q (Query)W_Q“What am I looking for?”
K (Key)W_K“What do I offer?”
V (Value)W_V“What information do I carry?”

W_Q, W_K, and W_V do not directly store factual knowledge. Instead, they learn how to route information between tokens, encoding structural patterns such as syntax, word relationships, and attention behavior.

Once Q, K, and V have been computed, attention compares the Q vector of the current token against all K vectors using a dot product, scaled by 1/√d_head to keep dot-product magnitudes stable before normalizing. Those scores are then normalized via softmax so they sum to 1, and used to produce a weighted sum of all V vectors, collecting information from the most relevant tokens. A fourth matrix, W_O, then projects that result back to d_model so it can flow into the next layer.

Multi-head attention. In practice, each attention block runs multiple attention operations in parallel, each called a head. Every head learns to focus on a different type of relationship: one might track subject-verb dependencies, another pronoun references, another word order. The outputs of all heads are concatenated and projected through W_O.

Modern Llama-style models, including SmolLM2-135M, use Grouped-Query Attention (GQA): there are more query heads than key/value heads, and several query heads share the same K and V. SmolLM2-135M has 9 query heads and 3 KV heads, with a head dimension of 64. That gives Q a total size of 9 × 64 = 576 (matching d_model), while K and V are only 3 × 64 = 192, which is why their projections show as [192, 576] in the tensor dump rather than [576, 576]. GQA exists primarily to shrink the KV cache, and we’ll come back to it in that section.

MLP block

The output of the attention block is passed to the MLP. Where attention figures out which tokens are relevant to each other, the MLP applies learned transformations to the resulting representation, capturing associations, patterns, and higher-level features learned during training. It expands the vector to a larger internal dimension, filters and reshapes the values, then compresses back down.

How the pieces fit together

Before listing all tensors, it’s important to get an overview of the entire process:

  • W_E (token embedding) runs once per token, at the very start of the forward pass, before any layer is involved
  • Attention + MLP run once per layer, repeated N times for every token

Every token, whether part of the input prompt or newly generated, goes through its own forward pass: W_E lookup first, then through all N layers.

Each token's forward pass
│
├── W_E lookup  ← once, before any layer
│     └── token ID → embedding vector
│
├── Layer 0
│   ├── Attention (Q, K, V, W_O)
│   └── MLP (W_up, W_gate, W_down)
│
├── Layer 1
│   ├── Attention
│   └── MLP
│
│   ... repeated for all N layers ...
│
├── Layer N-1
│   ├── Attention
│   └── MLP
│
├── final_norm
└── W_unembed → one score per vocabulary word

W_E and W_unembed are global: they sit outside the layer stack. Everything in between (attention + MLP) repeats once per layer. (In some models these two are tied: the same matrix is reused at input and output.)

Full tensor inventory per layer

One more dimension appears in the tensor tables: d_ffn, the larger intermediate size the MLP expands into before compressing back down to d_model. This wider space gives the model more room to detect and combine patterns.

The tensor names below are conceptual labels. In practice, naming varies by model: Llama uses mlp.up_proj / mlp.gate_proj / mlp.down_proj, GPT-2 uses mlp.c_fc and mlp.c_proj. The structure is the same; only the names differ.

Each layer contains the same set of tensors:

Attention block

TensorRole
W_QQuery projection
W_KKey projection
W_VValue projection
W_OProjects attention output back to d_model
norm1.weightNorm before attention

MLP block

TensorRole
W_upExpands from d_modeld_ffn
W_gateControls information flow (gating)
W_downCompresses back from d_ffnd_model
norm2.weightNorm before MLP

Global tensors (outside layers)

TensorRole
W_EToken embedding table
final_norm.weightNorm after the last layer
W_unembed (lm_head)Final projection from d_model to vocab size → logits

A note on positional information. Position is injected into attention via RoPE (Rotary Position Embedding), which rotates the Q and K vectors based on each token’s position in the sequence. RoPE is a computation applied at every layer using fixed (non-learned) sinusoidal frequencies. It has no entry in the tensor dump because nothing is stored on disk for it. Without it, attention would have no sense of word order.

A note on layer norms. The norm1 and norm2 tensors perform RMS normalization (a simplified variant of layer norm used in Llama-style models): they rescale the token vectors before each sub-block to keep values in a stable numerical range. Without them, values can drift toward zero or explode as they pass through many layers.

For reference, approximate parameter counts in Llama 3 8B:

ComponentParameters
Token embeddings (embed_tokens)~525M
lm_head (output projection)~525M
Attention (×32 layers, with GQA)~1.07B
MLP (×32 layers, SwiGLU)~5.7B
Normsnegligible
Total~8B

The full forward pass

Before reading the diagram, one key concept: the residual stream. Rather than each layer replacing the token vector, every sub-block only adds its output to the existing vector. This makes it difficult for later layers to completely erase earlier information, and encourages the model to build progressively on top of existing representations.

Input tokens  (one per word or subword)
  ↓  W_E lookup
Embeddings  (one vector per token)
  ↓
━━━[ Layer 0 ]━━━━━━━━━━━━━━━━━━━
  ↓  norm1                            ← stabilize values before attention
  ↓  × W_Q, W_K, W_V  →  Q, K, V
  ↓  RoPE applied to Q and K          ← inject positional information
  ↓  compare Q against all K vectors  ← compute attention scores
  ↓  softmax over scores              ← each weight says how much to attend to each token
  ↓  weighted sum of V                ← collect information from relevant tokens
  ↓  × W_O                            ← project back to d_model
  ↓  residual add                     ← add attention output to the token vector
  ↓  norm2                            ← stabilize before MLP
  ↓  × W_gate  →  expand to d_ffn
  ↓  activation function (SiLU) on gate branch
  ↓  × W_up    →  expand to d_ffn (parallel branch)
  ↓  elementwise multiply gate × up   ← the "gating" mechanism
  ↓  × W_down  →  compress to d_model
  ↓  residual add                     ← add MLP output to the token vector
━━━[ Layers 1..N ]━━━━━━━━━━━━━━━
  ↓  (same structure repeated)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ↓  final_norm
  ↓  × W_unembed
Output: one score per word in the vocabulary

The final output is a list of logits: one raw score per word in the vocabulary. A high score means the model considers that word a likely next token. These scores are then converted to probabilities, from which the next token is selected via sampling or other decoding strategies.

Inspecting tensors in practice

Everything above can be verified on a real model. The example below uses SmolLM2-135M, a small openly available model that follows the same Llama-style architecture. No HuggingFace account needed.

pip install torch safetensors huggingface_hub

# Download the model. With recent versions of huggingface_hub:
hf download HuggingFaceTB/SmolLM2-135M --local-dir ./smollm
# Older versions: huggingface-cli download HuggingFaceTB/SmolLM2-135M --local-dir ./smollm
inspect_tensors.py
import re
from safetensors import safe_open

def natural_key(s):
    return [int(t) if t.isdigit() else t for t in re.split(r'(\d+)', s)]

with safe_open("./smollm/model.safetensors", framework="pt") as f:
    for key in sorted(f.keys(), key=natural_key):
        shape = f.get_slice(key).get_shape()
        print(f"{key:55s} {shape}")

This reads shapes directly from the file header without loading the weights into memory. The output shows every tensor with its real name and shape:


See full output
$ python inspect_tensors.py
model.embed_tokens.weight                               [49152, 576]
model.layers.0.input_layernorm.weight                   [576]
model.layers.0.mlp.down_proj.weight                     [576, 1536]
model.layers.0.mlp.gate_proj.weight                     [1536, 576]
model.layers.0.mlp.up_proj.weight                       [1536, 576]
model.layers.0.post_attention_layernorm.weight          [576]
model.layers.0.self_attn.k_proj.weight                  [192, 576]
model.layers.0.self_attn.o_proj.weight                  [576, 576]
model.layers.0.self_attn.q_proj.weight                  [576, 576]
model.layers.0.self_attn.v_proj.weight                  [192, 576]
model.layers.1.input_layernorm.weight                   [576]
model.layers.1.mlp.down_proj.weight                     [576, 1536]
model.layers.1.mlp.gate_proj.weight                     [1536, 576]
model.layers.1.mlp.up_proj.weight                       [1536, 576]
model.layers.1.post_attention_layernorm.weight          [576]
model.layers.1.self_attn.k_proj.weight                  [192, 576]
model.layers.1.self_attn.o_proj.weight                  [576, 576]
model.layers.1.self_attn.q_proj.weight                  [576, 576]
model.layers.1.self_attn.v_proj.weight                  [192, 576]

... (layers 2–28 have identical structure) ...

model.layers.29.input_layernorm.weight                  [576]
model.layers.29.mlp.down_proj.weight                    [576, 1536]
model.layers.29.mlp.gate_proj.weight                    [1536, 576]
model.layers.29.mlp.up_proj.weight                      [1536, 576]
model.layers.29.post_attention_layernorm.weight         [576]
model.layers.29.self_attn.k_proj.weight                 [192, 576]
model.layers.29.self_attn.o_proj.weight                 [576, 576]
model.layers.29.self_attn.q_proj.weight                 [576, 576]
model.layers.29.self_attn.v_proj.weight                 [192, 576]
model.norm.weight                                       [576]

You can map every row back to the concepts above: q_proj is W_Q, o_proj is W_O, gate_proj and up_proj are the MLP expansion, input_layernorm is norm1, and model.norm.weight is the final norm before lm_head. The shapes make the dimensions concrete: [576, 576] for Q means d_model=576, and [1536, 576] for the MLP expansion means d_ffn=1536, roughly 2.7× d_model for this small model. K and V show [192, 576] rather than [576, 576] because of Grouped-Query Attention: the model has 9 query heads and 3 KV heads, each with a head dimension of 64, so K and V total 3 × 64 = 192.

One thing missing from the dump: there’s no lm_head.weight. SmolLM2-135M uses tied embeddings: the same matrix used to look up tokens at the input is reused (transposed) at the output to produce logits. Larger Llama models don’t tie, which is why their parameter count includes a separate lm_head of the same size as the embedding table.

The KV cache

To understand why the KV cache exists, start with the problem it solves.

Attention works by comparing the current token against every previous token in the sequence. When the model generates a new word, it needs to look back at everything that came before to decide what comes next. That comparison relies on the K and V vectors of every previous token.

Without a cache, the model would recompute K and V for all previous tokens on every single generation step. For a 500-token prompt, generating the first new word means computing K and V for 500 tokens. Generating the second word means doing it again for 501 tokens. By the 100th generated word, that becomes a huge amount of redundant work.

The solution is simple: save K and V the first time you compute them, and reuse them at every subsequent step.

Q is not cached. The query vector asks “what am I looking for right now?” It only makes sense for the current token and is thrown away after use. K and V, on the other hand, represent what each past token offers and carries. Once computed for a given token at a specific position in the sequence, they remain fixed and can be reused for all future generation steps.

Processing the prompt (500 tokens):
  Compute K and V for all 500 tokens
  Store them in the cache (500 rows per layer)

Generating token 501:
  Compute K and V for this new token only
  Append to cache (now 501 rows)
  Use Q of token 501 to compare against all 501 cached K vectors
  Collect the corresponding V vectors to produce the output

Generating token 502:
  Compute K and V for this new token only
  Append to cache (now 502 rows)
  ... and so on

The cache only ever grows. Past entries are never rewritten. Each new token adds one K and one V vector per layer.

This has a real memory cost. The cache holds K and V for every token, across every layer of the model. On large models with long contexts, this easily reaches several GB of GPU memory. That cost is the main bottleneck for high concurrency and long contexts, and the direct reason tools like vLLM’s PagedAttention exist. It’s also why GQA (introduced earlier) matters so much in production: by sharing K and V across multiple query heads, GQA shrinks the cache by the ratio of query heads to KV heads: a 3× reduction for SmolLM2-135M, an 8× reduction for Llama 3 8B.

Prefill vs decode

It helps to separate two levels of structure. Attention → MLP is the inner structure: what happens inside each transformer layer on every forward pass. Prefill and decode are the outer structure: the two phases that govern how the forward pass is applied across an entire generation.

Inference
├── Prefill  (once, all prompt tokens processed in parallel)
│   └── Forward pass on the full prompt
│       └── Each layer: Attention → MLP
│           └── Produces the KV cache
│
└── Decode  (repeated, one new token per step)
    └── Forward pass on a single token
        └── Each layer: Attention (reads KV cache) → MLP
            └── Appends one K, V row to the cache

The KV cache is the link between the two phases: prefill creates it, and decode consumes and grows it one row at a time.

The forward pass itself is identical in both phases. What changes is what it operates on: the full prompt sequence during prefill, a single token during decode.

Every generation has two distinct phases with very different performance profiles:

PrefillDecode
InputEntire prompt at onceOne token at a time
ParallelismFull: all tokens processed in parallelSequential within a sequence
BottleneckCompute (GPU tensor cores)Memory bandwidth (reading weights and KV cache)
OutputFirst token + populated KV cacheOne new token per step

Prefill is fast but compute-heavy. Decode is slow because it’s sequential and memory-bound: each step requires reading the model weights and the entire KV cache from GPU memory, even though only a small fraction of the cache changes.

This distinction explains several advanced techniques:

  • Speculative decoding: a small model drafts several tokens in a prefill-like parallel step, the large model verifies them all at once
  • Chunked prefill: split a long prompt into chunks to avoid stalling the GPU on a single massive prefill
  • Prefix caching: if many requests share the same system prompt, compute and cache its KV once and reuse it

Roles: where knowledge lives

LocationWhat it contributes
W_E embeddingsInitial semantic representation of tokens
MLP weightsLearned associations, patterns, and transformations
Attention weightsHow information flows between tokens (structure, relationships)

In practice these are not cleanly separated: everything is distributed across the network, and each component shapes the final representation. W_Q, W_K, W_V in particular encode structural routing strategies rather than facts, but that routing is still a form of learned knowledge, and interpretability research has shown that specific attention heads can encode surprisingly concrete circuits. Fine-tuning techniques such as LoRA work by adjusting a small subset of these weight matrices rather than retraining the entire model from scratch, which makes them much cheaper to run.