How LLMs Work

Large language models (LLMs) are the technology behind ChatGPT, Claude, Llama, and others, all based on the transformer architecture. Understanding how they work helps you make better decisions when running them, whether you’re choosing between tools, tuning inference parameters, or debugging unexpected behaviour. This page covers the essentials, no prior ML knowledge needed.

What an LLM does

An LLM does one thing:

Given a sequence of tokens, it predicts a probability distribution over the next token.

Everything else comes from repeating this operation in a loop: answering questions, writing code, reasoning. The model generates one token, appends it to the sequence, and runs again.

A token is a chunk of text (often a word, a part of a word, or even a punctuation) produced by a tokenizer. The vocabulary is typically around 128,000 tokens for modern models. For example, the word “Kubernetes” might be a single token whereas “kubectl” might be split into several.

Training vs inference

Before a model can predict anything useful, it must be trained. Training exposes the model to vast amounts of text and gradually adjusts its internal parameters, what is called the weights, so that its predictions get better over time. Training is usually done long before you download and run a model. The weights are fixed. Inference is just the model using what it already learned.

The transformer layer

Modern LLMs are built on the transformer architecture, first introduced in the 2017 paper Attention Is All You Need. A transformer model is a stack of identical layers: 32 for an 8B model, 80 for a 70B model. Each layer has two sub-blocks that do different things:

    graph LR
    A[Input] --> B["Attention\n(context)"]
    B --> C["MLP\n(knowledge)"]
    C --> D[Output]
  

Attention lets each token look at the other tokens in the sequence and decide which ones are relevant to it. When you write “the cat sat on the mat because it was tired”, attention is what allows the model to link “it” to “the cat” rather than “the mat”.

That contextually enriched representation is then passed to the feed-forward network (called MLP). Think of it as the model’s long-term memory: it’s where learned associations are stored, baked in during training and frozen at inference time. Another example: when the model reads “the Eiffel Tower”, attention identifies we’re talking about a specific item and the MLP is what supplies everything it learned about it: that it’s in Paris, 330 metres tall, and more. It makes up roughly 60% of the model’s parameters.

In summary: the Attention figures out what’s relevant, whereas the MLP knows what things mean.

Each layer doesn’t replace its input, but it adds to it. A single vector per token (the residual stream) flows through all layers, accumulating updates, until it’s ready to be turned into a prediction. You can think of it as the model’s working memory for each token.

From output to token : sampling

After all layers have processed the input, the model produces a logit, which is one score per token in the vocabulary, representing how likely that token is to come next. These scores go through a sampling pipeline:

Logits  [one score per vocabulary token]
  ↓  ÷ Temperature
  ↓  Top-K filter   (keep only the top K candidates)
  ↓  Top-P filter   (keep the smallest set covering P% of probability)
  ↓  Softmax        (convert scores to probabilities)
  ↓  Sample         (draw one token)

Note: the exact order can vary slightly between implementations, but the idea is the same.

The sampling parameters directly affect the model’s behaviour:

ParameterEffect
Temperature = 0Always pick the top token — fully deterministic
Temperature < 1More confident and focused
Temperature > 1More creative and unpredictable
Top-KHard cap: only consider the K most likely tokens
Top-PSoft cap: keep tokens until cumulative probability reaches P

This is why setting temperature=0 in an API call gives you reproducible outputs, and why raising it makes the model more creative (and less reliable).

Why this matters in practice

Knowing that a model is weights (frozen knowledge) plus attention (dynamic context) makes several common techniques less mysterious:

TopicHow it connects
QuantizationReducing the numerical precision of the model’s weights to save memory, trading a bit of quality for a lot of VRAM. (e.g. FP16 → INT8 → 4-bit)
LoRA / fine-tuningAdding small trainable layers on top of frozen model weights to adapt the model to a specific task
RAGInjecting external knowledge into the prompt so the model can reason over it since the model can only use what’s in its weights and the current context
MoE (Mixture of Experts)Replacing one large MLP with N smaller ones, activating only 1-2 per token. This provides the same quality with less compute
Context window limitsThe model can only “see” a fixed number of tokens at once, after that the older context is lost
vLLM / PagedAttentionManaging inference memory efficiently so more requests can be served in parallel
In summary: tokens flow through transformer layers → attention figures out what’s relevant → MLP applies stored knowledge → logits score every possible next token → sampling picks one.