Why vLLM?

Ollama works well for local use or a handful of users. It gets difficult when multiple users send requests at the same time. vLLM solves that.

Running a model to generate a response is called inference. vLLM is an inference server built to handle many requests at once. It has become the standard for self-hosted model serving.

The problem with basic inference

When a model generates text, it works one token at a time. A token is roughly a word or word fragment. For each new token, the model needs the full context of what came before. Rather than recompute it every time, it keeps it in memory, which grows as the response gets longer.

With a basic approach:

Memory is reserved upfront for the longest possible response, even if the actual response is short
Requests are grouped into batches: the server waits until a batch is full, processes it, then starts over

This leads to wasted GPU memory and a GPU sitting idle between batches.

PagedAttention

vLLM’s main innovation is PagedAttention. Instead of reserving memory upfront, it allocates memory in small chunks as tokens are generated.

    graph LR
    R1[Request 1] -->|tokens 1-8| P1[Page 1]
    R1 -->|tokens 9-12| P2[Page 2]
    R2[Request 2] -->|tokens 1-8| P3[Page 3]
    R2 -->|tokens 9-16| P4[Page 4]
    P1 & P2 & P3 & P4 --> GPU[GPU Memory Pool]

Memory from different requests sits side by side with no gaps. When a request finishes, its memory is freed immediately.

Continuous batching

With continuous batching, as soon as one request finishes, a new one takes its place. The GPU stays busy instead of waiting for a full batch to complete.

Why a GPU matters

LLMs contain billions of numbers. Each token requires running calculations across all of them at the same time. GPUs are built for that kind of work. A CPU processes things one after another, a GPU runs thousands of operations in parallel.

On a GPU, vLLM can serve thousands of tokens per second. On CPU, the same model takes several seconds per token.

OpenAI-compatible API

vLLM exposes an OpenAI-compatible REST API. Any app using the OpenAI SDK can point to a vLLM instance by changing the base URL:

from openai import OpenAI

client = OpenAI(
    base_url="http://vllm-service:8000/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role": "user", "content": "What is Kubernetes?"}],
)

Ollama vs vLLM

	Ollama	vLLM
Target	Local / dev	Production serving
GPU required	No	Yes
Multi-user throughput	Limited	High
OpenAI-compatible API	Yes	Yes
Model formats	GGUF (quantized)	HuggingFace (full precision)

GGUF is a compressed format: smaller files, runs on CPU, slight quality drop. HuggingFace models are the full-size originals: higher quality, but more memory needed.

For local use or a single user, Ollama is fine. For production with multiple users, vLLM is the right tool.

Coming soon: the article vLLM on Kubernetes will cover how to deploy vLLM on a Kubernetes cluster with GPU support.