Why vLLM?

Ollama works well for local use or a handful of users. It gets difficult when multiple users send requests at the same time. vLLM solves that.

Running a model to generate a response is called inference. vLLM is an inference server built to handle many requests at once. It has become the standard for self-hosted model serving.

The problem with basic inference

When a model generates text, it works one token at a time. A token is roughly a word or word fragment. For each new token, the model needs the full context of what came before. Rather than recompute it every time, it keeps it in memory, which grows as the response gets longer.

With a basic approach:

  • Memory is reserved upfront for the longest possible response, even if the actual response is short
  • Requests are grouped into batches: the server waits until a batch is full, processes it, then starts over

This leads to wasted GPU memory and a GPU sitting idle between batches.

PagedAttention

vLLM’s main innovation is PagedAttention. Instead of reserving memory upfront, it allocates memory in small chunks as tokens are generated.

    graph LR
    R1[Request 1] -->|tokens 1-8| P1[Page 1]
    R1 -->|tokens 9-12| P2[Page 2]
    R2[Request 2] -->|tokens 1-8| P3[Page 3]
    R2 -->|tokens 9-16| P4[Page 4]
    P1 & P2 & P3 & P4 --> GPU[GPU Memory Pool]
  

Memory from different requests sits side by side with no gaps. When a request finishes, its memory is freed immediately.

Continuous batching

With continuous batching, as soon as one request finishes, a new one takes its place. The GPU stays busy instead of waiting for a full batch to complete.

Why a GPU matters

LLMs contain billions of numbers. Each token requires running calculations across all of them at the same time. GPUs are built for that kind of work. A CPU processes things one after another, a GPU runs thousands of operations in parallel.

On a GPU, vLLM can serve thousands of tokens per second. On CPU, the same model takes several seconds per token.

OpenAI-compatible API

vLLM exposes an OpenAI-compatible REST API. Any app using the OpenAI SDK can point to a vLLM instance by changing the base URL:

from openai import OpenAI

client = OpenAI(
    base_url="http://vllm-service:8000/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role": "user", "content": "What is Kubernetes?"}],
)

Ollama vs vLLM

OllamavLLM
TargetLocal / devProduction serving
GPU requiredNoYes
Multi-user throughputLimitedHigh
OpenAI-compatible APIYesYes
Model formatsGGUF (quantized)HuggingFace (full precision)

GGUF is a compressed format: smaller files, runs on CPU, slight quality drop. HuggingFace models are the full-size originals: higher quality, but more memory needed.

For local use or a single user, Ollama is fine. For production with multiple users, vLLM is the right tool.

Coming soon: the article vLLM on Kubernetes will cover how to deploy vLLM on a Kubernetes cluster with GPU support.