vLLM on Kubernetes

Prerequisites

We need a Kubernetes cluster with at least one NVIDIA GPU node. We will use an Exoscale SKS cluster composed of a single node having an A40 NVIDIA GPU.

$ kubectl get node
NAME               STATUS   ROLES    AGE     VERSION
pool-5da1e-vnifb   Ready    <none>   2m13s   v1.36.0

GPU setup

Before running vLLM in a Pod, Kubernetes needs to be aware of the GPU: the NVIDIA drivers must be installed on the node, the CUDA toolkit must be available, and the NVIDIA device plugin must be running so that Pods can request GPU resources.

On a self-managed cluster, the installation of those components is usually handled by the NVIDIA GPU Operator. On Exoscale SKS, this stack is provisioned automatically: GPU nodes come with the NVIDIA drivers, the CUDA toolkit, and the device plugin pre-installed.

The only thing that you may need on top of those components is the DCGM exporter, which exposes the GPU metrics.

The Node’s description shows its GPU capacity.

$ kubectl describe node pool-5da1e-vnifb
...
Capacity:
  cpu:                12
  ephemeral-storage:  102083312Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             57583556Ki
  nvidia.com/gpu:     1
  pods:               110

Running a Pod

To verify the GPU can be used, we define the following Pod requesting a GPU.

gpu-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: Never
  containers:
  - name: cuda
    image: nvidia/cuda:12.5.0-base-ubuntu22.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1

Then, we run it.

kubectl apply -f gpu-test.yaml

Once this Pod reaches the Completed status, we get the GPU information from its logs. This indicates the A40 GPU can be used. There are currently no processes running.

$ kubectl logs gpu-test
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03              Driver Version: 595.58.03      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A40                     Off |   00000000:01:09.0 Off |                    0 |
|  0%   33C    P8             15W /  300W |       0MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Deploying vLLM

First, we define the resources needed to deploy vLLM using the official vllm/vllm-openai image. vLLM handles the model download and GPU loading itself at startup. The PVC caches the downloaded model, so subsequent restarts are fast.

vllm.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-models
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: exoscale-sbs
  resources:
    requests:
      storage: 100Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.20.2-cu129-ubuntu2404
        args:
        - Qwen/Qwen2.5-3B-Instruct
        ports:
        - containerPort: 8000
        env:
        - name: HF_HOME
          value: /data/models
        resources:
          limits:
            nvidia.com/gpu: "1"
        volumeMounts:
        - name: models
          mountPath: /data/models
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-server
spec:
  selector:
    app: vllm
  ports:
  - name: http
    port: 8000
    targetPort: 8000
  • The storageClassName: exoscale-sbs is specific to Exoscale SKS. It defines the StorageClass to use.

  • Qwen/Qwen2.5-3B-Instruct is not a gated model on HuggingFace and can be downloaded without authentication. For vLLM to download a gated model, we’d need to create a Secret containing a HuggingFace access token.

Then, we create those resources.

kubectl apply -f vllm.yaml

The first startup takes a few minutes as the model is downloaded. We can follow the progress through the Pod’s logs:

$ kubectl logs -f deploy/vllm
(APIServer pid=1) INFO 05-13 15:49:18 [utils.py:299]
(APIServer pid=1) INFO 05-13 15:49:18 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 05-13 15:49:18 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.2
(APIServer pid=1) INFO 05-13 15:49:18 [utils.py:299]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-3B-Instruct
(APIServer pid=1) INFO 05-13 15:49:18 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 05-13 15:49:18 [utils.py:299]
(APIServer pid=1) INFO 05-13 15:49:18 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen2.5-3B-Instruct', 'model': 'Qwen/Qwen2.5-3B-Instruct'}
...
(APIServer pid=1) INFO 05-13 15:50:54 [api_server.py:602] Starting vLLM server on http://0.0.0.0:8000
...
(APIServer pid=1) INFO:     Application startup complete.

Testing the API

As we only use a ClusterIP Service to expose our vLLM Pod, we forward the port:

kubectl port-forward svc/vllm-server 8000:8000

Next, we send a request using the OpenAI-compatible API:

$ curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-3B-Instruct",
    "messages": [{"role": "user", "content": "What is Kubernetes in one sentence?"}]
     }' | jq
{
  "id": "chatcmpl-8103dfe80be436cb",
  "object": "chat.completion",
  "created": 1778687690,
  "model": "Qwen/Qwen2.5-3B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 36,
    "total_tokens": 58,
    "completion_tokens": 22,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

Checking GPU utilization

With nvidia-smi running inside the Pod, or using DCGM metrics, we can verify the GPU is being used.

$ kubectl exec deploy/vllm -- nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03              Driver Version: 595.58.03      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A40                     Off |   00000000:01:09.0 Off |                    0 |
|  0%   49C    P0             80W /  300W |   42309MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A             110      C   VLLM::EngineCore                      42300MiB |
+-----------------------------------------------------------------------------------------+

If you want a web UI, you can add OpenWebUI as described in the Ollama on Kubernetes article. You need to set:

  • OPENAI_API_BASE_URL: http://vllm-server:8000/v1
  • OPENAI_API_KEY: vllm

This setup is similar in structure to the Ollama on Kubernetes example, there is a Deployment, a PVC, a Service, but the difference is that vLLM runs on a GPU and is built for production. We’ll cover advanced topics and production stacks in future articles.

Going Further