vLLM on Kubernetes
Prerequisites
We need a Kubernetes cluster with at least one NVIDIA GPU node. We will use an Exoscale SKS cluster composed of a single node having an A40 NVIDIA GPU.
$ kubectl get node
NAME STATUS ROLES AGE VERSION
pool-5da1e-vnifb Ready <none> 2m13s v1.36.0GPU setup
Before running vLLM in a Pod, Kubernetes needs to be aware of the GPU: the NVIDIA drivers must be installed on the node, the CUDA toolkit must be available, and the NVIDIA device plugin must be running so that Pods can request GPU resources.
On a self-managed cluster, the installation of those components is usually handled by the NVIDIA GPU Operator. On Exoscale SKS, this stack is provisioned automatically: GPU nodes come with the NVIDIA drivers, the CUDA toolkit, and the device plugin pre-installed.
The only thing that you may need on top of those components is the DCGM exporter, which exposes the GPU metrics.
The Node’s description shows its GPU capacity.
$ kubectl describe node pool-5da1e-vnifb
...
Capacity:
cpu: 12
ephemeral-storage: 102083312Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 57583556Ki
nvidia.com/gpu: 1
pods: 110Running a Pod
To verify the GPU can be used, we define the following Pod requesting a GPU.
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: Never
containers:
- name: cuda
image: nvidia/cuda:12.5.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1Then, we run it.
kubectl apply -f gpu-test.yamlOnce this Pod reaches the Completed status, we get the GPU information from its logs. This indicates the A40 GPU can be used. There are currently no processes running.
$ kubectl logs gpu-test
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03 Driver Version: 595.58.03 CUDA Version: 13.2 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A40 Off | 00000000:01:09.0 Off | 0 |
| 0% 33C P8 15W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+Deploying vLLM
First, we define the resources needed to deploy vLLM using the official vllm/vllm-openai image. vLLM handles the model download and GPU loading itself at startup. The PVC caches the downloaded model, so subsequent restarts are fast.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-models
spec:
accessModes:
- ReadWriteOnce
storageClassName: exoscale-sbs
resources:
requests:
storage: 100Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.20.2-cu129-ubuntu2404
args:
- Qwen/Qwen2.5-3B-Instruct
ports:
- containerPort: 8000
env:
- name: HF_HOME
value: /data/models
resources:
limits:
nvidia.com/gpu: "1"
volumeMounts:
- name: models
mountPath: /data/models
volumes:
- name: models
persistentVolumeClaim:
claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
name: vllm-server
spec:
selector:
app: vllm
ports:
- name: http
port: 8000
targetPort: 8000The
storageClassName: exoscale-sbsis specific to Exoscale SKS. It defines the StorageClass to use.Qwen/Qwen2.5-3B-Instructis not a gated model on HuggingFace and can be downloaded without authentication. For vLLM to download a gated model, we’d need to create a Secret containing a HuggingFace access token.
Then, we create those resources.
kubectl apply -f vllm.yamlThe first startup takes a few minutes as the model is downloaded. We can follow the progress through the Pod’s logs:
$ kubectl logs -f deploy/vllm
(APIServer pid=1) INFO 05-13 15:49:18 [utils.py:299]
(APIServer pid=1) INFO 05-13 15:49:18 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=1) INFO 05-13 15:49:18 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.20.2
(APIServer pid=1) INFO 05-13 15:49:18 [utils.py:299] █▄█▀ █ █ █ █ model Qwen/Qwen2.5-3B-Instruct
(APIServer pid=1) INFO 05-13 15:49:18 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 05-13 15:49:18 [utils.py:299]
(APIServer pid=1) INFO 05-13 15:49:18 [utils.py:233] non-default args: {'model_tag': 'Qwen/Qwen2.5-3B-Instruct', 'model': 'Qwen/Qwen2.5-3B-Instruct'}
...
(APIServer pid=1) INFO 05-13 15:50:54 [api_server.py:602] Starting vLLM server on http://0.0.0.0:8000
...
(APIServer pid=1) INFO: Application startup complete.Testing the API
As we only use a ClusterIP Service to expose our vLLM Pod, we forward the port:
kubectl port-forward svc/vllm-server 8000:8000Next, we send a request using the OpenAI-compatible API:
$ curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-3B-Instruct",
"messages": [{"role": "user", "content": "What is Kubernetes in one sentence?"}]
}' | jq
{
"id": "chatcmpl-8103dfe80be436cb",
"object": "chat.completion",
"created": 1778687690,
"model": "Qwen/Qwen2.5-3B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.",
"refusal": null,
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": [],
"reasoning": null
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null,
"token_ids": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 36,
"total_tokens": 58,
"completion_tokens": 22,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"prompt_token_ids": null,
"kv_transfer_params": null
}Checking GPU utilization
With nvidia-smi running inside the Pod, or using DCGM metrics, we can verify the GPU is being used.
$ kubectl exec deploy/vllm -- nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03 Driver Version: 595.58.03 CUDA Version: 13.2 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A40 Off | 00000000:01:09.0 Off | 0 |
| 0% 49C P0 80W / 300W | 42309MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 110 C VLLM::EngineCore 42300MiB |
+-----------------------------------------------------------------------------------------+If you want a web UI, you can add OpenWebUI as described in the Ollama on Kubernetes article. You need to set:
OPENAI_API_BASE_URL: http://vllm-server:8000/v1OPENAI_API_KEY: vllm
This setup is similar in structure to the Ollama on Kubernetes example, there is a Deployment, a PVC, a Service, but the difference is that vLLM runs on a GPU and is built for production. We’ll cover advanced topics and production stacks in future articles.