AI & LLMs
Practical guides for running AI workloads, this section was built while learning the domain, with AI as a collaborator. From local LLMs on your machine to deploying models on Kubernetes with GPU scheduling, autoscaling, and observability.
Understand the foundations
A plain-language overview: tokens, attention, MLP blocks, sampling, and what it all means for the tools you use
The deep dive: tensor inventory, Q/K/V mechanics, the full forward pass, KV cache, and prefill vs decode
A quick map of the AI tooling: Ollama, vLLM, LiteLLM, llm-d, and where each one fits
Where open-source models live: how to find a model, read a model card, and download what you need
Run locally first
The fastest way to get started with LLMs is to run them on your own machine. No API keys, no subscription, full control over the model.
Bring it to Kubernetes
Deploy Ollama with persistent model storage and a web UI, using an initContainer to pull models automatically
PagedAttention, continuous batching, and why a GPU is essential for production inference
Coming soon
The next step: taking these models to Kubernetes. GPU node pools, model serving, and more.