Tools Overview
The AI tooling landscape moves fast and many tools seem to deserve the same purpose. This page gives a summary to quickly get what each tool does and where it fits.
Model hub
| Tool | What it does |
|---|---|
| HuggingFace | The main hub for open-source models. Most tools in this list can pull models directly from it. |
Running models locally
The simplest starting point where your local machine is all that you eed.
| Tool | What it does |
|---|---|
| Ollama | Runs open-source models locally with a simple CLI and API. |
| llama.cpp | Runs quantized models efficiently on CPU. This is the engine behind many local tools, including Ollama. |
Serving models in production
When you need to handle multiple users at once, local tools are not enough.
| Tool | What it does |
|---|---|
| vLLM | High-throughput inference server built for GPU. The most widely used option for self-hosted production serving. |
| HuggingFace TGI | HuggingFace’s own inference server. Similar positioning to vLLM. |
| llm-d | Distributed inference on Kubernetes, designed for large models that need multiple GPUs or nodes. |
| LiteLLM | A proxy that exposes a unified API in front of many backends (OpenAI, Anthropic, vLLM, Ollama…). Useful when you want to switch providers without changing your application code. |
Building on top of models
Libraries and frameworks that help you build applications on top of models.
| Tool | What it does |
|---|---|
| LangChain | Orchestration framework for chaining model calls, managing prompts, and integrating tools. |
| LlamaIndex | Focused on retrieval-augmented generation (RAG): connecting models to your own data. |