Reading a Model Name

Model names like qwen2.5:7b-instruct-q4_K_M or llama3.2:3b-instruct-q4_K_M look cryptic, but each part tells something specific about the model. Once you know the pattern, you can immediately what you are getting before you download anything.

The anatomy of a model name

Taking qwen2.5:7b-instruct-q4_K_M as an example:

qwen2.5 : 7b - instruct - q4_K_M
  │        │      │           │
  │        │      │           └── quantization level
  │        │      └────────────── variant (how it was fine-tuned)
  │        └───────────────────── parameter count
  └────────────────────────────── model family and version

Model family and version

The first part identifies the model and its generation. qwen2.5 is version 2.5 of the Qwen family (from Alibaba). llama3.2 is Meta’s Llama 3, revision 2. Newer versions of the same family are usually more capable and more efficient.

Some well-known families:

  • Llama (Meta)
  • Qwen (Alibaba)
  • Mistral (Mistral AI)
  • Gemma (Google)
  • Phi (Microsoft)
  • DeepSeek (DeepSeek AI)

Parameter count

7b, 3b, or 70b is the number of parameters, in billions, in the model. Parameters are the values learned during training that encode the model’s knowledge and capabilities. They are numerical numbers living in big tensors (multi-dimensional arrays).

SizeTypical useWhat to expect
1B – 3BEdge devices, fast experimentationFast, lightweight, limited reasoning
7B – 9BEveryday local useGood balance of speed and capability
13B – 14BBetter quality on a single GPUNoticeably more capable, slower
70B+High-end workloadsNear-frontier quality, needs serious hardware

More parameters generally means better capability but slower inference and higher memory requirements. A 7B model typically fits in 4–8 GB of RAM (depending on quantization explained below), while a 70B model needs 40 GB or more even at 4-bit quantization.

Variant

The variant tells you what kind of fine-tuning was applied after the initial training.

VariantMeaning
(none) or baseRaw pretrained model, not fine-tuned for conversation
instructFine-tuned to follow instructions
chatFine-tuned for conversational use (similar to instruct)
codeFine-tuned for code generation and understanding
visionMultimodal which can process images as well as text

If you want a model you can talk to or give tasks to, you almost always want instruct or chat. A base model will complete your text rather than answer your question, which is useful for research and fine-tuning, but maybe not for direct use.

Quantization

Quantization reduces the precision of the model’s weights to save memory and speed up inference, at the cost of some quality. This is why the same model can come in several sizes on disk.

LevelBitsMemory vs quality
f16 / fp1616-bitFull quality, highest memory usage
q8_08-bitNear-identical quality, ~50% less memory than f16
q4_K_M4-bit (medium)Small quality loss, very practical for local use
q4_K_S4-bit (small)Slightly lower quality than _M, smaller file
q3_K_M3-bitNoticeable quality degradation, very small

For most local use cases, q4_K_M is the standard starting point. It uses about 4x less memory than f16 with minimal quality loss. If you have enough VRAM or RAM, q8_0 is worth considering for tasks where quality is important.

Other parts you may see

  • Context window: 128k in a name like phi3:mini-128k-instruct means the model can process up to 128,000 tokens in one prompt.
  • Version tags: v0.2, or v2 is the minor model revisions within the same family.
  • turbo: usually a faster, more efficient variant of a larger model.
  • mini / small / large: informal size labels used by some families instead of exact parameter counts.

Examples

NameFamilyParamsVariantQuantization
qwen2.5:7b-instruct-q4_K_MQwen 2.57BInstruct4-bit medium
llama3.2:3b-instruct-q4_K_MLlama 3.23BInstruct4-bit medium
mistral:7b-instruct-v0.2Mistral7BInstructnone (f16)
gemma2:9b-instruct-q8_0Gemma 29BInstruct8-bit
phi3:mini-128k-instructPhi-3~3.8BInstructnone (f16)
deepseek-r1:7bDeepSeek R17B(reasoning)none (f16)

7b-instruct-q4_K_M can be a good starting point on most machines. This model is a good fit for a wide range of tasks, it is fast enough and quite light enough to fit on consumer hardware.