Technical information was last verified on April 2026. The AI/LLM field moves fast — re-check official docs if more than 6 months have passed.
Who should read this
Summary: As of April 2026, the number of open LLMs that run on consumer GPUs (8—24GB VRAM) has exploded. But each model excels in a completely different area. This article compares Llama 4, Gemma 4 (released April 2), Phi-4 mini, Qwen 3, and Mistral Small 3.1 across four axes — VRAM requirements, benchmark scores, inference speed, and multilingual performance — and maps out which model to pick for each use case.
This article is written for developers and ML engineers who plan to deploy local LLMs. The premise throughout is inference on your own hardware, not cloud APIs.
Model overview
| Parameters | Architecture | Context window | License | |
|---|---|---|---|---|
| Llama 4 Scout | 109B total / 17B active | MoE (16 experts) | 10M tokens | Llama 4 Community |
| Gemma 4 31B | 30.7B | Dense | 128K tokens | Apache 2.0 |
| Gemma 4 26B | 26B total / ~4B active | MoE (128 experts, 8 active) | 128K tokens | Apache 2.0 |
| Gemma 4 E4B | ~4B effective | PLE (Per-Layer Embeddings) | 128K tokens | Apache 2.0 |
| Gemma 4 E2B | ~2B effective | PLE | 128K tokens | Apache 2.0 |
| Phi-4 mini | 3.8B | Dense | 128K tokens | MIT |
| Qwen 3 14B | 14B | Dense | 128K tokens | Apache 2.0 |
| Qwen 3 30B-A3B | 30B total / 3B active | MoE | 128K tokens | Apache 2.0 |
| Mistral Small 3.1 | 24B | Dense | 128K tokens | Apache 2.0 |
VRAM requirements and quantization
VRAM is the first gate for local deployment. Q4_K_M quantization compresses weights to 4-bit, saving roughly 75% of memory versus FP16 while keeping output quality at a practical level. As of 2026, GGUF is the de facto standard format, and llama.cpp plus Ollama support every major model.
| FP16 VRAM | Q4_K_M VRAM | Q8 VRAM | Fits 8GB GPU? | |
|---|---|---|---|---|
| Llama 4 Scout | ~220GB | ~65GB | ~115GB | No |
| Gemma 4 31B | ~62GB | ~20GB | ~33GB | No |
| Gemma 4 26B (MoE) | ~52GB | ~17GB | ~28GB | No (~4B active) |
| Gemma 4 E4B | ~8GB | ~3.5GB | ~5GB | Yes |
| Gemma 4 E2B | ~4GB | ~1.5GB | ~2.5GB | Yes |
| Phi-4 mini | ~7.6GB | ~2.1GB | ~4GB | Yes |
| Qwen 3 14B | ~28GB | ~10.7GB | ~15GB | No |
| Qwen 3 30B-A3B | ~60GB | ~20GB | ~32GB | No (3B active) |
| Mistral Small 3.1 | ~48GB | ~14GB | ~25GB | No |
Recommendations by VRAM tier
8GB VRAM (RTX 4060, RTX 3070, etc.): Phi-4 mini (Q4_K_M, 2.1GB), Gemma 4 E4B (Q4_K_M, ~3.5GB), and Gemma 4 E2B (Q4_K_M, ~1.5GB) all fit comfortably. Gemma 4 E4B succeeds Gemma 3 4B and adds multimodal support (text + image + video + audio).
16GB VRAM (RTX 4080, RTX 5060 Ti, etc.): Qwen 3 14B (Q4_K_M, 10.7GB) leads in reasoning and math, and Mistral Small 3.1 (Q4_K_M, 14GB) also fits. Gemma 4 26B MoE (Q4_K_M, ~17GB) is the new powerhouse at this tier — only ~4B active parameters for fast inference while scoring 1441 on LMArena.
24GB VRAM (RTX 4090, RTX 5080, etc.): Gemma 4 31B (Q4_K_M, ~20GB) is the best choice. MMLU-Pro 85.2% and LiveCodeBench 80% put it well ahead of its predecessor (Gemma 3 27B). Qwen 3 30B-A3B remains competitive.
Benchmark performance
| MMLU-Pro | MATH (AIME) | LiveCodeBench | GPQA Diamond | |
|---|---|---|---|---|
| Gemma 4 31B (dense) | 85.2% | 89.2% (AIME) | 80% | -- |
| Gemma 4 26B (MoE, ~4B active) | -- | -- | -- | -- |
| Llama 4 Scout (17B active) | 74.3 | 75.8 | -- | 57.2 |
| Mistral Small 3.1 (24B) | 79% | -- | 74% (HumanEval) | -- |
| Qwen 3 14B | -- | 79.2 | -- | -- |
| Phi-4 mini (3.8B) | -- | -- | 74% (HumanEval) | -- |
Inference speed
What determines perceived performance in local deployment is tokens per second (tok/s). Approximate figures on an RTX 4090 at Q4_K_M quantization:
- Gemma 4 E2B: 150+ tok/s — fastest, thanks to ~2B effective parameters
- Phi-4 mini: 120+ tok/s — the advantage of 3.8B parameters
- Gemma 4 E4B: 90—120 tok/s — PLE architecture improves efficiency over Gemma 3 4B
- Qwen 3 30B-A3B: ~196 tok/s — MoE structure means only 3B active parameters compute
- Gemma 4 26B (MoE): 80—100 tok/s — ~4B active, 8 of 128 experts activated
- Qwen 3 14B: 45—55 tok/s — solid for a 14B-class model
- Mistral Small 3.1: 25—35 tok/s — 24B dense model
- Gemma 4 31B: 18—25 tok/s — 31B dense, most capable but slowest
Qwen 3 30B-A3B occupies an unusual position. Its MoE structure means actual computation is 3B-class while quality approaches 14B-class. The catch: you still need to load all parameters into memory, so generous VRAM is required.
Multilingual performance
If you are using a local LLM for non-English languages, model selection shifts significantly.
Qwen 3 dominates. With a 250K vocabulary and training data spanning 201 languages, it shows a clear gap over competitors in CJK languages. Instruction following, summarization, and translation in Korean, Chinese, and Japanese all outperform same-size rivals.
Gemma 3 delivers solid multilingual performance across Asian languages thanks to Google’s multilingual training pipeline. The 27B model recorded an Elo of 1339 on LMSys Chatbot Arena, ranking in the top 10.
Llama 4 Scout shows strong results on multilingual benchmarks (TydiQA, etc.), but the impracticality of local deployment is a hard constraint.
Phi-4 mini and Mistral Small 3.1 are English-centric designs and fall noticeably behind the above models on non-English tasks.
Coding ability
When using a local model for code generation and debugging, the tradeoffs differ.
- Qwen 3 14B: Top of its class on LiveCodeBench and coding benchmarks. Broad language support.
- Phi-4 mini: Remarkably capable at coding for its 3.8B size. HumanEval 74% is competitive with much larger models.
- Mistral Small 3.1: Also at HumanEval 74%, matching Phi-4 mini — but six times the size.
- Gemma 3 27B: LiveCodeBench 29.7 — not a coding specialist, but serviceable for general tasks.
Recommendations by use case
| Recommended model | VRAM required | Key reason | |
|---|---|---|---|
| CJK chatbot / summarization | Qwen 3 14B | ~10.7GB (Q4) | Best multilingual performance, 250K vocabulary |
| Lightweight coding assistant | Phi-4 mini | ~2.1GB (Q4) | Solid coding ability at minimal resources |
| General purpose (8GB GPU) | Gemma 4 E4B | ~3.5GB (Q4) | Multimodal + audio, successor to Gemma 3 4B |
| General purpose (16GB GPU) | Gemma 4 26B MoE | ~17GB (Q4) | LMArena 1441, ~4B active for fast inference |
| General purpose (24GB GPU) | Gemma 4 31B | ~20GB (Q4) | MMLU-Pro 85.2%, LiveCodeBench 80% -- strongest open model |
| Speed-first inference | Qwen 3 30B-A3B | ~20GB (Q4) | MoE delivers 196 tok/s |
| On-device / mobile | Gemma 4 E2B | ~1.5GB (Q4) | ~2B effective, audio input support |
| Long document processing | Llama 4 Scout | ~65GB+ (Q4) | 10M-token context (server-grade hardware required) |
Deployment checklist
-
Quantization format: GGUF + Q4_K_M is the standard for quality-to-memory efficiency. Q8 yields slightly better quality but nearly doubles VRAM usage. Worth considering if your primary use case is code generation.
-
Inference engine: Ollama (built on llama.cpp) has the lowest barrier to entry. For production-grade serving, evaluate vLLM or TGI.
-
Context length and VRAM: Beyond model weights, the KV cache consumes additional VRAM. Longer contexts mean higher memory usage, so do not judge by model size alone.
-
Multimodal needs: All Gemma 4 variants support text + image + video, and the E2B/E4B variants also accept audio input. Llama 4 Scout natively supports multimodal as well. If multimodal is a requirement, Gemma 4 offers the broadest range of options.
Further reading
- RAG Pipeline Design: From Chunking to Retrieval Quality Monitoring — Designing the five layers you need when building RAG on top of a local LLM
- LLM Structured Output: JSON Mode vs Function Calling vs Constrained Decoding — How to get reliable JSON from local models