In late 2024, DeepSeek released V3 with a reported training cost of just ~$5.5 million — roughly 1/50th of what experts estimated for GPT-4. Weeks later, R1 arrived and outscored OpenAI's o1 on several math benchmarks. The AI world took notice.
Meanwhile, Meta kept pushing with the fully open-source Llama 3.3 70B. Alibaba released an entire family of Qwen 2.5 models with specialized variants for code, math, and reasoning. And Moonshot AI quietly built Kimi — with the largest context window of any model in this landscape.
The picture is clear: 2025 is the year of self-hosted LLMs. The question is no longer "Should I self-host?" — it's "Which model do I deploy, and how?"
📚 Series: Self-hosted LLM 2026 — From Zero to Production
- Overview: Choosing the Right Model (DeepSeek, Llama, Qwen, Kimi)
- DeepSeek V3 & R1: The New Era of Reasoning
- Meta Llama 3.3 70B: Enterprise Workhorse
- Alibaba Qwen 2.5: Master of Coding & AI Agents
Why Self-Hosting LLMs Matters
Before comparing models, it's worth defining when self-hosting actually makes sense — because it's not always the right answer.
When to self-host:
| Situation | Core Reason |
|---|
| Sensitive data (healthcare, finance, legal) | Data never leaves your infrastructure |
| High volume (>50M tokens/day) | API costs exceed infrastructure costs at scale |
| Latency requirement <100ms first token | External API network overhead is unavoidable |
| Proprietary fine-tuning needed | You can't fine-tune GPT-4 or Claude |
| Offline / air-gapped environment | API requires persistent internet connection |
When NOT to self-host:
- Team under 5 people with no DevOps/MLOps capacity
- Low volume (<1M tokens/day) — API is cheaper when you factor in total infra costs
- You need the absolute best model (GPT-4o / Claude 3.5 still lead on complex tasks)
- Time-to-market is critical — self-host setup takes days to weeks
The 2025 Self-Hosted LLM Landscape
Four names every builder should know when evaluating LLMs for self-hosting or private deployment — each with a distinct design philosophy and killer use case.
1. DeepSeek (DeepSeek AI — China)
DeepSeek is the most disruptive name in 2025. Their core thesis: architectural efficiency beats brute-force compute. They proved it by building a model that rivals GPT-4o at a fraction of the training cost.
The MoE Architecture — Why it's different:
DeepSeek V3 has 685 billion total parameters, but only ~37 billion are "active" per forward pass. Think of it like a 685-person company where only the 37 right specialists get called in for each project — far more efficient than mobilizing everyone for everything.
| Model | Parameters | Type | Context | Key Strength |
|---|
| DeepSeek-V3 | 685B (37B active) | General | 128K | Coding, instruction following |
| DeepSeek-R1 | 671B (37B active) | Reasoning | 128K | Math, logic, chain-of-thought |
| R1-Distill-Llama-70B | 70B dense | Reasoning | 128K | R1 reasoning in a Llama body |
| R1-Distill-Qwen-32B | 32B dense | Reasoning | 128K | Self-hostable, 1-2 GPUs |
| R1-Distill-Qwen-14B | 14B dense | Reasoning | 128K | Runs on 1× RTX 4090 |
| R1-Distill-Qwen-7B | 7B dense | Reasoning | 128K | Laptop/edge deployment |
Pros:
- Cheapest API in the game: V3 at $0.27/1M input tokens (vs GPT-4o at $5/1M) — 18× cheaper
- Exceptional coding: V3 scores ~89% on HumanEval, matching GPT-4o
- R1 is the first open-source reasoning model matching OpenAI's o1
- Distilled variants (7B–70B) bring R1's reasoning ability to consumer hardware
Cons:
- Full V3/R1 requires a GPU cluster to self-host — not viable for small teams
- Distilled versions lose some accuracy vs the full model (~10-20% depending on benchmark)
- Training data skews toward Chinese and English; weaker on other languages
Llama is the "foundation" model of the open-source world. Meta's goal isn't to build the biggest model — it's to deliver the best balance of performance, fine-tunability, and deployment simplicity.
| Model | Parameters | Context | Languages | Note |
|---|
| Llama 3.3 70B Instruct | 70B | 128K | 8 languages | Best instruction following at 70B |
| Llama 3.2 11B Vision | 11B | 128K | Multilingual | Multimodal (text + image) |
| Llama 3.2 90B Vision | 90B | 128K | Multilingual | Large multimodal |
| Llama 3.1 8B Instruct | 8B | 128K | 8 languages | Lightweight, edge/mobile |
Official Llama 3.3 languages: English, German, French, Italian, Portuguese, Hindi, Spanish, Thai.
Pros:
- Best ecosystem: Every major framework supports it natively — Ollama, vLLM, llama.cpp, LM Studio, Transformers, Jan.ai
- Easiest fine-tuning: LoRA/QLoRA on Llama is the most documented and community-supported path. Thousands of fine-tuned variants on Hugging Face
- Clear commercial license: Allows commercial use (with conditions under 700M MAU)
- Stable and predictable: Behavior is more consistent, fewer hallucinations than newer experimental models
Cons:
- 70B is the only size that reaches truly competitive performance — requires ~40GB VRAM at Q4
- No reasoning model (like R1 or QwQ) in the Llama 3.x family
- Vietnamese and Chinese quality noticeably weaker than Qwen
- Performance gap with SOTA models is widening
3. Alibaba Qwen 2.5 (Alibaba Cloud — China)
Qwen was the big surprise of 2024–2025. Alibaba didn't release a model — they released an entire ecosystem of specialized variants, with the strongest multilingual performance of the group.
| Model | Parameters | Context | Specialty |
|---|
| Qwen2.5-72B-Instruct | 72B | 128K | General purpose, best multilingual |
| Qwen2.5-32B-Instruct | 32B | 32K | Best size-to-performance balance |
| Qwen2.5-14B-Instruct | 14B | 32K | Single RTX 4090, production-ready |
| Qwen2.5-7B-Instruct | 7B | 32K | Lightweight, low-cost production |
| Qwen2.5-Coder-32B | 32B | 32K | Best coding model at 32B, beats CodeLlama |
| Qwen2.5-Math-72B | 72B | 4K | Deep mathematical reasoning |
| QwQ-32B-Preview | 32B | 32K | Reasoning model, matches o1-mini |
Pros:
- Best multilingual performance: Particularly strong in Chinese and Asian languages including Vietnamese
- Most diverse size range: From 0.5B to 72B — easy to pick the right size for hardware and budget
- QwQ-32B — actually self-hostable reasoning: Benchmarks near o1-mini, runs on 1× A100 or 2× RTX 4090
- Qwen2.5-Coder: Outperforms CodeLlama and DeepSeek-Coder at equivalent sizes
Cons:
- Smaller context window (32K for most variants; only 72B has 128K)
- QwQ-32B is still "preview" — not fully production-stable, can be verbose
- Less fine-tuning documentation than Llama in the English-language community
4. Kimi / Moonshot AI (月之暗面 — China)
Kimi is built by Moonshot AI, a Chinese AI startup founded in 2023. What sets Kimi apart isn't parameter count or benchmark scores — it's the massive context window and the ability to process extremely long documents in a single inference call.
The 2M+ token context window — The core differentiator:
While DeepSeek, Llama, and Qwen all cap around 128K tokens, Kimi Chat has pushed to 200K–2 million tokens across its products. This lets you feed in an entire codebase, hundreds of pages of legal documents, or a full technical manual in one shot — a use case the other models simply can't match.
Kimi k1.5 — Reasoning Model with Multimodal:
Released in early 2025, Kimi k1.5 is Moonshot AI's major leap forward:
- Reasoning model using Long Chain-of-Thought (Long-CoT), competing directly with o1 and DeepSeek R1
- Multimodal: handles both text and images in a single model
- Two operating modes: Long-CoT (deep thinking, slower) and Short-CoT (fast responses)
- Context window: 128K tokens for k1.5
| Model | Context | Type | Key Feature |
|---|
| Kimi Chat | 200K–2M | General | Long-document processing, multimodal |
| Kimi k1.5 | 128K | Reasoning + Vision | Long/Short-CoT, o1-competitive |
| Moonlight-16B-A3B | 128K | General (MoE) | Open-weight, 16B total / 3B active |
| Kimi-VL | 128K | Vision-Language | Open-weight, multimodal |
Can you actually self-host Kimi?
This is the critical question: Kimi's flagship models (k1.5, Chat) are closed-source, API-only — similar to OpenAI or Anthropic. You cannot download the weights and run them locally.
However, Moonshot AI has opened part of their ecosystem:
- Moonlight-16B-A3B-Instruct: An open-weight MoE model (16B total, 3B active params) — self-hostable on a standard GPU (~10GB VRAM)
- Kimi-VL series: Open-weight vision-language models for multimodal use cases
This means Kimi fits into the self-host landscape in two ways:
- API with massive context: Use Kimi's API when you need to process documents that OpenAI/Anthropic can't handle in one shot
- Lightweight self-host: Moonlight-16B is a compact MoE option for resource-constrained environments
Pros:
- Largest context window in this landscape (2M+) — ideal for RAG on large corpora, legal docs, technical manuals
- Kimi k1.5 reasoning + vision in one API — versatile for complex tasks
- Moonlight-16B: compact MoE, compute-efficient for self-hosting on a single GPU
- Kimi API pricing is competitive with OpenAI for long-context use cases
Cons:
- Flagship models (k1.5, Chat) are not open-weight — no full self-hosting
- Moonlight-16B and Kimi-VL lack the production maturity of Llama or Qwen
- Much smaller community and ecosystem than the other three families
- Limited English-language documentation and community support
Verdict: Kimi isn't a "pure self-host" play. But if your use case centers on processing extremely long documents or you need reasoning + vision through a cost-effective API, Kimi deserves a spot on your shortlist.
Full Comparison Matrix
| Model | Q4 VRAM | Coding | Reasoning | Multilingual | Easy Self-host | API Price |
|---|
| DeepSeek-V3 (full) | ~350GB+ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐ | ❌ Cluster needed | $0.27/1M |
| DeepSeek-R1 (full) | ~350GB+ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐ | ❌ Cluster needed | $0.55/1M |
| R1-Distill-Qwen-32B | ~20GB | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐ | ✅ Easy | Free |
| R1-Distill-Qwen-14B | ~9GB | ⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ✅ Very easy | Free |
| Llama 3.3 70B | ~40GB | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ✅ Easy | ~$0.88/1M |
| Qwen2.5-72B | ~40GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ✅ Easy | ~$0.4/1M |
| Qwen2.5-32B | ~20GB | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | ✅ Easy | Free |
| Qwen2.5-Coder-32B | ~20GB | ⭐⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐ | ✅ Easy | Free |
| QwQ-32B | ~20GB | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ✅ Easy | Free |
| Kimi k1.5 (API) | N/A | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ❌ API-only | See kimi.ai |
| Moonlight-16B | ~10GB | ⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ✅ Easy | Free |
Kimi k1.5: high Reasoning rating from Long-CoT; the 2M+ token context window advantage is not captured in this table.
Hardware Sizing Guide
Consumer GPUs
| GPU | VRAM | Maximum practical model |
|---|
| RTX 3080 / 4070 | 10–12GB | 7B FP16, 13B Q4, Moonlight-16B Q4 |
| RTX 3090 / 4080 | 16–20GB | 14B FP16, 32B Q4 (tight) |
| RTX 4090 | 24GB | 14B FP16, 32B Q4_K_M |
| 2× RTX 4090 | 48GB | 70B Q4, 32B FP16 |
| 4× RTX 4090 | 96GB | 70B FP16, 72B FP16 |
Data Center / Cloud GPUs
| GPU | VRAM | Recommended models |
|---|
| A100 40GB | 40GB | 70B Q4, 32B FP16 |
| A100 80GB | 80GB | 70B FP16, 72B FP16 |
| H100 80GB | 80GB | Same as A100 but ~3× faster (HBM3) |
| 2× A100 80GB | 160GB | DeepSeek V3/R1 full at Q2 |
Quantization Quick Reference
FP16 → Highest accuracy | VRAM ≈ 2 × (billions of params) GB
Q8 → ~1-2% accuracy loss | VRAM ≈ 1 × (billions of params) GB
Q4_K_M → Sweet spot | VRAM ≈ 0.5 × (billions of params) GB | ~3-5% accuracy loss
Q2 → Last resort | VRAM ≈ 0.25 × (billions of params) GB | Significant quality drop
Decision Framework: Which Model for Your Use Case?
Choose DeepSeek V3 / R1 if:
- Your project is coding-heavy — code generation, review, automated testing
- You need multi-step reasoning — math, logic chains, complex analysis
- You want the cheapest API available while your infra scales up (V3 at $0.27/1M)
- Limited hardware but need reasoning? → Use R1-Distill-14B (runs on 1 GPU)
- You need to fine-tune on proprietary company data
- Your team is already comfortable with Hugging Face / Transformers
- You need a model with a clear, stable commercial license
- You want to build on the most widely supported open-source foundation long-term
Choose Qwen 2.5 if:
- Your app needs strong multilingual support, especially Vietnamese or Chinese
- You need multiple size options to optimize cost (7B for edge, 72B for cloud)
- You want a self-hostable reasoning model without a cluster → QwQ-32B
- You need a specialized code model → Qwen2.5-Coder-32B
Choose Kimi if:
- Your use case requires processing extremely long documents — hundreds of PDF pages, full codebases, legal corpora
- You need reasoning + vision in a single API (Kimi k1.5)
- You want a compact MoE model for self-hosting on limited compute → Moonlight-16B (~10GB VRAM)
- You don't need to be fully offline and are comfortable with an API-based approach
Deployment Frameworks
Ollama — Best for getting started fast
ollama run qwen2.5:32b
ollama run deepseek-r1:32b
ollama run llama3.3:70b
vLLM — Best for production workloads
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 2 \
--quantization awq \
--max-model-len 32768
llama.cpp — Best for edge and CPU inference
./llama-cli \
-m qwen2.5-14b-instruct-q4_k_m.gguf \
--prompt "Your prompt here" \
-n 512
What's Coming in This Series
| Part | Coverage | Focus |
|---|
| Part 2 | DeepSeek V3 & R1 | MoE internals, distilled model setup, R1 prompting patterns, real benchmarks |
| Part 3 | Meta Llama 3.3 | LoRA fine-tuning walkthrough, vLLM production setup, multilingual optimization |
| Part 4 | Alibaba Qwen 2.5 | 128K long context strategies, QwQ reasoning setup, Coder-32B in CI/CD |
Conclusion
There's no single "best" model — only the most appropriate model for your specific context.
Quick decision summary:
- 🏎️ Need fast + cheap + strong coding → DeepSeek V3 API or R1-Distill self-hosted
- 🛠️ Need fine-tuning + community support → Meta Llama 3.3 70B
- 🌏 Need multilingual + flexible sizing → Alibaba Qwen 2.5
- 📄 Need to process 2M-token documents → Kimi API or Moonlight-16B self-hosted
In the next article, we'll fully dissect DeepSeek V3 and R1 — from how MoE architecture works internally, to setting up an efficient inference server, to the specific prompt engineering patterns that unlock R1's chain-of-thought reasoning.
Which LLM are you currently running in production? Reach out to the Autonow team — we love hearing real use cases from builders.