Series: The Complete Self-Hosted LLM Guide — 4 Parts
In January 2025, DeepSeek R1 landed and shook the AI world: an open-source model from a Chinese startup had just beaten OpenAI's o1 on AIME 2024 (79.8% vs 79.2%) and MATH-500 (97.3% vs 96.4%). Total training cost for V3: ~$5.5M — roughly 1/50th of what frontier labs spend.
The immediate question: How?
The answer lies in two core architectural innovations — MLA and DeepSeekMoE — and an R1 training process that looks nothing like any reasoning model before it.
Part 1: Inside DeepSeek V3's Architecture
MLA — Multi-head Latent Attention: Killing the KV Cache Bottleneck
In standard transformers, Multi-head Attention (MHA) creates the biggest scaling bottleneck: every token requires storing K (Key) and V (Value) matrices for every attention head. As context grows, the KV cache balloons.
Real-world example: Llama 3.3 70B at 128K context needs ~35GB just for the KV cache.
DeepSeek solves this with MLA (Multi-head Latent Attention):
Standard MHA:
Token → Q, K, V (separate per head) → Attention → Output
KV cache = num_heads × head_dim × 2 (K+V) per token
DeepSeek MLA:
Token → Q, [K,V] joint low-rank compression → latent vector c_KV
→ decompress on demand
KV cache = latent_dim per token (~5.75× smaller)
Result: DeepSeek V3 achieves ~5.75× smaller KV cache than equivalent MHA, making 128K context genuinely practical — not just theoretically supported.
DeepSeekMoE: When Expertise Actually Specializes
MoE isn't new — but DeepSeek does it differently in two critical ways:
1. Fine-grained Expert Segmentation:
| Standard MoE | DeepSeekMoE |
|---|
| Expert count | ~16 large experts | 256 small experts per layer |
| Routing | Top-2 of ~16 | Top-8 of 256 |
| Specialization depth | Low | Much higher |
By shrinking experts, each one learns an extremely narrow, deep domain. The router can combine 8 highly-specific micro-experts per token instead of 2 generalist large ones.
2. Shared Experts — Preventing Knowledge Collapse:
DeepSeek separates two expert types:
- Shared experts (~2): always activated, learn universal foundational knowledge
- Routed experts (Top-6 of 254): context-activated, hyper-specialized
This prevents "Knowledge Collapse" — where routed experts redundantly learn similar things instead of specializing.
The combined MLA + DeepSeekMoE result:
- 685B total parameters, only ~37B active per forward pass
- Inference cost equivalent to a 37B dense model
- Quality of a 685B model
FP8 Training — How $5.5M Became Possible
DeepSeek V3 is among the first large models trained entirely with FP8 mixed precision:
- FP16: current standard, 2 bytes/weight
- FP8: 1 byte/weight → 50% reduction in memory and bandwidth
- Custom FP8 gradient communication framework
MLA + MoE + FP8 on H800 cluster = training cost that's achievable at mid-size company scale.
Part 2: DeepSeek R1 — The Reasoning Revolution
Why R1 Is Different
Before R1, every reasoning model (including OpenAI's o1) shared a common approach: supervised fine-tuning (SFT) on human-written chain-of-thought data. This is expensive, slow to annotate, and doesn't scale.
DeepSeek R1 used pure Reinforcement Learning:
R1 Training Pipeline:
1. Cold Start: light training from V3 with some Long-CoT demonstrations
2. RL Stage 1: GRPO with reward = correctness + format compliance
3. Rejection Sampling: filter for high-quality reasoning traces
4. RL Stage 2: Combined SFT + RL for alignment
GRPO (Group Relative Policy Optimization):
Instead of a separate critic model, GRPO evaluates response quality by comparing within a group of responses to the same prompt. Correct responses get positive reward, incorrect get negative. The model learns to reason naturally.
Emergent Reasoning Behaviors
Through RL training, R1 developed behaviors never explicitly programmed:
- Self-reflection: "Wait, I made an error in step 3..."
- Backtracking: returning to try different approaches when stuck
- Verification: checking its own final answers
- Exploration: testing multiple approaches before committing
This is why R1's reasoning chains can run thousands of tokens but arrive at correct answers.
R1 vs o1 — Real Benchmarks
| Benchmark | DeepSeek R1 | OpenAI o1 |
|---|
| AIME 2024 | 79.8% | 79.2% |
| MATH-500 | 97.3% | 96.4% |
| Codeforces Rating | 2029 | 1891 |
| GPQA Diamond | 71.5% | 75.7% |
| SWE-bench Verified | 49.2% | 48.9% |
R1 wins on math and competitive coding. o1 edges ahead on scientific reasoning. More importantly: R1 is open-weight and free — o1 costs $15/1M output tokens.
Part 3: Self-Hosting R1 — The Practical Guide
Pick the Right Variant for Your Hardware
| Model | FP16 VRAM | Q8 VRAM | Q4_K_M VRAM | Minimum Hardware |
|---|
| R1-Distill-Qwen-7B | 14GB | 7GB | 4GB | 1× RTX 3080 |
| R1-Distill-Qwen-14B | 28GB | 14GB | 8GB | 1× RTX 4090 |
| R1-Distill-Qwen-32B | 64GB | 32GB | 18GB | 1× RTX 4090 (Q4) |
| R1-Distill-Llama-70B | 140GB | 70GB | 40GB | 2× A100 80GB |
| Full DeepSeek R1 | ~1.3TB | ~650GB | ~350GB | 8× A100 (cluster) |
Best bang for buck for most builders: R1-Distill-Qwen-14B on 1× RTX 4090, or R1-Distill-Qwen-32B at Q4 for better accuracy.
Deploy with Ollama (Development)
curl -fsSL https://ollama.com/install.sh | sh
ollama run deepseek-r1:14b
ollama run deepseek-r1:32b
OLLAMA_HOST=0.0.0.0 ollama serve
Custom Modelfile for reasoning optimization:
FROM deepseek-r1:14b
PARAMETER num_ctx 32768
PARAMETER temperature 0.6
SYSTEM "You are an expert problem solver. Think through problems step by step."
ollama create deepseek-r1-reasoning -f Modelfile
Deploy with vLLM (Production)
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
--tensor-parallel-size 2 \
--quantization awq \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--served-model-name deepseek-r1
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1",
"messages": [{"role": "user", "content": "Prove that √2 is irrational."}],
"temperature": 0.6,
"max_tokens": 8192
}'
Part 4: Prompt Engineering for R1
Triggering Deep Reasoning
R1 uses <think> tags to wrap its reasoning chain. To get the best results:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="deepseek-r1",
messages=[
{
"role": "system",
"content": "You are an expert problem solver. Always think carefully before answering."
},
{
"role": "user",
"content": """
A SaaS startup has 100 users. Monthly:
- 15% of users churn
- New users acquired = current_users × 0.25
After 6 months, how many users? Show work.
"""
}
],
temperature=0.6,
max_tokens=8192
)
Parse Thinking vs Final Answer
def parse_r1_response(content: str) -> dict:
if "<think>" in content and "</think>" in content:
think_start = content.index("<think>") + len("<think>")
think_end = content.index("</think>")
return {
"thinking": content[think_start:think_end].strip(),
"answer": content[think_end + len("</think>"):].strip()
}
return {"thinking": "", "answer": content}
result = parse_r1_response(response.choices[0].message.content)
print("Answer:", result["answer"])
When to Use R1 vs V3
| Use Case | Use | Reason |
|---|
| Code generation | V3 | Faster, less verbose output |
| Multi-step bug analysis | R1 | Reasoning chain finds root cause |
| Math / formal proofs | R1 | Significantly higher accuracy |
| General Q&A / chat | V3 | Faster, no CoT overhead |
| Security code review | R1 | Deep analysis of logic flaws |
| Text generation | V3 | No reasoning overhead needed |
Production Tips
response = client.chat.completions.create(
model="deepseek-r1", messages=messages, stream=True
)
Conclusion
DeepSeek isn't just "cheap GPT-4" — it's proof that architectural innovation can beat raw scale. MLA solves the KV cache bottleneck. DeepSeekMoE maximizes expert utilization. FP8 training cuts compute cost. Combined, they built a frontier-class model at startup-accessible cost.
R1 is even more significant: this is the first time pure RL (no human-labeled CoT required) produced a world-class reasoning model. This isn't a fluke — it's the blueprint for the next generation of reasoning models.
Your next steps:
- Run
R1-Distill-14B via Ollama today — works on a single RTX 4090
- Benchmark it on a real problem from your project
- Compare reasoning quality against GPT-4o on the same task
Next up: Meta Llama 3.3 70B — why it's the real enterprise workhorse, and how to go production-ready in 30 minutes.