DeepSeek V3 & R1: MLA Architecture, MoE, and Self-Host Guide

DeepSeek V3 & R1: MLA Architecture, MoE, and Self-Host Guide | Autonow

Series: The Complete Self-Hosted LLM Guide — 4 Parts

Part 1: Overview & Comparison: DeepSeek, Llama, Qwen, Kimi

Part 2 (you're here): DeepSeek V3 & R1 — Architecture, Reasoning, Self-Host

Part 3: Meta Llama 3.3 70B — Enterprise Workhorse

Part 4: Alibaba Qwen 2.5 — Coding Master & Multilingual Champion

In January 2025, DeepSeek R1 landed and shook the AI world: an open-source model from a Chinese startup had just beaten OpenAI's o1 on AIME 2024 (79.8% vs 79.2%) and MATH-500 (97.3% vs 96.4%). Total training cost for V3: ~$5.5M — roughly 1/50th of what frontier labs spend.

The immediate question: How?

The answer lies in two core architectural innovations — MLA and DeepSeekMoE — and an R1 training process that looks nothing like any reasoning model before it.

Part 1: Inside DeepSeek V3's Architecture

MLA — Multi-head Latent Attention: Killing the KV Cache Bottleneck

In standard transformers, Multi-head Attention (MHA) creates the biggest scaling bottleneck: every token requires storing K (Key) and V (Value) matrices for every attention head. As context grows, the KV cache balloons.

Real-world example: Llama 3.3 70B at 128K context needs ~35GB just for the KV cache.

DeepSeek solves this with MLA (Multi-head Latent Attention):

Standard MHA:
Token → Q, K, V (separate per head) → Attention → Output
KV cache = num_heads × head_dim × 2 (K+V) per token

DeepSeek MLA:
Token → Q, [K,V] joint low-rank compression → latent vector c_KV
→ decompress on demand
KV cache = latent_dim per token (~5.75× smaller)

Result: DeepSeek V3 achieves ~5.75× smaller KV cache than equivalent MHA, making 128K context genuinely practical — not just theoretically supported.

DeepSeekMoE: When Expertise Actually Specializes

MoE isn't new — but DeepSeek does it differently in two critical ways:

1. Fine-grained Expert Segmentation:

	Standard MoE	DeepSeekMoE
Expert count	~16 large experts	256 small experts per layer
Routing	Top-2 of ~16	Top-8 of 256
Specialization depth	Low	Much higher

By shrinking experts, each one learns an extremely narrow, deep domain. The router can combine 8 highly-specific micro-experts per token instead of 2 generalist large ones.

2. Shared Experts — Preventing Knowledge Collapse:

DeepSeek separates two expert types:

Shared experts (~2): always activated, learn universal foundational knowledge
Routed experts (Top-6 of 254): context-activated, hyper-specialized

This prevents "Knowledge Collapse" — where routed experts redundantly learn similar things instead of specializing.

The combined MLA + DeepSeekMoE result:

685B total parameters, only ~37B active per forward pass
Inference cost equivalent to a 37B dense model
Quality of a 685B model

FP8 Training — How $5.5M Became Possible

DeepSeek V3 is among the first large models trained entirely with FP8 mixed precision:

FP16: current standard, 2 bytes/weight
FP8: 1 byte/weight → 50% reduction in memory and bandwidth
Custom FP8 gradient communication framework

MLA + MoE + FP8 on H800 cluster = training cost that's achievable at mid-size company scale.

Part 2: DeepSeek R1 — The Reasoning Revolution

Why R1 Is Different

Before R1, every reasoning model (including OpenAI's o1) shared a common approach: supervised fine-tuning (SFT) on human-written chain-of-thought data. This is expensive, slow to annotate, and doesn't scale.

DeepSeek R1 used pure Reinforcement Learning:

R1 Training Pipeline:
1. Cold Start: light training from V3 with some Long-CoT demonstrations
2. RL Stage 1: GRPO with reward = correctness + format compliance
3. Rejection Sampling: filter for high-quality reasoning traces
4. RL Stage 2: Combined SFT + RL for alignment

GRPO (Group Relative Policy Optimization): Instead of a separate critic model, GRPO evaluates response quality by comparing within a group of responses to the same prompt. Correct responses get positive reward, incorrect get negative. The model learns to reason naturally.

Emergent Reasoning Behaviors

Through RL training, R1 developed behaviors never explicitly programmed:

Self-reflection: "Wait, I made an error in step 3..."
Backtracking: returning to try different approaches when stuck
Verification: checking its own final answers
Exploration: testing multiple approaches before committing

This is why R1's reasoning chains can run thousands of tokens but arrive at correct answers.

R1 vs o1 — Real Benchmarks

Benchmark	DeepSeek R1	OpenAI o1
AIME 2024	79.8%	79.2%
MATH-500	97.3%	96.4%
Codeforces Rating	2029	1891
GPQA Diamond	71.5%	75.7%
SWE-bench Verified	49.2%	48.9%

R1 wins on math and competitive coding. o1 edges ahead on scientific reasoning. More importantly: R1 is open-weight and free — o1 costs $15/1M output tokens.

Part 3: Self-Hosting R1 — The Practical Guide

Pick the Right Variant for Your Hardware

Model	FP16 VRAM	Q8 VRAM	Q4_K_M VRAM	Minimum Hardware
R1-Distill-Qwen-7B	14GB	7GB	4GB	1× RTX 3080
R1-Distill-Qwen-14B	28GB	14GB	8GB	1× RTX 4090
R1-Distill-Qwen-32B	64GB	32GB	18GB	1× RTX 4090 (Q4)
R1-Distill-Llama-70B	140GB	70GB	40GB	2× A100 80GB
Full DeepSeek R1	~1.3TB	~650GB	~350GB	8× A100 (cluster)

Best bang for buck for most builders: R1-Distill-Qwen-14B on 1× RTX 4090, or R1-Distill-Qwen-32B at Q4 for better accuracy.

Deploy with Ollama (Development)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run R1 14B (auto-downloads and quantizes)
ollama run deepseek-r1:14b

# Or 32B for better accuracy
ollama run deepseek-r1:32b

# Expose OpenAI-compatible API
OLLAMA_HOST=0.0.0.0 ollama serve
# API available at: http://localhost:11434/v1

Custom Modelfile for reasoning optimization:

FROM deepseek-r1:14b
PARAMETER num_ctx 32768
PARAMETER temperature 0.6
SYSTEM "You are an expert problem solver. Think through problems step by step."

ollama create deepseek-r1-reasoning -f Modelfile

Deploy with vLLM (Production)

pip install vllm

# Serve R1-Distill-32B with AWQ quantization
python -m vllm.entrypoints.openai.api_server \
  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
  --tensor-parallel-size 2 \
  --quantization awq \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --served-model-name deepseek-r1

# Test it
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1",
    "messages": [{"role": "user", "content": "Prove that √2 is irrational."}],
    "temperature": 0.6,
    "max_tokens": 8192
  }'

Part 4: Prompt Engineering for R1

Triggering Deep Reasoning

R1 uses <think> tags to wrap its reasoning chain. To get the best results:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="deepseek-r1",
    messages=[
        {
            "role": "system",
            "content": "You are an expert problem solver. Always think carefully before answering."
        },
        {
            "role": "user",
            "content": """
            A SaaS startup has 100 users. Monthly:
            - 15% of users churn
            - New users acquired = current_users × 0.25
            
            After 6 months, how many users? Show work.
            """
        }
    ],
    temperature=0.6,
    max_tokens=8192  # R1 needs tokens to reason
)

Parse Thinking vs Final Answer

def parse_r1_response(content: str) -> dict:
    if "<think>" in content and "</think>" in content:
        think_start = content.index("<think>") + len("<think>")
        think_end = content.index("</think>")
        return {
            "thinking": content[think_start:think_end].strip(),
            "answer": content[think_end + len("</think>"):].strip()
        }
    return {"thinking": "", "answer": content}

result = parse_r1_response(response.choices[0].message.content)
print("Answer:", result["answer"])
# Optionally show reasoning for debugging

When to Use R1 vs V3

Use Case	Use	Reason
Code generation	V3	Faster, less verbose output
Multi-step bug analysis	R1	Reasoning chain finds root cause
Math / formal proofs	R1	Significantly higher accuracy
General Q&A / chat	V3	Faster, no CoT overhead
Security code review	R1	Deep analysis of logic flaws
Text generation	V3	No reasoning overhead needed

Production Tips

# 1. Always stream — CoT chains are long
response = client.chat.completions.create(
    model="deepseek-r1", messages=messages, stream=True
)

# 2. Temperature guide:
# Math/Logic: 0.0–0.3 (deterministic)
# Coding: 0.3–0.6
# Creative reasoning: 0.6–1.0

# 3. Budget control — limit thinking tokens
# Set max_tokens = answer_budget + reasoning_budget
# Typical: 2048 (answer) + 6144 (reasoning) = 8192

Conclusion

DeepSeek isn't just "cheap GPT-4" — it's proof that architectural innovation can beat raw scale. MLA solves the KV cache bottleneck. DeepSeekMoE maximizes expert utilization. FP8 training cuts compute cost. Combined, they built a frontier-class model at startup-accessible cost.

R1 is even more significant: this is the first time pure RL (no human-labeled CoT required) produced a world-class reasoning model. This isn't a fluke — it's the blueprint for the next generation of reasoning models.

Your next steps:

Run R1-Distill-14B via Ollama today — works on a single RTX 4090
Benchmark it on a real problem from your project
Compare reasoning quality against GPT-4o on the same task

Next up: Meta Llama 3.3 70B — why it's the real enterprise workhorse, and how to go production-ready in 30 minutes.

DeepSeek V3 & R1: MLA Architecture, DeepSeekMoE, and the Reasoning Revolution

At a Glance

Related Resources

Stay Updated

Related Articles

Master Class: Build an AI Video Factory — Produce 20+ Videos Per Day

Self-Hosted LLMs in 2025: DeepSeek vs Llama vs Qwen — Which Model Fits Your Stack?

Install Tailscale on Ubuntu: A 5-Minute Guide for Non-Technical Users