Qwen 2.5: Mastering Code, Multilingual Tasks, and Building AI Agent Workflows
Alibaba's Qwen 2.5 is the open-source leader in coding and multilingual tasks — and the ideal foundation for complex AI Agent workflows. With models ranging from 0.5B to 72B and specialized variants (Coder, Math), Qwen 2.5 covers everything from edge devices to enterprise servers. This guide dives into coding benchmarks, Vietnamese capabilities, and production-grade AI Agent patterns.
1. Qwen 2.5 Model Family Overview
Qwen 2.5 ships with multiple specialized variants:
| Model | Parameters | Characteristics | VRAM (Q4) |
|---|
| Qwen2.5-0.5B | 0.5B | Edge, mobile | ~0.5GB |
| Qwen2.5-1.5B | 1.5B | Edge, speculative draft | ~1.5GB |
| Qwen2.5-3B | 3B | Light tasks | ~2GB |
| Qwen2.5-7B | 7B | General purpose | ~5GB |
| Qwen2.5-14B | 14B | Strong reasoning | ~9GB |
| Qwen2.5-32B | 32B | Near-SOTA, balanced | ~20GB |
| Qwen2.5-72B | 72B | Top-tier open source | ~45GB |
| Qwen2.5-Coder-32B | 32B | Code specialist | ~20GB |
| Qwen2.5-Math-72B | 72B | Math specialist | ~45GB |
Key Architecture Points
- GQA (Grouped Query Attention): Optimized KV cache for large batches and long context
- Context window: 128K tokens (needs config in Ollama to use fully)
- Vocab: 151,936 tokens — largest among mainstream models, optimized for Chinese and Vietnamese
- YaRN RoPE scaling: Effective long context extrapolation
- SwiGLU activation: Standard in modern LLMs
2. Coding Benchmark: Qwen2.5-Coder Leads the World
HumanEval, MBPP, and LiveCodeBench
| Model | HumanEval | MBPP | LiveCodeBench | MultiPL-E |
|---|
| Qwen2.5-Coder-32B | 92.7% | 90.9% | 65.9% | 82.4% |
| Claude 3.5 Sonnet | 92.0% | 91.0% | 60.9% | 81.3% |
| GPT-4o (2024-11) | 90.2% | 87.8% | 62.4% | 79.1% |
| DeepSeek-Coder-V2 | 90.2% | 84.6% | 58.7% | 78.0% |
| Llama 3.3 70B | 88.4% | 81.0% | 53.2% | 74.5% |
Achievement: Qwen2.5-Coder-32B beats GPT-4o on all coding benchmarks with only 32B parameters — the first open-source model to reach this level.
Supported Programming Languages
- Tier 1 (92%+ pass rate): Python, JavaScript, TypeScript, Java, C++, Go
- Tier 2 (80–92%): Rust, Swift, Kotlin, C#, PHP, SQL
- Tier 3 (65–80%): Shell, Lua, R, Julia
3. Multilingual Capabilities: Vietnamese and Chinese
Multilingual Benchmark
| Benchmark | Qwen2.5-72B | Llama 3.3 70B | Mistral Large |
|---|
| C-Eval (Chinese) | 91.1% | 75.2% | 78.4% |
| CMMLU | 90.7% | 73.1% | 74.9% |
| Vietnamese VLSP | 78.3% | 68.1% | 65.2% |
| M-MMLU (avg 14 langs) | 82.5% | 74.3% | 73.8% |
Why Qwen Leads for Vietnamese
- 151K token vocabulary: Tokenizes Vietnamese efficiently with fewer splits → fewer tokens → cheaper and faster
- Diverse training data: High-quality Vietnamese text from Wikipedia, news, forums
- Multilingual instruction tuning: Fine-tuned with Vietnamese prompts and responses
- Natural language switching: Automatically responds in the user's language without special system prompts
Demo: Vietnamese with Qwen
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="qwen2.5:72b",
messages=[{
"role": "user",
"content": "Explain the Dijkstra algorithm in Vietnamese and write a Python illustration."
}],
temperature=0.3
)
print(response.choices[0].message.content)
4. Ollama Deployment
ollama pull qwen2.5:7b
ollama pull qwen2.5:14b
ollama pull qwen2.5:32b
ollama pull qwen2.5:72b
ollama pull qwen2.5-coder:32b
cat > Qwen-Modelfile << 'EOF'
FROM qwen2.5:72b
PARAMETER num_ctx 32768
PARAMETER temperature 0.7
PARAMETER top_p 0.9
SYSTEM "You are a helpful AI assistant. Respond in the same language as the user."
EOF
ollama create qwen25-72b-32k -f Qwen-Modelfile
5. vLLM Production Deployment
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.92 \
--max-model-len 32768 \
--dtype bfloat16 \
--port 8000
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 65536 \
--port 8000
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-Coder-32B-Instruct \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--port 8001
6. Building AI Agent Workflows: ReAct Pattern
Qwen 2.5 excels at agentic tasks — following complex instructions, calling tools, and maintaining context across many steps.
ReAct Agent
import json
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="qwen")
tools = [
{
"type": "function",
"function": {
"name": "search_web",
"description": "Search for information on the web",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "execute_python",
"description": "Execute Python code and return the result",
"parameters": {
"type": "object",
"properties": {
"code": {"type": "string"}
},
"required": ["code"]
}
}
},
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read file contents",
"parameters": {
"type": "object",
"properties": {
"path": {"type": }
},
: []
}
}
}
]
() -> :
messages = [
{
: ,
:
},
{: , : user_task}
]
step (max_steps):
response = client.chat.completions.create(
model=,
messages=messages,
tools=tools,
tool_choice=,
temperature=
)
msg = response.choices[].message
messages.append(msg)
msg.tool_calls:
msg.content
tool_call msg.tool_calls:
result = execute_tool(
tool_call.function.name,
json.loads(tool_call.function.arguments)
)
messages.append({
: ,
: tool_call.,
: (result)
})
() -> :
name == :
name == :
subprocess
result = subprocess.run(
[, , args[]],
capture_output=, text=, timeout=
)
result.stdout result.stderr
name == :
(args[]) f:
f.read()
result = run_agent(
)
(result)
7. Multi-Agent Pipeline
from openai import OpenAI
class QwenAgent:
def __init__(self, name: str, system_prompt: str, model: str = "qwen2.5:72b"):
self.name = name
self.system_prompt = system_prompt
self.client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
self.model = model
def run(self, task: str, context: str = "") -> str:
messages = [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": f"Context:\n{context}\n\nTask: {task}"}
]
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
temperature=0.3
)
return response.choices[0].message.content
researcher = QwenAgent(
"Researcher",
"You are a research specialist. Collect and summarize information about the requested topic."
)
coder = QwenAgent(
"Coder",
"You are a senior software engineer. Write clean Python code with tests and documentation.",
model="qwen2.5-coder:32b"
)
reviewer = QwenAgent(
"Reviewer",
"You are a technical reviewer. Evaluate code for performance, security, and best practices."
)
() -> :
research = researcher.run()
code = coder.run(
,
context=research
)
review = reviewer.run(
,
context=code
)
{: research, : code, : review}
result = run_pipeline()
8. Structured Output with Pydantic
from pydantic import BaseModel
from typing import List
import json
class TaskBreakdown(BaseModel):
objective: str
subtasks: List[str]
estimated_complexity: str
dependencies: List[str]
suggested_tech_stack: List[str]
def analyze_task(description: str) -> TaskBreakdown:
response = client.chat.completions.create(
model="qwen2.5:72b",
messages=[
{
"role": "system",
"content": "You are a technical architect. Analyze tasks and return JSON."
},
{
"role": "user",
"content": f"""Analyze this task following the TaskBreakdown schema:
Task: {description}
Schema:
{{
"objective": "string",
"subtasks": ["string"],
"estimated_complexity": "low|medium|high",
"dependencies": ["string"],
"suggested_tech_stack": ["string"]
}}
Return only pure JSON, no other text."""
}
],
temperature=0.1,
response_format={"type": "json_object"}
)
data = json.loads(response.choices[0].message.content)
return TaskBreakdown(**data)
result = analyze_task("Build a RAG-powered customer service chatbot system")
print(f"Complexity: {result.estimated_complexity}")
print(f"Subtasks: {', '.join(result.subtasks)}")
9. Production Code Assistant
class QwenCodeAssistant:
def __init__(self):
self.client = OpenAI(
base_url="http://localhost:8001/v1",
api_key="qwen"
)
def generate_code(self, spec: str, language: str = "python") -> str:
response = self.client.chat.completions.create(
model="qwen2.5-coder-32b-instruct",
messages=[
{
"role": "system",
"content": f"You are an expert {language} developer. Write clean code with docstrings and unit tests."
},
{"role": "user", "content": spec}
],
temperature=0.1,
max_tokens=4096
)
return response.choices[0].message.content
def review_code(self, code: str) -> str:
response = self.client.chat.completions.create(
model="qwen2.5-coder-32b-instruct",
messages=[
{
"role": "system",
"content": "Review code and identify: bugs, security issues, performance bottlenecks, style problems."
},
{"role": "user", "content": f"```\n{code}\n```"}
],
temperature=
)
response.choices[].message.content
() -> :
response = .client.chat.completions.create(
model=,
messages=[{
: ,
:
}],
temperature=
)
response.choices[].message.content
assistant = QwenCodeAssistant()
code = assistant.generate_code(
)
(code)
10. Qwen 2.5 vs Other Models
| Criterion | Qwen 2.5-72B | Llama 3.3 70B | DeepSeek V3 | GPT-4o |
|---|
| Coding (HumanEval) | 88.2% | 88.4% | 91.6% | 90.2% |
| Specialized coding (Coder-32B) | 92.7% | N/A | N/A | N/A |
| Vietnamese | Best open-source | Good | Good | Best (API) |
| Chinese | Best | Weak | Good | Good |
| Agentic tasks | ✅ Excellent | ✅ Good | ✅ Good | ✅ Best |
| Context window | 128K | 128K | 64K | 128K |
| Self-host cost | Medium | Medium | High (671B) | N/A |
11. Model Selection by Use Case
| Use Case | Recommended Model | Reason |
|---|
| Code generation & review | Qwen2.5-Coder-32B | SOTA coding, beats GPT-4o |
| Vietnamese/Chinese tasks | Qwen2.5-72B | 151K vocab, superior training data |
| Complex AI agents | Qwen2.5-72B | Strong tool calling, long context |
| Edge/mobile | Qwen2.5-1.5B or 3B | Compact, runs offline |
| Math reasoning | Qwen2.5-Math-72B | Specialized for mathematics |
| General API server | Qwen2.5-32B | Good performance/cost balance |
| Speculative draft | Qwen2.5-1.5B | High speed when paired with 72B |
Conclusion
Qwen 2.5 stands out through three core strengths:
- Coding SOTA: Qwen2.5-Coder-32B is the best open-source model for coding, surpassing GPT-4o on HumanEval (92.7%) and LiveCodeBench
- Multilingual leader: 151K token vocabulary and diverse training data make Qwen the top choice for Vietnamese and Chinese among self-hosted models
- Powerful AI agents: Accurate tool calling, stable instruction following, and structured JSON output support — all essential for production agent workflows
For enterprises wanting to self-host the best multilingual model or build production-grade code assistants and AI agents, Qwen 2.5 is the unrivaled choice in the open-source world.