Qwen 2.5: Code, Multilingual & AI Agent Workflow Guide | Autonow | Autonow

Qwen 2.5: Mastering Code, Multilingual Tasks, and Building AI Agent Workflows

Alibaba's Qwen 2.5 is the open-source leader in coding and multilingual tasks — and the ideal foundation for complex AI Agent workflows. With models ranging from 0.5B to 72B and specialized variants (Coder, Math), Qwen 2.5 covers everything from edge devices to enterprise servers. This guide dives into coding benchmarks, Vietnamese capabilities, and production-grade AI Agent patterns.

1. Qwen 2.5 Model Family Overview

Qwen 2.5 ships with multiple specialized variants:

Model	Parameters	Characteristics	VRAM (Q4)
Qwen2.5-0.5B	0.5B	Edge, mobile	~0.5GB
Qwen2.5-1.5B	1.5B	Edge, speculative draft	~1.5GB
Qwen2.5-3B	3B	Light tasks	~2GB
Qwen2.5-7B	7B	General purpose	~5GB
Qwen2.5-14B	14B	Strong reasoning	~9GB
Qwen2.5-32B	32B	Near-SOTA, balanced	~20GB
Qwen2.5-72B	72B	Top-tier open source	~45GB
Qwen2.5-Coder-32B	32B	Code specialist	~20GB
Qwen2.5-Math-72B	72B	Math specialist	~45GB

Key Architecture Points

GQA (Grouped Query Attention): Optimized KV cache for large batches and long context
Context window: 128K tokens (needs config in Ollama to use fully)
Vocab: 151,936 tokens — largest among mainstream models, optimized for Chinese and Vietnamese
YaRN RoPE scaling: Effective long context extrapolation
SwiGLU activation: Standard in modern LLMs

2. Coding Benchmark: Qwen2.5-Coder Leads the World

HumanEval, MBPP, and LiveCodeBench

Model	HumanEval	MBPP	LiveCodeBench	MultiPL-E
Qwen2.5-Coder-32B	92.7%	90.9%	65.9%	82.4%
Claude 3.5 Sonnet	92.0%	91.0%	60.9%	81.3%
GPT-4o (2024-11)	90.2%	87.8%	62.4%	79.1%
DeepSeek-Coder-V2	90.2%	84.6%	58.7%	78.0%
Llama 3.3 70B	88.4%	81.0%	53.2%	74.5%

Achievement: Qwen2.5-Coder-32B beats GPT-4o on all coding benchmarks with only 32B parameters — the first open-source model to reach this level.

Supported Programming Languages

Tier 1 (92%+ pass rate): Python, JavaScript, TypeScript, Java, C++, Go
Tier 2 (80–92%): Rust, Swift, Kotlin, C#, PHP, SQL
Tier 3 (65–80%): Shell, Lua, R, Julia

3. Multilingual Capabilities: Vietnamese and Chinese

Multilingual Benchmark

Benchmark	Qwen2.5-72B	Llama 3.3 70B	Mistral Large
C-Eval (Chinese)	91.1%	75.2%	78.4%
CMMLU	90.7%	73.1%	74.9%
Vietnamese VLSP	78.3%	68.1%	65.2%
M-MMLU (avg 14 langs)	82.5%	74.3%	73.8%

Why Qwen Leads for Vietnamese

151K token vocabulary: Tokenizes Vietnamese efficiently with fewer splits → fewer tokens → cheaper and faster
Diverse training data: High-quality Vietnamese text from Wikipedia, news, forums
Multilingual instruction tuning: Fine-tuned with Vietnamese prompts and responses
Natural language switching: Automatically responds in the user's language without special system prompts

Demo: Vietnamese with Qwen

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# Qwen automatically responds in Vietnamese when prompted in Vietnamese
response = client.chat.completions.create(
    model="qwen2.5:72b",
    messages=[{
        "role": "user",
        "content": "Explain the Dijkstra algorithm in Vietnamese and write a Python illustration."
    }],
    temperature=0.3
)
print(response.choices[0].message.content)

4. Ollama Deployment

# Common models
ollama pull qwen2.5:7b      # ~5GB — RTX 3060 12GB
ollama pull qwen2.5:14b     # ~9GB — RTX 4080/3090
ollama pull qwen2.5:32b     # ~20GB — 2× RTX 3090 or A100 40GB
ollama pull qwen2.5:72b     # ~45GB — 2× A100 40GB

# Code specialist
ollama pull qwen2.5-coder:32b

# Increase context (Ollama defaults to 2K)
cat > Qwen-Modelfile << 'EOF'
FROM qwen2.5:72b
PARAMETER num_ctx 32768
PARAMETER temperature 0.7
PARAMETER top_p 0.9
SYSTEM "You are a helpful AI assistant. Respond in the same language as the user."
EOF
ollama create qwen25-72b-32k -f Qwen-Modelfile

5. vLLM Production Deployment

# Single H100 80GB — run 72B FP16
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-72B-Instruct \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 32768 \
  --dtype bfloat16 \
  --port 8000

# 2× A100 40GB — Tensor Parallel
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-72B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 65536 \
  --port 8000

# Coder model for code assistant
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-Coder-32B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --port 8001

6. Building AI Agent Workflows: ReAct Pattern

Qwen 2.5 excels at agentic tasks — following complex instructions, calling tools, and maintaining context across many steps.

ReAct Agent

import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="qwen")

tools = [
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "Search for information on the web",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"}
                },
                "required": ["query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "execute_python",
            "description": "Execute Python code and return the result",
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {"type": "string"}
                },
                "required": ["code"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read file contents",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": }
                },
                : []
            }
        }
    }
]

 () -> :
    messages = [
        {
            : ,
            : 
        },
        {: , : user_task}
    ]

     step  (max_steps):
        response = client.chat.completions.create(
            model=,
            messages=messages,
            tools=tools,
            tool_choice=,
            temperature=  
        )

        msg = response.choices[].message
        messages.append(msg)

        
          msg.tool_calls:
             msg.content

        
         tool_call  msg.tool_calls:
            result = execute_tool(
                tool_call.function.name,
                json.loads(tool_call.function.arguments)
            )
            messages.append({
                : ,
                : tool_call.,
                : (result)
            })

     

 () -> :
     name == :
        
         
     name == :
         subprocess
        result = subprocess.run(
            [, , args[]],
            capture_output=, text=, timeout=
        )
         result.stdout  result.stderr
     name == :
         (args[])  f:
             f.read()
     


result = run_agent(
    
)
(result)

7. Multi-Agent Pipeline

from openai import OpenAI

class QwenAgent:
    def __init__(self, name: str, system_prompt: str, model: str = "qwen2.5:72b"):
        self.name = name
        self.system_prompt = system_prompt
        self.client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
        self.model = model

    def run(self, task: str, context: str = "") -> str:
        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": f"Context:\n{context}\n\nTask: {task}"}
        ]
        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            temperature=0.3
        )
        return response.choices[0].message.content

# Define specialized agents
researcher = QwenAgent(
    "Researcher",
    "You are a research specialist. Collect and summarize information about the requested topic."
)
coder = QwenAgent(
    "Coder",
    "You are a senior software engineer. Write clean Python code with tests and documentation.",
    model="qwen2.5-coder:32b"
)
reviewer = QwenAgent(
    "Reviewer",
    "You are a technical reviewer. Evaluate code for performance, security, and best practices."
)

 () -> :
    
    research = researcher.run()

    
    code = coder.run(
        ,
        context=research
    )

    
    review = reviewer.run(
        ,
        context=code
    )

     {: research, : code, : review}

result = run_pipeline()

8. Structured Output with Pydantic

from pydantic import BaseModel
from typing import List
import json

class TaskBreakdown(BaseModel):
    objective: str
    subtasks: List[str]
    estimated_complexity: str  # "low", "medium", "high"
    dependencies: List[str]
    suggested_tech_stack: List[str]

def analyze_task(description: str) -> TaskBreakdown:
    response = client.chat.completions.create(
        model="qwen2.5:72b",
        messages=[
            {
                "role": "system",
                "content": "You are a technical architect. Analyze tasks and return JSON."
            },
            {
                "role": "user",
                "content": f"""Analyze this task following the TaskBreakdown schema:
Task: {description}

Schema:
{{
  "objective": "string",
  "subtasks": ["string"],
  "estimated_complexity": "low|medium|high",
  "dependencies": ["string"],
  "suggested_tech_stack": ["string"]
}}

Return only pure JSON, no other text."""
            }
        ],
        temperature=0.1,
        response_format={"type": "json_object"}
    )

    data = json.loads(response.choices[0].message.content)
    return TaskBreakdown(**data)

result = analyze_task("Build a RAG-powered customer service chatbot system")
print(f"Complexity: {result.estimated_complexity}")
print(f"Subtasks: {', '.join(result.subtasks)}")

9. Production Code Assistant

class QwenCodeAssistant:
    def __init__(self):
        self.client = OpenAI(
            base_url="http://localhost:8001/v1",  # Coder model port
            api_key="qwen"
        )

    def generate_code(self, spec: str, language: str = "python") -> str:
        response = self.client.chat.completions.create(
            model="qwen2.5-coder-32b-instruct",
            messages=[
                {
                    "role": "system",
                    "content": f"You are an expert {language} developer. Write clean code with docstrings and unit tests."
                },
                {"role": "user", "content": spec}
            ],
            temperature=0.1,
            max_tokens=4096
        )
        return response.choices[0].message.content

    def review_code(self, code: str) -> str:
        response = self.client.chat.completions.create(
            model="qwen2.5-coder-32b-instruct",
            messages=[
                {
                    "role": "system",
                    "content": "Review code and identify: bugs, security issues, performance bottlenecks, style problems."
                },
                {"role": "user", "content": f"```\n{code}\n```"}
            ],
            temperature=
        )
         response.choices[].message.content

     () -> :
        response = .client.chat.completions.create(
            model=,
            messages=[{
                : ,
                : 
            }],
            temperature=
        )
         response.choices[].message.content

assistant = QwenCodeAssistant()
code = assistant.generate_code(
    
)
(code)

10. Qwen 2.5 vs Other Models

Criterion	Qwen 2.5-72B	Llama 3.3 70B	DeepSeek V3	GPT-4o
Coding (HumanEval)	88.2%	88.4%	91.6%	90.2%
Specialized coding (Coder-32B)	92.7%	N/A	N/A	N/A
Vietnamese	Best open-source	Good	Good	Best (API)
Chinese	Best	Weak	Good	Good
Agentic tasks	✅ Excellent	✅ Good	✅ Good	✅ Best
Context window	128K	128K	64K	128K
Self-host cost	Medium	Medium	High (671B)	N/A

11. Model Selection by Use Case

Use Case	Recommended Model	Reason
Code generation & review	Qwen2.5-Coder-32B	SOTA coding, beats GPT-4o
Vietnamese/Chinese tasks	Qwen2.5-72B	151K vocab, superior training data
Complex AI agents	Qwen2.5-72B	Strong tool calling, long context
Edge/mobile	Qwen2.5-1.5B or 3B	Compact, runs offline
Math reasoning	Qwen2.5-Math-72B	Specialized for mathematics
General API server	Qwen2.5-32B	Good performance/cost balance
Speculative draft	Qwen2.5-1.5B	High speed when paired with 72B

Conclusion

Qwen 2.5 stands out through three core strengths:

Coding SOTA: Qwen2.5-Coder-32B is the best open-source model for coding, surpassing GPT-4o on HumanEval (92.7%) and LiveCodeBench
Multilingual leader: 151K token vocabulary and diverse training data make Qwen the top choice for Vietnamese and Chinese among self-hosted models
Powerful AI agents: Accurate tool calling, stable instruction following, and structured JSON output support — all essential for production agent workflows

For enterprises wanting to self-host the best multilingual model or build production-grade code assistants and AI agents, Qwen 2.5 is the unrivaled choice in the open-source world.

Qwen 2.5: Mastering Code, Multilingual Tasks, and Building AI Agent Workflows

Related Resources

Stay Updated

Related Articles

At a Glance

OpenCog Hyperon & ASI Alliance: The Real AGI Roadmap Beyond OpenAI

Clawdbot Is Smart but Costly: Build a /billing Dashboard Before Your Next API Bill Arrives

Llama 3.3 70B: Enterprise Deployment, Optimization, and LoRA Fine-tuning Guide

Qwen 2.5: Mastering Code, Multilingual Tasks, and Building AI Agent Workflows

1. Qwen 2.5 Model Family Overview

Key Architecture Points

2. Coding Benchmark: Qwen2.5-Coder Leads the World

HumanEval, MBPP, and LiveCodeBench

Supported Programming Languages

3. Multilingual Capabilities: Vietnamese and Chinese

Multilingual Benchmark

Why Qwen Leads for Vietnamese

Demo: Vietnamese with Qwen

4. Ollama Deployment

5. vLLM Production Deployment

6. Building AI Agent Workflows: ReAct Pattern

ReAct Agent

7. Multi-Agent Pipeline

8. Structured Output with Pydantic

9. Production Code Assistant

10. Qwen 2.5 vs Other Models

11. Model Selection by Use Case

Conclusion