Why Qwen 3.5 Matters for Production LLM Systems
The gap between research-grade LLM demos and production-grade inference systems is massive. Latency spikes. GPU memory bottlenecks. Throughput collapse under concurrency.
This is where Qwen 3.5 becomes interesting.
Qwen 3.5 offers:
- Strong multilingual reasoning
- Competitive coding benchmarks
- Efficient parameter scaling
- Excellent compatibility with open inference engines like vLLM
The real power isn’t just in the model — it’s in how you deploy it.
In this guide, we’ll cover:
- Running Qwen 3.5 with vLLM for high-throughput serving
- Optimizing tensor parallelism and memory
- Deploying local inference on an Azure GPU VM
- Cost-performance tradeoffs
Let’s break it down like engineers.
System Architecture Overview
Here’s the high-level deployment architecture for serving Qwen 3.5 via vLLM on Azure:
The key pieces:
- vLLM Engine — Handles batching, PagedAttention, memory optimization
- Azure GPU VM — Provides CUDA-enabled GPU hardware
- OpenAI-Compatible API — Makes integration seamless
What Makes vLLM Special?
Traditional inference servers struggle with:
- Fragmented KV cache
- Poor batching under dynamic loads
- GPU memory waste
- Low throughput under concurrency
vLLM solves this with PagedAttention, which:
- Dynamically allocates KV cache blocks
- Supports continuous batching
- Maximizes GPU utilization
- Reduces memory fragmentation
In practice, vLLM improves throughput by 2–4x compared to naive HuggingFace Transformers serving.
Step 1: Running Qwen 3.5 with vLLM (Local or Server)
Install Dependencies
1pip install vllm2pip install torch --index-url https://download.pytorch.org/whl/cu121Make sure:
- CUDA version matches your GPU
- NVIDIA driver is properly installed
Start vLLM Server with Qwen 3.5
1python -m vllm.entrypoints.openai.api_server \2 --model Qwen/Qwen2.5-7B-Instruct \3 --dtype float16 \4 --tensor-parallel-size 1 \5 --max-model-len 8192 \6 --gpu-memory-utilization 0.90Key parameters explained:
| Parameter | What It Does | Recommended Setting |
|---|---|---|
--dtype | Precision mode | float16 or bfloat16 |
--tensor-parallel-size | Number of GPUs | 1 (single GPU VM) |
--max-model-len | Context length | 8192 or 32k depending on variant |
--gpu-memory-utilization | VRAM allocation limit | 0.85–0.95 |
Once started, the server exposes an OpenAI-compatible API at:
1http://localhost:8000/v1/chat/completionsExample Request
1import requests2
3response = requests.post(4 "http://localhost:8000/v1/chat/completions",5 json={6 "model": "Qwen/Qwen2.5-7B-Instruct",7 "messages": [8 {"role": "user", "content": "Explain PagedAttention in simple terms."}9 ]10 }11)12
13print(response.json())That’s it. You now have a production-grade inference server.
Performance Benchmarks (Real-World Expectations)
Let’s compare deployment options.
| Deployment Type | GPU | Tokens/sec | Concurrency | Cost Efficiency |
|---|---|---|---|---|
| HF Transformers | A10 | ~35 tok/s | Low | Medium |
| vLLM | A10 | ~90 tok/s | High | High |
| vLLM + A100 | A100 80GB | 180–220 tok/s | Very High | Very High |
| CPU Only | 32-core | <5 tok/s | Very Low | Poor |
vLLM nearly triples throughput under load.
Step 2: Deploying Qwen 3.5 on Azure VM (Local Inference)
Now let’s deploy this properly in the cloud.
1️⃣ Choose the Right Azure VM
Recommended GPU VM types:
| VM Type | GPU | VRAM | Good For |
|---|---|---|---|
| Standard_NC4as_T4_v3 | T4 | 16GB | 7B models |
| Standard_NC24ads_A100_v4 | A100 | 80GB | 14B+ models |
| Standard_ND96amsr_A100_v4 | 8x A100 | 640GB | Large-scale serving |
For Qwen 7B, a T4 or A10 is sufficient.
2️⃣ Create Azure VM
1az vm create \2 --resource-group myRG \3 --name qwen-vm \4 --image Ubuntu2204 \5 --size Standard_NC4as_T4_v3 \6 --admin-username azureuser \7 --generate-ssh-keys3️⃣ Install NVIDIA Drivers
1sudo apt update2sudo apt install nvidia-driver-5353sudo rebootVerify:
1nvidia-smi4️⃣ Install CUDA + PyTorch
1pip install torch torchvision --index-url https://download.pytorch.org/whl/cu1212pip install vllmOptimizing for Local Inference
Enable Swap for Stability
1sudo fallocate -l 16G /swapfile2sudo chmod 600 /swapfile3sudo mkswap /swapfile4sudo swapon /swapfilePrevents OOM crashes during large context windows.
Tune GPU Memory Usage
For 16GB GPUs:
1--gpu-memory-utilization 0.852--max-model-len 4096For 80GB GPUs:
1--gpu-memory-utilization 0.952--max-model-len 32768Production-Ready Deployment Architecture on Azure
Add:
- NGINX for rate limiting
- Redis for response caching
- Prometheus + Grafana for monitoring
Cost Considerations
Running Qwen locally on Azure is dramatically cheaper than API usage at scale.
Example:
| Scenario | Monthly Cost | Notes |
|---|---|---|
| OpenAI API (10M tokens/day) | $2,000+ | Usage-based |
| Azure T4 VM (24/7) | ~$450 | Fixed cost |
| Azure A100 VM (24/7) | ~$2,200 | Enterprise scale |
If your workload is steady, local inference wins.
If your workload is bursty, API usage might be cheaper.
Advanced: Quantization for Smaller GPUs
For lower VRAM GPUs, use:
- AWQ quantization
- GPTQ
- 4-bit quantized weights
Example:
1--quantization awqThis reduces:
- VRAM usage by ~50%
- Slight drop in quality (1–3%)
Observability & Monitoring
Track:
- GPU utilization
- Memory fragmentation
- Tokens/sec
- p95 latency
Use:
1watch -n 1 nvidia-smiAnd integrate:
- Prometheus exporters
- Grafana dashboards
Production LLM systems fail not because of models — but because of missing observability.
Common Pitfalls
❌ CUDA mismatch ❌ Using float32 instead of float16 ❌ Not limiting max_model_len ❌ No swap memory ❌ Ignoring KV cache memory growth
Real-World Latency Expectations
For Qwen 7B on T4:
- First token latency: 400–800ms
- Generation speed: 80–100 tok/sec
- Stable concurrency: 15–30 users
On A100:
- First token latency: 200–400ms
- Generation speed: 200+ tok/sec
- 100+ concurrent users possible
Read Next
Sovereignty at Scale: Engineering Production-Grade RAG on Bare Metal
Stop paying the 'Internet Tax' and risking data leaks. We moved our RAG pipeline from SaaS to a local H100 cluster, cutting latency by 40% and TCO by 70% at scale.
I Run an AI Agent on a VPS. Here's My Actual Setup
A walkthrough of my real OpenClaw deployment: 13 Telegram topics, GPT-5.2 on Azure free tier, heartbeat-driven morning briefings, Playwright browser automation, memsearch semantic recall, and a Second Brain that auto-captures everything I text. Pulled directly from my live droplet.
How I Set Up an On-Prem Agentic AI Stack with Open-Source Embeddings and Fully Local Inference
A practical guide to building a fully on-prem agentic AI system using open-source embeddings and local LLM inference — no APIs, no cloud, complete data control.