Qwen 3.5 in Production: Running with vLLM and Deploying Local Inference on Azure VM

Why Qwen 3.5 Matters for Production LLM Systems

The gap between research-grade LLM demos and production-grade inference systems is massive. Latency spikes. GPU memory bottlenecks. Throughput collapse under concurrency.

This is where Qwen 3.5 becomes interesting.

Qwen 3.5 offers:

Strong multilingual reasoning
Competitive coding benchmarks
Efficient parameter scaling
Excellent compatibility with open inference engines like vLLM

The real power isn’t just in the model — it’s in how you deploy it.

In this guide, we’ll cover:

Running Qwen 3.5 with vLLM for high-throughput serving
Optimizing tensor parallelism and memory
Deploying local inference on an Azure GPU VM
Cost-performance tradeoffs

Let’s break it down like engineers.

System Architecture Overview

Here’s the high-level deployment architecture for serving Qwen 3.5 via vLLM on Azure:

Loading diagram...

The key pieces:

vLLM Engine — Handles batching, PagedAttention, memory optimization
Azure GPU VM — Provides CUDA-enabled GPU hardware
OpenAI-Compatible API — Makes integration seamless

What Makes vLLM Special?

Traditional inference servers struggle with:

Fragmented KV cache
Poor batching under dynamic loads
GPU memory waste
Low throughput under concurrency

vLLM solves this with PagedAttention, which:

Dynamically allocates KV cache blocks
Supports continuous batching
Maximizes GPU utilization
Reduces memory fragmentation

In practice, vLLM improves throughput by 2–4x compared to naive HuggingFace Transformers serving.

Step 1: Running Qwen 3.5 with vLLM (Local or Server)

Install Dependencies

bash

1pip install vllm
2pip install torch --index-url https://download.pytorch.org/whl/cu121

Make sure:

CUDA version matches your GPU
NVIDIA driver is properly installed

Start vLLM Server with Qwen 3.5

python

1python -m vllm.entrypoints.openai.api_server \
2  --model Qwen/Qwen2.5-7B-Instruct \
3  --dtype float16 \
4  --tensor-parallel-size 1 \
5  --max-model-len 8192 \
6  --gpu-memory-utilization 0.90

Key parameters explained:

Parameter	What It Does	Recommended Setting
`--dtype`	Precision mode	`float16` or `bfloat16`
`--tensor-parallel-size`	Number of GPUs	1 (single GPU VM)
`--max-model-len`	Context length	8192 or 32k depending on variant
`--gpu-memory-utilization`	VRAM allocation limit	0.85–0.95

Once started, the server exposes an OpenAI-compatible API at:

1http://localhost:8000/v1/chat/completions

Example Request

python

1import requests
2
3response = requests.post(
4    "http://localhost:8000/v1/chat/completions",
5    json={
6        "model": "Qwen/Qwen2.5-7B-Instruct",
7        "messages": [
8            {"role": "user", "content": "Explain PagedAttention in simple terms."}
9        ]
10    }
11)
12
13print(response.json())

That’s it. You now have a production-grade inference server.

Performance Benchmarks (Real-World Expectations)

Let’s compare deployment options.

Deployment Type	GPU	Tokens/sec	Concurrency	Cost Efficiency
HF Transformers	A10	~35 tok/s	Low	Medium
vLLM	A10	~90 tok/s	High	High
vLLM + A100	A100 80GB	180–220 tok/s	Very High	Very High
CPU Only	32-core	<5 tok/s	Very Low	Poor

vLLM nearly triples throughput under load.

Step 2: Deploying Qwen 3.5 on Azure VM (Local Inference)

Now let’s deploy this properly in the cloud.

1️⃣ Choose the Right Azure VM

Recommended GPU VM types:

VM Type	GPU	VRAM	Good For
Standard_NC4as_T4_v3	T4	16GB	7B models
Standard_NC24ads_A100_v4	A100	80GB	14B+ models
Standard_ND96amsr_A100_v4	8x A100	640GB	Large-scale serving

For Qwen 7B, a T4 or A10 is sufficient.

2️⃣ Create Azure VM

bash

1az vm create \
2  --resource-group myRG \
3  --name qwen-vm \
4  --image Ubuntu2204 \
5  --size Standard_NC4as_T4_v3 \
6  --admin-username azureuser \
7  --generate-ssh-keys

3️⃣ Install NVIDIA Drivers

bash

1sudo apt update
2sudo apt install nvidia-driver-535
3sudo reboot

Verify:

bash

1nvidia-smi

4️⃣ Install CUDA + PyTorch

bash

1pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
2pip install vllm

Optimizing for Local Inference

Enable Swap for Stability

bash

1sudo fallocate -l 16G /swapfile
2sudo chmod 600 /swapfile
3sudo mkswap /swapfile
4sudo swapon /swapfile

Prevents OOM crashes during large context windows.

Tune GPU Memory Usage

For 16GB GPUs:

bash

1--gpu-memory-utilization 0.85
2--max-model-len 4096

For 80GB GPUs:

bash

1--gpu-memory-utilization 0.95
2--max-model-len 32768

Production-Ready Deployment Architecture on Azure

Loading diagram...

Add:

NGINX for rate limiting
Redis for response caching
Prometheus + Grafana for monitoring

Cost Considerations

Running Qwen locally on Azure is dramatically cheaper than API usage at scale.

Example:

Scenario	Monthly Cost	Notes
OpenAI API (10M tokens/day)	$2,000+	Usage-based
Azure T4 VM (24/7)	~$450	Fixed cost
Azure A100 VM (24/7)	~$2,200	Enterprise scale

If your workload is steady, local inference wins.

If your workload is bursty, API usage might be cheaper.

Advanced: Quantization for Smaller GPUs

For lower VRAM GPUs, use:

AWQ quantization
GPTQ
4-bit quantized weights

Example:

bash

1--quantization awq

This reduces:

VRAM usage by ~50%
Slight drop in quality (1–3%)

Observability & Monitoring

Track:

GPU utilization
Memory fragmentation
Tokens/sec
p95 latency

Use:

bash

1watch -n 1 nvidia-smi

And integrate:

Prometheus exporters
Grafana dashboards

Production LLM systems fail not because of models — but because of missing observability.

Common Pitfalls

❌ CUDA mismatch ❌ Using float32 instead of float16 ❌ Not limiting max_model_len ❌ No swap memory ❌ Ignoring KV cache memory growth

Real-World Latency Expectations

For Qwen 7B on T4:

First token latency: 400–800ms
Generation speed: 80–100 tok/sec
Stable concurrency: 15–30 users

On A100:

First token latency: 200–400ms
Generation speed: 200+ tok/sec
100+ concurrent users possible

ahad.

Qwen 3.5 in Production: Running with vLLM and Deploying Local Inference on Azure VM

Why Qwen 3.5 Matters for Production LLM Systems

System Architecture Overview

What Makes vLLM Special?

Step 1: Running Qwen 3.5 with vLLM (Local or Server)

Install Dependencies

Start vLLM Server with Qwen 3.5

Example Request

Performance Benchmarks (Real-World Expectations)

Step 2: Deploying Qwen 3.5 on Azure VM (Local Inference)

1️⃣ Choose the Right Azure VM

2️⃣ Create Azure VM

3️⃣ Install NVIDIA Drivers

4️⃣ Install CUDA + PyTorch

Optimizing for Local Inference

Enable Swap for Stability

Tune GPU Memory Usage

Production-Ready Deployment Architecture on Azure

Cost Considerations

Advanced: Quantization for Smaller GPUs

Observability & Monitoring

Common Pitfalls

Real-World Latency Expectations

Read Next

Sovereignty at Scale: Engineering Production-Grade RAG on Bare Metal

I Run an AI Agent on a VPS. Here's My Actual Setup

How I Set Up an On-Prem Agentic AI Stack with Open-Source Embeddings and Fully Local Inference