Qwen 3

Qwen 3.5 is out. The Qwen 3.5 generation (February 2026) surpasses Qwen 3 on nearly every benchmark — with native multimodal support, faster inference, and a 35B-A3B model that outperforms the Qwen3-235B flagship. Read the full Qwen 3.5 guide

Qwen 3 was the generation that put Alibaba's open-weight models on the map. When it launched in April 2025, AI researcher Nathan Lambert called the lineup "the new open standard" — and he wasn't wrong. Six dense models, two MoE variants, 119 languages, and performance that matched or beat GPT-4-class systems, all under Apache 2.0.

That was six months ago. Qwen 3.5 has since taken the crown, and Qwen 3 is now the previous generation. But "previous" doesn't mean irrelevant. The dense models remain the go-to choice for fine-tuning. The Qwen3-Next hybrid architecture introduced efficiency breakthroughs that directly shaped Qwen 3.5's design. And millions of production deployments still run Qwen 3 today.

This guide covers the full Qwen 3 family — including Qwen3-Next, which most sites completely overlook — and helps you decide whether Qwen 3 still makes sense for your use case or if it's time to move to 3.5.

Qwen 3 model family lineup showing dense and MoE variants from 0.6B to 235B parameters
The original Qwen 3 lineup: eight open-weight models spanning edge devices to multi-GPU servers.

Qwen3-Next: The Hybrid Architecture That Changed Everything

Bottom line: Qwen3-Next-80B-A3B is the single most efficient model in the entire Qwen 3 family. It matches or beats the 235B flagship on most benchmarks while activating only 3B parameters per token — a staggering 3.75% of its total 80B weights.

Released in September 2025, Qwen3-Next was Alibaba's proof-of-concept for a radically different architecture. Instead of the standard Transformer attention used in every other Qwen 3 model, it combines three mechanisms in a repeating block: GatedDeltaNet (linear attention for speed), GatedAttention (standard attention for precision), and MoE routing (512 experts, 10 active + 1 shared per token).

Qwen3-Next hybrid architecture diagram showing the repeating block of GatedDeltaNet, GatedAttention, and MoE layers
Qwen3-Next's repeating block: 3 GatedDeltaNet-MoE layers followed by 1 GatedAttention-MoE layer, stacked 12 times across 48 total layers.

The practical upshot? At sequences longer than 32K tokens, Qwen3-Next delivers 10x the throughput of Qwen3-32B. At around 1 million tokens, it hits 3x the speed of standard attention models. Training costs drop by 90% compared to Qwen3-32B-Base. This isn't incremental — it's a generational leap in efficiency.

Two variants exist: Qwen3-Next-80B-A3B-Instruct (non-thinking, optimized for chat and tool use) and Qwen3-Next-80B-A3B-Thinking (reasoning mode with chain-of-thought). Both support 262K native context, extendable to 1 million tokens via YaRN — and the RULER benchmark confirms this isn't just a marketing number. Qwen3-Next scores 91.8 average accuracy at 1M tokens, nearly matching the 235B flagship's 92.5 and far exceeding Qwen3-30B's 86.8.

Qwen3-Next Instruct Benchmarks vs the Rest of the Family

Benchmark Qwen3-Next-80B Qwen3-235B Qwen3-32B
LiveBench75.875.459.8
LiveCodeBench v656.651.829.1
Arena-Hard v282.779.234.1
WritingBench87.385.275.4
AIME2569.570.320.2
MMLU-Pro80.683.071.9

Qwen3-Next wins on live benchmarks (LiveBench, LiveCodeBench, Arena-Hard, WritingBench) while the 235B retains an edge on static knowledge tests (MMLU-Pro, AIME25).

The pattern is clear: Qwen3-Next excels at tasks that reflect real-world usage — live coding, open-ended conversation, creative writing. The 235B still leads on heavily-studied academic benchmarks, which is worth noting when evaluating these numbers. On the reasoning side (Thinking variant), the 235B maintains a wider lead: 92.3 vs 87.8 on AIME25, 74.1 vs 68.7 on LiveCodeBench. If deep mathematical reasoning is your primary need, the 235B remains the stronger pick within the Qwen 3 family.

The Thinking variant tells a slightly different story. On deep reasoning tasks, the 235B still wins:

Qwen3-Next Thinking Mode vs 235B and Gemini

Benchmark Qwen3-Next-Think Qwen3-235B-Think Gemini 2.5 Flash-Think
AIME2587.892.372.0
LiveCodeBench v668.774.161.2
TAU2-Airline60.558.052.0
TAU1-Retail69.667.865.2

The 235B leads on pure math and coding reasoning, but Qwen3-Next wins on real-world agentic tasks (TAU benchmarks). Both crush Gemini 2.5 Flash.

This split matters for choosing between the two. If you're building agents that interact with real-world systems — booking flights, handling customer service workflows, managing retail operations — Qwen3-Next's TAU benchmark leads suggest it handles structured, multi-step tasks better than even the 235B. For math competitions and complex coding challenges, the 235B remains king within the Qwen 3 family.

Community adoption has been strong. HuggingFace hosts 79 quantized versions and 34 fine-tunes of Qwen3-Next. You can run it today on Ollama, LMStudio, or serve it via AWS Bedrock, NVIDIA NIM, and Together AI. For a model that was positioned as a research preview, that's a remarkably mature ecosystem.

The Original Lineup: Dense and MoE Models

The April 2025 launch gave us six dense models (0.6B through 32B) and two MoE variants (30B-A3B and 235B-A22B). All share the same tokenizer, chat template, and hybrid thinking/non-thinking capability. Swap sizes without touching your code.

The dense models are where Qwen 3 still holds a genuine advantage over Qwen 3.5. If you're fine-tuning — which most production teams are — a dense 8B or 32B model is simpler to work with, easier to quantize predictably, and has a more mature LoRA ecosystem than any MoE variant. The 32B dense model remains one of the best creative writing models in the open-weight space, a fact the community consistently confirms on r/LocalLLaMA.

Qwen 3 base model comparison chart showing performance across different sizes from 0.6B to 235B
Performance scaling across the Qwen 3 base model lineup. The jump from 8B to 14B is where most benchmarks show a meaningful quality increase.

Quick Reference: All Qwen 3 Base Models

ModelTypeTotal / Active ContextMin VRAM (Q4)Best For
Qwen3-0.6BDense0.6B32K~2 GBEdge, IoT, Raspberry Pi
Qwen3-1.7BDense1.7B32K~3 GBMobile, quick prototypes
Qwen3-4BDense4B32K / 128K~4 GBLocal dev, lightweight agents
Qwen3-8BDense8.2B32K / 128K~6 GBBest balance of speed + quality
Qwen3-14BDense14B32K / 128K~10 GBStrong all-rounder
Qwen3-32BDense32.8B32K / 128K~20 GBCreative writing, fine-tuning
Qwen3-30B-A3BMoE30.5B / ~3.3B32K / 128K~20 GBFast inference, 90%+ of flagship
Qwen3-235B-A22BMoE235B / ~22B32K / 128K~80 GBOpen-source flagship

Our pick for most users: the Qwen3-8B strikes the best balance between quality and accessibility. It runs under 10 GB VRAM with Q4 quantization, handles multilingual tasks well, and has the deepest community support. If you have a 24 GB GPU (RTX 4090, A5000), jump to the 32B or 30B-A3B — the quality gap is significant, and you can check exact compatibility here.

The 2507 Update: Dedicated Thinking and Instruct Variants

In July 2025, the Qwen team made a philosophical shift. Instead of one hybrid model that switches between thinking and non-thinking modes via a parameter, they released dedicated variants — each independently optimized for its mode.

The results were dramatic. The 235B-Instruct-2507 jumped from 24.7 to 70.3 on AIME25 in non-thinking mode — that's not a typo. ZebraLogic went from 37.7 to 95.0. Context windows expanded from 32K native to 256K native with up to 1 million tokens via DCA + MInference sparse attention.

Qwen 3 thinking budget control showing how max_thought_tokens affects latency and accuracy tradeoff
The thinking budget system lets you hard-cap reasoning tokens, trading accuracy for speed. The sweet spot depends on your task complexity.

Three model sizes received the 2507 treatment: 4B, 30B-A3B, and 235B-A22B. Each comes in both Instruct and Thinking flavors. If a 2507 variant exists for the size you want, always use it over the original April release — the performance gap is too large to ignore. For the 0.6B, 1.7B, 8B, 14B, and 32B dense models, the original April versions remain the latest available.

One standout: the 4B-Thinking-2507 hits 81.3 on AIME25, rivaling the much larger Qwen2.5-72B-Instruct. That kind of reasoning density in a 4B model is genuinely remarkable — and useful if you're deploying on constrained hardware but need strong math capabilities.

Qwen 3 vs Qwen 3.5: Which Generation Should You Use?

This is the question everyone's asking, so here's the honest answer: for new projects, Qwen 3.5 is almost always the better choice. The numbers aren't even close on most fronts. But Qwen 3 has specific advantages that matter in real production scenarios.

Head-to-Head Comparison

Factor Qwen 3 Qwen 3.5
Flagship PerformanceQwen3-235B-A22BQwen3.5-35B-A3B outperforms 235B with ~78x less active compute
ModalityText onlyNative multimodal (text + image + video)
Instruction FollowingIFEval: 87.8IFBench: 76.5 (beats GPT-5.2)
Long-Context SpeedStandard attention8.6-19x faster decode throughput
Dense Models0.6B to 32B (mature)MoE only (no dense variants yet)
Fine-Tuning EcosystemThousands of community fine-tunesStill growing
ArchitectureStandard TransformerHybrid linear + standard attention (from Qwen3-Next)

Choose Qwen 3 When:

Choose Qwen 3.5 When:

Benchmarks: How Qwen 3 Models Compare

Three models define the Qwen 3 performance tier: the 235B flagship, the 32B dense workhorse, and Qwen3-Next's hybrid efficiency play. Here's how they stack up across reasoning, coding, and general tasks — including where they fall short.

Qwen3-235B-A22B-Thinking-2507 (Open-Source Flagship)

Benchmark Score Context
AIME2592.3Rivals O4-mini (92.7) — essentially tied
LiveCodeBench v674.1Beats O4-mini on live coding tasks
Arena-Hard v279.7Strong but below Qwen3-Next's 82.7 in Instruct mode
MMLU-Pro84.4Solid knowledge, but proprietary models score higher
GPQA Diamond81.1PhD-level science — behind GPT-5.2's ~92
IFEval87.8Good instruction following

The 235B's strength is deep reasoning — math competitions and complex coding. Its weakness? It trails proprietary models on science knowledge (GPQA) and software engineering (SWE-Bench Verified: 75.3 vs Claude Opus 4.5's 80.9). That gap matters if you're building coding agents, but for math and general reasoning, the 235B holds its own against models costing far more to run.

Long-Context Performance (RULER 1M)

Model RULER 1M Avg
Qwen3-235B-A22B92.5
Qwen3-Next-80B-A3B91.8
Qwen3-30B-A3B86.8

Qwen3-Next nearly matches the 235B at 1M tokens while using a fraction of the compute. The 30B-A3B drops off noticeably.

RULER benchmark results comparing Qwen3-Next, Qwen3-235B, and Qwen3-30B across different context lengths up to 1M tokens
RULER benchmark scores at 1M tokens. Qwen3-Next's hybrid attention architecture maintains accuracy where standard MoE models degrade.

Running Qwen 3 Locally

Every open-weight Qwen 3 model is on Hugging Face in multiple formats. The fastest way to get started:

Ollama (30 Seconds to Running)

ollama run qwen3:8b

That's it. Ollama handles the download, quantization, and chat interface. Other popular tags: qwen3:4b (8 GB VRAM), qwen3:32b (24 GB), qwen3:30b-a3b (MoE, 24 GB). For Qwen3-Next, use the community GGUF builds available on HuggingFace.

vLLM (Production Serving)

vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 --tensor-parallel-size 4

OpenAI-compatible API with PagedAttention. Qwen3-Next also gets 1.3-2.1x generation speedup with vLLM and SGLang thanks to its Multi-Token Prediction architecture.

llama.cpp (CPU + GPU Hybrid)

./llama-server -m qwen3-8b-q4_K_M.gguf -c 32768 -ngl 35

GGUF quantized models work with partial GPU offloading. Good for machines with limited VRAM. Check our hardware compatibility tool to find which models fit your setup.

Hardware Requirements at a Glance

GPU / VRAMBest Qwen 3 ModelExpected Speed
4-6 GBQwen3-4B (Q4)~40 tok/s
8-10 GBQwen3-8B (Q4)~30 tok/s
16 GBQwen3-14B (Q4)~20 tok/s
24 GBQwen3-32B or 30B-A3B (Q4)~15 tok/s (dense) / ~25 tok/s (MoE)
48+ GBQwen3-Next-80B-A3BVariable (hybrid arch)
80+ GB (multi-GPU)Qwen3-235B-A22B~10-15 tok/s with TP=4

Tip: GGUF Q4_K_M quantization cuts VRAM by 70-80% with minimal quality loss. For the 30B-A3B, the MoE architecture means only 3B parameters fire per token, so you get 30B-class quality at near-3B speed. Not sure what fits? Use our Can I Run Qwen tool.

API Access and Pricing

If you don't want to self-host, Alibaba Cloud serves Qwen 3 models through DashScope / Model Studio with OpenAI-compatible endpoints. Third-party providers often undercut these prices significantly.

ModelProviderInput / 1M tokensOutput / 1M tokens
Qwen3-Max-ThinkingAlibaba Cloud$1.20$6.00
Qwen3-Max-ThinkingAlibaba (long ctx)$3.00$15.00
Qwen3-235B / 30BOpenRouter~$0.10-0.50~$0.50-2.00
Qwen3-Next-80B-A3BTogether AI$0.50$1.20

The open-source models run on OpenRouter, Novita AI, Fireworks AI, and Together AI. For cost-sensitive workloads, self-hosting a quantized 30B-A3B on a single 24 GB GPU often beats API pricing after a few million tokens.

Base URL (International): https://dashscope-intl.aliyuncs.com/compatible-mode/v1

Fine-Tuning: Where Qwen 3 Still Leads

This is Qwen 3's strongest argument against moving to 3.5. The fine-tuning ecosystem is six months more mature, with thousands of community LoRAs, proven training recipes, and first-class support in every major framework.

If fine-tuning is core to your workflow and you don't need vision, Qwen 3's dense models remain the pragmatic choice. The 3.5 MoE models can be fine-tuned too, but the tooling is younger and the community knowledge base is thinner.

Qwen 3 Timeline

DateReleaseSignificance
Apr 2025Qwen3 Base Models8 open-weight models (0.6B-235B), Apache 2.0, hybrid thinking
Jun 2025Qwen3-EmbeddingText embeddings and reranking (0.6B / 8B)
Jul 2025Qwen3-2507 UpdateDedicated Instruct/Thinking splits, 256K context, massive performance gains
Jul 2025Qwen3-Coder480B-A35B agentic coding model (67% SWE-Bench Verified)
Sep 2025Qwen3-Next-80B-A3BHybrid GatedDeltaNet architecture, 10x throughput, 262K-1M context
Sep 2025Qwen3-Max1T+ closed-source flagship
Oct 2025Qwen3-VL modelsVision-Language (2B-32B)
Jan 2026Qwen3-Max-ThinkingTest-time scaling, #1 on HLE with search
Feb 2026Qwen3-Coder-Next80B-A3B coding model, 70.6% SWE-Bench Verified
Feb 2026Qwen 3.5 launchNext generation — surpasses Qwen 3 across the board

Honest Limitations

No model is perfect, and pretending otherwise doesn't help anyone. Here's where Qwen 3 falls short:

These are real trade-offs, not dealbreakers. For most users running an 8B or 32B locally, the limitations above don't apply. But if you're evaluating Qwen 3 for a new enterprise deployment, factor them in — and seriously consider whether Qwen 3.5 resolves them for your use case.

Frequently Asked Questions

Is Qwen 3 still worth using in 2026?

Yes, for specific use cases. The dense models (especially 8B and 32B) remain the best choice for fine-tuning workflows. Existing production deployments shouldn't migrate just because 3.5 exists — if your Qwen 3 setup works, the switching cost rarely justifies the marginal gains. But for new projects, start with Qwen 3.5 unless you specifically need a dense architecture.

Qwen3-Next or Qwen3.5-35B-A3B?

For new deployments, Qwen3.5-35B-A3B. It inherits Qwen3-Next's hybrid architecture but with better training, native multimodal support, and stronger benchmarks. Qwen3-Next makes sense if you're already invested in the Qwen 3 ecosystem and don't want to migrate, or if you need the specific 80B-scale behavior for your fine-tuned pipeline.

What's the best Qwen 3 model for a 24 GB GPU?

Two strong options. Qwen3-32B (dense, Q4 quantized) is better for creative writing and fine-tuning — it's predictable and well-understood. Qwen3-30B-A3B (MoE) is faster at inference since only 3B parameters activate per token, and it scores higher on reasoning benchmarks. Pick based on whether you prioritize speed or fine-tuning flexibility.

Are all Qwen 3 models free for commercial use?

Every open-weight model (0.6B through 235B, including Qwen3-Next) ships under Apache 2.0 — unrestricted commercial use, modification, and distribution. The only proprietary model is Qwen3-Max, which is API-only.

How does Qwen 3 handle tool calling?

Natively. Define your tools as JSON schemas and the model generates structured function calls — compatible with the Model Context Protocol (MCP). No custom parsing needed. The 2507 variants and Qwen3-Next are noticeably better at tool use than the original April models.

Can I use Qwen3-Next in Ollama?

Yes. Community-built GGUF quantizations are available on HuggingFace, and you can load them into Ollama with a custom Modelfile. The official Ollama library also carries Qwen3-Next tags. Given its 80B total but only 3B active parameters, inference speed is surprisingly fast — though you'll need enough RAM or VRAM to load the full 80B weights even if only 3B fire per token.

What happened to Qwen 2.5? Should I skip straight to Qwen 3?

Qwen 2.5 was the previous generation before Qwen 3. At this point, there's no reason to start a new project on Qwen 2.5 — Qwen 3 outperforms it across the board at every size, and the 2507 update widened the gap further. Some specialized Qwen 2.5 fine-tunes still circulate (the 72B was extremely popular for creative writing), but for general use, Qwen 3 is strictly better.