Run Qwen AI Models Locally
Running Qwen 3.5 on your own machine means zero API costs, full privacy, and no rate limits. But the tool you pick matters — a lot. Ollama, llama.cpp, LM Studio, and MLX each handle Qwen differently, and the wrong choice can cost you half your inference speed or break features like vision and tool calling entirely.
This page helps you choose. Not install — choose. Each tool has its own dedicated setup guide linked below. First, though, make sure your hardware is up to the task: check what you can run with our Can I Run Qwen? tool.
Ollama vs. llama.cpp vs. LM Studio vs. MLX
Here's what actually matters when picking a local inference tool for Qwen. Not every feature is equal — focus on the rows relevant to your setup.
| Feature | Ollama | llama.cpp | LM Studio | MLX |
|---|---|---|---|---|
| Setup Difficulty | Easiest (3 commands) | Medium (build from source) | Easy (GUI installer) | Easy (pip install) |
| Interface | CLI + API | CLI + Web UI | Full GUI | CLI + Python |
| Mac Performance | Baseline | ~Same as Ollama | Same as backend | ~2x faster |
| NVIDIA Performance | Good | Best | Good | N/A |
| GPU Support | NVIDIA, AMD, Apple | NVIDIA, AMD, Apple | NVIDIA, AMD, Apple | Apple Silicon only |
| Qwen3.5 Vision | Broken (Mar 2026) | Yes | Yes | Yes |
| Tool Calling | Broken (Mar 2026) | Works | Works | Partially works |
| Thinking Toggle | Forced ON | Configurable | Configurable | Configurable |
| API Compatible | OpenAI :11434 | OpenAI :8080 | OpenAI :1234 | No server |
| Model Format | Auto (GGUF) | GGUF | GGUF + MLX | MLX |
| Best For | Beginners, API apps | Power users, max speed | GUI users | Mac speed |
A few things stand out. Ollama is the most popular option — and for good reason, it's dead simple. But as of March 2026, it has real issues with Qwen 3.5 specifically: vision models don't load, tool calling is broken, and thinking mode is forced on with no toggle. These aren't minor inconveniences. If you're building anything that relies on function calling or multimodal input, skip Ollama for now.
On Mac, MLX is the clear performance winner — roughly double the token throughput of Ollama on the same Apple Silicon chip. That gap matters when you're running a 27B model interactively. On NVIDIA GPUs, llama.cpp edges ahead of everything else, particularly for prompt processing where it can be 5-10x faster than Ollama on the same hardware.
One caveat worth mentioning: all four tools use GGUF-format models (except MLX, which has its own format). That means you'll find the same quantized model weights on HuggingFace working across Ollama, llama.cpp, and LM Studio. Switching between them later doesn't mean re-downloading everything.
Which Tool Should You Use?
Start with your hardware, then narrow by what you need.
Mac with Apple Silicon
Want maximum speed? MLX. It's built specifically for Apple's unified memory architecture and it shows. Prefer a visual interface? LM Studio with the MLX backend gives you the GUI without sacrificing much performance. Just want the simplest CLI experience? Ollama works — but expect slower inference and no vision support with Qwen 3.5.
Windows or Linux with NVIDIA GPU
Need the absolute fastest inference? llama.cpp. It takes more effort to set up — you'll compile from source and manage models manually — but the performance payoff is significant, especially with CUDA acceleration on RTX cards. Prefer a GUI instead of the terminal? LM Studio gives you a polished chat interface with model management built in. Building an app that needs an OpenAI-compatible endpoint? Ollama is the quickest path to a local API server, spinning up on port 11434 with a single command.
When in doubt, go with llama.cpp. It's the most flexible and the best-performing backend on NVIDIA hardware.
Quick Recommendations by Use Case
| Use Case | Recommended Tool |
|---|---|
| Just want to try Qwen quickly | Ollama |
| Fastest inference (NVIDIA) | llama.cpp |
| Fastest inference (Mac) | MLX |
| Visual chat interface | LM Studio |
| Building an app with a local LLM API | Ollama |
| Qwen 3.5 vision models | llama.cpp or LM Studio (not Ollama) |
| Working tool calling | llama.cpp or LM Studio (not Ollama) |
| Limited VRAM, need partial offload | llama.cpp with --fit flag |
What Can Your Hardware Actually Run?
The model you can run depends almost entirely on how much memory you have — VRAM for discrete GPUs, unified memory for Macs, or plain RAM for CPU-only setups. Here's a quick reference for Qwen 3.5 models at Q4_K_M quantization:
| Your Hardware | What You Can Run |
|---|---|
| 8GB RAM (no GPU) | Qwen3.5-0.8B, Qwen3.5-2B |
| 16GB RAM / 8GB VRAM | Qwen3.5-4B, Qwen3.5-9B |
| 24GB VRAM (RTX 4090) | Qwen3.5-35B-A3B (sweet spot) |
| 16GB Mac (Apple Silicon) | Qwen3.5-9B via MLX |
| 32GB+ Mac | Qwen3.5-35B-A3B, Qwen3.5-27B (3-bit) |
| 64GB+ Mac | Qwen3.5-27B, Qwen3.5-122B-A10B |
The 35B-A3B is Qwen's MoE (Mixture of Experts) model — 35 billion total parameters, but only 3 billion activate per token. That's why it fits in 24GB VRAM while delivering quality that punches well above its active parameter count.
Not sure where you fall? Our hardware compatibility tool will tell you exactly which models your specific GPU or Mac can handle.
Picking a Model and Quantization
Which model depends on what you're doing. Which quantization depends on how much memory you can spare.
Model Recommendations
General use: Qwen3.5-9B. It's the best balance of quality and resource usage for most people. Fits comfortably in 8GB VRAM at Q4_K_M.
Coding: Qwen3-Coder-Next if you can run it, or Qwen3.5-35B-A3B for a lighter alternative that still handles code well.
Budget hardware: Qwen3.5-4B fits in 6GB VRAM. It won't match the 9B, but for basic tasks and quick experiments, it's surprisingly capable.
Maximum quality: Qwen3.5-35B-A3B at Q5_K_M or higher. If you have the VRAM for it, this is the local Qwen experience at its best.
Quantization Quick Guide
Quantization compresses model weights to use less memory. Lower numbers mean smaller files and less VRAM, but some quality loss. Here's where to start:
- Q4_K_M — The default starting point. Good balance of size and quality for most models.
- Q5_K_M — Better for coding tasks where precision matters. About 15-20% larger than Q4.
- Q8_0 — Near-lossless quality. Only use if you have VRAM to spare.
If a model barely fits your VRAM at Q4_K_M, don't try to squeeze in a bigger model at Q2 — you'll get worse results than running a smaller model at higher quantization.
Setup Guides for Each Tool
Each guide walks you through installation, configuration, running your first model, and troubleshooting. Pick the one that matches your setup and click through.
Ollama
The easiest way to get started. Three commands and you're chatting. Best for beginners and local API endpoints.
llama.cpp
Maximum performance on NVIDIA GPUs. Full control over inference settings. Requires building from source.
LM Studio
Full GUI with model management, chat interface, and local server. Download, configure, and run without touching a terminal.
MLX
Apple Silicon only. Roughly 2x faster than Ollama on Mac. Python-native with pip install. The speed king on M-series chips.
From the Community
Real-world numbers from people running Qwen on their own hardware:
Qwen3 27B Dense: 35 tok/s on RTX 3090. MoE variant: 112 tok/s. The MoE models are absurdly fast for their quality level.
— (@sudoingX) March 2026
Those MoE numbers explain why the 35B-A3B is the sweet spot for local inference — you get 35B-class reasoning at speeds most dense models can't touch. On the Mac side, Apple Silicon users are pushing boundaries too:
Running Qwen 397B at 5.7 tok/s with 5.5GB RAM. Local LLMs have come a very long way.
— Simon Willison (@simonw) March 2026
Frequently Asked Questions
What's the fastest way to run Qwen locally?
On NVIDIA GPUs, llama.cpp consistently delivers the highest token throughput. On Apple Silicon Macs, MLX is roughly twice as fast as Ollama for the same model.
Can I run Qwen on CPU only?
Yes, but keep expectations in check. Qwen3.5-0.8B and 2B run acceptably on CPU with 8GB of RAM — you'll get usable output for simple tasks. The 4B model technically works on 16GB RAM, but at single-digit tokens per second it's more of an exercise in patience than practical usage. For anything beyond quick experiments, you really want a GPU.
What's the best model for 8GB VRAM?
Qwen3.5-9B at Q4_K_M. It fits with room to spare and delivers strong general-purpose performance across reasoning, coding, and conversation.
Ollama or llama.cpp — which one?
Ollama for simplicity. It's three commands and you're running. llama.cpp for performance and control — expect 5-10x faster prompt processing and full access to inference parameters. If you need Qwen 3.5 vision or tool calling, llama.cpp is your only CLI option right now.
Can I run Qwen on my phone?
The Qwen3.5-2B model runs on iPhone via MLX. It's not going to replace your desktop setup, but it works for lightweight tasks and quick experiments on the go.
Do I need to download models separately for each tool?
Not entirely. Ollama, llama.cpp, and LM Studio all use GGUF-format model files. If you've already downloaded a GGUF model for one tool, you can point another tool at the same file. MLX uses its own format, so those models aren't interchangeable with the other three.