Run Qwen AI Models Locally

Running Qwen 3.5 on your own machine means zero API costs, full privacy, and no rate limits. But the tool you pick matters — a lot. Ollama, llama.cpp, LM Studio, and MLX each handle Qwen differently, and the wrong choice can cost you half your inference speed or break features like vision and tool calling entirely.

This page helps you choose. Not install — choose. Each tool has its own dedicated setup guide linked below. First, though, make sure your hardware is up to the task: check what you can run with our Can I Run Qwen? tool.

Ollama vs. llama.cpp vs. LM Studio vs. MLX

Here's what actually matters when picking a local inference tool for Qwen. Not every feature is equal — focus on the rows relevant to your setup.

Feature	Ollama	llama.cpp	LM Studio	MLX
Setup Difficulty	Easiest (3 commands)	Medium (build from source)	Easy (GUI installer)	Easy (pip install)
Interface	CLI + API	CLI + Web UI	Full GUI	CLI + Python
Mac Performance	Baseline	~Same as Ollama	Same as backend	~2x faster
NVIDIA Performance	Good	Best	Good	N/A
GPU Support	NVIDIA, AMD, Apple	NVIDIA, AMD, Apple	NVIDIA, AMD, Apple	Apple Silicon only
Qwen3.5 Vision	Broken (Mar 2026)	Yes	Yes	Yes
Tool Calling	Broken (Mar 2026)	Works	Works	Partially works
Thinking Toggle	Forced ON	Configurable	Configurable	Configurable
API Compatible	OpenAI :11434	OpenAI :8080	OpenAI :1234	No server
Model Format	Auto (GGUF)	GGUF	GGUF + MLX	MLX
Best For	Beginners, API apps	Power users, max speed	GUI users	Mac speed

A few things stand out. Ollama is the most popular option — and for good reason, it's dead simple. But as of March 2026, it has real issues with Qwen 3.5 specifically: vision models don't load, tool calling is broken, and thinking mode is forced on with no toggle. These aren't minor inconveniences. If you're building anything that relies on function calling or multimodal input, skip Ollama for now.

On Mac, MLX is the clear performance winner — roughly double the token throughput of Ollama on the same Apple Silicon chip. That gap matters when you're running a 27B model interactively. On NVIDIA GPUs, llama.cpp edges ahead of everything else, particularly for prompt processing where it can be 5-10x faster than Ollama on the same hardware.

One caveat worth mentioning: all four tools use GGUF-format models (except MLX, which has its own format). That means you'll find the same quantized model weights on HuggingFace working across Ollama, llama.cpp, and LM Studio. Switching between them later doesn't mean re-downloading everything.

Which Tool Should You Use?

Start with your hardware, then narrow by what you need.

Mac with Apple Silicon

Want maximum speed? MLX. It's built specifically for Apple's unified memory architecture and it shows. Prefer a visual interface? LM Studio with the MLX backend gives you the GUI without sacrificing much performance. Just want the simplest CLI experience? Ollama works — but expect slower inference and no vision support with Qwen 3.5.

Windows or Linux with NVIDIA GPU

Need the absolute fastest inference? llama.cpp. It takes more effort to set up — you'll compile from source and manage models manually — but the performance payoff is significant, especially with CUDA acceleration on RTX cards. Prefer a GUI instead of the terminal? LM Studio gives you a polished chat interface with model management built in. Building an app that needs an OpenAI-compatible endpoint? Ollama is the quickest path to a local API server, spinning up on port 11434 with a single command.

When in doubt, go with llama.cpp. It's the most flexible and the best-performing backend on NVIDIA hardware.

Quick Recommendations by Use Case

Use Case	Recommended Tool
Just want to try Qwen quickly	Ollama
Fastest inference (NVIDIA)	llama.cpp
Fastest inference (Mac)	MLX
Visual chat interface	LM Studio
Building an app with a local LLM API	Ollama
Qwen 3.5 vision models	llama.cpp or LM Studio (not Ollama)
Working tool calling	llama.cpp or LM Studio (not Ollama)
Limited VRAM, need partial offload	llama.cpp with `--fit` flag

What Can Your Hardware Actually Run?

The model you can run depends almost entirely on how much memory you have — VRAM for discrete GPUs, unified memory for Macs, or plain RAM for CPU-only setups. Here's a quick reference for Qwen 3.5 models at Q4_K_M quantization:

Your Hardware	What You Can Run
8GB RAM (no GPU)	Qwen3.5-0.8B, Qwen3.5-2B
16GB RAM / 8GB VRAM	Qwen3.5-4B, Qwen3.5-9B
24GB VRAM (RTX 4090)	Qwen3.5-35B-A3B (sweet spot)
16GB Mac (Apple Silicon)	Qwen3.5-9B via MLX
32GB+ Mac	Qwen3.5-35B-A3B, Qwen3.5-27B (3-bit)
64GB+ Mac	Qwen3.5-27B, Qwen3.5-122B-A10B

The 35B-A3B is Qwen's MoE (Mixture of Experts) model — 35 billion total parameters, but only 3 billion activate per token. That's why it fits in 24GB VRAM while delivering quality that punches well above its active parameter count.

Not sure where you fall? Our hardware compatibility tool will tell you exactly which models your specific GPU or Mac can handle.

Picking a Model and Quantization

Which model depends on what you're doing. Which quantization depends on how much memory you can spare.

Model Recommendations

General use: Qwen3.5-9B. It's the best balance of quality and resource usage for most people. Fits comfortably in 8GB VRAM at Q4_K_M.

Coding: Qwen3-Coder-Next if you can run it, or Qwen3.5-35B-A3B for a lighter alternative that still handles code well.

Budget hardware: Qwen3.5-4B fits in 6GB VRAM. It won't match the 9B, but for basic tasks and quick experiments, it's surprisingly capable.

Maximum quality: Qwen3.5-35B-A3B at Q5_K_M or higher. If you have the VRAM for it, this is the local Qwen experience at its best.

Quantization Quick Guide

Quantization compresses model weights to use less memory. Lower numbers mean smaller files and less VRAM, but some quality loss. Here's where to start:

Q4_K_M — The default starting point. Good balance of size and quality for most models.
Q5_K_M — Better for coding tasks where precision matters. About 15-20% larger than Q4.
Q8_0 — Near-lossless quality. Only use if you have VRAM to spare.

If a model barely fits your VRAM at Q4_K_M, don't try to squeeze in a bigger model at Q2 — you'll get worse results than running a smaller model at higher quantization.

Setup Guides for Each Tool

Each guide walks you through installation, configuration, running your first model, and troubleshooting. Pick the one that matches your setup and click through.

Ollama

The easiest way to get started. Three commands and you're chatting. Best for beginners and local API endpoints.

llama.cpp

Maximum performance on NVIDIA GPUs. Full control over inference settings. Requires building from source.

LM Studio

Full GUI with model management, chat interface, and local server. Download, configure, and run without touching a terminal.

MLX

Apple Silicon only. Roughly 2x faster than Ollama on Mac. Python-native with pip install. The speed king on M-series chips.

From the Community

Real-world numbers from people running Qwen on their own hardware:

Qwen3 27B Dense: 35 tok/s on RTX 3090. MoE variant: 112 tok/s. The MoE models are absurdly fast for their quality level.
— (@sudoingX) March 2026

Those MoE numbers explain why the 35B-A3B is the sweet spot for local inference — you get 35B-class reasoning at speeds most dense models can't touch. On the Mac side, Apple Silicon users are pushing boundaries too:

Running Qwen 397B at 5.7 tok/s with 5.5GB RAM. Local LLMs have come a very long way.
— Simon Willison (@simonw) March 2026

Frequently Asked Questions

What's the fastest way to run Qwen locally?

On NVIDIA GPUs, llama.cpp consistently delivers the highest token throughput. On Apple Silicon Macs, MLX is roughly twice as fast as Ollama for the same model.

Can I run Qwen on CPU only?

Yes, but keep expectations in check. Qwen3.5-0.8B and 2B run acceptably on CPU with 8GB of RAM — you'll get usable output for simple tasks. The 4B model technically works on 16GB RAM, but at single-digit tokens per second it's more of an exercise in patience than practical usage. For anything beyond quick experiments, you really want a GPU.

What's the best model for 8GB VRAM?

Qwen3.5-9B at Q4_K_M. It fits with room to spare and delivers strong general-purpose performance across reasoning, coding, and conversation.

Ollama or llama.cpp — which one?

Ollama for simplicity. It's three commands and you're running. llama.cpp for performance and control — expect 5-10x faster prompt processing and full access to inference parameters. If you need Qwen 3.5 vision or tool calling, llama.cpp is your only CLI option right now.

Can I run Qwen on my phone?

The Qwen3.5-2B model runs on iPhone via MLX. It's not going to replace your desktop setup, but it works for lightweight tasks and quick experiments on the go.

Do I need to download models separately for each tool?

Not entirely. Ollama, llama.cpp, and LM Studio all use GGUF-format model files. If you've already downloaded a GGUF model for one tool, you can point another tool at the same file. MLX uses its own format, so those models aren't interchangeable with the other three.