Run Qwen AI Models Locally

Running Qwen 3.5 on your own machine means zero API costs, full privacy, and no rate limits. But the tool you pick matters — a lot. Ollama, llama.cpp, LM Studio, and MLX each handle Qwen differently, and the wrong choice can cost you half your inference speed or break features like vision and tool calling entirely.

This page helps you choose. Not install — choose. Each tool has its own dedicated setup guide linked below. First, though, make sure your hardware is up to the task: check what you can run with our Can I Run Qwen? tool.

Ollama vs. llama.cpp vs. LM Studio vs. MLX

Here's what actually matters when picking a local inference tool for Qwen. Not every feature is equal — focus on the rows relevant to your setup.

Feature Ollama llama.cpp LM Studio MLX
Setup Difficulty Easiest (3 commands) Medium (build from source) Easy (GUI installer) Easy (pip install)
Interface CLI + API CLI + Web UI Full GUI CLI + Python
Mac Performance Baseline ~Same as Ollama Same as backend ~2x faster
NVIDIA Performance Good Best Good N/A
GPU Support NVIDIA, AMD, Apple NVIDIA, AMD, Apple NVIDIA, AMD, Apple Apple Silicon only
Qwen3.5 Vision Broken (Mar 2026) Yes Yes Yes
Tool Calling Broken (Mar 2026) Works Works Partially works
Thinking Toggle Forced ON Configurable Configurable Configurable
API Compatible OpenAI :11434 OpenAI :8080 OpenAI :1234 No server
Model Format Auto (GGUF) GGUF GGUF + MLX MLX
Best For Beginners, API apps Power users, max speed GUI users Mac speed

A few things stand out. Ollama is the most popular option — and for good reason, it's dead simple. But as of March 2026, it has real issues with Qwen 3.5 specifically: vision models don't load, tool calling is broken, and thinking mode is forced on with no toggle. These aren't minor inconveniences. If you're building anything that relies on function calling or multimodal input, skip Ollama for now.

On Mac, MLX is the clear performance winner — roughly double the token throughput of Ollama on the same Apple Silicon chip. That gap matters when you're running a 27B model interactively. On NVIDIA GPUs, llama.cpp edges ahead of everything else, particularly for prompt processing where it can be 5-10x faster than Ollama on the same hardware.

One caveat worth mentioning: all four tools use GGUF-format models (except MLX, which has its own format). That means you'll find the same quantized model weights on HuggingFace working across Ollama, llama.cpp, and LM Studio. Switching between them later doesn't mean re-downloading everything.

Which Tool Should You Use?

Start with your hardware, then narrow by what you need.

Mac with Apple Silicon

Want maximum speed? MLX. It's built specifically for Apple's unified memory architecture and it shows. Prefer a visual interface? LM Studio with the MLX backend gives you the GUI without sacrificing much performance. Just want the simplest CLI experience? Ollama works — but expect slower inference and no vision support with Qwen 3.5.

Windows or Linux with NVIDIA GPU

Need the absolute fastest inference? llama.cpp. It takes more effort to set up — you'll compile from source and manage models manually — but the performance payoff is significant, especially with CUDA acceleration on RTX cards. Prefer a GUI instead of the terminal? LM Studio gives you a polished chat interface with model management built in. Building an app that needs an OpenAI-compatible endpoint? Ollama is the quickest path to a local API server, spinning up on port 11434 with a single command.

When in doubt, go with llama.cpp. It's the most flexible and the best-performing backend on NVIDIA hardware.

Quick Recommendations by Use Case

Use Case Recommended Tool
Just want to try Qwen quickly Ollama
Fastest inference (NVIDIA) llama.cpp
Fastest inference (Mac) MLX
Visual chat interface LM Studio
Building an app with a local LLM API Ollama
Qwen 3.5 vision models llama.cpp or LM Studio (not Ollama)
Working tool calling llama.cpp or LM Studio (not Ollama)
Limited VRAM, need partial offload llama.cpp with --fit flag

What Can Your Hardware Actually Run?

The model you can run depends almost entirely on how much memory you have — VRAM for discrete GPUs, unified memory for Macs, or plain RAM for CPU-only setups. Here's a quick reference for Qwen 3.5 models at Q4_K_M quantization:

Your Hardware What You Can Run
8GB RAM (no GPU) Qwen3.5-0.8B, Qwen3.5-2B
16GB RAM / 8GB VRAM Qwen3.5-4B, Qwen3.5-9B
24GB VRAM (RTX 4090) Qwen3.5-35B-A3B (sweet spot)
16GB Mac (Apple Silicon) Qwen3.5-9B via MLX
32GB+ Mac Qwen3.5-35B-A3B, Qwen3.5-27B (3-bit)
64GB+ Mac Qwen3.5-27B, Qwen3.5-122B-A10B

The 35B-A3B is Qwen's MoE (Mixture of Experts) model — 35 billion total parameters, but only 3 billion activate per token. That's why it fits in 24GB VRAM while delivering quality that punches well above its active parameter count.

Not sure where you fall? Our hardware compatibility tool will tell you exactly which models your specific GPU or Mac can handle.

Picking a Model and Quantization

Which model depends on what you're doing. Which quantization depends on how much memory you can spare.

Model Recommendations

General use: Qwen3.5-9B. It's the best balance of quality and resource usage for most people. Fits comfortably in 8GB VRAM at Q4_K_M.

Coding: Qwen3-Coder-Next if you can run it, or Qwen3.5-35B-A3B for a lighter alternative that still handles code well.

Budget hardware: Qwen3.5-4B fits in 6GB VRAM. It won't match the 9B, but for basic tasks and quick experiments, it's surprisingly capable.

Maximum quality: Qwen3.5-35B-A3B at Q5_K_M or higher. If you have the VRAM for it, this is the local Qwen experience at its best.

Quantization Quick Guide

Quantization compresses model weights to use less memory. Lower numbers mean smaller files and less VRAM, but some quality loss. Here's where to start:

If a model barely fits your VRAM at Q4_K_M, don't try to squeeze in a bigger model at Q2 — you'll get worse results than running a smaller model at higher quantization.

Setup Guides for Each Tool

Each guide walks you through installation, configuration, running your first model, and troubleshooting. Pick the one that matches your setup and click through.

From the Community

Real-world numbers from people running Qwen on their own hardware:

Those MoE numbers explain why the 35B-A3B is the sweet spot for local inference — you get 35B-class reasoning at speeds most dense models can't touch. On the Mac side, Apple Silicon users are pushing boundaries too:

Frequently Asked Questions

What's the fastest way to run Qwen locally?

On NVIDIA GPUs, llama.cpp consistently delivers the highest token throughput. On Apple Silicon Macs, MLX is roughly twice as fast as Ollama for the same model.

Can I run Qwen on CPU only?

Yes, but keep expectations in check. Qwen3.5-0.8B and 2B run acceptably on CPU with 8GB of RAM — you'll get usable output for simple tasks. The 4B model technically works on 16GB RAM, but at single-digit tokens per second it's more of an exercise in patience than practical usage. For anything beyond quick experiments, you really want a GPU.

What's the best model for 8GB VRAM?

Qwen3.5-9B at Q4_K_M. It fits with room to spare and delivers strong general-purpose performance across reasoning, coding, and conversation.

Ollama or llama.cpp — which one?

Ollama for simplicity. It's three commands and you're running. llama.cpp for performance and control — expect 5-10x faster prompt processing and full access to inference parameters. If you need Qwen 3.5 vision or tool calling, llama.cpp is your only CLI option right now.

Can I run Qwen on my phone?

The Qwen3.5-2B model runs on iPhone via MLX. It's not going to replace your desktop setup, but it works for lightweight tasks and quick experiments on the go.

Do I need to download models separately for each tool?

Not entirely. Ollama, llama.cpp, and LM Studio all use GGUF-format model files. If you've already downloaded a GGUF model for one tool, you can point another tool at the same file. MLX uses its own format, so those models aren't interchangeable with the other three.