Alan West

Posted on Mar 29

Qwen 3.5 Small: Four Models, Zero API Cost. A Quick Benchmark.

#qwen #opensource #locallm #benchmarks

Alibaba just dropped four models and said "here, they're free." The Qwen 3.5 Small family — 0.8B, 2B, 4B, and 9B parameter models — is fully open source under Apache 2.0. No gated access, no usage restrictions, no phone-home telemetry. Download the weights and run them on whatever hardware you have.

I spent the weekend benchmarking all four on code generation tasks. Here's whether any of them are actually useful.

The Lineup

The Qwen 3.5 Small family is designed for edge and local deployment. These aren't frontier models competing with GPT-4 — they're small models optimized to run fast on limited hardware. Each model targets a different hardware tier.

Model	Parameters	Quantized Size (Q4)	Target Hardware
Qwen 3.5 Small 0.8B	800M	~500MB	Phones, Raspberry Pi
Qwen 3.5 Small 2B	2B	~1.2GB	Laptops (CPU only)
Qwen 3.5 Small 4B	4B	~2.5GB	Laptops with basic GPU
Qwen 3.5 Small 9B	9B	~5.5GB	Desktop with 8GB+ VRAM

The 0.8B model runs on a phone. The 9B model fits in a single consumer GPU. Every model in between fills a gap. Alibaba clearly thought about the deployment matrix.

Benchmark Setup

I ran each model through three code generation benchmarks: HumanEval (164 Python problems), MBPP sanitized (427 Python problems), and a custom set of 50 practical tasks I've been maintaining — things like "parse this CSV," "implement retry logic with exponential backoff," and "write a SQLAlchemy model for this schema."

All tests ran on my MacBook Pro M3 Max with 36GB RAM using llama.cpp with Q4_K_M quantization. I used temperature 0.2 and a maximum of 512 output tokens per problem.

# Setup for each model
./llama-server \
  --model qwen3.5-small-{size}-Q4_K_M.gguf \
  --ctx-size 8192 \
  --n-gpu-layers 99 \
  --port 8080

# Run benchmark suite
python benchmark.py \
  --endpoint http://localhost:8080/completion \
  --suite humaneval,mbpp,practical \
  --temperature 0.2 \
  --max-tokens 512 \
  --runs 3  # average over 3 runs for stability

The Results

Here's the raw data. Pass rates are pass@1 (first attempt, no retries).

Model              | HumanEval | MBPP  | Practical | Tokens/sec
-------------------|-----------|-------|-----------|----------
Qwen 3.5 Small 0.8B |  31.7%   | 28.4% |   18.0%  |   95 t/s
Qwen 3.5 Small 2B   |  48.2%   | 44.1% |   36.0%  |   72 t/s
Qwen 3.5 Small 4B   |  61.0%   | 57.3% |   52.0%  |   48 t/s
Qwen 3.5 Small 9B   |  72.0%   | 68.5% |   64.0%  |   28 t/s

For comparison:
GPT-4o-mini (API)    |  82.3%   | 78.1% |   76.0%  |   N/A
Claude Haiku 3.5     |  80.5%   | 76.8% |   74.0%  |   N/A
Llama 3.2 8B (local) |  67.1%   | 63.2% |   58.0%  |   32 t/s

The 9B model is legitimately competitive. It beats Llama 3.2 8B across the board — roughly 5 percentage points better on HumanEval and MBPP, and 6 points better on my practical tests. For a model you can run locally with zero API cost, those numbers are solid.

The 4B model is the sweet spot for laptop use. It's fast enough for real-time autocomplete (48 tokens/sec is comfortable), and it passes more than half of practical coding tasks. It won't write your architecture for you, but it can handle function-level generation reliably.

The 2B model is where quality drops noticeably. It handles straightforward tasks — simple algorithms, basic CRUD operations — but struggles with anything involving multiple steps of reasoning. Error handling, edge cases, and complex control flow are weak points.

The 0.8B model is a toy for code generation. At 31.7% on HumanEval, it's wrong more often than it's right. But at 95 tokens/sec on a laptop, it's interesting for other uses — code completion of short snippets, documentation generation, commit message writing. Tasks where being wrong some of the time is acceptable.

Where They Shine

The Qwen 3.5 Small models are surprisingly good at Python and JavaScript. Alibaba clearly weighted the training data toward these languages. The 9B model's Python output is clean, idiomatic, and handles standard library usage correctly.

# Prompt: "Write a function that finds all duplicate files in a directory
# by comparing SHA-256 hashes"

# Qwen 3.5 Small 9B output (actual, unedited):
import hashlib
from pathlib import Path
from collections import defaultdict

def find_duplicates(directory: str) -> list[list[str]]:
    hash_map = defaultdict(list)

    for path in Path(directory).rglob("*"):
        if path.is_file():
            file_hash = _hash_file(path)
            hash_map[file_hash].append(str(path))

    return [paths for paths in hash_map.values() if len(paths) > 1]

def _hash_file(path: Path, chunk_size: int = 8192) -> str:
    sha256 = hashlib.sha256()
    with open(path, "rb") as f:
        while chunk := f.read(chunk_size):
            sha256.update(chunk)
    return sha256.hexdigest()

That's clean code. It uses pathlib, defaultdict, the walrus operator, proper chunked reading for large files. A junior developer would be proud of this output. From a 9B parameter model running locally.

Where They Fall Apart

Complex, multi-step reasoning is the weak spot. When a task requires understanding the relationship between multiple functions, maintaining state across a long generation, or handling subtle edge cases, even the 9B model stumbles.

I asked each model to implement a connection pool with health checking, retry logic, and graceful shutdown. The 9B model got the basic structure right but had a subtle deadlock in the shutdown path. The 4B model produced something that looked right but had a race condition in the health check thread. The 2B and 0.8B models produced code that wouldn't run at all.

For comparison, GPT-4o-mini nailed it on the first try. There's still a significant gap between small local models and cloud API models for complex tasks.

Practical Recommendations

Use the 9B model if: You want a capable local coding assistant with zero ongoing cost. Pair it with an IDE that supports local model backends (Continue, Cody, or Tabby). It's good enough for autocomplete, simple generation, and code explanation. Keep a cloud API model available for complex tasks.

Use the 4B model if: You're on a laptop without dedicated GPU or you need fast completions for autocomplete. The quality-to-speed ratio is the best in the family.

Use the 2B model if: You're building an embedded coding assistant for a resource-constrained environment. CI/CD bots, code review automation, or documentation generation where you can tolerate some error rate.

Skip the 0.8B for code. Use it for non-code tasks like summarization or text generation where it's more competitive.

The Bottom Line

The Qwen 3.5 Small family is the best collection of small open-source models for code generation available today. The 9B model is genuinely useful for daily development work if you're willing to accept that it won't match cloud models on complex tasks. The Apache 2.0 license means you can deploy it anywhere without legal worry.

Zero API cost is a powerful thing. For developers who can't or won't send code to external APIs — security requirements, air-gapped environments, or just principle — these models make local AI-assisted coding practical rather than aspirational.

Download the 9B model, set it up with your editor, and see if it sticks. The worst that happens is you wasted an afternoon. The best that happens is you never pay for API tokens for routine coding tasks again.

Top comments (1)

Nahuel Nucera • Mar 30

Awesome! Great post!