Qwen3.6-35B-A3B Runs on My Laptop and Draws Better Than Claude Opus 4.7

#english #experiments #ialocal #llmlocal

I was testing local models to replace some Anthropic API calls — cost, latency, privacy, the usual reasons — when I asked Qwen3.6-35B-A3B to draw me a pelican in ASCII art. What showed up in the terminal made me do a double-take. I copied the exact prompt, sent it to Claude Opus 4.7 via API, and the result was... worse. Considerably worse. I sat there for five minutes staring at both outputs in a split terminal, wondering what the hell everyone is actually measuring with those benchmarks.

Qwen3.6 Local vs Claude Opus 4.7: Context Before the Pelican

Qwen3.6-35B-A3B is a Mixture of Experts model from Alibaba. 35 billion parameters total, but only 3.7B are active per inference — hence the A3B suffix. That makes it surprisingly efficient on consumer hardware. I run it with llama.cpp on a laptop with 32GB of unified RAM, quantized to Q4_K_M, and inference is smooth. Not cloud-API fast, but smooth.

Claude Opus 4.7 is Anthropic's most expensive model at the time of writing. It costs $15 per million input tokens and $75 per million output tokens. It's what you reach for when you want the best Anthropic has to offer.

These two models shouldn't even be competing in the same weight class. And yet.

# Setup: llama.cpp with Metal support on macOS
# The model weighs ~22GB at Q4_K_M

# Download the model from Hugging Face
huggingface-cli download \
  Qwen/Qwen3.6-35B-A3B-Q4_K_M-GGUF \
  --local-dir ./models/qwen3.6-35b

# Local server with wide context
./llama-server \
  -m ./models/qwen3.6-35b/qwen3.6-35b-a3b-q4_k_m.gguf \
  --ctx-size 32768 \
  --n-predict 2048 \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99  # All layers on GPU (Metal)

The prompt I used was deliberately simple: "Draw me a pelican in ASCII art. Make it look good, with detail on the characteristic beak." No size instructions, no examples, no forced chain-of-thought.

What Came Out: The Comparison I Didn't Expect

Qwen3.6 gave me this (reconstructed approximately for the post, but faithful to the real output):

        .---.
       /     \
      |  o   |
       \  --/ \
        '--'   |
       /|      |
      / |  __--'
,----'  | /
|       |/
|    ___|
|   /   \___
'--'         '--.
   ~~~~~~~~~~~~~~~

It has the long beak with a pouched bottom — the pelican's gular sac. It has a neck. It has a body. It has legs. The water at the base makes contextual sense. It's not a photographic pelican, but it is a pelican. Someone who didn't know which model generated it would say "yeah, that's a pelican."

Claude Opus 4.7 gave me something closer to this:

   ___
  /   \
 |  o  |
  \___/
   |||
   |||  ___________
   |||_/

That's a generic bird with a stick underneath. No distinctive beak. No gular sac. Nothing that identifies it specifically as a pelican. It could be any bird.

I repeated the experiment three times with prompt variations. The pattern held.

Why This Matters Beyond ASCII Art

The easy first reaction is: "well, it's a niche task, benchmarks measure more important things." And that's exactly the problem.

Benchmarks measure what's easy to measure: MMLU, HumanEval, GSM8K, mathematical reasoning, standardized reading comprehension. They're useful. But they don't measure spatial representation in text, which is exactly what ASCII art tests. And that capability has real-world correlates: understanding architecture diagrams described in text, reasoning about layouts, generating technical documentation with ASCII diagrams that are actually readable.

I've written before about the gap between 'works' and 'is useful' with Gemma 4 on iPhone. The pelican case is the inverse story: something that technically should be inferior, on a metric that matters to me, isn't.

This has direct implications on how much you're paying. If you're using Claude Opus 4.7 for tasks where local Qwen3.6 is equal or better, you're paying for brand name and the convenience of the API. That can be valid — the Anthropic API uses your credits in ways that aren't always transparent and you need to understand the trade-off. But at least make it a conscious decision.

# Cost comparison for the same usage volume
# 1 million input tokens per month (moderate use)

costs = {
    # API models
    "claude_opus_4_7": {
        "input_per_mtoken": 15.00,    # USD
        "output_per_mtoken": 75.00,
        "estimated_monthly_cost": 90.00,  # ~1M input + 1M output
        "hardware_required": 0
    },
    "claude_sonnet": {
        "input_per_mtoken": 3.00,
        "output_per_mtoken": 15.00,
        "estimated_monthly_cost": 18.00,
        "hardware_required": 0
    },
    # Local model
    "qwen3_6_35b_local": {
        "input_per_mtoken": 0,         # Electricity, basically
        "output_per_mtoken": 0,
        "estimated_monthly_cost": 3.50, # Estimated electricity cost
        "hardware_required": 32_000    # MB RAM minimum
    }
}

# The point isn't that local always wins
# The point is that the quality difference
# doesn't always justify the price difference

The Mistakes You Make When Evaluating Local Models

Mistake 1: Comparing the quantized model to full-precision as if they're the same. Q4_K_M loses some quality compared to the original FP16 model. In complex reasoning tasks, that loss matters. In ASCII art and many text generation tasks, you won't notice it. You need to know when it matters.

Mistake 2: Ignoring temperature and sampling parameters. Local models running in llama.cpp have defaults that can differ from what Anthropic uses in their API. Temperature 0.7 locally can behave differently from temperature 0.7 in the API.

# Parameters I tuned to get more consistent results
# on creative/spatial tasks
./llama-cli \
  -m ./models/qwen3.6-35b-a3b-q4_k_m.gguf \
  --temp 0.6 \
  --top-p 0.9 \
  --top-k 40 \
  --repeat-penalty 1.1 \
  -p "Draw me a pelican in ASCII art with detail on the beak"
# --temp lower = more deterministic, better for spatial tasks
# --repeat-penalty prevents the model from looping characters

Mistake 3: Using thinking mode when you don't need it. Qwen3.6 has extended reasoning capability (like an internal chain-of-thought). For ASCII art, that mode is counterproductive — the model starts reasoning about the pelican instead of drawing it. I turned it off with /no_think in the prompt and results improved immediately.

Mistake 4: Only evaluating on what you already know the cloud model wins at. If your evaluation is biased toward the expensive model's strengths, you'll conclude it's worth the price. Test it on your actual use cases, not on the marketing benchmarks.

Mistake 5: Not counting privacy as a variable in the equation. Everything you send to the Anthropic API leaves your machine. If that matters in your context — and it should matter more than you think — then the local model has value that doesn't show up in any quality benchmark.

FAQ: Qwen3.6 Local vs Claude Opus 4.7

Does Qwen3.6-35B-A3B actually run on consumer hardware?
Yes, with conditions. You need at least 24GB of RAM for the quantized model at Q4_K_M (~22GB). With 32GB of unified RAM (like Apple's M2/M3 Pro chips) you run it comfortably. On systems with separate RAM and VRAM, you need it to fit in VRAM or accept partial CPU inference, which is much slower.

What exactly does the A3B suffix in Qwen3.6-35B-A3B mean?
It's a Mixture of Experts (MoE) model. It has 35 billion parameters total, but the MoE architecture only activates a subset per processed token — in this case, approximately 3.7B active parameters. That makes it significantly more efficient in memory and speed than an equivalent dense 35B model, while retaining most of the capability.

Where does Claude Opus 4.7 still win comfortably?
Complex mathematical reasoning, instruction-following with many simultaneous constraints, long-document analysis with extended context, and tasks where consistency under pressure matters a lot. For very specific code with complex edge cases, I still prefer Claude Sonnet or Opus. The pelican was a surprise; it doesn't mean they're equivalent at everything.

Is the technical setup worth it to run Qwen3.6 locally?
Depends on your usage volume. If you're doing more than 500K tokens per month, the savings are significant. If you're using local models on projects where privacy matters — proprietary code, sensitive data — the setup pays for itself regardless of cost. If you're technically curious and already have the hardware, absolutely yes. It's not a process for casual users yet, but it's also not rocket science.

How do I know which model to use for each task in practice?
Honest answer: experiment on your specific use cases, not on generic benchmarks. I built a system similar to what I described in the post about self-regulated curation of technical resources: I evaluate models on real tasks I need to do anyway, log the results, and adjust routing accordingly. There's no shortcut more reliable than that.

Is Qwen3.6 safe to run locally from a security standpoint?
Safer than the API in the sense that your data doesn't leave your machine. But "safe" is a broader dimension. The model itself is open-weight and auditable. The main risk vector in local setups is the dependencies (llama.cpp, serving frameworks) and the endpoints you expose on the network. If you expose the local server on your network, apply the same security rules you'd apply to any service: authentication, don't expose it to the internet without reason, logs.

What the Pelican Taught Me About How We Evaluate Intelligence

The pelican isn't the point. The point is that a concrete, specific task that I needed for a real project was solved better by an open-weight model running on my laptop than by the most expensive model from one of the most respected AI labs in the world.

That doesn't mean Qwen3.6 is "better" than Claude Opus 4.7 in any global sense. It means the notion of "better" is completely context-dependent, and that the benchmarks we use to measure "intelligence" in language models are a noisy approximation of what actually matters in practice.

What this experiment changed for me was the evaluation process. Now before choosing which model to use for a new task, I run a small, fast evaluation with my actual use cases. Not marketing benchmarks, not lab papers. My tasks, my hardware, my criteria.

Sometimes the result surprises me. The pelican surprised me. And that's worth more than any number on a leaderboard.

If you want to replicate the experiment, the full setup is in the code block above. Give yourself an hour to get it running. Test it on your tasks, not mine. Maybe your use case still needs Opus. Maybe it doesn't. The only way to find out is to ask a pelican.