Arshdeep Singh

Posted on Mar 20

Qwen 3.5: The AI Model That Runs on Your iPhone Without an Internet Connection

#ai #automation #opensource #tools

Qwen 3.5: The AI Model That Runs on Your iPhone Without an Internet Connection

Written by Arshdeep Singh

The default assumption in AI today is connectivity. Ask a question → request goes to a data center → model processes it → response comes back. Fast, convenient, and entirely dependent on a working internet connection and a third party you're trusting with your data.

Qwen 3.5 is part of a different trend: capable AI models small enough to run on your phone, your laptop, your edge device — entirely offline. Alibaba's open-weight model family has been moving steadily in this direction, and with Qwen 3.5, released in February 2026, on-device AI crossed a meaningful capability threshold.

The Qwen Family: Context

To understand where Qwen 3.5 sits, it helps to see the progression:

Qwen 2.5 (2024) — Alibaba's strong open-weight series, competitive with Llama 3 at various sizes. Solid general-purpose models from 0.5B to 72B parameters.

Qwen 3 (April 2025) — A major leap. Introduced hybrid thinking/non-thinking modes (like DeepSeek's chain-of-thought toggle), scaled up to 235B parameters via Mixture of Experts (MoE), and achieved near-frontier performance on reasoning benchmarks. The 235B MoE model became a serious open-weight competitor to closed models.

Qwen 3.5 (February 2026) — The on-device focus. Rather than scaling up, Alibaba optimized down. The key innovation: taking Qwen 3's capabilities and compressing them into sizes that run on consumer hardware — phones, laptops, embedded devices.

The Model Sizes

Qwen 3.5 ships in four sizes:

Model	Parameters	Target Hardware
Qwen 3.5-0.8B	0.8 billion	iPhone 12+, mid-range Android
Qwen 3.5-2B	2 billion	iPhone 14+, any modern laptop
Qwen 3.5-4B	4 billion	iPhone 15 Pro, M1 MacBook Air
Qwen 3.5-9B	9 billion	M2/M3 MacBook, high-end phones

The 0.8B model runs on an iPhone 12. The 9B model runs on a MacBook Air with an M-series chip. All of them run without an internet connection.

This isn't a theoretical capability. These models run at usable speeds on the hardware most people already own.

Hybrid Thinking Mode

One of Qwen 3.5's key inherited features from Qwen 3 is the hybrid thinking/non-thinking mode.

In thinking mode, the model uses chain-of-thought reasoning — working through problems step by step before producing an answer. This is slower but significantly more accurate for complex reasoning tasks: math, coding, multi-step logic.

In non-thinking mode, the model responds immediately without the intermediate reasoning steps. Faster, suitable for conversational use, simple lookups, and tasks where speed matters more than depth.

The ability to toggle between these modes on-device is meaningful. You get a model that can be fast and lightweight for casual use, and slow-and-thorough when you need it to actually think.

Open-Weight: What That Actually Means

"Open-weight" is meaningfully different from "open-source," and it's worth being precise.

Open-weight means the model weights are publicly available for download. You can:

Download the model and run it locally
Fine-tune it on your own data
Deploy it on your own infrastructure
Integrate it into your application

You cannot necessarily see the training code, the data curation process, or the full training recipe — that's where "open-weight" differs from fully open-source.

But for practical purposes, open-weight is what matters for most developers and most use cases:

No per-token fees — run as many tokens as you want, pay only for your own hardware/compute
No rate limits — inference speed is limited only by your hardware
No privacy concerns — data never leaves your device
No downtime — if your internet is down, the model still runs
Fine-tunable — adapt the model to your domain, your style, your use case

For consumer applications, edge deployments, and privacy-sensitive use cases, these properties are transformative.

Why On-Device AI Matters Now

The argument for on-device AI has always been there: privacy, latency, offline capability, cost. But for years, the models small enough to run on phones were too limited to be genuinely useful — good enough for autocomplete, not good enough for reasoning.

Qwen 3.5 is evidence that this gap is closing.

The 4B model, running locally on a modern phone, can:

Answer complex questions with reasonable accuracy
Write and explain code
Summarize documents
Reason through multi-step problems
Translate between languages

Not perfectly. Not at the level of GPT-4o or Claude Sonnet. But well enough for a significant fraction of real tasks — and entirely offline.

The 9B model on a MacBook is more capable still. For many everyday AI tasks, it's competitive with early-generation frontier models.

Running Qwen 3.5 Locally

Via Ollama (easiest):

ollama pull qwen3.5:4b
ollama run qwen3.5:4b

Via llama.cpp:

# Download GGUF from HuggingFace
./llama-cli -m qwen3.5-4b-q4_k_m.gguf -p "Your prompt here"

Via Transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-4B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-4B")

On iPhone: Via apps like LLM Farm, PocketPal, or MLX-based iOS apps that support Qwen 3.5 weights.

The Broader Significance

Qwen 3.5 isn't just a technical achievement — it's a signal about where the AI industry is heading.

The frontier models (GPT-5, Claude 4, Gemini Ultra) will keep getting larger and more capable. But in parallel, a different optimization is happening: making capable models smaller and faster.

This second trajectory matters for:

Developing markets — where connectivity is unreliable but smartphones are ubiquitous
Privacy-first applications — medical, legal, personal data that can't leave the device
Edge computing — AI in IoT devices, industrial equipment, vehicles
Cost reduction — enterprise deployments where inference costs matter at scale
Resilience — applications that need to function even when cloud services are down

Alibaba's Qwen series has been consistently impressive and consistently underappreciated in Western tech media. Qwen 3.5 continues that pattern: a serious technical achievement that quietly expands what's possible for developers and users who care about running AI on their own terms.

Final Thoughts

The question "what AI can I run without internet?" has historically had a depressing answer. Qwen 3.5 changes that.

A 4B model that reasons well, runs on a modern iPhone, and supports hybrid thinking mode is a qualitatively different kind of tool than the cramped, limited on-device models of two years ago.

Download it. Run it. See for yourself what offline AI looks like in 2026.

The model weights are free. The inference is yours. The data stays on your device.

Written by Arshdeep Singh

DEV Community

Qwen 3.5: The AI Model That Runs on Your iPhone Without an Internet Connection

Qwen 3.5: The AI Model That Runs on Your iPhone Without an Internet Connection

The Qwen Family: Context

The Model Sizes

Hybrid Thinking Mode

Open-Weight: What That Actually Means

Why On-Device AI Matters Now

Running Qwen 3.5 Locally

The Broader Significance

Final Thoughts

Top comments (0)