Why We Run LLMs On-Device in 2026

#ai #machinelearning #privacy #indiedev

Originally published on the AstroLexis blog. Cross-posted here for the community.

For most of the last three years, "AI" has meant calling someone else's API. Your prompt leaves your machine, hits a datacenter, and a response comes back. In 2026 that's no longer the only sensible architecture. Here's the case for running LLMs on your own hardware — and what we ship at AstroLexis to make it actually work.

The cloud isn't the only place for AI anymore

When OpenAI shipped GPT-3.5 in late 2022, running an LLM locally was an exotic hobby. The smallest useful models needed a workstation, the tooling barely worked outside a research lab, and inference was slow enough that real-time use was out of reach. The cloud was the only practical option.

That's not the world we live in anymore. As of mid-2026:

An Apple M4 Pro Mac mini ($1,400) runs a quantized 30B parameter model at 25-40 tokens/second using MLX.
A consumer RTX 5090 (24GB VRAM) handles 70B models in 4-bit quantization with comfortable headroom for context windows.
Apple's own Foundation Models (built into iOS 26 and macOS) ship a 3B-parameter on-device LLM that's available to every app through a system framework.
Llama 4, Qwen 3.6, Mistral Small 3.1 and Gemma 4 all ship 4-bit weights designed to run on commodity hardware.

The cost-performance curve has crossed a line where, for a large class of real applications, running locally is now better — not just feasible. The question stopped being "can we run this without the cloud?" and became "why are we still sending this to someone else's datacenter?"

Cost: the math has flipped

Cloud LLM pricing in 2024 was an order of magnitude cheaper than running your own inference. By 2026, for any sustained workload, the math is the opposite.

Take a concrete example. A static code analysis pipeline that scans 500 commits per day against a 1M-line codebase. With KCode we measured:

OpenAI o4-mini, hosted API: ~$340/month, plus the latency overhead of going to the cloud per file.
Local Qwen3.6-Heretic 30B on a single RTX 5090: roughly $0 marginal cost after the GPU is purchased, with a sub-second turnaround per file because the model is warm in VRAM and there's no network hop.

The capex is real — a workstation isn't free. But for any team doing real volume, the breakeven against API pricing arrives in 4-8 months. After that, every additional run is essentially free. The same calculus applies to support agents, document classification pipelines, voice transcription, image captioning, anything that runs at scale.

Privacy: your data is your data

The privacy story is easier to explain when the user is non-technical: if your data never leaves your machine, no one can lose it, sell it, or train on it.

This matters more in some contexts than others. We ship products on both ends of the privacy spectrum:

ClearCaps generates live captions and diarized transcripts for users with hearing loss. The audio is profoundly personal — medical conversations, family calls, work meetings. Running speech recognition (WhisperKit) and speaker diarization on-device means there's nothing for an attacker to intercept or a vendor to monetize.
PhoenixSteps is a clinical speech-therapy companion for pediatric patients. The users are children. Their speech recordings are protected health information under HIPAA-equivalent frameworks across most jurisdictions. There's no possible "cloud version" that we'd ship.
Kulvex AI is a self-hosted assistant. It runs on hardware the user owns, in their home, on their network. We never see the conversations.

This isn't ideology. It's a product constraint. There are categories of software — health, legal, family, identity — where shipping to a cloud LLM is a non-starter. On-device is the only viable architecture.

Latency: 50ms vs 800ms

A cloud LLM round-trip is at minimum the network latency (50-200ms) plus the time-to-first-token (200-1000ms depending on load) plus the streaming of the response. For a short reply that's a 1-2 second user-facing delay.

An on-device model on Apple Silicon, with the weights already memory-mapped into RAM, can start producing tokens in under 50ms and stream at 30+ tokens/second for a 7B model. For interactive UX — autocomplete, voice assistants, real-time captions — this is the difference between "feels native" and "feels like a web form."

We're working with this constraint right now on our iOS apps. The Apple Foundation Models framework gives us a 3B-parameter LLM that responds in 100-200ms total on an iPhone 16. That's fast enough that the user never sees a spinner. The same query against an OpenAI API would feel slower even if it produced a higher-quality answer — because the perceived speed of UI dominates short interactions.

Freedom: no vendor lock-in

This is the underappreciated one. Every cloud LLM you build on top of is a dependency on someone else's roadmap, pricing, and content policy. They can deprecate the model you're using, double the price overnight, refuse to serve your jurisdiction, or decide that your use case violates their terms.

We've watched this play out repeatedly:

The original GPT-4 API was deprecated and replaced with new versions that broke established prompt patterns for thousands of products.
Anthropic, OpenAI, and Google have all rejected or rate-limited use cases at various points (security tooling, certain medical applications, anything touching content moderation).
Hosted prices have moved up and down without warning, making it impossible to model unit economics.

On-device, you can pin the model version forever. Llama 4 will run on your 5090 in 2030 the same way it runs today. No one can take it away. Your customers' workflows don't break because a vendor changed their mind.

The on-device weights become a real asset. It's the opposite of "renting" intelligence.

What we ship at AstroLexis

Everything we build runs locally by default. The full lineup:

Kulvex AI — self-hosted AI platform with 17 domain agents (home automation, messaging across 8 platforms, voice control). Runs on your own GPU.
KCode — deterministic security audit tool with 414 hand-curated patterns across 20+ languages. Pre-filters with regex/AST, verifies with a local LLM. Your source code never leaves your machine. SARIF output, GitHub Action.
ClearCaps — live captions and speaker diarization on iPhone. WhisperKit + Apple SpeakerKit, all on-device.
SiliconMon — Apple Silicon system monitor for macOS. Shows you exactly what your GPU, ANE, and unified memory are doing while you run MLX, Ollama, llama.cpp, or LM Studio locally.
PhoenixSteps — clinical speech-therapy companion for pediatric SLPs. iOS-only, MLX-based.
Vela — memory companion for adults with memory impairment. iOS-only, on-device.
Tutto — conversational practice for English and Spanish learners. In development.

The common thread isn't a particular AI framework or model. It's the architectural commitment: the user owns the inference. We don't sit in the middle.

How to start

If you're building software in 2026 and considering whether to make an on-device version, our take:

Start with the right hardware target. Apple Silicon is the most underrated AI dev box on the market. An M2 Pro or newer Mac with 32GB+ unified memory handles 7-13B parameter models comfortably. For server work, a single 24GB consumer GPU (RTX 4090/5090) handles 30B models.
Pick a model family and stay on it. Llama 4, Qwen 3.6, Mistral Small, Gemma 4. All ship 4-bit quantizations. All have stable APIs through MLX, llama.cpp, or vLLM. Don't chase weekly model releases — pick one, learn its quirks, ship.
Treat the local LLM as a tool, not a magic box. Wrap it in deterministic pre-processing and post-processing. KCode does this: regex/AST patterns find candidates, the LLM verifies. The local model doesn't have to be GPT-5-level to be useful — it has to be reliable for a narrow task.
Measure honestly. Track tokens-per-second, time-to-first-token, memory footprint, and battery impact on real devices. The numbers you see on a research blog don't match what you'll see on a customer's M1 Air.

— Bruno, founder, AstroLexis LLC. If you build in this space, drop a line: contact@astrolexis.space.