1. Why we left the cloud
For years, developers turned to the cloud because they had to, not because it was the best option. The cloud made intelligent software possible, but it came with trade-offs: privacy concerns, latency, cost unpredictability, and a total reliance on someone else’s infrastructure.
At Pieces, we decided to try something different. We rebuilt our AI stack from the ground up to run entirely on the user’s device.
No cloud calls. No API gateways. No hidden compute bills.
The goal wasn’t just “offline mode.” It was about taking back ownership of performance, of privacy, and of control.
2. A faster, leaner, smarter AI stack
Instead of using a single giant language model to do everything, we broke the problem into smaller parts.
We built a collection of small, focused models, each trained for a specific task:
- Is this request about the past or the future?
- Should it summarize, remind, or plan?
- What data span does it reference?
Each model is compact, with between 20M and 80M parameters, and is built to run fast, locally, and predictably. They hand off results to one another like a microservice chain, passing structured outputs through a smart pipeline.
The result?
- Inference times under 150ms
- No token-based costs
- Up to 16% better accuracy than GPT-4, Gemini Flash, and LLaMA-3 3B on our benchmarks
- 55× higher throughput
3. It’s not just us, this is an industry shift
We’re not alone in this thinking. The industry is pivoting.
From Apple’s A18 chip to Microsoft’s Copilot+ PCs and Snapdragon’s latest NPUs, hardware is being optimized for local AI execution. Even Chromebooks now come with tensor accelerators built-in.
Hugging Face CEO Clément Delangue summed it up best:
“Everyone’s asking for more data centers, but why aren’t we talking more about running AI on your own device?”
When developers realize that local AI isn’t a limitation but a better foundation, the shift becomes obvious.
4. How on-device AI actually works?
On-device AI means the model runs where you are—on your laptop, phone, or edge device.
There’s no cloud request. No latency spike. No invisible pipeline.
Inference happens entirely in memory, with your data staying private and under your control.
At Pieces, we built our entire system around this. Using Ollama, you can run full LLMs offline on an Apple Silicon Mac or any supported Windows GPU. You can even switch between cloud and local models mid-conversation, without losing chat history or breaking context.
It’s built for anyone working without reliable internet—or working with data that simply can’t leave the machine.
5. Privacy isn’t a feature, it’s the architecture
Most AI products treat privacy like a setting you can toggle. But true data ownership means your data never leaves the device in the first place.
No outbound calls. No token logs. No inference history sitting on someone else’s server.
On-device AI flips the entire risk model. It shrinks the attack surface to your own machine, where you can audit what runs and how it behaves.
That’s why it’s resonating with teams in finance, healthcare, legal, and other regulated industries. This isn’t just about passing GDPR or CCPA—it’s about building infrastructure that never leaks in the first place.
6. Cloud costs vs. fixed compute
Running large models in the cloud isn’t just expensive—it’s unpredictable. You’re charged per token, per user, per request, often without a clear way to budget or forecast usage.
With on-device AI, you pay once (via compute) and inference is free after that. The cost model becomes architectural, not transactional. And your stack scales linearly, not exponentially.
The energy savings are huge, too. Consider this:
Model Size | Setup | CO₂ per 1M Tokens |
---|---|---|
77M (local CPU) | Your laptop | 0.14g |
70B (cloud GPU) | 20× A100s | 400–530g |
GPT-4-class (MoE) | H100 cluster | 900–1300g |
That's over 1000× more energy to get the same answer in the cloud. When multiplied across millions of users, the difference becomes systemic.
7. Where on-device AI shines
Not every AI task needs to be on-device. But many of the ones users care about the most do:
- Voice transcription
- Summarization
- Local memory and search
- Language translation
- Keyboard suggestions
- Code snippet enhancement
These don’t need 175B parameters. They need speed, precision, and trust—the kind that comes from running close to the user.
And with energy, latency, and privacy now being design constraints—not afterthoughts—on-device AI becomes not just viable, but inevitable.
8. So what now?
This isn’t about keeping up with the cloud. It’s about choosing a different path, one where AI is embedded, not streamed.
- Where memory is local.
- Where costs don’t scale with usage.
- Where you know what your model is doing, and where.
The moment we started asking real questions—
“Do we need GPT-4 to detect a URL?”
“Why send calendar data to the cloud to get a reminder?”
—the answer became obvious.
We didn’t just optimize our pipeline. We changed the foundation.
And we’re building from there.
If you’re curious how Pieces works locally, let’s talk.
Article written by Tsavo, CEO, Pieces and fully accessible here
Top comments (0)