DEV Community

Cover image for How I Shipped a 3-Model On-Device ASR Pipeline on a Phone in 2 Months with Claude Code
Ako Wu
Ako Wu

Posted on

How I Shipped a 3-Model On-Device ASR Pipeline on a Phone in 2 Months with Claude Code

Most translation apps send your voice to a server and wait. I built one that runs entirely on your phone — speech recognition, translation, OCR, even real-time Bluetooth conversation — all offline, no cloud. Here's what I learned along the way.

Why I Built This

My mother speaks Cantonese. Our domestic helper speaks Tagalog. Every day they need to communicate, and neither speaks the other's language.

Google Translate needs WiFi and feels clunky for natural conversation. I wanted two phones to translate a live conversation over Bluetooth — no internet, no cloud, just talk.

That constraint — everything must run locally on consumer hardware — shaped every decision I made.

The Challenges Nobody Warns You About

1. One model doesn't fit all languages

I assumed I could ship one speech recognition model and call it done. Wrong.

English-first models are great for European languages but struggle with tonal languages like Cantonese and Vietnamese. Models tuned for Asian languages have worse English. Small models give you fast startup but lower accuracy.

I ended up shipping multiple models with automatic routing per language. The user never sees this — they just speak and it works. But getting the routing logic right, handling model loading and unloading, and making the transitions seamless took far longer than I expected.

2. Memory management is the real boss fight

Running ML models on a phone isn't like running them on a server with 64GB of RAM. Speech recognition and translation models together need gigabytes of memory. On a mid-range phone, that's most of what's available.

Lessons learned the hard way:

  • Lazy-load everything. Don't pre-load models the user hasn't needed yet.
  • Evict aggressively. If the user switches from voice to camera, swap models.
  • iOS has a hard memory ceiling. If you exceed it, the OS kills your app silently — no warning, no error, just gone.
  • Test on budget phones early. My development phone is high-end. The first time I ran on a 3GB RAM device, everything crashed. That was a painful week.

3. Bluetooth is not what you think

I wanted phones to translate conversations in real-time over Bluetooth. The first thing I learned: Bluetooth Low Energy has tiny bandwidth. You cannot stream audio over it.

The breakthrough was realizing I didn't need to. Each phone handles all the heavy processing locally. Only the final translated text — a few hundred bytes — crosses the Bluetooth link. Latency ends up being better than cloud-based translation because there's no network round-trip.

This architecture also means it scales to groups naturally. Multiple phones, each doing their own processing, each showing the translation in their own language.

4. Battery will humble you

Running neural networks continuously drains battery in minutes. The key insight: use a tiny, cheap voice activity detection model to listen for speech vs. silence. The expensive speech recognition model only runs when someone is actually talking. During pauses (most of a conversation), the phone is barely working.

5. Camera OCR has hidden traps

Two bugs that cost me days:

On iOS, the camera sensor gives you raw data in one orientation, but the display framework bakes in EXIF rotation. Result: my translation overlay boxes appeared rotated 90 degrees. Looked fine in the simulator, broken on real devices.

Also on iOS: only the Latin text recognition model is included by default. Chinese, Japanese, Korean, and Devanagari scripts silently fail — no error, just no results. Android handles this automatically. Finding this took a full day of "why does it work on Android but not iPhone?"

Building with Claude Code

I built Traverba in 2 months as a solo developer using Claude Code as my coding agent.

What Claude Code was great at:

  • Bridging Dart to native platform code (Kotlin/Swift) — the boilerplate is repetitive and error-prone
  • Generating localization files across 117 locales
  • Debugging platform-specific quirks (like the EXIF orientation bug)
  • Refactoring and keeping code consistent across two native platforms

What I had to do myself:

  • Architecture decisions — which models, how to route between them, memory management strategy
  • ML integration and optimization — getting models to run efficiently on mobile hardware
  • The Bluetooth protocol design
  • Testing on actual devices in real-world conditions

Claude Code didn't replace my engineering judgment, but it multiplied my output dramatically. Things that would have taken a week took a day. As a solo dev, that's the difference between shipping and not shipping.

What I'd Do Differently

Start with one model. Validate demand first, then add complexity. Three models at launch tripled my testing surface.

Test on budget phones from day one. Not week six.

Ship two-phone Bluetooth first. I built the full group mesh before validating that anyone wanted group translation. Two-phone mode would have been enough for launch.

The Result

Traverba supports 100+ languages with Cantonese as a first-class language — not buried under "Chinese (Traditional)" like every other translator. Voice, camera, screen, and text translation all run offline. The Bluetooth conversation mode lets multiple people have a real-time translated conversation with no internet at all.

The first time my mum and our helper used it, my mum laughed and said "finally I can tell her exactly what I mean."

Free to download. No signup required.

https://www.traverba.com

If you've tackled on-device ML or mobile performance challenges, I'd love to hear your war stories in the comments.

Top comments (0)