I'm one of the founders of Nataris. We built a P2P inference marketplace where Android phones run open-weight AI models and serve developer API requests. Phone owners earn per token. Developers get an OpenAI-compatible API.
This is a writeup of the genuinely hard technical problems we ran into — not a product pitch.
The idea
There are currently two ways to run AI inference: on your own machine locally, or through a big company's datacenter. We wanted to build the third option — inference that runs on real people's Android phones, accessible via a standard API, where the phone owners get paid.
The privacy properties fall out naturally: inference never touches a server we own. No prompt logging. No content filtering. No model training on queries.
Problem 1: llama.cpp OOM crashes on mobile
llama.cpp reads context_length from GGUF metadata and allocates the full KV cache upfront.
Llama 3.2 1B ships with a 131K context window — that's ~4GB of KV cache on a phone with maybe 2GB free RAM. Instant OOM, app killed by Android.
Our fix: binary-patch the GGUF metadata after download to cap context_length before the model loads. Caps: 7B→1024, 3B→2048, ≤1B→4096. The patch is idempotent — runs on download and on every app startup as a safety net.
Later, the RunAnywhere SDK we use added native adaptive context sizing in v0.19.6, so the C++ layer now handles this automatically. We kept the GGUF patcher as defense-in-depth.
Problem 2: Routing to mobile is nothing like routing to GPUs
Standard inference routing assumes homogeneous hardware — you pick the least-loaded instance.
Mobile is completely different. Every device has different thermal state, available RAM, battery level, SoC performance, and model warm state.
Our scoring function weights:
-
Thermal state —
nominal=1.0,fair=0.3,serious=0.1. A hot phone gets deprioritized hard. - Available RAM vs model requirement — gate, not score. If a device can't fit the model, it's excluded entirely.
- Model warm state — if the model is already loaded in RAM from a recent job, +0.20 bonus. Cold start for Llama 1B on a mid-range phone is 15-30s. Warm inference is 2-5s.
-
LRU rotation — Redis key
last_assigned:{deviceId}with 24h TTL ensures true round-robin across devices rather than one device eating all traffic. - Load spread band — we pick randomly among devices scoring within 60% of the top score, not just the top device. This prevents a single high-reputation device from monopolizing jobs.
Problem 3: OEM battery savers silently kill WebSocket connections
This one cost us weeks. Android OEM battery optimizers (MIUI, ColorOS, OnePlus, Huawei) kill background processes aggressively. The WebSocket connection to our backend drops. The device still appears ONLINE in our system because the last heartbeat was recent. Jobs get assigned to a dead connection and time out 3 minutes later.
Fixes we layered:
- Inbound-message liveness watchdog — if no message received from backend in 180s, force reconnect. Catches ghost sessions.
- WorkManager 15-min safety net — even if the foreground service is killed, WorkManager reschedules a reconnect attempt.
- OEM-specific autostart deep links — on first launch, we detect the manufacturer and open the exact battery settings screen for MIUI / ColorOS / OnePlus / Vivo / Huawei with instructions to whitelist the app.
-
NET_CAPABILITY_VALIDATEDcheck — verify the network connection is actually internet- capable before attempting reconnect, not just connected.
Problem 4: SDK init must run on the Android main thread
We were initializing the RunAnywhere SDK on a background coroutine. Got a SIGSEGV in racModelRegistrySave on Play Store installs — not on debug builds, not reliably reproducible, just occasional crashes in production.
Root cause: the native JNI library requires initialization on the main thread. Standard Android JNI constraint, but the SDK docs didn't spell it out. Fix: wrap all SDK init calls in withContext(Dispatchers.Main).
Problem 5: JNI parameter count mismatch → SIGABRT
When we added a new parameter (supportsLora: Boolean) to a Kotlin external fun declaration, we didn't realize the pre-built native .so we were using hadn't been updated to match. The Kotlin compiler doesn't catch this — it happily generates the JNI call with the extra parameter.
At runtime on Android 10: SIGABRT. On Android 13+: silent stack corruption. No compile error, no lint warning. We spent days across three physical devices before realizing the .so parameter count didn't match the Kotlin declaration.
Fix: removed the parameter from the Kotlin declaration until we had a matching .so. Now we verify parameter counts by disassembling the .so before any JNI signature change.
Where we are
- Qwen 2.5 0.5B (~5s latency) and Llama 3.2 1B (~15-20s latency)
- 21 provider devices on the network
- 2,775 inference jobs completed, 350K+ tokens processed
- OpenAI-compatible API — works anywhere you can set a custom base URL
The latency is real — these are mobile phones, not GPUs. We're not trying to compete on speed. The value prop is privacy and the P2P model (85% of inference fees go to phone owners).
Try it
API docs: https://api.nataris.ai/docs
$5 free credits, no card needed: https://nataris.ai
Provider app (Android, earn by running models): https://play.google.com/store/apps/details?id=ai.nataris.app
Happy to go deeper on any of this in the comments — the routing algorithm, the GGUF patching, the JNI debugging process, or the economics of the P2P model.
Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.