<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vishal Sharma</title>
    <description>The latest articles on DEV Community by Vishal Sharma (@vishal_sharma_nataris).</description>
    <link>https://dev.to/vishal_sharma_nataris</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3895690%2F1696eeaf-9be4-4c5d-b941-c9e9fbc285eb.png</url>
      <title>DEV Community: Vishal Sharma</title>
      <link>https://dev.to/vishal_sharma_nataris</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vishal_sharma_nataris"/>
    <language>en</language>
    <item>
      <title>We built a P2P AI inference network that runs on Android phones — here's what we learned</title>
      <dc:creator>Vishal Sharma</dc:creator>
      <pubDate>Fri, 24 Apr 2026 09:15:40 +0000</pubDate>
      <link>https://dev.to/vishal_sharma_nataris/we-built-a-p2p-ai-inference-network-that-runs-on-android-phones-heres-what-we-learned-3lhb</link>
      <guid>https://dev.to/vishal_sharma_nataris/we-built-a-p2p-ai-inference-network-that-runs-on-android-phones-heres-what-we-learned-3lhb</guid>
      <description>&lt;p&gt;I'm one of the founders of Nataris. We built a P2P inference marketplace where Android phones run open-weight AI models and serve developer API requests. Phone owners earn per token. Developers get an OpenAI-compatible API.&lt;/p&gt;

&lt;p&gt;This is a writeup of the genuinely hard technical problems we ran into — not a product pitch.&lt;/p&gt;




&lt;h2&gt;
  
  
  The idea
&lt;/h2&gt;

&lt;p&gt;There are currently two ways to run AI inference: on your own machine locally, or through a big company's datacenter. We wanted to build the third option — inference that runs on real people's Android phones, accessible via a standard API, where the phone owners get paid.&lt;/p&gt;

&lt;p&gt;The privacy properties fall out naturally: inference never touches a server we own. No prompt logging. No content filtering. No model training on queries.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem 1: llama.cpp OOM crashes on mobile
&lt;/h2&gt;

&lt;p&gt;llama.cpp reads &lt;code&gt;context_length&lt;/code&gt; from GGUF metadata and allocates the full KV cache upfront. &lt;br&gt;
Llama 3.2 1B ships with a 131K context window — that's ~4GB of KV cache on a phone with maybe 2GB free RAM. Instant OOM, app killed by Android.&lt;/p&gt;

&lt;p&gt;Our fix: binary-patch the GGUF metadata after download to cap context_length before the model loads. Caps: 7B→1024, 3B→2048, ≤1B→4096. The patch is idempotent — runs on download and on every app startup as a safety net.&lt;/p&gt;

&lt;p&gt;Later, the RunAnywhere SDK we use added native adaptive context sizing in v0.19.6, so the C++ layer now handles this automatically. We kept the GGUF patcher as defense-in-depth.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem 2: Routing to mobile is nothing like routing to GPUs
&lt;/h2&gt;

&lt;p&gt;Standard inference routing assumes homogeneous hardware — you pick the least-loaded instance. &lt;br&gt;
Mobile is completely different. Every device has different thermal state, available RAM, battery level, SoC performance, and model warm state.&lt;/p&gt;

&lt;p&gt;Our scoring function weights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Thermal state&lt;/strong&gt; — &lt;code&gt;nominal=1.0&lt;/code&gt;, &lt;code&gt;fair=0.3&lt;/code&gt;, &lt;code&gt;serious=0.1&lt;/code&gt;. A hot phone gets deprioritized hard.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Available RAM vs model requirement&lt;/strong&gt; — gate, not score. If a device can't fit the model, it's excluded entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model warm state&lt;/strong&gt; — if the model is already loaded in RAM from a recent job, +0.20 bonus. Cold start for Llama 1B on a mid-range phone is 15-30s. Warm inference is 2-5s.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LRU rotation&lt;/strong&gt; — Redis key &lt;code&gt;last_assigned:{deviceId}&lt;/code&gt; with 24h TTL ensures true round-robin across devices rather than one device eating all traffic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load spread band&lt;/strong&gt; — we pick randomly among devices scoring within 60% of the top score, not just the top device. This prevents a single high-reputation device from monopolizing jobs.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Problem 3: OEM battery savers silently kill WebSocket connections
&lt;/h2&gt;

&lt;p&gt;This one cost us weeks. Android OEM battery optimizers (MIUI, ColorOS, OnePlus, Huawei) kill background processes aggressively. The WebSocket connection to our backend drops. The device still appears ONLINE in our system because the last heartbeat was recent. Jobs get assigned to a dead connection and time out 3 minutes later.&lt;/p&gt;

&lt;p&gt;Fixes we layered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Inbound-message liveness watchdog&lt;/strong&gt; — if no message received from backend in 180s, force reconnect. Catches ghost sessions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WorkManager 15-min safety net&lt;/strong&gt; — even if the foreground service is killed, WorkManager reschedules a reconnect attempt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OEM-specific autostart deep links&lt;/strong&gt; — on first launch, we detect the manufacturer and open the exact battery settings screen for MIUI / ColorOS / OnePlus / Vivo / Huawei with instructions to whitelist the app.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;NET_CAPABILITY_VALIDATED&lt;/code&gt; check&lt;/strong&gt; — verify the network connection is actually internet- capable before attempting reconnect, not just connected.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Problem 4: SDK init must run on the Android main thread
&lt;/h2&gt;

&lt;p&gt;We were initializing the RunAnywhere SDK on a background coroutine. Got a SIGSEGV in &lt;code&gt;racModelRegistrySave&lt;/code&gt; on Play Store installs — not on debug builds, not reliably reproducible, just occasional crashes in production.&lt;/p&gt;

&lt;p&gt;Root cause: the native JNI library requires initialization on the main thread. Standard Android JNI constraint, but the SDK docs didn't spell it out. Fix: wrap all SDK init calls in &lt;code&gt;withContext(Dispatchers.Main)&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem 5: JNI parameter count mismatch → SIGABRT
&lt;/h2&gt;

&lt;p&gt;When we added a new parameter (&lt;code&gt;supportsLora: Boolean&lt;/code&gt;) to a Kotlin &lt;code&gt;external fun&lt;/code&gt; declaration, we didn't realize the pre-built native &lt;code&gt;.so&lt;/code&gt; we were using hadn't been updated to match. The Kotlin compiler doesn't catch this — it happily generates the JNI call with the extra parameter.&lt;/p&gt;

&lt;p&gt;At runtime on Android 10: SIGABRT. On Android 13+: silent stack corruption. No compile error, no lint warning. We spent days across three physical devices before realizing the &lt;code&gt;.so&lt;/code&gt; parameter count didn't match the Kotlin declaration.&lt;/p&gt;

&lt;p&gt;Fix: removed the parameter from the Kotlin declaration until we had a matching &lt;code&gt;.so&lt;/code&gt;. Now we verify parameter counts by disassembling the &lt;code&gt;.so&lt;/code&gt; before any JNI signature change.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where we are
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Qwen 2.5 0.5B (~5s latency) and Llama 3.2 1B (~15-20s latency)&lt;/li&gt;
&lt;li&gt;21 provider devices on the network&lt;/li&gt;
&lt;li&gt;2,775 inference jobs completed, 350K+ tokens processed&lt;/li&gt;
&lt;li&gt;OpenAI-compatible API — works anywhere you can set a custom base URL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The latency is real — these are mobile phones, not GPUs. We're not trying to compete on speed. The value prop is privacy and the P2P model (85% of inference fees go to phone owners).&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;API docs: &lt;a href="https://api.nataris.ai/docs" rel="noopener noreferrer"&gt;https://api.nataris.ai/docs&lt;/a&gt;&lt;br&gt;&lt;br&gt;
$5 free credits, no card needed: &lt;a href="https://nataris.ai" rel="noopener noreferrer"&gt;https://nataris.ai&lt;/a&gt;&lt;br&gt;&lt;br&gt;
Provider app (Android, earn by running models): &lt;a href="https://play.google.com/store/apps/details?id=ai.nataris.app" rel="noopener noreferrer"&gt;https://play.google.com/store/apps/details?id=ai.nataris.app&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Happy to go deeper on any of this in the comments — the routing algorithm, the GGUF patching, the JNI debugging process, or the economics of the P2P model.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>android</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
