<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Srikar Phani Kumar Marti</title>
    <description>The latest articles on DEV Community by Srikar Phani Kumar Marti (@mspk97).</description>
    <link>https://dev.to/mspk97</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2519231%2F04ef8b8c-91fd-4ba1-a9b7-1e98f84baf6a.png</url>
      <title>DEV Community: Srikar Phani Kumar Marti</title>
      <link>https://dev.to/mspk97</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mspk97"/>
    <language>en</language>
    <item>
      <title>I Ran AI Models Directly in the Browser and Measured What It Did to Core Web Vitals</title>
      <dc:creator>Srikar Phani Kumar Marti</dc:creator>
      <pubDate>Sun, 17 May 2026 07:37:49 +0000</pubDate>
      <link>https://dev.to/mspk97/i-ran-ai-models-directly-in-the-browser-and-measured-what-it-did-to-core-web-vitals-4adj</link>
      <guid>https://dev.to/mspk97/i-ran-ai-models-directly-in-the-browser-and-measured-what-it-did-to-core-web-vitals-4adj</guid>
      <description>&lt;p&gt;Everyone is shipping AI features. Sentiment analysis on user input, speech recognition without sending audio to a server, image classification that never leaves the device. The privacy pitch is real, the latency pitch is real. But nobody's asking the obvious question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What does running a neural network in the browser actually cost the user?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I decided to find out. I built a benchmark harness, ran four quantized models in Chrome stable, and measured the impact on Core Web Vitals — specifically INP, the metric Google now uses to rank your site.&lt;/p&gt;

&lt;p&gt;Here's what I found.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;The test uses &lt;a href="https://huggingface.co/docs/transformers.js" rel="noopener noreferrer"&gt;Transformers.js&lt;/a&gt; — the library that lets you run Hugging Face models directly in the browser via WebAssembly. All models were loaded in INT8 quantized format (q8) to reflect real production conditions.&lt;/p&gt;

&lt;p&gt;Four models, chosen to cover different architectures and modalities:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Params&lt;/th&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DistilBERT&lt;/td&gt;
&lt;td&gt;66M&lt;/td&gt;
&lt;td&gt;Sentiment analysis&lt;/td&gt;
&lt;td&gt;Encoder (6 layers)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BERT-base&lt;/td&gt;
&lt;td&gt;110M&lt;/td&gt;
&lt;td&gt;Feature extraction&lt;/td&gt;
&lt;td&gt;Encoder (12 layers)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Whisper Tiny&lt;/td&gt;
&lt;td&gt;39M&lt;/td&gt;
&lt;td&gt;Speech recognition&lt;/td&gt;
&lt;td&gt;Encoder-Decoder&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MobileViT-S&lt;/td&gt;
&lt;td&gt;5.7M&lt;/td&gt;
&lt;td&gt;Image classification&lt;/td&gt;
&lt;td&gt;Vision Transformer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The benchmark harness is live at &lt;strong&gt;&lt;a href="https://benchmark.mspk.me" rel="noopener noreferrer"&gt;benchmark.mspk.me&lt;/a&gt;&lt;/strong&gt; and open source at &lt;strong&gt;&lt;a href="https://github.com/srikarphanikumar/cwv-ai-benchmark" rel="noopener noreferrer"&gt;github.com/srikarphanikumar/cwv-ai-benchmark&lt;/a&gt;&lt;/strong&gt;. Run it yourself.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is INP and Why Does It Matter?
&lt;/h2&gt;

&lt;p&gt;INP (Interaction to Next Paint) replaced First Input Delay as Google's interactivity metric in March 2024. It measures how long it takes for the browser to respond to a user interaction — a click, a tap, a keypress — and paint the result.&lt;/p&gt;

&lt;p&gt;Google's thresholds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ &lt;strong&gt;Good&lt;/strong&gt;: under 200ms&lt;/li&gt;
&lt;li&gt;⚠️ &lt;strong&gt;Needs Improvement&lt;/strong&gt;: 200–500ms&lt;/li&gt;
&lt;li&gt;❌ &lt;strong&gt;Poor&lt;/strong&gt;: over 500ms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;INP affects your search ranking. More importantly, it affects whether users feel your app is responsive or broken.&lt;/p&gt;

&lt;p&gt;When you run neural network inference on the browser's main thread, you're blocking it. That means if a user clicks something while inference is running, their click won't be processed until the model finishes. That delay IS your INP.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;p&gt;Here's the full table from Chrome stable on an Apple M-series MacBook Pro, 16GB RAM:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Load Time&lt;/th&gt;
&lt;th&gt;Avg Inference&lt;/th&gt;
&lt;th&gt;INP&lt;/th&gt;
&lt;th&gt;INP Class&lt;/th&gt;
&lt;th&gt;Mem Δ&lt;/th&gt;
&lt;th&gt;Mem Pressure&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DistilBERT&lt;/td&gt;
&lt;td&gt;7.85s&lt;/td&gt;
&lt;td&gt;25.1ms ±0.5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;27.8ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Good&lt;/td&gt;
&lt;td&gt;+59.6MB&lt;/td&gt;
&lt;td&gt;2.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BERT-base&lt;/td&gt;
&lt;td&gt;6.07s&lt;/td&gt;
&lt;td&gt;83.3ms ±1.5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;85.0ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⚠️ Needs Improvement&lt;/td&gt;
&lt;td&gt;+65.3MB&lt;/td&gt;
&lt;td&gt;4.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Whisper Tiny&lt;/td&gt;
&lt;td&gt;6.71s&lt;/td&gt;
&lt;td&gt;496.9ms ±6.2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;540.3ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌ Poor&lt;/td&gt;
&lt;td&gt;+123.9MB&lt;/td&gt;
&lt;td&gt;7.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MobileViT-S&lt;/td&gt;
&lt;td&gt;1.15s&lt;/td&gt;
&lt;td&gt;66.7ms ±1.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;75.6ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⚠️ Needs Improvement&lt;/td&gt;
&lt;td&gt;+37.0MB&lt;/td&gt;
&lt;td&gt;8.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Surprising Findings
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Parameter count doesn't predict INP
&lt;/h3&gt;

&lt;p&gt;Whisper Tiny has only 39M parameters — the fewest of any model tested. It also produces the worst INP at 540.3ms, more than 19x worse than DistilBERT which has 66M parameters.&lt;/p&gt;

&lt;p&gt;The culprit is architecture, not size. Whisper is an encoder-decoder model. It doesn't process the full input in a single forward pass — it runs an &lt;strong&gt;autoregressive decode loop&lt;/strong&gt;, generating output tokens one at a time. Each iteration blocks the main thread. The total blocking time accumulates regardless of how aggressively you quantize the weights.&lt;/p&gt;

&lt;p&gt;This means &lt;strong&gt;no amount of quantization will fix Whisper's INP on the main thread&lt;/strong&gt;. It's an architectural constraint, not a tuning problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. MobileViT-S loads 6x faster but still misses "Good"
&lt;/h3&gt;

&lt;p&gt;MobileViT-S loads in 1.15s compared to 6–8 seconds for the text models. That's a huge UX win for initial load. But its INP of 75.6ms puts it in "Needs Improvement" territory despite having only 5.7M parameters.&lt;/p&gt;

&lt;p&gt;Vision transformer inference carries disproportionate cost relative to parameter count in WASM environments. Something to watch if you're building image classification features.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Memory pressure ≠ memory delta
&lt;/h3&gt;

&lt;p&gt;MobileViT-S has the lowest absolute memory consumption (+37MB) but the &lt;strong&gt;highest memory pressure at 8.0%&lt;/strong&gt;. That 37MB represents a larger fraction of the available JS heap than you'd expect — with implications for mid-range Android devices where heap limits are much tighter.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Means for Your Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;If you're building with encoder-only text models (DistilBERT class):&lt;/strong&gt;&lt;br&gt;
You're fine on the main thread. 27.8ms INP is negligible. Trigger inference directly on user interactions without worrying about CWV degradation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you're using larger encoder models (BERT-base class):&lt;/strong&gt;&lt;br&gt;
Don't trigger inference synchronously on interactions. At 85ms, stacking this with other main thread work risks crossing 200ms. Move it to a post-interaction background step — run inference after you've already painted the response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you're using any encoder-decoder model (Whisper, T5, BART, etc.):&lt;/strong&gt;&lt;br&gt;
You &lt;strong&gt;must&lt;/strong&gt; offload to a Web Worker. This isn't an optimization — it's a requirement. The main thread will be blocked for hundreds of milliseconds no matter what you do. Transformers.js supports Web Worker execution natively:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;pipeline&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@xenova/transformers&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Run in a Web Worker to avoid blocking main thread&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;transcriber&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;automatic-speech-recognition&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Xenova/whisper-tiny&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;worker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;If you're using vision transformers:&lt;/strong&gt;&lt;br&gt;
Test on actual mobile hardware before shipping. The memory pressure numbers on an M-series Mac will look very different on a mid-range Android.&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitations to Know
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;TBT couldn't be captured in the deployed environment.&lt;/strong&gt; The Long Tasks API isn't available in cross-origin deployed contexts — only in locally-served or Chrome DevTools Protocol environments. The INP measurements are real, but the full main thread blocking profile requires a different setup to measure properly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;All numbers are from high-end hardware.&lt;/strong&gt; An Apple M-series Mac is not the median global web user's device. INP values on mid-range Android will be significantly higher — potentially 3–5x. The relative ordering of models should hold, but don't use these absolute numbers as production thresholds for mobile.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The benchmark is live and open source. Run it on your device, your network conditions, your hardware profile. Export the results as JSON or CSV.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Live benchmark&lt;/strong&gt;: &lt;a href="https://benchmark.mspk.me" rel="noopener noreferrer"&gt;benchmark.mspk.me&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Source code&lt;/strong&gt;: &lt;a href="https://github.com/srikarphanikumar/cwv-ai-benchmark" rel="noopener noreferrer"&gt;github.com/srikarphanikumar/cwv-ai-benchmark&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full paper&lt;/strong&gt;: arXiv link coming soon&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you run it on a mid-range Android or a low-end device and want to share the numbers, I'd love to see them — that's exactly the follow-on data this research needs.&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;DistilBERT is the only model that stays in Google's "Good" INP range on the main thread&lt;/li&gt;
&lt;li&gt;Whisper Tiny is "Poor" despite being the smallest model — architecture beats quantization&lt;/li&gt;
&lt;li&gt;Encoder-decoder models require Web Worker offloading — no exceptions&lt;/li&gt;
&lt;li&gt;Parameter count is a bad proxy for browser inference cost&lt;/li&gt;
&lt;li&gt;Memory pressure on mobile is a separate concern from memory consumption&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The era of client-side AI is here. Now we need to measure what it actually costs.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>webvitals</category>
      <category>corewebvitals</category>
    </item>
  </channel>
</rss>
