<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Bruno Juca</title>
    <description>The latest articles on DEV Community by Bruno Juca (@bruno_juca_7038c22bcca1db).</description>
    <link>https://dev.to/bruno_juca_7038c22bcca1db</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3922788%2F01143770-3c3d-45e6-a4f5-87953bb10055.png</url>
      <title>DEV Community: Bruno Juca</title>
      <link>https://dev.to/bruno_juca_7038c22bcca1db</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bruno_juca_7038c22bcca1db"/>
    <language>en</language>
    <item>
      <title>Why Most Browser AI Demos Fail on Real Hardware</title>
      <dc:creator>Bruno Juca</dc:creator>
      <pubDate>Sun, 10 May 2026 04:44:10 +0000</pubDate>
      <link>https://dev.to/bruno_juca_7038c22bcca1db/why-most-browser-ai-demos-fail-on-real-hardware-220f</link>
      <guid>https://dev.to/bruno_juca_7038c22bcca1db/why-most-browser-ai-demos-fail-on-real-hardware-220f</guid>
      <description>&lt;p&gt;&lt;em&gt;Building adaptive local AI inference for real-world hardware instead of benchmark machines.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Running AI models directly in the browser has improved dramatically over the last few years.&lt;/p&gt;

&lt;p&gt;With technologies like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;WebGPU&lt;/li&gt;
&lt;li&gt;ONNX Runtime Web&lt;/li&gt;
&lt;li&gt;WebAssembly&lt;/li&gt;
&lt;li&gt;quantized transformer models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…it’s now possible to run surprisingly capable AI systems locally without uploading data to the cloud.&lt;/p&gt;

&lt;p&gt;But there’s a problem that becomes obvious the moment real users start testing your application:&lt;/p&gt;

&lt;p&gt;Real hardware is chaotic.&lt;/p&gt;

&lt;p&gt;Some users have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;gaming GPUs&lt;/li&gt;
&lt;li&gt;integrated graphics&lt;/li&gt;
&lt;li&gt;old laptops with 4 GB RAM&lt;/li&gt;
&lt;li&gt;workstations with 32 threads&lt;/li&gt;
&lt;li&gt;browsers with partially implemented WebGPU support&lt;/li&gt;
&lt;li&gt;thermally constrained mobile CPUs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most browser AI demos are tested on a single developer machine and assume:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stable GPU acceleration&lt;/li&gt;
&lt;li&gt;enough memory&lt;/li&gt;
&lt;li&gt;predictable threading behavior&lt;/li&gt;
&lt;li&gt;fast inference backends&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once exposed to real users, many of these applications become unstable, extremely slow, or simply crash.&lt;/p&gt;

&lt;p&gt;While building Cowslator — a local-first AI transcription platform — this became one of the biggest engineering challenges.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The illusion of “it works on my machine”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A surprising amount of browser AI software is effectively optimized for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one browser&lt;/li&gt;
&lt;li&gt;one GPU&lt;/li&gt;
&lt;li&gt;one RAM configuration&lt;/li&gt;
&lt;li&gt;one backend&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This works fine in demos.&lt;/p&gt;

&lt;p&gt;It fails in production.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a model that runs perfectly on a desktop GPU may completely freeze a low-end laptop&lt;/li&gt;
&lt;li&gt;a WebGPU backend may behave differently across browsers&lt;/li&gt;
&lt;li&gt;memory fragmentation can destroy performance on integrated GPUs&lt;/li&gt;
&lt;li&gt;thread counts that help one CPU may hurt another&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result is a poor user experience:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;browser freezes&lt;/li&gt;
&lt;li&gt;out-of-memory crashes&lt;/li&gt;
&lt;li&gt;fans spinning at maximum speed&lt;/li&gt;
&lt;li&gt;unusable transcription times&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is especially problematic for local AI applications, where the user’s machine is responsible for inference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why transcription workloads are difficult&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Speech transcription is computationally expensive.&lt;/p&gt;

&lt;p&gt;Even quantized Whisper models can consume significant:&lt;/p&gt;

&lt;p&gt;RAM&lt;br&gt;
VRAM&lt;br&gt;
CPU bandwidth&lt;br&gt;
GPU compute time&lt;/p&gt;

&lt;p&gt;And unlike small text demos, transcription often involves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;long audio files&lt;/li&gt;
&lt;li&gt;sustained inference&lt;/li&gt;
&lt;li&gt;large token generation&lt;/li&gt;
&lt;li&gt;continuous decoding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now combine that with browser constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sandboxing&lt;/li&gt;
&lt;li&gt;memory limits&lt;/li&gt;
&lt;li&gt;varying WebGPU implementations&lt;/li&gt;
&lt;li&gt;WebAssembly overhead&lt;/li&gt;
&lt;li&gt;inconsistent multithreading support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The complexity grows quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The naive solution: fixed inference strategy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The simplest architecture is:&lt;/p&gt;

&lt;p&gt;loadOneModel();&lt;br&gt;
runInference();&lt;/p&gt;

&lt;p&gt;But this creates major problems.&lt;/p&gt;

&lt;p&gt;If the model is too large:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;weaker devices crash&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the model is too small:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transcription quality suffers unnecessarily on powerful machines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If GPU acceleration fails:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the entire application may become unusable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is one reason many browser AI demos feel impressive initially but unreliable in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Building an adaptive inference engine&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To solve this problem, I started building an adaptive inference engine for local transcription.&lt;/p&gt;

&lt;p&gt;Instead of assuming all devices are similar, the application attempts to understand the user’s hardware environment and dynamically choose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;inference backend&lt;/li&gt;
&lt;li&gt;model size&lt;/li&gt;
&lt;li&gt;quantization level&lt;/li&gt;
&lt;li&gt;threading configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At startup, the engine evaluates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;available RAM&lt;/li&gt;
&lt;li&gt;CPU thread count&lt;/li&gt;
&lt;li&gt;WebGPU availability&lt;/li&gt;
&lt;li&gt;browser capabilities&lt;/li&gt;
&lt;li&gt;estimated memory limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then it selects the most appropriate strategy.&lt;/p&gt;

&lt;p&gt;Simplified example:&lt;/p&gt;

&lt;p&gt;if (gpuAvailable &amp;amp;&amp;amp; ramGB &amp;gt;= 8) {&lt;br&gt;
    backend = "onnx-webgpu";&lt;br&gt;
    model = "medium-q5";&lt;br&gt;
}&lt;br&gt;
else if (cpuThreads &amp;gt;= 8) {&lt;br&gt;
    backend = "whisper-wasm";&lt;br&gt;
    model = "base-q5";&lt;br&gt;
}&lt;br&gt;
else {&lt;br&gt;
    backend = "whisper-wasm";&lt;br&gt;
    model = "tiny-q5";&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;This dramatically improves reliability across heterogeneous hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why fallback systems matter&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of the biggest lessons from browser AI development is:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;GPU acceleration cannot be assumed.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;WebGPU support still varies significantly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;browser implementations differ&lt;/li&gt;
&lt;li&gt;drivers behave inconsistently&lt;/li&gt;
&lt;li&gt;integrated GPUs may have unstable memory behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because of this, fallback systems are essential.&lt;/p&gt;

&lt;p&gt;Cowslator currently uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ONNX Runtime Web with WebGPU acceleration when available&lt;/li&gt;
&lt;li&gt;a Whisper.cpp WebAssembly fallback when GPU acceleration is not viable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows the application to continue functioning even on weaker systems.&lt;/p&gt;

&lt;p&gt;Without fallback systems, many users would simply encounter crashes or unusable performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batch transcription changes the workload entirely&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once adaptive inference became reliable, another feature became practical:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batch transcription&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of uploading a single file, users can upload an entire folder of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;interviews&lt;/li&gt;
&lt;li&gt;lectures&lt;/li&gt;
&lt;li&gt;podcasts&lt;/li&gt;
&lt;li&gt;voice notes&lt;/li&gt;
&lt;li&gt;documentaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The application then:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;creates a transcription queue&lt;/li&gt;
&lt;li&gt;processes files sequentially&lt;/li&gt;
&lt;li&gt;adapts inference strategy dynamically&lt;/li&gt;
&lt;li&gt;generates outputs locally&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This creates a very different workload profile compared to simple browser demos.&lt;/p&gt;

&lt;p&gt;Now the system must handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;long-running inference sessions&lt;/li&gt;
&lt;li&gt;memory cleanup between files&lt;/li&gt;
&lt;li&gt;scheduling stability&lt;/li&gt;
&lt;li&gt;sustained thermal pressure&lt;/li&gt;
&lt;li&gt;browser responsiveness over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, batch transcription became an excellent stress test for adaptive local AI systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why local-first AI matters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most transcription platforms rely on cloud processing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;upload audio&lt;/li&gt;
&lt;li&gt;wait for server inference&lt;/li&gt;
&lt;li&gt;&lt;p&gt;download subtitles&lt;br&gt;
This approach is convenient, but it also introduces:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;privacy concerns&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;upload bottlenecks&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;subscription costs&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;API dependence&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Local inference changes the model entirely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no upload required&lt;/li&gt;
&lt;li&gt;works offline&lt;/li&gt;
&lt;li&gt;uses local hardware&lt;/li&gt;
&lt;li&gt;predictable scaling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As consumer hardware improves, local AI becomes increasingly practical for workloads that previously required cloud infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final thoughts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Browser AI is reaching an interesting stage.&lt;/p&gt;

&lt;p&gt;The technology is now powerful enough to run serious workloads locally, but real-world deployment exposes problems that benchmarks rarely reveal:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;inconsistent hardware&lt;/li&gt;
&lt;li&gt;unstable GPU support&lt;/li&gt;
&lt;li&gt;memory constraints&lt;/li&gt;
&lt;li&gt;heterogeneous performance characteristics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The future of local AI applications may depend less on raw model capability and more on adaptive orchestration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;selecting the right backend&lt;/li&gt;
&lt;li&gt;choosing the right quantization&lt;/li&gt;
&lt;li&gt;scaling to the available hardware dynamically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words:&lt;/p&gt;

&lt;p&gt;Local AI applications cannot assume homogeneous hardware anymore.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Adaptive inference is becoming essential.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Cowslator is an ongoing experiment in local-first AI transcription:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;browser-based&lt;/li&gt;
&lt;li&gt;privacy-focused&lt;/li&gt;
&lt;li&gt;adaptive to hardware&lt;/li&gt;
&lt;li&gt;capable of batch transcription entirely offline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As local AI tooling matures, I think we’ll see more applications move away from centralized inference and toward adaptive edge computation running directly on consumer hardware.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>inference</category>
      <category>hardware</category>
      <category>benchmark</category>
    </item>
  </channel>
</rss>
