Rahul

Posted on Apr 10

A simple React hook for running local LLMs via WebGPU

#webdev #react #ai #api

Running AI inference natively in the browser is the holy grail for reducing API costs and keeping enterprise data private. But if you’ve actually tried to build it, you know the reality is a massive headache.

You have to manually configure WebLLM or Transformers.js, set up dedicated Web Workers so your main React thread doesn't freeze, handle browser caching for massive model files, and write custom state management just to track the loading progress. It is hours of complex, low-level boilerplate before you can even generate a single token.

I got tired of configuring the same WebGPU architecture over and over, so I wrapped the entire engine into a single, drop-in React hook: react-brai.

Initialize the engine. The hook automatically handles Leader/Follower negotiation based on multiple active tabs.

import { useLocalAI } from 'react-brai';

export default function Chat() {
  const { loadModel, chat, isReady, tps } = useLocalAI();

  useEffect(() => { 
      loadModel('Llama-3.2-1B-Instruct-q4f16_1-MLC'); 
  }, []);

  return <div>Speed: {tps} T/s</div>
}

Now just use the loaded model like this,

const response = await chat([
  { role: "system", content: "Output JSON: { sentiment: 'pos' | 'neg' }" },
  { role: "user", content: "I love this library!" }
]);
const data = JSON.parse(response);

It abstracts away the web worker delegation, the model caching, and the memory constraints. You just call the hook, pick a quantized SLM (like Llama-3B), and start generating text or extracting JSON.

The Browser Cache
Let me be brutally honest, This is not for lightweight, general-purpose landing pages. react-brai requires the user to download a ~1.5GB to 3GB model into their local browser cache on the first load.

But for high-profile, niche use cases, that initial heavy download is an incredibly cheap price to pay.

Where this actually makes sense

Heavy B2B Dashboards: The user logs in daily. They eat the download cost once, and forever after, their inference is instant and offline.
Enterprise Data Privacy: When strict rules prevent you from sending customer data to OpenAI, local WebGPU inference is your only secure option.
Automated JSON Extraction: Constantly formatting and extracting JSON from large datasets without burning through API tokens.

Try it out
I’ve published the package on NPM and set up a live playground. I’d love for fellow React devs to test the implementation and let me know how the memory management handles on your hardware.

NPM: https://www.npmjs.com/package/react-brai

Live WebGPU Playground: https://react-brai.vercel.app

Top comments (9)

Victor Okefie • Apr 10

The constraint you named honestly: "This is not for lightweight, general-purpose landing pages." Most libraries hide the 3GB download. You put it upfront. That's not a bug, it's a filter. The use cases that survive that constraint are the ones that actually need local inference: B2B dashboards, enterprise data privacy, structured extraction. Everything else falls away. That's good product design. You built for the problem, not the demo.

Rahul • Apr 10

Exactly this. I figured hiding the constraint just to get more demo clicks wasn't worth the headache later. Better to filter out the noise early so the people who actually need local inference can find it. Thanks for the kind words!

Victor Okefie • Apr 11

You are always welcome sir.

Knowband • Apr 11

Really solid abstraction of a genuinely painful setup, wrapping WebGPU, workers, and caching into a simple hook is a big DX win. I especially like the honest positioning around when it actually makes sense, the B2B and privacy use cases are spot on.

mote • Apr 10

WebGPU for local inference is the right direction. We hit a similar problem building data infrastructure for edge AI — when your robot needs to make decisions in under 50ms, round-tripping to an API is not even an option.

The 1.5-3GB model download concern you mentioned is real though. In our case we solved it by shipping a smaller quantized model (Q4) as part of the binary itself. The tradeoff is accuracy vs. startup time, but for many edge use cases that tradeoff makes sense.

One question: how does react-brai handle tab-level coordination? If a user has 3 tabs open, do they each download their own model copy? That was a nasty issue we had with Web Workers — each worker loading its own model into memory and OOM-killing the browser.

Morphal • Apr 14

Thanks for the information.

Rahul • Apr 10

Exactly! Web3 and centralized AI APIs don't really mix well if you care about data privacy. Keeping the LLM strictly in the browser fixes that leak. Would be cool to see how it fits into a decentralized architecture if you end up experimenting with it.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.