Running AI inference natively in the browser is the holy grail for reducing API costs and keeping enterprise data private. But if you’ve actually tried to build it, you know the reality is a massive headache.
You have to manually configure WebLLM or Transformers.js, set up dedicated Web Workers so your main React thread doesn't freeze, handle browser caching for massive model files, and write custom state management just to track the loading progress. It is hours of complex, low-level boilerplate before you can even generate a single token.
I got tired of configuring the same WebGPU architecture over and over, so I wrapped the entire engine into a single, drop-in React hook: react-brai.
Initialize the engine. The hook automatically handles Leader/Follower negotiation based on multiple active tabs.
import { useLocalAI } from 'react-brai';
export default function Chat() {
const { loadModel, chat, isReady, tps } = useLocalAI();
useEffect(() => {
loadModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
}, []);
return <div>Speed: {tps} T/s</div>
}
Now just use the loaded model like this,
const response = await chat([
{ role: "system", content: "Output JSON: { sentiment: 'pos' | 'neg' }" },
{ role: "user", content: "I love this library!" }
]);
const data = JSON.parse(response);
It abstracts away the web worker delegation, the model caching, and the memory constraints. You just call the hook, pick a quantized SLM (like Llama-3B), and start generating text or extracting JSON.
The Browser Cache
Let me be brutally honest, This is not for lightweight, general-purpose landing pages. react-brai requires the user to download a ~1.5GB to 3GB model into their local browser cache on the first load.
But for high-profile, niche use cases, that initial heavy download is an incredibly cheap price to pay.
Where this actually makes sense
- Heavy B2B Dashboards: The user logs in daily. They eat the download cost once, and forever after, their inference is instant and offline.
- Enterprise Data Privacy: When strict rules prevent you from sending customer data to OpenAI, local WebGPU inference is your only secure option.
- Automated JSON Extraction: Constantly formatting and extracting JSON from large datasets without burning through API tokens.
Try it out
I’ve published the package on NPM and set up a live playground. I’d love for fellow React devs to test the implementation and let me know how the memory management handles on your hardware.
NPM: https://www.npmjs.com/package/react-brai
Live WebGPU Playground: https://react-brai.vercel.app
Top comments (9)
The constraint you named honestly: "This is not for lightweight, general-purpose landing pages." Most libraries hide the 3GB download. You put it upfront. That's not a bug, it's a filter. The use cases that survive that constraint are the ones that actually need local inference: B2B dashboards, enterprise data privacy, structured extraction. Everything else falls away. That's good product design. You built for the problem, not the demo.
Exactly this. I figured hiding the constraint just to get more demo clicks wasn't worth the headache later. Better to filter out the noise early so the people who actually need local inference can find it. Thanks for the kind words!
You are always welcome sir.
Really solid abstraction of a genuinely painful setup, wrapping WebGPU, workers, and caching into a simple hook is a big DX win. I especially like the honest positioning around when it actually makes sense, the B2B and privacy use cases are spot on.
WebGPU for local inference is the right direction. We hit a similar problem building data infrastructure for edge AI — when your robot needs to make decisions in under 50ms, round-tripping to an API is not even an option.
The 1.5-3GB model download concern you mentioned is real though. In our case we solved it by shipping a smaller quantized model (Q4) as part of the binary itself. The tradeoff is accuracy vs. startup time, but for many edge use cases that tradeoff makes sense.
One question: how does react-brai handle tab-level coordination? If a user has 3 tabs open, do they each download their own model copy? That was a nasty issue we had with Web Workers — each worker loading its own model into memory and OOM-killing the browser.
Thanks for the information.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.