Running AI inference natively in the browser is the holy grail for reducing API costs and keeping enterprise data private. But if you’ve actually tried to build it, you know the reality is a massive headache.
You have to manually configure WebLLM or Transformers.js, set up dedicated Web Workers so your main React thread doesn't freeze, handle browser caching for massive model files, and write custom state management just to track the loading progress. It is hours of complex, low-level boilerplate before you can even generate a single token.
I got tired of configuring the same WebGPU architecture over and over, so I wrapped the entire engine into a single, drop-in React hook: react-brai.
Initialize the engine. The hook automatically handles Leader/Follower negotiation based on multiple active tabs.
import { useLocalAI } from 'react-brai';
export default function Chat() {
const { loadModel, chat, isReady, tps } = useLocalAI();
useEffect(() => {
loadModel('Llama-3.2-1B-Instruct-q4f16_1-MLC');
}, []);
return <div>Speed: {tps} T/s</div>
}
Now just use the loaded model like this,
const response = await chat([
{ role: "system", content: "Output JSON: { sentiment: 'pos' | 'neg' }" },
{ role: "user", content: "I love this library!" }
]);
const data = JSON.parse(response);
It abstracts away the web worker delegation, the model caching, and the memory constraints. You just call the hook, pick a quantized SLM (like Llama-3B), and start generating text or extracting JSON.
The Browser Cache
Let me be brutally honest, This is not for lightweight, general-purpose landing pages. react-brai requires the user to download a ~1.5GB to 3GB model into their local browser cache on the first load.
But for high-profile, niche use cases, that initial heavy download is an incredibly cheap price to pay.
Where this actually makes sense
- Heavy B2B Dashboards: The user logs in daily. They eat the download cost once, and forever after, their inference is instant and offline.
- Enterprise Data Privacy: When strict rules prevent you from sending customer data to OpenAI, local WebGPU inference is your only secure option.
- Automated JSON Extraction: Constantly formatting and extracting JSON from large datasets without burning through API tokens.
Try it out
I’ve published the package on NPM and set up a live playground. I’d love for fellow React devs to test the implementation and let me know how the memory management handles on your hardware.
NPM: https://www.npmjs.com/package/react-brai
Live WebGPU Playground: https://react-brai.vercel.app
Top comments (0)