Discussion on: A simple React hook for running local LLMs via WebGPU

View post

WebGPU for local inference is the right direction. We hit a similar problem building data infrastructure for edge AI — when your robot needs to make decisions in under 50ms, round-tripping to an API is not even an option.

The 1.5-3GB model download concern you mentioned is real though. In our case we solved it by shipping a smaller quantized model (Q4) as part of the binary itself. The tradeoff is accuracy vs. startup time, but for many edge use cases that tradeoff makes sense.

One question: how does react-brai handle tab-level coordination? If a user has 3 tabs open, do they each download their own model copy? That was a nasty issue we had with Web Workers — each worker loading its own model into memory and OOM-killing the browser.