DEV Community

Cover image for 3 Ways to Run AI in the Browser with Next.js (No API Keys Required)
Niroshan Dh
Niroshan Dh

Posted on

3 Ways to Run AI in the Browser with Next.js (No API Keys Required)

The "API Fatigue" is Real

We are all building AI wrappers. But the standard pattern Frontend sends data to Backend → Backend hits OpenAI → Backend pays $0.03 → Frontend displays result is getting expensive. And frankly, it's boring.

What if your users could run Whisper, YOLO, or even Llama directly in their Chrome tab?

I recently built CodeCoffeeTools.com entirely around this "Client-Side AI" concept. Here is the technical breakdown of how you can do it too using Transformers.js.

How is this even possible?

You might think running a neural network in JavaScript would be agonizingly slow. It used to be. But three technologies have converged to make this viable:

  • ONNX Runtime: A cross-platform accelerator for machine learning models.
  • WebAssembly (WASM): Allows C++ code (the ONNX runtime) to run at near-native speed in the browser.
  • Transformers.js: A library by Hugging Face that bridges the gap, letting you use models just like you would in Python, but in JS.

The "Free" Audio Transcriber (Whisper)
Instead of paying OpenAI $0.006/minute, we use the Whisper-Tiny model.

import { pipeline } from '@xenova/transformers';
// 1. Create the pipeline (Downloads the 40MB model once)
const transcriber = await pipeline('automatic-speech-recognition', 'Xenova/whisper-tiny.en');
// 2. Transcribe
const result = await transcriber('https://example.com/my-audio.mp3');
// 3. Output
console.log(result.text); 
// Output: "Hello world, this is a test." 
Enter fullscreen mode Exit fullscreen mode

Note -: Always run this in a Web Worker so you don't freeze the UI while the transcription crunches numbers.

Real-Time Object Detection (YOLO)

We can run object detection (identifying "person", "car", "dog") at 60 FPS in the browser using the detr-resnet-50 model.

import { pipeline } from '@xenova/transformers';

// 1. Load the detection pipeline
const detector = await pipeline('object-detection', 'Xenova/detr-resnet-50');
// 2. Detect objects in an image
const output = await detector('my-image.jpg');
// 3. Result
console.log(output);
/* [
  { label: 'person', score: 0.99, box: { xmin: 50, ymin: 20... } },
  { label: 'bicycle', score: 0.95, box: { xmin: 120, ymin: 80... } }
] 
*/
Enter fullscreen mode Exit fullscreen mode

Running a Chatbot Locally (Text Generation)
Yes, you can run LLMs in the browser. You won't run GPT-4, but you can run highly optimized models like Phi-2 or TinyLlama.

import { pipeline } from '@xenova/transformers';
const generator = await pipeline('text-generation', 'Xenova/TinyLlama-1.1B-Chat-v1.0');
const output = await generator('Write a tagline for a coffee shop:', {
  max_new_tokens: 20,
  temperature: 0.7
});
console.log(output[0].generated_text); 
Enter fullscreen mode Exit fullscreen mode

Note -: This requires WebGPU support for decent speed, which Transformers.js v3 (currently in alpha/beta) supports brilliantly.

Performance

Yes it's matter - "But won't this crash my user's browser?"

It can, if you aren't careful. Here is how I handled performance on CodeCoffeeTools:

  • 1. Quantization is Key: I never use the full "float32" models. I use "quantized" (q8 or q4) versions. They are 4x smaller and run 2x faster with almost zero loss in accuracy.
  • 2. Lazy Loading: Do NOT load the AI model when the homepage loads. Only load it when the user clicks "Start Tool."
  • 3. Cache It: Transformers.js caches files in the browser. The first visit takes 10 seconds to download; the second visit takes 100ms.

We are moving towards a web where the server just serves the app, and the client does the thinking.
If you want to see these examples running live (and critique my code), check out CodeCoffeeTools.com. I've open-sourced the "Privacy-First" philosophy, even if the repo itself is private for now.
What should I build next? A local Background Remover or a PDF Summarizer? Let me know below!

Top comments (0)