Chrome quietly shipped something huge: built‑in AI running fully on-device (Gemini Nano). It lives right in the browser, with no extra install and no network calls.
I wanted to see how far I could push that in a real product, so I built DocuMentor AI, a Chrome extension that helps developers learn from technical content: smart summaries, “should I read this?” signals, and curated learning resources.
In production, this extension runs on both:
- Chrome’s local Gemini Nano (privacy‑first, device‑bound), and
- A cloud AI backend (faster, more capable)
And it picks the right path automatically. This post is the short version of how that hybrid architecture works.
The Problem: Two Very Different AIs
Most cloud LLMs give you one generic chat endpoint and expect you to solve everything with prompt engineering and tools.
Chrome’s built‑in AI is different. It exposes specialized APIs:
-
Summarizer– for TL;DRs and key points -
Writer– for generating content -
LanguageModel– for general prompts and chat
If you try to hide all of that behind a single generate() function, one side (local or cloud) will always feel wrong.
So instead of abstracting by “model,” I abstracted by feature.
Step 1: Design the Interface Around Features
Inside the extension, there’s a small interface that describes what the system can do, not how any provider works:
summarize(text, options)executePrompt(messages, options)write(content, options)-
isAvailable()/initialize()
Then I implemented two providers behind that interface:
-
GeminiNanoProviderwraps Chrome’sSummarizer,LanguageModel, andWriterAPIs. -
DocumentorAIProvidertalks to a backend that uses a single cloud LLM and prompt engineering to mimic those same capabilities.
Provider selection is simple:
- Logged‑in users → cloud provider (speed + stronger reasoning)
- Logged‑out users → local provider (privacy + offline‑ish)
No mid‑session switching. Auth state decides the provider, which keeps the extension logic predictable.
Step 2: Different Strategies for Local vs Cloud AI
Once both providers matched the same interface, a new issue appeared: capability and reliability.
Cloud models can handle multi‑step workflows with tools in a single call.
Chrome’s local Gemini Nano can’t reliably do that, especially on CPU‑only machines:
- It struggles to maintain context over many steps.
- Tool calling is flaky.
- Structured JSON output isn’t consistent.
So I ended up with two orchestration strategies:
Local AI: Sequential, App‑Driven Orchestration
For features like YouTube video recommendations, the Chrome‑only path looks like:
- Ask Gemini Nano to generate a search query.
- Call the YouTube API directly in code.
- Ask Gemini Nano again to rank and rewrite the results.
- Parse and repair JSON defensively, with retries if needed.
The model only does small, focused reasoning tasks. The extension orchestrates everything else.
Cloud AI: One Tool‑Augmented Call
For the cloud provider, the same feature becomes:
- One prompt + tool‑calling:
“Analyze this page, call
search_youtube, score relevance, and return the top 3 videos as JSON.”
The cloud model acts as planner + executor. The extension just calls it once and renders the result.
Same UX, two internal strategies:
- Local = sequential decomposition
- Cloud = single tool‑augmented call
Step 3: Respecting On‑Device Limits
Running AI inside the browser forces you to care about constraints the cloud usually hides:
Token quotas: Use
measureInputUsage()andinputQuotato check if your input fits before prompting. Truncate instead of hoping.Content extraction: Use something like Mozilla Readability to strip noise (nav, ads, footers) and cap content (~15K characters worked well) before sending it to AI.
Sequential over parallel: Multiple Chrome AI calls in parallel were often slower and more fragile.
UX win: send a fast TL;DR first, and let heavier results stream in afterward.
Step 4: Fallbacks Users Don’t Have to Think About
The extension uses a simple, silent fallback chain:
- Try cloud AI (fast path).
- On any failure or quota hit, fall back to local AI.
- Only then show an error.
The UI never says “switching providers now.” It just feels like:
- Most of the time → fast responses (cloud).
- Sometimes → slower, but still works (local).
Takeaways for Your Own Chrome AI Projects
If you’re thinking about mixing Chrome’s built‑in AI with cloud models in a Chrome extension:
- Design around features, not models.
- Let cloud models handle complex, tool‑heavy workflows in one call.
- Let the extension orchestrate and keep prompts small on local AI.
- Treat on‑device constraints (tokens, content size, parallelism) as first‑class design inputs.
- Build fallbacks so users get degraded performance, not hard errors.
That’s the gist of how I built a Chrome extension that quietly juggles local and cloud AI without exposing that complexity to users.
Top comments (0)