You want to add AI to your Chrome extension.
The obvious path: spin up a Node.js server, hold a master API key, charge users monthly, eat the AI cost. That's what everyone does.
I didn't do that. I built three Chrome extensions with AI features — PR summarization, risk scoring, draft review generation — and my monthly infrastructure bill is $0. No server. No backend. No API key to protect.
Here's the exact architecture, the real trade-offs, and the specific places where this approach breaks down so you don't find out the hard way.
The problem with the "standard" approach
Most AI-powered extensions work like this:
User → Extension → Your server → AI provider → Your server → Extension → User
Your server holds a master API key. Users pay you. You pay the AI provider out of that margin.
The problems:
You're a proxy business now. You're paying OpenAI $X, charging users $Y, and the difference is your margin. But you're also responsible for rate limiting, uptime, abuse prevention, and GDPR compliance for every request that touches your server.
Private code goes through your infra. For a developer tool that reads GitHub diffs, this is the question users ask first: "is my code going to your server?" With a hosted backend, the honest answer is yes.
You're competing on price against companies with VC money. CodeRabbit, GitHub Copilot, Linear, and a dozen others are running hosted AI with economies of scale you can't match as a solo developer.
There's a different architecture. It's not new — it's called BYOK (Bring Your Own Key), and it shifts the AI provider relationship from you to the user.
User → Extension → AI provider (user's own key)
No server in the middle. No margin math. No "is my code safe" question.
How BYOK works in a Chrome extension
The core mechanic is simple: instead of your extension calling your server, it calls the AI provider directly from the browser using the user's own API key.
// The user pastes their API key during onboarding
// You store it locally — never send it anywhere else
await chrome.storage.local.set({
aiApiKey: userProvidedKey,
aiProvider: 'groq' // or 'openai', 'mistral', 'ollama'
});
// Every AI call uses their key, from their browser
async function callAI(prompt) {
const { aiApiKey, aiProvider } = await chrome.storage.local.get(['aiApiKey', 'aiProvider']);
const endpoint = getEndpoint(aiProvider);
const response = await fetch(endpoint, {
method: 'POST',
headers: {
'Authorization': `Bearer ${aiApiKey}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: getModel(aiProvider),
messages: [{ role: 'user', content: prompt }],
max_tokens: 500
})
});
return response.json();
}
The API key lives in chrome.storage.local. It never leaves the browser except to go directly to the AI provider. Your extension never sees it again after the user pastes it in.
The manifest.json permissions you actually need
For direct API calls from a Chrome extension, declare host permissions for each provider you support:
{
"manifest_version": 3,
"permissions": [
"storage"
],
"host_permissions": [
"https://api.openai.com/*",
"https://api.groq.com/*",
"https://api.mistral.ai/*",
"http://localhost:*/*"
]
}
The localhost entry covers Ollama — for users who want a fully local model with zero API costs.
Important: In MV3, host permissions are scrutinized during review. Be specific. Don't use
<all_urls>when you can name the exact domains. I've been through CWS review twice with this manifest — being explicit helps.
Supporting multiple providers without a mess
All four major providers use the OpenAI-compatible /v1/chat/completions format. One implementation, four providers:
const AI_PROVIDERS = {
groq: {
endpoint: 'https://api.groq.com/openai/v1/chat/completions',
model: 'llama-3.3-70b-versatile',
maxTokens: 1024,
supportsStreaming: true,
},
openai: {
endpoint: 'https://api.openai.com/v1/chat/completions',
model: 'gpt-4o-mini',
maxTokens: 1024,
supportsStreaming: true,
},
mistral: {
endpoint: 'https://api.mistral.ai/v1/chat/completions',
model: 'mistral-small-latest',
maxTokens: 1024,
supportsStreaming: false,
},
ollama: {
endpoint: 'http://localhost:11434/v1/chat/completions',
model: 'llama3.2',
maxTokens: 1024,
supportsStreaming: true,
}
};
async function getProviderConfig() {
const { aiProvider } = await chrome.storage.local.get('aiProvider');
return AI_PROVIDERS[aiProvider] || AI_PROVIDERS.groq;
}
Store the model name here, not hardcoded in your fetch calls. When Groq deprecated an older Llama version, I pushed one config update and every user was on the new model automatically — no user action required.
The onboarding friction problem — and how to reduce it
Here's the real cost of BYOK: users have to get an API key before they can use your AI features. Some users bounce at this step.
What actually reduces friction:
1. Lead with Groq. Groq's free tier covers ~14,400 requests per day for smaller models. For most individual developers, it's genuinely free. This changes the conversation from "go pay for an API key" to "go get a free API key in 2 minutes."
2. Give the exact steps, not a vague instruction:
Step 1: Go to console.groq.com/keys
Step 2: Click "Create API key"
Step 3: Paste the key here → [input]
Three lines. No ambiguity. I track where users drop off in onboarding — the step with the most abandonment is always the one where I said "get your API key" without saying exactly where.
3. Make core features work without AI. If every feature is gated behind BYOK setup, the first session is a setup session — and many users don't return for a second. In PR Focus, multi-account GitHub, PR sorting, CSV export, and stale notifications all work without any API key. The AI features are additive.
The MV3 service worker problem with streaming
If you want to stream AI responses token by token, you hit an MV3 constraint: service workers handle the API calls, but streaming requires a long-lived connection, and service workers can be terminated mid-stream.
The pattern that works — service worker handles the fetch, sends tokens to the popup via messages:
// Service worker — handles the streaming fetch
chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
if (message.type === 'STREAM_AI') {
streamAIResponse(message.prompt, sender.tab.id);
return true; // Keep the message channel open
}
});
async function streamAIResponse(prompt, tabId) {
const config = await getProviderConfig();
const { aiApiKey } = await chrome.storage.local.get('aiApiKey');
const response = await fetch(config.endpoint, {
method: 'POST',
headers: {
'Authorization': `Bearer ${aiApiKey}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: config.model,
messages: [{ role: 'user', content: prompt }],
stream: true
})
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n').filter(line => line.startsWith('data: '));
for (const line of lines) {
const data = line.slice(6);
if (data === '[DONE]') continue;
try {
const parsed = JSON.parse(data);
const token = parsed.choices[0]?.delta?.content || '';
chrome.tabs.sendMessage(tabId, { type: 'AI_TOKEN', token });
} catch (e) {
// Skip malformed chunks — they happen
}
}
}
chrome.tabs.sendMessage(tabId, { type: 'AI_DONE' });
}
The fetch keeps the service worker alive for the duration of the stream. Tokens go to the popup via messages. The popup accumulates them and renders progressively.
Specific error handling — this saves you support tickets
The most common support category with BYOK: users with wrong or misconfigured keys. Generic "AI error" messages generate follow-up tickets. Status-code-specific messages don't:
async function validateApiKey(apiKey, provider) {
try {
const config = AI_PROVIDERS[provider];
const response = await fetch(config.endpoint, {
method: 'POST',
headers: {
'Authorization': `Bearer ${apiKey}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: config.model,
messages: [{ role: 'user', content: 'test' }],
max_tokens: 1
})
});
if (response.status === 401)
return { valid: false, error: 'Invalid API key — check you copied it completely, no trailing spaces.' };
if (response.status === 429)
return { valid: false, error: 'Rate limit hit — your key is valid but you\'ve hit the free tier ceiling.' };
if (response.status === 403)
return { valid: false, error: 'Permission denied — this key may not have access to this model tier.' };
if (!response.ok)
return { valid: false, error: `Provider returned ${response.status} — try again in a moment.` };
return { valid: true };
} catch (e) {
return { valid: false, error: 'Network error — check your internet connection or try a different provider.' };
}
}
Token costs in practice — real numbers
A typical PR summary in PR Focus: ~800 tokens input (diff context + system prompt), ~150 tokens output. ~950 tokens per PR.
| Provider | Tier | Cost per PR | 100 PRs/day |
|---|---|---|---|
| Groq (Llama 3.3 70B) | Free | $0 | $0 |
| OpenAI GPT-4o-mini | Paid | ~$0.0001 | ~$0.01 |
| Mistral Small | Paid | ~$0.00008 | ~$0.008 |
| Ollama (local) | Free | $0 | $0 |
The cost argument for BYOK isn't just privacy — it's math. A hosted model charging $10/month makes pennies after AI costs and infrastructure. Users with their own Groq key pay nothing for individual use. That's a value proposition you can't match with a hosted backend.
What breaks — be honest about it
Corporate users behind strict proxies. Some enterprise environments block direct browser-to-external-API calls. You can't fix this. Be upfront about it, and point to Ollama as the local workaround.
Ollama requires a separate install. It's not "just paste a key" — it's "install Ollama, pull a model, run it locally, then configure the extension." Worth supporting for privacy-first users, but don't pitch it as the simple path.
You can't cache responses. Each user's key means each user pays for their own calls. No cross-user caching. For most use cases this doesn't matter, but if you're building something where 1000 users asking the same question is likely, hosted with caching will be cheaper for them.
Is BYOK right for your extension?
Yes, if:
- Your users are developers or technical enough that "API key" isn't a foreign concept
- Privacy is a genuine selling point (code review, writing assistance, anything involving private data)
- You're solo and don't want to operate infrastructure
- You want a free tier without eating AI costs
No, if:
- Your audience is non-technical and "API key" will lose them before they get to your value
- You need to control which model is used for consistency or quality reasons
- You want platform-level caching, rate limiting, or abuse prevention
- You're fine with a subscription model and want the simplicity of a managed service
The architecture in one diagram
chrome.storage.local
├── aiApiKey ← user's own, never leaves browser except to provider
└── aiProvider ← 'groq' | 'openai' | 'mistral' | 'ollama'
Popup / content script
└── message → service worker: { type: 'RUN_AI', prompt }
Service worker
├── reads key + provider from storage
├── calls provider API directly (fetch)
└── streams tokens → popup via chrome.runtime.sendMessage
Infrastructure cost: $0
Monthly AI bill: $0
Trust question ("does my code go to your server?"): No.
See it running in production
Everything in this article is running in PR Focus Pro — a Chrome extension that triages GitHub pull requests with AI summaries, hybrid risk scoring (0–100), and one-click draft reviews. Free to install; AI features activate with your own API key.
The full engineering decision log behind this architecture — including the options I rejected, what it cost in user friction, and whether I'd choose it again — is Build Log #007 in my public Build Logs repo.
If you're building something similar and want a second pair of eyes on your implementation, the Summer Review Swap is open — there's a PR waiting for a reviewer right now if you want to jump straight in.
What's your approach to AI in browser extensions? Running your own backend, BYOK, or something else entirely? Particularly curious whether anyone has found a cleaner solution to the streaming + service worker termination problem — drop it in the comments.
Links in this article:
Top comments (1)
Super interesting approach! I'm curious how you