Projekta2

Posted on Jun 28

I built an AI Chrome extension with zero backend cost — here's the exact architecture

#chrome #ai #javascript #webdev

BYOK privacy vs onboarding friction

You want to add AI to your Chrome extension.

The obvious path: spin up a Node.js server, hold a master API key, charge users monthly, eat the AI cost. That's what everyone does.

I didn't do that. I built three Chrome extensions with AI features — PR summarization, risk scoring, draft review generation — and my monthly infrastructure bill is $0. No server. No backend. No API key to protect.

Here's the exact architecture, the real trade-offs, and the specific places where this approach breaks down so you don't find out the hard way.

The problem with the "standard" approach

Most AI-powered extensions work like this:

User → Extension → Your server → AI provider → Your server → Extension → User

Your server holds a master API key. Users pay you. You pay the AI provider out of that margin.

The problems:

You're a proxy business now. You're paying OpenAI $X, charging users $Y, and the difference is your margin. But you're also responsible for rate limiting, uptime, abuse prevention, and GDPR compliance for every request that touches your server.
Private code goes through your infra. For a developer tool that reads GitHub diffs, this is the question users ask first: "is my code going to your server?" With a hosted backend, the honest answer is yes.
You're competing on price against companies with VC money. CodeRabbit, GitHub Copilot, Linear, and a dozen others are running hosted AI with economies of scale you can't match as a solo developer.

There's a different architecture. It's not new — it's called BYOK (Bring Your Own Key), and it shifts the AI provider relationship from you to the user.

User → Extension → AI provider (user's own key)

No server in the middle. No margin math. No "is my code safe" question.

How BYOK works in a Chrome extension

The core mechanic is simple: instead of your extension calling your server, it calls the AI provider directly from the browser using the user's own API key.

// The user pastes their API key during onboarding
// You store it locally — never send it anywhere else
await chrome.storage.local.set({ 
  aiApiKey: userProvidedKey,
  aiProvider: 'groq' // or 'openai', 'mistral', 'ollama'
});

// Every AI call uses their key, from their browser
async function callAI(prompt) {
  const { aiApiKey, aiProvider } = await chrome.storage.local.get(['aiApiKey', 'aiProvider']);

  const endpoint = getEndpoint(aiProvider);

  const response = await fetch(endpoint, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${aiApiKey}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      model: getModel(aiProvider),
      messages: [{ role: 'user', content: prompt }],
      max_tokens: 500
    })
  });

  return response.json();
}

The API key lives in chrome.storage.local. It never leaves the browser except to go directly to the AI provider. Your extension never sees it again after the user pastes it in.

The manifest.json permissions you actually need

For direct API calls from a Chrome extension, declare host permissions for each provider you support:

{
  "manifest_version": 3,
  "permissions": [
    "storage"
  ],
  "host_permissions": [
    "https://api.openai.com/*",
    "https://api.groq.com/*",
    "https://api.mistral.ai/*",
    "http://localhost:*/*"
  ]
}

The localhost entry covers Ollama — for users who want a fully local model with zero API costs.

Important: In MV3, host permissions are scrutinized during review. Be specific. Don't use <all_urls> when you can name the exact domains. I've been through CWS review twice with this manifest — being explicit helps.

Supporting multiple providers without a mess

All four major providers use the OpenAI-compatible /v1/chat/completions format. One implementation, four providers:

const AI_PROVIDERS = {
  groq: {
    endpoint: 'https://api.groq.com/openai/v1/chat/completions',
    model: 'llama-3.3-70b-versatile',
    maxTokens: 1024,
    supportsStreaming: true,
  },
  openai: {
    endpoint: 'https://api.openai.com/v1/chat/completions',
    model: 'gpt-4o-mini',
    maxTokens: 1024,
    supportsStreaming: true,
  },
  mistral: {
    endpoint: 'https://api.mistral.ai/v1/chat/completions',
    model: 'mistral-small-latest',
    maxTokens: 1024,
    supportsStreaming: false,
  },
  ollama: {
    endpoint: 'http://localhost:11434/v1/chat/completions',
    model: 'llama3.2',
    maxTokens: 1024,
    supportsStreaming: true,
  }
};

async function getProviderConfig() {
  const { aiProvider } = await chrome.storage.local.get('aiProvider');
  return AI_PROVIDERS[aiProvider] || AI_PROVIDERS.groq;
}

Store the model name here, not hardcoded in your fetch calls. When Groq deprecated an older Llama version, I pushed one config update and every user was on the new model automatically — no user action required.

The onboarding friction problem — and how to reduce it

Here's the real cost of BYOK: users have to get an API key before they can use your AI features. Some users bounce at this step.

What actually reduces friction:

1. Lead with Groq. Groq's free tier covers ~14,400 requests per day for smaller models. For most individual developers, it's genuinely free. This changes the conversation from "go pay for an API key" to "go get a free API key in 2 minutes."

2. Give the exact steps, not a vague instruction:

Step 1: Go to console.groq.com/keys
Step 2: Click "Create API key"
Step 3: Paste the key here → [input]

Three lines. No ambiguity. I track where users drop off in onboarding — the step with the most abandonment is always the one where I said "get your API key" without saying exactly where.

3. Make core features work without AI. If every feature is gated behind BYOK setup, the first session is a setup session — and many users don't return for a second. In PR Focus, multi-account GitHub, PR sorting, CSV export, and stale notifications all work without any API key. The AI features are additive.

The MV3 service worker problem with streaming

If you want to stream AI responses token by token, you hit an MV3 constraint: service workers handle the API calls, but streaming requires a long-lived connection, and service workers can be terminated mid-stream.

The pattern that works — service worker handles the fetch, sends tokens to the popup via messages:

// Service worker — handles the streaming fetch
chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
  if (message.type === 'STREAM_AI') {
    streamAIResponse(message.prompt, sender.tab.id);
    return true; // Keep the message channel open
  }
});

async function streamAIResponse(prompt, tabId) {
  const config = await getProviderConfig();
  const { aiApiKey } = await chrome.storage.local.get('aiApiKey');

  const response = await fetch(config.endpoint, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${aiApiKey}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      model: config.model,
      messages: [{ role: 'user', content: prompt }],
      stream: true
    })
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const chunk = decoder.decode(value);
    const lines = chunk.split('\n').filter(line => line.startsWith('data: '));

    for (const line of lines) {
      const data = line.slice(6);
      if (data === '[DONE]') continue;

      try {
        const parsed = JSON.parse(data);
        const token = parsed.choices[0]?.delta?.content || '';

        chrome.tabs.sendMessage(tabId, { type: 'AI_TOKEN', token });
      } catch (e) {
        // Skip malformed chunks — they happen
      }
    }
  }

  chrome.tabs.sendMessage(tabId, { type: 'AI_DONE' });
}

The fetch keeps the service worker alive for the duration of the stream. Tokens go to the popup via messages. The popup accumulates them and renders progressively.

Specific error handling — this saves you support tickets

The most common support category with BYOK: users with wrong or misconfigured keys. Generic "AI error" messages generate follow-up tickets. Status-code-specific messages don't:

async function validateApiKey(apiKey, provider) {
  try {
    const config = AI_PROVIDERS[provider];
    const response = await fetch(config.endpoint, {
      method: 'POST',
      headers: { 
        'Authorization': `Bearer ${apiKey}`, 
        'Content-Type': 'application/json' 
      },
      body: JSON.stringify({
        model: config.model,
        messages: [{ role: 'user', content: 'test' }],
        max_tokens: 1
      })
    });

    if (response.status === 401) 
      return { valid: false, error: 'Invalid API key — check you copied it completely, no trailing spaces.' };
    if (response.status === 429) 
      return { valid: false, error: 'Rate limit hit — your key is valid but you\'ve hit the free tier ceiling.' };
    if (response.status === 403) 
      return { valid: false, error: 'Permission denied — this key may not have access to this model tier.' };
    if (!response.ok) 
      return { valid: false, error: `Provider returned ${response.status} — try again in a moment.` };

    return { valid: true };
  } catch (e) {
    return { valid: false, error: 'Network error — check your internet connection or try a different provider.' };
  }
}

Token costs in practice — real numbers

A typical PR summary in PR Focus: ~800 tokens input (diff context + system prompt), ~150 tokens output. ~950 tokens per PR.

Provider	Tier	Cost per PR	100 PRs/day
Groq (Llama 3.3 70B)	Free	$0	$0
OpenAI GPT-4o-mini	Paid	~$0.0001	~$0.01
Mistral Small	Paid	~$0.00008	~$0.008
Ollama (local)	Free	$0	$0

The cost argument for BYOK isn't just privacy — it's math. A hosted model charging $10/month makes pennies after AI costs and infrastructure. Users with their own Groq key pay nothing for individual use. That's a value proposition you can't match with a hosted backend.

What breaks — be honest about it

Corporate users behind strict proxies. Some enterprise environments block direct browser-to-external-API calls. You can't fix this. Be upfront about it, and point to Ollama as the local workaround.

Ollama requires a separate install. It's not "just paste a key" — it's "install Ollama, pull a model, run it locally, then configure the extension." Worth supporting for privacy-first users, but don't pitch it as the simple path.

You can't cache responses. Each user's key means each user pays for their own calls. No cross-user caching. For most use cases this doesn't matter, but if you're building something where 1000 users asking the same question is likely, hosted with caching will be cheaper for them.

Is BYOK right for your extension?

Yes, if:

Your users are developers or technical enough that "API key" isn't a foreign concept
Privacy is a genuine selling point (code review, writing assistance, anything involving private data)
You're solo and don't want to operate infrastructure
You want a free tier without eating AI costs

No, if:

Your audience is non-technical and "API key" will lose them before they get to your value
You need to control which model is used for consistency or quality reasons
You want platform-level caching, rate limiting, or abuse prevention
You're fine with a subscription model and want the simplicity of a managed service

The architecture in one diagram

chrome.storage.local
  ├── aiApiKey      ← user's own, never leaves browser except to provider
  └── aiProvider    ← 'groq' | 'openai' | 'mistral' | 'ollama'

Popup / content script
  └── message → service worker: { type: 'RUN_AI', prompt }

Service worker
  ├── reads key + provider from storage
  ├── calls provider API directly (fetch)
  └── streams tokens → popup via chrome.runtime.sendMessage

Infrastructure cost: $0
Monthly AI bill: $0
Trust question ("does my code go to your server?"): No.

See it running in production

Everything in this article is running in PR Focus Pro — a Chrome extension that triages GitHub pull requests with AI summaries, hybrid risk scoring (0–100), and one-click draft reviews. Free to install; AI features activate with your own API key.

The full engineering decision log behind this architecture — including the options I rejected, what it cost in user friction, and whether I'd choose it again — is Build Log #007 in my public Build Logs repo.

If you're building something similar and want a second pair of eyes on your implementation, the Summer Review Swap is open — there's a PR waiting for a reviewer right now if you want to jump straight in.

What's your approach to AI in browser extensions? Running your own backend, BYOK, or something else entirely? Particularly curious whether anyone has found a cleaner solution to the streaming + service worker termination problem — drop it in the comments.

Links in this article:

Top comments (18)

UnitBuilds • Jun 28

These days, just BYOM, everyone uses AI, they have their favorites, so let them use their favorites, but allow them to experiment with the rest. A single Google OAuth is good enough to sign them in to just about all of them. Users are also scared of 'companies stealing their data', they dont much care that Google, OpenAI, Anthropic, SpaceXAI all do the same... But if you're going to leak your own data, atleast let it hide in a crowd of billions?

But it's their choice. The best you can do as a provider, is provide a choice. It lowers your running costs, it lowers your liability and it creates customer trust that the app is there as a tool for the user, not for data harvesting.

Projekta2 • Jun 29

Really good point on BYOM — and honestly the direction I'm leaning for the next iteration. Supporting OpenRouter already gets you most of the way there (30+ models, one key, user picks their favorite). The OAuth angle is interesting though: the friction of "paste a key" vs "sign in with Google" is real, and for less technical users OAuth wins every time.

The privacy paradox you're describing is accurate and a bit uncomfortable to sit with. Users are worried about my server seeing their code, while their chosen AI provider is training on it anyway (unless they're on an API tier with opt-out). The honest answer is that BYOK gives them control and transparency, not actual privacy — they can audit exactly what leaves their browser and to whom. That's different from trusting a black box, even if the end result is similar.
The liability point is the one I don't see discussed enough. As a solo developer, not holding a master API key means I'm not a target. If someone's key gets misused, it's between them and their provider. That asymmetry matters a lot at small scale.

What's your stack? Curious if you've shipped BYOM in production and what the OAuth + multi-provider flow looks like in practice.

UnitBuilds • Jun 29

For the API end, Essentially the model garden models are the third parties, those require different api calls, so you create a template, for each, then tie it to the provider's URL in the model declaration, so anthropic means claude, etc. That way you have set API structure for each model type and it means as the time goes on, if they release new models, they can upgrade the second it releases, not wait a week for a patch.

Projekta2 • Jun 30

Really appreciate you walking through the Doccit setup in that level of detail — that's a genuinely different shape of BYOM than what I expected.

The GCP service account handoff is clever — it sidesteps the "paste a raw key" friction entirely while still keeping you out of the liability chain, since the user is provisioning their own service account against their own billing. That's BYOK and OAuth-style delegation at the same time, which I hadn't considered as a combination.

The per-document-profile model selection is the part that stands out most. Storing "this model works best for this document type" as a persistent decision tied to the profile, rather than a single global setting, is more practical than what most BYOK implementations do. Did that come from observing real usage — users manually overriding the default often enough that you made it sticky per profile — or was it designed in from the start?

The admin-assigns-allowed-models / user-picks-within-that-set pattern solves the enterprise procurement problem cleanly. "So someone doesn't select Opus" made me laugh, but it's a real cost-control need that most BYOK writeups (mine included) don't address — I've only been thinking about individual users, not teams with a budget owner.
On the per-provider templates: that's effectively the same abstraction I'm using (one config object per provider, swap endpoint/model), just one layer deeper since you're proxying through Vertex's Model Garden instead of calling providers directly. Makes sense that new model releases become a config change rather than a code change either way.

Curious — with Gemini API being "pretty bad" as you mentioned, did that push usage toward the Model Garden third-party models (Claude, etc.) more than you expected? Interesting signal if Google's own routing layer ends up making the case for competitors' models within their own product.

UnitBuilds • Jun 30

Design from the start, 6 years ago when I started with the original Doccit, I had tried out DocTR which had different datasets you could preload. Results were Master worked for most docs, but failed on others, whereas FUNSD worked well on some outliers, but was much faster, list goes on. That was the original switch, which dataset to use in general. Fast forward, I had to rebuild Doccit from scatch because the source code died with my ssd (Yes, I know, use Git, I do now... I had it locally hosted and a power surge took out my NAS and SSD on dev pc).

The move from Gemini to Vertex AI, was more a matter of use-case. Gemini api is great for basic chatbot support, but it's not viable for any heavy workloads (concurrency and quota), whereas Vertex AI opened that door up, while providing better solutions for regional regulations (that happened since the original), namely, data privacy laws that state financial docs need to stay in the same geographical location, unless you completely sanitize all potential identifiers (which is impossible to guarantee), so for liability's sake, it had to move to Vertex AI, for rate's sake, it had to move to Vertex AI, configurability was just a bonus really. Especially when you compare gemini 3.1 flash lite vs Claude Haiku... Lite is cheaper, faster, higher rate limit, availability and success rate than Haiku, so for 99% of cases, you'd just use the default Lite, but the option is what people care about. If a doc scans poorly and they have a million of them, they dont want to struggle with corrections, they'd sooner pay the premium by choice to upgrade that specific doc type to a higher model. When dealing with financial docs, mistaking a 0 for an o costs more in liability for the customer than it does to just pay an extra 50% to ensure it passes. For the company using it, it's the difference between spending 30 seconds reviewing and 30 minutes reviewing each doc for correctness. Either case, it's worth the premium, but it's never a forced premium, it's a choice.

That being said, if you want to cost optimize, Cloudflare worker ai is generally cheaper I mean you can run Kimi K2.7 for the price of 3.1 flash... a 1t param model...

Projekta2 • Jun 30

The SSD/NAS story is the kind of thing that should be required reading before anyone skips backups — glad Doccit survived the rebuild, even if the rebuild itself sounds brutal.
The regulatory angle is the part I hadn't considered at all. I've been thinking about BYOK purely as a privacy/trust lever for the user, but you're describing something stricter: data residency as a hard compliance requirement, where the choice of provider isn't about quality or cost, it's about which jurisdiction the data is legally allowed to touch. That's a different category of constraint than anything I deal with — PR diffs don't have geographic residency laws attached to them. Financial documents clearly do.

The "0 vs o" example is a great way to make the stakes concrete. It reframes the whole model-selection problem: you're not optimizing for average accuracy, you're optimizing for the cost of being wrong on the tail end. A 30-second review vs a 30-minute review is the real unit of value there, not tokens or latency.

The Cloudflare Workers AI / Kimi K2.7 pointer is new to me — I hadn't clocked that a trillion-parameter model was reachable at that price point. Have you actually moved any production traffic onto it, or is it still in the "watching the price/performance curve" phase? I ask because for my use case, the calculus is different (Groq's free tier means $0 is already on the table for most users), but I'm curious whether Workers AI is reliable enough yet for anyone to trust it as a primary path rather than a cost-optimization experiment.

UnitBuilds • Jun 30

In terms of free tier, I've tried it on @pascal_cescato_692b7a8a20 's benchmark suite in his 2nd last post, to test it, which is where my 12 part series over 5 days to a standalone OS started at, it's quite solid for 19 files per day free, but the real problem with Kimi, is there are no mandatory safeguards, you need to be explicit. If you dont want credentials leaking you need to make sure you tell it so. Workers AI though, is really solid from my experience, so worth a try?

UnitBuilds • Jun 29

Autonomous Accounting Suite, called Doccit. I handled BYOM a little different, Vertex AI is the go-to, they just link their GCP and run a script to create the service account for it, then registers it to their hosted Doccit. It queries available models via the API, so they can select what model they want to use (sometimes 1 model work better than another for a specific use case, eg. dot-matrix, Gemini 3.1 flash lite outperforms 3.5 flash and sonnet. That selection is saved to their document profiles, which is the SIFT function for pre-LLM detection, layout detection and text detection, then the model is used to spot-check flaws and understand the purpose of the doc.

I did test the OAuth system though, but as Gemini API is pretty bad, it ended up being just a method for companies to manage costs, by using the Admin profile to assign the models they are willing to use (so someone doesnt select Opus...), then the individual users select the models they want to use for the tasks and it lets the admin view the dashboard to see who prioritized what model, so they can query costs and justify it vs cheaper alternatives.

Nazar Boyko • Jun 29

On the streaming question you floated at the end, the cleaner path most people land on now is an offscreen document. The service worker still gets killed around 30 seconds in mid-stream, but an offscreen doc gives you a context that actually stays alive to hold the fetch open, and it relays tokens back over messages the same way you're already doing. The service worker's only job becomes spawning that doc and tearing it down when the stream ends. One caveat before you commit, though. You can only have one offscreen document at a time, so concurrent streams need a small queue. Solid writeup on the BYOK trade-offs otherwise, the cost table makes the case better than any pitch could.

Projekta2 • Jun 30

Nazar — this is exactly the kind of detail that makes the difference between "works in dev" and "works in production." Thanks for sharing it.

The offscreen document approach is something I've been meaning to test, but the 30-second service worker termination window is real. The pattern you're describing (SW spawns the doc → doc holds the fetch → relays tokens back) is cleaner than what I have now, which is a hack that keeps the SW alive longer than it should.

One caveat I'd add to your caveat: the "one offscreen document at a time" limit means queueing concurrent streams, but in practice for PR Focus, I've rarely seen more than one stream running simultaneously. Most users are reviewing one PR at a time.

A question back to you: have you tested how the offscreen document behaves with Chrome's memory pressure policies? My concern is that if the browser is under memory pressure, the offscreen document might get terminated just like the SW, albeit with a longer grace period. Do you know if it's treated differently by the browser's lifecycle policies?

Also worth noting: the offscreen document API is still relatively new (Chrome 109+). For developers supporting older Chrome versions, the SW-only approach might still be the safer compatibility bet — even with the 30-second termination risk.

I'll be testing the offscreen document approach this week. If it holds up, Build Log #008 is coming.

Thanks again for the sharp observation — this is exactly the kind of feedback that makes writing these articles worthwhile.

Hossein Yazdi • Jun 29

Nice approach, BYOK makes sense for cutting backend cost and improving privacy, but onboarding is still the main friction point, especially for non-dev users.

Overall solid dev-first architecture, especially with MV3 streaming workaround.

Projekta2 • Jun 29

Exactly — onboarding friction is the number I watch most closely. Install-to-first-AI-call is the metric that matters, not install count.

The mitigation that's moved the needle most: making every non-AI feature work without a key, so the first session has value before setup. Users who get value in session one come back for session two and actually configure the key. Users who hit a setup wall in session one don't.

For non-dev users specifically, I think BYOK is the wrong architecture — you'd want OAuth or a managed key. But for a tool aimed at developers who review code on GitHub, "get a free Groq key in 2 minutes" is a friction level they'll tolerate.
The question is always: does your user understand what an API key is and do they care enough about privacy to go get one?
What's your use case? Building something or evaluating architectures?

Raju Dandigam • Jun 30

This is a nice breakdown of the BYOK tradeoff. Removing the backend simplifies cost and privacy, especially for developer tools that may touch private code or PR diffs, but it also shifts responsibility to onboarding, key storage, quota errors, and provider-specific UX. I like that you included where the approach breaks down instead of presenting it as universally better. For production extensions, I’d also think about traceability: when an AI review looks wrong, users need enough local context to understand what diff, prompt, and provider response produced it.

Projekta2 • Jun 30

Raju — you've hit on something I deliberately left out of the main article because it deserves its own treatment, but you're absolutely right: traceability is the missing piece in most BYOK writeups.

When an AI review produces a bad summary or a risk score that doesn't match the reviewer's intuition, the user needs to answer three questions:

What diff was sent? (truncation? full file? only changed lines?)
What prompt was used? (system prompt + diff context + instructions)
What did the provider return? (the raw response, not just the parsed output)

Currently, PR Focus logs all three locally in IndexedDB with the PR ID as the key. If a user flags a bad summary, I can ask them to export the log and see exactly what happened — but that's manual.

What I'd like to build next: a "debug view" in the extension that shows the prompt, the diff sent, and the raw provider response for any PR. That turns "this AI review is wrong" from a support ticket into a self-serve investigation.

The constraint: logging the full prompt + diff + response can get large quickly. I'm thinking of a rolling log — keep the last 20 AI calls, purge older ones. For most users, that's enough to debug a bad review.

One more thought: the traceability requirement changes depending on who's using the tool. For individual developers, "I trust it or I don't" is enough. For teams, you need to be able to audit why a review was flagged as high-risk. That's a different scale of requirement.

I documented the initial tracing approach in Build Logs, but I'd like to improve it. Do you have a specific tracing setup in your projects that you've found works well? Always curious to see what others are doing in this space.

Mudassir Khan • Jul 4

the streaming implementation has a subtle chunk boundary issue worth knowing before it bites you. decoder.decode(value) gives you whatever bytes arrived in this read, which does not always align with SSE event boundaries. if a chunk ends in the middle of a line, your chunk.split call silently drops the tail and the next chunk starts without that continuation. in practice this shows up as occasional token drops that look like the model cutting itself short, because the logic still runs without throwing. the fix is a line buffer across reads: accumulate the incomplete fragment in a variable and prepend it before splitting the next chunk. the malformed chunk catch block handles parse failures but not partial line carryover, which is a quieter failure mode. have you seen unexplained short responses in production streams that might trace back to this?

Projekta2 • Jul 5

Mudassir — you're right, and I should have caught this earlier. The chunk boundary issue is real and exactly as quiet as you describe. I had seen occasional short responses that I attributed to model behavior, but your diagnosis makes more sense: partial line carryover that the catch block doesn't handle because it never throws.

The line buffer fix is the correct approach. Here's what I'm moving to:

const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split('\n');
  buffer = lines.pop(); // keep the incomplete tail for the next read

  for (const line of lines) {
    if (!line.startsWith('data: ')) continue;
    const data = line.slice(6);
    if (data === '[DONE]') continue;

    try {
      const parsed = JSON.parse(data);
      const token = parsed.choices[0]?.delta?.content || '';
      chrome.tabs.sendMessage(tabId, { type: 'AI_TOKEN', token });
    } catch (e) {
      // malformed chunk — skip
    }
  }
}

The { stream: true } flag on decode also matters — it tells the decoder not to flush the internal state between calls, which handles multi-byte UTF-8 sequences that might split across chunk boundaries as well.

I've included this fix with credit to you in the next article I'm publishing — it's specifically about decisions I'd make differently building PR Focus, and the streaming bug fits exactly there. Thanks for taking the time to write it up properly instead of just flagging that something was off.

Frank • Jun 28

Super interesting approach! I'm curious how you

Projekta2 • Jun 28

Thanks! Appreciate the interest — and I'm genuinely curious about what you were going to ask.

The comment cut off, but I'd love to hear what part you're curious about. Some possible angles:

The streaming + service worker lifecycle (this one's the trickiest in practice)
The provider abstraction (supporting 4 providers with one implementation)
Error handling for invalid/misconfigured keys (the most common support request)
Token cost estimates or Groq's free tier limits
How the Manifest V3 permissions model interacts with direct API calls

If there's a specific piece you'd like me to go deeper on, just ask. I've been running this in production for ~6 weeks and have hit most of the edge cases.

Or if you're working on something similar and want a second pair of eyes on the implementation, happy to take a look — the Summer Review Swap is open right now if you want to exchange a PR review.

Either way, thanks for reading and for the engagement 🙌

View full discussion (18 comments)