I built a "Bring Your Own Key" Chrome extension to kill my $22/month dictation subscription

#showdev

Last year I crossed a line with subscription fatigue.

I was mid-sprint, voice-drafting a long client email, and my dictation tool flashed a notification: "You've used 80% of your monthly minutes." I had a week left in the billing cycle. I could upgrade to the next plan tier — $38/month — or just… start typing again.

I closed the notification and started building Aurai instead.

What Aurai does

Aurai is a Chrome extension that turns your voice into clean, polished text — and pastes it directly into whatever you were typing in. Gmail, Slack, Notion, any textarea or contenteditable on the web.

But it's not just speech-to-text. Raw STT output is messy: run-on sentences, missing punctuation, homophones that fool the model ("their/there", "tortoise invoice" vs. "total invoice"). Aurai passes your transcript through a Gemini refinement layer that fixes these contextually and applies proper formatting before anything hits your text box.

There's also a tone-shifting feature: you can flag your dictation as needing a Professional, Excited, or Persuasive rewrite. Useful when you're thinking out loud in casual language but need polished output.

The BYOK architecture

The core design decision — and the one I'd want other builders to think about — is Bring Your Own Key.

Most AI-powered tools work like this:

User → Your App (backend) → AI API → Your App → User

You pay the API bill, mark it up, and charge a subscription. It's a valid business model, but it creates a few problems:

You're a cost center before you have users
Your users' data flows through your servers
You need rate limiting, abuse prevention, key rotation...

Aurai flips it:

User (with their own API key) → AI API directly → User

The extension runs entirely in the browser. The user's API key (stored in chrome.storage.sync, never transmitted to me) is used to call the Gemini API directly from the extension. My server involvement: zero.

What this means in practice:

No backend to maintain
No API costs to absorb
No data retention liability
Unlimited usage — the user's own quota, not mine
Free Google AI Studio keys cover most individual usage comfortably

The technical challenge: injecting text into modern web apps

This was the part that took the longest.

Clicking into a Gmail compose box and programmatically inserting text sounds trivial. In practice, modern web frameworks maintain their own virtual DOM and state, so if you just set element.value = "..." or element.textContent = "...", the framework doesn't know the value changed — and the text may vanish on the next re-render or fail to trigger form validation.

Here's the approach that ended up working across the most surfaces:

function injectText(element, text) {
  element.focus();

  const inputEvent = new InputEvent('input', {
    inputType: 'insertText',
    data: text,
    bubbles: true,
    cancelable: true,
  });

  if (element.tagName === 'INPUT' || element.tagName === 'TEXTAREA') {
    const nativeInputValueSetter = Object.getOwnPropertyDescriptor(
      window.HTMLInputElement.prototype, 'value'
    )?.set || Object.getOwnPropertyDescriptor(
      window.HTMLTextAreaElement.prototype, 'value'
    )?.set;

    if (nativeInputValueSetter) {
      nativeInputValueSetter.call(element, element.value + text);
      element.dispatchEvent(inputEvent);
      return;
    }
  }

  // For contenteditable divs (Gmail, Notion, Slack)
  document.execCommand('insertText', false, text);
}

Edge cases that still bite:

CodeMirror / Monaco editors: they intercept keyboard events at a different level; execCommand doesn't work.
iframes: cross-origin iframes need a separate content script injection.
Shadow DOM: some apps use shadow roots — you need to walk the shadow tree.

The Gemini prompting strategy

The refinement prompt matters a lot. Here's the rough structure:

You are a text editor assistant. You will receive raw voice dictation text.
Your job is to:
1. Fix transcription errors (homophones, phonetic mistakes, misheard words)
2. Add appropriate punctuation and sentence structure
3. Preserve the speaker's intent and vocabulary
4. Output ONLY the corrected text — no commentary, no explanation

[If tone shift requested]:
Additionally, rewrite the text in a [Professional/Excited/Persuasive] tone
while preserving the core content.

Raw dictation:
"""
{transcript}
"""

Key lessons from iterating on this:

"Output ONLY the corrected text" is critical — without it, Gemini sometimes prefaces with "Here is the corrected version:" which gets pasted into the user's text box
Short prompts perform better than elaborate ones for this task
Tone shifting works much better as a separate instruction appended to the base prompt than as a separate API call

What I'd do differently

User onboarding for BYOK is harder than you think. "Get a free API key from Google AI Studio" sounds easy, but for non-technical users, it's a friction point. I added a direct link inside the extension that opens the exact page, with a tooltip explaining what to copy. Still, it's the number-one support question.

I should have built the tone-shifting UI before launch. It was a late addition and the UX is a bit rough. Users want to set their default tone and forget it. Flagging this for v2.

Try it

Aurai is free and available on the Chrome Web Store.

→ Install Aurai

If you've built BYOK products before, I'd love to hear how you handled onboarding — it's the biggest UX challenge in this model and I'm still iterating.

Built by Sid. If Aurai saves you time, there's a support option via Gumroad inside the app.