I built voice transcription on Supabase Edge Functions for ~$0.001/min using Gemini 2.5 Flash

#supabase #gemini #ai #javascript

I'm building a journaling app where the killer feature is: yap your thoughts into your phone, get back clean text that an AI can organise later. Speech-to-text is the most-trafficked path in the product, so the cost math matters from day one.

The default everyone reaches for is OpenAI Whisper at $0.006/min. Deepgram Nova-3 is $0.0059/min real-time. Groq's Whisper Large v3 Turbo is the discount option at ~$0.0008/min.

I shipped on Gemini 2.5 Flash multimodal instead. Roughly $0.001/min, completely free under my existing GCP credits, and one less third-party account to manage because the same key already powers other features in my app.

Here's exactly what I did, the full edge function, and why I rejected the more obvious choices.

The "use what you already have" check

Before picking a transcription provider, I looked at what was already wired in:

Supabase project: yes, with a GEMINI_API_KEY already set as a secret (used by my AI features)
GCP free credits: $522 sitting on the same billing account that funds the Gemini API
OpenAI account: I'd have to create one and bill it
Groq account: same, plus a separate API to learn

Every new third-party SaaS account is a future maintenance cost: a new dashboard, a new key to rotate, a new bill to forget about until the credit card statement arrives. If the Gemini key I already had could do voice, that was a strong default.

It can. Gemini 2.5 Flash accepts audio inputs natively as multimodal parts. Same generateContent endpoint that my existing AI features use, just with an inline_data part that carries the audio bytes.

Why not Web Speech API (the "free" option)?

The browser actually ships a SpeechRecognition API. It's free. So why not use that?

Three reasons that killed it for my use case:

Browser support: Chrome and Edge only. Firefox doesn't implement it. Safari has partial support. For a journaling app where users are likely on whatever browser they prefer, that's a non-starter.
Privacy: Chrome's implementation routes audio to Google's servers. My app's whole pitch is "your honest journal, not a feed for ad targeting." I'm not handing the most intimate audio my users produce to an ad company by default.
Control: I can't tune the model, can't set retention, can't see what was sent. With my own backend in front of Gemini, I control all three.

Native mobile is different. expo-speech-recognition runs on-device using iOS Speech and Android SpeechRecognizer. That's free and local and good. So my architecture is hybrid: native STT on mobile, Gemini on web.

The edge function

Here's the meaningful core. Full file is ~200 lines; I've stripped boilerplate.

// supabase/functions/transcribe-audio/index.ts
import { createClient } from 'https://esm.sh/@supabase/supabase-js@2.101.1';

const GEMINI_API_KEY = Deno.env.get('GEMINI_API_KEY') || '';
const SUPABASE_URL = Deno.env.get('SUPABASE_URL') || '';
const SUPABASE_SERVICE_ROLE_KEY = Deno.env.get('SUPABASE_SERVICE_ROLE_KEY') || '';

const GEMINI_MODEL = 'gemini-2.5-flash';
const GEMINI_URL = `https://generativelanguage.googleapis.com/v1beta/models/${GEMINI_MODEL}:generateContent`;

// Inline audio parts up to 20 MB; ~25+ min of opus speech.
const MAX_AUDIO_BYTES = 20 * 1024 * 1024;

Deno.serve(async (req: Request) => {
  if (req.method !== 'POST') return jsonResponse({ error: 'Method not allowed' }, 405);

  // 1. Verify the user JWT ourselves (config.toml has verify_jwt = false because
  //    we need to parse multipart audio and read the Authorization header in
  //    one place; the platform's pre-check would conflict).
  const authHeader = req.headers.get('Authorization');
  if (!authHeader?.startsWith('Bearer ')) return jsonResponse({ error: 'Unauthorized' }, 401);

  const supabase = createClient(SUPABASE_URL, SUPABASE_SERVICE_ROLE_KEY, {
    auth: { persistSession: false, autoRefreshToken: false },
  });
  const { data: { user }, error } = await supabase.auth.getUser(
    authHeader.slice('Bearer '.length),
  );
  if (error || !user) return jsonResponse({ error: 'Unauthorized' }, 401);

  // 2. Parse the upload
  const formData = await req.formData();
  const audio = formData.get('audio');
  if (!(audio instanceof Blob) || audio.size === 0) {
    return jsonResponse({ error: 'Missing audio' }, 400);
  }
  if (audio.size > MAX_AUDIO_BYTES) {
    return jsonResponse({ error: 'Audio too large' }, 413);
  }

  // 3. ... rate-limit check goes here, see below ...

  // 4. Encode for Gemini inline
  const base64Audio = bufferToBase64(new Uint8Array(await audio.arrayBuffer()));
  const mimeType = (audio.type || 'audio/webm').split(';')[0].trim().toLowerCase();

  // 5. Call Gemini
  const body = {
    contents: [{
      role: 'user',
      parts: [
        { text: 'Transcribe this audio. Return ONLY the words spoken, with normal punctuation.' },
        { inline_data: { mime_type: mimeType, data: base64Audio } },
      ],
    }],
    generationConfig: { temperature: 0.2, maxOutputTokens: 8192 },
  };

  const res = await fetch(`${GEMINI_URL}?key=${encodeURIComponent(GEMINI_API_KEY)}`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(body),
  });

  if (!res.ok) return jsonResponse({ error: `Gemini ${res.status}` }, 502);

  const parsed = await res.json();
  const transcript = (parsed.candidates?.[0]?.content?.parts?.[0]?.text ?? '').trim();
  if (!transcript) return jsonResponse({ error: 'Empty transcript' }, 502);

  // 6. ... usage logging goes here ...

  return jsonResponse({ transcript, source: 'gemini_flash' });
});

// btoa() blows the stack on large arrays, build the binary string in chunks
function bufferToBase64(buf: Uint8Array): string {
  const CHUNK = 0x8000;
  const parts: string[] = [];
  for (let i = 0; i < buf.length; i += CHUNK) {
    parts.push(String.fromCharCode(...buf.subarray(i, i + CHUNK)));
  }
  return btoa(parts.join(''));
}

function jsonResponse(body: unknown, status = 200): Response {
  return new Response(JSON.stringify(body), {
    status,
    headers: { 'Content-Type': 'application/json' },
  });
}

Three small things worth flagging:

The Authorization parse. If you let Supabase's edge runtime verify the JWT for you, it adds a step before your function runs. With multipart/form-data and a custom auth check we want side-by-side, it's cleaner to set verify_jwt = false in config.toml and do it ourselves. One less moving piece.

bufferToBase64 chunking. I learned the hard way that btoa(String.fromCharCode(...largeArray)) will throw a stack overflow on a 5 MB buffer. The chunked variant above handles inputs up to the API's 20 MB cap fine.

The prompt. You'd be surprised how often the model wants to add commentary like "Here's the transcript:". Insisting on "ONLY the words spoken" with normal punctuation cuts this down to near-zero. Setting temperature: 0.2 further pins it.

Per-user rate limit (so one bad-faith account can't burn your credits)

The version of the function above will happily transcribe any audio you throw at it, indefinitely, until your Gemini bill goes vertical. For a public app you need a per-user budget.

I added a tiny transcription_usage table:

CREATE TABLE transcription_usage (
  id BIGSERIAL PRIMARY KEY,
  user_id UUID NOT NULL REFERENCES auth.users(id) ON DELETE CASCADE,
  bytes INT NOT NULL DEFAULT 0,
  source TEXT NOT NULL DEFAULT 'gemini_flash',
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_transcription_usage_user_time
  ON transcription_usage (user_id, created_at DESC);

ALTER TABLE transcription_usage ENABLE ROW LEVEL SECURITY;

CREATE POLICY "users read own transcription usage"
  ON transcription_usage FOR SELECT USING (auth.uid() = user_id);
-- No INSERT policy: only the service role (the edge function) writes.

Then in the function, between the auth/parse step and the Gemini call:

const DAILY_REQUEST_LIMIT = 50;
const DAILY_BYTES_LIMIT = 100 * 1024 * 1024; // 100 MB
const ROLLING_WINDOW_HOURS = 24;

const sinceIso = new Date(
  Date.now() - ROLLING_WINDOW_HOURS * 60 * 60 * 1000,
).toISOString();

const { data: usageRows, error: usageErr } = await supabase
  .from('transcription_usage')
  .select('bytes')
  .eq('user_id', user.id)
  .gte('created_at', sinceIso);

if (usageErr) {
  // Fail-closed: better to block one request than risk uncapped spend.
  return jsonResponse({ error: 'Usage check failed' }, 503);
}

const requestCount = usageRows?.length ?? 0;
const bytesUsed = usageRows.reduce((sum, r) => sum + (r.bytes ?? 0), 0);

if (requestCount >= DAILY_REQUEST_LIMIT) {
  return jsonResponse({
    error: 'Daily transcription limit reached',
    limit: DAILY_REQUEST_LIMIT,
    used: requestCount,
    window_hours: ROLLING_WINDOW_HOURS,
  }, 429);
}
if (bytesUsed + audio.size > DAILY_BYTES_LIMIT) {
  return jsonResponse({ error: 'Daily transcription size budget reached' }, 429);
}

And after a successful Gemini call:

await supabase.from('transcription_usage').insert({
  user_id: user.id,
  bytes: audio.size,
  source: 'gemini_flash',
});

A few things I learned about this pattern:

Fail closed on the usage check. If the table query fails, return 503 and refuse to call Gemini. The opposite (call Gemini anyway, log nothing) opens an uncapped-spend hole if your DB has a bad day.
Bytes is a better budget unit than seconds. I don't trust my client's audio metadata, but I always know how many bytes I uploaded.
Don't put the limit number in the user-facing error. I do it above for code clarity, but production should say "Daily transcription limit reached. Try again later." A bad-faith user gets less information for crafting a bypass.
Belt and braces with a GCP budget alert. Set a $5/month cap on the Gemini project at console.cloud.google.com/billing, alerts at 50/90/100%. If my edge-function rate limit ever has a bug, the budget catches it.

For my own usage (~5 voice notes/day at 30 seconds each) the actual Gemini bill is about $0.06/month. The 50/100 MB cap is generous-honest, tight-enough-to-not-bleed.

Browser side

The web client is a thin wrapper around MediaRecorder:

// pickAudioMimeType() picks the first one MediaRecorder.isTypeSupported() returns true for:
//   audio/webm;codecs=opus -> Chrome/Edge/Firefox
//   audio/mp4;codecs=mp4a.40.2 -> Safari
//   ...
const recorder = new MediaRecorder(stream, { mimeType: pickAudioMimeType() });

recorder.ondataavailable = (e) => {
  if (e.data && e.data.size > 0) chunks.push(e.data);
};

recorder.onstop = async () => {
  const blob = new Blob(chunks, { type: mimeType });
  const form = new FormData();
  form.set('audio', blob, 'audio.webm');

  const res = await fetch(`${SUPABASE_URL}/functions/v1/transcribe-audio`, {
    method: 'POST',
    headers: { Authorization: `Bearer ${session.access_token}` },
    body: form,
  });
  const { transcript } = await res.json();
  // ... do something with transcript ...
};

recorder.start();

Two real-world tweaks I had to make:

Mime sniffing per browser. Chrome emits audio/webm;codecs=opus, Safari emits audio/mp4;codecs=mp4a.40.2. Probe with MediaRecorder.isTypeSupported() and pick the first one that passes. Pass the same mime to Gemini's inline_data.mime_type.
Permission fail messages. getUserMedia rejects with NotAllowedError when the user denied mic access, SecurityError on insecure context. Handle them with friendlier copy than "DOMException: blah blah".

What I keep on disk vs throw away

This is a journaling app, so privacy matters more than typical voice apps. My defaults:

Audio: never persisted server-side. The blob is uploaded for one-shot transcription, sent to Gemini, then dropped. Nothing in Supabase Storage by default.
Raw transcript: persisted in journal_entries.metadata.raw_transcript JSONB column, alongside the (possibly AI-organised) content. This way the user can always toggle back to "what they actually said" if the AI cleaned the wrong thing.
Provenance flag: metadata.transcription_source: 'gemini_flash' | 'native_ios' | 'native_android' | 'manual'. Useful for later analysis ("are users on the native STT path getting better quality?") and for an honest UI badge.

TL;DR cost comparison at the scale of one happy user

5 voice entries per day, 30 seconds each = 2.5 minutes per day per user.

Provider	Cost / month / user
OpenAI Whisper API	~$0.45
Deepgram Nova-3 batch	~$0.44
Groq Whisper Large v3 Turbo	~$0.06
Gemini 2.5 Flash multimodal	~$0.06 (and free under my existing GCP credits)

For a journaling app at single-user scale, none of these are scary. But the difference between $0.06 and $0.45 starts to matter when you cross 1,000 active users a month. And the Gemini path uses credits I'd already paid for elsewhere by being in the GCP free tier.

I build this kind of glue for indie products and small teams. If you're untangling a similar choice or want a similar edge function for your own stack, I'm at astraedus.dev.