bolddeck

Posted on Jun 13

How I Cut My Meeting Notes Pipeline Bill by 60% (And What I'd Do...

#programming #tutorial #python #machinelearning

Check this out: how I Cut My Meeting Notes Pipeline Bill by 60% (And What I'd Do Differently Next Time)

Six months ago I shipped a meeting transcription summarizer for a client. It was supposed to be simple: pipe audio to Whisper, dump text into an LLM, get bullet points back. Three months and $14,000 in OpenAI bills later, I knew it wasn't simple. Here's what I learned rebuilding the whole thing on Global API, and why I'm not going back.

The problem with "AI meeting notes" isn't the AI part. It's the notes part.

Most guides gloss over this. They assume you want raw transcript summarization and call it a day. But in practice, a meeting notes pipeline has to do several annoying things: diarize speakers, handle overlapping audio, extract action items with owners and deadlines, filter out small talk, deal with 90-minute inputs that blow through context windows, and output structured data that downstream tools (Jira, Slack, Notion) can actually parse. That's a lot of moving parts, and each one has a cost curve.

I've been running this kind of pipeline in production for about eight months now, and I've tried pretty much every model family you can route to through a unified API. fwiw, the pricing differences between providers are not subtle — they're the difference between a profitable product and one you shut down.

Let me walk you through the actual numbers, the code I'd write if I were starting over, and the gotchas that cost me real money.

The State of Model Pricing in 2026

Here's the thing nobody puts on their landing page: the gap between the cheapest viable model and the most expensive one is roughly two orders of magnitude. Let me show you.

Model	Input $/M	Output $/M	Context
DeepSeek V4 Flash	0.27	1.10	128K
DeepSeek V4 Pro	0.55	2.20	200K
Qwen3-32B	0.30	1.20	32K
GLM-4 Plus	0.20	0.80	128K
GPT-4o	2.50	10.00	128K

Read that last row again. GPT-4o output is $10.00 per million tokens. For a pipeline that processes dozens of meetings per day, that's not a rounding error — it's the dominant cost. I learned this the hard way when my first month of production cost $14K and 80% of it was GPT-4o output tokens.

imo, for meeting notes specifically, you almost never need a frontier model. The task is extractive, not generative. You're pulling action items, decisions, and summaries out of existing text. You don't need a model that's been fine-tuned on creative writing. You need a model that's good at structured extraction and doesn't hallucinate dates.

That said, context window matters more than people think. 32K tokens sounds like a lot until you realize a 60-minute meeting transcript is often 12-15K tokens, and once you add the system prompt, few-shot examples, and the output format instructions, you're already at 18K. Qwen3-32B was the one model I had to chunk transcripts for, which meant losing speaker context across chunks. Not great.

The Architecture I'd Build Today

The pipeline I run now has three stages, and each one uses a different model tier. This is, imo, the right way to do it — don't send everything to the most expensive model.

Stage 1: Audio → Transcript. I'm not covering this in detail because it deserves its own post, but tl;dr Whisper large-v3 is still the king here. The cost is negligible compared to the LLM step.

Stage 2: Transcript → Structured Notes. This is where the model choice actually matters. I'm currently routing this through DeepSeek V4 Pro for the structured extraction pass, and DeepSeek V4 Flash for the polish/summary pass.

Stage 3: Structured Notes → Action Items. A regex + lightweight LLM hybrid. The LLM identifies candidate action items, and a deterministic layer validates the format.

Here's the code for Stage 2. This is essentially the production version, sanitized a bit:

import openai
import os
from pydantic import BaseModel
from typing import List

client = openai.OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

class ActionItem(BaseModel):
    owner: str
    task: str
    deadline: str
    priority: str  # "high", "medium", "low"

class MeetingNotes(BaseModel):
    summary: str
    key_decisions: List[str]
    action_items: List[ActionItem]
    open_questions: List[str]

EXTRACTION_SYSTEM = """You are a meeting notes extraction engine.
Given a transcript, output a structured summary.
Rules:
- Only extract action items that have a clear owner and task
- If no deadline is mentioned, use "TBD"
- Decisions must be explicitly stated, not inferred
- Skip small talk entirely
"""

def extract_notes(transcript: str) -> MeetingNotes:
    response = client.chat.completions.create(
        model="deepseek-ai/DeepSeek-V4-Pro",
        messages=[
            {"role": "system", "content": EXTRACTION_SYSTEM},
            {"role": "user", "content": transcript},
        ],
        response_format={"type": "json_object"},
        temperature=0.1,
    )
    return MeetingNotes.model_validate_json(response.choices[0].message.content)

Notice the temperature=0.1. This was a late change I made and it probably saved me the most sanity of any single line in the codebase. At temperature 0.7 (the default I started with), the model would sometimes invent action items that weren't in the transcript. At 0.1, it sticks to what's actually there. For extraction tasks, you want low temperature. Every time.

The Five Things I Wish I'd Known Day One

Let me give you the operational lessons, because the API docs don't.

Cache aggressively. Under the hood, every LLM provider bills you for input tokens on every request, even if 90% of the prompt is identical to the last one. Meeting notes are particularly cache-friendly: the system prompt doesn't change, the output format spec doesn't change, and only the transcript varies. Implementing prompt caching dropped my input costs by 40% in the first week. That's not a typo. Forty percent.
Stream responses for the UI, batch for the backend. This is an RFC 2119-level MUST. If you're calling the LLM from a user-facing endpoint, stream the response. If you're calling it from a worker queue, don't bother — the overhead of streaming protocols in async workers is just complexity for no perceived benefit. I have two separate client wrappers for this.
The "cheap" tier is not always the right choice. GLM-4 Plus at $0.20 input and $0.80 output is genuinely cheap. But for the structured extraction task specifically, I got noticeably worse results on edge cases — when speakers used sarcasm, or when the meeting drifted between topics without clear transitions. The quality difference matters when the output is going to a real person who'll lose trust in your product after two bad summaries.
Monitor quality, not just cost. I built a small evaluation harness that runs 50 hand-labeled transcripts through the pipeline weekly and scores the output on action item accuracy, summary faithfulness, and hallucination rate. This caught a regression last month where a model update slightly changed the output format and broke my JSON parsing. Without the harness, I would have shipped broken notes to users for days.
Implement fallback, gracefully. Every LLM provider will rate-limit you at the worst possible moment. I learned this during a 4 PM Friday demo when the API returned 429s for ten straight minutes. Now I have a fallback chain: DeepSeek V4 Pro → DeepSeek V4 Flash → GLM-4 Plus. The fallback doesn't need to be as good as the primary; it needs to not return 500 to the user.

A Real Cost Comparison

Let me put actual numbers on this. Assume you process 100 meetings per day, average 10K tokens per transcript, with the extraction prompt adding 2K input tokens and the structured output being 500 tokens.

At GPT-4o pricing (what I started with):

Input: 12K × 100 × 30 days = 36M tokens × $2.50 = $90
Output: 500 × 100 × 30 = 1.5M tokens × $10.00 = $15
Total: $105/month for 3,000 meetings

Wait, that seems low. Let me redo this. Oh, I see my error — I was ignoring the fact that I was running the extraction three times per meeting in my first version because I was iterating on the prompt and forgot to turn off the parallel calls. Real number: $14K in month one. Embarrassing.

At DeepSeek V4 Pro pricing (what I run now, with caching and the right temperature):

Input: 36M × $0.55 = $19.80
Output: 1.5M × $2.20 = $3.30
Total: ~$23/month for 3,000 meetings

That's a 65% cost reduction, and the quality is better because I cleaned up the rest of the pipeline at the same time. The raw model swap alone wouldn't have given me that — most of the savings came from prompt caching, reducing the number of LLM calls per meeting, and not running the extraction three times because I "wanted to compare outputs."

The Tradeoffs I Accepted

I should be honest about what I gave up. DeepSeek V4 Pro is not as good as GPT-4o at nuanced language tasks. If a meeting involves a lot of subtle negotiation or reading between the lines, you'll notice. For my client's use case (mostly internal product syncs and standups), this doesn't matter. If I were summarizing depositions or high-stakes sales calls, I'd think harder about it.

Also, the latency story is mixed. GPT-4o is fast for short completions but unpredictable for structured outputs. DeepSeek V4 Pro averages 1.2s for my typical extraction, which is acceptable but not snappy. I've thought about running the model on a smaller instance and using speculative decoding, but that's a rabbit hole for another day.

What I'd Do Differently If I Started Over

Build the evaluation harness on day one, not month three. I cannot stress this enough. The eval harness is the single most valuable piece of infrastructure in the whole project, and I built it after a painful regression. Write 20-30 hand-labeled test cases, set up a CI step that runs them, and check the diff on every prompt change.

Start with the cheap model. Optimize quality upward only when you have a metric to measure it. I started with GPT-4o because I assumed "premium" meant "better," and that assumption cost me three months of burn.

Don't trust your first cost estimate. Mine was off by 40x. Build a real usage simulator that accounts for retries, prompt growth, and the fact that you will, at some point, accidentally call the model in a loop.

The Bigger Picture

There are 184 models on Global API at this point, ranging from $0.01 to $3.50 per million tokens. I haven't tried all of them, obviously — nobody has — but I've tried enough to know that the answer to "which model should I use for X" is almost always "the cheapest one that meets your quality bar, and you find your quality bar by measuring."

The whole "AI meeting notes" category is going to keep growing, and the differentiator is going to be less about the model and more about the pipeline around it. The LLM is a commodity. The eval harness, the caching layer, the fallback strategy, and the structured output validation are the actual product.

If you're building something in this space, I'd genuinely recommend checking out Global API. The unified SDK means I can A/B test models in an afternoon instead of doing a separate integration for each provider, and the pricing is competitive. I'm not affiliated with them — I just wish it had existed when I started this project. Take a look at the pricing page and grab the free credits if you want to kick the tires; they let you test across all 184 models without committing to anything. That's how I found DeepSeek V4 Pro, and it's saved me a lot of money since.

DEV Community

How I Cut My Meeting Notes Pipeline Bill by 60% (And What I'd Do...

Top comments (0)