TL;DR
Asking an LLM to do everything in one shot degrades quality. Task decomposition + caching made it both cheaper and better.
I split a 57-minute meeting transcript into 16 separate LLM calls using Gemini 2.5 Flash-Lite. Total cost per run: $0.005. The result was higher quality meeting notes than a single call to a more expensive model. Here's the design and the real numbers.
1. Why Meeting Notes Are a Hard Task for LLMs
Meeting transcripts are messy by nature:
- Spoken language has no structure — it's hard to read as-is
- Topics jump around, and multiple threads interleave
- A one-hour meeting easily runs into thousands or tens of thousands of tokens
The expected outputs are diverse: summary, decisions, action items, discussion context and background — each requiring a different reading strategy and extraction logic. If speaker diarization is available, you can also extract per-speaker summaries to quickly identify who said what and each person's role in the meeting.
In short, the task is: "Extract information from a long, unstructured text along multiple different axes." Cramming all of that into a single prompt causes problems.
2. What Happens When You Ask an LLM to Do Everything at Once
My first implementation was exactly this. One prompt: "Generate all sections of the meeting notes from this transcript." Summary, decisions, speaker summaries — all in one shot.
The results were poor. Specifically:
- Speaker summaries were copy-paste duplicates. Four speakers who played different roles all got the same bullet list of agenda topics. Their individual contributions were invisible.
- Status reports leaked into action items. "We're not making progress, so we'll continue as-is" got rewritten as "Confirm the current status" — an action item that nobody committed to.
- Discussion details collapsed into restated decisions. The reasoning, trade-offs, and disagreements behind decisions were lost. Only the conclusions remained, repeated.
These aren't model capability issues — they're task design issues. Stuffing every section into a single prompt means the model can't allocate sufficient attention to each one.
3. Let Code Own the Structure, Give the LLM One Job at a Time
The mindset shift:
| Old Approach | New Approach |
|---|---|
| Let the LLM do everything | Use the LLM as a specialist for each subtask |
| One call to rule them all | Build a pipeline in code |
| Upgrade the model to fix quality | Decompose the task to fix quality |
For meeting notes, the program manages what gets generated in what order. Each LLM call receives one clear instruction.
Here's how it breaks down:
Step 1 (parallel): Generate summary / Extract decisions / Per-speaker summaries (4 speakers × 1 each) = 6 calls
Step 2 (sequential): Extract action item candidates → Review and filter = 2 calls
Step 3a: Extract discussion titles (using Step 1 results) = 1 call
Step 3b (parallel): Generate background for each title (7 topics × 1 each) = 7 calls
Each step gives the LLM a single role. No competing instructions, and each section's quality stabilizes.
Action items needed special attention. Initially I generated them per-speaker, but tasks that belonged to one speaker kept leaking into another's list. The current approach uses a two-step candidate extraction + review pipeline: first, extract up to 10 candidates referencing the decisions; then, filter out duplicates, items with unclear owners, and overlaps with decisions.
For a 57-minute meeting with 4 speakers and 7 discussion topics, this adds up to 16 LLM calls. "Won't 16 calls blow up the cost?" — that's where caching comes in.
4. Caching: Read the Transcript Once, Reference It 16 Times
Splitting into multiple calls means sending the same transcript repeatedly. That's where caching saves you.
Register the transcript with the API once, and reference it by ID in subsequent calls. I'm using Gemini's Context Caching. Cached token reads cost 1/10th of the normal input price (Gemini 2.5 Flash-Lite: $0.10/1M normal → $0.01/1M cached reads).
Step 0: Register transcript as cache → get cacheId
Steps 1–3: Each LLM call references cacheId
→ No transcript re-upload
→ Only cached read pricing applies
finally: Delete cache (TTL=120s, auto-expires)
Other APIs offer similar mechanisms. Anthropic's Prompt Caching also prices cached token reads at 1/10th of normal input.
5. The Real Cost: Input Dominates, and That's the Point
With 16 calls reading the same transcript, 76% of the total cost comes from input (including cached reads).
Measured Cost (57-min meeting, 4 speakers, 7 discussion topics, 16 calls)
| Item | Tokens | Unit Price | Cost | Share |
|---|---|---|---|---|
| Cache creation (11,978 tokens) | 11,978 | $0.10/1M | $0.001198 | 23% |
| Cached reads (16 calls) | 191,648 | $0.01/1M | $0.001916 | 37% |
| Additional input (per-call deltas) | 8,729 | $0.10/1M | $0.000873 | 17% |
| Output (all 16 calls combined) | 3,153 | $0.40/1M | $0.001261 | 24% |
| Total | $0.0052 |
This design deliberately spends more on input to improve quality on each subtask. The trade-off works because cached reads on a budget model are dirt cheap ($0.01/1M).
This is the divergence point with higher-tier models. Upgrading the model raises input prices, making "read the same transcript 16 times" prohibitively expensive. With a budget model, 16 reads still costs $0.005. A budget model with generous input beats a premium model with a single shot — both on cost and quality.
At 1,000 runs per month, that's roughly $5/month.
6. Two More Benefits
Parallel Execution Makes It Faster
With decomposed tasks, independent ones run simultaneously. Summary, decisions, and speaker summaries fire in parallel — total latency equals the slowest single call, which is often shorter than one monolithic generation.
Quality Variance Drops
A single prompt handling multiple tasks produces uneven output: one section is great, another is thin. Isolating each task into its own call keeps the scope small and the quality consistent.
For per-speaker summaries specifically, the instruction "Generate a summary for Speaker A only" eliminates cross-contamination from other speakers' content. Filtering by minimum utterance count also prevents the model from fabricating summaries for participants who barely spoke.
7. What This Design Can't Fully Solve
In fairness, some challenges persist even with task decomposition.
Action item accuracy remains the hardest problem. The model still sometimes converts "we're at capacity, so no changes" (a status report) into "confirm capacity" (an action item). It also tends to pad the list by rephrasing the same item three ways.
The root cause: "don't write what wasn't said" is an inherently difficult constraint for LLMs. They're optimized to generate plausible interpretations, so even when a meeting has no clear commitments, the model will invent action items to fill the section.
The two-step candidate + review pipeline mitigates this — extracting broadly first, then filtering out items with unclear owners or overlapping decisions — but it doesn't fully solve it.
In practice, having reasonably accurate meeting notes available immediately after the meeting is already a major win. While memories are fresh, participants can spot misattributions or missing items quickly.
If higher precision is needed for specific sections, you can copy the transcript and re-run it through a more capable model. Use the cheapest model for instant full coverage, then selectively upgrade where precision matters — that's the practical sweet spot.
8. Why I Obsess Over $0.005
Why care so much about half a cent?
This design powers Lightning Notes, an iPhone transcription and meeting notes app I built. The app is completely free, ad-supported — revenue comes from interstitial ads at roughly $0.007–0.015 per impression. If the LLM cost per run exceeds the ad revenue, every meeting note generated loses money. Keeping LLM costs under one cent per run isn't a nice-to-have — it's a business survival requirement.
Lightning Notes provides all of these features for free:
- On-device transcription, speaker diarization, and automatic speaker assignment — audio never leaves the iPhone. Register a speaker once, and their voiceprint is used to auto-identify them in future meetings or suggest AI candidates.
- High-quality meeting notes powered by Gemini — using the pipeline design from this article: summary, decisions, action items, discussion details, and per-speaker summaries, all auto-generated.
- Cross-meeting person summaries powered by Gemini — view any person's speaking patterns and areas of responsibility across multiple meetings.
Curious if $0.005 can really produce useful meeting notes? Try it yourself.
9. Takeaways
Before upgrading your model, check:
- Are you cramming multiple roles into a single prompt?
- Can the structure (ordering, dependencies) be managed in code?
- Can caching reduce your input costs?
"Model capability" and "task design" are separate problems. Fix what design can fix first. Model upgrades can wait.
In this case, 16 calls to Gemini 2.5 Flash-Lite cost $0.005 per run and produced higher quality output than a single call to a more expensive model.
Before upgrading the model, try redesigning the pipeline. That alone might improve both cost and quality.
This article is based on lessons learned while building Lightning Notes, an iPhone transcription and meeting notes app.
Top comments (0)