Whisper + LLM Task Extraction: My Meeting Intelligence Architecture
Last quarter, our team was drowning in meeting notes. We had 40+ meetings per week across 12 people, and action items were scattered across Slack, email, and Google Docs. Someone would inevitably miss a deadline because a task got buried in a 2000-word transcript. So I built a system that listens to meetings, extracts structured tasks, and routes them to the right people. It's been running in production for 6 months, processing ~200 meetings monthly. Here's exactly how it works.
The Problem With Naive Transcription
You might think: "Just use Whisper to transcribe, then ask an LLM to extract tasks." That's a starting point, but it fails in practice.
The issues:
- Whisper produces 3000-5000 word transcripts. LLMs struggle to extract precise tasks from walls of text.
- A 45-minute meeting transcript costs $0.15-0.30 to process with GPT-4. At scale, this adds up.
- You lose context about who said what and when they committed to something.
- Generic prompts produce 10 tasks when there are actually 3 real ones. You get noise, not signal.
I needed a multi-stage pipeline: transcribe → segment → classify → extract → validate.
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ Meeting Audio File │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌────────────────────────────────────┐
│ Whisper (speech-to-text) │
│ - Local or API │
│ - Timestamps + speaker labels │
└────────────────┬───────────────────┘
│
▼
┌────────────────────────────────────┐
│ Segment by speaker turns │
│ - Group into logical chunks │
│ - Max 300 tokens per segment │
└────────────────┬───────────────────┘
│
▼
┌────────────────────────────────────┐
│ Classify segments │
│ - Decision/Action/Discussion │
│ - Cheap LLM (Claude Haiku) │
└────────────────┬───────────────────┘
│
┌───────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌──────────┐
│Decision│ │ Action │ │Discussion│
│ (skip) │ │(extract)│ │ (skip) │
└────────┘ └────┬───┘ └──────────┘
│
▼
┌────────────────────────────────────┐
│ Extract task details │
│ - Owner, deadline, dependencies │
│ - More capable LLM (Claude 3.5) │
└────────────────┬───────────────────┘
│
▼
┌────────────────────────────────────┐
│ Structured output (JSON) │
│ - Deduplicate │
│ - Route to project management │
└────────────────────────────────────┘
The key insight: use cheap models for classification, expensive ones only for extraction. This cuts costs by 70%.
Implementation: The Real Code
Here's the production pipeline I use, simplified for clarity:
import anthropic
import json
from typing import TypedDict
class TaskExtraction(TypedDict):
owner: str
task: str
deadline: str
priority: str
dependencies: list[str]
class MeetingProcessor:
def __init__(self, api_key: str):
self.client = anthropic.Anthropic(api_key=api_key)
def segment_transcript(self, transcript: str, max_tokens: int = 300) -> list[str]:
"""Split transcript into speaker segments, respecting token boundaries."""
segments = []
current_segment = ""
for line in transcript.split("\n"):
if len(current_segment.split()) > max_tokens:
segments.append(current_segment)
current_segment = line
else:
current_segment += "\n" + line
if current_segment:
segments.append(current_segment)
return segments
def classify_segment(self, segment: str) -> str:
"""Cheap classification: is this an action, decision, or discussion?"""
response = self.client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=50,
messages=[
{
"role": "user",
"content": f"""Classify this meeting segment as one of: ACTION, DECISION, DISCUSSION.
Segment:
{segment}
Return only the classification word."""
}
]
)
return response.content[0].text.strip()
def extract_tasks(self, segments_with_context: list[dict]) -> list[TaskExtraction]:
"""Extract structured tasks from ACTION segments only."""
action_segments = [
s for s in segments_with_context
if s["classification"] == "ACTION"
]
if not action_segments:
return []
combined_context = "\n\n".join([
f"[{s['timestamp']}] {s['text']}"
for s in action_segments
])
response = self.client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1500,
messages=[
{
"role": "user",
"content": f"""Extract all action items from this meeting excerpt.
Return a JSON array of tasks with this structure:
{{
"owner": "person name or 'TBD'",
"task": "specific action",
"deadline": "date or 'ASAP' or 'TBD'",
"priority": "high/medium/low",
"dependencies": ["other task ids if mentioned"]
}}
Meeting excerpt:
{combined_context}
Return ONLY valid JSON array, no markdown formatting."""
}
]
)
try:
tasks = json.loads(response.content[0].text)
return tasks
except json.JSONDecodeError:
print(f"Failed to parse: {response.content[0].text}")
return []
def process_meeting(self, transcript: str) -> list[TaskExtraction]:
"""Full pipeline: segment → classify → extract."""
segments = self.segment_transcript(transcript)
# Classify all segments (cheap pass)
segments_with_class = [
{
"text": seg,
"classification": self.classify_segment(seg),
"timestamp": "0:00" # You'd extract real timestamps
}
for seg in segments
]
# Extract tasks from ACTION segments only
tasks = self.extract_tasks(segments_with_class)
# Deduplicate by task description
seen = set()
unique_tasks = []
for task in tasks:
task_key = (task["owner"], task["task"])
if task_key not in seen:
seen.add(task_key)
unique_tasks.append(task)
return unique_tasks
# Usage
processor = MeetingProcessor(api_key="your-key")
with open("meeting_transcript.txt") as f:
transcript = f.read()
tasks = processor.process_meeting(transcript)
print(json.dumps(tasks, indent=2))
This approach costs ~$0.04-0.08 per meeting. At 200 meetings/month, that's $8-16 in LLM costs alone.
Cost Optimization: Using OpenRouter
If you're processing lots of meetings, you'll want to try different models and providers. I use OpenRouter for this—it abstracts away provider switching and gives you a single API key to test Claude, GPT-4, Llama, and others.
Here's a variant that lets you swap models easily:
python
import requests
import json
class OpenRouterTaskExtractor:
def __init__(self, api_key: str):
self.api_key = api_key
self.base_url = "https://openrouter.ai/api/v1"
def classify_segment(self, segment: str, model: str = "anthropic/claude-3-5-haiku") -> str:
"""Classify using OpenRouter—easy model swapping."""
response = requests.post(
f"{self.base_url}/chat/completions",
headers={
"Authorization": f"Bearer {self.api_key}",
"HTTP-Referer": "https://your-app.com",
},
json={
"model": model,
"messages": [
{
"role": "user",
"content": f"Classify as ACTION, DECISION, or DISCUSSION:\n{segment}"
}
],
"max_tokens": 50,
}
)
return response.json()["choices"][0]["message"]["content"].strip()
def extract_tasks(self, segments: str, model: str = "anthropic/claude-3-5-sonnet") -> list:
"""Extract tasks using OpenRouter."""
response = requests.post(
f"{self.base_url}/chat/completions",
headers={
"Authorization": f"Bearer {self.api_key}
---
**Need an AI system for your business?**
I'm Alessandro Trimarco, AI engineer behind a 6-module AI stack for a 14-location restaurant chain (236 users, ~€88k/month processed).
📬 **[alevibecoding@gmail.com](mailto:alevibecoding@gmail.com)** · [Portfolio](https://alessandrotrimarco.github.io) · [Case study](https://github.com/AlessandroTrimarco/aires-burger-case-study)
Top comments (0)