Whisper + LLM Task Extraction: My Meeting Intelligence Architecture

#ai #python #productivity #nlp

Whisper + LLM Task Extraction: My Meeting Intelligence Architecture

Last quarter, our team was drowning in meeting notes. We had 40+ meetings per week across 12 people, and action items were scattered across Slack, email, and Google Docs. Someone would inevitably miss a deadline because a task got buried in a 2000-word transcript. So I built a system that listens to meetings, extracts structured tasks, and routes them to the right people. It's been running in production for 6 months, processing ~200 meetings monthly. Here's exactly how it works.

The Problem With Naive Transcription

You might think: "Just use Whisper to transcribe, then ask an LLM to extract tasks." That's a starting point, but it fails in practice.

The issues:

Whisper produces 3000-5000 word transcripts. LLMs struggle to extract precise tasks from walls of text.
A 45-minute meeting transcript costs $0.15-0.30 to process with GPT-4. At scale, this adds up.
You lose context about who said what and when they committed to something.
Generic prompts produce 10 tasks when there are actually 3 real ones. You get noise, not signal.

I needed a multi-stage pipeline: transcribe → segment → classify → extract → validate.

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                      Meeting Audio File                      │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
        ┌────────────────────────────────────┐
        │  Whisper (speech-to-text)          │
        │  - Local or API                    │
        │  - Timestamps + speaker labels     │
        └────────────────┬───────────────────┘
                         │
                         ▼
        ┌────────────────────────────────────┐
        │  Segment by speaker turns          │
        │  - Group into logical chunks       │
        │  - Max 300 tokens per segment      │
        └────────────────┬───────────────────┘
                         │
                         ▼
        ┌────────────────────────────────────┐
        │  Classify segments                 │
        │  - Decision/Action/Discussion      │
        │  - Cheap LLM (Claude Haiku)        │
        └────────────────┬───────────────────┘
                         │
         ┌───────────────┼───────────────────┐
         │               │                   │
         ▼               ▼                   ▼
    ┌────────┐     ┌────────┐         ┌──────────┐
    │Decision│     │ Action │         │Discussion│
    │ (skip) │     │(extract)│        │ (skip)   │
    └────────┘     └────┬───┘         └──────────┘
                        │
                        ▼
        ┌────────────────────────────────────┐
        │  Extract task details              │
        │  - Owner, deadline, dependencies   │
        │  - More capable LLM (Claude 3.5)   │
        └────────────────┬───────────────────┘
                         │
                         ▼
        ┌────────────────────────────────────┐
        │  Structured output (JSON)          │
        │  - Deduplicate                     │
        │  - Route to project management     │
        └────────────────────────────────────┘

The key insight: use cheap models for classification, expensive ones only for extraction. This cuts costs by 70%.

Implementation: The Real Code

Here's the production pipeline I use, simplified for clarity:

import anthropic
import json
from typing import TypedDict

class TaskExtraction(TypedDict):
    owner: str
    task: str
    deadline: str
    priority: str
    dependencies: list[str]

class MeetingProcessor:
    def __init__(self, api_key: str):
        self.client = anthropic.Anthropic(api_key=api_key)

    def segment_transcript(self, transcript: str, max_tokens: int = 300) -> list[str]:
        """Split transcript into speaker segments, respecting token boundaries."""
        segments = []
        current_segment = ""

        for line in transcript.split("\n"):
            if len(current_segment.split()) > max_tokens:
                segments.append(current_segment)
                current_segment = line
            else:
                current_segment += "\n" + line

        if current_segment:
            segments.append(current_segment)
        return segments

    def classify_segment(self, segment: str) -> str:
        """Cheap classification: is this an action, decision, or discussion?"""
        response = self.client.messages.create(
            model="claude-3-5-haiku-20241022",
            max_tokens=50,
            messages=[
                {
                    "role": "user",
                    "content": f"""Classify this meeting segment as one of: ACTION, DECISION, DISCUSSION.

Segment:
{segment}

Return only the classification word."""
                }
            ]
        )
        return response.content[0].text.strip()

    def extract_tasks(self, segments_with_context: list[dict]) -> list[TaskExtraction]:
        """Extract structured tasks from ACTION segments only."""
        action_segments = [
            s for s in segments_with_context 
            if s["classification"] == "ACTION"
        ]

        if not action_segments:
            return []

        combined_context = "\n\n".join([
            f"[{s['timestamp']}] {s['text']}" 
            for s in action_segments
        ])

        response = self.client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1500,
            messages=[
                {
                    "role": "user",
                    "content": f"""Extract all action items from this meeting excerpt. 
Return a JSON array of tasks with this structure:
{{
  "owner": "person name or 'TBD'",
  "task": "specific action",
  "deadline": "date or 'ASAP' or 'TBD'",
  "priority": "high/medium/low",
  "dependencies": ["other task ids if mentioned"]
}}

Meeting excerpt:
{combined_context}

Return ONLY valid JSON array, no markdown formatting."""
                }
            ]
        )

        try:
            tasks = json.loads(response.content[0].text)
            return tasks
        except json.JSONDecodeError:
            print(f"Failed to parse: {response.content[0].text}")
            return []

    def process_meeting(self, transcript: str) -> list[TaskExtraction]:
        """Full pipeline: segment → classify → extract."""
        segments = self.segment_transcript(transcript)

        # Classify all segments (cheap pass)
        segments_with_class = [
            {
                "text": seg,
                "classification": self.classify_segment(seg),
                "timestamp": "0:00"  # You'd extract real timestamps
            }
            for seg in segments
        ]

        # Extract tasks from ACTION segments only
        tasks = self.extract_tasks(segments_with_class)

        # Deduplicate by task description
        seen = set()
        unique_tasks = []
        for task in tasks:
            task_key = (task["owner"], task["task"])
            if task_key not in seen:
                seen.add(task_key)
                unique_tasks.append(task)

        return unique_tasks

# Usage
processor = MeetingProcessor(api_key="your-key")
with open("meeting_transcript.txt") as f:
    transcript = f.read()

tasks = processor.process_meeting(transcript)
print(json.dumps(tasks, indent=2))

This approach costs ~$0.04-0.08 per meeting. At 200 meetings/month, that's $8-16 in LLM costs alone.

Cost Optimization: Using OpenRouter

If you're processing lots of meetings, you'll want to try different models and providers. I use OpenRouter for this—it abstracts away provider switching and gives you a single API key to test Claude, GPT-4, Llama, and others.

Here's a variant that lets you swap models easily:


python
import requests
import json

class OpenRouterTaskExtractor:
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = "https://openrouter.ai/api/v1"

    def classify_segment(self, segment: str, model: str = "anthropic/claude-3-5-haiku") -> str:
        """Classify using OpenRouter—easy model swapping."""
        response = requests.post(
            f"{self.base_url}/chat/completions",
            headers={
                "Authorization": f"Bearer {self.api_key}",
                "HTTP-Referer": "https://your-app.com",
            },
            json={
                "model": model,
                "messages": [
                    {
                        "role": "user",
                        "content": f"Classify as ACTION, DECISION, or DISCUSSION:\n{segment}"
                    }
                ],
                "max_tokens": 50,
            }
        )
        return response.json()["choices"][0]["message"]["content"].strip()

    def extract_tasks(self, segments: str, model: str = "anthropic/claude-3-5-sonnet") -> list:
        """Extract tasks using OpenRouter."""
        response = requests.post(
            f"{self.base_url}/chat/completions",
            headers={
                "Authorization": f"Bearer {self.api_key}

---
**Need an AI system for your business?**  
I'm Alessandro Trimarco, AI engineer behind a 6-module AI stack for a 14-location restaurant chain (236 users, ~€88k/month processed).  
📬 **[alevibecoding@gmail.com](mailto:alevibecoding@gmail.com)** · [Portfolio](https://alessandrotrimarco.github.io) · [Case study](https://github.com/AlessandroTrimarco/aires-burger-case-study)