KevinTen

Posted on Apr 6

I Built a Voice Notes Assistant and Learned Why AI Still Needs Training Wheels

#python #ai #opensource #productivity

I Built a Voice Notes Assistant and Learned Why AI Still Needs Training Wheels

Honestly, I started this project because I was tired of my own voice notes. You know the drill — you're walking down the street, have a brilliant idea, whip out your phone, and ramble for three minutes about... something. Fast forward two weeks, and you have 47 untitled voice recordings that you're definitely going to organize someday. Spoiler alert: you won't.

So here's the thing. I decided to build an AI voice notes assistant. Not because the world needed another one, but because I needed to understand why this problem was so hard to solve. And oh boy, did I learn some lessons the hard way.

The Pitch vs. The Reality

The idea was simple enough: build something that takes your messy, unstructured voice ramblings and turns them into organized, actionable notes. Think of it as a personal assistant that actually listens to you (unlike some people I know).

My pitch to myself: "It'll be easy! Just whip up some Python, throw in a speech-to-text API, add a sprinkle of LLM magic, and boom — product."

The reality: Three weeks of questioning my life choices and discovering edge cases I didn't know existed.

What I Actually Built

Let me show you the thing first, then I'll tell you why it's both awesome and deeply flawed.

The project is called Voice Notes Assistant (creative naming, I know). It's a Python-based prototype that:

Takes audio input from various sources
Transcribes it using speech-to-text
Processes the transcript with an LLM to extract structure
Outputs organized notes, summaries, and action items

Here's a taste of the core logic:

import os
from pathlib import Path

class VoiceNotesProcessor:
    def __init__(self, audio_path: str):
        self.audio_path = Path(audio_path)
        self.transcript = None
        self.structured_output = None

    def transcribe(self, whisper_model="base"):
        """
        Because accuracy is overrated anyway, right? 
        (Spoiler: it's not, and I learned this the hard way)
        """
        try:
            import whisper
            model = whisper.load_model(whisper_model)
            result = model.transcribe(str(self.audio_path))
            self.transcript = result["text"]
            return self.transcript
        except Exception as e:
            print(f"Transcription failed: {e}")
            return None

    def structure_with_llm(self, transcript: str, provider="openai"):
        """
        This is where the magic happens. Or doesn't. Depends on your prompt.
        """
        prompt = f"""
        Take this messy voice transcript and organize it:

        TRANSCRIPT:
        {transcript}

        Please provide:
        1. A brief summary (2-3 sentences)
        2. Key points as bullet items
        3. Any action items or tasks mentioned
        4. Topics/categories this relates to

        Be concise but thorough.
        """

        # API call implementation here...
        # You get the idea
        return structured_output

Seems straightforward, right? Well...

The Things Nobody Tells You

1. Audio Quality is a Cruel Master

I tested this with pristine studio recordings first. Worked like a charm! Then I tried it with actual voice notes — you know, the ones recorded on a busy street with wind noise, traffic, and me mumbling because I'm trying not to look like I'm talking to myself.

The accuracy drop-off was brutal. What worked at 95% accuracy in a quiet room dropped to maybe 60% in real-world conditions. And that's being generous.

Lesson learned: Test with real data, not ideal data. Your users won't be in a sound studio.

2. Transcription is Only Half the Battle

Getting the text is the easy part. Making sense of it? That's where things get spicy.

Voice notes are messy. People:

Start sentences and don't finish them
Change topics mid-thought
Use filler words (um, uh, like, you know)
Reference things they thought about earlier but never said out loud

My first few prompts for the LLM were... optimistic. I basically asked it to perform miracles. Now my prompts are novellas with examples, edge cases, and enough constraints to make a lawyer proud.

3. The Format Wars

Here's a fun one: what format should the output be in?

Some people want bullet points. Others want paragraphs. Some want action items extracted. Others want it categorized by topic. I spent way too long trying to make everyone happy before realizing that configurable templates were the only sane approach.

What Actually Works

Despite my complaints, there are some things I'm genuinely proud of:

The Modular Design

I structured the project so you can swap components:

# Use OpenAI's Whisper
processor = VoiceNotesProcessor("meeting.mp3")
processor.transcribe(whisper_model="base")

# Or use a different STT service
processor.use_alternative_transcription("assembly-ai")

# Different LLM providers for structuring
processor.structure_with_llm(provider="anthropic")  # or "openai", "local"

This saved me when I realized OpenAI's API costs were going to bankrupt me at scale.

The CLI Interface

Sometimes you just want to process a file without writing code:

python -m voice_notes process "my-rambling.mp3" \
  --output-format markdown \
  --extract-actions \
  --save-to notion

Simple, but effective.

The Honest Pros and Cons

Because I'm not trying to sell you anything, here's the real breakdown:

What works:

Decent transcription quality with Whisper (in good conditions)
LLM structuring actually helps organize messy thoughts
CLI is convenient for quick processing
Modular design makes it hackable

What doesn't:

Background noise kills accuracy
Processing time is slower than I'd like
Costs can add up with API usage
Still requires manual review (don't trust AI summaries blindly!)
Only supports limited audio formats

The brutal truth: This is a prototype, not a product. It works for my use case, but your mileage will vary.

What I'd Do Differently

If I were starting over:

Start with real user data — I spent too long optimizing for my own perfectly-enunciated test recordings
Local-first approach — Relying on APIs adds latency and cost. I'd explore more local models
Better error handling — Right now it fails gracelessly when things go wrong
Mobile app — Let's be honest, most voice notes happen on phones, not desktops

The Bigger Picture

Building this taught me something about AI tools in general. We're in this weird middle ground where the technology is almost good enough to be magical, but still requires enough babysitting that you can't fully trust it.

The voice notes problem isn't solved by better transcription or smarter LLMs. It's solved by understanding that humans are messy, inconsistent, and unpredictable. Any tool that doesn't account for that is going to disappoint.

Your Turn

I'd love to hear from you:

Do you actually use voice notes, or are they just digital hoarding for you?
What would your ideal voice notes assistant do?
Have you tried building something similar? What broke first?

Drop your thoughts below. And if you're curious about the code, check out the GitHub repo. It's rough around the edges, but it works. Mostly.

P.S. — Yes, I used this tool to help organize my thoughts for writing this article. Yes, it required cleanup. No, the irony is not lost on me.

DEV Community

I Built a Voice Notes Assistant and Learned Why AI Still Needs Training Wheels

I Built a Voice Notes Assistant and Learned Why AI Still Needs Training Wheels

The Pitch vs. The Reality

What I Actually Built

The Things Nobody Tells You

1. Audio Quality is a Cruel Master

2. Transcription is Only Half the Battle

3. The Format Wars

What Actually Works

The Modular Design

The CLI Interface

The Honest Pros and Cons

What I'd Do Differently

The Bigger Picture

Your Turn

Top comments (0)