DEV Community

Cover image for 🧠 Real-Time Smart Speech Assistant with Python, Whisper & LLMs
Ryan Banze
Ryan Banze

Posted on

🧠 Real-Time Smart Speech Assistant with Python, Whisper & LLMs

The future of human-computer interaction isn’t just about recognizing words ,it’s about understanding meaning
That’s the philosophy behind this project: a real-time speech companion that doesn’t just transcribe your voice but actively listens, interprets, and supports you in the flow of conversation.

Imagine this: You’re presenting, and mid-sentence you forget a technical term. Instead of awkward silence, a live assistant quietly displays the word, a crisp definition, and even suggests a better phrase. That’s what this system does — an AI-powered coach in your corner, live.


🎯 Why Build This?

Most speech-to-text tools are glorified stenographers. They capture your words ,period. But real conversations are messy, uncertain, and nuanced.

  • What if you stumble on a word?
  • What if your phrasing is too jargon-heavy for your audience?
  • What if you sound unsure and need a guiding hand?

Traditional transcription doesn’t solve these. This app does.


✅ The Solution: Speech-to-Insight

This isn’t just about transcription. It’s about augmenting speech with intelligence.

Here’s what the assistant provides in real-time:

  • 🗣️ Raw Speech Capture – your words, transcribed instantly
  • 🔑 Concept Extraction – what ideas you’re really talking about
  • 📖 Definitions – crisp meanings for rare or academic terms
  • 💡 LLM Suggestions – alternative phrasing, smarter wording
  • 🧠 Hesitation Detection – nudges when you sound uncertain

Think of it as the Google Docs grammar checker — but for live speech.


🧱 The Modular Architecture

The code is structured in a clean, extendable way (src/ directory):

File Purpose
main.py Tkinter GUI + app launch logic
audio_utils.py Real-time mic capture & chunking
transcription.py Whisper & AssemblyAI pipelines for speech recognition
text_utils.py NLP-based concept extraction & ambiguity detection
llm_utils.py Hooks to OpenRouter, Groq, Gemini
rowlogic.py Builds UI rows dynamically
controls.py Start/Stop mic logic
app_state.py Shared memory for utterances + mic queue
config.py Secure .env key loading

This isn’t spaghetti-code. It’s a scalable blueprint for real-time NLP systems.


🎨 What It Looks Like

  • Dark-themed Tkinter GUI (easy on the eyes)
  • Microphone selector & engine dropdown
  • Dynamic table with 5 columns:
    1. Your speech (live transcription)
    2. Key concepts (distilled ideas)
    3. Definitions (for tough words)
    4. LLM suggestions (smarter phrasing)
    5. Ambiguity/Hesitation flags

It feels less like a CLI tool and more like a personal dashboard for your voice.


⚙️ How It Works (Step-by-Step)

Here’s the intellectual heart of the system:

1. Audio Capture

Streams your mic input, chunks audio, and writes temporary .wav files.

💡 Why: Whisper and AssemblyAI need .wav — this bridges live audio to ML models.

path = "temp_chunk.wav"
with wave.open(path, "wb") as wf:
    wf.setnchannels(1)
    wf.setsampwidth(2)
    wf.setframerate(16000)
    wf.writeframes((chunk * 32767).astype(np.int16).tobytes())

Enter fullscreen mode Exit fullscreen mode

2. Transcription Engines

Switch between:
• ⚡ Whisper (local, GPU-accelerated, private)
• ☁️ AssemblyAI (cloud, highly accurate, versatile)


engine = engine_var.get()
if engine == "AssemblyAI":
    text = transcribe_with_assemblyai(path)
elif engine == "Whisper":
    text = transcribe_with_whisper(path)

Enter fullscreen mode Exit fullscreen mode

3. Concept & Entity Extraction


NLP via spaCy distills raw text into meaningful ideas.
doc = nlp(text)
concepts = extract_clean_concepts(doc)
entities = extract_named_entities(doc)

Enter fullscreen mode Exit fullscreen mode

This makes the assistant semantic-aware it knows you’re talking about “machine learning,” not just “machines” and “learning.”

4. Ambiguity & Hesitation Detection

Regex + context memory detect when you stumble.


context = " ".join(recent_utterances)
ambiguous = detect_ambiguity(context)
hesitant = detect_hesitation(context)

Enter fullscreen mode Exit fullscreen mode

This is where it becomes a coach, not a scribe.

5. LLM Support Mode

When you hesitate, the app calls an LLM (Mistral, LLaMA 3, or Gemini) to help.


if ambiguous or hesitant:
    prompt = get_ambiguous_or_hesitant_prompt(context, ambiguous, hesitant)
    llm_response = get_llm_support_response(prompt)
else:
    llm_response = "—"

Enter fullscreen mode Exit fullscreen mode

This turns uncertainty into real-time, context-aware assistance.

6. Rare Word Definitions

Detected via wordfreq + free dictionary API.


definitions = extract_difficult_definitions(text)

Enter fullscreen mode Exit fullscreen mode

This ensures you never lose your audience.

7. Dynamic UI Update

Everything inserts as a row in the live table.


insert_row(text, concepts, entities, engine, scrollable_frame, header, row_widgets, canvas)

Enter fullscreen mode Exit fullscreen mode

🛠️ Tech Stack
• 🎧 sounddevice → Mic streaming
• 🧠 faster-whisper + AssemblyAI → Speech recognition
• 📖 spaCy + wordfreq → NLP & word rarity detection
• 🤖 OpenRouter (Mistral), Groq (LLaMA 3), Gemini → LLM suggestions
• 🎨 tkinter → GUI
• 📚 Free Dictionary API → Definitions

🚀 Why It Matters
This project hints at the next wave of human-AI interfaces:
• Beyond transcription
• Beyond chatbots
• Towards empathetic, real-time, context-aware AI assistants
It’s not production-hardened yet, but as a proof of concept it shows:
• ✅ Real-time multimodal pipelines are feasible
• ✅ Open-source + cloud models can play together
• ✅ AI can move from “tools” to companions

⭐ Try It, Fork It, Extend It
Want to make it your own?
• Add emoji sentiment analysis
• Build meeting summarizers
• Enable multilingual coaching
• Add agent roles (therapist, teacher, coach)
The architecture is modular enough to adapt.

💡 Final Thoughts
This isn’t about replacing speech. It’s about enhancing it.
Your words stay yours ,but smarter, sharper, and better supported.
In many ways, this is a blueprint for empathetic AI interfaces ,AI that doesn’t just hear you, but actually has your back.

💬 Want to Support My Work?

If you enjoyed this project, consider buying me a coffee to support more free AI tutorials and tools:

👉 Buy Me a Coffee ☕


📱 Follow Me

Top comments (0)