Ryan Banze

Posted on Oct 10 • Edited on Oct 16

🧠 Real-Time Smart Speech Assistant with Python, Whisper & LLMs

#ai #whisper #programming #python

The future of human-computer interaction isn’t just about recognizing words, it’s about understanding meaning
That’s the philosophy behind this project: a real-time speech companion that doesn’t just transcribe your voice but actively listens, interprets, and supports you in the flow of conversation.

Imagine this: You’re presenting, and mid-sentence you forget a technical term. Instead of awkward silence, a live assistant quietly displays the word, a crisp definition, and even suggests a better phrase. That’s what this system does — an AI-powered coach in your corner, live.

🎯 Why Build This?

Most speech-to-text tools are glorified stenographers. They capture your words ,period. But real conversations are messy, uncertain, and nuanced.

What if you stumble on a word?
What if your phrasing is too jargon-heavy for your audience?
What if you sound unsure and need a guiding hand?

Traditional transcription doesn’t solve these. This app does.

✅ The Solution: Speech-to-Insight

This isn’t just about transcription. It’s about augmenting speech with intelligence.

Here’s what the assistant provides in real-time:

🗣️ Raw Speech Capture – your words, transcribed instantly
🔑 Concept Extraction – what ideas you’re really talking about
📖 Definitions – crisp meanings for rare or academic terms
💡 LLM Suggestions – alternative phrasing, smarter wording
🧠 Hesitation Detection – nudges when you sound uncertain

Think of it as the Google Docs grammar checker — but for live speech.

🧱 The Modular Architecture

The code is structured in a clean, extendable way (src/ directory):

File	Purpose
`main.py`	Tkinter GUI + app launch logic
`audio_utils.py`	Real-time mic capture & chunking
`transcription.py`	Whisper & AssemblyAI pipelines for speech recognition
`text_utils.py`	NLP-based concept extraction & ambiguity detection
`llm_utils.py`	Hooks to OpenRouter, Groq, Gemini
`rowlogic.py`	Builds UI rows dynamically
`controls.py`	Start/Stop mic logic
`app_state.py`	Shared memory for utterances + mic queue
`config.py`	Secure .env key loading

This isn’t spaghetti-code. It’s a scalable blueprint for real-time NLP systems.

🎨 What It Looks Like

Dark-themed Tkinter GUI (easy on the eyes)
Microphone selector & engine dropdown
Dynamic table with 5 columns:
1. Your speech (live transcription)
2. Key concepts (distilled ideas)
3. Definitions (for tough words)
4. LLM suggestions (smarter phrasing)
5. Ambiguity/Hesitation flags

It feels less like a CLI tool and more like a personal dashboard for your voice.

⚙️ How It Works (Step-by-Step)

Here’s the intellectual heart of the system:

1. Audio Capture

Streams your mic input, chunks audio, and writes temporary .wav files.

💡 Why: Whisper and AssemblyAI need .wav — this bridges live audio to ML models.

path = "temp_chunk.wav"
with wave.open(path, "wb") as wf:
    wf.setnchannels(1)
    wf.setsampwidth(2)
    wf.setframerate(16000)
    wf.writeframes((chunk * 32767).astype(np.int16).tobytes())

2. Transcription Engines

Switch between:
• ⚡ Whisper (local, GPU-accelerated, private)
• ☁️ AssemblyAI (cloud, highly accurate, versatile)


engine = engine_var.get()
if engine == "AssemblyAI":
    text = transcribe_with_assemblyai(path)
elif engine == "Whisper":
    text = transcribe_with_whisper(path)

3. Concept & Entity Extraction


NLP via spaCy distills raw text into meaningful ideas.
doc = nlp(text)
concepts = extract_clean_concepts(doc)
entities = extract_named_entities(doc)

This makes the assistant semantic-aware it knows you’re talking about “machine learning,” not just “machines” and “learning.”

4. Ambiguity & Hesitation Detection

Regex + context memory detect when you stumble.

context = " ".join(recent_utterances)
ambiguous = detect_ambiguity(context)
hesitant = detect_hesitation(context)

This is where it becomes a coach, not a scribe.

5. LLM Support Mode

When you hesitate, the app calls an LLM (Mistral, LLaMA 3, or Gemini) to help.

if ambiguous or hesitant:
    prompt = get_ambiguous_or_hesitant_prompt(context, ambiguous, hesitant)
    llm_response = get_llm_support_response(prompt)
else:
    llm_response = "—"

This turns uncertainty into real-time, context-aware assistance.

6. Rare Word Definitions

Detected via wordfreq + free dictionary API.


definitions = extract_difficult_definitions(text)

This ensures you never lose your audience.

7. Dynamic UI Update

Everything inserts as a row in the live table.


insert_row(text, concepts, entities, engine, scrollable_frame, header, row_widgets, canvas)

🛠️ Tech Stack
• 🎧 sounddevice → Mic streaming
• 🧠 faster-whisper + AssemblyAI → Speech recognition
• 📖 spaCy + wordfreq → NLP & word rarity detection
• 🤖 OpenRouter (Mistral), Groq (LLaMA 3), Gemini → LLM suggestions
• 🎨 tkinter → GUI
• 📚 Free Dictionary API → Definitions

🚀 Why It Matters
This project hints at the next wave of human-AI interfaces:
• Beyond transcription
• Beyond chatbots
• Towards empathetic, real-time, context-aware AI assistants
It’s not production-hardened yet, but as a proof of concept it shows:
• ✅ Real-time multimodal pipelines are feasible
• ✅ Open-source + cloud models can play together
• ✅ AI can move from “tools” to companions

⭐ Try It, Fork It, Extend It
Want to make it your own?
• Add emoji sentiment analysis
• Build meeting summarizers
• Enable multilingual coaching
• Add agent roles (therapist, teacher, coach)
The architecture is modular enough to adapt.

Full Video

Full Video Tutorial

💡 Final Thoughts
This isn’t about replacing speech. It’s about enhancing it. Your words stay yours ,but smarter, sharper, and better supported.
In many ways, this is a blueprint for empathetic AI interfaces ,AI that doesn’t just hear you, but actually has your back.

💬 Want to Support My Work?

If you enjoyed this project, consider buying me a coffee to support more free AI tutorials and tools:

👉 Buy Me a Coffee ☕

DEV Community