The future of human-computer interaction isn’t just about recognizing words ,it’s about understanding meaning
That’s the philosophy behind this project: a real-time speech companion that doesn’t just transcribe your voice but actively listens, interprets, and supports you in the flow of conversation.
Imagine this: You’re presenting, and mid-sentence you forget a technical term. Instead of awkward silence, a live assistant quietly displays the word, a crisp definition, and even suggests a better phrase. That’s what this system does — an AI-powered coach in your corner, live.
🎯 Why Build This?
Most speech-to-text tools are glorified stenographers. They capture your words ,period. But real conversations are messy, uncertain, and nuanced.
- What if you stumble on a word?
- What if your phrasing is too jargon-heavy for your audience?
- What if you sound unsure and need a guiding hand?
Traditional transcription doesn’t solve these. This app does.
✅ The Solution: Speech-to-Insight
This isn’t just about transcription. It’s about augmenting speech with intelligence.
Here’s what the assistant provides in real-time:
- 🗣️ Raw Speech Capture – your words, transcribed instantly
- 🔑 Concept Extraction – what ideas you’re really talking about
- 📖 Definitions – crisp meanings for rare or academic terms
- 💡 LLM Suggestions – alternative phrasing, smarter wording
- 🧠 Hesitation Detection – nudges when you sound uncertain
Think of it as the Google Docs grammar checker — but for live speech.
🧱 The Modular Architecture
The code is structured in a clean, extendable way (src/
directory):
File | Purpose |
---|---|
main.py |
Tkinter GUI + app launch logic |
audio_utils.py |
Real-time mic capture & chunking |
transcription.py |
Whisper & AssemblyAI pipelines for speech recognition |
text_utils.py |
NLP-based concept extraction & ambiguity detection |
llm_utils.py |
Hooks to OpenRouter, Groq, Gemini |
rowlogic.py |
Builds UI rows dynamically |
controls.py |
Start/Stop mic logic |
app_state.py |
Shared memory for utterances + mic queue |
config.py |
Secure .env key loading |
This isn’t spaghetti-code. It’s a scalable blueprint for real-time NLP systems.
🎨 What It Looks Like
- Dark-themed Tkinter GUI (easy on the eyes)
- Microphone selector & engine dropdown
- Dynamic table with 5 columns:
- Your speech (live transcription)
- Key concepts (distilled ideas)
- Definitions (for tough words)
- LLM suggestions (smarter phrasing)
- Ambiguity/Hesitation flags
It feels less like a CLI tool and more like a personal dashboard for your voice.
⚙️ How It Works (Step-by-Step)
Here’s the intellectual heart of the system:
1. Audio Capture
Streams your mic input, chunks audio, and writes temporary .wav
files.
💡 Why: Whisper and AssemblyAI need .wav
— this bridges live audio to ML models.
path = "temp_chunk.wav"
with wave.open(path, "wb") as wf:
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(16000)
wf.writeframes((chunk * 32767).astype(np.int16).tobytes())
2. Transcription Engines
Switch between:
• ⚡ Whisper (local, GPU-accelerated, private)
• ☁️ AssemblyAI (cloud, highly accurate, versatile)
engine = engine_var.get()
if engine == "AssemblyAI":
text = transcribe_with_assemblyai(path)
elif engine == "Whisper":
text = transcribe_with_whisper(path)
3. Concept & Entity Extraction
NLP via spaCy distills raw text into meaningful ideas.
doc = nlp(text)
concepts = extract_clean_concepts(doc)
entities = extract_named_entities(doc)
This makes the assistant semantic-aware it knows you’re talking about “machine learning,” not just “machines” and “learning.”
4. Ambiguity & Hesitation Detection
Regex + context memory detect when you stumble.
context = " ".join(recent_utterances)
ambiguous = detect_ambiguity(context)
hesitant = detect_hesitation(context)
This is where it becomes a coach, not a scribe.
5. LLM Support Mode
When you hesitate, the app calls an LLM (Mistral, LLaMA 3, or Gemini) to help.
if ambiguous or hesitant:
prompt = get_ambiguous_or_hesitant_prompt(context, ambiguous, hesitant)
llm_response = get_llm_support_response(prompt)
else:
llm_response = "—"
This turns uncertainty into real-time, context-aware assistance.
6. Rare Word Definitions
Detected via wordfreq + free dictionary API.
definitions = extract_difficult_definitions(text)
This ensures you never lose your audience.
7. Dynamic UI Update
Everything inserts as a row in the live table.
insert_row(text, concepts, entities, engine, scrollable_frame, header, row_widgets, canvas)
🛠️ Tech Stack
• 🎧 sounddevice → Mic streaming
• 🧠 faster-whisper + AssemblyAI → Speech recognition
• 📖 spaCy + wordfreq → NLP & word rarity detection
• 🤖 OpenRouter (Mistral), Groq (LLaMA 3), Gemini → LLM suggestions
• 🎨 tkinter → GUI
• 📚 Free Dictionary API → Definitions
🚀 Why It Matters
This project hints at the next wave of human-AI interfaces:
• Beyond transcription
• Beyond chatbots
• Towards empathetic, real-time, context-aware AI assistants
It’s not production-hardened yet, but as a proof of concept it shows:
• ✅ Real-time multimodal pipelines are feasible
• ✅ Open-source + cloud models can play together
• ✅ AI can move from “tools” to companions
⭐ Try It, Fork It, Extend It
Want to make it your own?
• Add emoji sentiment analysis
• Build meeting summarizers
• Enable multilingual coaching
• Add agent roles (therapist, teacher, coach)
The architecture is modular enough to adapt.
💡 Final Thoughts
This isn’t about replacing speech. It’s about enhancing it.
Your words stay yours ,but smarter, sharper, and better supported.
In many ways, this is a blueprint for empathetic AI interfaces ,AI that doesn’t just hear you, but actually has your back.
💬 Want to Support My Work?
If you enjoyed this project, consider buying me a coffee to support more free AI tutorials and tools:
👉 Buy Me a Coffee ☕
📱 Follow Me
- X (Twitter): @RyanBanze
- Instagram: @aibanze
- LinkedIn: Ryan Banze
Top comments (0)