TL;DR:
I built a fully offline AI voice assistant for students using Whisper STT, Silero VAD, quantized LLaMA 3.2, and Kokoro TTS.
It runs entirely on CPU, fits in 2GB RAM, works without internet, and is designed for low-cost laptops.
Why I Built This
While looking into NavGurukul’s AI Lab initiatives, one question kept bothering me:
What about students who can’t read fluently—or don’t have reliable internet access?
Most AI learning tools assume:
Stable internet
Cloud APIs
Expensive hardware
Comfortable reading ability
But in many Tier-2 and Tier-3 regions in India, students often:
Struggle with reading comprehension
Have unreliable or no internet
Study in schools with limited infrastructure
Face real data-privacy concerns
So I decided to build something different:
🗣️ Voice-first
📴 Offline-first
💻 CPU-only
🧠 Privacy-first
What I Built
A local AI voice assistant that:
Runs fully offline after model download
Fits in ~2GB RAM using quantization
Responds in 5–7 seconds on CPU
Works on ₹20k laptops
Uses open-source models only
GitHub Repo:
👉 Code Base
High-Level Architecture
[Student Voice]
↓
Whisper STT (offline)
↓
Silero VAD (detect end of speech)
↓
LLaMA 3.2 (Q4_K_M, CPU)
↓
Kokoro TTS (ONNX)
↓
[Audio Response]
Key Guarantees
✅ No internet after setup
✅ CPU-only
✅ 2GB RAM footprint
✅ Student-friendly voice
✅ No data leaves the device
Speech-to-Text: Why Whisper
For offline STT, Whisper was the obvious choice.
Why?
Works fully offline
Handles noisy environments well
Lightweight models available
Easy multilingual expansion later
stt = WhisperSTTService(
model_size="tiny",
device="cpu",
compute_type="int8"
)
Model Trade-Off
| Model | Accuracy | Speed | RAM |
| ------ | -------- | ------ | ------ |
| tiny | ~95% | ⚡ Fast | ✅ Low |
| base | ~97% | Medium | OK |
| medium | ~99% | Slow | ❌ High |
Decision: tiny
Because waiting kills engagement faster than small transcription errors.
Voice Activity Detection (VAD)
Students don’t press a “stop recording” button.
So the system needs to detect natural pauses.
Solution: Silero VAD
vad = SileroVADAnalyzer(
threshold=0.5,
sample_rate=16000,
frame_duration_ms=100
)
This:
Detects sentence completion
Prevents mid-sentence cut-offs
Keeps the UX natural and frictionless
This matters a lot in classroom environments.
LLM Choice: Why Quantized LLaMA Wins
The Reality Check
| Option | Memory | Speed | Hardware |
| ------------------ | -------- | ------- | -------- |
| LLaMA 7B (FP16) | ~28GB | ~50s | GPU |
| LLaMA 3.2 (Q4) | ~2GB | ~6s | CPU |
For education:
Speed > creativity
Availability > perfection
Simplicity > fancy prose
Quantization:
Reduced memory by ~14×
Improved inference speed by ~8×
Lost only ~8–12% quality
That trade-off is absolutely worth it.
Text-to-Speech: Why Voice Quality Matters
My first TTS sounded robotic.
That turned out to be a bigger problem than accuracy.
Students already struggling with learning don’t need a cold, mechanical voice.
Why Kokoro ONNX?
| Feature | Kokoro |
| -------- | -------------- |
| Voice | Natural & warm |
| Speed | ~1.2s |
| Hardware | CPU-only |
| Size | ~512MB |
samples, _ = await asyncio.to_thread(
self.tts.create,
text,
voice="af_heart",
speed=1.0
)
💡 Lesson:
A calm, encouraging voice keeps students engaged more than perfect answers.
Real-Time Orchestration (Why Async Matters)
A naive blocking pipeline does not work.
❌ Blocking (Bad)
text = stt(audio)
response = llm(text)
tts(response)
✅ Async Pipeline (Good)
pipeline = Pipeline([
transport.input(),
stt,
user_aggregator,
llm,
tts,
transport.output(),
assistant_aggregator
])
await PipelineRunner().run(task)
Each component runs independently:
No freezing
No lag
Smooth real-time interaction
Multi-Turn Context
Students ask follow-up questions.
context = OpenAILLMContext([
{"role": "system", "content": SYSTEM_PROMPT}
])
Example:
“Since you asked about photosynthesis earlier…”
⚠️ Important:
Trim context after ~10 turns to avoid memory issues.
Mistakes I Made (Learn From These)
1️⃣ Audio Sample Rate Mismatch
Whisper → 16kHz
Kokoro → 24kHz
Result → distorted audio
Fix:
librosa.resample(audio, 16000, 24000)
2️⃣ Blocking I/O in Async Code
Loading models synchronously froze conversations.
Fix:
await asyncio.to_thread(load_model)
Performance Benchmarks (CPU-Only)
| Device | Total Latency |
| ---------- | ------------- |
| MacBook M1 | ~6.7s |
| Intel i7 | ~5.6s |
| Ryzen 5 | ~6.2s |
Observations:
LLM dominates latency
TTS performance was surprisingly good
What I’d Improve Next
🔁 Stream LLM tokens directly into TTS (cut latency to ~3s)
💾 Optional SQLite logging (privacy-first)
🌍 Hindi & Tamil support
🎯 Prompt A/B testing for learning styles
Why This Matters
For Students
Removes reading barriers
Works without internet
Non-judgmental learning experience
*For Schools
*
No vendor lock-in
Runs on existing hardware
Full data privacy
For ML Engineers
Offline AI is viable
Quantization is production-ready
UX matters as much as model accuracy
Open Source & Next Steps
✅ Full source code
✅ Docker setup
✅ Benchmarks
✅ MIT License
👉 GitHub:
https://github.com/SanthanaBharathiM/Local-AI-Voice-Assistant-for-Student-Learning
Final Thought
The best AI isn’t the smartest — it’s the one people can actually use.
Built with ❤️ for students who learn differently.
Santhana Bharathi
AI/ML Engineer | Offline AI | Jan 2026
Top comments (0)