SANTHANA BHARATHI

Posted on Jan 4

Building a Fully Offline AI Voice Assistant on a Laptop (2GB RAM, CPU Only)

#ai #machinelearning #opensource #deved

TL;DR:
I built a fully offline AI voice assistant for students using Whisper STT, Silero VAD, quantized LLaMA 3.2, and Kokoro TTS.
It runs entirely on CPU, fits in 2GB RAM, works without internet, and is designed for low-cost laptops.

Why I Built This

While looking into NavGurukul’s AI Lab initiatives, one question kept bothering me:

What about students who can’t read fluently—or don’t have reliable internet access?

Most AI learning tools assume:

Stable internet
Cloud APIs
Expensive hardware
Comfortable reading ability

But in many Tier-2 and Tier-3 regions in India, students often:

Struggle with reading comprehension
Have unreliable or no internet
Study in schools with limited infrastructure
Face real data-privacy concerns
So I decided to build something different:

🗣️ Voice-first
📴 Offline-first
💻 CPU-only
🧠 Privacy-first

What I Built

A local AI voice assistant that:

Runs fully offline after model download
Fits in ~2GB RAM using quantization
Responds in 5–7 seconds on CPU
Works on ₹20k laptops
Uses open-source models only

GitHub Repo:
👉 Code Base

High-Level Architecture

[Student Voice]
↓
Whisper STT (offline)
↓
Silero VAD (detect end of speech)
↓
LLaMA 3.2 (Q4_K_M, CPU)
↓
Kokoro TTS (ONNX)
↓
[Audio Response]

Key Guarantees

✅ No internet after setup
✅ CPU-only
✅ 2GB RAM footprint
✅ Student-friendly voice
✅ No data leaves the device

Speech-to-Text: Why Whisper

For offline STT, Whisper was the obvious choice.

Why?

Works fully offline
Handles noisy environments well
Lightweight models available
Easy multilingual expansion later

stt = WhisperSTTService(
model_size="tiny",
device="cpu",
compute_type="int8"
)

Model Trade-Off
| Model | Accuracy | Speed | RAM |
| ------ | -------- | ------ | ------ |
| tiny | ~95% | ⚡ Fast | ✅ Low |
| base | ~97% | Medium | OK |
| medium | ~99% | Slow | ❌ High |

Decision: tiny
Because waiting kills engagement faster than small transcription errors.

Voice Activity Detection (VAD)

Students don’t press a “stop recording” button.

So the system needs to detect natural pauses.

Solution: Silero VAD

vad = SileroVADAnalyzer(
threshold=0.5,
sample_rate=16000,
frame_duration_ms=100
)

This:

Detects sentence completion
Prevents mid-sentence cut-offs
Keeps the UX natural and frictionless

This matters a lot in classroom environments.

LLM Choice: Why Quantized LLaMA Wins
The Reality Check
| Option | Memory | Speed | Hardware |
| ------------------ | -------- | ------- | -------- |
| LLaMA 7B (FP16) | ~28GB | ~50s | GPU |
| LLaMA 3.2 (Q4) | ~2GB | ~6s | CPU |

For education:

Speed > creativity
Availability > perfection
Simplicity > fancy prose

Quantization:

Reduced memory by ~14×
Improved inference speed by ~8×
Lost only ~8–12% quality

That trade-off is absolutely worth it.

Text-to-Speech: Why Voice Quality Matters

My first TTS sounded robotic.

That turned out to be a bigger problem than accuracy.

Students already struggling with learning don’t need a cold, mechanical voice.

Why Kokoro ONNX?
| Feature | Kokoro |
| -------- | -------------- |
| Voice | Natural & warm |
| Speed | ~1.2s |
| Hardware | CPU-only |
| Size | ~512MB |

samples, _ = await asyncio.to_thread(
self.tts.create,
text,
voice="af_heart",
speed=1.0
)

💡 Lesson:
A calm, encouraging voice keeps students engaged more than perfect answers.

Real-Time Orchestration (Why Async Matters)

A naive blocking pipeline does not work.

❌ Blocking (Bad)

text = stt(audio)
response = llm(text)
tts(response)

✅ Async Pipeline (Good)

pipeline = Pipeline([
transport.input(),
stt,
user_aggregator,
llm,
tts,
transport.output(),
assistant_aggregator
])

await PipelineRunner().run(task)

Each component runs independently:

No freezing
No lag
Smooth real-time interaction

Multi-Turn Context

Students ask follow-up questions.

context = OpenAILLMContext([
{"role": "system", "content": SYSTEM_PROMPT}
])

Example:

“Since you asked about photosynthesis earlier…”

⚠️ Important:
Trim context after ~10 turns to avoid memory issues.

Mistakes I Made (Learn From These)
1️⃣ Audio Sample Rate Mismatch

Whisper → 16kHz
Kokoro → 24kHz
Result → distorted audio

Fix:

librosa.resample(audio, 16000, 24000)

2️⃣ Blocking I/O in Async Code

Loading models synchronously froze conversations.

Fix:

await asyncio.to_thread(load_model)

Performance Benchmarks (CPU-Only)
| Device | Total Latency |
| ---------- | ------------- |
| MacBook M1 | ~6.7s |
| Intel i7 | ~5.6s |
| Ryzen 5 | ~6.2s |

Observations:

LLM dominates latency
TTS performance was surprisingly good

What I’d Improve Next

🔁 Stream LLM tokens directly into TTS (cut latency to ~3s)
💾 Optional SQLite logging (privacy-first)
🌍 Hindi & Tamil support
🎯 Prompt A/B testing for learning styles

Why This Matters
For Students

Removes reading barriers
Works without internet
Non-judgmental learning experience
*For Schools
*

No vendor lock-in
Runs on existing hardware
Full data privacy

For ML Engineers

Offline AI is viable
Quantization is production-ready
UX matters as much as model accuracy

Open Source & Next Steps

✅ Full source code
✅ Docker setup
✅ Benchmarks
✅ MIT License

👉 GitHub:
https://github.com/SanthanaBharathiM/Local-AI-Voice-Assistant-for-Student-Learning

Final Thought

The best AI isn’t the smartest — it’s the one people can actually use.

Built with ❤️ for students who learn differently.

Santhana Bharathi
AI/ML Engineer | Offline AI | Jan 2026