If you’re building a local AI companion, you already know the truth:
Without fast, accurate, real‑time speech recognition, your AI will never feel alive.
This guide shows you how to achieve Neuro‑Sama‑level STT using:
- Whisper (large model)
- CUDA acceleration
- Async LivinGrimoire skill architecture
The result?
✔️ High accuracy
✔️ Real‑time transcription
✔️ Runs fully locally
✔️ No REST APIs, no cloud, no rate limits
✔️ Does NOT block your AI’s main think loop
✔️ Modular skill you can drop into any LivinGrimoire‑based AI
Let’s get into it.
🧨 The Problem: PyTorch Fails to Import When Enabling CUDA
On new GPUs (like the RTX 5090 in the Lenovo Legion 9), PyTorch’s stable builds may fail during import:
Traceback (most recent call last):
File "main.py", line 80, in <module>
import torch
...
File ".../torch/utils/_debug_mode/_mode.py", line 116, in <module>
<error occurs here>
This usually means:
- Your GPU is too new for the stable PyTorch build
- You need a nightly build with updated CUDA kernels
🛠️ Fix: Install the Correct PyTorch Nightly Build (CUDA 12.8)
pip uninstall torch torchvision torchaudio -y
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
pip install packaging
Verify CUDA:
import torch
print(torch.cuda.is_available()) # Should print True
If you see True, you’re ready for Whisper acceleration.
⚡ Whisper + CUDA = Real‑Time, High‑Accuracy STT
Install dependencies(via pycharm terminal):
winget install ffmpeg
ffmpeg -version
pip install openai-whisper pyaudio numpy wave
Choose your Whisper model:
| Model | Speed | Accuracy | GPU Required |
|---|---|---|---|
| base | fast | decent | no |
| small | good | good | recommended |
| medium | slower | high | yes |
| large | slowest | best | yes |
For Neuro‑Sama‑tier accuracy:
model = whisper.load_model("large", device="cuda")
Enable fp16 for speed:
result = model.transcribe(audio_np, fp16=True, language='en')
🧩 The LivinGrimoire STT Skill (Async, Non‑Blocking, Local)
This is where the magic happens.
The LivinGrimoire pattern allows you to add new abilities with one line of code, and each ability runs as an independent skill.
This STT skill:
- Runs in a background thread
- Never blocks the AI’s main loop
- Continuously listens and transcribes
- Uses Whisper large + CUDA
- Cleans and normalizes text
- Works 100% offline
- Integrates into any LivinGrimoire AI instantly
Here is the full async STT skill:
import whisper
import pyaudio
import numpy as np
import re
import atexit
import threading
from queue import Queue
from LivinGrimoirePacket.LivinGrimoire import Brain, Skill
class DiSTT(Skill):
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
MIN_ACTIVE_SECONDS = 0.3
exit_event = threading.Event()
model = whisper.load_model("large", device="cuda")
p = pyaudio.PyAudio()
stream = p.open(
format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK,
)
silence_threshold = None
audio_queue = Queue()
latest_transcription = ""
def __init__(self, brain: Brain):
super().__init__()
self.set_skill_lobe(3)
self.brain = brain
atexit.register(DiSTT.cleanup)
DiSTT.initSTT()
# ASYNC: runs in background, never blocks main loop
threading.Thread(target=self.run_stt, daemon=True).start()
@staticmethod
def cleanup():
print("\nCleaning up resources...")
DiSTT.exit_event.set()
DiSTT.stream.stop_stream()
DiSTT.stream.close()
DiSTT.p.terminate()
print("Cleanup complete. Exiting.")
@staticmethod
def calibrate_mic():
print("Calibrating mic (stay silent for 2s)...")
samples = []
for _ in range(int(DiSTT.RATE / DiSTT.CHUNK * 2)):
data = DiSTT.stream.read(DiSTT.CHUNK, exception_on_overflow=False)
samples.append(np.abs(np.frombuffer(data, dtype=np.int16)).mean())
mean_val = float(np.mean(samples) * 1.5)
DiSTT.silence_threshold = max(mean_val, 100.0)
@staticmethod
def clean_text(text):
text = text.lower()
text = re.sub(r'[^a-z0-9\s]', '', text)
text = re.sub(r'\s+', ' ', text).strip()
return text
@staticmethod
def record_chunk():
frames = []
silent_frames = 0
max_silent_frames = int(DiSTT.RATE / DiSTT.CHUNK * 0.5)
while not DiSTT.exit_event.is_set():
data = DiSTT.stream.read(DiSTT.CHUNK, exception_on_overflow=False)
audio_data = np.frombuffer(data, dtype=np.int16)
volume = np.abs(audio_data).mean()
if volume < DiSTT.silence_threshold:
silent_frames += 1
if silent_frames > max_silent_frames:
break
else:
silent_frames = 0
frames.append(data)
return b''.join(frames) if len(frames) > int(DiSTT.RATE / DiSTT.CHUNK * DiSTT.MIN_ACTIVE_SECONDS) else None
@staticmethod
def transcribe_chunk(audio_bytes):
audio_np = np.frombuffer(audio_bytes, dtype=np.int16).astype(np.float32) / 32768.0
result = DiSTT.model.transcribe(audio_np, fp16=True, language='en')
return DiSTT.clean_text(result["text"])
@staticmethod
def initSTT():
print("Initializing...")
DiSTT.calibrate_mic()
print(f"Silence threshold set to: {DiSTT.silence_threshold:.2f}")
add the skill with
brain.add_skill(DiSTT(brain))
🎤 Why This Is a Game‑Changer
✔️ Async
The STT skill runs in its own thread.
Your AI continues thinking, planning, reacting — uninterrupted.
✔️ Local
No cloud.
No REST calls.
No privacy leaks.
No latency spikes.
No API keys.
No rate limits.
✔️ Fast
CUDA + fp16 + Whisper large = real‑time transcription.
✔️ Accurate
Whisper large is state‑of‑the‑art.
You get near‑human transcription quality.
✔️ Modular
Thanks to the LivinGrimoire pattern, adding this skill is literally:
add_skill(DiSTT)
🌱 Explore the Full LivinGrimoire Project
If you want to build:
- Local AI companions
- Modular AGI‑style systems
- Skill‑based AI architectures
- Extensible, maintainable AI codebases
…you owe it to yourself to explore the full project:
👉 https://github.com/yotamarker/LivinGrimoire github.com
It’s a powerful pattern that lets you add new AI abilities with one line of code.
Top comments (0)