owly

Posted on Apr 7 • Edited on Apr 9

Achieving Neuro‑Sama‑Tier Speech‑to‑Text for Your Local AI Companion (Whisper + CUDA + LivinGrimoire)

#whisper #designpatterns #python #cuda

If you’re building a local AI companion, you already know the truth:

Without fast, accurate, real‑time speech recognition, your AI will never feel alive.

This guide shows you how to achieve Neuro‑Sama‑level STT using:

Whisper (large model)
CUDA acceleration
Async LivinGrimoire skill architecture

The result?

✔️ High accuracy

✔️ Real‑time transcription

✔️ Runs fully locally

✔️ No REST APIs, no cloud, no rate limits

✔️ Does NOT block your AI’s main think loop

✔️ Modular skill you can drop into any LivinGrimoire‑based AI

Let’s get into it.

🧨 The Problem: PyTorch Fails to Import When Enabling CUDA

On new GPUs (like the RTX 5090 in the Lenovo Legion 9), PyTorch’s stable builds may fail during import:

Traceback (most recent call last):
  File "main.py", line 80, in <module>
    import torch
  ...
  File ".../torch/utils/_debug_mode/_mode.py", line 116, in <module>
    <error occurs here>

This usually means:

Your GPU is too new for the stable PyTorch build
You need a nightly build with updated CUDA kernels

🛠️ Fix: Install the Correct PyTorch Nightly Build (CUDA 12.8)

pip uninstall torch torchvision torchaudio -y

pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128

pip install packaging

Verify CUDA:

import torch
print(torch.cuda.is_available())  # Should print True

If you see True, you’re ready for Whisper acceleration.

⚡ Whisper + CUDA = Real‑Time, High‑Accuracy STT

Install dependencies(via pycharm terminal):

winget install ffmpeg
ffmpeg -version

pip install openai-whisper pyaudio numpy wave

Choose your Whisper model:

Model	Speed	Accuracy	GPU Required
base	fast	decent	no
small	good	good	recommended
medium	slower	high	yes
large	slowest	best	yes

For Neuro‑Sama‑tier accuracy:

model = whisper.load_model("large", device="cuda")

Enable fp16 for speed:

result = model.transcribe(audio_np, fp16=True, language='en')

🧩 The LivinGrimoire STT Skill (Async, Non‑Blocking, Local)

This is where the magic happens.

The LivinGrimoire pattern allows you to add new abilities with one line of code, and each ability runs as an independent skill.

This STT skill:

Runs in a background thread
Never blocks the AI’s main loop
Continuously listens and transcribes
Uses Whisper large + CUDA
Cleans and normalizes text
Works 100% offline
Integrates into any LivinGrimoire AI instantly

Here is the full async STT skill:

import whisper
import pyaudio
import numpy as np
import re
import atexit
import threading
from queue import Queue

from LivinGrimoirePacket.LivinGrimoire import Brain, Skill

"""
cmd
winget install ffmpeg
check if it installed ok:
ffmpeg -version

in python terminal:
pip install openai-whisper pyaudio numpy wave
"""

"""
🔧 Whisper + CUDA Upgrade Guide

✅ 1. Check if your system supports GPU acceleration
import torch
print(torch.cuda.is_available())
if true:
✅ 2. Install CUDA-enabled PyTorch (if needed)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
✅ 3. Load a better Whisper model (with GPU support)
replace:
model = whisper.load_model("base")

with one of these for better accuracy:
model = whisper.load_model("small", device="cuda")     # Good balance
model = whisper.load_model("medium", device="cuda")    # Higher accuracy
model = whisper.load_model("large", device="cuda")     # Best accuracy (but slowest)

This tells Whisper to use your **GPU** (via CUDA), which makes transcription faster and lets you use larger models if needed.
✅ 4. Enable fp16 for faster transcription (on supported GPUs)
replace:
result = DiSTT.model.transcribe(audio_np, fp16=False, language='en')
with:
result = DiSTT.model.transcribe(audio_np, fp16=True, language='en')

This enables **16-bit floating point precision**, which is faster on modern GPUs.
⚠️ fp16=True will crash on unsupported GPUs and should be switched back if needed.
NVIDIA RTX or newer graphics cards
"""


class DiSTT(Skill):
    CHUNK = 1024
    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 16000
    MIN_ACTIVE_SECONDS = 0.3
    exit_event = threading.Event()
    # model = whisper.load_model("base")
    model = whisper.load_model("large", device="cuda")
    p = pyaudio.PyAudio()
    stream = p.open(
        format=FORMAT,
        channels=CHANNELS,
        rate=RATE,
        input=True,
        frames_per_buffer=CHUNK,
    )
    silence_threshold = None
    audio_queue = Queue()
    latest_transcription = ""

    def __init__(self, brain: Brain):
        super().__init__()
        self.set_skill_lobe(3)  # connect to ear input
        self.brain = brain
        atexit.register(DiSTT.cleanup)
        DiSTT.initSTT()

        # Launch background STT thread
        threading.Thread(target=self.run_stt, daemon=True).start()

    @staticmethod
    def cleanup():
        print("\nCleaning up resources...")
        DiSTT.exit_event.set()
        DiSTT.stream.stop_stream()
        DiSTT.stream.close()
        DiSTT.p.terminate()
        print("Cleanup complete. Exiting.")

    @staticmethod
    def calibrate_mic():
        print("Calibrating mic (stay silent for 2s)...")
        samples = []
        for _ in range(int(DiSTT.RATE / DiSTT.CHUNK * 2)):
            data = DiSTT.stream.read(DiSTT.CHUNK, exception_on_overflow=False)
            samples.append(np.abs(np.frombuffer(data, dtype=np.int16)).mean())
        mean_val = float(np.mean(samples) * 1.5)
        DiSTT.silence_threshold = max(mean_val, 100.0)

    @staticmethod
    def clean_text(text):
        # Convert to lowercase
        text = text.lower()
        # Remove special characters except alphanumeric and spaces
        text = re.sub(r'[^a-z0-9\s]', '', text)
        # Remove extra spaces
        text = re.sub(r'\s+', ' ', text).strip()

        return text

    @staticmethod
    def record_chunk():
        frames = []
        silent_frames = 0
        max_silent_frames = int(DiSTT.RATE / DiSTT.CHUNK * 0.5)  # recognition inits after 1 sec of shut up time

        while not DiSTT.exit_event.is_set():
            data = DiSTT.stream.read(DiSTT.CHUNK, exception_on_overflow=False)
            audio_data = np.frombuffer(data, dtype=np.int16)
            volume = np.abs(audio_data).mean()

            if volume < DiSTT.silence_threshold:
                silent_frames += 1
                if silent_frames > max_silent_frames:
                    break
            else:
                silent_frames = 0
                frames.append(data)

        return b''.join(frames) if len(frames) > int(DiSTT.RATE / DiSTT.CHUNK * DiSTT.MIN_ACTIVE_SECONDS) else None

    @staticmethod
    def transcribe_chunk(audio_bytes):
        audio_np = np.frombuffer(audio_bytes, dtype=np.int16).astype(np.float32) / 32768.0
        # result = DiSTT.model.transcribe(audio_np, fp16=False, language='en')
        result = DiSTT.model.transcribe(audio_np, fp16=True, language='en')
        return DiSTT.clean_text(result["text"])

    @staticmethod
    def initSTT():
        print("Initializing...")
        DiSTT.calibrate_mic()
        print(f"Silence threshold set to: {DiSTT.silence_threshold:.2f}")

    @staticmethod
    def run_stt():
        """ Continuous background STT processing """
        while not DiSTT.exit_event.is_set():
            audio_data = DiSTT.record_chunk()
            if audio_data:
                text = DiSTT.transcribe_chunk(audio_data)
                DiSTT.latest_transcription = text  # Store latest transcription
                if text.strip():
                    print(f"> {text}")  # LIVE CONSOLE OUTPUT

    def input(self, ear: str, skin: str, eye: str):
        """ Read latest transcription from global var """
        if len(self.brain.getLogicChobitOutput()) > 0:
            print("Skipping listen")
            return

        # print("\nSpeak now")
        DiSTT.latest_transcription = DiSTT.clean_text(DiSTT.latest_transcription)  # Clean the text before printing
        self.setSimpleAlg(DiSTT.latest_transcription)
        DiSTT.latest_transcription = ""  # Clear after printing

    def skillNotes(self, param: str) -> str:
        if param == "notes":
            return "speech to text"
        elif param == "triggers":
            return "automatic and continuous"
        return "note unavailable"

add the skill with
brain.add_skill(DiSTT(brain))

🎤 Why This Is a Game‑Changer

✔️ Async

The STT skill runs in its own thread.

Your AI continues thinking, planning, reacting — uninterrupted.

✔️ Local

No cloud.

No REST calls.

No privacy leaks.

No latency spikes.

No API keys.

No rate limits.

✔️ Fast

CUDA + fp16 + Whisper large = real‑time transcription.

✔️ Accurate

Whisper large is state‑of‑the‑art.

You get near‑human transcription quality.

✔️ Modular

Thanks to the LivinGrimoire pattern, adding this skill is literally:

add_skill(DiSTT)

🌱 Explore the Full LivinGrimoire Project

If you want to build:

Local AI companions
Modular AGI‑style systems
Skill‑based AI architectures
Extensible, maintainable AI codebases

…you owe it to yourself to explore the full project:

👉 https://github.com/yotamarker/LivinGrimoire github.com

It’s a powerful pattern that lets you add new AI abilities with one line of code.

DEV Community