DEV Community

Cover image for Achieving Neuro‑Sama‑Tier Speech‑to‑Text for Your Local AI Companion (Whisper + CUDA + LivinGrimoire)
owly
owly

Posted on

Achieving Neuro‑Sama‑Tier Speech‑to‑Text for Your Local AI Companion (Whisper + CUDA + LivinGrimoire)

If you’re building a local AI companion, you already know the truth:

Without fast, accurate, real‑time speech recognition, your AI will never feel alive.

This guide shows you how to achieve Neuro‑Sama‑level STT using:

  • Whisper (large model)
  • CUDA acceleration
  • Async LivinGrimoire skill architecture

The result?

✔️ High accuracy

✔️ Real‑time transcription

✔️ Runs fully locally

✔️ No REST APIs, no cloud, no rate limits

✔️ Does NOT block your AI’s main think loop

✔️ Modular skill you can drop into any LivinGrimoire‑based AI

Let’s get into it.


🧨 The Problem: PyTorch Fails to Import When Enabling CUDA

On new GPUs (like the RTX 5090 in the Lenovo Legion 9), PyTorch’s stable builds may fail during import:

Traceback (most recent call last):
  File "main.py", line 80, in <module>
    import torch
  ...
  File ".../torch/utils/_debug_mode/_mode.py", line 116, in <module>
    <error occurs here>
Enter fullscreen mode Exit fullscreen mode

This usually means:

  • Your GPU is too new for the stable PyTorch build
  • You need a nightly build with updated CUDA kernels

🛠️ Fix: Install the Correct PyTorch Nightly Build (CUDA 12.8)

pip uninstall torch torchvision torchaudio -y

pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128

pip install packaging
Enter fullscreen mode Exit fullscreen mode

Verify CUDA:

import torch
print(torch.cuda.is_available())  # Should print True
Enter fullscreen mode Exit fullscreen mode

If you see True, you’re ready for Whisper acceleration.


⚡ Whisper + CUDA = Real‑Time, High‑Accuracy STT

Install dependencies(via pycharm terminal):

winget install ffmpeg
ffmpeg -version

pip install openai-whisper pyaudio numpy wave
Enter fullscreen mode Exit fullscreen mode

Choose your Whisper model:

Model Speed Accuracy GPU Required
base fast decent no
small good good recommended
medium slower high yes
large slowest best yes

For Neuro‑Sama‑tier accuracy:

model = whisper.load_model("large", device="cuda")
Enter fullscreen mode Exit fullscreen mode

Enable fp16 for speed:

result = model.transcribe(audio_np, fp16=True, language='en')
Enter fullscreen mode Exit fullscreen mode

🧩 The LivinGrimoire STT Skill (Async, Non‑Blocking, Local)

This is where the magic happens.

The LivinGrimoire pattern allows you to add new abilities with one line of code, and each ability runs as an independent skill.

This STT skill:

  • Runs in a background thread
  • Never blocks the AI’s main loop
  • Continuously listens and transcribes
  • Uses Whisper large + CUDA
  • Cleans and normalizes text
  • Works 100% offline
  • Integrates into any LivinGrimoire AI instantly

Here is the full async STT skill:

import whisper
import pyaudio
import numpy as np
import re
import atexit
import threading
from queue import Queue

from LivinGrimoirePacket.LivinGrimoire import Brain, Skill

class DiSTT(Skill):
    CHUNK = 1024
    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 16000
    MIN_ACTIVE_SECONDS = 0.3
    exit_event = threading.Event()

    model = whisper.load_model("large", device="cuda")
    p = pyaudio.PyAudio()
    stream = p.open(
        format=FORMAT,
        channels=CHANNELS,
        rate=RATE,
        input=True,
        frames_per_buffer=CHUNK,
    )
    silence_threshold = None
    audio_queue = Queue()
    latest_transcription = ""

    def __init__(self, brain: Brain):
        super().__init__()
        self.set_skill_lobe(3)
        self.brain = brain
        atexit.register(DiSTT.cleanup)
        DiSTT.initSTT()

        # ASYNC: runs in background, never blocks main loop
        threading.Thread(target=self.run_stt, daemon=True).start()

    @staticmethod
    def cleanup():
        print("\nCleaning up resources...")
        DiSTT.exit_event.set()
        DiSTT.stream.stop_stream()
        DiSTT.stream.close()
        DiSTT.p.terminate()
        print("Cleanup complete. Exiting.")

    @staticmethod
    def calibrate_mic():
        print("Calibrating mic (stay silent for 2s)...")
        samples = []
        for _ in range(int(DiSTT.RATE / DiSTT.CHUNK * 2)):
            data = DiSTT.stream.read(DiSTT.CHUNK, exception_on_overflow=False)
            samples.append(np.abs(np.frombuffer(data, dtype=np.int16)).mean())
        mean_val = float(np.mean(samples) * 1.5)
        DiSTT.silence_threshold = max(mean_val, 100.0)

    @staticmethod
    def clean_text(text):
        text = text.lower()
        text = re.sub(r'[^a-z0-9\s]', '', text)
        text = re.sub(r'\s+', ' ', text).strip()
        return text

    @staticmethod
    def record_chunk():
        frames = []
        silent_frames = 0
        max_silent_frames = int(DiSTT.RATE / DiSTT.CHUNK * 0.5)

        while not DiSTT.exit_event.is_set():
            data = DiSTT.stream.read(DiSTT.CHUNK, exception_on_overflow=False)
            audio_data = np.frombuffer(data, dtype=np.int16)
            volume = np.abs(audio_data).mean()

            if volume < DiSTT.silence_threshold:
                silent_frames += 1
                if silent_frames > max_silent_frames:
                    break
            else:
                silent_frames = 0
                frames.append(data)

        return b''.join(frames) if len(frames) > int(DiSTT.RATE / DiSTT.CHUNK * DiSTT.MIN_ACTIVE_SECONDS) else None

    @staticmethod
    def transcribe_chunk(audio_bytes):
        audio_np = np.frombuffer(audio_bytes, dtype=np.int16).astype(np.float32) / 32768.0
        result = DiSTT.model.transcribe(audio_np, fp16=True, language='en')
        return DiSTT.clean_text(result["text"])

    @staticmethod
    def initSTT():
        print("Initializing...")
        DiSTT.calibrate_mic()
        print(f"Silence threshold set to: {DiSTT.silence_threshold:.2f}")
Enter fullscreen mode Exit fullscreen mode

add the skill with
brain.add_skill(DiSTT(brain))

🎤 Why This Is a Game‑Changer

✔️ Async

The STT skill runs in its own thread.

Your AI continues thinking, planning, reacting — uninterrupted.

✔️ Local

No cloud.

No REST calls.

No privacy leaks.

No latency spikes.

No API keys.

No rate limits.

✔️ Fast

CUDA + fp16 + Whisper large = real‑time transcription.

✔️ Accurate

Whisper large is state‑of‑the‑art.

You get near‑human transcription quality.

✔️ Modular

Thanks to the LivinGrimoire pattern, adding this skill is literally:

add_skill(DiSTT)
Enter fullscreen mode Exit fullscreen mode

🌱 Explore the Full LivinGrimoire Project

If you want to build:

  • Local AI companions
  • Modular AGI‑style systems
  • Skill‑based AI architectures
  • Extensible, maintainable AI codebases

…you owe it to yourself to explore the full project:

👉 https://github.com/yotamarker/LivinGrimoire github.com

It’s a powerful pattern that lets you add new AI abilities with one line of code.

Top comments (0)