DEV Community

Akash
Akash

Posted on

I Built a Voice-Controlled OBS Assistant (Metaltank) — Here’s What Really Happened

This post is about debugging pain, systems thinking, and the moment everything finally worked.


🎯 What I Wanted to Build

I wanted to remove clicks from OBS.

Not automate one button.

Not trigger a hotkey.

I wanted to talk to OBS.

Say things like:

  • “Metaltank mute mic”
  • “Metaltank switch scene”
  • “Metaltank start recording”

…and have OBS respond instantly.

No Stream Deck.

No keyboard shortcuts.

No mouse.

Just voice → intent → OBS WebSocket → action.

That project is called Metaltank.


🧠 What I Actually Built (So Far)

Metaltank is a Node.js-based voice controller for OBS that:

  • Connects to OBS using obs-websocket
  • Captures microphone audio using arecord
  • Streams short audio chunks to whisper.cpp (local, offline)
  • Converts speech → text
  • Parses intent using a custom rule engine
  • Executes OBS actions (mute, unmute, toggle mic, scenes, recording)

All offline.

No cloud APIs.

No paid services.


⚙️ Tech Stack

  • Node.js (ESM)
  • OBS WebSocket
  • whisper.cpp (server mode)
  • arecord (ALSA)
  • Custom rule-based intent parser

Simple stack.

Hard execution.


😤 Why This Was Way Harder Than It Sounds

Let me be honest — nothing worked the first time.

1️⃣ Native Modules Failed (Vosk)

I initially tried vosk.

It failed because:

  • Native compilation
  • Missing build tools
  • Node-gyp issues
  • Environment limitations

Lesson:

“Offline” doesn’t always mean “easy.”


2️⃣ OBS Was “Connected” But Not Ready

I kept hitting errors like:

Error: Socket not identified
Error: Not connected

Root cause:

  • OBS WebSocket connect ≠ identify
  • Actions were being called before OBS completed its handshake

Fix:

  • Explicit OBS-ready state
  • No actions allowed before identification

3️⃣ Voice Was Triggering Commands Without Me Speaking

At one point, Metaltank muted my mic without me saying anything.

Why?

  • Simulated voice input still wired
  • Voice module executing too early
  • No wake-word guard

Fix:

  • Strict wake word: metaltank
  • OBS readiness gate
  • Clear separation between CLI and voice input

4️⃣ whisper.cpp Flags Betrayed Me

I tried flags like:

--step
--length

They don’t exist in whisper-cli.

Fix:

  • Stop guessing flags
  • Read --help
  • Switch to whisper-server
  • POST WAV files properly

Lesson:

Always check the CLI help. Even when you’re confident.


5️⃣ Audio Was Recording… But Whisper Heard Nothing

This was the hardest part.

  • WAV files existed
  • Audio played correctly
  • Whisper returned [BLANK_AUDIO]

Root causes:

  • Chunk timing too short
  • Silence dominance
  • Wrong assumptions about streaming

Fix:

  • Fixed-length chunks (3 seconds)
  • File-based inference
  • Let whisper finish before deleting audio

🔥 The Moment It Worked

🗣️ Heard: mute mic
[VOICE] MUTE_MIC
🎙 Mic muted

I didn’t celebrate loudly.

I just smiled.

Because this wasn’t luck —

it was layers finally aligning.


🧩 Current Metaltank Capabilities

  • 🎙 Mute / unmute / toggle mic
  • 🎬 Scene control
  • ⏺ Recording control
  • 🧠 Continuous listening
  • 🔒 Fully offline

OBS reacts to my voice.


🧠 What I Learned

  • “Connected” doesn’t mean “ready”
  • Audio pipelines fail silently
  • Logging saves hours
  • If you’re confused, the system is confused too

Biggest lesson:

Complex systems don’t fail loudly — they fail quietly.


🚧 Still Phase 1

This is still Phase 1.

The vision is bigger:

  • Zero-click OBS setup
  • Scene creation via voice
  • Layout & webcam control
  • Full recording workflows

The goal stays simple:

No clicks. Only intent.


🏁 Final Thoughts

This project reminded me why I love engineering.

Not because things work —

but because they don’t, and you make them.

If you’re building something ambitious and it feels impossible right now:

You’re probably doing it right.

Top comments (0)