This post is about debugging pain, systems thinking, and the moment everything finally worked.
🎯 What I Wanted to Build
I wanted to remove clicks from OBS.
Not automate one button.
Not trigger a hotkey.
I wanted to talk to OBS.
Say things like:
- “Metaltank mute mic”
- “Metaltank switch scene”
- “Metaltank start recording”
…and have OBS respond instantly.
No Stream Deck.
No keyboard shortcuts.
No mouse.
Just voice → intent → OBS WebSocket → action.
That project is called Metaltank.
🧠 What I Actually Built (So Far)
Metaltank is a Node.js-based voice controller for OBS that:
- Connects to OBS using obs-websocket
- Captures microphone audio using arecord
- Streams short audio chunks to whisper.cpp (local, offline)
- Converts speech → text
- Parses intent using a custom rule engine
- Executes OBS actions (mute, unmute, toggle mic, scenes, recording)
All offline.
No cloud APIs.
No paid services.
⚙️ Tech Stack
- Node.js (ESM)
- OBS WebSocket
- whisper.cpp (server mode)
- arecord (ALSA)
- Custom rule-based intent parser
Simple stack.
Hard execution.
😤 Why This Was Way Harder Than It Sounds
Let me be honest — nothing worked the first time.
1️⃣ Native Modules Failed (Vosk)
I initially tried vosk.
It failed because:
- Native compilation
- Missing build tools
- Node-gyp issues
- Environment limitations
Lesson:
“Offline” doesn’t always mean “easy.”
2️⃣ OBS Was “Connected” But Not Ready
I kept hitting errors like:
Error: Socket not identified
Error: Not connected
Root cause:
- OBS WebSocket connect ≠ identify
- Actions were being called before OBS completed its handshake
Fix:
- Explicit OBS-ready state
- No actions allowed before identification
3️⃣ Voice Was Triggering Commands Without Me Speaking
At one point, Metaltank muted my mic without me saying anything.
Why?
- Simulated voice input still wired
- Voice module executing too early
- No wake-word guard
Fix:
- Strict wake word: metaltank
- OBS readiness gate
- Clear separation between CLI and voice input
4️⃣ whisper.cpp Flags Betrayed Me
I tried flags like:
--step
--length
They don’t exist in whisper-cli.
Fix:
- Stop guessing flags
- Read
--help - Switch to whisper-server
- POST WAV files properly
Lesson:
Always check the CLI help. Even when you’re confident.
5️⃣ Audio Was Recording… But Whisper Heard Nothing
This was the hardest part.
- WAV files existed
- Audio played correctly
- Whisper returned
[BLANK_AUDIO]
Root causes:
- Chunk timing too short
- Silence dominance
- Wrong assumptions about streaming
Fix:
- Fixed-length chunks (3 seconds)
- File-based inference
- Let whisper finish before deleting audio
🔥 The Moment It Worked
🗣️ Heard: mute mic
[VOICE] MUTE_MIC
🎙 Mic muted
I didn’t celebrate loudly.
I just smiled.
Because this wasn’t luck —
it was layers finally aligning.
🧩 Current Metaltank Capabilities
- 🎙 Mute / unmute / toggle mic
- 🎬 Scene control
- ⏺ Recording control
- 🧠 Continuous listening
- 🔒 Fully offline
OBS reacts to my voice.
🧠 What I Learned
- “Connected” doesn’t mean “ready”
- Audio pipelines fail silently
- Logging saves hours
- If you’re confused, the system is confused too
Biggest lesson:
Complex systems don’t fail loudly — they fail quietly.
🚧 Still Phase 1
This is still Phase 1.
The vision is bigger:
- Zero-click OBS setup
- Scene creation via voice
- Layout & webcam control
- Full recording workflows
The goal stays simple:
No clicks. Only intent.
🏁 Final Thoughts
This project reminded me why I love engineering.
Not because things work —
but because they don’t, and you make them.
If you’re building something ambitious and it feels impossible right now:
You’re probably doing it right.
Top comments (0)