Art

Posted on Mar 12

Speech-To-Action - Turning Voice Into CLI Pipelines with Whisper.cpp

#ai #automation #cli #showdev

I recently built a small tool called Forge STA (Speech-To-Action).

The idea is simple:

Speech should not just become text.

Speech should become actions.

So instead of:

Speech → Text

we get:

Speech → Text → CLI → Anything

The Problem

Most speech tools stop at transcription.

You dictate something and it becomes text.

But developers often want something different:

turn speech into code comments
generate prompts
trigger scripts
pipe into tools
format into HTML / Markdown / XML

In short:

Speech should be part of a toolchain.

The Architecture

Forge STA uses a very simple design.

Mic
 ↓
Speech Recognition
 ↓
Post Processing (CLI)
 ↓
Output / Action

The interesting part is that post-processing is external.

No plugins.

No internal scripting.

Just CLI tools.

Example:

STA → python formatter.py
STA → bash script.sh
STA → custom binary

The tool receives text via stdin and returns processed output via stdout.

This keeps the core small and stable.

Running Whisper on a Separate Machine

The latest improvement is even cooler.

Instead of running Whisper locally on the Mac, STA can now connect to a Whisper.cpp server.

So the setup becomes:

Mac (STA client)
   ↓
Audio
   ↓
Whisper.cpp server (powerful PC)
   ↓
Transcription
   ↓
Back to STA
   ↓
CLI pipeline

Benefits:

Mac stays lightweight
large Whisper models possible
better recognition quality
multiple devices can share one STT server
everything stays local

No cloud required.

Example Workflow

Speech:

create a comment explaining this function

CLI processor:

python comment_formatter.py

Output:

// This function loads animation data and converts it into pose format.

Or:

Speech:

generate an HTML snippet for a login page

CLI processor:

python html_template.py

Output:

<div class="login">
  ...
</div>

Why CLI Instead of Plugins?

Plugins can break the host application.

CLI tools cannot.

If a processor crashes:

STA keeps running
the processor can be replaced
debugging is simple

This follows the Unix philosophy:

Do one thing well and connect tools together.

Why I Built This

I wanted a speech system that:

runs locally
integrates with developer workflows
is extensible
stays simple

STA is part of the Forge ecosystem, which explores alternative ways of building software without heavy web stacks.

What’s Next

Possible processors:

speech → prompt
speech → markdown
speech → code comment
speech → SML
speech → CLI commands

Fun Fact - I used STA already

To create Codex CLI prompts to finish this tool.
What I did:

Open Terminal -> run Codex CLI -> press CTRL-S to start speeking -> pressed ENTER -> Codex get my idea 1:1 -> Codex starts to create software immediatly. Even the ENTER was used to send the prompt.

I'm curious

How would you use a Speech → CLI pipeline in your workflow?

DEV Community