DEV Community

Art
Art

Posted on

Speech-To-Action - Turning Voice Into CLI Pipelines with Whisper.cpp

I recently built a small tool called Forge STA (Speech-To-Action).

The idea is simple:

Speech should not just become text.

Speech should become actions.

So instead of:

Speech → Text

we get:

Speech → Text → CLI → Anything


The Problem

Most speech tools stop at transcription.

You dictate something and it becomes text.

But developers often want something different:

  • turn speech into code comments
  • generate prompts
  • trigger scripts
  • pipe into tools
  • format into HTML / Markdown / XML

In short:

Speech should be part of a toolchain.


The Architecture

Forge STA uses a very simple design.

Mic
 ↓
Speech Recognition
 ↓
Post Processing (CLI)
 ↓
Output / Action
Enter fullscreen mode Exit fullscreen mode

The interesting part is that post-processing is external.

No plugins.

No internal scripting.

Just CLI tools.

Example:

STA → python formatter.py
STA → bash script.sh
STA → custom binary
Enter fullscreen mode Exit fullscreen mode

The tool receives text via stdin and returns processed output via stdout.

This keeps the core small and stable.


Running Whisper on a Separate Machine

The latest improvement is even cooler.

Instead of running Whisper locally on the Mac, STA can now connect to a Whisper.cpp server.

So the setup becomes:

Mac (STA client)
   ↓
Audio
   ↓
Whisper.cpp server (powerful PC)
   ↓
Transcription
   ↓
Back to STA
   ↓
CLI pipeline
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • Mac stays lightweight
  • large Whisper models possible
  • better recognition quality
  • multiple devices can share one STT server
  • everything stays local

No cloud required.


Example Workflow

Speech:

create a comment explaining this function
Enter fullscreen mode Exit fullscreen mode

CLI processor:

python comment_formatter.py  
Enter fullscreen mode Exit fullscreen mode

Output:

// This function loads animation data and converts it into pose format.
Enter fullscreen mode Exit fullscreen mode

Or:

Speech:

generate an HTML snippet for a login page
Enter fullscreen mode Exit fullscreen mode

CLI processor:

python html_template.py  
Enter fullscreen mode Exit fullscreen mode

Output:

<div class="login">
  ...
</div>
Enter fullscreen mode Exit fullscreen mode

Why CLI Instead of Plugins?

Plugins can break the host application.

CLI tools cannot.

If a processor crashes:

  • STA keeps running
  • the processor can be replaced
  • debugging is simple

This follows the Unix philosophy:

Do one thing well and connect tools together.


Why I Built This

I wanted a speech system that:

  • runs locally
  • integrates with developer workflows
  • is extensible
  • stays simple

STA is part of the Forge ecosystem, which explores alternative ways of building software without heavy web stacks.


What’s Next

Possible processors:

  • speech → prompt
  • speech → markdown
  • speech → code comment
  • speech → SML
  • speech → CLI commands

Fun Fact - I used STA already

To create Codex CLI prompts to finish this tool.
What I did:

Open Terminal -> run Codex CLI -> press CTRL-S to start speeking -> pressed ENTER -> Codex get my idea 1:1 -> Codex starts to create software immediatly. Even the ENTER was used to send the prompt.

I'm curious

How would you use a Speech → CLI pipeline in your workflow?

Top comments (0)