I work with audio a lot — music collections, audiobooks, voice messages, podcast recordings. Every task needed a different tool, a different installation headache, and half of them required FFmpeg as a dependency.
So I started building my own tools. One turned into two, two turned into six, and now it's a full ecosystem: audiotools.dev — open source audio tools in Go and Python, plus browser-based utilities that process audio without uploading anything.
Here's the technical story behind each one.
The Ecosystem
The project covers four main areas:
- Audio conversion — format conversion without FFmpeg (Go)
- Speech transcription — bulk voice-to-text with Whisper (Python)
- Music identification — batch Shazam recognition with auto-tagging (Python)
- Audiobook cleaning — AI noise/music removal with neural networks (Python)
Plus two browser-based tools on the website: an audio converter (WASM) and real-time voice-to-text (Web Speech API).
go-audio-converter: Why I Wrote a FLAC Encoder in Pure Go
GitHub — this is the one I'm most proud of technically.
The Problem
Every Go audio library out there depends on either FFmpeg or CGO. That means:
- Cross-compilation breaks
- Docker images bloat from 50MB to 500MB+
- WASM builds are impossible
- Alpine Linux needs extra packages
I wanted a single static binary that converts audio. No dependencies. No CGO. Download and run.
The Solution
Built a converter that handles WAV, MP3, FLAC, and OGG using only pure Go libraries. The hardest part? FLAC encoding — no pure Go FLAC encoder existed.
So I wrote one.
Writing a FLAC Encoder From Scratch
FLAC (Free Lossless Audio Codec) compression works in stages:
- Split audio into frames (typically 4096 samples)
- Predict each sample from previous samples using a mathematical model
- Calculate residuals — the difference between predicted and actual values
- Encode residuals using Rice coding (a form of entropy coding)
The key insight is that predicted residuals are much smaller numbers than raw samples, so they compress well.
I implemented FIXED prediction (orders 0-4), which uses polynomial prediction:
// Order 0: residual = sample (no prediction)
// Order 1: residual = sample - prev1
// Order 2: residual = sample - 2*prev1 + prev2
// Order 3: residual = sample - 3*prev1 + 3*prev2 - prev3
// Order 4: residual = sample - 4*prev1 + 6*prev2 - 4*prev3 + prev4
The encoder tries all five orders and picks the one that produces the smallest residuals for each frame. Then it encodes them using Rice coding, where a parameter k determines how many bits to use for the quotient vs remainder:
func riceEncode(residual int, k uint) {
// Map signed to unsigned (zig-zag encoding)
unsigned := zigZag(residual)
quotient := unsigned >> k
remainder := unsigned & ((1 << k) - 1)
// Write quotient as unary (q ones + zero)
// Write remainder as k bits
}
The Rice parameter k is chosen per partition by minimizing the total encoded bit length.
Compression results:
| Content | Ratio |
|---|---|
| Silence | 95%+ |
| Sine wave | 50-70% |
| Music | 30-50% |
| White noise | ~0% |
It's not as optimal as libFLAC (which uses LPC prediction), but it's the first pure Go FLAC encoder and it works everywhere — including WASM in the browser.
The WASM Angle
Because everything is pure Go, I compiled the converter to WebAssembly and embedded it in the audiotools.dev/converter page. You drop an audio file in the browser, it converts locally using the Go code compiled to WASM. No upload, no server.
GOOS=js GOARCH=wasm go build -o converter.wasm ./cmd/wasm
The WASM binary is ~4MB — larger than I'd like, but acceptable for a tool that replaces a server-side FFmpeg pipeline.
voice-to-text: Bulk Transcription Pipeline
GitHub — Python tool for transcribing voice messages in bulk.
The Problem I Was Solving
I had hundreds of Telegram voice messages that I needed as text. Manual transcription was out of the question. Existing tools handle one file at a time.
Multiple Backends
The tool supports four transcription backends:
- Whisper (local) — OpenAI's model running locally. Best quality, needs GPU.
- faster-whisper — CTranslate2-optimized Whisper. 4x faster, same quality.
- OpenAI API — cloud Whisper. No GPU needed, costs money.
- Groq API — fastest cloud option. Free tier available.
It also includes a Telegram bot that transcribes voice messages in real-time — send a voice note, get text back.
Browser Version
The website has a voice-to-text page that uses the Web Speech API for real-time transcription in 60+ languages. No AI model downloads, no processing — the browser handles everything natively.
The Web Speech API is underrated. It's built into Chrome, Edge, and Safari, works in real-time, and supports dozens of languages. The catch? It requires an internet connection (audio goes to Google's/Apple's servers for recognition). But for quick transcription, it's instant and free.
music_recognition: Batch Shazam
GitHub — identify unknown music files using Shazam's algorithm.
The Backstory
I had a folder with 2,000+ MP3 files from the early 2000s. Filenames like Track_001.mp3, Unknown Artist - Unknown Track.mp3, (3).mp3. No metadata, no tags.
This tool scans a directory, identifies each track via Shazam, and automatically:
- Writes ID3 tags (artist, title, album, year)
- Renames files to
Artist - Title.mp3 - Organizes into
Artist/Album/folder structure
It uses the ShazamIO library, which reverse-engineered Shazam's recognition API. Rate limiting and retries are built in — Shazam will throttle you if you hit it too fast.
Audiobook-Cleaner: AI Noise Removal
GitHub — clean audiobooks from background music, noise, and echo using neural networks.
How It Works
The tool is a wrapper over audio-separator, which implements the same AI models used by Ultimate Vocal Remover (UVR5):
- MDX-Net — best for music/voice separation
- VR Architecture — good for noise removal
- Roformer — state-of-the-art transformer model
The key problem I solved: large file handling. These models expect short audio segments (typically 30-60 seconds). A 10-hour audiobook crashes any of them. My wrapper automatically chunks files, processes each segment, and stitches results back together — handling crossfades at chunk boundaries to avoid clicks.
Works on both CPU and GPU (NVIDIA CUDA). CPU processing is viable for a single audiobook; for batch processing, GPU is 10-20x faster.
What I Learned Building This
Go for CLI tools, Python for ML pipelines
This split worked perfectly. Go gives you:
- Single binary distribution (no runtime, no pip install)
- Easy cross-compilation
- WASM compilation for browser deployment
Python gives you:
- Access to ML models and the entire AI ecosystem
- Rapid prototyping with libraries like faster-whisper, audio-separator
- Jupyter notebook experimentation
WASM is viable for real tools
The audio converter running in the browser via WASM handles real-world files. It's not a toy demo. Users drop 50MB FLAC files and get MP3s back without any upload. The 4MB WASM binary is a fair trade for zero server costs.
Privacy is a feature
Every browser-based tool on audiotools.dev processes data locally. This isn't just a philosophical choice — it eliminates server costs, simplifies architecture, and genuinely matters to users processing personal audio (voice messages, private recordings).
Try It
- Website: audiotools.dev — browser-based converter and voice-to-text
- GitHub: github.com/formeo — all projects, MIT licensed
-
go-audio-converter:
go install github.com/formeo/go-audio-converter/cmd/audioconv@latest
Everything is open source under MIT. Contributions, issues, and stars welcome.
I'd love to hear: what audio processing tasks do you struggle with? What would you add to the ecosystem?
Built with Go and Python. The browser tools use WASM and Web Speech API. No FFmpeg was harmed in the making of this ecosystem.
Top comments (0)