Gordienko Roman

Posted on Feb 13

I Built an Open Source Audio Processing Ecosystem — From a Pure Go FLAC Encoder to AI Noise Removal

#go #python #opensource #audio

I work with audio a lot — music collections, audiobooks, voice messages, podcast recordings. Every task needed a different tool, a different installation headache, and half of them required FFmpeg as a dependency.

So I started building my own tools. One turned into two, two turned into six, and now it's a full ecosystem: audiotools.dev — open source audio tools in Go and Python, plus browser-based utilities that process audio without uploading anything.

Here's the technical story behind each one.

The Ecosystem

The project covers four main areas:

Audio conversion — format conversion without FFmpeg (Go)
Speech transcription — bulk voice-to-text with Whisper (Python)
Music identification — batch Shazam recognition with auto-tagging (Python)
Audiobook cleaning — AI noise/music removal with neural networks (Python)

Plus two browser-based tools on the website: an audio converter (WASM) and real-time voice-to-text (Web Speech API).

go-audio-converter: Why I Wrote a FLAC Encoder in Pure Go

GitHub — this is the one I'm most proud of technically.

The Problem

Every Go audio library out there depends on either FFmpeg or CGO. That means:

Cross-compilation breaks
Docker images bloat from 50MB to 500MB+
WASM builds are impossible
Alpine Linux needs extra packages

I wanted a single static binary that converts audio. No dependencies. No CGO. Download and run.

The Solution

Built a converter that handles WAV, MP3, FLAC, and OGG using only pure Go libraries. The hardest part? FLAC encoding — no pure Go FLAC encoder existed.

So I wrote one.

Writing a FLAC Encoder From Scratch

FLAC (Free Lossless Audio Codec) compression works in stages:

Split audio into frames (typically 4096 samples)
Predict each sample from previous samples using a mathematical model
Calculate residuals — the difference between predicted and actual values
Encode residuals using Rice coding (a form of entropy coding)

The key insight is that predicted residuals are much smaller numbers than raw samples, so they compress well.

I implemented FIXED prediction (orders 0-4), which uses polynomial prediction:

// Order 0: residual = sample (no prediction)
// Order 1: residual = sample - prev1
// Order 2: residual = sample - 2*prev1 + prev2
// Order 3: residual = sample - 3*prev1 + 3*prev2 - prev3
// Order 4: residual = sample - 4*prev1 + 6*prev2 - 4*prev3 + prev4

The encoder tries all five orders and picks the one that produces the smallest residuals for each frame. Then it encodes them using Rice coding, where a parameter k determines how many bits to use for the quotient vs remainder:

func riceEncode(residual int, k uint) {
    // Map signed to unsigned (zig-zag encoding)
    unsigned := zigZag(residual)
    quotient := unsigned >> k
    remainder := unsigned & ((1 << k) - 1)
    // Write quotient as unary (q ones + zero)
    // Write remainder as k bits
}

The Rice parameter k is chosen per partition by minimizing the total encoded bit length.

Compression results:

Content	Ratio
Silence	95%+
Sine wave	50-70%
Music	30-50%
White noise	~0%

It's not as optimal as libFLAC (which uses LPC prediction), but it's the first pure Go FLAC encoder and it works everywhere — including WASM in the browser.

The WASM Angle

Because everything is pure Go, I compiled the converter to WebAssembly and embedded it in the audiotools.dev/converter page. You drop an audio file in the browser, it converts locally using the Go code compiled to WASM. No upload, no server.

GOOS=js GOARCH=wasm go build -o converter.wasm ./cmd/wasm

The WASM binary is ~4MB — larger than I'd like, but acceptable for a tool that replaces a server-side FFmpeg pipeline.

voice-to-text: Bulk Transcription Pipeline

GitHub — Python tool for transcribing voice messages in bulk.

The Problem I Was Solving

I had hundreds of Telegram voice messages that I needed as text. Manual transcription was out of the question. Existing tools handle one file at a time.

Multiple Backends

The tool supports four transcription backends:

Whisper (local) — OpenAI's model running locally. Best quality, needs GPU.
faster-whisper — CTranslate2-optimized Whisper. 4x faster, same quality.
OpenAI API — cloud Whisper. No GPU needed, costs money.
Groq API — fastest cloud option. Free tier available.

It also includes a Telegram bot that transcribes voice messages in real-time — send a voice note, get text back.

Browser Version

The website has a voice-to-text page that uses the Web Speech API for real-time transcription in 60+ languages. No AI model downloads, no processing — the browser handles everything natively.

The Web Speech API is underrated. It's built into Chrome, Edge, and Safari, works in real-time, and supports dozens of languages. The catch? It requires an internet connection (audio goes to Google's/Apple's servers for recognition). But for quick transcription, it's instant and free.

music_recognition: Batch Shazam

GitHub — identify unknown music files using Shazam's algorithm.

The Backstory

I had a folder with 2,000+ MP3 files from the early 2000s. Filenames like Track_001.mp3, Unknown Artist - Unknown Track.mp3, (3).mp3. No metadata, no tags.

This tool scans a directory, identifies each track via Shazam, and automatically:

Writes ID3 tags (artist, title, album, year)
Renames files to Artist - Title.mp3
Organizes into Artist/Album/ folder structure

It uses the ShazamIO library, which reverse-engineered Shazam's recognition API. Rate limiting and retries are built in — Shazam will throttle you if you hit it too fast.

Audiobook-Cleaner: AI Noise Removal

GitHub — clean audiobooks from background music, noise, and echo using neural networks.

How It Works

The tool is a wrapper over audio-separator, which implements the same AI models used by Ultimate Vocal Remover (UVR5):

MDX-Net — best for music/voice separation
VR Architecture — good for noise removal
Roformer — state-of-the-art transformer model

The key problem I solved: large file handling. These models expect short audio segments (typically 30-60 seconds). A 10-hour audiobook crashes any of them. My wrapper automatically chunks files, processes each segment, and stitches results back together — handling crossfades at chunk boundaries to avoid clicks.

Works on both CPU and GPU (NVIDIA CUDA). CPU processing is viable for a single audiobook; for batch processing, GPU is 10-20x faster.

What I Learned Building This

Go for CLI tools, Python for ML pipelines

This split worked perfectly. Go gives you:

Single binary distribution (no runtime, no pip install)
Easy cross-compilation
WASM compilation for browser deployment

Python gives you:

Access to ML models and the entire AI ecosystem
Rapid prototyping with libraries like faster-whisper, audio-separator
Jupyter notebook experimentation

WASM is viable for real tools

The audio converter running in the browser via WASM handles real-world files. It's not a toy demo. Users drop 50MB FLAC files and get MP3s back without any upload. The 4MB WASM binary is a fair trade for zero server costs.

Privacy is a feature

Every browser-based tool on audiotools.dev processes data locally. This isn't just a philosophical choice — it eliminates server costs, simplifies architecture, and genuinely matters to users processing personal audio (voice messages, private recordings).

Try It

Website: audiotools.dev — browser-based converter and voice-to-text
GitHub: github.com/formeo — all projects, MIT licensed
go-audio-converter: go install github.com/formeo/go-audio-converter/cmd/audioconv@latest

Everything is open source under MIT. Contributions, issues, and stars welcome.

I'd love to hear: what audio processing tasks do you struggle with? What would you add to the ecosystem?

Built with Go and Python. The browser tools use WASM and Web Speech API. No FFmpeg was harmed in the making of this ecosystem.

DEV Community