DEV Community

Louis Beaumont
Louis Beaumont

Posted on

how i built a local first audio transcription: building a privacy-first voice processing pipeline

I implemented a sophisticated local-first audio processing pipeline that captures, processes, and transcribes audio while respecting privacy, written in rust. here's how it works:

🎤 audio capture & device management

  • supports both input (microphones) and output devices (system audio)
  • handles multi-channel audio devices through smart channel mixing
  • implements device hot-plugging and graceful error handling
  • uses tokio channels for efficient async communication ### 🔊 audio processing pipeline
  1. channel conversion  - converts multi-channel audio to mono using weighted averaging  - handles various sample formats (f32, i16, i32, i8)  - implements real-time resampling to 16khz for whisper compatibility
  2. signal processing  - normalizes audio using RMS and peak normalization  - implements spectral subtraction for noise reduction  - uses realfft for efficient fourier transforms  - maintains audio quality while reducing background noise
  3. voice activity detection (vad)  - dual vad engine support: webrtc (lightweight) and silero (more accurate)  - configurable sensitivity levels (low/medium/high)  - uses sliding window analysis for robust speech detection  - implements frame history for better context awareness ### 🤖 transcription engine
  • primary: whisper (tiny/large-v3/large-v3-turbo)
  • fallback: deepgram api integration
  • smart overlap handling:
// handles cases where audio chunks might cut sentences
 if let Some((prev_idx, cur_idx)) = longest_common_word_substring(previous, current) {
 // strip overlapping content and merge transcripts
 }
Enter fullscreen mode Exit fullscreen mode

💾 storage & optimization

  • uses h265 encoding for efficient audio storage
  • implements a local sqlite database for metadata
  • stores raw audio chunks with timestamps
  • maintains reference to original audio for verification

    🔒 privacy features

  • completely local processing by default

  • optional pii removal

  • configurable data retention policies

  • no cloud dependencies unless explicitly enabled

    🧠 experimental features

  • context-aware post-processing using llama-3.2–1b

  • speaker diarization using voice embeddings

  • local vector db for speaker identification

  • adaptive noise profiling

    🔧 technical stack

  • rust + tokio for async processing

  • tauri for cross-platform support

  • onnx runtime for ml inference

  • crossbeam channels for thread communication

    📊 performance considerations

  • efficient memory usage through streaming processing

  • minimal cpu overhead through smart buffering

  • configurable quality/performance tradeoffs

  • automatic resource management

result:

it's open source btw!

https://github.com/mediar-ai/screenpipe

drop any question!

Heroku

Build apps, not infrastructure.

Dealing with servers, hardware, and infrastructure can take up your valuable time. Discover the benefits of Heroku, the PaaS of choice for developers since 2007.

Visit Site

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs