Building an AI Profanity Filter with Vocal Separation

#ai #python #showdev

I built an online tool that automatically detects and bleeps profanity in video and audio files. Here's the high-level architecture.

The problem

Manual profanity censoring takes 45+ minutes for a 10-minute video. You have to listen through, find each word, razor the audio, drop a beep effect. For songs, it's nearly impossible without destroying the music.

The solution

AI speech recognition + neural vocal separation.

How it works

User uploads a file or pastes a YouTube URL
Audio is extracted with FFmpeg
AI speech-to-text transcribes the audio (AssemblyAI / Deepgram)
Profanity is detected using morphological analysis (lemmatization)
Each word is replaced with beep/silence/custom sound via FFmpeg
For songs: Demucs AI separates vocals from instruments first

Song mode — the hard part

Demucs by Meta AI does the heavy lifting — splitting audio into vocal and instrumental tracks. Profanity detection runs only on the vocal track, then the censored vocals are mixed back with the original instruments. The music stays untouched.

Stack

Frontend: Next.js (React)
Backend: NestJS (Node.js), BullMQ queues
Audio processing: Python (FastAPI), Demucs, FFmpeg
Infrastructure: Docker Compose, PostgreSQL, Redis

Results

12,000+ files processed. Three processing modes: standard (clean speech), precise (noisy audio), enhanced (songs with vocal separation).

Free for up to 15 minutes per month at videocensor.net.

Would love to hear your thoughts!

DEV Community