I built an online tool that automatically detects and bleeps profanity in video and audio files. Here's the high-level architecture.
The problem
Manual profanity censoring takes 45+ minutes for a 10-minute video. You have to listen through, find each word, razor the audio, drop a beep effect. For songs, it's nearly impossible without destroying the music.
The solution
AI speech recognition + neural vocal separation.
How it works
- User uploads a file or pastes a YouTube URL
- Audio is extracted with FFmpeg
- AI speech-to-text transcribes the audio (AssemblyAI / Deepgram)
- Profanity is detected using morphological analysis (lemmatization)
- Each word is replaced with beep/silence/custom sound via FFmpeg
- For songs: Demucs AI separates vocals from instruments first
Song mode — the hard part
Demucs by Meta AI does the heavy lifting — splitting audio into vocal and instrumental tracks. Profanity detection runs only on the vocal track, then the censored vocals are mixed back with the original instruments. The music stays untouched.
Stack
- Frontend: Next.js (React)
- Backend: NestJS (Node.js), BullMQ queues
- Audio processing: Python (FastAPI), Demucs, FFmpeg
- Infrastructure: Docker Compose, PostgreSQL, Redis
Results
12,000+ files processed. Three processing modes: standard (clean speech), precise (noisy audio), enhanced (songs with vocal separation).
Free for up to 15 minutes per month at videocensor.net.
Would love to hear your thoughts!
Top comments (0)