How We Built Voice and Image Spam Detection for Telegram (Technical Deep Dive)

#telegram #ai #computervision #whisper

Last month we shipped three features that no other Telegram anti-spam bot has: voice message analysis, image spam detection, and anti-masking intelligence. Here's how we built them.

The Problem
Spammers evolved. Our text-based pipeline was catching 99.7% of text spam. So spammers stopped using text.

Voice messages with gambling/scam ads
Images with overlaid promotional text
Text with emoji inserted between every character
Traditional keyword and even AI text analysis is blind to all three.

Voice Message Pipeline
Architecture:

Voice message received
→ Download .ogg file from Telegram API
→ Transcribe (speech-to-text)
→ Feed transcript into existing anti-spam pipeline
→ Same AI context analysis as text messages
→ Decision: ban / warn / allow

Key decisions:

We transcribe everything under 5 minutes (covers 99% of spam voice notes)
Transcription runs async — doesn't block the moderation pipeline
The transcript gets the same 8-layer analysis as text: whitelist → global ban → reputation → trust → fingerprint → rules → AI context → decision
Language detection handles Russian, Ukrainian, and English voice messages
The result: a 15-second voice note saying "free betting tips, guaranteed profit" gets transcribed, classified as gambling spam, and the user gets banned — all within 3-5 seconds.

Image Spam Detection
Architecture:

Image received (photo or document)
→ Send to Vision AI
→ Analyze: is there promotional/spam text in the image?
→ If spam detected → classify category → ban
→ If clean → allow

What Vision AI catches:

Screenshots of fake profit/portfolio charts with channel links
Photos with overlaid text advertising gambling/crypto
Casino/betting ad graphics
Profile avatars that are literally advertisements
We only trigger Vision analysis on images from untrusted users (trust score below threshold). Trusted members' images pass through without Vision analysis — saves cost and reduces latency.

Anti-Masking
Spammers discovered they could bypass keyword filters with tricks like:

З🎰а🎰р🎰а🎰б🎰о🎰т🎰о🎰к (emoji between letters)
3аработок (number 3 instead of letter З)
Зaрaботок (Latin 'a' instead of Cyrillic 'а')
Our approach:

Raw message text
→ Strip emoji and special characters
→ Normalize Unicode (Cyrillic/Latin homoglyphs)
→ Normalize number→letter substitutions
→ Feed cleaned text into AI analysis
→ AI evaluates meaning, not characters

The AI layer is the key — even after normalization, context matters. "Заработок" in a freelance group is normal. "Заработок" in a cooking group is spam. Same word, different context, different decision.

Updated Pipeline (10 layers)

Whitelist check → free, 0ms
Global ban check → free, 0ms
Reputation auto-ban → free, 0ms
Trust system check → free, 0ms
Anti-masking normalization → free, 1ms
Fingerprint matching → free, 1ms
Rule-based detection → free, 0ms
Voice transcription → if voice message
Vision AI analysis → if image from untrusted user
AI context analysis → for edge cases
Decision → ban / mute / allow

Cheapest checks first. AI and Vision only for what cheaper layers can't decide.

Results
Voice spam: from 0% detection → ~95% detection
Image spam: from 0% detection → ~90% detection
Masked text: from ~60% → ~95% detection
Overall accuracy: maintained at 99.7%
False positive rate: still near zero
What's Next
Video message analysis is on the roadmap. Spammers will try video next — we'll be ready.

→ personym-ai.com/moderator-ai
→ Try free for 7 days

DEV Community

How We Built Voice and Image Spam Detection for Telegram (Technical Deep Dive)

Top comments (0)