Built this for a hackathon (Build Small, June 2026) and figured I'd write it up while it's still fresh.
It's a small ensemble — TF-IDF + Logistic Regression baseline, plus a fine-tuned BERTić model (110M params) — that flags SMS phishing in South Slavic languages: Serbian, Bosnian, Croatian, and Montenegrin.
Why
Smishing is apparently up something like 1,300% in Serbia over the last three years. Every phishing dataset and model I could find was English-only, which turns out to be a real gap for a few reasons:
- Grammatical case. These languages decline nouns by case, so "nagrada" / "nagradu" / "nagradi" are all the same word ("prize"), just different grammatical forms. A keyword filter sees five unrelated strings; a scammer just... uses whichever form the sentence needs, no extra effort.
- Script mixing and homoglyphs. Serbian can be Cyrillic or Latin, and the two can be mixed in the same message. A Cyrillic "а" (U+0430) looks identical to a Latin "a" (U+0061) but is a different character — invisible to a human, invisible to a Latin-only keyword filter, but not invisible to a model looking at actual bytes.
- No dataset existed. We looked. Couldn't find one. So we built one — 1,529 labeled messages (900 legit / 629 phishing), Cyrillic and Latin, across all four languages.
What it does
Both models run side by side and the app shows confidence scores plus which signals fired (fake URL, urgency language, sender impersonation, suspicious/typosquatted domains, etc.) — not just a yes/no.
Example — this one's flagged as phishing:
MUP: Sаobraćajni prekršaj evidentiran. Platite online na linku: https://mup-gov.online/login
The "a" in "Sаobraćajni" is Cyrillic, not Latin. Same glyph, different codepoint, classic evasion trick.
And this one's correctly left alone:
Raiffeisen: Transakcija karticom ****3421 u iznosu od 1.299 RSD je odobrena. Stanje: 124.567,80 RSD.
Numbers
96.96% accuracy / 96.3% F1 on the held-out split for the BERTić model. We also built a separate, harder 105-case test set (typosquatting, homographs, morphological case variants, no-link IBAN scams) — it's downloadable from the app itself, batch-test it and it scores live. Currently at 93.3% (97/105).
Most of the misses are no-link phishing — scams that rely on IBAN numbers or pure social pressure instead of a URL, which our heuristics don't really cover yet since they lean on domain/URL signals. Known gap, working on it.
Also, this happened
Mid-build, one of us got a real SMS impersonating the traffic police — fake case number, citation of an actual law article, same-day payment deadline. Not a training example, not synthetic. Just a normal Tuesday in Serbia, apparently. Good validation that the problem is real.
Links
- App: https://huggingface.co/spaces/build-small-hackathon/ne-nasedaj-sms-phishing
- Model: https://huggingface.co/ravi2505/ne-nasedaj-sms-phishing
- Demo video: https://www.loom.com/share/33f87e7836244b28ae054a346ce8ffff
- Writeup/blog: https://metalalchemistspex.github.io
Runs locally on CPU, nothing leaves the app. Bilingual EN/SR UI. Open to contributions — especially more no-link phishing examples and Bosnian/Montenegrin regional variants.
If anyone's worked on similar morphology problems for other inflected languages (Polish, Finnish, etc.), curious how you approached it — feel free to poke holes, the no-link gap is the obvious weak spot.
Top comments (0)