<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: metalalchemistspex</title>
    <description>The latest articles on DEV Community by metalalchemistspex (@metalalchemistspex).</description>
    <link>https://dev.to/metalalchemistspex</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3986242%2F839b57e9-34c6-430d-bd21-d9627ca938bb.png</url>
      <title>DEV Community: metalalchemistspex</title>
      <link>https://dev.to/metalalchemistspex</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/metalalchemistspex"/>
    <language>en</language>
    <item>
      <title>Smishi — an SMS phishing detector for Serbian/Bosnian/Croatian/Montenegrin</title>
      <dc:creator>metalalchemistspex</dc:creator>
      <pubDate>Mon, 15 Jun 2026 21:26:29 +0000</pubDate>
      <link>https://dev.to/metalalchemistspex/smishi-an-sms-phishing-detector-for-serbianbosniancroatianmontenegrin-4b5g</link>
      <guid>https://dev.to/metalalchemistspex/smishi-an-sms-phishing-detector-for-serbianbosniancroatianmontenegrin-4b5g</guid>
      <description>&lt;p&gt;Built this for a hackathon (Build Small, June 2026) and figured I'd write it up while it's still fresh.&lt;/p&gt;

&lt;p&gt;It's a small ensemble — TF-IDF + Logistic Regression baseline, plus a fine-tuned BERTić model (110M params) — that flags SMS phishing in South Slavic languages: Serbian, Bosnian, Croatian, and Montenegrin.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why
&lt;/h2&gt;

&lt;p&gt;Smishing is apparently up something like 1,300% in Serbia over the last three years. Every phishing dataset and model I could find was English-only, which turns out to be a real gap for a few reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Grammatical case.&lt;/strong&gt; These languages decline nouns by case, so &lt;em&gt;"nagrada"&lt;/em&gt; / &lt;em&gt;"nagradu"&lt;/em&gt; / &lt;em&gt;"nagradi"&lt;/em&gt; are all the same word ("prize"), just different grammatical forms. A keyword filter sees five unrelated strings; a scammer just... uses whichever form the sentence needs, no extra effort.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Script mixing and homoglyphs.&lt;/strong&gt; Serbian can be Cyrillic or Latin, and the two can be mixed in the same message. A Cyrillic "а" (U+0430) looks identical to a Latin "a" (U+0061) but is a different character — invisible to a human, invisible to a Latin-only keyword filter, but not invisible to a model looking at actual bytes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No dataset existed.&lt;/strong&gt; We looked. Couldn't find one. So we built one — 1,529 labeled messages (900 legit / 629 phishing), Cyrillic and Latin, across all four languages.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;p&gt;Both models run side by side and the app shows confidence scores plus &lt;em&gt;which&lt;/em&gt; signals fired (fake URL, urgency language, sender impersonation, suspicious/typosquatted domains, etc.) — not just a yes/no.&lt;/p&gt;

&lt;p&gt;Example — this one's flagged as phishing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MUP: Sаobraćajni prekršaj evidentiran. Platite online na linku: https://mup-gov.online/login
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "a" in "Sаobraćajni" is Cyrillic, not Latin. Same glyph, different codepoint, classic evasion trick.&lt;/p&gt;

&lt;p&gt;And this one's correctly left alone:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Raiffeisen: Transakcija karticom ****3421 u iznosu od 1.299 RSD je odobrena. Stanje: 124.567,80 RSD.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Numbers
&lt;/h2&gt;

&lt;p&gt;96.96% accuracy / 96.3% F1 on the held-out split for the BERTić model. We also built a separate, harder 105-case test set (typosquatting, homographs, morphological case variants, no-link IBAN scams) — it's downloadable from the app itself, batch-test it and it scores live. Currently at 93.3% (97/105).&lt;/p&gt;

&lt;p&gt;Most of the misses are no-link phishing — scams that rely on IBAN numbers or pure social pressure instead of a URL, which our heuristics don't really cover yet since they lean on domain/URL signals. Known gap, working on it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Also, this happened
&lt;/h2&gt;

&lt;p&gt;Mid-build, one of us got a real SMS impersonating the traffic police — fake case number, citation of an actual law article, same-day payment deadline. Not a training example, not synthetic. Just a normal Tuesday in Serbia, apparently. Good validation that the problem is real.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;App: &lt;a href="https://huggingface.co/spaces/build-small-hackathon/ne-nasedaj-sms-phishing" rel="noopener noreferrer"&gt;https://huggingface.co/spaces/build-small-hackathon/ne-nasedaj-sms-phishing&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Model: &lt;a href="https://huggingface.co/ravi2505/ne-nasedaj-sms-phishing" rel="noopener noreferrer"&gt;https://huggingface.co/ravi2505/ne-nasedaj-sms-phishing&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Demo video: &lt;a href="https://www.loom.com/share/33f87e7836244b28ae054a346ce8ffff" rel="noopener noreferrer"&gt;https://www.loom.com/share/33f87e7836244b28ae054a346ce8ffff&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Writeup/blog: &lt;a href="https://metalalchemistspex.github.io" rel="noopener noreferrer"&gt;https://metalalchemistspex.github.io&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Runs locally on CPU, nothing leaves the app. Bilingual EN/SR UI. Open to contributions — especially more no-link phishing examples and Bosnian/Montenegrin regional variants.&lt;/p&gt;

&lt;p&gt;If anyone's worked on similar morphology problems for other inflected languages (Polish, Finnish, etc.), curious how you approached it — feel free to poke holes, the no-link gap is the obvious weak spot.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>nlp</category>
      <category>opensource</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
