DEV Community

Truong Bui
Truong Bui

Posted on

How to Monitor Certificate Transparency Logs for Lookalike Domains

The firehose problem, typo/homoglyph matching, and where it breaks in production

A lookalike domain usually sits around for days or weeks before anyone notices. Someone registers yourbrand-support.com, clones your login page, and runs a phishing campaign until a customer complains or a threat-intel feed catches up. By then the early window is gone.

Certificate Transparency logs give you a much earlier signal. Browsers won't trust a public TLS cert unless it's been logged first, Chrome's required this since April 2018, Apple since October 2018. So almost every new cert shows up in a public log within minutes of being issued, often before the site is even live.

That's the good news. The bad news is the logs are a firehose.

The volume problem

SSLMate, which runs a CT search service, says it ingests over 10 million certs a day from 40-plus logs. Other estimates go much higher depending on how you count precertificates, so treat any number here as rough. There are roughly 40-45 actively-accepting logs right now across about eight operators (Google, Cloudflare, DigiCert, Sectigo, Let's Encrypt, and a few smaller ones). The list changes too, logs get retired or go read-only, so don't hardcode it, pull it from Chrome's log list instead.

One thing worth knowing: every cert actually gets logged twice, once as a "precertificate" before the real one. Precerts show up first, so watching only final certs means missing your earliest signal. Most setups filter on precerts and then dedupe against the final cert so domains don't show up twice.

None of that volume is about you specifically, though. The real work is turning "someone got a cert" into "someone got a cert that's trying to look like you."

Matching

Edit distance against your brand name catches maybe a third of what matters. You also need:

  • Typo permutations: omissions, insertions, transpositions, keyboard-adjacent swaps. dnstwist already does this well, and does DNS/WHOIS lookups too. If a nightly dnstwist run covers your risk tolerance, you probably don't need much more.
  • Combosquats: brand plus a generic word, yourbrand-login.com, secure-yourbrand.net. Edit distance won't catch these since your brand name is sitting right there unmodified, you need keyword matching against common phishing terms.
  • TLD swaps: same name, different extension. If you only watch .com, attackers just register .net or .io instead.
  • Homoglyphs: the hard one.

Homoglyphs

An IDN homograph attack uses a domain that looks like yours but isn't, because it's built from different Unicode characters. Cyrillic Π° instead of Latin a is the classic example, same shape, different character.

Two things make this annoying. First, these domains show up as punycode (xn--...), so you have to decode that back to actual Unicode before comparing anything. Second, there's no simple list of "characters that look alike." Unicode's confusables standard (UTS #39) handles this by mapping each character to a canonical "skeleton," then you compare skeletons instead of raw strings. Two totally different-looking domains can collapse to the same skeleton, which is exactly what you want to catch.

Also worth flagging on its own: a domain mixing scripts within one label (Cyrillic and Latin together) is suspicious regardless of whether it matches your brand.

What actually breaks

A few things bite you once this runs for real instead of as a script. Your own infrastructure trips false positives constantly, CDN certs, wildcard certs, regional domains, marketing spinning up a campaign domain without telling anyone. You need a way to mark known-good stuff rather than guessing.

A single cert can list hundreds of hostnames (SANs), so "a cert was issued" is really N separate things to check, not one.

And once something matches, you'll want to actually look at it, is it parked, a coincidence, or a live clone of your login page? That means visiting a domain that might be actively hostile. Treat that as its own security boundary: isolated, no shared credentials, tight timeouts, nothing that talks to the rest of your infrastructure. Don't fetch a suspected phishing kit from a box that has anything else useful on it.

None of the individual pieces here are exotic, CT tailing is documented, dnstwist is free, Unicode confusables is a published standard. The value is wiring it together and running it continuously.

This is roughly the pipeline behind impersona.io. There's a free check at impersona.io/brand-check if you want to run it against your own domain, no signup needed to see the summary.

Top comments (0)