We've known about Homograph attacks since the 1990s -- so you may be wondering why I'm writing about them in 2018. Don't worry, I'll get to that. In this post, we'll explore the history of homograph attacks, and why, like many of the internet’s problems that stem from path dependence, it seems like they just won’t ever go away.
I first got interested in homograph attacks a few months back when I was working through tickets for Kickstarter's Hackerone program. HackerOne is a "bug bounty program", or, an invitation that hackers and security researchers find vulnerabilities in our site in exchange for money.
When I was looking through the tickets, one caught my attention. It wasn't a particularly high risk vulnerability, but I didn't understand a lot of words in the ticket, so of course I was interested. The hacker was concerned about Kickstarter's profile pages. (We often get reports about our profile and project pages.)
Profile pages often create vulnerabilities for websites. Whenever you are in the position to “host” someone on your site, you are going to have to think about the ways they’ll abuse that legitimacy you give them. Our hacker was specifically concerned about a field that allows our users to add user-urls or "websites" to their profile.
They thought this section could be used in a homograph attack. To which I was like, what the heck is a homograph attack? And that question lead me down a rabbit hole of international internet governance, handfuls of RFCs, and a decades-old debate about the global nature of the internet.
We have to start with ICANN, the main international internet body in charge in this story. ICANN makes all the rules about what can and cannot be a domain name (along with performing the technical maintenance of the DNS root zone registries and maintaining the namespaces of the Internet).
For example, say you go to Namecheap to register "loganisthemostawesome.com". Namecheap uses the “extensible provisioning protocol” to verify your name with Verisign. Verisign is the organization that manages the registry for the “.com” gTLD. Versign checks the ICANN rules and regulations for your registration attempt, tells Namecheap the result, and Namecheap tells me if I can register "loganisthemostawesome.com". Spoilers: I can!
This is great. But I primarily speak English and I use ASCII for all my awesome businesses on the internet. What happens to all those other languages that can’t be expressed in a script compatible with ASCII?
ICANN attempted to answer this question when they proposed and implemented IDNs as a standard protocol for domain names in the late 90s. They wanted a more global internet so they opened up domains to a variety of unicode represented scripts.
What's a script? A script is a collection of letters/signs for a single system. For example, Latin is a script that supports many languages, whereas a script like Kanji is one of the scripts supporting the Japanese language. Scripts can support many languages, and languages can be made of multiple scripts. ICANN keeps tables of all unicode character it associates with any given script.
This is even better now! Through IDNs, ICANN has given us the ability to express internet communities across many scripts. However, there was one important requirement. ICANN’s Domain Name System, which performs a lookup service to translate user-friendly names into network addresses for locating Internet resources, is restricted in practice to the use of ASCII characters.
Thus ICANN turned to Punycode. Punycode is just puny unicode. Bootstring is the algorithm that translates names written in language-native scripts (unicode) into an ASCII text representation that is compatible with the Domain Name System (punycode).
For example, take this fictional domain name (because we still can't have emojis in gTLDs 😭):
If you put this in your browser, the real lookup against the Domain Name System would have to use the punycode equivalent:
So, problems solved. We have a way to use domain names in unicode scripts that represent the full global reach of the internet and can start handing out IDNs. Great! What could go wrong?
Well, things aren’t always as they seem. And this is where homographs and homoglyphs come in.
A homograph refers to multiple things that look or seem the same, but have different meanings. We have many of these in English, for example “lighter” could refer to the fire starter or the comparative adjective.
The problem when it comes to IDNs is that homoglyphs exist between scripts as well, with many of the Latin letters having copies in other scripts, like Greek or Cyrillic.
Example of lookalikes from homoglyphs.net:
Let's look at an example of a domain name.
Can you tell the difference? Well, let's translate both of these to purely ASCII:
Uh oh, these definitely aren't the same. However, user-agents would make them appear the same in a browser, in order to make the punycode user-friendly. But in reality, the first "a" in the fake-WaPo is really a Cyrillic character. When translated to punycode we can see the ASCII remaining characters, "wshingtonpost" and then a key signifying the Cyrillic a, "w1k".
This presented ICANN with a big problem. You can clearly see how these may be used in phishing attacks when user-agents interpret both Washington Post's as homographs, making them look exactly same. So what was ICANN to do?
By 2005, ICANN had figured out a solution. They told gTLD registrars they had to restrict mix scripts. Every single registered domain had to have a "label" on it to indicate the single pure script that the domain name would use to support it's language. Today, if you went and tried to register our copy-cat Washington Post at
xn--wshingtonpost-w1k.com, you would get an error. Note: There were a few exceptions made, however, for languages that need to be mixed script, like Japanese.
Problem fixed, right? Well, while mixed scripts are not allowed, pure scripts are still perfectly fine according to ICANN's guidelines. Thus, we still have a problem. What about pure scripts in Cyrillic or Greek alphabets that look like the Latin characters? How many of those could there be?
- Takes the top 1 million websites
- For each domain, checks if letters in each are confusable with latin or decimal
- Checks to see if the punycode url for that domain is registered through a WHOIS lookup
- Returns all the available domains we could register
A lot of the URLs are a little off looking with the Cyrillic (also a lot of the top 1 million websites are porn), but we found some interesting ones you could register.
For example, here's my personal favorite. In both Firefox and Chrome, visit:
Here's what they look like in those Browsers.
Pretty cool! In Firefox, it totally looks like the official PayPal in the address bar! However, in Chrome, it resolves to punycode. Why is that? 🤔
It is because Chrome and Mozilla use different Internationalized Domain Name Display Algorithms. Chrome's algorithm is much stricter and more complex than Mozilla's, and includes special logic to protect against homograph attacks. Chrome checks to see if the domain name is on a gTLD and all the letters are confusable Cyrillic, then it shows punycode in the browser rather than the unicode characters. Chrome only changed this recently because of Xudong Zheng’s 2017 report using
www.xn--80ak6aa92e.com as a POC.
Firefox, on the other hand, still shows the full URL in its intended script, even if it's confusable with Latin characters. I want to point out that Firefox allows you to change your settings to always show punycode in the Browser, but if you often use sites that aren't ASCII domains, this can be pretty inaccessible.
So what, now, is our responsibility as application developers and maintainers if we think someone might use our site to phish people using a homograph? I can see a couple paths forward:
- Advocate to Mozilla and other user-agents to make sure to change their algorithms to protect users.
- Advocate that ICANN changes its rules around registering domains with Latin confusable characters.
- Implement our own display algorithms. This is what we ended up doing at Kickstarter. We used Google's open-source algorithm and show a warning if it's possible that the url shown on the page is a homograph for another url.
- Finally, we could just register these domains like @frewsxcv and I did with PayPal so that they aren't able to be used maliciously. Possibly, if we are part of an organization with a susceptible domain, we should just register it.
To summarize, this is a hard problem! That's why it's been around for two decades. And fundamentally what I find so interesting about the issues surfaced by this attack. I personally think ICANN did the right thing in allowing IDNs in various scripts. The internet should be more accessible to all.
I like Chrome's statement in support of their display algorithm, however, which nicely summarizes the tradeoffs as play:
We want to prevent confusion, while ensuring that users across languages have a great experience in Chrome. Displaying either punycode or a visible security warning on too wide of a set of URLs would hurt web usability for people around the world.
The internet is full of these tradeoffs around accessibility versus security. As users and maintainers of this wonderful place, I find conversations like these to be one of the best parts of building our world together.
Now, we just gotta get some emoji support.
Thanks for reading! 🌍💖🎉🙌🌏
- Wikipedia on Homograph Attacks
- Wikipedia on IDNs
- Plagiarism Detection in Texts Obfuscated with Homoglyphs
- A Collective Intelligence Approach to Detecting IDN Phishing by Shian-Shyong Tseng, Ai-Chin Lu, Ching-Heng Ku, and Guang-Gang Geng
- Exposing Homograph Obfuscation Intentions by Coloring Unicode Strings by Liu Wenyin, Anthony Y. Fu, and Xiaotie Deng
- Phishing with Unicode Domains by Xudong Zheng
- The Homograph Attack by Evgeniy Gabrilovich and Alex Gontmakher NOTE: The original paper!
- Cutting through the Confusion: A Measurement Study of Homograph Attacks by Tobias Holgers, David E. Watson, and Steven D. Gribble
- Assessment of Internationalised Domain Name Homograph Attack Mitigation by Peter Hannay and Christopher Bolan
- Multilingual web sites: Internationalized Domain Name homograph attacks by Johnny Al Helou and Scott Tilley
- IDN Homograph Attack Potential Impact Analysis by @jsidrach
- Chrome's IDN Display Algorithm
- Mozilla's IDN Display Algorithm
- UTC Mixed Script Detection Security Mechanisms
- Chrome's IDN Spoof Checker
- Bugzilla Open Bug on IDNs
- Homograph Attack Generator for Mixed Scripts NOTE: It is no longer possible to register mixed script domain names.
- Homograph Attack Finder for Pure Cyrillic Scripts
- Homograph Attack Finder + WHOIS lookup for Pure Cyrillic Scripts
- Homoglyph Dictionary
- Puncode converter
- 2005 IDN Version 2.0 Guidelines
- ICANN 2005 RFC Announcement for Version 2.0 of IDN Guidelines
- IDNA2008 Version 2.2 draft
- 2011 IDN Version 3.0 Guidelines
- Wikipedia for TLD. Each TLD has its own Registry that manages it and defines its IDN rules.
- Wikipedia for Domain Name Registry, like Verisign
- Wikipedia for Domain Name Registrar, like Namecheap, Godaddy, or Gandi.net
- ICANN Accreditation and Verisign Certification for distributing .com domains
- Wikipedia for the Extensible Provisioning Protocol, which is used when a user on a registry requests a .com domain. The registry uses the EPP protocol to communicate with verisign to register the domain.
- Verisign's IDN Policy. Verisign requires you specify a three letter language tag associated with the domain upon registration. this tag determines which character scripts you can use in the domain. presumably the language tag for https://аррӏе.com/ (cyrillic) is 'RUS' or 'UKR'.
- PIR, manager of .org TLDs, IDN rules