Logan McDonald

Posted on Feb 16, 2018 • Edited on Mar 23, 2018

Homographs, Attack!

#security

We've known about Homograph attacks since the 1990s -- so you may be wondering why I'm writing about them in 2018. Don't worry, I'll get to that. In this post, we'll explore the history of homograph attacks, and why, like many of the internet’s problems that stem from path dependence, it seems like they just won’t ever go away.

Origins of my Interest

I first got interested in homograph attacks a few months back when I was working through tickets for Kickstarter's Hackerone program. HackerOne is a "bug bounty program", or, an invitation that hackers and security researchers find vulnerabilities in our site in exchange for money.

When I was looking through the tickets, one caught my attention. It wasn't a particularly high risk vulnerability, but I didn't understand a lot of words in the ticket, so of course I was interested. The hacker was concerned about Kickstarter's profile pages. (We often get reports about our profile and project pages.)

Profile pages often create vulnerabilities for websites. Whenever you are in the position to “host” someone on your site, you are going to have to think about the ways they’ll abuse that legitimacy you give them. Our hacker was specifically concerned about a field that allows our users to add user-urls or "websites" to their profile.

They thought this section could be used in a homograph attack. To which I was like, what the heck is a homograph attack? And that question lead me down a rabbit hole of international internet governance, handfuls of RFCs, and a decades-old debate about the global nature of the internet.

Internet Corporation for Names and Numbers (ICANN)

We have to start with ICANN, the main international internet body in charge in this story. ICANN makes all the rules about what can and cannot be a domain name (along with performing the technical maintenance of the DNS root zone registries and maintaining the namespaces of the Internet).

For example, say you go to Namecheap to register "loganisthemostawesome.com". Namecheap uses the “extensible provisioning protocol” to verify your name with Verisign. Verisign is the organization that manages the registry for the “.com” gTLD. Versign checks the ICANN rules and regulations for your registration attempt, tells Namecheap the result, and Namecheap tells me if I can register "loganisthemostawesome.com". Spoilers: I can!

This is great. But I primarily speak English and I use ASCII for all my awesome businesses on the internet. What happens to all those other languages that can’t be expressed in a script compatible with ASCII?

Version 1 of Internationalized Domain Names

ICANN attempted to answer this question when they proposed and implemented IDNs as a standard protocol for domain names in the late 90s. They wanted a more global internet so they opened up domains to a variety of unicode represented scripts.

What's a script? A script is a collection of letters/signs for a single system. For example, Latin is a script that supports many languages, whereas a script like Kanji is one of the scripts supporting the Japanese language. Scripts can support many languages, and languages can be made of multiple scripts. ICANN keeps tables of all unicode character it associates with any given script.

This is even better now! Through IDNs, ICANN has given us the ability to express internet communities across many scripts. However, there was one important requirement. ICANN’s Domain Name System, which performs a lookup service to translate user-friendly names into network addresses for locating Internet resources, is restricted in practice to the use of ASCII characters.

Punycode

Thus ICANN turned to Punycode. Punycode is just puny unicode. Bootstring is the algorithm that translates names written in language-native scripts (unicode) into an ASCII text representation that is compatible with the Domain Name System (punycode).

For example, take this fictional domain name (because we still can't have emojis in gTLDs 😭):

hi👋friends💖🗣.com

If you put this in your browser, the real lookup against the Domain Name System would have to use the punycode equivalent:

xn--hifriends-mq85h1xad5j.com

So, problems solved. We have a way to use domain names in unicode scripts that represent the full global reach of the internet and can start handing out IDNs. Great! What could go wrong?

Homographs

Well, things aren’t always as they seem. And this is where homographs and homoglyphs come in.

A homograph refers to multiple things that look or seem the same, but have different meanings. We have many of these in English, for example “lighter” could refer to the fire starter or the comparative adjective.

The problem when it comes to IDNs is that homoglyphs exist between scripts as well, with many of the Latin letters having copies in other scripts, like Greek or Cyrillic.

Example of lookalikes from homoglyphs.net:

Let's look at an example of a domain name.

washingtonpost.com

wаshingtonpost.com

Can you tell the difference? Well, let's translate both of these to purely ASCII:

washingtonpost.com

xn--wshingtonpost-w1k.com

Uh oh, these definitely aren't the same. However, user-agents would make them appear the same in a browser, in order to make the punycode user-friendly. But in reality, the first "a" in the fake-WaPo is really a Cyrillic character. When translated to punycode we can see the ASCII remaining characters, "wshingtonpost" and then a key signifying the Cyrillic a, "w1k".

This presented ICANN with a big problem. You can clearly see how these may be used in phishing attacks when user-agents interpret both Washington Post's as homographs, making them look exactly same. So what was ICANN to do?

Internationalized Domain Names Version 2 & 3

By 2005, ICANN had figured out a solution. They told gTLD registrars they had to restrict mix scripts. Every single registered domain had to have a "label" on it to indicate the single pure script that the domain name would use to support it's language. Today, if you went and tried to register our copy-cat Washington Post at xn--wshingtonpost-w1k.com, you would get an error. Note: There were a few exceptions made, however, for languages that need to be mixed script, like Japanese.

Problem fixed, right? Well, while mixed scripts are not allowed, pure scripts are still perfectly fine according to ICANN's guidelines. Thus, we still have a problem. What about pure scripts in Cyrillic or Greek alphabets that look like the Latin characters? How many of those could there be?

Proof of Concept

Well, when I was talking to my friend @frewsxcv about homograph attacks, he had the great idea to make a script to find susceptible urls for the attack. So I made a homograph attack detector that:

Takes the top 1 million websites
For each domain, checks if letters in each are confusable with latin or decimal
Checks to see if the punycode url for that domain is registered through a WHOIS lookup
Returns all the available domains we could register

A lot of the URLs are a little off looking with the Cyrillic (also a lot of the top 1 million websites are porn), but we found some interesting ones you could register.

For example, here's my personal favorite. In both Firefox and Chrome, visit:

https://раураӏ.com/

Here's what they look like in those Browsers.

Firefox:

Chrome:

Pretty cool! In Firefox, it totally looks like the official PayPal in the address bar! However, in Chrome, it resolves to punycode. Why is that? 🤔

User-Agents & Their Internationalized Domain Names Display Algorithms

It is because Chrome and Mozilla use different Internationalized Domain Name Display Algorithms. Chrome's algorithm is much stricter and more complex than Mozilla's, and includes special logic to protect against homograph attacks. Chrome checks to see if the domain name is on a gTLD and all the letters are confusable Cyrillic, then it shows punycode in the browser rather than the unicode characters. Chrome only changed this recently because of Xudong Zheng’s 2017 report using www.xn--80ak6aa92e.com as a POC.

Firefox, on the other hand, still shows the full URL in its intended script, even if it's confusable with Latin characters. I want to point out that Firefox allows you to change your settings to always show punycode in the Browser, but if you often use sites that aren't ASCII domains, this can be pretty inaccessible.

So, what's next?

So what, now, is our responsibility as application developers and maintainers if we think someone might use our site to phish people using a homograph? I can see a couple paths forward:

Advocate to Mozilla and other user-agents to make sure to change their algorithms to protect users.
Advocate that ICANN changes its rules around registering domains with Latin confusable characters.
Implement our own display algorithms. This is what we ended up doing at Kickstarter. We used Google's open-source algorithm and show a warning if it's possible that the url shown on the page is a homograph for another url.
Finally, we could just register these domains like @frewsxcv and I did with PayPal so that they aren't able to be used maliciously. Possibly, if we are part of an organization with a susceptible domain, we should just register it.

To summarize, this is a hard problem! That's why it's been around for two decades. And fundamentally what I find so interesting about the issues surfaced by this attack. I personally think ICANN did the right thing in allowing IDNs in various scripts. The internet should be more accessible to all.

I like Chrome's statement in support of their display algorithm, however, which nicely summarizes the tradeoffs as play:

We want to prevent confusion, while ensuring that users across languages have a great experience in Chrome. Displaying either punycode or a visible security warning on too wide of a set of URLs would hurt web usability for people around the world.

The internet is full of these tradeoffs around accessibility versus security. As users and maintainers of this wonderful place, I find conversations like these to be one of the best parts of building our world together.

Now, we just gotta get some emoji support.

Thanks for reading! 🌍💖🎉🙌🌏

Resources

Background

Browser policies

Tools

Homograph Attack Generator for Mixed Scripts NOTE: It is no longer possible to register mixed script domain names.
Homograph Attack Finder for Pure Cyrillic Scripts
Homograph Attack Finder + WHOIS lookup for Pure Cyrillic Scripts
Homoglyph Dictionary
Puncode converter

ICANN CFPs and Guidelines

ICANN, Verisign, and the Domain Registration Process

Wikipedia for TLD. Each TLD has its own Registry that manages it and defines its IDN rules.
Wikipedia for Domain Name Registry, like Verisign
Wikipedia for Domain Name Registrar, like Namecheap, Godaddy, or Gandi.net
ICANN Accreditation and Verisign Certification for distributing .com domains
Wikipedia for the Extensible Provisioning Protocol, which is used when a user on a registry requests a .com domain. The registry uses the EPP protocol to communicate with verisign to register the domain.
Verisign's IDN Policy. Verisign requires you specify a three letter language tag associated with the domain upon registration. this tag determines which character scripts you can use in the domain. presumably the language tag for https://аррӏе.com/ (cyrillic) is 'RUS' or 'UKR'.
PIR, manager of .org TLDs, IDN rules

Misc Security related to Domains

Extended Validation Is Broken by @iangcarroll

Homograph Major Site Copy-cat Examples

http://аоӏ.com/
https://раураӏ.com/
https://аррӏе.com/
http://www.спп.com/

Top comments (30)

Ryan Palo • Feb 16 '18

+1 for the sheer body of research attached to this post :)

Sergey Kislyakov • Feb 16 '18

Firefox users: you can go to about:config and switch network.IDN_show_punycode to true.

Logan McDonald • Feb 16 '18

Yep! Unfortunately this always shows punycode for all IDNs not just malicious ones. Wish they'd come up with a solution as a default for just the potentially malicious ones like Chrome did!

Sergey Kislyakov • Feb 16 '18

Or they could show it like https://pаypal.com/ (punycode there)

Logan McDonald • Feb 16 '18

Yes! This is similar to what IE does with IDNs, by showing an informational alert that you're on one as a pop up. (Not sure which IE version does this). Some have suggested color coating non-ASCII text as well. Lots of potential solutions 😊

Gabriel Manoel • Feb 16 '18 • Edited

Interesting. I've never heard of homographs attack before.

Kasey Speakman • Feb 16 '18

I learned quite a bit. Thank you!

Andy Zhao (he/him) • Feb 16 '18 • Edited

Great post, ~~also an interesting tidbit with Firefox is that it suggests the real PayPal in the link:~~

As I was writing this I realized you put that icon there. Awesome touch! Definitely fooled me 🙈

Jess Lee • Feb 16 '18

Wow, that is all super interesting.

Billie Hilton • Feb 17 '18 • Edited

Implement our own display algorithms. This is what we ended up doing at Kickstarter. We used Google's open-source algorithm and show a warning if it's possible that the url shown on the page is a homograph for another url.

If you published this as an npm package (e.g. sanitizeHomograph(url)) then all of us could use it to sanitize URLs we display on profile pages.

Logan McDonald • Feb 17 '18

Kickstarter is about to publish the ruby code as a gem! Would be down to do in js as well 😊

Logan McDonald • Jun 18 '18

sorry this took a while! github.com/kickstarter/ruby-homogr...

edA‑qa mort‑ora‑y • Feb 18 '18

When punycode first came out Firefox would only display the unicode version on a whitelisted set of TLDs. The rule, if I recall, was that a registrar must have published a policy on how they avoid the registration of homographs. This meant, for example, that .de would be okay since the registrar policy was limited script, but .com would always show punycode since it was a free-for-all.

I kind of think this is registrar problem. The registration of homographs on common script characters should just be rejected.

Logan McDonald • Feb 18 '18

Great proposal! I think, based on my reading of ICANN's meeting minutes and IDN RFCs, that as an international organization they are worried limiting some scripts that support non-ASCII languages would be an overreach in favor of English speakers and Latin. They are taking time to make sure that whatever decision they make doesn't over-exclude non-latin-language speakers. (And in the meantime hoping the Browsers just do this for them 😉.) Turns out internet governance is just as slow-paced as any other kind of governance.