From combining emoji marks and astral planes, Unicode is under appreciated and poorly understood. The importance of understanding Unicode extends beyond localization and diversity. Failing to understand Unicode may lead to vulnerabilities in your code.
One lesser known occurrence is Unicode Case Mapping Collisions. Loosely speaking, a collision occurs when two different characters are uppercased or lowercased into the same character. This effect is found commonly at the boundary between two different protocols, like email and domain names.
A quick example
'ß'.toLowerCase() // 'ss'
'ß'.toLowerCase() === 'SS'.toLowerCase() // true
// Note the Turkish dotless i
'John@Gıthub.com'.toLowerCase() === 'John@Github.com'.toLowerCase()
Transformation Collisions
While there are many Unicode case collisions across all the Unicode astral planes, We'll only include the characters that collide into the English alphabet. This Unicode guide includes an exhaustive list of collisions.
Uppercase
Char | Code Point | Output Char |
---|---|---|
ß | 0x00DF | SS |
ı | 0x0131 | I |
ſ | 0x017F | S |
ff | 0xFB00 | FF |
fi | 0xFB01 | FI |
fl | 0xFB02 | FL |
ffi | 0xFB03 | FFI |
ffl | 0xFB04 | FFL |
ſt | 0xFB05 | ST |
st | 0xFB06 | ST |
Lowercase
Char | Code Point | Output Char |
---|---|---|
K | 0x212A | k |
Case Study
Company: GitHub
Vulnerability: Password reset emails delıvered to the wrong address.
Cause: Forgot password emails validated against lowercase value on file, but sent the provided email.
GitHub's forgot password feature could be compromised because the system lowercased the provided email address and compared it to the email address stored in the user database. If there was a match, GitHub would send the reset password link to the email address provided by the attacker- which was technically speaking, not the same email address. I'll let the GitHub Security team explain further.
One Quick Note: Though not strictly required, using punycode conversion from John@Gıthub.com
to xn--john@gthub-2ub.com
would have helped prevent this issue. It's doubtful any web apps do this as part of the user registration process.
John discovered a flaw in the way email addresses were being normalized to standard character sets when used to look up accounts during the password recovery flow. Password reset tokens are associated with email addresses and initiating a password reset with an email address that normalizes to another email address would result in the reset token for one user being delivered to the email address of another account. The attack only works if an email provider allows Unicode in the “local part” of the email address and an attacker can claim an email address containing Unicode that would improperly normalize to the email address of another account (e.g.
mike@example.org
vsmıke@example.org
). Unicode in the “domain part” is not allowed by GitHub's outgoing mail server and therefore cannot be used as part of a broader attack on common domains (e.g. gmail.com vs gmaıl.com).GitHub addressed the vulnerability by making sure the email address in the database matches the email address that initiated the reset flow. This ensures that the email address used to generate the token matches the email address to which the reset token gets delivered.
This particular fix is simple - only send out the original email address that was used to create the account.
More Unicode
Have we convinced you that Unicode is Awesome? Checkout our verbose guide to "Awesome Unicode", which made it to the front page of HackerNews.
Follow on Github
You can follow John Gracey on Github at github.com/jagracey— or view other articles including:
Top comments (0)