A Unicode Lookup Tool That Shows UTF-8 Bytes, Surrogate Pairs, and All Four Normalization Forms
Type a character, get its codepoint. Type a codepoint, get the character. See the UTF-8 byte sequence, the UTF-16 code units, the surrogate pair (if applicable), the HTML entity, CSS escape, JS escape, the Unicode block, the normalization forms. It's a calculator for Unicode questions that come up all the time in practice.
Every developer hits Unicode weirdness eventually. "Why is this string length 2 when I see one character?" (Surrogate pair.) "Why does 'é' compare unequal to 'é'?" (Composed vs decomposed — NFC vs NFD.) "How do I paste this emoji into a CSS file?" (You need \1F600, not \u1F600.)
🔗 Live demo: https://sen.ltd/portfolio/unicode-lookup/
📦 GitHub: https://github.com/sen-ltd/unicode-lookup
Features:
- 4 search modes: by character, by codepoint, by name, by block
- UTF-8 bytes (1-4 byte sequences)
- UTF-16 code units (with surrogate pair detection)
- HTML entity, CSS escape, JS escape formats
- All 4 Unicode normalization forms (NFC, NFD, NFKC, NFKD)
- 59 Unicode blocks browsable
- Japanese / English UI
- Zero dependencies, 75 tests
UTF-8 encoding
UTF-8 is variable-length: ASCII is 1 byte, European accents are 2 bytes, CJK is 3 bytes, emoji and supplementary plane chars are 4 bytes.
export function toUTF8Bytes(cp) {
if (cp < 0x80) return [cp];
if (cp < 0x800) return [
0xC0 | (cp >> 6),
0x80 | (cp & 0x3F),
];
if (cp < 0x10000) return [
0xE0 | (cp >> 12),
0x80 | ((cp >> 6) & 0x3F),
0x80 | (cp & 0x3F),
];
return [
0xF0 | (cp >> 18),
0x80 | ((cp >> 12) & 0x3F),
0x80 | ((cp >> 6) & 0x3F),
0x80 | (cp & 0x3F),
];
}
The lead byte tells you how many continuation bytes follow:
-
0xxxxxxx→ 1 byte (ASCII) -
110xxxxx→ 2 bytes (1 continuation) -
1110xxxx→ 3 bytes (2 continuations) -
11110xxx→ 4 bytes (3 continuations)
Continuations always start with 10xxxxxx. This self-synchronizing property is why UTF-8 is so robust: you can find any character boundary by scanning for a non-continuation byte.
UTF-16 surrogate pairs
JavaScript strings are UTF-16. Codepoints above 0xFFFF (the BMP, Basic Multilingual Plane) can't fit in a single 16-bit unit, so they're encoded as surrogate pairs:
export function toUTF16CodeUnits(cp) {
if (cp <= 0xFFFF) return [cp];
const offset = cp - 0x10000;
const high = 0xD800 + (offset >> 10);
const low = 0xDC00 + (offset & 0x3FF);
return [high, low];
}
High surrogates are in 0xD800-0xDBFF, low surrogates in 0xDC00-0xDFFF. A 20-bit offset (0 to 0xFFFFF) is split into 10 high bits and 10 low bits.
This is why "😀".length === 2 in JavaScript — the emoji is one codepoint but two code units.
The four normalization forms
Unicode has multiple ways to encode the same visual character:
- é as a single codepoint U+00E9 (precomposed)
- é as U+0065 (e) + U+0301 (combining acute) (decomposed)
These look identical but don't compare equal. Normalization fixes this:
- NFC — Composition (precompose where possible)
- NFD — Decomposition (break into base + combining marks)
- NFKC — Compatibility + Composition (also fold "fi" ligature, superscripts, etc.)
- NFKD — Compatibility + Decomposition
const forms = {
NFC: str.normalize('NFC'),
NFD: str.normalize('NFD'),
NFKC: str.normalize('NFKC'),
NFKD: str.normalize('NFKD'),
};
String comparison, database keys, filename equality — all of these can bite you if you don't normalize first.
Escape formats
Each language's escape syntax is slightly different:
// HTML: 😀
// CSS: \1F600 (note: no 0x, no braces, can need trailing space)
// JS ES2015+: \u{1F600}
// JS legacy: \uD83D\uDE00 (surrogate pair)
The CSS escape is the weirdest — it's a backslash followed by up to 6 hex digits, with a required space afterward if the next character could be a valid hex digit.
Tests
75 tests covering: codepoint parsing in all formats, UTF-8 encoding across all byte-length categories, UTF-16 surrogate pair correctness, escape format output, block lookup, emoji detection, normalization forms, and edge cases (BOM, max codepoint 0x10FFFF, surrogate validation).
Series
This is entry #46 in my 100+ public portfolio series.
- 📦 Repo: https://github.com/sen-ltd/unicode-lookup
- 🌐 Live: https://sen.ltd/portfolio/unicode-lookup/
- 🏢 Company: https://sen.ltd/

Top comments (0)