Snappy Tools

Posted on May 2

ASCII, Unicode, and UTF-8: What Every Developer Should Know

#webdev #javascript #beginners #programming

Character encoding is one of those topics that seems irrelevant until you ship a bug caused by it. Here is a compact guide to the concepts behind ASCII, Unicode, and UTF-8 — enough to reason through encoding issues when they come up.

ASCII: the original 128

ASCII (American Standard Code for Information Interchange) was designed in the early 1960s. It maps 128 characters to integers 0–127:

0–31: control characters (newline, tab, null, etc.)
32–126: printable characters (letters, digits, punctuation)
127: delete

Seven bits is enough for ASCII, so it fits in a single byte. Every character in the English alphabet, plus digits and common punctuation, has an ASCII code. 'A' is 65, 'a' is 97, '0' is 48.

The problem: 128 characters is nowhere near enough for the world's writing systems.

Unicode: one standard to contain them all

Unicode assigns a unique number (a "code point") to every character in every writing system — currently over 149,000 characters covering 161 scripts, plus emoji, mathematical symbols, and historic scripts.

A Unicode code point is written as U+ followed by a hex number. Examples:

U+0041 → A (Latin capital letter A)
U+00E9 → é (Latin small e with acute)
U+4E2D → 中 (Chinese character for "middle")
U+1F600 → 😀 (grinning face emoji)

Unicode defines the characters — it does not define how to store them. That's what encoding formats are for.

UTF-8: the dominant encoding

UTF-8 encodes Unicode code points as 1 to 4 bytes:

Code point range	Bytes used	Pattern
U+0000 to U+007F	1 byte	`0xxxxxxx`
U+0080 to U+07FF	2 bytes	`110xxxxx 10xxxxxx`
U+0800 to U+FFFF	3 bytes	`1110xxxx 10xxxxxx 10xxxxxx`
U+10000 to U+10FFFF	4 bytes	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`

Key properties of UTF-8:

Backward compatible with ASCII — code points 0–127 use the same byte values as ASCII. A valid ASCII file is also valid UTF-8.
Self-synchronising — you can tell where a multi-byte sequence starts by looking at the first byte's leading bits.
No byte-order ambiguity — unlike UTF-16, UTF-8 has no endianness issues.

UTF-8 is the dominant encoding for the web, APIs, databases, and file systems. If you receive text data and aren't told the encoding, assume UTF-8.

UTF-16 and UTF-32

UTF-16: uses 2 or 4 bytes per code point. Code points outside the Basic Multilingual Plane (above U+FFFF, which includes most emoji) require "surrogate pairs" — two 2-byte sequences. UTF-16 is used internally by JavaScript strings, Java, and Windows APIs.

UTF-32: always uses 4 bytes. Simple but wasteful — a purely ASCII document becomes 4× larger. Used in some internal processing.

JavaScript strings are UTF-16

This is a common source of bugs. JavaScript strings are encoded as UTF-16 internally, and .length returns the number of UTF-16 code units, not characters:

'hello'.length     // 5 — one code unit per character
'café'.length      // 4 — 'é' is U+00E9, one code unit
'😀'.length        // 2 — emoji requires a surrogate pair in UTF-16

To get the actual character count:

[...'😀'].length   // 1 — spread operator iterates code points, not code units
'😀'.codePointAt(0).toString(16)  // "1f600" — correct code point

The newer Intl.Segmenter is more robust for grapheme cluster counting (which handles complex emoji sequences like 👨‍👩‍👧‍👦 that consist of multiple code points).

HTML entities and Unicode

HTML entities like &, <, and > are one way to represent characters in HTML. You can also use numeric character references directly:

A → A (decimal)
A → A (hex)
😀 → 😀

Use the HTML Entity Encoder to convert text to safe HTML entities for embedding special characters in HTML documents.

Base64 and binary data

Base64 encodes binary data as ASCII text. When you encode a file or image in Base64, each byte of the original binary data maps to 1.33 bytes of ASCII text (since Base64 represents 6 bits per character, and a byte is 8 bits).

Why does this matter for encoding? Base64 operates on raw bytes — it does not understand Unicode. If you Base64-encode a JavaScript string that contains characters above U+007F, you need to encode the UTF-8 byte representation first, not the raw JavaScript UTF-16 string.

// WRONG for non-ASCII strings — btoa fails on characters > U+00FF
btoa('hello')  // works: "aGVsbG8="

// RIGHT for Unicode strings
function toBase64(str) {
  return btoa(encodeURIComponent(str).replace(
    /%([0-9A-F]{2})/g,
    (_, hex) => String.fromCharCode(parseInt(hex, 16))
  ));
}

Or use the Base64 Encoder/Decoder which handles Unicode correctly in the browser.

Diagnosing encoding issues

If you see garbled characters (Ã©, â€™, ï»¿), you have a mismatch between the encoding used to write the data and the encoding used to read it:

â€™ where you expect ' → UTF-8 curly quote being read as Latin-1
ï»¿ at the start → UTF-8 BOM being read as Latin-1

The fix is always the same: ensure both the writer and reader agree on the encoding. For web content, set Content-Type: text/html; charset=utf-8 on HTTP responses and <meta charset="utf-8"> in the <head> of HTML documents.

Normalisation

The same visible character can have multiple valid Unicode representations. The letter é can be:

A single code point: U+00E9 (precomposed)
Two code points: e + U+0301 (combining acute accent) (decomposed)

Both render identically but are not equal when compared as strings. Use Unicode normalisation (NFC for composed, NFD for decomposed) before comparing user input:

'é'.normalize('NFC') === 'é'.normalize('NFC')  // true — both precomposed

Python: unicodedata.normalize('NFC', text)

Character encoding is one of those invisible layers that mostly works until it doesn't. Knowing the relationship between Unicode, UTF-8, and how your language handles strings makes the difference between confidently fixing an encoding bug and spending hours guessing at the cause.

DEV Community