usapop

Posted on Jan 22

Beyond Surrogate Pairs: Hidden Pitfalls in JS Character Counting with PHP/MySQL

#laravel #php #mysql #javascript

When counting characters in JavaScript, the surrogate pair issue (where emoji and certain CJK characters count as 2) is fairly well-known. You've probably seen advice like "just use [...str].length" to solve it.

However, when I implemented a character counter for a textarea on the frontend, I discovered there are other causes of mismatches beyond surrogate pairs. It turned out to be deeper than I expected, so I'd like to share what I learned. I might have some things wrong, so please take this as a reference rather than absolute truth.

What Happened

I was implementing a character counter for a <textarea>. I was already using [...str].length as a surrogate pair countermeasure, but Laravel's max validation was still rejecting some inputs.

After investigating, I found the culprit: newline normalization.

Causes Beyond Surrogate Pairs

Cause	JS Side	PHP/MySQL Side	Discrepancy
Emoji (surrogate pairs)	`"🎉".length` → 2	`mb_strlen("🎉")` → 1	JS counts more
Newline normalization	`"a\nb".length` → 3	After HTTP: `"a\r\nb"` → 4	PHP counts more
Lone surrogates	Counted	MySQL error or warning	Possible save failure
NFD form (older macOS)	`"é".length` → 2	`mb_strlen("é")` → 2	Matches, but unintentionally 2 chars

Surrogate pairs can be handled with [...str].length, but the newline issue isn't solved by that alone. The more newlines in the text, the bigger the discrepancy—for example, 10 newlines means a +10 character difference.

About Newline Normalization

This is where I got stuck the most.

Browser → Server

When you submit the contents of a <textarea> via HTTP, newlines are normalized from \n to \r\n.

According to MDN, for wrap="soft" (the default):

the browser ensures that all line breaks in the entered value are a CR+LF pair

The WHATWG Blog also explains this in detail—newlines are normalized to CRLF during form submission per the spec.

// In browser
textarea.value  // "a\nb\nc" → 5 characters

// Received by PHP
$_POST['field'] // "a\r\nb\r\nc" → 7 characters

Laravel's max rule uses mb_strlen(), so it checks against the character count after normalization. This was why text with many newlines kept getting rejected.

The Round-Trip Problem During Editing

This issue doesn't just occur with new input—it also happens on edit screens.

Initial input: Browser (\n) → HTTP submission converts to \r\n → Saved to DB (\r\n)
Edit screen display: DB (\r\n) → Displayed in textarea → textarea.value is normalized to \n
JS count: Counts with \n → "5 characters"
Re-submission: Normalized to \r\n again → PHP/MySQL → "7 characters"

The textarea's .value property is called the "API value," and it's normalized to \n (LF) when retrieved. So even if \r\n is stored in the database, when you get it via JavaScript, it becomes \n.

This round-trip causes a constant mismatch between JS and server-side counting, which is why you need to normalize to \r\n on the JS side before counting.

Database Storage

Major databases like MySQL, PostgreSQL, and SQL Server store \r\n (CRLF) as-is as 2 bytes of data.

Text created in Windows environments is stored as \r\n
Input from Linux/Mac environments (or modern web browsers) uses \n (LF)

In other words, databases don't automatically convert newline codes, so the data varies depending on the input source.

Database-Specific Notes

MySQL / MariaDB

Data is basically stored as-is. However, when importing CSV files via LOAD DATA INFILE from the command line, if the default newline code setting doesn't match, \r can remain as garbage at the end of data.

SQL Server

Due to its Windows affinity, \r\n is treated as standard. When copying and pasting results in SSMS (management tool), newline codes might get changed automatically, so be careful.

Other Causes

Lone Surrogates

Corrupted data can sometimes get mixed in through copy-paste. MySQL treats these as invalid UTF-8 and may throw an error or truncate with a warning.

The behavior varies depending on sql_mode settings—STRICT mode causes an error and stops, while non-STRICT mode issues a warning and performs truncation or replacement.

NFD Form

This is a decomposed form that was used in filenames on older macOS (HFS+). Characters like "é" or "ñ" can have different internal representations while looking the same.

// NFC (composed) vs NFD (decomposed)
"é".length                    // → 1 (NFC: single code point U+00E9)
"é".normalize("NFD").length   // → 2 (NFD: "e" + combining acute accent)

"ñ".length                    // → 1 (NFC: single code point U+00F1)
"ñ".normalize("NFD").length   // → 2 (NFD: "n" + combining tilde)

This also affects other languages with combining marks—Vietnamese, Korean Hangul, Japanese kana, and many others.

According to Apple's official documentation, APFS (since macOS High Sierra in 2017) stores the normalization form as-is, unlike HFS+. However, normalization may still occur at the macOS layer, so NFD issues haven't completely disappeared.

In practice, NFD coming through textarea input or normal web input is quite rare—it mostly becomes an issue when dealing with filenames or copy-paste.

MySQL doesn't normalize by default, so it stores data in whatever form it receives. Depending on the collation, MySQL can search and compare NFD and NFC as "the same thing," but the stored byte count and CHAR_LENGTH results differ, so you still need to be careful from a character counting perspective.

Comparing JS Character Counting Methods

Using [...str].length for surrogate pair handling is common advice, but let me organize the differences between various methods.

Basic Comparison

Method	Unit	`"🎉"`	`"👨‍👩‍👧"`	`"\r\n"`
`str.length`	UTF-16 code units	2	8	2
`[...str].length`	Code points	1	5	2
`Intl.Segmenter`	Grapheme clusters	1	1	1
`TextEncoder`	Byte count	4	18	2
MySQL `CHAR_LENGTH()`	Code points	1	5	2

MySQL's CHAR_LENGTH() counts \r\n as 2 characters, so using Intl.Segmenter (which counts it as 1) will cause a mismatch.

`Intl.Segmenter` vs Spread Syntax

Here's a more detailed comparison:

Character Type	`Intl.Segmenter`	Spread `[...]`
Windows newline (`\r\n`)	1 character	Split into 2
CJK variant selectors	1 character	Split into 2
Private Use Area (U+E000, etc.)	1 character	1 character
ZWJ emoji sequences (👩‍👩‍👧)	1 character	Split apart

The `Intl.Segmenter` Trap

Intl.Segmenter returns "the number of characters as humans visually perceive them."

const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
[...segmenter.segment("👨‍👩‍👧")].length  // → 1 (as it looks)
[...segmenter.segment("\r\n")].length   // → 1 (CRLF as 1 character)

This seems good at first, but it can't be used for MySQL VARCHAR limits. 👨‍👩‍👧 is 1 character in Intl.Segmenter, but MySQL's CHAR_LENGTH() counts it as 5, so even if you're within limits according to Segmenter, you might overflow in the database.

Interestingly, X (formerly Twitter) uses its own method. According to Emojipedia, in 2018 they changed to count all emoji uniformly as 2 characters. This differs from both Intl.Segmenter and [...str].length.

Private Use Area (PUA)

For the area commonly used for "custom characters" in Windows (Private Use Area: E000-F8FF), both Intl.Segmenter and spread syntax can handle them as 1 character without issues.

However, they only recognize it as "one code point"—Intl.Segmenter doesn't know what the character actually looks like.

Lone Surrogate Handling

The handling of lone surrogates also differs by method:

Method	Lone Surrogates	Combining Characters
`[...str]`	Kept	Split
`str.match(/./gu)`	Excluded	Split
`Intl.Segmenter`	Kept	Kept as 1 character

`TextEncoder`

Returns the UTF-8 byte count. Useful for checking MySQL's LENGTH() or index byte limits (like 3072 bytes).

new TextEncoder().encode("abc").length   // → 3 (1 byte × 3 characters)
new TextEncoder().encode("café").length  // → 5 (c, a, f each 1 byte + é is 2 bytes in UTF-8)
new TextEncoder().encode("你好").length   // → 6 (CJK characters are 3 bytes each in UTF-8)

By Use Case

Use Case	Method
MySQL VARCHAR(N) limit	`[...str].length` (excluding lone surrogates, etc.)
MySQL index byte limit	`TextEncoder().encode(str).length`
Visual character count	`Intl.Segmenter`

Common Mistakes

Being Satisfied with Just Surrogate Pair Handling

// Surrogate pairs are handled, but newlines cause mismatch
const count = [...str].length;

Using `str.length` Directly

// Emoji counted as 2 characters
"🎉".length  // → 2

Looping by Index

// Surrogate pairs get split
for (let i = 0; i < str.length; i++) {
  const char = str[i];  // 🎉 gets broken
}

// for...of works by code point
for (const char of str) {
  console.log(char);  // 🎉 stays intact
}

Using `Intl.Segmenter` for VARCHAR Limits

As mentioned above, it doesn't match MySQL's counting.

Using `strlen()` in PHP

// Returns byte count
strlen("café")  // → 5 (é is 2 bytes in UTF-8)
strlen("你好")   // → 6 (CJK characters are 3 bytes each)

// Use this instead
mb_strlen("café", 'UTF-8')  // → 4
mb_strlen("你好", 'UTF-8')   // → 2

Example Solution

Here's a function I use to sync with MySQL's VARCHAR(N) limit. There might be better approaches, though.

export function countForMySQL(str: string, charLimit: number) {
  // 1. Exclude lone surrogates from count (MySQL treats them as invalid UTF-8)
  const validChars = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]|[^\uD800-\uDFFF]/gu) ?? [];

  // 2. Normalize newlines to CRLF for counting (to match post-HTTP-submission state)
  const normalized = validChars.join('').replace(/\r?\n/g, "\r\n");

  // 3. Count code points (= MySQL CHAR_LENGTH())
  const charLength = [...normalized].length;

  // 4. Byte count (for index limits)
  const byteLength = new TextEncoder().encode(normalized).length;

  return {
    isValid: charLength <= charLimit,
    charLength,
    byteLength,
  };
}

The key point is that this returns only "how MySQL would count it" without modifying the original data. Lone surrogates are excluded from the count because MySQL treats them as invalid UTF-8, but the input data itself is not modified.

Whether to reject input containing lone surrogates via validation can be decided by the calling code.

Reference Table

Concept	MySQL	PHP	JavaScript
Character count (code points)	`CHAR_LENGTH(col)`	`mb_strlen($str, 'UTF-8')`	`[...str].length`
Byte count	`LENGTH(col)`	`strlen($str)`	`new TextEncoder().encode(str).length`
Visual character count	-	`grapheme_strlen($str)`	`Intl.Segmenter`

Conclusion

Using [...str].length for surrogate pair handling is well-known, but there are cases where it's not enough.

Intl.Segmenter is correct for visual character count, but it's a bit tricky that it can't be used for database limit checks.

NFD normalization might be needed in some cases, but it seems rare enough that it's probably okay to ignore for now.

DEV Community

Beyond Surrogate Pairs: Hidden Pitfalls in JS Character Counting with PHP/MySQL

What Happened

Causes Beyond Surrogate Pairs

About Newline Normalization

Browser → Server

The Round-Trip Problem During Editing

Database Storage

Database-Specific Notes

Other Causes

Lone Surrogates

NFD Form

Comparing JS Character Counting Methods

Basic Comparison

`Intl.Segmenter` vs Spread Syntax

The `Intl.Segmenter` Trap

Private Use Area (PUA)

Lone Surrogate Handling

`TextEncoder`

By Use Case

Common Mistakes

Being Satisfied with Just Surrogate Pair Handling

Using `str.length` Directly

Looping by Index

Using `Intl.Segmenter` for VARCHAR Limits

Using `strlen()` in PHP

Example Solution

Reference Table

Conclusion

References

Top comments (0)

What Happened

Causes Beyond Surrogate Pairs

About Newline Normalization

Browser → Server

The Round-Trip Problem During Editing

Database Storage

Database-Specific Notes

Other Causes

Lone Surrogates

NFD Form

Comparing JS Character Counting Methods

Basic Comparison

Intl.Segmenter vs Spread Syntax

The Intl.Segmenter Trap

Private Use Area (PUA)

Lone Surrogate Handling

TextEncoder

By Use Case

Common Mistakes

Being Satisfied with Just Surrogate Pair Handling

Using str.length Directly

Looping by Index

Using Intl.Segmenter for VARCHAR Limits

Using strlen() in PHP

Example Solution

Reference Table

Conclusion

References

`Intl.Segmenter` vs Spread Syntax

The `Intl.Segmenter` Trap

`TextEncoder`

Using `str.length` Directly

Using `Intl.Segmenter` for VARCHAR Limits

Using `strlen()` in PHP