DEV Community

usapop
usapop

Posted on

Beyond Surrogate Pairs: Hidden Pitfalls in JS Character Counting with PHP/MySQL

When counting characters in JavaScript, the surrogate pair issue (where emoji and certain CJK characters count as 2) is fairly well-known. You've probably seen advice like "just use [...str].length" to solve it.

However, when I implemented a character counter for a textarea on the frontend, I discovered there are other causes of mismatches beyond surrogate pairs. It turned out to be deeper than I expected, so I'd like to share what I learned. I might have some things wrong, so please take this as a reference rather than absolute truth.

What Happened

I was implementing a character counter for a <textarea>. I was already using [...str].length as a surrogate pair countermeasure, but Laravel's max validation was still rejecting some inputs.

After investigating, I found the culprit: newline normalization.

Causes Beyond Surrogate Pairs

Cause JS Side PHP/MySQL Side Discrepancy
Emoji (surrogate pairs) "🎉".length → 2 mb_strlen("🎉") → 1 JS counts more
Newline normalization "a\nb".length → 3 After HTTP: "a\r\nb" → 4 PHP counts more
Lone surrogates Counted MySQL error or warning Possible save failure
NFD form (older macOS) "é".length → 2 mb_strlen("é") → 2 Matches, but unintentionally 2 chars

Surrogate pairs can be handled with [...str].length, but the newline issue isn't solved by that alone. The more newlines in the text, the bigger the discrepancy—for example, 10 newlines means a +10 character difference.

About Newline Normalization

This is where I got stuck the most.

Browser → Server

When you submit the contents of a <textarea> via HTTP, newlines are normalized from \n to \r\n.

According to MDN, for wrap="soft" (the default):

the browser ensures that all line breaks in the entered value are a CR+LF pair

The WHATWG Blog also explains this in detail—newlines are normalized to CRLF during form submission per the spec.

// In browser
textarea.value  // "a\nb\nc" → 5 characters

// Received by PHP
$_POST['field'] // "a\r\nb\r\nc" → 7 characters
Enter fullscreen mode Exit fullscreen mode

Laravel's max rule uses mb_strlen(), so it checks against the character count after normalization. This was why text with many newlines kept getting rejected.

The Round-Trip Problem During Editing

This issue doesn't just occur with new input—it also happens on edit screens.

  1. Initial input: Browser (\n) → HTTP submission converts to \r\n → Saved to DB (\r\n)
  2. Edit screen display: DB (\r\n) → Displayed in textarea → textarea.value is normalized to \n
  3. JS count: Counts with \n → "5 characters"
  4. Re-submission: Normalized to \r\n again → PHP/MySQL → "7 characters"

The textarea's .value property is called the "API value," and it's normalized to \n (LF) when retrieved. So even if \r\n is stored in the database, when you get it via JavaScript, it becomes \n.

This round-trip causes a constant mismatch between JS and server-side counting, which is why you need to normalize to \r\n on the JS side before counting.

Database Storage

Major databases like MySQL, PostgreSQL, and SQL Server store \r\n (CRLF) as-is as 2 bytes of data.

  • Text created in Windows environments is stored as \r\n
  • Input from Linux/Mac environments (or modern web browsers) uses \n (LF)

In other words, databases don't automatically convert newline codes, so the data varies depending on the input source.

Database-Specific Notes

MySQL / MariaDB

Data is basically stored as-is. However, when importing CSV files via LOAD DATA INFILE from the command line, if the default newline code setting doesn't match, \r can remain as garbage at the end of data.

SQL Server

Due to its Windows affinity, \r\n is treated as standard. When copying and pasting results in SSMS (management tool), newline codes might get changed automatically, so be careful.

Other Causes

Lone Surrogates

Corrupted data can sometimes get mixed in through copy-paste. MySQL treats these as invalid UTF-8 and may throw an error or truncate with a warning.

The behavior varies depending on sql_mode settings—STRICT mode causes an error and stops, while non-STRICT mode issues a warning and performs truncation or replacement.

NFD Form

This is a decomposed form that was used in filenames on older macOS (HFS+). Characters like "é" or "ñ" can have different internal representations while looking the same.

// NFC (composed) vs NFD (decomposed)
"é".length                    // → 1 (NFC: single code point U+00E9)
"é".normalize("NFD").length   // → 2 (NFD: "e" + combining acute accent)

"ñ".length                    // → 1 (NFC: single code point U+00F1)
"ñ".normalize("NFD").length   // → 2 (NFD: "n" + combining tilde)
Enter fullscreen mode Exit fullscreen mode

This also affects other languages with combining marks—Vietnamese, Korean Hangul, Japanese kana, and many others.

According to Apple's official documentation, APFS (since macOS High Sierra in 2017) stores the normalization form as-is, unlike HFS+. However, normalization may still occur at the macOS layer, so NFD issues haven't completely disappeared.

In practice, NFD coming through textarea input or normal web input is quite rare—it mostly becomes an issue when dealing with filenames or copy-paste.

MySQL doesn't normalize by default, so it stores data in whatever form it receives. Depending on the collation, MySQL can search and compare NFD and NFC as "the same thing," but the stored byte count and CHAR_LENGTH results differ, so you still need to be careful from a character counting perspective.

Comparing JS Character Counting Methods

Using [...str].length for surrogate pair handling is common advice, but let me organize the differences between various methods.

Basic Comparison

Method Unit "🎉" "👨‍👩‍👧" "\r\n"
str.length UTF-16 code units 2 8 2
[...str].length Code points 1 5 2
Intl.Segmenter Grapheme clusters 1 1 1
TextEncoder Byte count 4 18 2
MySQL CHAR_LENGTH() Code points 1 5 2

MySQL's CHAR_LENGTH() counts \r\n as 2 characters, so using Intl.Segmenter (which counts it as 1) will cause a mismatch.

Intl.Segmenter vs Spread Syntax

Here's a more detailed comparison:

Character Type Intl.Segmenter Spread [...]
Windows newline (\r\n) 1 character Split into 2
CJK variant selectors 1 character Split into 2
Private Use Area (U+E000, etc.) 1 character 1 character
ZWJ emoji sequences (👩‍👩‍👧) 1 character Split apart

The Intl.Segmenter Trap

Intl.Segmenter returns "the number of characters as humans visually perceive them."

const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
[...segmenter.segment("👨‍👩‍👧")].length  // → 1 (as it looks)
[...segmenter.segment("\r\n")].length   // → 1 (CRLF as 1 character)
Enter fullscreen mode Exit fullscreen mode

This seems good at first, but it can't be used for MySQL VARCHAR limits. 👨‍👩‍👧 is 1 character in Intl.Segmenter, but MySQL's CHAR_LENGTH() counts it as 5, so even if you're within limits according to Segmenter, you might overflow in the database.

Interestingly, X (formerly Twitter) uses its own method. According to Emojipedia, in 2018 they changed to count all emoji uniformly as 2 characters. This differs from both Intl.Segmenter and [...str].length.

Private Use Area (PUA)

For the area commonly used for "custom characters" in Windows (Private Use Area: E000-F8FF), both Intl.Segmenter and spread syntax can handle them as 1 character without issues.

However, they only recognize it as "one code point"—Intl.Segmenter doesn't know what the character actually looks like.

Lone Surrogate Handling

The handling of lone surrogates also differs by method:

Method Lone Surrogates Combining Characters
[...str] Kept Split
str.match(/./gu) Excluded Split
Intl.Segmenter Kept Kept as 1 character

TextEncoder

Returns the UTF-8 byte count. Useful for checking MySQL's LENGTH() or index byte limits (like 3072 bytes).

new TextEncoder().encode("abc").length   // → 3 (1 byte × 3 characters)
new TextEncoder().encode("café").length  // → 5 (c, a, f each 1 byte + é is 2 bytes in UTF-8)
new TextEncoder().encode("你好").length   // → 6 (CJK characters are 3 bytes each in UTF-8)
Enter fullscreen mode Exit fullscreen mode

By Use Case

Use Case Method
MySQL VARCHAR(N) limit [...str].length (excluding lone surrogates, etc.)
MySQL index byte limit TextEncoder().encode(str).length
Visual character count Intl.Segmenter

Common Mistakes

Being Satisfied with Just Surrogate Pair Handling

// Surrogate pairs are handled, but newlines cause mismatch
const count = [...str].length;
Enter fullscreen mode Exit fullscreen mode

Using str.length Directly

// Emoji counted as 2 characters
"🎉".length  // → 2
Enter fullscreen mode Exit fullscreen mode

Looping by Index

// Surrogate pairs get split
for (let i = 0; i < str.length; i++) {
  const char = str[i];  // 🎉 gets broken
}

// for...of works by code point
for (const char of str) {
  console.log(char);  // 🎉 stays intact
}
Enter fullscreen mode Exit fullscreen mode

Using Intl.Segmenter for VARCHAR Limits

As mentioned above, it doesn't match MySQL's counting.

Using strlen() in PHP

// Returns byte count
strlen("café")  // → 5 (é is 2 bytes in UTF-8)
strlen("你好")   // → 6 (CJK characters are 3 bytes each)

// Use this instead
mb_strlen("café", 'UTF-8')  // → 4
mb_strlen("你好", 'UTF-8')   // → 2
Enter fullscreen mode Exit fullscreen mode

Example Solution

Here's a function I use to sync with MySQL's VARCHAR(N) limit. There might be better approaches, though.

export function countForMySQL(str: string, charLimit: number) {
  // 1. Exclude lone surrogates from count (MySQL treats them as invalid UTF-8)
  const validChars = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]|[^\uD800-\uDFFF]/gu) ?? [];

  // 2. Normalize newlines to CRLF for counting (to match post-HTTP-submission state)
  const normalized = validChars.join('').replace(/\r?\n/g, "\r\n");

  // 3. Count code points (= MySQL CHAR_LENGTH())
  const charLength = [...normalized].length;

  // 4. Byte count (for index limits)
  const byteLength = new TextEncoder().encode(normalized).length;

  return {
    isValid: charLength <= charLimit,
    charLength,
    byteLength,
  };
}
Enter fullscreen mode Exit fullscreen mode

The key point is that this returns only "how MySQL would count it" without modifying the original data. Lone surrogates are excluded from the count because MySQL treats them as invalid UTF-8, but the input data itself is not modified.

Whether to reject input containing lone surrogates via validation can be decided by the calling code.

Reference Table

Concept MySQL PHP JavaScript
Character count (code points) CHAR_LENGTH(col) mb_strlen($str, 'UTF-8') [...str].length
Byte count LENGTH(col) strlen($str) new TextEncoder().encode(str).length
Visual character count - grapheme_strlen($str) Intl.Segmenter

Conclusion

Using [...str].length for surrogate pair handling is well-known, but there are cases where it's not enough.

Intl.Segmenter is correct for visual character count, but it's a bit tricky that it can't be used for database limit checks.

NFD normalization might be needed in some cases, but it seems rare enough that it's probably okay to ignore for now.

References

Top comments (0)