The Problem: When Database Migration Breaks Frontend Display
One time we decided to add a 280-character constraint to our posts table, but there's a problem - existing content already contains posts longer than this limit.
So we write a Python migration script to clean up the existing data to fit in db column:
# Migration script: Clean up existing long posts
def truncate_existing_posts():
long_posts = Post.objects.filter(content__length__gt=280)
for post in long_posts:
if len(post.content) > 280:
post.content = post.content[:280] # Truncate to 280 characters
post.save()
print(f"Truncated {long_posts.count()} posts to 280 characters")
# Example of what the migration does
original_post = "Amazing day with the family! π¨βπ¨βπ¦βπ¦β€οΈ Can't wait for more adventures together! ππ¨ Life is beautiful when you're surrounded by love and joy!"
print(f"Original length: {len(original_post)}") # 132 characters
truncated_post = original_post[:280] # Well within the limit
print(f"Truncated length: {len(truncated_post)}") # 132 characters β
print(f"Truncated content: {truncated_post}")
The migration runs successfully. All posts are now 280 characters or fewer according to Python. We deploy the new schema with character validation on the frontend.
But then users start reporting that their post content becomes invalid when editing
The Investigation: When Character Counting Goes Wrong
After some digging we found out their is a weird different length between python and javascript, when using emojis.
# Python
text_with_emoji = "Hello π¨βπ» World"
print(f"Python len(): {len(text_with_emoji)}") # 15 characters
# Let's see what Python sees
for i, char in enumerate(text_with_emoji):
print(f"{i}: {repr(char)}")
# 0: 'H', 1: 'e', 2: 'l', 3: 'l', 4: 'o', 5: ' ', 6: 'π¨', 7: '\u200d', 8: 'π»', 9: ' ', 10: 'W', 11: 'o', 12: 'r', 13: 'l', 14: 'd'
// JavaScript
const textWithEmoji = "Hello π¨βπ» World";
console.log(`JavaScript length: ${textWithEmoji.length}`); // 17 characters
// Let's see what JavaScript sees
console.log([...textWithEmoji]);
// ['H', 'e', 'l', 'l', 'o', ' ', 'π¨', 'β', 'π»', ' ', 'W', 'o', 'r', 'l', 'd']
It turns out that Python counts Unicode code points, while JavaScript counts UTF-16 code units. The same emoji is counted differently.
Three emojis, three lengths. Why???
The Curiosity Detour
At this point, the fix was obvious - just align the character counting methods between frontend and backend. Problem solved!
But another thing caught my eye:
// Simple emoji
const man = 'π¨';
console.log(man.length); // 2 in js, 1 in python
const developer = 'π¨βπ»';
console.log(developer.length); // 5 in js, 3 in python (?!)
const family = 'π¨βπ¨βπ¦βπ¦';
console.log(family.length); // 11 in js, 7 in python (!!!)
Why was π¨βπ¨βπ¦βπ¦
counting as 11 characters in JavaScript but only 7 in Python? What exactly was hiding inside that "single" family emoji?
That curiosity led me to study Unicode emoji specifications...
The Deep Dive: Unicode Emoji Specifications
So what's really happening under the hood? Let's examine the Unicode specifications that govern emoji behavior.
Zero-Width Joiners: The Invisible Glue
Some emojis are built with several base emojis, using Zero-Width Joiners (ZWJ U+200D*)* - invisible characters that glue base emojis together:
// Let's break down the developer emoji
const developer = 'π¨βπ»';
console.log(Array.from(developer));
// Output: ['π¨', 'β', 'π»']
// That's: man + ZWJ + computer
// The family emoji breakdown
const family = 'π¨βπ¨βπ¦βπ¦';
console.log(Array.from(family));
// Output: ['π¨', 'β', 'π¨', 'β', 'π¦', 'β', 'π¦']
// That's: man + ZWJ + man + ZWJ + boy + ZWJ + boy
Emoji Presentation Selector: The Style Controller
Another invisible character: U+FE0F VARIATION SELECTOR-16 (VS16), the emoji presentation selector. This forces a symbol to display in emoji style rather than text style:
// Medical symbol without emoji presentation selector
const textSymbol = 'β'; // U+2695 only
console.log(textSymbol.length); // 1
// Medical symbol WITH emoji presentation selector
const emojiSymbol = 'βοΈ'; // U+2695 + U+FE0F
console.log(emojiSymbol.length); // 2
console.log(Array.from(emojiSymbol)); // ['β', 'οΈ']
Many common emojis include this hidden selector:
// These look like single characters but aren't:
const heart = 'β€οΈ'; // U+2764 + U+FE0F
const warning = 'β οΈ'; // U+26A0 + U+FE0F
console.log(Array.from(heart)); // ['β€', 'οΈ'] - has VS16
console.log(Array.from(warning)); // ['β ', 'οΈ'] - has VS16
From the Unicode specification:
2695 FE0F
(fully-qualified) β βοΈ medical symbol (emoji presentation)2695
(unqualified) β β medical symbol (text presentation)
Emoji Modifiers: Skin tones
For people emojis, we can use modifiers for customization:
// Skin tone modifiers
const artist = 'π§π»βπ¨'; // light skin tone
console.log(artist.length); // 7
console.log(Array.from(artist));
// ['π§', 'π»', 'β', 'π¨']
// That's: person + light skin tone + ZWJ + artist palette
// Professional emojis with skin tone modifiers
const lightDeveloper = 'π¨π»βπ»'; // light skin male developer
console.log(lightDeveloper.length); // 7
console.log(Array.from(lightDeveloper));
// ['π¨', 'π»', 'β', 'π»']
The five skin tone modifiers are:
π» (Light skin tone - U+1F3FB)
πΌ (Medium-light skin tone - U+1F3FC)
π½ (Medium skin tone - U+1F3FD)
πΎ (Medium-dark skin tone - U+1F3FE)
πΏ (Dark skin tone - U+1F3FF)
// Same base emoji, different lengths due to modifiers
const wave = 'π'; // 2 characters
const waveLight = 'ππ»'; // 4 characters
const waveDark = 'ππΏ'; // 4 characters
// Comparing different combinations
const doctor = 'π©ββοΈ'; // 5 characters (woman + ZWJ + medical + VS16)
const lightDoctor = 'π©π»ββοΈ'; // 7 characters (+ skin tone modifier)
Flag Sequences: Countries Built from Letters
Here's where it gets really interesting. Country flags are actually sequences of two Regional Indicator characters:
// Flag examples - each is 2 Regional Indicator characters
const usFlag = 'πΊπΈ'; // U+1F1FA (πΊ) + U+1F1F8 (πΈ) = "US"
const taiwanFlag = 'πΉπΌ'; // U+1F1F9 (πΉ) + U+1F1FC (πΌ) = "TW"
const japanFlag = 'π―π΅'; // U+1F1EF (π―) + U+1F1F5 (π΅) = "JP"
// All flags have length 4 in JavaScript
console.log(usFlag.length); // 4
console.log(taiwanFlag.length); // 4
console.log(japanFlag.length); // 4
// Breaking them down shows the Regional Indicators
console.log(Array.from(usFlag)); // ['πΊ', 'πΈ']
console.log(Array.from(taiwanFlag)); // ['πΉ', 'πΌ']
// You can even manually construct flags:
const manualTW = 'πΉπΌ';
const constructedTW = '\u{1F1F9}\u{1F1FC}'; // Same thing!
console.log(manualTW === constructedTW); // true
The Experiment: Interactive Exploration
At this point, I was fascinated by the complexity. How many different emoji combinations exist? What do they actually contain?
So I built "Emoji Architect" to explore these compositions interactively. With this tool you can:
- Browse all ZWJ sequence emojis
- See the real-time breakdown of any emoji (base emoji & modifiers)
- See JavaScript length calculations live
π https://www.thingsaboutweb.dev/en/emojiarchitect
The Solution: Synchronized Character Counting
Now that we understand the problem, let's fix our backend and frontend validation mismatch.
The Core Issue: Different Counting Methods
Our original problem stemmed from inconsistent character counting:
# Python:
len("Hello π¨βπ» World") # 15
# JavaScript
"Hello π¨βπ» World".length # 17
My Initial Approach: Visual Character Counting
Initially, I was drawn to visual character counting because it felt like the right user experience. Why should users have to understand Unicode internals just to know if their post fits the character limit? Luckily Intl
can help usβ¦
JavaScript implementation:
function getVisualLength(str) {
if (typeof Intl.Segmenter !== 'undefined') {
const segmenter = new Intl.Segmenter('en', {granularity: 'grapheme'});
return [...segmenter.segment(str)].length;
}
// Fallback for older browsers
return [...str].length;
}
getVisualLength("Hello!! I am a developer π¨βπ» who loves π§ and πββοΈ!!")
// Result: 46
// what users actually see
For python..
import regex # pip install regex
def get_visual_length(text):
"""Count grapheme clusters like JavaScript's Intl.Segmenter"""
return len(regex.findall(r'\X', text))
get_visual_length("Hello!! I am a developer π¨βπ» who loves π§ and πββοΈ!!")
# Result: 46
# consistent with frontend
This seemed perfectβusers would see exactly what they expect, and complex emoji sequences would count as single characters regardless of their Unicode complexity.
Reality Check: Database Constraints
But when I wanted to save data to databaseβ¦.
// Frontend validation passes with visual counting
const bio = "Hello!! I am a developer π¨βπ» who loves π§ and πββοΈ!!";
console.log(`Visual length: ${getVisualLength(bio)}`); // 46 characters β
But database insertion fails spectacularly
SELECT CHAR_LENGTH("Hello!! I am a developer π¨\U+200Dπ» who loves \U+1F9CB and π\U+200DβοΈ!!")
/* 51! Which fails on VARCHAR(50) constraint! */
Since thereβs no easy way to make database limit content by visual length, the most practical approach is to settle with code point counting.
Second Solution: count code points:
function codePointLength(text) {
/**
* Count Unicode code points like Python len() does
* This matches Python's default string length behavior and database constraints
*/
return [...text].length;
}
const text = "Hello!! I am a developer π¨βπ» who loves π§ and πββοΈ!!";
console.log(`Frontend: ${codePointLength(text)}`);
// 51 (which matches Python & DB)
Code point counting might not be as user-friendly as visual counting, but it works reliably across your entire stack.
Key Takeaways
After diving deep into Unicode specifications and building production solutions, here's my learning:
- Test Across Entire Stack Visual character counting seems ideal, but database constraints and ecosystem compatibility often force you toward Unicode code point counting. Choose the solution that works with your entire stack, not just the frontend.
- Test with Complex Emoji Early Don't just test with simple emojis like π. Include ZWJ sequences (π¨βπ»), skin tone modifiers (ππ»), and flag emojis (πΉπΌ) in your validation logic from day one. Itβs fun to poke around on unicode / emoji, and hope my * Emoji Architect tool can be of some help on how emojis are built.*
Top comments (0)