Fin Chen

Posted on Jul 7

💥 The Day Emoji Caused a Mess : A Developer's 🚨🔍🐛🧪✅ Journey 👨‍💻

#webdev #learning

The Problem: When Database Migration Breaks Frontend Display

One time we decided to add a 280-character constraint to our posts table, but there's a problem - existing content already contains posts longer than this limit.

So we write a Python migration script to clean up the existing data to fit in db column:

# Migration script: Clean up existing long posts
def truncate_existing_posts():
    long_posts = Post.objects.filter(content__length__gt=280)

    for post in long_posts:
        if len(post.content) > 280:
            post.content = post.content[:280]  # Truncate to 280 characters
            post.save()

    print(f"Truncated {long_posts.count()} posts to 280 characters")

# Example of what the migration does
original_post = "Amazing day with the family! 👨‍👨‍👦‍👦❤️ Can't wait for more adventures together! 🚀🎨 Life is beautiful when you're surrounded by love and joy!"
print(f"Original length: {len(original_post)}")  # 132 characters

truncated_post = original_post[:280]  # Well within the limit
print(f"Truncated length: {len(truncated_post)}")  # 132 characters ✅
print(f"Truncated content: {truncated_post}")

The migration runs successfully. All posts are now 280 characters or fewer according to Python. We deploy the new schema with character validation on the frontend.

But then users start reporting that their post content becomes invalid when editing

The Investigation: When Character Counting Goes Wrong

After some digging we found out their is a weird different length between python and javascript, when using emojis.

# Python
text_with_emoji = "Hello 👨‍💻 World"
print(f"Python len(): {len(text_with_emoji)}")  # 15 characters

# Let's see what Python sees
for i, char in enumerate(text_with_emoji):
    print(f"{i}: {repr(char)}")
# 0: 'H', 1: 'e', 2: 'l', 3: 'l', 4: 'o', 5: ' ', 6: '👨', 7: '\u200d', 8: '💻', 9: ' ', 10: 'W', 11: 'o', 12: 'r', 13: 'l', 14: 'd'

// JavaScript
const textWithEmoji = "Hello 👨‍💻 World";
console.log(`JavaScript length: ${textWithEmoji.length}`); // 17 characters

// Let's see what JavaScript sees
console.log([...textWithEmoji]); 
// ['H', 'e', 'l', 'l', 'o', ' ', '👨', '‍', '💻', ' ', 'W', 'o', 'r', 'l', 'd']

It turns out that Python counts Unicode code points, while JavaScript counts UTF-16 code units. The same emoji is counted differently.

Three emojis, three lengths. Why???

The Curiosity Detour

At this point, the fix was obvious - just align the character counting methods between frontend and backend. Problem solved!

But another thing caught my eye:

// Simple emoji
const man = '👨';
console.log(man.length); // 2 in js, 1 in python

const developer = '👨‍💻';
console.log(developer.length); // 5 in js, 3 in python (?!)

const family = '👨‍👨‍👦‍👦';
console.log(family.length); // 11 in js, 7 in python (!!!)

Why was 👨‍👨‍👦‍👦 counting as 11 characters in JavaScript but only 7 in Python? What exactly was hiding inside that "single" family emoji?

That curiosity led me to study Unicode emoji specifications...

The Deep Dive: Unicode Emoji Specifications

So what's really happening under the hood? Let's examine the Unicode specifications that govern emoji behavior.

Zero-Width Joiners: The Invisible Glue

Some emojis are built with several base emojis, using Zero-Width Joiners (ZWJ U+200D*)* - invisible characters that glue base emojis together:

// Let's break down the developer emoji
const developer = '👨‍💻';
console.log(Array.from(developer));
// Output: ['👨', '‍', '💻']
// That's: man + ZWJ + computer

// The family emoji breakdown
const family = '👨‍👨‍👦‍👦';
console.log(Array.from(family));
// Output: ['👨', '‍', '👨', '‍', '👦', '‍', '👦']
// That's: man + ZWJ + man + ZWJ + boy + ZWJ + boy

Emoji Presentation Selector: The Style Controller

Another invisible character: U+FE0F VARIATION SELECTOR-16 (VS16), the emoji presentation selector. This forces a symbol to display in emoji style rather than text style:

// Medical symbol without emoji presentation selector
const textSymbol = '⚕';     // U+2695 only
console.log(textSymbol.length); // 1

// Medical symbol WITH emoji presentation selector  
const emojiSymbol = '⚕️';    // U+2695 + U+FE0F
console.log(emojiSymbol.length); // 2
console.log(Array.from(emojiSymbol)); // ['⚕', '️']

Many common emojis include this hidden selector:

// These look like single characters but aren't:
const heart = '❤️';         // U+2764 + U+FE0F
const warning = '⚠️';       // U+26A0 + U+FE0F

console.log(Array.from(heart));     // ['❤', '️'] - has VS16
console.log(Array.from(warning));   // ['⚠', '️'] - has VS16

From the Unicode specification:

2695 FE0F (fully-qualified) — ⚕️ medical symbol (emoji presentation)
2695 (unqualified) — ⚕ medical symbol (text presentation)

Emoji Modifiers: Skin tones

For people emojis, we can use modifiers for customization:

// Skin tone modifiers
const artist = '🧑🏻‍🎨'; // light skin tone
console.log(artist.length); // 7
console.log(Array.from(artist)); 
// ['🧑', '🏻', '‍', '🎨']
// That's: person + light skin tone + ZWJ + artist palette

// Professional emojis with skin tone modifiers
const lightDeveloper = '👨🏻‍💻'; // light skin male developer
console.log(lightDeveloper.length); // 7
console.log(Array.from(lightDeveloper));
// ['👨', '🏻', '‍', '💻']

The five skin tone modifiers are:

🏻 (Light skin tone - U+1F3FB)
🏼 (Medium-light skin tone - U+1F3FC)
🏽 (Medium skin tone - U+1F3FD)
🏾 (Medium-dark skin tone - U+1F3FE)
🏿 (Dark skin tone - U+1F3FF)

// Same base emoji, different lengths due to modifiers
const wave = '👋';           // 2 characters
const waveLight = '👋🏻';     // 4 characters  
const waveDark = '👋🏿';      // 4 characters

// Comparing different combinations
const doctor = '👩‍⚕️';       // 5 characters (woman + ZWJ + medical + VS16)
const lightDoctor = '👩🏻‍⚕️'; // 7 characters (+ skin tone modifier)

Flag Sequences: Countries Built from Letters

Here's where it gets really interesting. Country flags are actually sequences of two Regional Indicator characters:

// Flag examples - each is 2 Regional Indicator characters
const usFlag = '🇺🇸';        // U+1F1FA (🇺) + U+1F1F8 (🇸) = "US"
const taiwanFlag = '🇹🇼';     // U+1F1F9 (🇹) + U+1F1FC (🇼) = "TW"
const japanFlag = '🇯🇵';      // U+1F1EF (🇯) + U+1F1F5 (🇵) = "JP"

// All flags have length 4 in JavaScript
console.log(usFlag.length);     // 4
console.log(taiwanFlag.length); // 4
console.log(japanFlag.length);  // 4

// Breaking them down shows the Regional Indicators
console.log(Array.from(usFlag));     // ['🇺', '🇸']
console.log(Array.from(taiwanFlag)); // ['🇹', '🇼']

// You can even manually construct flags:
const manualTW = '🇹🇼';           
const constructedTW = '\u{1F1F9}\u{1F1FC}'; // Same thing!
console.log(manualTW === constructedTW);     // true

The Experiment: Interactive Exploration

At this point, I was fascinated by the complexity. How many different emoji combinations exist? What do they actually contain?

So I built "Emoji Architect" to explore these compositions interactively. With this tool you can:

Browse all ZWJ sequence emojis
See the real-time breakdown of any emoji (base emoji & modifiers)
See JavaScript length calculations live

🔗 https://www.thingsaboutweb.dev/en/emojiarchitect

The Solution: Synchronized Character Counting

Now that we understand the problem, let's fix our backend and frontend validation mismatch.

The Core Issue: Different Counting Methods

Our original problem stemmed from inconsistent character counting:

# Python: 
len("Hello 👨‍💻 World") # 15

# JavaScript
"Hello 👨‍💻 World".length # 17

My Initial Approach: Visual Character Counting

Initially, I was drawn to visual character counting because it felt like the right user experience. Why should users have to understand Unicode internals just to know if their post fits the character limit? Luckily Intl can help us…

JavaScript implementation:

function getVisualLength(str) {
    if (typeof Intl.Segmenter !== 'undefined') {
        const segmenter = new Intl.Segmenter('en', {granularity: 'grapheme'});
        return [...segmenter.segment(str)].length;
    }
    // Fallback for older browsers
    return [...str].length;
}

getVisualLength("Hello!! I am a developer 👨‍💻 who loves 🧋 and 🏃‍♂️!!") 
// Result: 46 
// what users actually see

For python..

import regex  # pip install regex

def get_visual_length(text):
    """Count grapheme clusters like JavaScript's Intl.Segmenter"""
    return len(regex.findall(r'\X', text))

get_visual_length("Hello!! I am a developer 👨‍💻 who loves 🧋 and 🏃‍♂️!!") 
# Result: 46
# consistent with frontend

This seemed perfect—users would see exactly what they expect, and complex emoji sequences would count as single characters regardless of their Unicode complexity.

Reality Check: Database Constraints

But when I wanted to save data to database….

// Frontend validation passes with visual counting
const bio = "Hello!! I am a developer 👨‍💻 who loves 🧋 and 🏃‍♂️!!";
console.log(`Visual length: ${getVisualLength(bio)}`);  // 46 characters ✅

But database insertion fails spectacularly

SELECT CHAR_LENGTH("Hello!! I am a developer 👨\U+200D💻 who loves \U+1F9CB and 🏃\U+200D♂️!!") 
/* 51! Which fails on VARCHAR(50) constraint! */

Since there’s no easy way to make database limit content by visual length, the most practical approach is to settle with code point counting.

Second Solution: count code points:

function codePointLength(text) {
    /**
     * Count Unicode code points like Python len() does
     * This matches Python's default string length behavior and database constraints
     */
    return [...text].length;
}


const text = "Hello!! I am a developer 👨‍💻 who loves 🧋 and 🏃‍♂️!!";
console.log(`Frontend: ${codePointLength(text)}`);     
// 51 (which matches Python & DB)

Code point counting might not be as user-friendly as visual counting, but it works reliably across your entire stack.

Key Takeaways

After diving deep into Unicode specifications and building production solutions, here's my learning:

Test Across Entire Stack Visual character counting seems ideal, but database constraints and ecosystem compatibility often force you toward Unicode code point counting. Choose the solution that works with your entire stack, not just the frontend.
Test with Complex Emoji Early Don't just test with simple emojis like 😀. Include ZWJ sequences (👨‍💻), skin tone modifiers (👋🏻), and flag emojis (🇹🇼) in your validation logic from day one. It’s fun to poke around on unicode / emoji, and hope my * Emoji Architect tool can be of some help on how emojis are built.*

DEV Community