DEV Community

Cover image for πŸ’₯ The Day Emoji Caused a Mess : A Developer's πŸš¨πŸ”πŸ›πŸ§ͺβœ… Journey πŸ‘¨β€πŸ’»
Fin Chen
Fin Chen

Posted on

πŸ’₯ The Day Emoji Caused a Mess : A Developer's πŸš¨πŸ”πŸ›πŸ§ͺβœ… Journey πŸ‘¨β€πŸ’»

The Problem: When Database Migration Breaks Frontend Display

One time we decided to add a 280-character constraint to our posts table, but there's a problem - existing content already contains posts longer than this limit.

So we write a Python migration script to clean up the existing data to fit in db column:

# Migration script: Clean up existing long posts
def truncate_existing_posts():
    long_posts = Post.objects.filter(content__length__gt=280)

    for post in long_posts:
        if len(post.content) > 280:
            post.content = post.content[:280]  # Truncate to 280 characters
            post.save()

    print(f"Truncated {long_posts.count()} posts to 280 characters")

# Example of what the migration does
original_post = "Amazing day with the family! πŸ‘¨β€πŸ‘¨β€πŸ‘¦β€πŸ‘¦β€οΈ Can't wait for more adventures together! πŸš€πŸŽ¨ Life is beautiful when you're surrounded by love and joy!"
print(f"Original length: {len(original_post)}")  # 132 characters

truncated_post = original_post[:280]  # Well within the limit
print(f"Truncated length: {len(truncated_post)}")  # 132 characters βœ…
print(f"Truncated content: {truncated_post}")

Enter fullscreen mode Exit fullscreen mode

The migration runs successfully. All posts are now 280 characters or fewer according to Python. We deploy the new schema with character validation on the frontend.

But then users start reporting that their post content becomes invalid when editing

The Investigation: When Character Counting Goes Wrong

After some digging we found out their is a weird different length between python and javascript, when using emojis.

# Python
text_with_emoji = "Hello πŸ‘¨β€πŸ’» World"
print(f"Python len(): {len(text_with_emoji)}")  # 15 characters

# Let's see what Python sees
for i, char in enumerate(text_with_emoji):
    print(f"{i}: {repr(char)}")
# 0: 'H', 1: 'e', 2: 'l', 3: 'l', 4: 'o', 5: ' ', 6: 'πŸ‘¨', 7: '\u200d', 8: 'πŸ’»', 9: ' ', 10: 'W', 11: 'o', 12: 'r', 13: 'l', 14: 'd'

Enter fullscreen mode Exit fullscreen mode
// JavaScript
const textWithEmoji = "Hello πŸ‘¨β€πŸ’» World";
console.log(`JavaScript length: ${textWithEmoji.length}`); // 17 characters

// Let's see what JavaScript sees
console.log([...textWithEmoji]); 
// ['H', 'e', 'l', 'l', 'o', ' ', 'πŸ‘¨', '‍', 'πŸ’»', ' ', 'W', 'o', 'r', 'l', 'd']

Enter fullscreen mode Exit fullscreen mode

It turns out that Python counts Unicode code points, while JavaScript counts UTF-16 code units. The same emoji is counted differently.

Three emojis, three lengths. Why???

The Curiosity Detour

At this point, the fix was obvious - just align the character counting methods between frontend and backend. Problem solved!

But another thing caught my eye:

// Simple emoji
const man = 'πŸ‘¨';
console.log(man.length); // 2 in js, 1 in python

const developer = 'πŸ‘¨β€πŸ’»';
console.log(developer.length); // 5 in js, 3 in python (?!)

const family = 'πŸ‘¨β€πŸ‘¨β€πŸ‘¦β€πŸ‘¦';
console.log(family.length); // 11 in js, 7 in python (!!!)

Enter fullscreen mode Exit fullscreen mode

Why was πŸ‘¨β€πŸ‘¨β€πŸ‘¦β€πŸ‘¦ counting as 11 characters in JavaScript but only 7 in Python? What exactly was hiding inside that "single" family emoji?

That curiosity led me to study Unicode emoji specifications...

The Deep Dive: Unicode Emoji Specifications

So what's really happening under the hood? Let's examine the Unicode specifications that govern emoji behavior.

Zero-Width Joiners: The Invisible Glue

Some emojis are built with several base emojis, using Zero-Width Joiners (ZWJ U+200D*)* - invisible characters that glue base emojis together:

// Let's break down the developer emoji
const developer = 'πŸ‘¨β€πŸ’»';
console.log(Array.from(developer));
// Output: ['πŸ‘¨', '‍', 'πŸ’»']
// That's: man + ZWJ + computer

// The family emoji breakdown
const family = 'πŸ‘¨β€πŸ‘¨β€πŸ‘¦β€πŸ‘¦';
console.log(Array.from(family));
// Output: ['πŸ‘¨', '‍', 'πŸ‘¨', '‍', 'πŸ‘¦', '‍', 'πŸ‘¦']
// That's: man + ZWJ + man + ZWJ + boy + ZWJ + boy

Enter fullscreen mode Exit fullscreen mode

Emoji Presentation Selector: The Style Controller

Another invisible character: U+FE0F VARIATION SELECTOR-16 (VS16), the emoji presentation selector. This forces a symbol to display in emoji style rather than text style:

// Medical symbol without emoji presentation selector
const textSymbol = 'βš•';     // U+2695 only
console.log(textSymbol.length); // 1

// Medical symbol WITH emoji presentation selector  
const emojiSymbol = 'βš•οΈ';    // U+2695 + U+FE0F
console.log(emojiSymbol.length); // 2
console.log(Array.from(emojiSymbol)); // ['βš•', '️']

Enter fullscreen mode Exit fullscreen mode

Many common emojis include this hidden selector:

// These look like single characters but aren't:
const heart = '❀️';         // U+2764 + U+FE0F
const warning = '⚠️';       // U+26A0 + U+FE0F

console.log(Array.from(heart));     // ['❀', '️'] - has VS16
console.log(Array.from(warning));   // ['⚠', '️'] - has VS16

Enter fullscreen mode Exit fullscreen mode

From the Unicode specification:

  • 2695 FE0F (fully-qualified) β€” βš•οΈ medical symbol (emoji presentation)

  • 2695 (unqualified) β€” βš• medical symbol (text presentation)

Emoji Modifiers: Skin tones

For people emojis, we can use modifiers for customization:

// Skin tone modifiers
const artist = 'πŸ§‘πŸ»β€πŸŽ¨'; // light skin tone
console.log(artist.length); // 7
console.log(Array.from(artist)); 
// ['πŸ§‘', '🏻', '‍', '🎨']
// That's: person + light skin tone + ZWJ + artist palette

// Professional emojis with skin tone modifiers
const lightDeveloper = 'πŸ‘¨πŸ»β€πŸ’»'; // light skin male developer
console.log(lightDeveloper.length); // 7
console.log(Array.from(lightDeveloper));
// ['πŸ‘¨', '🏻', '‍', 'πŸ’»']

Enter fullscreen mode Exit fullscreen mode

The five skin tone modifiers are:

  • 🏻 (Light skin tone - U+1F3FB)

  • 🏼 (Medium-light skin tone - U+1F3FC)

  • 🏽 (Medium skin tone - U+1F3FD)

  • 🏾 (Medium-dark skin tone - U+1F3FE)

  • 🏿 (Dark skin tone - U+1F3FF)

// Same base emoji, different lengths due to modifiers
const wave = 'πŸ‘‹';           // 2 characters
const waveLight = 'πŸ‘‹πŸ»';     // 4 characters  
const waveDark = 'πŸ‘‹πŸΏ';      // 4 characters

// Comparing different combinations
const doctor = 'πŸ‘©β€βš•οΈ';       // 5 characters (woman + ZWJ + medical + VS16)
const lightDoctor = 'πŸ‘©πŸ»β€βš•οΈ'; // 7 characters (+ skin tone modifier)

Enter fullscreen mode Exit fullscreen mode

Flag Sequences: Countries Built from Letters

Here's where it gets really interesting. Country flags are actually sequences of two Regional Indicator characters:

// Flag examples - each is 2 Regional Indicator characters
const usFlag = 'πŸ‡ΊπŸ‡Έ';        // U+1F1FA (πŸ‡Ί) + U+1F1F8 (πŸ‡Έ) = "US"
const taiwanFlag = 'πŸ‡ΉπŸ‡Ό';     // U+1F1F9 (πŸ‡Ή) + U+1F1FC (πŸ‡Ό) = "TW"
const japanFlag = 'πŸ‡―πŸ‡΅';      // U+1F1EF (πŸ‡―) + U+1F1F5 (πŸ‡΅) = "JP"

// All flags have length 4 in JavaScript
console.log(usFlag.length);     // 4
console.log(taiwanFlag.length); // 4
console.log(japanFlag.length);  // 4

// Breaking them down shows the Regional Indicators
console.log(Array.from(usFlag));     // ['πŸ‡Ί', 'πŸ‡Έ']
console.log(Array.from(taiwanFlag)); // ['πŸ‡Ή', 'πŸ‡Ό']

// You can even manually construct flags:
const manualTW = 'πŸ‡ΉπŸ‡Ό';           
const constructedTW = '\u{1F1F9}\u{1F1FC}'; // Same thing!
console.log(manualTW === constructedTW);     // true
Enter fullscreen mode Exit fullscreen mode

The Experiment: Interactive Exploration

At this point, I was fascinated by the complexity. How many different emoji combinations exist? What do they actually contain?

So I built "Emoji Architect" to explore these compositions interactively. With this tool you can:

  • Browse all ZWJ sequence emojis
  • See the real-time breakdown of any emoji (base emoji & modifiers)
  • See JavaScript length calculations live

πŸ”— https://www.thingsaboutweb.dev/en/emojiarchitect

The Solution: Synchronized Character Counting

Now that we understand the problem, let's fix our backend and frontend validation mismatch.

The Core Issue: Different Counting Methods

Our original problem stemmed from inconsistent character counting:

# Python: 
len("Hello πŸ‘¨β€πŸ’» World") # 15

# JavaScript
"Hello πŸ‘¨β€πŸ’» World".length # 17

Enter fullscreen mode Exit fullscreen mode

My Initial Approach: Visual Character Counting

Initially, I was drawn to visual character counting because it felt like the right user experience. Why should users have to understand Unicode internals just to know if their post fits the character limit? Luckily Intl can help us…

JavaScript implementation:

function getVisualLength(str) {
    if (typeof Intl.Segmenter !== 'undefined') {
        const segmenter = new Intl.Segmenter('en', {granularity: 'grapheme'});
        return [...segmenter.segment(str)].length;
    }
    // Fallback for older browsers
    return [...str].length;
}

getVisualLength("Hello!! I am a developer πŸ‘¨β€πŸ’» who loves πŸ§‹ and πŸƒβ€β™‚οΈ!!") 
// Result: 46 
// what users actually see

Enter fullscreen mode Exit fullscreen mode

For python..

import regex  # pip install regex

def get_visual_length(text):
    """Count grapheme clusters like JavaScript's Intl.Segmenter"""
    return len(regex.findall(r'\X', text))

get_visual_length("Hello!! I am a developer πŸ‘¨β€πŸ’» who loves πŸ§‹ and πŸƒβ€β™‚οΈ!!") 
# Result: 46
# consistent with frontend

Enter fullscreen mode Exit fullscreen mode

This seemed perfectβ€”users would see exactly what they expect, and complex emoji sequences would count as single characters regardless of their Unicode complexity.

Reality Check: Database Constraints

But when I wanted to save data to database….

// Frontend validation passes with visual counting
const bio = "Hello!! I am a developer πŸ‘¨β€πŸ’» who loves πŸ§‹ and πŸƒβ€β™‚οΈ!!";
console.log(`Visual length: ${getVisualLength(bio)}`);  // 46 characters βœ…
Enter fullscreen mode Exit fullscreen mode

But database insertion fails spectacularly

SELECT CHAR_LENGTH("Hello!! I am a developer πŸ‘¨\U+200DπŸ’» who loves \U+1F9CB and πŸƒ\U+200D♂️!!") 
/* 51! Which fails on VARCHAR(50) constraint! */
Enter fullscreen mode Exit fullscreen mode

Since there’s no easy way to make database limit content by visual length, the most practical approach is to settle with code point counting.

Second Solution: count code points:

function codePointLength(text) {
    /**
     * Count Unicode code points like Python len() does
     * This matches Python's default string length behavior and database constraints
     */
    return [...text].length;
}


const text = "Hello!! I am a developer πŸ‘¨β€πŸ’» who loves πŸ§‹ and πŸƒβ€β™‚οΈ!!";
console.log(`Frontend: ${codePointLength(text)}`);     
// 51 (which matches Python & DB)
Enter fullscreen mode Exit fullscreen mode

Code point counting might not be as user-friendly as visual counting, but it works reliably across your entire stack.

Key Takeaways

After diving deep into Unicode specifications and building production solutions, here's my learning:

  • Test Across Entire Stack Visual character counting seems ideal, but database constraints and ecosystem compatibility often force you toward Unicode code point counting. Choose the solution that works with your entire stack, not just the frontend.
  • Test with Complex Emoji Early Don't just test with simple emojis like πŸ˜€. Include ZWJ sequences (πŸ‘¨β€πŸ’»), skin tone modifiers (πŸ‘‹πŸ»), and flag emojis (πŸ‡ΉπŸ‡Ό) in your validation logic from day one. It’s fun to poke around on unicode / emoji, and hope my * Emoji Architect tool can be of some help on how emojis are built.*

Top comments (0)