🌶️ SCP Emoji Attack: Exploiting Unicode for Storage Exhaustion in Messaging Systems

#database #systemdesign #validation #programming

Abstract

Modern messaging platforms enforce message length limits in terms of characters rather than bytes. Due to Unicode variability, emojis can represent a single “character” while consuming multiple bytes in UTF-8 encoding. This paper introduces the SCP Emoji Attack (Single Code Point Emoji Attack), a method that leverages high-byte, single-code-point emojis to inflate storage usage and cause denial-of-service conditions in databases.

1. Introduction

Messaging systems typically restrict messages to a maximum number of characters (e.g., 100 characters). Developers often assume that one character ≈ one byte. However, Unicode breaks this assumption:

ASCII letters/numbers → 1 byte each
Single Code Point (SCP) Emojis (🚀, 🔥, ❤️) → 4 bytes each
Complex multi-code-point emojis (e.g., 🏴‍☠️) → up to 13+ bytes each

This discrepancy creates an amplification gap between perceived limit and actual storage. By choosing SCP emojis, attackers achieve a balance of high byte-per-character efficiency and maximum scalability, making the SCP Emoji Attack more practical than complex multi-code-point payloads.

2. Why SCP Emojis Instead of Complex Emojis?

At first glance, the largest possible byte payload (e.g., 🏴‍☠️ = 13 bytes) seems like the strongest attack vector. However, SCP emojis are superior in practice for three reasons:

1- Character Limit Constraints

Most messaging systems enforce limits by code points or characters.
A complex emoji like 🏴‍☠️ consumes 4 code points, reducing how many can fit inside a 100-character message (25 maximum).
SCP emojis consume only 1 code point, allowing a full 100 to fit, maximizing throughput.

2- Simplicity & Reliability

SCP emojis are universally supported and rarely normalized or split differently by databases or programming languages.
Complex emojis may behave inconsistently across platforms (some count them as multiple characters, others collapse them).

3- Amplification Efficiency

🏴‍☠️: 25 × 13 bytes = 325 bytes in a 100-character field.
🚀: 100 × 4 bytes = 400 bytes in the same field. → Despite lower per-emoji size, SCP emojis produce a larger final payload due to character-limit efficiency.

Thus, SCP emojis are the optimal choice for maximizing stored bytes within character-limited messaging fields.

3. Attack Example: MariaDB with 50 GB Memory

Setup:

Database: MariaDB, utf8mb4, column VARCHAR(100)
Message limit: 100 characters
Payload: 🚀 (single-code-point emoji, 4 bytes each)

Calculation:

Per message: 100 × 4 = 400 bytes
If an attacker inserts 150 million messages:
- Naïve developer expectation (1 byte/char): ~15 GB
- Actual usage (SCP Emoji Attack): ~60 GB

This overshoots a 50 GB memory allocation, triggering:

Buffer pool exhaustion
Slow queries and replication lag
Backup inflation
Eventual denial-of-service

4. Mitigation Strategies

Enforce byte-based limits (e.g., VARCHAR(400) vs VARCHAR(100)).
Validate and normalize emojis at input.
Monitor for emoji-dense traffic patterns.
Rate-limit message creation per user.

5. Conclusion

The SCP Emoji Attack demonstrates that even “simple” single-code-point emojis can be weaponized to amplify database storage consumption. By carefully choosing SCP emojis over complex grapheme clusters, attackers maximize per-message payload efficiency, achieve broader compatibility, and maintain stealth.

This highlights the necessity of byte-aware validation and Unicode-conscious design in modern messaging platforms.

Top comments (1)

Roozbeh Sharifnasab • Aug 25

Interesting and accurate