Detecting Invisible Code: A 30-Line Scanner for Unicode Steganography

#security #python #javascript #opensource

Steganography is older than computers. Invisible ink, microdots, messages carved under wax tablets — humans have been hiding data in plain sight for millennia. What's new is that your package manager will run it for you.

There's a class of malware that hides executable payloads in characters you literally cannot see. Not obfuscated. Not minified. Invisible. The code is there — your editor just doesn't render it.

How It Works

Unicode has hundreds of characters that take up zero visual space: zero-width spaces (U+200B), zero-width joiners (U+200D), variation selectors (U+FE00-FE0F), and others. Normally harmless. But string them together in a specific pattern and you can encode arbitrary binary data.

The attack:

Attacker publishes an npm package (or any repo) with what looks like normal code
Buried in a string literal is a long sequence of invisible Unicode characters
A small decoder function maps those characters back to executable JavaScript
eval() runs it

Your editor shows a normal-looking file with maybe a suspicious gap. The actual payload is hiding in plain sight — or rather, hiding in plain absence.

Recent examples (Glassworm, KOI loader) go further: the decoded payload contacts Solana blockchain addresses for command-and-control instructions, making the C2 infrastructure decentralized and nearly impossible to take down.

Why Your Editor Doesn't Catch It

Most editors follow the Unicode spec faithfully. Zero-width characters are supposed to be invisible — that's their defined behavior. VS Code has settings to highlight them (editor.unicodeHighlight.invisibleCharacters), but they're not aggressive enough to flag sequences buried in string literals.

The real problem: no one reads dependency code character by character. You audit the logic, not the encoding.

The Detection Logic

The good news — this is trivially detectable with static analysis. You don't need AI, you don't need a fancy tool. You need string iteration.

The pattern to catch: sequences of 3+ consecutive invisible Unicode characters. Normal code never has this. Copy-paste artifacts produce one or two stray characters, not runs of dozens or hundreds.

Here are the codepoint ranges to watch:

INVISIBLE_RANGES = {
    'zero-width space':      (0x200B, 0x200B),
    'zero-width non-joiner': (0x200C, 0x200C),
    'zero-width joiner':     (0x200D, 0x200D),
    'LTR/RTL marks':         (0x200E, 0x200F),
    'bidi controls':         (0x202A, 0x202E),
    'word joiner':           (0x2060, 0x2060),
    'invisible operators':   (0x2061, 0x2064),
    'variation selectors':   (0xFE00, 0xFE0F),
    'BOM':                   (0xFEFF, 0xFEFF),
    'variation selectors+':  (0xE0100, 0xE01EF),
    'tag characters':        (0xE0001, 0xE007F),
}

def is_invisible(char):
    cp = ord(char)
    return any(low <= cp <= high for low, high in INVISIBLE_RANGES.values())

And the scanner itself:

def scan_for_stego(code):
    """Find sequences of consecutive invisible Unicode characters."""
    findings = []
    for line_num, line in enumerate(code.split('\n'), 1):
        run_length = 0
        for char in line:
            if is_invisible(char):
                run_length += 1
            else:
                if run_length >= 3:
                    findings.append((line_num, run_length))
                run_length = 0
        if run_length >= 3:
            findings.append((line_num, run_length))
    return findings

That's it. Run this against any file and you'll catch the pattern. A run of 3+ invisible characters in source code is abnormal. A run of 50+ is almost certainly a payload.

Combining With eval() Detection

The steganographic payload is inert without an execution mechanism. In every documented case, that mechanism is eval() (JavaScript) or exec() (Python). Scanning for both patterns together — invisible Unicode sequences and eval/exec — gives you near-zero false positives.

import re

def has_eval(code):
    return bool(re.search(r'\beval\s*\(', code)) or \
           bool(re.search(r'\bexec\s*\(', code))

def assess_risk(code):
    sequences = scan_for_stego(code)
    uses_eval = has_eval(code)

    if sequences and uses_eval:
        return "CRITICAL: invisible Unicode + eval() — likely malicious"
    elif sequences:
        return f"WARNING: {len(sequences)} invisible Unicode sequence(s) found"
    elif uses_eval:
        return "INFO: eval/exec present — review context"
    return "CLEAN"

What You Can Do Today

Enable Unicode highlighting in your editor. VS Code: editor.unicodeHighlight.invisibleCharacters: true. It's not perfect but it's free.
Add a pre-commit hook. Run the scanner above against staged files. Reject anything with invisible sequences. Takes milliseconds.
Audit your dependencies. Run the scanner against your node_modules or site-packages. You'll either find nothing (good) or find something you need to deal with immediately.
Question any eval(). In 2026, there are almost no legitimate use cases for eval in application code. Its presence in a dependency should trigger a manual review.