Raye Deng

Posted on Mar 16

What Your Linter Can't Catch: The Invisible Unicode Attacks Hitting GitHub

#javascript #security #npm #webdev

In March 2026, a threat actor called Glassworm launched one of the most sophisticated supply-chain attacks the open-source ecosystem has seen. They compromised repositories on GitHub, published malicious npm packages, and infected 72 VS Code extensions on the Open VSX marketplace. All using characters you can't see.

The Wasmer project, reworm, and dozens of other repositories were hit. The attack spread through invisible Unicode characters embedded in source code — characters that GitHub's diff view renders as blank space and that standard security tools completely ignore.

Here's how it works, why your existing tooling misses it, and what you can do about it.

The Technical Mechanism

The core technique exploits Unicode code points in ranges that are invisible to humans and ignored by most tools:

Variation Selectors (U+FE00–U+FE0F)

These are the primary Glassworm weapon. Variation selectors are designed to modify the appearance of a preceding character (like choosing emoji presentation vs text). When inserted after a regular code character, they're completely invisible in editors, diffs, and terminals. But at the byte level, they carry additional data that can be decoded into executable instructions.

Private Use Area (U+E000–U+F8FF)

Characters in this range have no defined glyphs. Fonts render nothing, making them truly invisible. Glassworm used PUA characters as a steganographic channel — encoding data in sequences that humans see as empty space.

Zero-Width Characters (U+200B–U+200D, U+2060)

Zero-width space, zero-width joiner, and word joiner. These have legitimate uses (joining emoji sequences, controlling line breaks) but are also perfect for hiding payload data between visible characters.

Unicode Tag Characters (U+E0001–U+E007F)

These map 1:1 to ASCII — literally encoding invisible instructions that LLMs will tokenize and execute but humans can never see. Recent research found 6 of 50 popular .cursorrules files contained these characters.

Why ESLint and SonarQube Don't Catch This

Here's the uncomfortable truth: every major JavaScript linter and SAST tool operates on parsed ASTs, not raw bytes.

When TypeScript or JavaScript parses source code:

The tokenizer reads the byte stream
Invisible characters (Cf category, variation selectors, PUA) are treated as whitespace or non-significant characters
They never appear in the AST
Linters analyze the AST — the invisible characters are already gone

This means:

ESLint rules can't see the characters — they don't exist in the AST
SonarQube scans miss them — same reason, AST-based analysis
Semgrep patterns won't match — the parser has already stripped them
GitHub's own code scanning doesn't flag them — the diff rendering hides them

The attack exists entirely in the gap between raw bytes and parsed syntax. This is a fundamentally different category from typical code vulnerabilities.

How to Detect Invisible Characters

1. Hex Dump

# Check for suspicious UTF-8 sequences in a file
xxd suspicious-file.ts | grep -i 'ef bf be\|e0\|e2 80 8b'

# Common invisible character UTF-8 byte patterns:
# U+FE0F (variation selector): ef b8 8f
# U+200B (zero-width space):   e2 80 8b
# U+200D (zero-width joiner):  e2 80 8d

2. Python Byte-Level Scanner

import unicodedata, sys

for i, ch in enumerate(open(sys.argv[1], 'r').read()):
    cp = ord(ch)
    cat = unicodedata.category(ch)
    if cat == 'Cf' and ch not in ('\t', '\n', '\r'):
        print(f'Format char U+{cp:04X} at pos {i}')
    elif 0xFE00 <= cp <= 0xFE0F or 0xE0100 <= cp <= 0xE01EF:
        print(f'Variation selector U+{cp:04X} at pos {i}')
    elif 0xE000 <= cp <= 0xF8FF or 0xFDD0 <= cp <= 0xFDEF:
        print(f'PUA/noncharacter U+{cp:04X} at pos {i}')

3. Git Pre-Commit Hook

#!/bin/bash
# Check staged files for invisible Unicode
for f in $(git diff --cached --name-only); do
  if grep -Pn '[\x{FE00}-\x{FE0F}\x{E000}-\x{F8FF}\x{200B}-\x{200D}]' "$f"; then
    echo "⚠️  Invisible Unicode characters detected in $f"\n    exit 1
  fi
done

4. Existing Tools

vscode-gremlins — VS Code extension that highlights invisible characters in the editor
gremlins (Emacs) — similar functionality for Emacs
git diff with --word-diff and a font that renders invisible characters

The Hard Problem: Defense at Scale

Individual file scanning works, but it doesn't scale. The real challenge is building this detection into CI/CD pipelines and code review workflows in a way that's fast enough not to slow developers down.

This requires a fundamentally different scanning approach from traditional SAST:

Raw byte analysis instead of AST analysis
Unicode code point inspection at every position in every file
Context-aware detection — not all invisible characters are malicious (e.g., BOM markers, legitimate emoji sequences)
Diff-aware scanning — check every commit for newly introduced invisible characters

Static analysis tools for AI-generated code (like Open Code Review's security anti-pattern detector) are well-positioned to add this capability, since they already operate on raw file content rather than relying solely on AST parsing. The natural extension would be a Unicode safety check that flags suspicious invisible character sequences alongside existing security pattern detection.

Immediate Action Items

Audit your repositories: Run the Python scanner above on all source files
Add pre-commit hooks: Block invisible character injection at the developer's machine
Check for Glassworm persistence: Look for ~/init.json on any machine that may have cloned compromised repos
Rotate credentials: If you've worked with any of the affected repositories, rotate your tokens and SSH keys
Update VS Code extensions: Review installed extensions for anything unexpected

The invisible Unicode attack vector isn't going away. As long as our tools operate on parsed syntax instead of raw bytes, this gap will exist. The best defense is awareness and byte-level scanning.

DEV Community