DEV Community

Raye Deng
Raye Deng

Posted on

What Your Linter Can't Catch: The Invisible Unicode Attacks Hitting GitHub

In March 2026, a threat actor called Glassworm launched one of the most sophisticated supply-chain attacks the open-source ecosystem has seen. They compromised repositories on GitHub, published malicious npm packages, and infected 72 VS Code extensions on the Open VSX marketplace. All using characters you can't see.

The Wasmer project, reworm, and dozens of other repositories were hit. The attack spread through invisible Unicode characters embedded in source code — characters that GitHub's diff view renders as blank space and that standard security tools completely ignore.

Here's how it works, why your existing tooling misses it, and what you can do about it.

The Technical Mechanism

The core technique exploits Unicode code points in ranges that are invisible to humans and ignored by most tools:

Variation Selectors (U+FE00–U+FE0F)

These are the primary Glassworm weapon. Variation selectors are designed to modify the appearance of a preceding character (like choosing emoji presentation vs text). When inserted after a regular code character, they're completely invisible in editors, diffs, and terminals. But at the byte level, they carry additional data that can be decoded into executable instructions.

Private Use Area (U+E000–U+F8FF)

Characters in this range have no defined glyphs. Fonts render nothing, making them truly invisible. Glassworm used PUA characters as a steganographic channel — encoding data in sequences that humans see as empty space.

Zero-Width Characters (U+200B–U+200D, U+2060)

Zero-width space, zero-width joiner, and word joiner. These have legitimate uses (joining emoji sequences, controlling line breaks) but are also perfect for hiding payload data between visible characters.

Unicode Tag Characters (U+E0001–U+E007F)

These map 1:1 to ASCII — literally encoding invisible instructions that LLMs will tokenize and execute but humans can never see. Recent research found 6 of 50 popular .cursorrules files contained these characters.

Why ESLint and SonarQube Don't Catch This

Here's the uncomfortable truth: every major JavaScript linter and SAST tool operates on parsed ASTs, not raw bytes.

When TypeScript or JavaScript parses source code:

  1. The tokenizer reads the byte stream
  2. Invisible characters (Cf category, variation selectors, PUA) are treated as whitespace or non-significant characters
  3. They never appear in the AST
  4. Linters analyze the AST — the invisible characters are already gone

This means:

  • ESLint rules can't see the characters — they don't exist in the AST
  • SonarQube scans miss them — same reason, AST-based analysis
  • Semgrep patterns won't match — the parser has already stripped them
  • GitHub's own code scanning doesn't flag them — the diff rendering hides them

The attack exists entirely in the gap between raw bytes and parsed syntax. This is a fundamentally different category from typical code vulnerabilities.

How to Detect Invisible Characters

1. Hex Dump

# Check for suspicious UTF-8 sequences in a file
xxd suspicious-file.ts | grep -i 'ef bf be\|e0\|e2 80 8b'

# Common invisible character UTF-8 byte patterns:
# U+FE0F (variation selector): ef b8 8f
# U+200B (zero-width space):   e2 80 8b
# U+200D (zero-width joiner):  e2 80 8d
Enter fullscreen mode Exit fullscreen mode

2. Python Byte-Level Scanner

import unicodedata, sys

for i, ch in enumerate(open(sys.argv[1], 'r').read()):
    cp = ord(ch)
    cat = unicodedata.category(ch)
    if cat == 'Cf' and ch not in ('\t', '\n', '\r'):
        print(f'Format char U+{cp:04X} at pos {i}')
    elif 0xFE00 <= cp <= 0xFE0F or 0xE0100 <= cp <= 0xE01EF:
        print(f'Variation selector U+{cp:04X} at pos {i}')
    elif 0xE000 <= cp <= 0xF8FF or 0xFDD0 <= cp <= 0xFDEF:
        print(f'PUA/noncharacter U+{cp:04X} at pos {i}')
Enter fullscreen mode Exit fullscreen mode

3. Git Pre-Commit Hook

#!/bin/bash
# Check staged files for invisible Unicode
for f in $(git diff --cached --name-only); do
  if grep -Pn '[\x{FE00}-\x{FE0F}\x{E000}-\x{F8FF}\x{200B}-\x{200D}]' "$f"; then
    echo "⚠️  Invisible Unicode characters detected in $f"\n    exit 1
  fi
done
Enter fullscreen mode Exit fullscreen mode

4. Existing Tools

  • vscode-gremlins — VS Code extension that highlights invisible characters in the editor
  • gremlins (Emacs) — similar functionality for Emacs
  • git diff with --word-diff and a font that renders invisible characters

The Hard Problem: Defense at Scale

Individual file scanning works, but it doesn't scale. The real challenge is building this detection into CI/CD pipelines and code review workflows in a way that's fast enough not to slow developers down.

This requires a fundamentally different scanning approach from traditional SAST:

  1. Raw byte analysis instead of AST analysis
  2. Unicode code point inspection at every position in every file
  3. Context-aware detection — not all invisible characters are malicious (e.g., BOM markers, legitimate emoji sequences)
  4. Diff-aware scanning — check every commit for newly introduced invisible characters

Static analysis tools for AI-generated code (like Open Code Review's security anti-pattern detector) are well-positioned to add this capability, since they already operate on raw file content rather than relying solely on AST parsing. The natural extension would be a Unicode safety check that flags suspicious invisible character sequences alongside existing security pattern detection.

Immediate Action Items

  1. Audit your repositories: Run the Python scanner above on all source files
  2. Add pre-commit hooks: Block invisible character injection at the developer's machine
  3. Check for Glassworm persistence: Look for ~/init.json on any machine that may have cloned compromised repos
  4. Rotate credentials: If you've worked with any of the affected repositories, rotate your tokens and SSH keys
  5. Update VS Code extensions: Review installed extensions for anything unexpected

The invisible Unicode attack vector isn't going away. As long as our tools operate on parsed syntax instead of raw bytes, this gap will exist. The best defense is awareness and byte-level scanning.

Top comments (0)