DEV Community

Leon
Leon

Posted on • Originally published at taprun.dev

Facebook scrambles author names with Flexbox order — here's the 5-line diagnostic that proves it isn't custom fonts

A potential client posted on Reddit asking for a Facebook keyword-post scraper. Their budget: $500. My first instinct after looking at the page was to say no.

Here's what a naive scraper saw when it grabbed the first [role="article"] on the search results page:

oSodnprmmlffgfi1c3mSg0so0d000c0uh1l40llhe09n2991imm38opar · Shared with Public
In 24 months every serious website will talk. Get in before it's crowded.… See more
0:00 / 0:00
SNOWIE.AI
$67 Lifetime Deal!
Enter fullscreen mode Exit fullscreen mode

The · Shared with Public and the post body are readable. But that first line — the one that should be the author's name — is gibberish. Snowie.Ai rendered visually. oSodnprm… returned by textContent.

I jumped to "Facebook ships custom-font character remapping at scale — this is uncompetable, decline the gig." I was wrong. Here's what the actual answer turned out to be, and why I had to write a diagnostic before I could see it.

The seven ways rendered text can escape textContent

Before declaring a site uncompetable, you have to know what you're looking at. There are exactly seven mechanisms by which what a human sees on screen can diverge from what Node.textContent returns:

# Mechanism Defeat cost
1 Selector mismatch (not actually anti-scraping — you grabbed the wrong node) Minutes
2 CSS ::before / ::after content rules Low — read computed style
3 Flexbox order reordering (DOM scrambled, CSS re-sorts visually) Low — sort children by computed order
4 Custom font glyph remapping (.woff2 rebinds codepoints) High — OCR pixels or reverse each session's font table
5 Unicode homoglyph substitution Low — NFKC + confusable normalize
6 Canvas pixel rendering (no DOM text at all) High — OCR only
7 WebAssembly runtime decryption Extreme — reverse the WASM module, track session keys

Each requires a different defeat strategy with wildly different economics. #1 is free (fix your selector). #4 and #6 start at ~$15K/year to maintain. #7 is measured in tens of thousands.

So the only useful question is: which one does this site use? Without a diagnostic you're guessing — and guessing wrong costs you either a scraping contract you could have fulfilled, or a contract you over-promised on.

The 5-minute diagnostic

I wrote a throwaway script that walks the first [role="article"] and dumps the signals that separate the seven mechanisms:

const el = document.querySelectorAll('[role="article"]')[0];
console.log({
  textContent: el.textContent.substring(0, 300),
  innerText:   el.innerText.substring(0, 300),
  font_family: getComputedStyle(el).fontFamily,
  has_canvas:  !!el.querySelector('canvas'),
  // Also check DevTools → Network → filter .wasm
  child_sample: Array.from(el.children).slice(0, 10).map(c => ({
    tag: c.tagName,
    order: getComputedStyle(c).order,
    text_len: (c.textContent || '').length,
  })),
});
Enter fullscreen mode Exit fullscreen mode

I ran it. The result killed every hypothesis except one:

  • font_family = system-ui, -apple-system, sans-serif — Facebook is using the OS default font. Mechanism #4 ruled out (no custom .woff2, no glyph remapping).
  • No <canvas> element. #6 ruled out.
  • No WASM in network traffic. #7 ruled out.
  • innerText = "Snowie.Ai\no\ns\no\ne\nt\nS\nd\nn\np\nr\n…". Character-per-line. Flex-column newlines — that's the giveaway.
  • Child elements of the scrambled container: each contained 1–2 characters with a non-zero order value like order: 17, order: 4, order: 23.

That's the signature of mechanism #3 — Flexbox order reordering. Facebook splits author display names into individual single-character spans and gives each a scrambled order value. The browser's flexbox layout re-sorts them for visual rendering. textContent returns DOM order, which is randomized per render.

And only the author name gets this treatment. Post body, engagement counts, aria-labels, and timestamps are plain text.

The fix is ten lines

// When children are all single-character and at least one has a non-zero CSS order,
// sort by order, concat — that's the real text as the browser would paint it.
const unscramble = (el) => {
  const kids = Array.from(el.children);
  if (kids.length >= 4
      && kids.every(c => (c.textContent || '').length <= 2)
      && kids.some(c => parseInt(getComputedStyle(c).order || '0') !== 0)) {
    return kids
      .slice()
      .sort((a, b) => parseInt(getComputedStyle(a).order || '0')
                    - parseInt(getComputedStyle(b).order || '0'))
      .map(c => c.textContent)
      .join('');
  }
  return (el.textContent || '').trim();
};
Enter fullscreen mode Exit fullscreen mode

With this helper wired in, author_name extraction went from "oSodnprm…" to "Snowie.Ai". Everything else — text, like_count, lang — was already plain. No OCR, no WASM reversal, no font-table reverse engineering. Ten lines.

Two honest caveats

  • No native post permalink. Facebook search result cards don't expose a /posts/<id>/ href — the visible links are profile URLs with encrypted __cft__ tracking params. When no native ID is found, my scraper emits an fb_<hash> id stable across runs for the same author+body combination. Downstream deduplication still works; you just can't deep-link back to the post.
  • Lazy-loaded pagination. The search page initially renders 1–3 results; the rest arrive as the user scrolls. A production scraper drives scroll events in a loop until limit is satisfied or no new articles appear.

The generalizable lesson

Every time I've been asked "can you scrape site X?" and said no without running a diagnostic, I was wrong at least half the time. The reflex is understandable — the DOM returns garbage, you assume the worst — but the cost asymmetry is severe. Five minutes of running the seven-factor diagnostic versus walking away from a paying contract.

The protocol is:

  1. Extract textContent of the element the human can read. If it matches visible text → mechanism #1, fix your selector.
  2. Check getComputedStyle(el).fontFamily. Points to a custom .woff2? Suspect #4.
  3. Walk children. If many are single-character and have non-zero order? Mechanism #3, unscramble with ten lines.
  4. Check for <canvas> siblings and WASM network requests. Both absent? #6 and #7 are ruled out.
  5. Only after this do you get to say "un-scrapable."

Facebook is not un-scrapable for keyword search. They've applied a low-cost obfuscation to one high-value field (the author name you might use for audience targeting) and left everything else alone. That's a reasonable product decision — enough friction to discourage casual scrapers, not enough to break accessibility tooling that depends on rendered text. As a side effect, someone who runs the diagnostic wins.


Question for the room: anyone here actually shipped against mechanism #4 (custom font glyph remap) or #7 (WASM decryption) in production? Curious what the maintenance cost looks like once you account for font table rotation / WASM session-key changes. Drop your war stories in the comments.

Original post with full JSON-LD metadata + Tap reference: taprun.dev/blog/facebook-anti-scraping-flexbox-order

Top comments (0)