A potential client posted on Reddit asking for a Facebook keyword-post scraper. Their budget: $500. My first instinct after looking at the page was to say no.
Here's what a naive scraper saw when it grabbed the first [role="article"] on the search results page:
oSodnprmmlffgfi1c3mSg0so0d000c0uh1l40llhe09n2991imm38opar · Shared with Public
In 24 months every serious website will talk. Get in before it's crowded.… See more
0:00 / 0:00
SNOWIE.AI
$67 Lifetime Deal!
The · Shared with Public and the post body are readable. But that first line — the one that should be the author's name — is gibberish. Snowie.Ai rendered visually. oSodnprm… returned by textContent.
I jumped to "Facebook ships custom-font character remapping at scale — this is uncompetable, decline the gig." I was wrong. Here's what the actual answer turned out to be, and why I had to write a diagnostic before I could see it.
The seven ways rendered text can escape textContent
Before declaring a site uncompetable, you have to know what you're looking at. There are exactly seven mechanisms by which what a human sees on screen can diverge from what Node.textContent returns:
| # | Mechanism | Defeat cost |
|---|---|---|
| 1 | Selector mismatch (not actually anti-scraping — you grabbed the wrong node) | Minutes |
| 2 | CSS ::before / ::after content rules |
Low — read computed style |
| 3 | Flexbox order reordering (DOM scrambled, CSS re-sorts visually) |
Low — sort children by computed order
|
| 4 | Custom font glyph remapping (.woff2 rebinds codepoints) |
High — OCR pixels or reverse each session's font table |
| 5 | Unicode homoglyph substitution | Low — NFKC + confusable normalize |
| 6 | Canvas pixel rendering (no DOM text at all) | High — OCR only |
| 7 | WebAssembly runtime decryption | Extreme — reverse the WASM module, track session keys |
Each requires a different defeat strategy with wildly different economics. #1 is free (fix your selector). #4 and #6 start at ~$15K/year to maintain. #7 is measured in tens of thousands.
So the only useful question is: which one does this site use? Without a diagnostic you're guessing — and guessing wrong costs you either a scraping contract you could have fulfilled, or a contract you over-promised on.
The 5-minute diagnostic
I wrote a throwaway script that walks the first [role="article"] and dumps the signals that separate the seven mechanisms:
const el = document.querySelectorAll('[role="article"]')[0];
console.log({
textContent: el.textContent.substring(0, 300),
innerText: el.innerText.substring(0, 300),
font_family: getComputedStyle(el).fontFamily,
has_canvas: !!el.querySelector('canvas'),
// Also check DevTools → Network → filter .wasm
child_sample: Array.from(el.children).slice(0, 10).map(c => ({
tag: c.tagName,
order: getComputedStyle(c).order,
text_len: (c.textContent || '').length,
})),
});
I ran it. The result killed every hypothesis except one:
-
font_family=system-ui, -apple-system, sans-serif— Facebook is using the OS default font. Mechanism #4 ruled out (no custom.woff2, no glyph remapping). - No
<canvas>element. #6 ruled out. - No WASM in network traffic. #7 ruled out.
-
innerText="Snowie.Ai\no\ns\no\ne\nt\nS\nd\nn\np\nr\n…". Character-per-line. Flex-column newlines — that's the giveaway. - Child elements of the scrambled container: each contained 1–2 characters with a non-zero
ordervalue likeorder: 17,order: 4,order: 23.
That's the signature of mechanism #3 — Flexbox order reordering. Facebook splits author display names into individual single-character spans and gives each a scrambled order value. The browser's flexbox layout re-sorts them for visual rendering. textContent returns DOM order, which is randomized per render.
And only the author name gets this treatment. Post body, engagement counts, aria-labels, and timestamps are plain text.
The fix is ten lines
// When children are all single-character and at least one has a non-zero CSS order,
// sort by order, concat — that's the real text as the browser would paint it.
const unscramble = (el) => {
const kids = Array.from(el.children);
if (kids.length >= 4
&& kids.every(c => (c.textContent || '').length <= 2)
&& kids.some(c => parseInt(getComputedStyle(c).order || '0') !== 0)) {
return kids
.slice()
.sort((a, b) => parseInt(getComputedStyle(a).order || '0')
- parseInt(getComputedStyle(b).order || '0'))
.map(c => c.textContent)
.join('');
}
return (el.textContent || '').trim();
};
With this helper wired in, author_name extraction went from "oSodnprm…" to "Snowie.Ai". Everything else — text, like_count, lang — was already plain. No OCR, no WASM reversal, no font-table reverse engineering. Ten lines.
Two honest caveats
-
No native post permalink. Facebook search result cards don't expose a
/posts/<id>/href — the visible links are profile URLs with encrypted__cft__tracking params. When no native ID is found, my scraper emits anfb_<hash>id stable across runs for the same author+body combination. Downstream deduplication still works; you just can't deep-link back to the post. -
Lazy-loaded pagination. The search page initially renders 1–3 results; the rest arrive as the user scrolls. A production scraper drives scroll events in a loop until
limitis satisfied or no new articles appear.
The generalizable lesson
Every time I've been asked "can you scrape site X?" and said no without running a diagnostic, I was wrong at least half the time. The reflex is understandable — the DOM returns garbage, you assume the worst — but the cost asymmetry is severe. Five minutes of running the seven-factor diagnostic versus walking away from a paying contract.
The protocol is:
- Extract
textContentof the element the human can read. If it matches visible text → mechanism #1, fix your selector. - Check
getComputedStyle(el).fontFamily. Points to a custom.woff2? Suspect #4. - Walk children. If many are single-character and have non-zero
order? Mechanism #3, unscramble with ten lines. - Check for
<canvas>siblings and WASM network requests. Both absent? #6 and #7 are ruled out. - Only after this do you get to say "un-scrapable."
Facebook is not un-scrapable for keyword search. They've applied a low-cost obfuscation to one high-value field (the author name you might use for audience targeting) and left everything else alone. That's a reasonable product decision — enough friction to discourage casual scrapers, not enough to break accessibility tooling that depends on rendered text. As a side effect, someone who runs the diagnostic wins.
Question for the room: anyone here actually shipped against mechanism #4 (custom font glyph remap) or #7 (WASM decryption) in production? Curious what the maintenance cost looks like once you account for font table rotation / WASM session-key changes. Drop your war stories in the comments.
Original post with full JSON-LD metadata + Tap reference: taprun.dev/blog/facebook-anti-scraping-flexbox-order
Top comments (0)