Devil Scrapes

Posted on Jun 3

Threads Reply Scraper: export the full conversation tree of any public post

#webscraping #python #apify #data

Quick answer: Meta's official Threads API is gated behind a developer-account review and refuses third-party conversation reads. To export the full reply tree of any public threads.net post, you scrape the server-rendered HTML. A Threads reply scraper fetches the conversation payload Threads embeds in its initial page load and emits one flat row per post or reply, depth-linked via parent_reply_id, so you can rebuild the conversation graph with a single LEFT JOIN. The Apify Actor below does it for $0.005 per row (~$5.05 per 1,000), with TLS fingerprinting, proxy rotation, and retry logic handled for you.

When a post about your brand goes sideways — or a public official drops a controversial take — the reply tree under it is a structured dataset hiding in plain sight. Two hundred replies can be the difference between a PR team that spotted a pattern at 9 a.m. and one that found out at noon after a journalist called. The catch: Threads gives you a scrollable UI and nothing else. No download button. No public API for third-party conversation reads.

Here is what it takes to extract that conversation programmatically, why the obvious approaches fail, and how the Actor I built shortens it to one API call.

What is Threads? 🔎

Threads is Meta's text-based social network, launched in 2023 as a companion to Instagram. A post lives at https://www.threads.net/@{username}/post/{code} and accumulates a tree of replies — direct replies to the root, plus inline-expanded nested chains where users reply to each other. Meta server-renders the conversation payload into the initial HTML, which is where this Actor extracts it from.

The public post page exposes — without any login — the root post text, author, timestamps, engagement counts, every top-level reply Threads inlines into the first HTML load, and the inline-expanded nested reply chains (depth 2+) embedded in the same payload. It does not expose, without extra work, replies behind a "Show replies" click, the reposter user list, or quote-post bodies.

Does Threads have a public API for reply trees? 📡

No. As of 2026, Meta's official Threads API requires a developer-account review and exposes only the post owner's own data — it will not return third-party conversation trees. The only programmatic surface for a public post's full reply tree is the server-rendered HTML threads.net returns when you GET a post URL. That HTML embeds the conversation payload inside <script type="application/json" data-sjs> blocks — but extracting them reliably requires a real browser TLS fingerprint, residential proxy coverage, and a parser that handles Meta's nested Relay format.

What the data looks like

Each post and reply comes back as one flat, typed row — a real record from a live-verified run on https://www.threads.net/@mosseri/post/DYX3oNcAO4r:

{
  "row_type": "post",
  "root_post_id": "3897828658278100523",
  "root_post_url": "https://www.threads.net/@mosseri/post/DYX3oNcAO4r",
  "parent_reply_id": null,
  "reply_id": "3897828658278100523",
  "reply_url": "https://www.threads.net/@mosseri/post/DYX3oNcAO4r",
  "reply_text": "Does DMing people back help with reach?",
  "author_username": "mosseri",
  "author_display_name": "Adam Mosseri",
  "author_user_id": "63482099442",
  "author_followers": null,
  "posted_at": "2026-05-15T13:36:48+00:00",
  "like_count": 427,
  "reply_count": 98,
  "repost_count": 12,
  "quote_count": 2,
  "depth": 0,
  "scraped_at": "2026-05-16T12:00:00+00:00"
}

Eighteen fields, same shape on every row, validated by Pydantic v2 before it's written. The row_type discriminator, the depth integer (0 = root, 1 = direct reply, 2+ = nested chain), and the parent_reply_id pointer are what turn a flat table into a graph: a single LEFT JOIN on reply_id rebuilds the tree.

The naive approach (and why it falls apart) 🔥

The first thing a scraper-savvy person tries: open DevTools, grep the HTML for thread_items, replay the GET with Python's requests, parse the JSON blobs, iterate the edges. It breaks before the replay even returns the payload. Three reasons, each representing work this Actor absorbs:

1. TLS fingerprinting. Meta's servers inspect the JA3/JA4 signature of your TLS ClientHello. Python's stdlib SSL — and httpx — produce handshakes that match no real browser, so the server quietly degrades the response to a login-only stub: no payload, just a wall. We impersonate Chrome 131's TLS + HTTP/2 fingerprint via curl-cffi, so the handshake is indistinguishable from a real browser session.

2. Datacenter IP blocks. Meta blocks repeated requests from the same datacenter IP within minutes. We route every request through Apify's BUYPROXIES94952 residential proxy pool, rotate the session per URL, and request a fresh proxy session the moment we see a 403 or empty-payload response.

3. A nested Relay payload, not labeled JSON. The embedded <script data-sjs> block is the Relay framework's wire format — a deeply nested __bbox.result.data.data.edges structure where meaning depends on position. We enumerate every data-sjs block, keep the one whose body contains both thread_items and BarcelonaPostPage, then walk the edges[*].node.thread_items tree. On 408 / 429 / 5xx we retry with exponential backoff (base 2 s, cap 30 s, max 5 attempts) and fail loudly rather than return an empty dataset with a green status.

None of that is exotic. All of it is the difference between a one-off script and a scraper that survives Meta's payload updates.

The Actor ⚙️

I packaged the result as an Apify Actor: Threads Reply Tree Scraper.

Paste post URLs in the Apify Console and click Start, or call it programmatically:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("DevilScrapes/threads-reply-tree").call(
    run_input={
        "postUrls": [
            "https://www.threads.net/@mosseri/post/DYX3oNcAO4r"
        ],
        "maxDepth": 3,
        "maxRepliesPerNode": 50,
        "useProxy": True,
    }
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

The three input knobs that matter:

postUrls — up to 50 Threads post URLs per run. Each is normalized (trailing slashes, query strings like ?xmt=..., and fragments are stripped) and deduplicated before any network call. A bad URL fails fast with a Pydantic ValidationError before the run charges you anything.
maxDepth — caps the reply depth emitted. Default 3 covers the root (depth 0), direct replies (depth 1), and the first two layers of inline-expanded nested chains. Raise it to 10 for every embedded chain, or set 1 for direct replies only.
maxRepliesPerNode — caps the number of top-level reply threads per post. Default 50, max 500. Controls cost on very large posts.

What you would actually use this for 💡

Five concrete patterns:

Brand reputation triage. A viral mention of your brand lands on Threads. Instead of reading 600 replies manually, run the Actor, filter to like_count > 10 AND depth = 1, and you have the top-resonance replies in a spreadsheet in minutes.

Crisis comms. Pull every visible reply to a controversial post before a PR call, export to CSV, and share in Slack. The parent_reply_id chain shows whether a negative narrative is branching (many depth-2+ nodes) or staying flat.

Creator analytics. Run the Actor against your own posts periodically. reply_count and like_count on every node tell you which of your replies sparked sub-conversations vs. which went quiet.

Social-listening dashboards. Feed the rows into Looker, Hex, Streamlit, or Observable for a real-time reply topology view. Each row's depth, parent_reply_id, author_username, and engagement counts are the minimum columns a NetworkX or d3-force graph needs.

Conversation-graph research. Teams working on argument mining, polarisation, or content moderation can bootstrap labeled datasets from public posts — the flat table with a parent_reply_id tree pointer drops straight into standard graph-analysis pipelines.

Pricing — exact numbers 💰

Pay-per-event. You pay for rows you get. No data, no charge (except the fixed start fee).

Event	Price (USD)	When
`actor-start`	$0.05	Once per run, at boot
`result-row`	$0.005	Per post or reply row written

Rows scraped	Actor starts	Total cost
100 rows	1	$0.55
500 rows	1	$2.55
1,000 rows	1	$5.05
5,000 rows	1	$25.05

A typical run on a single mid-sized post — root plus roughly 50 direct replies plus a handful of nested chains — emits 60–120 rows, costing $0.35–$0.65. Apify's $5 free trial credit covers your first ~990 rows, no credit card required.

The technically interesting bit

The payload location — result.data.data.edges[*].node.thread_items[*].post inside a data-sjs-tagged script block — was live-verified on 2026-05-16 against the @mosseri post above. The initial HTML was ~790 KB and contained 52 separate JSON <script> payloads; the correct one is identified by the combination of thread_items and BarcelonaPostPage markers, while the rest are sidebar metadata, related-post carousels, and Relay hydration blobs.

The thread_items semantics are the catch: a length-N chain is a direct reply followed by N-1 inline-expanded nested replies, and each subsequent item's parent_reply_id is the pk of the previous chain item — so the Actor correctly assigns depth=2 to the third item in a chain. Most scrapers that flatten this payload collapse all chain items to depth=1 and lose the tree.

Limitations 🚧

No "Show replies" pagination. Threads renders some nested replies behind a client-side click that triggers an XHR. The Actor emits exactly what the initial HTML contains — typically the root, all top-level direct replies, and any inline-expanded depth-2/3 chains Meta included. Hidden replies behind a "Show replies" button are not fetched.
No reposter user list. The /reposts/ sub-page loads via a client-side XHR using rotating GraphQL doc_id and lsd tokens. Repost counts are captured on every row; the list of who reposted is not.
No quote-post bodies. quote_count is captured; the text of quote-posts is out of scope for v1.
No media. Only reply_text is captured. Image alt text, video transcripts, and link-preview cards are not.
Private and login-walled posts return zero rows. If Threads serves a login wall — which depends on IP reputation — the Actor logs a WARNING and skips that URL. Residential proxy (useProxy: true) maximizes success rate.

FAQ ❓

Is scraping public Threads posts legal?
The threads.net post URL is publicly accessible without any login. This Actor reads only what the public UI exposes to anyone with a browser, at a paced request rate, and collects no personal data beyond what the public post displays. You remain responsible for verifying your jurisdiction's data-protection rules and Meta's current Terms of Service before using scraped data commercially; the README includes a full ToS notice.

Do I need a Meta or Instagram account?
No. The Actor calls threads.net directly with a Chrome 131 TLS fingerprint. No Meta login, no API key, no OAuth flow.

Can I export to Google Sheets, BigQuery, or a warehouse?
Yes — the Apify Console exports to JSON, CSV, Excel, and XML after any run. You can also webhook the dataset on ACTOR.RUN.SUCCEEDED into Make, Zapier, or n8n, or pull it via the Apify Datasets REST API.

How is this different from other Threads scrapers on the Apify Store?
Existing Threads scrapers focus on profile-level post enumeration and cap at roughly 20 posts per profile with no reply expansion. This Actor does the opposite: give it one post URL and it returns the entire visible conversation underneath it, depth-linked, with engagement counts on every node — conversation-graph data, not just a profile feed.

Try it

The Actor is live: apify.com/DevilScrapes/threads-reply-tree. Free $5 trial credit, no credit card. Run it on any active Threads post and the full reply tree lands in your dataset in under a minute. Hit a post where the payload changed shape or a reply chain came back wrong? Drop it in the comments — we push fixes within days of a reproducible report.

Useful links:

Threads on threads.net — the target platform
Apify Datasets REST API docs — how to pull results programmatically
Meta Threads Developer Documentation — the official (limited) API for comparison

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community