George Kioko

Posted on Mar 13

Scraping Threads by Meta When There's No API

#tutorial

Threads has 300 million monthly users and zero public API. Here's how I built a scraper that extracts posts, profiles, and engagement data without logging in.

Why Threads?

Instagram scrapers are the most popular actors on Apify — over 191,000 users. Threads is Instagram's text-first sibling with the same audience, same brands, same influencers. But when I looked for Threads data tools in early 2026, there were almost none.

Meta has been protective of Threads data. No official API. No public data export. If you want to analyze what brands are posting, track influencer engagement, or build a monitoring tool, you're stuck manually scrolling.

The Recon: How Does Threads Deliver Data?

Before writing any code, I spent an hour with Chrome DevTools open on threads.net.

Discovery 1: Threads Uses Meta's Barcelona GraphQL API

Open the Network tab, visit any profile, and watch the requests. You'll see calls to www.threads.net/api/graphql with a doc_id parameter.

POST https://www.threads.net/api/graphql
Content-Type: application/x-www-form-urlencoded

doc_id=12345678901234567
variables={"userID":"314216"}

The response? Beautiful, structured JSON with everything: posts, likes, replies, follower counts.

Discovery 2: The Doc IDs Change

Meta rotates doc_id values. Some stay stable for weeks, others change daily. This meant I couldn't hardcode queries — I needed a fallback.

Discovery 3: The DOM Has Everything

The rendered HTML contains most data in structured format. Threads uses React with server-side rendering, so the initial HTML includes post content, timestamps, and metrics.

The Architecture: GraphQL First, DOM Fallback

1. Launch headless browser
2. Navigate to Threads profile
3. Set up CDP network interception
4. Wait for GraphQL responses
5. If GraphQL captured → parse structured data
6. If not → fall back to DOM extraction
7. Scroll for more posts
8. Return unified output

The CDP Interception Layer

The key technique — intercepting network responses using Chrome DevTools Protocol:

const capturedData = { threads: [], profile: null };

page.on('response', async (response) => {
    const url = response.url();
    if (url.includes('/api/graphql')) {
        try {
            const json = await response.json();
            if (json?.data?.userData?.user) {
                capturedData.profile = json.data.userData.user;
            }
            if (json?.data?.mediaData?.threads) {
                for (const thread of json.data.mediaData.threads) {
                    capturedData.threads.push(thread);
                }
            }
        } catch (e) { /* Not all responses are relevant */ }
    }
});

This captures data as the page loads, before it renders. No DOM parsing needed when this works.

The DOM Fallback

When GraphQL fails (ad blockers, network issues, Meta changes):

async function extractFromDOM(page) {
    return await page.evaluate(() => {
        const posts = [];
        const articles = document.querySelectorAll('[data-pressable-container]');
        for (const article of articles) {
            posts.push({
                text: article.querySelector('[class*="bodyText"]')?.textContent?.trim() || '',
                timestamp: article.querySelector('time')?.getAttribute('datetime') || '',
                likes: parseInt(article.querySelector('[class*="likeCount"]')?.textContent?.replace(/,/g, '') || '0'),
            });
        }
        return posts;
    });
}

More brittle than GraphQL, but works as a safety net.

Handling Infinite Scroll

async function scrollForMore(page, maxPosts) {
    let previousHeight = 0;
    let attempts = 0;
    while (capturedData.threads.length < maxPosts && attempts < 10) {
        await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
        await page.waitForTimeout(2000);
        const currentHeight = await page.evaluate(() => document.body.scrollHeight);
        if (currentHeight === previousHeight) attempts++;
        else { attempts = 0; previousHeight = currentHeight; }
    }
}

The GraphQL interceptor catches new data from each scroll automatically.

The Tricky Parts

No Login Scraping

Threads shows limited data to logged-out users — but enough. Public profiles display recent posts, bio, follower counts, and engagement. You lose some historical data, but for most use cases the public data is sufficient.

The advantage: no account risk. Can't get banned if you never log in.

Rate Limiting

My approach:

2-3 second delays between page loads
Proxy rotation per profile
New browser context per request
Exponential backoff on 429s

Data Normalization

GraphQL and DOM return different shapes. Everything normalizes to:

{
    "username": "zuck",
    "full_name": "Mark Zuckerberg",
    "followers": 12500000,
    "posts": [{
        "text": "...",
        "timestamp": "2026-03-10T14:30:00Z",
        "likes": 45000,
        "replies": 2300,
        "media": [{ "type": "image", "url": "..." }],
        "hashtags": ["meta", "ai"]
    }]
}

Performance

Tested on 50 profiles:

Extraction time: 8-12 seconds per profile
Success rate: 94% GraphQL, 100% with DOM fallback
Cost: $0.004 per post

50 posts from @zuck: $0.20. Any public profile, no login.

The Broader Lesson

When a platform doesn't offer an API, the data isn't hidden — it's just not served on a silver platter. The browser sees everything. If you can see it on screen, you can capture it.

The GraphQL-interception-plus-DOM-fallback pattern works for any React/GraphQL app. I've used it for Instagram Stories, Facebook Marketplace, and LinkedIn feeds.

Try It

🔗 Threads Scraper on Apify
📦 Source on GitHub

Input a username, get structured JSON. No login, no cookies, no Meta developer account.

Built with Puppeteer, Crawlee, and the Apify SDK.

DEV Community