Threads has 300 million monthly users and zero public API. Here's how I built a scraper that extracts posts, profiles, and engagement data without logging in.
Why Threads?
Instagram scrapers are the most popular actors on Apify — over 191,000 users. Threads is Instagram's text-first sibling with the same audience, same brands, same influencers. But when I looked for Threads data tools in early 2026, there were almost none.
Meta has been protective of Threads data. No official API. No public data export. If you want to analyze what brands are posting, track influencer engagement, or build a monitoring tool, you're stuck manually scrolling.
The Recon: How Does Threads Deliver Data?
Before writing any code, I spent an hour with Chrome DevTools open on threads.net.
Discovery 1: Threads Uses Meta's Barcelona GraphQL API
Open the Network tab, visit any profile, and watch the requests. You'll see calls to www.threads.net/api/graphql with a doc_id parameter.
POST https://www.threads.net/api/graphql
Content-Type: application/x-www-form-urlencoded
doc_id=12345678901234567
variables={"userID":"314216"}
The response? Beautiful, structured JSON with everything: posts, likes, replies, follower counts.
Discovery 2: The Doc IDs Change
Meta rotates doc_id values. Some stay stable for weeks, others change daily. This meant I couldn't hardcode queries — I needed a fallback.
Discovery 3: The DOM Has Everything
The rendered HTML contains most data in structured format. Threads uses React with server-side rendering, so the initial HTML includes post content, timestamps, and metrics.
The Architecture: GraphQL First, DOM Fallback
1. Launch headless browser
2. Navigate to Threads profile
3. Set up CDP network interception
4. Wait for GraphQL responses
5. If GraphQL captured → parse structured data
6. If not → fall back to DOM extraction
7. Scroll for more posts
8. Return unified output
The CDP Interception Layer
The key technique — intercepting network responses using Chrome DevTools Protocol:
const capturedData = { threads: [], profile: null };
page.on('response', async (response) => {
const url = response.url();
if (url.includes('/api/graphql')) {
try {
const json = await response.json();
if (json?.data?.userData?.user) {
capturedData.profile = json.data.userData.user;
}
if (json?.data?.mediaData?.threads) {
for (const thread of json.data.mediaData.threads) {
capturedData.threads.push(thread);
}
}
} catch (e) { /* Not all responses are relevant */ }
}
});
This captures data as the page loads, before it renders. No DOM parsing needed when this works.
The DOM Fallback
When GraphQL fails (ad blockers, network issues, Meta changes):
async function extractFromDOM(page) {
return await page.evaluate(() => {
const posts = [];
const articles = document.querySelectorAll('[data-pressable-container]');
for (const article of articles) {
posts.push({
text: article.querySelector('[class*="bodyText"]')?.textContent?.trim() || '',
timestamp: article.querySelector('time')?.getAttribute('datetime') || '',
likes: parseInt(article.querySelector('[class*="likeCount"]')?.textContent?.replace(/,/g, '') || '0'),
});
}
return posts;
});
}
More brittle than GraphQL, but works as a safety net.
Handling Infinite Scroll
async function scrollForMore(page, maxPosts) {
let previousHeight = 0;
let attempts = 0;
while (capturedData.threads.length < maxPosts && attempts < 10) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(2000);
const currentHeight = await page.evaluate(() => document.body.scrollHeight);
if (currentHeight === previousHeight) attempts++;
else { attempts = 0; previousHeight = currentHeight; }
}
}
The GraphQL interceptor catches new data from each scroll automatically.
The Tricky Parts
No Login Scraping
Threads shows limited data to logged-out users — but enough. Public profiles display recent posts, bio, follower counts, and engagement. You lose some historical data, but for most use cases the public data is sufficient.
The advantage: no account risk. Can't get banned if you never log in.
Rate Limiting
My approach:
- 2-3 second delays between page loads
- Proxy rotation per profile
- New browser context per request
- Exponential backoff on 429s
Data Normalization
GraphQL and DOM return different shapes. Everything normalizes to:
{
"username": "zuck",
"full_name": "Mark Zuckerberg",
"followers": 12500000,
"posts": [{
"text": "...",
"timestamp": "2026-03-10T14:30:00Z",
"likes": 45000,
"replies": 2300,
"media": [{ "type": "image", "url": "..." }],
"hashtags": ["meta", "ai"]
}]
}
Performance
Tested on 50 profiles:
- Extraction time: 8-12 seconds per profile
- Success rate: 94% GraphQL, 100% with DOM fallback
- Cost: $0.004 per post
50 posts from @zuck: $0.20. Any public profile, no login.
The Broader Lesson
When a platform doesn't offer an API, the data isn't hidden — it's just not served on a silver platter. The browser sees everything. If you can see it on screen, you can capture it.
The GraphQL-interception-plus-DOM-fallback pattern works for any React/GraphQL app. I've used it for Instagram Stories, Facebook Marketplace, and LinkedIn feeds.
Try It
🔗 Threads Scraper on Apify
📦 Source on GitHub
Input a username, get structured JSON. No login, no cookies, no Meta developer account.
Built with Puppeteer, Crawlee, and the Apify SDK.
Top comments (0)