Devil Scrapes

Posted on Jun 3

Stack Overflow Scraper: export any tag or query to JSON for $1.50/1K

#webscraping #python #apify #datascience

Quick answer: Stack Overflow and every Stack Exchange site expose a v2.3 API at api.stackexchange.com, but it enforces a 300-request/day anonymous quota and returns paginated, gzip-encoded JSON with a backoff field you must honour or get banned. A stack overflow scraper wraps that API, handles the quota paging and backoff signalling, and writes one clean row per question — title, body, tags, score, view count, accepted answer, asker, and timestamps. The Apify Actor below does it for $0.0015 per question (~$1.50 per 1,000), with fingerprint rotation, proxy threading, and retry logic handled for you.

Stack Overflow is the de-facto training corpus for developer tooling — every LLM fine-tune that "knows Python" learned a slice of it from the SO Q&A archive. In 2023 Stack Exchange Inc. paused the community data dump researchers had relied on for years. The API still works, but most teams hit the quota wall on day one or end up with a partial dataset because they didn't handle backoff correctly.

If you need fresh Stack Overflow data in 2026 — questions asked yesterday, current vote counts, new accepted answers — you're going to the API. This post explains what it gives you, where the friction lives, and how a single Actor call replaces the pagination boilerplate.

What is Stack Exchange? 🔎

Stack Exchange is a network of 170+ Q&A communities operated by Stack Exchange Inc. The flagship site is Stack Overflow (software development); the network also includes Server Fault, Super User, Cross Validated (statistics), Ask Ubuntu, Math, and dozens more. All sites share one API endpoint (api.stackexchange.com/2.3) with a site parameter to switch context.

Each question carries a title, full HTML body, tags, net vote score, view count, answer count, an accepted-answer flag and ID, and the asker's display name and user ID. That's the unit of data this Actor ships.

Does Stack Overflow have a scraping API?

Partially. Stack Exchange publishes a documented v2.3 API that returns question data, but "API" here comes with caveats tutorials skip. Anonymous callers get 300 requests per day per IP; a registered API key lifts that to 10,000 per day. Pages return up to 100 items, so 10,000 requests covers at most 1,000,000 questions — workable for many uses, restrictive for a full-corpus pull. Critically, when the server is under load it sends a backoff field in the response body (not a header); you must sleep(backoff + 1) before the next call or you get throttled and eventually banned. Most one-off scripts miss this.

Beyond quota: the API gzip-compresses every response, returns timestamps as Unix integers rather than ISO strings, and surfaces the accepted answer ID as a top-level field only sometimes present. Small friction points that add up across a pipeline.

What the data looks like

Every question becomes one flat, Pydantic-validated row. Seventeen fields, consistent shape every time:

{
  "question_id": 11227809,
  "site": "stackoverflow",
  "title": "Why is processing a sorted array faster than processing an unsorted array?",
  "body_html": "<p>Here is a piece of C++ code that shows some very peculiar behavior...</p>",
  "tags": ["java", "c++", "performance", "cpu-architecture", "branch-prediction"],
  "score": 27122,
  "view_count": 1783944,
  "answer_count": 28,
  "is_answered": true,
  "accepted_answer_id": 11227902,
  "link": "https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster",
  "owner_user_id": 508666,
  "owner_display_name": "GManNickG",
  "creation_date": 1341853987,
  "last_activity_date": 1716892810,
  "posted_at": "2012-07-09T23:53:07+00:00",
  "scraped_at": "2026-05-31T14:22:00+00:00"
}

Seventeen fields. posted_at is an ISO-8601 string derived from creation_date so your pipeline doesn't need to do the conversion. body_html is present when you set includeBody: true (the default) — the full question text in HTML, ready for chunking into a RAG store. Null when you skip it to save quota.

The naive approach (and why it falls apart)

The first pass most engineers write looks like this: call the /questions endpoint with &tagged=python, collect items, loop until has_more is false. It works on a small run. It breaks at scale.

1. The backoff field is load-bearing. When the Stack Exchange API returns "backoff": 5 in the JSON body, it expects you to wait 5 seconds before the next request. This is not an HTTP 429 — it's a field in a 200 response. Code that ignores it hammers the server, gets throttled more aggressively, and eventually hits the quota ceiling well before 10,000 requests. We read backoff on every page and sleep backoff + 1 seconds. We retry on 408 / 429 / 5xx with exponential backoff starting at 2 seconds, up to 5 attempts.

2. The 300/day anonymous wall is invisible until it hits. If your script runs without an API key, it hits the 300-request cap sometime around midnight UTC and silently stops paginating. We surface partial-success clearly with a set_status_message — you always know how many questions landed before quota exhausted.

3. TLS fingerprinting still matters. The Stack Exchange API inspects the request profile. We rotate through Chrome / Firefox / Safari TLS fingerprints via curl-cffi, so every session looks like a real browser rather than a Python requests call with a bot-shaped ClientHello. On any block we rotate residential proxies through Apify Proxy, refreshing the session ID each time.

4. Deleted-user nullability. Questions asked by deleted accounts return owner: {} or no owner key. A parser that assumes item["owner"]["user_id"] exists throws a KeyError on a ~5% subset of questions. We handle this upstream; owner_user_id and owner_display_name are nullable fields that arrive as null rather than crashing your pipeline.

The Actor 🔧

I packaged the result as an Apify Actor: Stack Exchange Questions Scraper.

Three modes: fetch by tag (all questions matching a tag or tag list), free-text search, or user's questions by numeric user ID. Run it from the Apify Console by filling in the form, or programmatically:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/stackexchange-questions-scraper").call(
    run_input={
        "site": "stackoverflow",
        "mode": "tagged",
        "tags": ["python", "asyncio"],
        "sortBy": "votes",
        "maxResults": 1000,
        "includeBody": True,
        "apiKey": "YOUR_STACKAPPS_KEY",  # optional — lifts quota to 10k/day
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["title"], item["score"])

The site parameter accepts any Stack Exchange site slug — stackoverflow, superuser, serverfault, askubuntu, stats (Cross Validated), math, or any of the 170+ network sites. The sortBy parameter maps directly to the API's sort param: activity, votes, creation, or hot.

If you have a Stack Apps key (free to register at stackapps.com), pass it in apiKey and the quota ceiling rises from 300 to 10,000 requests per day. Without it, the Actor uses anonymous quota and surfaces a warning when it exhausts.

Use cases

Four patterns that come up repeatedly in the communities where this Actor gets used:

RAG corpus for a domain copilot. Pull the top 500 voted questions tagged [your-library], include the body, chunk the HTML into 512-token segments, embed, and index. The corpus covers the canonical questions your users actually ask. At $1.50 / 1,000 questions, a 10,000-question seed corpus costs about $15.

DevRel unanswered-question monitoring. Schedule a daily run on your product's tag, sort by creation, diff against yesterday's dataset, and alert when new questions arrive with zero answers. Ship a fix, post the answer with attribution, and your tag's answer rate climbs — one of the most underused dev-relations signals available.

Competitor tag benchmarking. Run the Actor on [strapi] and [directus] weekly, compare view_count velocity on similar questions, and track whether a competitor is gaining or losing developer mindshare. Directional and cheap.

Stack Overflow data dump alternative. For fresh data — questions posted after the last dump, current vote counts, live accepted-answer status — the API is the only path. This Actor gives you dump-like row structure with current values, at a per-question price well below a commercial data license. Every row includes the canonical link you need for CC BY-SA 4.0 attribution.

Edtech content enrichment. Feed questions tagged [javascript] with score >= 50 into a learning platform as curated exercises. The accepted-answer ID gives you the canonical solution reference; filter by is_answered: true to keep only questions with known-good answers.

Pricing — exact numbers 💰

Pay-per-event. You pay for questions that land in your dataset. No data, no charge (minus the one-time start fee).

Event	USD
Actor start (one-off per run)	$0.005
Per question written	$0.0015

Pull	Cost
100 questions	$0.16
1,000 questions	$1.51
10,000 questions	$15.01
100,000 questions	$150.01

Apify gives every new account $5 of free credit — no credit card required. That covers your first ~3,300 questions. For context, the Stack Exchange commercial data license is priced for enterprise budgets; this Actor operates in the per-question API layer, priced for builders.

The technically interesting bit

The Stack Exchange API's backoff signalling is genuinely unusual. Most APIs express rate limits through HTTP headers (Retry-After, X-RateLimit-Remaining) or via 429 status codes. Stack Exchange sends backoff as a field inside a 200 response body, alongside the actual items. The implication: a pagination loop that only checks HTTP status codes will never see the signal. It will appear to work, fetch all pages successfully, and silently burn through quota 3-5x faster than it should.

We handle this with a per-page check after every fetch: if payload.get("backoff") is present and positive, the scraper sleeps backoff + 1 seconds before the next request — the + 1 guards against clock skew. It's documented in the Stack Exchange API docs but easy to miss, and it's the difference between a scraper that runs reliably at scale and one that exhausts quota halfway through a large job.

Limitations 🚧

API quota ceiling. Without a Stack Apps key, the anonymous quota is 300 requests/day per IP. With a key, 10,000/day. At 100 items per page, that's a maximum of 1,000,000 questions per day with a key. Enough for most use cases; a constraint for full-corpus pulls.

Answers are out of scope. The withbody filter returns the question body, but answers live on a separate endpoint — see the planned stackexchange-answers-scraper sibling.

Comments, revisions, and vote history are not included. Those are separate API endpoints with their own pagination.

Search ranking differs from the website. The API's search uses its own relevance model, which is not the same as the full-text search you see in the browser UI. Complex queries may return fewer or different results than expected.

Deleted-user questions. Questions from deleted accounts return null owner_user_id and owner_display_name. Expected; documented; handled cleanly.

FAQ

Is scraping Stack Overflow legal?
This Actor uses the official Stack Exchange v2.3 API — it is not web scraping in the traditional sense. The API is publicly documented and free to use with a registered key. Stack Exchange content is licensed under CC BY-SA 4.0, which requires attribution when redistributing. Every row the Actor returns includes the canonical question URL (link) for attribution purposes. Consult your own legal counsel for commercial redistribution use cases.

Why not just use the Stack Overflow data dump?
The community data dump — historically released every few months on Archive.org — was paused by Stack Exchange Inc. in 2023. Even when available, it is a snapshot: vote counts and accepted-answer status are frozen at dump time. This Actor returns current values from the live API, which matters for monitoring use cases and AI training datasets that need recency.

Does the Stack Exchange API have an official export?
No. The API exposes data programmatically but has no "export all questions for a tag" endpoint. Pagination is required, and the backoff field must be honoured. That pagination + backoff handling is the core of what this Actor provides.

Why are some body_html fields null?
If you run with includeBody: false, body_html is null to save API quota. Set includeBody: true (the default) to include the full question body in HTML. Some very old questions may also have sparse body data on the API side — the Actor passes through whatever the API returns.

Try it

The Actor is live on the Apify Store: apify.com/DevilScrapes/stackexchange-questions-scraper.

Free $5 trial credit, no credit card. Run it against the python tag sorted by votes and you'll have 1,000 of Stack Overflow's most canonical Python questions in your dataset in under two minutes. Building a RAG corpus, monitoring a tag, or benchmarking a competitor's developer mindshare? Drop a comment — I ship features based on what people are actually doing with the data.

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community