Devil Scrapes

Posted on May 31

DEV.to Scraper: pull articles by tag, author, or feed into clean JSON

#webscraping #python #apify #data

Quick answer: DEV.to (built on the Forem platform) publishes a public v1 REST API at https://developers.forem.com/api/v1 — but it paginates 30 articles at a time and offers no bulk-by-tag export beyond the first 1,000 items. A DEV.to scraper fans out across those paginated endpoints, fetches each article's full Markdown body in parallel, and returns one clean typed row per article. The Apify Actor below does it for $0.002 per article ($2.00 per 1,000), with rate-limit pacing, retries, and Pydantic-validated rows handled for you.

DEV.to's feed is one of the richest free sources of developer-written technical content on the web. On any given day the python tag alone has thousands of articles — tutorials, opinions, walkthroughs, career posts — each with engagement signals (reactions, comments, reading time) attached. The platform's UI surfaces individual articles just fine. What it doesn't give you is a download button, a bulk-export endpoint, or a way to pull every article in a tag across the full history. The API hands back 30 at a time, then stops answering after 1,000 items per tag.

If you want this as a dataset — to seed a RAG corpus, run an engagement benchmark, or mirror an author's catalogue — you have to stitch it together yourself. Here's what that involves, and how I turned it into a one-call Actor.

What is DEV.to? 🔎

DEV.to is a community publishing platform for software developers, built on Forem — the open-source publishing engine that also powers CodeNewbie and several smaller communities. Launched in 2016, DEV.to hosts millions of articles across tags like python, webdev, typescript, beginners, and ai, written by everyone from student bloggers to senior engineers.

What makes DEV.to useful as a data source:

Every article carries structured engagement metrics: positive reactions, comments, and reading time
Articles are tagged with community-maintained taxonomy (lowercase tags like javascript, devops, aws)
The body of every article is available as raw Markdown — ready to embed in a vector store without stripping HTML
Authorship is consistent: every article has a username and display_name, making per-author analysis straightforward

What the platform does not give you: a bulk export, a search-by-keyword endpoint, or a way to get all articles in a tag older than the most recent thousand.

Does DEV.to have an API? 🔌

Yes — but it has meaningful limits. DEV.to's Forem v1 API is public for read access, requires no API key for GET /articles, and is reasonably well-documented. That's genuinely the good news.

The constraints that send people looking for a scraper:

30 articles per page, hard cap. You can request per_page=30 — that's the max the server will honor. Getting 10,000 articles means 334 sequential (or carefully paced parallel) requests.
Tag endpoint cuts off at 1,000 items. Past page 34, the API returns empty arrays. There's no cursor mechanism, no since timestamp, and no workaround documented by Forem.
Body Markdown requires a second request. GET /articles returns metadata. To get body_markdown you need a GET /articles/:id call per article — one extra round trip for every row you want full-text on.
Rate limits are real and undocumented. Hit them and you get 429s with no Retry-After header. Retry naively and you accumulate backoff penalties.

None of that is a dealbreaker on its own. Together, for a 10,000-article corpus pull, it's several hours of babysitting a script. That's the gap the Actor fills.

What the data looks like

Each article becomes one flat, typed row. Here's a real-shaped output record with all 16 fields from models.py:

{
  "id": 1893402,
  "slug": "build-a-rag-pipeline-with-python-and-chroma-3x7k",
  "title": "Build a RAG pipeline with Python and Chroma",
  "description": "A step-by-step guide to building a retrieval-augmented generation system using Python, ChromaDB, and the OpenAI API.",
  "url": "https://dev.to/pythonista/build-a-rag-pipeline-with-python-and-chroma-3x7k",
  "cover_image": "https://media2.dev.to/cdn-cgi/image/quality=100/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xyz.jpg",
  "tags": ["python", "ai", "machinelearning", "beginners"],
  "author_username": "pythonista",
  "author_name": "Alex Chen",
  "reading_time_minutes": 8,
  "positive_reactions_count": 347,
  "comments_count": 19,
  "body_markdown": "## Introduction\n\nRetrieval-augmented generation (RAG)...",
  "published_at": "2026-04-12T09:14:00Z",
  "edited_at": "2026-04-14T11:02:00Z",
  "scraped_at": "2026-05-29T08:22:41+00:00"
}

Sixteen fields, consistent shape every time, Pydantic-validated before the row is written. It drops straight into Pandas, a vector store, or BigQuery with no field-wrangling on your side.

The naive approach (and why it falls apart) 🔧

The obvious script goes like this:

import httpx, json

articles = []
page = 1
while True:
    resp = httpx.get(
        "https://dev.to/api/articles",
        params={"tag": "python", "page": page, "per_page": 30}
    )
    batch = resp.json()
    if not batch:
        break
    articles.extend(batch)
    page += 1

This works until it doesn't. Three failure modes that matter at scale:

1. The 1,000-item ceiling. Around page 34 the API returns an empty list for any tag endpoint. There's no error, no header — just silence. A naive loop exits thinking it finished. You have 1,000 rows instead of the 12,000 that exist.

2. The body-fetch N+1 problem. Want Markdown? Every article needs a second GET /articles/:id. For 1,000 articles that's 1,000 extra requests. We handle this with a concurrency parameter — up to 16 parallel body fetches — so the Actor fans out rather than serializing. We pace those fetches and retry with exponential backoff on 408 / 429 / 5xx, up to 5 attempts per article before we surface a partial-success status rather than handing you a half-empty dataset.

3. Rate limits that arrive unannounced. DEV.to's rate-limit threshold isn't published; it varies by endpoint and time of day. We back off on rate-limit signals and reset the session rather than triggering a retry storm — and we surface a set_status_message so you know what happened, rather than silently returning fewer articles than you asked for.

We rotate browser fingerprints via curl-cffi so requests look like a real browser's TLS handshake, and we thread Apify residential proxies on every session rotation — fresh exit IP, fresh cookie jar — so a single blocked IP doesn't kill the run.

The Actor

I packaged this as an Apify Actor: DEV.to Articles Scraper.

Open the Apify Console and click Start, or run it programmatically with the Apify Python client:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/dev-to-articles-scraper").call(
    run_input={
        "mode": "tag",
        "tag": "python",
        "includeBody": True,
        "maxResults": 500,
        "concurrency": 8,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["title"], item["positive_reactions_count"])

The four mode values from the input schema:

Mode	What it fetches
`tag`	All articles for a given tag (e.g. `python`, `ai`, `webdev`)
`username`	All articles by a specific DEV.to author
`latest`	Global latest feed — newest articles across all tags
`top`	Top articles of all time across the platform

Set includeBody: false if you only need metadata — that halves the request count and the run time. maxResults caps the total rows; concurrency controls how many body fetches run in parallel (1–16, default 4).

Use cases

RAG corpus seeding. Pull mode=tag, tag=python, includeBody=true, maxResults=1000 to get 1,000 Python tutorials in raw Markdown. Each article is already chunked — the body_markdown field is the unit. Embed it directly into Chroma, Pinecone, or Weaviate. The url and author_username fields give you citation metadata for free.

Trending tag dashboards. Schedule a daily run on mode=tag, tag=ai with maxResults=50. Diff today's positive_reactions_count against yesterday's — any article that gained more than 50 reactions in 24 hours is trending. Wire it into a Slack webhook and you have a free daily briefing.

Author monitoring and portfolio analysis. Pull mode=username for a specific author to mirror their full catalogue. Useful for DevRel teams tracking competitor advocates, recruiters benchmarking engineering blog authors, or writers building a personal analytics dashboard outside DEV.to's own stats page.

Newsletter assembly. Pull mode=top or mode=latest with maxResults=10, sort by positive_reactions_count, and render the top 5 to Markdown for a weekly digest. The reading_time_minutes field tells readers upfront what they're committing to.

Engagement benchmarking. Pull 500 articles in a tag, group by author_username, and compute average reactions per post — a simple "who are the most impactful writers in this niche?" query for sponsorship research, guest-post pitching, or a contributor leaderboard.

Pricing — exact numbers 💰

Pay-per-event. You pay for articles written to your dataset, nothing for the ones you don't get.

$0.005 per run (covers the Actor warm-up)
$0.002 per article written to the dataset

Pull	Cost
30 articles (default)	$0.07
100 articles	$0.21
1,000 articles	$2.01
5,000 articles	$10.01
10,000 articles	$20.01

Apify's $5 free trial credit covers your first ~2,490 articles with no credit card required. No subscription, no minimum, no charge for runs that return zero results.

The technically interesting bit

DEV.to's API officially cuts off article listings at 1,000 per tag — but the per-article GET /articles/:id endpoint has no such limit. So a full corpus is achievable by combining the listing endpoint (for IDs) with the detail endpoint (for bodies): even when the listing only goes 34 pages deep, you can supplement IDs from the username endpoint, the latest feed, or a prior run's dataset. The Actor exposes this as a design choice — mode=tag is the fast lane for recent articles; mode=latest is the slow lane for full-history accumulation over scheduled runs. Both paths produce identical row shapes, so your downstream pipeline never needs to know which mode fed it.

Limitations 🚧

Tag endpoint hard cap at ~1,000 items. The DEV.to v1 API does not paginate beyond this for the tag feed. Full-history pulls require either the username mode (per-author) or multiple scheduled latest-mode runs.
Body Markdown is the API's version. If an author used DEV.to's rich editor with embedded Liquid tags (custom video/link cards), those render as raw Liquid syntax in the Markdown — not HTML. Post-processing is on you.
No comment bodies. comments_count is in the metadata, but fetching individual comment threads would multiply the request count significantly. Not in scope for v1.
No read-time filtering at the API level. You can filter post-scrape, but the API doesn't accept a min_reading_time param. Download the dataset and filter in Pandas.
Private/draft articles are inaccessible. The public API only surfaces published, non-hidden articles.

FAQ

Is scraping DEV.to legal?
This Actor calls DEV.to's own published public API (https://developers.forem.com/api/v1) — no authentication bypassed, no HTML scraped, no undocumented endpoint hit. The API is designed for programmatic access. Standard advice: read DEV.to's Terms of Service, stay within polite request rates, and don't republish article bodies wholesale without attribution.

Can I export the dataset to a spreadsheet or warehouse?
Yes — export CSV, JSON, or Excel directly from the Apify Console. Alternatively, webhook the dataset on ACTOR.RUN.SUCCEEDED into Make, Zapier, or n8n, or fetch it via the Apify API for direct warehouse ingestion.

Does DEV.to have an official bulk-export API?
No. The Forem v1 API paginates 30 articles at a time and caps tag listings at approximately 1,000 items. There is no official bulk download, no CSV export, and no GraphQL endpoint on the public surface.

Why are some body_markdown fields null?
Some DEV.to articles link out to a canonical URL hosted on the author's own blog — the metadata (title, tags, reactions) lives on DEV.to but the body lives elsewhere. In those cases the API returns an empty or very short body; the Actor surfaces that faithfully as null rather than silently dropping the row.

Try it

The Actor is live on the Apify Store: apify.com/DevilScrapes/dev-to-articles-scraper.

Free $5 trial credit, no credit card. Run it on tag=python with maxResults=30 and you'll have a full typed dataset in under a minute — Markdown bodies included if you leave includeBody: true. Need a field that isn't there (comment threads, co-authors, series metadata)? Drop a note in the comments. We read every one.

Built by Devil Scrapes — the devil's in the data, and we keep it clean. 😈

DEV Community