Devil Scrapes

Posted on Jun 2

Medium RSS Scraper: extract any author's articles as clean JSON

#webscraping #python #apify #automation

Quick answer: Medium publishes an RSS feed for every author at medium.com/feed/@username. A Medium RSS scraper fetches those feeds programmatically, parses the XML, and returns one typed row per article — title, link, author, tags, published date, and optional HTML body. The Apify Actor below does it for $0.002 per article (~$2.00 per 1,000), with proxy rotation, retries, and Pydantic-validated rows handled for you. It reads Medium's own public feed; no paywall access, no login required.

Here's the newsletter editor's problem: you follow 30 thought leaders on Medium and every Monday you want a curated digest of what they published last week. You could open 30 browser tabs and copy links — or pull all 30 feeds in a single API call and have clean JSON in Notion before coffee is ready.

Medium's RSS has been doing this job since 2005. It's unambiguously public — Medium exposes it in their own help docs. The catch: reliably fetching 30+ feeds, handling Medium's periodic throttling of cloud IPs, and getting typed rows (not raw XML) into a downstream system is a non-trivial engineering exercise. Here's what that looks like, and how the Actor shortcuts it.

What is Medium? 🔎

Medium is a publishing platform where roughly 100 million readers consume articles from creators ranging from solo newsletter writers to large publications like Towards Data Science. Launched in 2012 by Twitter co-founder Ev Williams, it blends a social-network discovery layer with a WordPress-style publishing backend. For content-mining purposes, what makes Medium interesting is its RSS layer: every author and publication gets a canonical feed at medium.com/feed/@username (or medium.com/feed/publication-slug), updated when new posts land.

What the RSS feed gives you per author:

Every article title and canonical URL
The author display name
Published timestamp (ISO-8601)
Tags / categories the author applied
An HTML body excerpt via content:encoded — usually the full article for non-paywalled posts, a truncated teaser for member-only content

What it does not give you: claps counts, follower stats, reader time, or content behind Medium's paywall. Those require authenticated session scraping, which is out of scope for this Actor and, frankly, not what most buyers need.

Does Medium have an export API? 🛠️

No. Medium deprecated its developer API in 2019 — the old api.medium.com endpoints are read-only fossils that only ever covered publishing, never reading. As of 2026 there is no official programmatic way to bulk-export an author's articles, past or present. The RSS feed is Medium's own supported alternative for machine readers; this Actor turns it into structured JSON at scale.

The longer answer: you can hit medium.com/feed/@username yourself with feedparser, and for a single author on your laptop that works. But Medium rate-limits cloud datacenter IPs, so 30 parallel feeds without proxy rotation and retry logic produce intermittent 429s, silent truncation, and empty feeds that look like success. Unofficial Medium APIs and full-page HTML scrapers either violate ToS or break on JavaScript-rendered pages. The RSS feed is the correct, stable surface — this Actor is the production-grade wrapper around it.

What the data looks like 📋

Each article lands as one flat, typed row. Every field in models.py — nothing invented:

{
  "username": "@TowersDS",
  "article_id": "https://medium.com/p/3a9e1bc72f04",
  "title": "Building a Real-Time Data Pipeline with Apache Kafka",
  "link": "https://towardsdatascience.com/building-a-real-time-data-pipeline-3a9e1bc72f04",
  "author": "Jordan Towers",
  "content_html": "<p>Kafka is overkill until the day it isn&#x27;t. Here&#x27;s the moment I changed my mind...</p>",
  "categories": ["data-engineering", "kafka", "python", "real-time"],
  "published": "2026-04-22T14:30:00+00:00",
  "scraped_at": "2026-05-28T09:15:42+00:00"
}

Nine fields, consistent shape across every run, validated with Pydantic v2 before hitting the dataset. It drops straight into Pandas, Airtable, Notion, BigQuery, or an n8n HTTP node — no XML wrangling on your side.

The naive approach (and why it falls apart) ⚙️

The first thing any Python developer tries:

import feedparser
feed = feedparser.parse("https://medium.com/feed/@username")

For a single author running on your laptop, this works. In production, across dozens of authors, on a cloud IP, it breaks in several ways:

1. IP throttling. Medium's infrastructure sees many parallel requests from the same datacenter IP range and starts returning 429 Too Many Requests or, worse, silently returning an empty feed body with a 200 OK status. A naive scraper writes zero rows and reports success. We rotate Apify residential proxies on every blocked request — fresh exit IP, fresh session — so the target sees geographically distributed, residential-looking traffic rather than a datacenter block.

2. Rate-limit pacing. Even without outright blocks, hammering 30 feeds simultaneously burns through rate limits quickly. We pace requests and retry with exponential backoff on 408 / 429 / 5xx — up to 5 attempts per feed, Retry-After headers honoured. When a feed is partially retrieved before a block, we surface the count via Actor.set_status_message rather than silently emitting an incomplete dataset.

3. Malformed XML. Medium's RSS occasionally ships malformed <pubDate> values, missing <guid> fields, and content:encoded payloads with unescaped HTML. We handle each failure shape explicitly — null fields rather than parser crashes, stable IDs even when Medium's GUID is absent.

4. TLS fingerprinting on the feed endpoints. Medium's edge network inspects TLS fingerprints on requests it suspects are automated. We impersonate real Chrome / Firefox / Safari TLS + HTTP/2 signatures via curl-cffi, so the handshake looks like a browser request, not Python's urllib3.

None of that is exotic engineering. All of it is exactly what separates a weekend script from a feed pipeline that runs Monday morning without supervision.

The Actor 🚀

The result is packaged as an Apify Actor: Medium User Articles Scraper.

Paste usernames in the Console and click Start, or call it from Python:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/medium-user-articles-scraper").call(
    run_input={
        "usernames": ["@TowersDS", "@cassidoo", "@swyx", "towardsdatascience"],
        "maxArticlesPerUser": 10,
        "includeContent": True,
        "proxyConfiguration": {"useApifyProxy": True},
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["title"], "—", item["published"])

Input parameters from the schema:

Field	Type	Default	Notes
`usernames`	array	`["@medium"]`	Usernames (with or without `@`) or full profile/feed URLs
`maxArticlesPerUser`	integer	`10`	Medium's RSS cap; max 50
`includeContent`	boolean	`true`	Surface `content:encoded` HTML body
`proxyConfiguration`	object	Apify Proxy	Residential rotation recommended

The Actor accepts full feed URLs too (https://medium.com/feed/@username) so you can drop it straight into an existing feed-monitoring workflow without any username parsing logic on your side.

Use cases 💡

Newsletter digest curation. Pull 20–30 authors weekly, filter by categories for topic relevance, pipe to Notion or Airtable. The entire weekly curation pipeline for a boutique newsletter — 30 authors × 10 articles = 300 rows — costs $0.60 per run.

Author monitoring with alerting. Run on a schedule, diff article IDs against a previous run's dataset, fire a Slack webhook when a new post appears. Many n8n and Make.com workflows do exactly this; the Actor is the first node.

Translation pipeline. Extract an author's last 10 articles in bulk, feed titles + HTML bodies to DeepL or OpenAI, write translated output to a second dataset. The content:encoded body is already HTML — a translation API can consume it directly without re-fetching.

Own-archive backfill. You're migrating off Medium to Ghost, Hashnode, or Astro. Pull your own @username feed to get titles, links, published dates, and body HTML. The output is structured enough to seed a Ghost content import without hand-copying every post.

Competitive content research. Watch competitors' authors, build a tag-frequency histogram across their categories arrays, and identify topics they're covering that you haven't.

Pricing — exact numbers 💰

Pay-per-event. You pay for articles that land in your dataset, nothing for articles that don't.

Event	Price	What triggers it
`actor-start`	$0.005	Once per run, covers warm-up
`result`	$0.002	Per article written to dataset

Articles pulled	Run cost
100	$0.21
500	$1.01
1,000	$2.01
5,000 (500 authors × 10)	$10.01

Apify's $5 free trial credit covers your first ~2,400 articles with no credit card. For a typical 30-author weekly digest (300 articles/run), that's 16 free runs before you'd need to top up — enough to validate your entire pipeline.

The technically interesting bit

Medium's RSS feed silently truncates content:encoded for member-only articles — you get a teaser paragraph rather than the full text, with no signal in the feed XML that truncation happened. We detect this by comparing the word count of content:encoded against the article's estimated reading time (available in the <item> metadata for some feeds). When the discrepancy suggests truncation, we set content_html to the teaser and do not attempt to fetch the full article. The devil's in the data: a scraper that silently returns 40-word "teasers" labeled as full bodies is worse than no scraper at all.

Limitations 🚧

10 articles per author, maximum ~50. Medium's per-author RSS exposes the latest 10 posts by default; the maxArticlesPerUser cap tops out at 50. For a complete back-catalogue (some authors have hundreds of posts), RSS is not the right surface.
No claps, follower count, or reading-time stats. Those fields live in rendered HTML, not RSS. They'd require per-article page fetches, which is a different (and more fragile) Actor.
Paywalled content returns teasers. Member-only articles show a truncated excerpt in content:encoded. The Actor does not attempt to fetch paywalled full text.
Some publications don't expose per-author feeds. Large publications like Towards Data Science use publication-level RSS (medium.com/feed/towards-data-science), not per-author feeds inside the publication. Passing the publication slug works for the publication's feed; per-contributor filtering within a publication is not supported.
published is sometimes null. Some accounts emit malformed <pubDate> values; we surface null rather than guess.

FAQ ❓

Is scraping Medium's RSS feed legal?
Medium's RSS feeds are publicly accessible, documented in their own help centre, and explicitly intended for machine consumption (that's what RSS is for). This Actor reads only what Medium's own RSS layer exposes, paces requests to stay below rate limits, collects no personal data beyond author display names and public article metadata, and accesses no authenticated or paywalled content. As always, review your jurisdiction and your specific use case before deploying at scale.

Is there a Medium API I could use instead?
Medium's developer API (api.medium.com) was deprecated in 2019 and no longer supports reading articles programmatically. The RSS feed is Medium's own supported channel for machine readers; this Actor is the production-grade wrapper around it.

Can I export to Google Sheets, a database, or a webhook?
Yes — export CSV, JSON, or Excel directly from the Apify Console, or wire a webhook to ACTOR.RUN.SUCCEEDED to push the dataset into Make, n8n, Zapier, or any HTTP endpoint automatically.

Why is content_html null for some articles?
Two causes: (1) the author disabled includeContent in the input, or (2) Medium returned an empty content:encoded element for that article — sometimes happens with older posts or certain publication feeds. The title, link, and published fields are still populated.

Try it

The Actor is live on the Apify Store: apify.com/DevilScrapes/medium-user-articles-scraper.

Free $5 trial credit, no credit card. Run it on @swyx or your own username and you'll have your last 10 articles in the dataset within seconds. Building an n8n or Make workflow around it? Drop the workflow JSON in the comments — I'm collecting patterns for a public cookbook.

External resources:

Medium RSS help documentation — Medium's own guide to their RSS surfaces
Apify Python client docs — how to call Actors and iterate datasets from Python
n8n RSS node documentation — for wiring Medium feeds into n8n workflows without code

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event pricing, honest limitations, no junk fields. 😈

DEV Community