Devil Scrapes

Posted on Jun 2

RSS Feed Scraper: parse any feed to clean JSON in the LLM era

#webscraping #python #apify #automation

Quick answer: An RSS feed scraper fetches any RSS 2.0 or Atom 1.0 URL and returns every item as a structured dataset row — title, link, author, published date, summary, full HTML content, tags, and GUID. There is no official "RSS to JSON API" you can call with a key; you parse the feed yourself or use a hosted service. The Apify Actor below handles RSS 2.0, Atom 1.0, and the common content:encoded / dc:creator extensions for $0.001 per item (~$1.00 per 1,000), with proxy rotation, fingerprint handling, and Pydantic-validated output included.

RSS never really died. It just went quiet for a decade while Twitter ate the link-sharing layer. In 2026 it is back — quietly, structurally, and with a new reason to exist: it is the format that LLM news-digest pipelines actually want. Flat XML, one item per entry, stable per-feed schemas. Clean enough to feed directly into a vector store with one line of Python.

The problem is that parsing RSS at scale is messier than the spec suggests. feedparser handles the happy path. It does not handle the roughly 1-in-20 real-world feeds that mix RSS 1.0 and 2.0 namespace collisions, emit content:encoded without declaring the content module, or publish timestamps that are half RFC 2822 and half something a developer invented in 2003. You want a hosted service that absorbs those edge cases, normalises both dialects, and hands you typed rows.

What is RSS? 🎙️

RSS (Really Simple Syndication) is an XML-based web feed format that publishes a list of items from a source — blog posts, news articles, podcast episodes, GitHub releases — in a machine-readable structure. Atom is a successor format with a slightly cleaner namespace model. Both describe the same concept: a channel with a list of entries, each carrying a title, a link, a date, and optionally a body.

Publishers expose feeds at stable URLs. Subscribers poll those URLs for new entries. The spec has been around since 1999. The format has no authentication layer, no SDK, and no official SDK — you fetch the XML and parse it.

Does RSS have an API? 🔌

No. RSS is the protocol — the feed URL is the API endpoint. There is no central RSS service you authenticate against. Each feed is a plain HTTP endpoint returning XML. What that means practically: there is nothing stopping a server from blocking your IP, serving bot-detection pages, requiring specific User-Agent strings, rate-limiting aggressive pollers, or returning malformed XML that a naive parser chokes on. We absorb all of those.

What the data looks like

Thirteen fields per item, same shape for RSS and Atom. Here is a real Hacker News RSS row:

{
  "feed_url": "https://news.ycombinator.com/rss",
  "feed_title": "Hacker News",
  "feed_format": "rss",
  "item_id": "https://news.ycombinator.com/item?id=41234567",
  "title": "Show HN: I built a Rust compiler backend for WebAssembly",
  "link": "https://news.ycombinator.com/item?id=41234567",
  "author": null,
  "summary": "Comments",
  "content_html": null,
  "categories": [],
  "published": "2026-05-15T20:00:00+00:00",
  "updated": null,
  "scraped_at": "2026-05-15T21:03:47+00:00"
}

Every row is Pydantic-validated before it lands in the dataset. ISO-8601 timestamps, nullable fields typed as T | None, stable field names regardless of whether the source was RSS or Atom. It drops straight into pandas.DataFrame, a BigQuery LOAD, or LangChain.RecursiveCharacterTextSplitter without column-name gymnastics.

The naive approach (and why it falls apart) ⚠️

The first pass everyone tries:

import feedparser
d = feedparser.parse("https://theregister.com/headlines.rss")
for entry in d.entries:
    print(entry.title, entry.published)

This works on the feeds that behave. Roughly 80-85% of feeds in the wild. The rest surface a long tail of problems we handle so you don't have to:

1. TLS and User-Agent gating. Some publishers — particularly Substack, newsletters with Cloudflare in front, and corporate press-release feeds — inspect the TLS fingerprint and the User-Agent string. Python's default SSL stack doesn't impersonate a real browser. We rotate through Chrome / Firefox / Safari TLS fingerprints via curl-cffi so the server sees a recognisable browser handshake, not an anonymous Python socket.

2. Proxy-gated feeds. A subset of feeds block datacenter IP ranges. We thread Apify residential proxies with sticky sessions — same exit IP for the entire run on a given feed, paced to avoid triggering rate limits.

3. Malformed XML. RSS has no mandatory validator. Production feeds ship broken entity references (& not encoded as &), mixed namespace declarations, duplicate GUIDs, and timestamp strings that look like they were authored by three different engineers on three different continents. We surface the items we can extract and log the recoverable errors rather than failing the entire run.

4. Retries and backoff. On 408 / 429 / 5xx we retry with exponential backoff (2 s → 4 s → 8 s → … cap 30 s), honour Retry-After headers, and rotate to a fresh proxy session on repeated blocks. Up to 5 attempts per feed. Partial success surfaces with a clear status message — we never silently return an empty dataset.

5. Dialect normalisation. The feed_format field tells you what you got; the field names are the same either way. dc:creator maps to author, content:encoded maps to content_html, Atom id maps to item_id. Downstream code shouldn't need to branch on dialect.

None of this is exciting. It's the exact infrastructure that disappears when you use a hosted Actor rather than maintaining a Python cron job that breaks every six weeks when a feed changes its charset declaration.

The Actor 🛠️

The result is an Apify Actor: RSS / Atom Feed Scraper.

Paste a list of feed URLs in the Apify Console and click Start, or drive it from Python:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/rss-feed-scraper").call(
    run_input={
        "feedUrls": [
            "https://news.ycombinator.com/rss",
            "https://feeds.feedburner.com/TechCrunch",
            "https://www.theregister.com/headlines.rss",
        ],
        "maxItemsPerFeed": 100,
        "includeContent": True,
        "proxyConfiguration": {"useApifyProxy": True},
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["feed_title"], "|", item["title"], "|", item["published"])

Input parameters from the schema:

Field	Type	Default	Notes
`feedUrls`	array	`["https://news.ycombinator.com/rss"]`	One URL per item. Multiple feeds in one run.
`maxItemsPerFeed`	integer	50	1–1,000 cap per feed.
`includeContent`	boolean	true	Pulls `content:encoded` / Atom content body when available.
`userAgent`	string	`DevilScrapesBot/1.0`	Override if a publisher whitelists specific bots.
`proxyConfiguration`	object	`{"useApifyProxy": false}`	Set to residential for gated feeds.

Use cases 💡

1. LLM news-digest pipeline. Aggregate 20-50 feeds — tech news, competitor blogs, industry publications — run them through the Actor daily, embed the summary + content_html into a vector store, and let a language model write the digest. The Actor gives you clean rows with ISO timestamps; the LLM handles the synthesis. The categories field helps filter by topic before embedding.

2. Brand monitoring. Google Alerts generates RSS. So does Mention, Talkwalker Alerts, and most social-listening platforms. Pipe those alert feeds into the Actor, store the results in a named Apify dataset, and webhook the results into Slack on ACTOR.RUN.SUCCEEDED. Cheaper than a dedicated monitoring SaaS for a three-person team.

3. Podcast metadata extraction. Podcast RSS is just RSS with <enclosure>. Each episode row has a link pointing to the media file. This Actor parses them cleanly — useful for building podcast directories, tracking episode metadata, or cross-referencing guest names from author fields.

4. n8n / Make / Zapier automation workflows. Feed the Actor into a no-code workflow. The n8n Apify node lets you chain "run Actor → filter items → post to Notion" in under 10 minutes. The structured output means no custom code node needed to massage the data shape.

5. Content pipeline for translation or republication. Corporate comms teams often need to monitor upstream parent-company or wire-service RSS feeds and route content to translation pipelines. The content_html field gives translators the full body, not just the teaser.

Pricing — exact numbers 💰

Pay-per-event. You pay when events fire; no subscription, no minimum.

Event	Price
Actor start (one-off per run)	$0.005
Result emitted (per dataset item)	$0.001

Cost examples:

Items	Cost
100 items	$0.11
1,000 items	$1.01
10,000 items	$10.01
50,000 items (daily digest across 500 feeds)	$50.01

Apify's $5 free trial credit — no credit card — covers your first ~4,900 items. For context: if you're pulling 50 items from each of 20 feeds daily, that's 1,000 items/day and $1.01/day in Actor cost.

The technically interesting part

RSS was designed to be fetched with a simple HTTP GET. What makes it hard in practice is not the format — it's the accumulation of real-world deviations from the spec across 25+ years of publishers. The feedparser Python library is genuinely excellent and handles most of them. Our Actor wraps it, but the durable contribution is the combination of: (a) browser-fingerprint rotation so bot-detection layers see a real browser, (b) residential proxy rotation for IP-blocked feeds, and (c) the Pydantic layer that validates and normalises the parsed output before it reaches your dataset. The feedparser output for a malformed feed can include None values in unexpected places, truncated content, or date fields that are time.struct_time objects rather than strings. We normalise all of that to ISO-8601 strings or null before the row is written.

Limitations 🚧

We do not follow <link rel="next"> pagination in paginated feeds — pass each page URL explicitly in feedUrls.
We do not render JavaScript-emitted feeds. If a publisher's "RSS URL" is actually a client-side JS app, you need a browser-based Actor instead.
content_html is only populated when the feed publishes content:encoded (RSS) or <content> (Atom). Many feeds are summary-only by design; the full article body lives on the publisher's site.
maxItemsPerFeed caps at 1,000 per the input schema. For archive-depth pulls beyond that, run multiple requests with startFrom pagination (planned for v2).
We surface recoverable parse errors as warnings, not failures. If a feed has 30 valid items and 2 malformed ones, you get 30 rows, not zero.

FAQ ❓

Is parsing RSS feeds legal?
RSS feeds are explicitly published for machine consumption — that is the entire point of the format. Publishers make feed URLs public and expect automated polling. This Actor fetches only what the feed URL exposes, at a pace polite enough not to hammer the server. Always review the terms of service for specific publishers if you intend large-scale or commercial use.

Can I export to Google Sheets or a database?
Yes — export CSV, JSON, or Excel from the Apify Console, or webhook the dataset on ACTOR.RUN.SUCCEEDED into Make / n8n / Zapier. The Apify API also supports direct dataset reads for warehouse ingestion.

Is there an official RSS-to-JSON API I should use instead?
No central official service exists. Services like RSS2JSON.com and the Google Feed API (now deprecated) tried this; most have rate limits, paywalls after a few hundred requests per day, or are simply gone. A hosted Actor gives you the same conversion with no per-feed caps, structured output, and the proxy/fingerprint handling for feeds that block public IPs.

Does this work on Atom feeds and podcast RSS?
Yes to both. feed_format in each output row tells you which dialect the source used. Podcast enclosure URLs appear in the link field for episode rows. The Actor handles RSS 2.0, Atom 1.0, and the content:encoded / dc:creator namespace extensions used by most modern publishing platforms.

Try it

The Actor is on the Apify Store: apify.com/DevilScrapes/rss-feed-scraper.

Free $5 trial credit, no credit card. Point it at https://news.ycombinator.com/rss and you'll have a structured dataset in under 30 seconds. If you're building an RSS-to-LLM pipeline and hit an edge case — a feed dialect we misparse, a field you need that isn't in the schema — drop it in the comments. The output schema is locked to what models.py validates; new fields are shipped on request.

Useful references: RSS 2.0 specification, Atom syndication format (RFC 4287), Apify Actor SDK docs.

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈