Parsing RSS and Atom From Scratch in Rust with quick-xml

#rust #axum #rss #tutorial

Parsing RSS and Atom From Scratch in Rust with quick-xml

A small axum service that parses RSS 2.0 and Atom 1.0 feeds into unified JSON, built on quick-xml without pulling in a feed framework. The point is that once you actually look at the formats, they're not that complicated — and writing the parser yourself saves a meaningful chunk of your dependency tree.

Every backend service that touches feeds ends up re-implementing the same glue. Different element names for the same concept. Dates that are half in RFC 822 and half in RFC 3339. Namespaced content:encoded living next to plain description. Atom's link-as-attribute vs RSS's link-as-text. And then you need to do something sensible when the producer left a field out, because half of them do.

The normal answer in Rust is to reach for rss or atom_syndication — well-maintained crates that handle the whole specification. But both pull in their own dependency trees, you end up with two parsers that don't share a type, and the combined surface area is considerably larger than what you actually need. I wanted one binary that handled both formats with one output shape and no feed-specific transitive deps.

So I wrote it. The whole parser is ~500 lines of straightforward event-stream code on top of quick-xml. It ships as a 11.1 MB Docker image from a multi-stage rust:1.90-alpine → alpine:3.20 build, exposing POST /parse, GET /parse?url=, POST /normalize, and GET /health via axum.

GitHub: github.com/sen-ltd/feed-parser

The two-format story

If you sit down and read the specs back to back, RSS 2.0 and Atom 1.0 are doing the same thing with different element names and one big difference in how they encode links. The concepts map almost one-to-one:

Concept	RSS 2.0	Atom 1.0
Feed container	`<channel>`	`<feed>`
Feed title	`<title>`	`<title>`
Feed link	`<link>text</link>`	`<link href="…" rel="alternate"/>`
Feed subtitle	`<description>`	`<subtitle>`
Item container	`<item>`	`<entry>`
Item title	`<title>`	`<title>`
Item URL	`<link>text</link>`	`<link href="…"/>`
Item id	`<guid>`	`<id>`
Item summary	`<description>`	`<summary>`
Item body	`<content:encoded>`	`<content>`
Publish time	`<pubDate>` (RFC 822)	`<published>` (RFC 3339)
Author	`<author>` or `<dc:creator>`	`<author><name/></author>`

That's the whole concept map. There's basically no feature in one that doesn't have a near-equivalent in the other, except for RSS's item-level category (where Atom uses an empty <category term="x"/> element) and Atom's multi-<link rel="…"/> where RSS uses one text link.

Given this, you have two reasonable architectures:

A unified parser that ingests both formats into one shared AST, with element name aliasing baked in.
Two sibling parsers with one shared Feed output struct, dispatched by format detection.

I chose option 2. The aliasing approach sounds elegant on paper but the element-name mapping gets enough special cases (Atom's <link> being an attribute-bearing empty element, RSS's <guid isPermaLink="…">, Atom published falling back to updated when absent) that writing them as two separate flat functions with a shared output struct turned out to be both shorter and easier to reason about.

Format detection

Detection is a tiny state machine over quick-xml's event stream. We only care about the first non-metadata element: if it's <rss version="2.0"> we emit Rss2, if it's <feed xmlns=".../Atom"> we emit Atom10, otherwise we return a clean bad_format error.

pub fn detect(xml: &[u8]) -> Result<Format, String> {
    let mut reader = Reader::from_reader(xml);
    reader.config_mut().trim_text(true);

    let mut buf = Vec::new();
    loop {
        match reader.read_event_into(&mut buf) {
            Ok(Event::Start(e)) | Ok(Event::Empty(e)) => {
                let local = std::str::from_utf8(e.name().local_name().as_ref())
                    .unwrap_or("").to_ascii_lowercase();

                match local.as_str() {
                    "rss" => {
                        let version = e.attributes().flatten()
                            .find(|a| a.key.as_ref() == b"version")
                            .and_then(|a| String::from_utf8(a.value.into_owned()).ok())
                            .unwrap_or_default();
                        if version == "2.0" { return Ok(Format::Rss2); }
                        if version.starts_with("0.9") { return Ok(Format::Rss09x); }
                        return Ok(Format::Rss2);
                    }
                    "feed" => {
                        let has_atom_ns = e.attributes().flatten().any(|a| {
                            let k = a.key.as_ref();
                            (k == b"xmlns" || k.starts_with(b"xmlns:"))
                                && a.value.as_ref().windows(4)
                                    .any(|w| w.eq_ignore_ascii_case(b"atom"))
                        });
                        if has_atom_ns { return Ok(Format::Atom10); }
                        return Err("root <feed> without Atom namespace".into());
                    }
                    _ => return Err(format!("unrecognized feed format: <{}>", local)),
                }
            }
            Ok(Event::Eof) => return Err("unrecognized feed format: empty document".into()),
            Ok(_) => {}
            Err(e) => return Err(format!("xml parse error: {}", e)),
        }
        buf.clear();
    }
}

A few things worth noting. First, local_name() strips namespace prefixes, so <atom:feed> and <feed xmlns="…/atom"> both match "feed". Second, we check for Atom's namespace rather than just accepting any <feed> root element, because <feed> is a generic word that shows up in other XML schemas and we'd rather fail loudly than parse garbage. Third, when we see an unknown <rss version> — 0.91, 0.92, or something completely unexpected — we fall back to the RSS 2.0 parser because the shape is identical in practice.

Walking the event stream

Both parsers share the same pattern: maintain a path: Vec<String> stack, accumulate text into current_text when we see Event::Text or Event::CData, and on Event::End assign the accumulated text to whichever field matches the element name. Here's the RSS item loop, trimmed for clarity:

pub fn parse(xml: &[u8], format_label: &str) -> Result<Feed, String> {
    let mut reader = Reader::from_reader(xml);
    reader.config_mut().trim_text(true);
    let mut feed = Feed { format: format_label.into(), ..Default::default() };

    let mut path: Vec<String> = Vec::new();
    let mut current_text = String::new();
    let mut current_item: Option<Item> = None;
    let mut content_encoded: Option<String> = None;

    let mut buf = Vec::new();
    loop {
        match reader.read_event_into(&mut buf) {
            Ok(Event::Start(e)) => {
                let name = local(e.name());
                path.push(name.clone());
                current_text.clear();
                if name == "item" && path.iter().any(|p| p == "channel") {
                    current_item = Some(Item::default());
                    content_encoded = None;
                }
            }
            Ok(Event::End(e)) => {
                let name = local(e.name());
                if let Some(item) = current_item.as_mut() {
                    match name.as_str() {
                        "title"       => set_if_empty(&mut item.title, &current_text),
                        "link"        => set_if_empty(&mut item.link, &current_text),
                        "description" => set_if_empty(&mut item.summary, &current_text),
                        "pubDate"     => set_if_empty(&mut item.pub_date_rfc3339, &current_text),
                        "guid"        => set_if_empty(&mut item.guid, &current_text),
                        "author"      => set_if_empty(&mut item.author, &current_text),
                        "creator"     => if item.author.is_none() {
                            set_if_empty(&mut item.author, &current_text);
                        },
                        "category" => {
                            let c = current_text.trim();
                            if !c.is_empty() { item.categories.push(c.to_string()); }
                        }
                        "encoded" => {
                            if !current_text.is_empty() {
                                content_encoded = Some(current_text.clone());
                            }
                        }
                        _ => {}
                    }
                } else if path.len() >= 2 && path[path.len() - 2] == "channel" {
                    match name.as_str() {
                        "title"       => set_if_empty(&mut feed.title, &current_text),
                        "link"        => set_if_empty(&mut feed.link, &current_text),
                        "description" => set_if_empty(&mut feed.description, &current_text),
                        _ => {}
                    }
                }
                if name == "item" {
                    if let Some(mut item) = current_item.take() {
                        if let Some(c) = content_encoded.take() {
                            item.content = Some(c);
                        }
                        feed.items.push(item);
                    }
                }
                path.pop();
                current_text.clear();
            }
            Ok(Event::Text(t)) => {
                if let Ok(s) = t.unescape() { current_text.push_str(&s); }
            }
            Ok(Event::CData(t)) => {
                if let Ok(s) = std::str::from_utf8(t.as_ref()) {
                    current_text.push_str(s);
                }
            }
            Ok(Event::Eof) => break,
            Ok(_) => {}
            Err(e) => return Err(format!("xml parse error: {}", e)),
        }
        buf.clear();
    }
    feed.item_count = feed.items.len();
    Ok(feed)
}

A couple of details that are easy to get wrong:

CDATA is raw text. Event::Text needs .unescape() to resolve & and <. Event::CData does not — the bytes are delivered verbatim and calling unescape() on them would double-process. Feeds use CDATA heavily for HTML bodies, so getting this right matters.
content:encoded vs description. The namespace content module is where "real" post bodies live in modern RSS feeds. description is meant to be a summary. We assign them to content and summary respectively. If a feed only has description, content stays null — we never fabricate it by copying summary into content.
set_if_empty. This is a small helper that refuses to overwrite an already-set Option<String>. It means duplicate elements (which are rare but happen) go to the first occurrence, matching reader-app behavior.
dc:creator as author fallback. local_name() strips the dc: prefix, so we match on "creator" and only assign it if author is still empty.

The Atom parser follows the same shape but handles links differently, because Atom encodes them as attribute-bearing empty elements:

fn handle_link_attrs(e: &BytesStart, feed: &mut Feed, current_item: &mut Option<Item>, path: &[String]) {
    let mut href = None;
    let mut rel = None;
    for attr in e.attributes().flatten() {
        match attr.key.as_ref() {
            b"href" => href = String::from_utf8(attr.value.into_owned()).ok(),
            b"rel"  => rel  = String::from_utf8(attr.value.into_owned()).ok(),
            _ => {}
        }
    }
    // Atom `rel` defaults to "alternate" when absent.
    if rel.as_deref().unwrap_or("alternate") != "alternate" { return; }
    let Some(href) = href else { return };
    if let Some(item) = current_item.as_mut() {
        if item.link.is_none() { item.link = Some(href); }
    } else if path.len() >= 2 && path[path.len() - 2] == "feed" && feed.link.is_none() {
        feed.link = Some(href);
    }
}

Atom feeds emit both <link href="…" rel="alternate"/> (the human-readable permalink) and <link href="…" rel="self"/> (the feed's own URL). You almost never want self — it points to the feed file, not the article — so we explicitly filter on rel="alternate" or missing rel (which defaults to alternate per RFC 4287).

Date normalization

This is the bit that reliably surprises people. RSS 2.0 uses RFC 822 dates (Thu, 01 Jan 2026 12:00:00 GMT), Atom 1.0 uses RFC 3339 (2026-01-01T12:00:00Z), and you want a single output shape. chrono handles both:

pub fn normalize_date(raw: &str) -> Option<String> {
    let trimmed = raw.trim();
    if trimmed.is_empty() { return None; }
    if let Ok(dt) = DateTime::parse_from_rfc3339(trimmed) {
        return Some(dt.to_rfc3339_opts(SecondsFormat::Secs, true));
    }
    if let Ok(dt) = DateTime::parse_from_rfc2822(trimmed) {
        return Some(dt.to_rfc3339_opts(SecondsFormat::Secs, true));
    }
    None
}

Two things worth knowing. First, chrono::DateTime::parse_from_rfc2822 accepts the informal variants real feeds emit (single-digit day, GMT instead of +0000, weekday-less forms), so you don't need to pre-clean the input. Second, to_rfc3339_opts(Secs, true) emits Z for UTC offsets and +09:00 for others — the true parameter means "use Z when possible". We picked that form because every JSON consumer understands it and the Z form is shorter than +00:00.

When a date fails both parsers, we emit null and push an explanation onto feed.normalization_notes:

pub fn normalize_feed(feed: &mut Feed) {
    for (i, item) in feed.items.iter_mut().enumerate() {
        if let Some(raw) = item.pub_date_rfc3339.take() {
            match normalize_date(&raw) {
                Some(iso) => item.pub_date_rfc3339 = Some(iso),
                None => feed.normalization_notes
                    .push(format!("item[{}]: could not parse date '{}'", i, raw)),
            }
        }
    }
}

Consumers get a clean signal — the date is null, the note explains why — instead of a silently-dropped field. That's the kind of small courtesy a general-purpose feed framework can't easily offer because its author doesn't know what your callers want to do with unparseable input.

Tradeoffs and non-goals

Writing a parser yourself means being honest about what you didn't build:

No RSS 1.0 / RDF. RSS 1.0 used an RDF/XML syntax that's structurally different from 2.0 and would need its own parser module. Real-world 1.0 feeds are extremely rare now — the format peaked around 2003 — so the parser rejects them at format detection with a clean 422.
No Atom 0.3. Deprecated in 2005 and essentially extinct.
No podcast or iTunes extensions. <itunes:duration>, <itunes:explicit>, enclosures, GUIDs-as-permalinks-for-episodes — all out. Add them as a consumer of this service, not inside it.
No JSON Feed. Different format, different job. If you want it, either add a second endpoint or a second service.
No custom error messages on each item. We report unparseable dates in a feed-level normalization_notes array, not per-item. That's enough for a reader to surface "1 item had a bad date" in a UI and good enough for ops to grep logs.
XXE is a non-issue. quick-xml doesn't resolve external entities or DOCTYPE declarations at all. There's no parser-level switch to flip. You cannot make the parser fetch a URL or read a local file by crafting a malicious feed, because the parser does not know how to do those things. This is one of the few security properties in the entire XML ecosystem that's genuinely "safe by default".
reqwest with rustls, not OpenSSL. reqwest's default-features = false, features = ["rustls-tls"] combination means the final binary statically links rustls instead of dynamically linking OpenSSL, which is exactly what you want on alpine. The resulting Docker image is 11.1 MB — chunkier than a pure-parse-only service would be, but still well under the 30 MB target and considerably smaller than the same service built with the OpenSSL variant.

Try it in 30 seconds

git clone https://github.com/sen-ltd/feed-parser
cd feed-parser
docker build -t feed-parser .
docker run --rm -p 8000:8000 feed-parser &

# Parse an RSS 2.0 feed:
curl -sS -X POST http://localhost:8000/parse \
  -H 'Content-Type: application/xml' \
  --data-binary @tests/fixtures/sample-rss2.xml | jq

# Parse an Atom feed — same output shape:
curl -sS -X POST http://localhost:8000/parse \
  -H 'Content-Type: application/atom+xml' \
  --data-binary @tests/fixtures/sample-atom.xml | jq

# Fetch and parse a remote feed:
curl -sS 'http://localhost:8000/parse?url=https://blog.rust-lang.org/feed.xml' | jq

Takeaways

Feed parsing has a reputation for being painful because the specs are old, there are five of them, and most tutorials wave their hands at the format detection problem. But if you scope yourself to the two formats that actually matter — RSS 2.0 and Atom 1.0 — and lean on quick-xml's event stream, the whole thing collapses to about 500 lines and no feed-specific dependencies.

The real lesson for me wasn't "parsers are easy". It was "once you look at the actual shape of the data instead of the fifteen variants the specs permit, the problem is much smaller than the ecosystem suggests". That generalizes. A lot of "use a library for this" decisions in backend code are defending against hypothetical edge cases rather than describing the real input distribution. Feed parsing is a good example of a domain where the real input is narrow, the corner cases the specs worry about don't happen, and writing 500 lines yourself is cheaper than dragging a framework along for the rest of the project's life.

All 38 tests pass. Image is 11.1 MB. Total Rust dependencies: 8 direct.

Repo: github.com/sen-ltd/feed-parser