DEV Community

Scofield
Scofield

Posted on

I built a daily AI-news pipeline that scores stories with an LLM (RSS AI human-gated publish)

"Agentic AI" moves fast enough that keeping up is a part-time job, and most "AI news" is press-release noise. I wanted a builder-focused feed — every story scored on whether it actually matters to people who ship software — and I wanted it to update itself.

So I built a small pipeline: fetch a handful of RSS/Atom feeds daily, filter, dedup, score each story with one structured LLM call, and stage everything as a draft for a human to approve. No "fully autonomous content farm" — a human still presses publish.

Here's the whole thing, including the parts that bit me. It's a pattern you can reuse for any niche news/curation feed.

The shape

daily cron
  → fetch N RSS/Atom feeds (parallel)
  → parse → keyword filter → freshness cutoff
  → dedup (by normalized URL, in-batch + vs DB)
  → for each new story: one LLM call → {summary, why-it-matters, category, 5 scores}
  → drop low scores → insert as `draft`
  → human reviews drafts → flips status to `published`
  → pages read `published` from the DB (ISR-cached)
Enter fullscreen mode Exit fullscreen mode

Five stages do the work, one human owns the trust boundary. Let's go through them.

1. Fetching: you don't need an RSS library (but watch CDATA)

Most quality sources still expose RSS or Atom. You can parse the well-formed ones with a tiny, dependency-free function instead of pulling in a library:

export function parseFeed(xml: string): RawFeedEntry[] {
  const items = xml.match(/<item\b[\s\S]*?<\/item>/gi) ?? []      // RSS
  const entries = xml.match(/<entry\b[\s\S]*?<\/entry>/gi) ?? []  // Atom
  const isAtom = items.length === 0 && entries.length > 0
  const blocks = isAtom ? entries : items

  return blocks
    .map((b) => ({
      title: stripTags(pick(b, "title") ?? ""),
      link: isAtom ? atomLink(b) : decode((pick(b, "link") ?? "").trim()),
      publishedAt: toISO(pick(b, "pubDate") ?? pick(b, "published") ?? pick(b, "updated")),
      summary: stripTags(pick(b, "description") ?? pick(b, "summary") ?? "").slice(0, 1200),
    }))
    .filter((e) => e.title && e.link)
}
Enter fullscreen mode Exit fullscreen mode

The gotcha that cost me an hour: a naive stripTags that runs replace(/<[^>]+>/g, " ") first will eat CDATA-wrapped titles whole. A title like <![CDATA[OpenAI ships something]]> looks like one big tag, so the regex deletes the entire thing and you get an empty title — which your .filter() then silently drops. Some feeds (Google, HF) don't wrap titles in CDATA so they sneak through; others (OpenAI) wrap everything, and you get zero entries from a feed with 1,000 items.

Fix: unwrap CDATA before stripping tags.

function stripTags(input: string): string {
  const unwrapped = input.replace(/<!\[CDATA\[([\s\S]*?)\]\]>/g, "$1") // CDATA first!
  return decode(unwrapped.replace(/<[^>]+>/g, " ")).replace(/\s+/g, " ").trim()
}
Enter fullscreen mode Exit fullscreen mode

Lesson: when a hand-rolled parser returns nothing from a 200-OK feed, suspect CDATA and entity decoding before you suspect your regex for matching blocks.

2. Filtering: keyword match + a freshness cutoff you'll forget at your peril

Keyword filtering is the easy half — keep anything that mentions your topic vocabulary:

const KEYWORDS = ["agentic", "ai agent", "multi-agent", "mcp", "agent framework",
  "tool calling", "computer use", "coding agent", /* ... */]

const matched = (text: string) =>
  KEYWORDS.filter((k) => text.toLowerCase().includes(k))
Enter fullscreen mode Exit fullscreen mode

The half people skip is a freshness cutoff, and it matters more than it sounds. OpenAI's feed returns its entire archive — ~1,000 items going back years. Your dedup only knows about what's already in your DB, so on every run those old posts look "new." Without a cutoff you'd slowly ingest years of stale news, a few per day, forever.

const MAX_AGE_DAYS = 21
const isFresh = (iso: string | null) =>
  !!iso && Date.now() - new Date(iso).getTime() <= MAX_AGE_DAYS * 864e5
Enter fullscreen mode Exit fullscreen mode

One line. Saves you from a feed that quietly backfills history into "latest news."

3. Dedup: key on the URL, not the title

Run this daily and you'll re-see the same story constantly. The instinct is to hash url + title. Don't — sources edit titles after publishing, and a changed title with the same URL would slip past a url+title key and create a duplicate.

Key on the normalized URL alone:

function dedupKey(url: string): string {
  const u = new URL(url)
  u.hash = ""
  for (const k of [...u.searchParams.keys()])
    if (/^utm_|^ref$|^fbclid$/i.test(k)) u.searchParams.delete(k)
  return `${u.origin}${u.pathname.replace(/\/+$/, "")}${u.search}`.toLowerCase()
}
Enter fullscreen mode Exit fullscreen mode

Then enforce it in two places: filter against keys already in the DB before you spend money on an LLM call, and put a unique index on the column as the race-safe backstop (insert ... onConflictDoNothing). The article URL is naturally unique; titles are not.

4. Scoring: one structured LLM call per story

This is where the "builder-focused" part lives. For each new candidate, one structured call rewrites the story and scores it on five axes. Using the Vercel AI SDK, a string model id routes through the AI Gateway, and generateObject forces the model to return validated JSON:

const Enriched = z.object({
  title: z.string(),          // cleaned, PR superlatives removed
  summary: z.string(),        // rewritten in its own words, not copied
  builderInsight: z.string(), // "why this matters if you ship software"
  category: z.enum(CATEGORIES),
  tags: z.array(z.string()),
  scores: z.object({
    relevance: z.number(), builder: z.number(), quality: z.number(),
    freshness: z.number(), fit: z.number(),     // each 0–5
  }),
})

const { object } = await generateObject({
  model: "anthropic/claude-haiku-4.5", // cheap + fast is plenty here
  schema: Enriched,
  prompt: `Rewrite and score this for a builder audience. Be strict:
if it's only tangentially about AI agents, score relevance low.
Write the summary in your OWN words.\n\n${candidate}`,
})

const importance = sum(Object.values(object.scores)) * 4 // → 0–100
Enter fullscreen mode Exit fullscreen mode

A cheap model (Haiku-class / 4o-mini-class) is the right call — summarizing and scoring news doesn't need a frontier model, and you're doing this in bulk. Two things I learned the hard way:

  • Loosen the schema, clamp in code. If you constrain scores to 0–5 and tags to max(5) in the schema, the model occasionally returns a 6 or a sixth tag and the whole call hard-fails with "response did not match schema" — and you lose the story. Accept a plain number/array, then clamp(0,5) and slice(0,5) yourself. Validation belongs at the edges, not as a tripwire.
  • Retry transient failures. Rate limits, the occasional schema miss, timeouts — wrap the call in a 3-try backoff. Free-tier AI gateways throttle bursts aggressively (I hit it after ~5 calls), so cap how many you enrich per run and let the rest roll to tomorrow.

Because failed stories never get inserted, their URL stays "unseen" and they're retried on the next run automatically. Nothing is lost; the work just spreads across days.

5. The human gate (the part that makes it trustworthy)

The pipeline produces draft, never published. A human skims the drafts — title, the AI's "why it matters," the score — and approves the good ones. In practice that's a 30-second CLI step:

news review                      # list drafts, highest score first
news publish <id> <id> ...       # draft → published
news archive <id>                # nope
Enter fullscreen mode Exit fullscreen mode

This is the line I won't cross: an LLM is great at drafting and ranking, but "this is real, accurate, and worth a reader's attention" is a human call. It's the difference between a curated feed and an AI slop farm — and readers can tell.

Storage is just two states on one table (draft / published / archived), so "publishing" is a status flip, not a redeploy. Pages query published and cache with ISR.

What I'd tell you to copy

  • Skip the RSS dependency for a known set of feeds — but unwrap CDATA before stripping tags.
  • Add a freshness cutoff. Feeds backfill history; your dedup won't catch it.
  • Dedup on normalized URL, with a DB unique index as the backstop.
  • One structured LLM call does rewrite + categorize + score. Loosen the schema, clamp in code, retry on transient errors.
  • Keep a human on publish. Let the model draft and rank; you decide what ships.

The whole thing is a daily cron, one DB table, and ~300 lines. The expensive part (an LLM scoring everything) costs pennies a day with a small model.

See it running

This pipeline feeds a live, builder-focused agentic-AI feed — every story carries a "why it matters for builders" note, not just a headline:

Agentic AI News →

And if you're shopping for the tools themselves: Best AI agent tools.

If you build your own version, the two things that will save you the most pain are the CDATA fix and the freshness cutoff. Ask me anything in the comments.

Top comments (0)