Why Your AI News Aggregator Misses Half the Stories (and How to Fix It)

#python #automation #ai #datapipeline

Every developer I know has tried to build some kind of automated briefing system. You wire up a few RSS feeds, maybe hit the Hacker News API, throw it at an LLM for summarization, and call it done. Then two weeks later you realize you missed a major framework release because your pipeline silently dropped it.

I've built three different versions of this for myself over the past year. Each time, I thought I'd nailed it. Each time, I was wrong. Here's what actually goes wrong and how to build a multi-source intelligence pipeline that doesn't quietly fail on you.

The Root Cause: Silent Failures Everywhere

The core problem isn't the AI summarization — that part is honestly the easy bit. The problem is source reliability and data quality upstream of your LLM.

Here's what typically happens:

An RSS feed changes its URL or schema, your parser returns empty results, and you never notice
Rate limiting kicks in on an API, you get partial data, and your pipeline treats it as "nothing new today"
Your LLM context window fills up with noise, so it drops the signal you actually cared about
Duplicate stories from different sources waste your token budget

The frustrating part? None of these throw errors. Your cron job runs, your script exits with code 0, and you get a cheerful summary of... incomplete data.

Step 1: Build Source Fetchers That Know When They Fail

Stop treating source fetching as a simple HTTP GET. Each source needs a health check built into the fetcher itself.

import httpx
from dataclasses import dataclass
from datetime import datetime, timedelta

@dataclass
class FetchResult:
    source: str
    items: list
    is_healthy: bool
    warning: str | None = None

async def fetch_with_health_check(
    source_name: str,
    url: str,
    min_expected_items: int = 3,
    max_age_hours: int = 24
) -> FetchResult:
    try:
        async with httpx.AsyncClient(timeout=15.0) as client:
            resp = await client.get(url)
            resp.raise_for_status()
            items = parse_feed(resp.text)  # your parser here

            # Health check: did we get suspiciously few items?
            if len(items) < min_expected_items:
                return FetchResult(
                    source=source_name,
                    items=items,
                    is_healthy=False,
                    warning=f"Only {len(items)} items (expected >= {min_expected_items})"
                )

            # Health check: is the newest item stale?
            newest = max(items, key=lambda x: x.published)
            if newest.published < datetime.now() - timedelta(hours=max_age_hours):
                return FetchResult(
                    source=source_name,
                    items=items,
                    is_healthy=False,
                    warning=f"Newest item is {max_age_hours}+ hours old"
                )

            return FetchResult(source=source_name, items=items, is_healthy=True)

    except (httpx.HTTPError, Exception) as e:
        return FetchResult(
            source=source_name, items=[], is_healthy=False,
            warning=f"Fetch failed: {str(e)}"
        )

The key insight: a successful HTTP response doesn't mean you got useful data. Checking item count and freshness catches 90% of the silent failures I've encountered.

Step 2: Deduplicate Before You Summarize

If the same story appears in four sources, you don't want your LLM spending tokens on it four times. But naive URL deduplication misses a lot — the same story often has completely different URLs across sources.

I use a two-pass approach: exact URL matching first, then fuzzy title similarity.

from difflib import SequenceMatcher

def deduplicate_items(items: list, similarity_threshold: float = 0.7) -> list:
    seen_urls = set()
    unique_items = []

    for item in items:
        # Pass 1: exact URL match
        normalized_url = item.url.rstrip("/").lower()
        if normalized_url in seen_urls:
            continue
        seen_urls.add(normalized_url)

        # Pass 2: fuzzy title matching against kept items
        is_duplicate = False
        for kept in unique_items:
            ratio = SequenceMatcher(
                None,
                item.title.lower(),
                kept.title.lower()
            ).ratio()
            if ratio > similarity_threshold:
                # Keep the one with more metadata (longer description, etc.)
                if len(item.description or "") > len(kept.description or ""):
                    unique_items.remove(kept)
                    unique_items.append(item)
                is_duplicate = True
                break

        if not is_duplicate:
            unique_items.append(item)

    return unique_items

A 0.7 similarity threshold works well in practice. Go lower and you'll merge stories that are related but distinct. Go higher and obvious duplicates slip through.

Step 3: Structure Your LLM Prompt for Relevance, Not Just Summary

Here's where most people go wrong. They dump all their fetched items into a prompt that says "summarize these." The LLM dutifully summarizes everything, including stuff you don't care about, and buries what matters.

Instead, give the LLM a scoring rubric specific to what you care about.

def build_briefing_prompt(items: list, interests: list[str]) -> str:
    items_text = "\n---\n".join(
        f"Title: {item.title}\nSource: {item.source}\n"
        f"Description: {item.description[:500]}"  # truncate to save tokens
        for item in items
    )

    return f"""You are generating a daily technical briefing.

Relevance criteria (score each item 1-5):
- Directly relates to: {', '.join(interests)}
- Announces a breaking change or security issue: always include
- Is a major release (not a patch): include
- Is general tech news with no actionable insight: exclude

Items to evaluate:
{items_text}

Return ONLY items scoring 3 or higher. For each:
1. One-line summary (what happened)
2. Why it matters (one sentence)
3. Action needed? (yes/no, with brief explanation if yes)

Order by relevance score descending."""

The "Action needed?" field is the real killer feature. Most briefings tell you what happened. Knowing whether you need to do something about it is what actually saves time.

Step 4: Add a Circuit Breaker for Bad Days

Sometimes multiple sources fail at once. Maybe GitHub is having an incident, or your IP got rate-limited across several APIs. You need to know when your briefing is incomplete rather than getting a confident-sounding summary based on 30% of your usual data.

def should_send_briefing(results: list[FetchResult]) -> tuple[bool, str]:
    healthy = [r for r in results if r.is_healthy]
    total = len(results)
    health_ratio = len(healthy) / total if total > 0 else 0

    if health_ratio < 0.5:
        # More than half the sources failed — don't send a misleading briefing
        failed = [r.source for r in results if not r.is_healthy]
        return False, f"Skipping briefing: {len(failed)}/{total} sources unhealthy"

    if health_ratio < 0.8:
        # Some sources failed — send but with a warning header
        failed = [r.source for r in results if not r.is_healthy]
        return True, f"⚠️ Partial briefing: {', '.join(failed)} unavailable"

    return True, "All sources healthy"

This is the difference between a toy project and something you actually rely on. Without this, you'll eventually make a decision based on absence of information — "I didn't see any security advisories" when really your security feed was down.

Prevention: Making It Observable

After getting burned enough times, I added a few things that seem obvious in retrospect:

Log source health daily — even when everything is fine. When something breaks, you want history to spot the trend.
Track item counts per source over time. A feed that usually gives you 15 items but suddenly gives you 2 is probably broken, even if those 2 items are valid.
Send yourself the health report, not just the briefing. I have mine append a footer like "12 sources checked, 11 healthy, 47 items processed, 18 included."
Version your prompts. When you tweak the scoring rubric, keep the old one around. You'll want to A/B test whether your changes actually improved relevance.

The Bigger Lesson

The pattern here applies way beyond news aggregation. Any pipeline where you're pulling data from multiple external sources, processing it, and making decisions based on the output has the same failure modes. Data pipelines, monitoring dashboards, CI/CD systems pulling from package registries — they all fail silently in the same ways.

The fix is always the same: treat the absence of data as a signal, not a non-event. A source returning zero results should be louder than a source returning a hundred. Build your health checks at the data layer, not the application layer, and make sure your system knows the difference between "nothing happened" and "I couldn't check."

I'm still iterating on my own setup, but these four patterns — health-checked fetchers, upstream deduplication, structured relevance scoring, and circuit breakers — have made it something I actually trust every morning instead of something I abandoned after a month.