Ponsubash Raj R

Posted on Jul 5

I Built an AI Feed, Then Spent Most of the Time Fighting Bad Input

#ai #backend #dataengineering #llm

I thought I was building an AI app.
Turns out, I was building a garbage sorting machine with embeddings.

PROJECT REPOSITORY

The Brief: Make the Internet Less Annoying

Pulse is a personal AI feed for keeping up with AI engineering.

The idea was simple:

Pull content from RSS feeds, GitHub, arXiv, and Gmail newsletters.
Clean the text.
Ask an LLM to summarize and classify it.
Store embeddings.
Serve it in a mobile app with search, bookmarks, digest, trends, quizzes, and Ask mode.

In diagram form, the dream looked like this:

Very elegant. Very architectural. Very “drawn before reality entered the room”.

The actual version looked more like this:

The lesson came quickly:

The model is not the hard part. The hard part is getting sane input into the model.

"The Sources Were Messy" - is an understatement.

Pulse ingests from four main source types:

RSS / Atom feeds
GitHub repositories
arXiv papers
Gmail newsletters

Each one brought its own special personality disorder.

RSS feeds sound simple until you meet malformed XML. Some feeds work perfectly. Some return partial content. Some fail for a day and then return like nothing happened. Very mature behaviour.

GitHub was cleaner because the official GitHub Search API gives structured JSON. I used repository search as a fallback for AI-related repos, sorting by stars and limiting the result count. Still, even clean APIs need defensive handling. A repository might have no description. A URL might be invalid. The API might fail. A good ingestion pipeline should not fall apart because one repo decided to be mysterious.

arXiv was nicer because it has an official API with search_query, start, max_results, sortBy, and sortOrder parameters, documented in the arXiv API manual. I used category queries and sorted by submitted date. But arXiv abstracts still need cleaning. LaTeX needs stripping. The API gives you structured data, not finished product data. That distinction matters.

And then there was Gmail.

Gmail Newsletters Are GOATed

Great On Arrival, Awful To Transform.

Newsletters provide latest hand-picked news. Reading such newsletters everyday really boosts our knowledge.

But they are awful for processing.

A human sees:

“Here are five interesting AI links”.

A parser sees:

hidden preheader
sponsor block
unsubscribe link
view in browser link
social share buttons
CSS
HTML tables
actual article
footer
another footer
legal footer
unsubscribe again, just in case

Pulse uses the Gmail API, not IMAP. The Gmail users.messages.get endpoint supports retrieving message data: Gmail API docs. I intentionally used read-only access because this app has no business modifying my inbox.

The ingestion query only looks at selected newsletter senders, unread messages, and a recent time window.

Then each email gets fetched in full, parsed completely, and turned into one or more article candidates.

The Gmail pipeline had to handle:

nested MIME parts
text/plain
text/html
redirect links
sponsor blocks
promotional-only emails
social/share/footer links

This is where the project stopped being “AI summarizer” and became “forensic email cleaner”.

The Link Problem

Newsletter links are often not the actual article links.

They are tracking links.

Something like:

https://newsletter.com/click?url=https%3A%2F%2Factual-article.com

Or worse:

https://tracking-domain.com/CL0/https:%2F%2Factual-site.com%2Fpost

So Pulse tries to recover the real destination.

The logic is intentionally bounded. It resolves at most one redirect and uses a timeout. Because if a newsletter tracker wants to become a distributed systems problem, I politely decline.

async def resolve_redirect(url: str) -> str:
    async with httpx.AsyncClient(timeout=5, follow_redirects=False) as client:
        response = await client.get(url)
        location = response.headers.get("location")
        return location or url

The rule was simple:

Recover useful links, but do not let one link hold the ingestion worker hostage.

If redirect resolution fails, the system keeps the original link or uses the inline newsletter context. Failing open is better than losing the article.

Data Quality Decisions, Also Known As “Please Stop Feeding Trash To The LLM”

LLM calls cost quota. Burning it on garbage input is not AI engineering. It is donation.

So Pulse makes several data quality decisions before enrichment.

1. Skip Tiny Articles

Some records have almost no useful text. A title, a link, maybe three words. Very inspiring. Not worth a model call.

if len(clean_text) < 50:
    article.enrichment_status = "skipped"

In my corpus, 798 records were skipped because they had fewer than 50 useful characters.

That saved hundreds of LLM calls.

The LLM did not need to summarize “Click here”. Thankfully, I was capable enough to handle that complex academic material myself.

2. Deduplicate Aggressively

There are two kinds of duplicate problems:

same source item appears again
same content appears from a different path

Each item gets normalized and hashed.

content_hash = sha256(
    f"{normalized_url}|{clean_title}|{clean_text}".encode()
).hexdigest()

So Pulse uses both source IDs and content hashes.

This matters because ingestion is scheduled. If the same newsletter or feed entry returns again, the system should not create another article and proudly announce, “Good news, I found the same thing again”.

3. Cap Text Before Enrichment

The enrichment worker trims article text before sending it to Groq.

text = clean_body(article.raw_text or "", limit=3000)

This prevents long newsletters from becoming expensive prompt sludge.

The goal of enrichment is not to preserve every footer, tracking disclaimer, and “You are receiving this email because...” paragraph.

LLM Safety: Because JSON Mode Still Has Hobbies

The enrichment model returns structured metadata:

summary
category
importance
entities
keywords

The ideal response is JSON.

The actual response can be JSON, markdown-wrapped JSON, JSON with leading prose, JSON with trailing commas, or JSON wearing a small theatrical costume.

So Pulse does not trust the raw output.

It extracts a JSON object, then validates it with Pydantic. Pydantic supports custom validators for enforcing constraints and cleaning values: Pydantic validator docs.

The schema enforces rules like:

class EnrichmentResult(BaseModel):
    summary: str = Field(min_length=10, max_length=1000)
    category: Category
    importance: int = Field(ge=1, le=5)
    entities: EntityMap
    keywords: list[str]

The Prompt Was Not Trusted Either

The prompt asks for:

two-sentence summary
one supported category
importance from 1 to 5
entities grouped by known keys
5 to 8 keywords
JSON only

But the system still validates everything afterward.

Because prompts are requests, not contracts.

A contract looks like this:

result = parse_enrichment(model_output)
article.summary = result.summary
article.category = result.category
article.importance = result.importance

The database only gets validated output.

If parsing fails, the article is marked failed or retried. It does not sneak into the feed half-broken and become the mobile app’s problem. Frontend developers deserve peace too. Occasionally.

Reliability: Do Not Enrich The Same Article Twice

The enrichment worker claims work using PostgreSQL row locks.

PostgreSQL supports FOR UPDATE SKIP LOCKED, which is useful for queue-like tables where multiple consumers should avoid fighting over the same row. SKIP LOCKED skips rows that cannot be locked immediately: PostgreSQL SELECT docs

Pulse uses that pattern:

statement = (
    select(Article)
    .where(Article.enrichment_status == "pending")
    .order_by(Article.ingested_at.desc())
    .with_for_update(skip_locked=True)
    .limit(1)
)

Once an article is claimed:

article.enrichment_status = "processing"
await session.commit()

Then the worker calls the LLM.

On success:

article.enrichment_status = "done"
article.summary = result.summary
article.embedding = embedding

On quota exhaustion:

article.enrichment_status = "pending"

On handled failure:

article.enrichment_status = "failed"
article.retry_count += 1

The important rule:

Never leave a row stuck in processing.

A stuck processing row is the backend version of getting seen zoned.

Quota Is A Product Feature

Groq enrichment uses a daily quota.

Pulse reserves quota before external calls:

if not await reserve_quota(quota_manager):
    article.enrichment_status = "pending"
    return "quota_exhausted"

This avoids half-started work.

It also lets the system degrade gracefully. If quota is exhausted:

ingestion can still store new articles
feed can still serve old articles
search still works
enrichment waits until quota resets

That is a much better failure mode than “everything exploded because one external service said no”.

External APIs are not loyal friends. They are business relationships.

The Final Shape

After all this, the pipeline became:

Not glamorous.

But reliable.

And once this pipeline exists, the fun features become much easier:

semantic search
hybrid search
daily digest
trends
Ask mode
quizzes
mobile offline cache

AI apps are built on boring data discipline.

Takeaway

I started by thinking:

“I will build an AI feed”.

I ended up learning:

“I will build a defensive ingestion system, and if the data behaves, I may allow an LLM near it”.

The LLM was useful. But only after the input was cleaned, filtered, validated, deduplicated, capped, retried, locked, and politely threatened.

The real architecture was not:

content -> LLM -> magic

It was:

mess -> discipline -> model -> useful product

AI systems are not impressive when they work on spoon-fed input.

They are impressive when they survive the internet.

DEV Community