DEV Community: Ponsubash Raj R

Free-Tier AI Made Me Build a Traffic Cop for Tokens

Ponsubash Raj R — Sun, 26 Jul 2026 07:21:56 +0000

My Project Brief

CourseFlow is my attempt to solve a very practical problem: long YouTube courses are useful, but consuming them is slow, messy, and usually trapped inside video timelines.

So I built a system that takes a course playlist and turns it into structured learning material:

transcripts
cleaned lesson notes
Anki flashcards
course-level exports
optional diagrams

PROJECT REPOSITORY: Open

The fun part is that this is not a toy “summarize one video” app. It has to process entire playlists, sometimes 50 or 60 videos long. That means many transcript chunks, many LLM calls, many retries, and many ways to discover that free-tier AI is not free from consequences.

The first version worked locally. Then I started asking the real question:

Can this process a full course without wasting API quota, crashing halfway, or pretending 429 errors are a lifestyle?

That is where the quota-aware scheduler came in.

The Local Architecture

For this blog, I am focusing only on the local architecture.

The local setup uses a normal distributed application shape, just running on one machine:

The frontend starts course processing. The backend stores the course, videos, jobs, notes, and usage records. Celery workers handle the slow work. Redis handles fast coordination. PostgreSQL stores durable truth.

That split matters.

Redis is fast, but temporary. PostgreSQL is slower, but trustworthy.

The AI provider is external, rate-limited, and has opinions.

Celery workers are concurrent and mildly chaotic, as workers usually are when left unsupervised.

So the core design rule became simple:

Redis decides who may go now. PostgreSQL remembers what actually happened.

That one sentence saved the system from a surprising amount of pain.

AI APIs Are Shared Distributed Resources

At first glance, an LLM API looks like a function call:

response = client.chat.completions.create(...)

Cute.

In reality, it behaves more like a shared distributed resource with multiple constraints:

requests per minute/requests per day
tokens per minute/tokens per day
audio seconds per hour
model-specific limits
organization-level limits
retry windows
partial failures

So if five workers call the API at the same time, they are not five independent workers anymore. They are five people drinking from the same bottle and acting shocked when it becomes empty.

This is why local “sleep for 10 seconds” throttling was not enough. It works only when one worker exists, the moon is aligned, and no request uses more tokens than expected.

CourseFlow needed an organization-wide scheduler.

The Useful Part: Groq Gives Rate Limit Headers

Groq exposes rate limit information through response headers, including:

x-ratelimit-limit-requests
x-ratelimit-remaining-requests
x-ratelimit-reset-requests
x-ratelimit-limit-tokens
x-ratelimit-remaining-tokens
x-ratelimit-reset-tokens
retry-after for 429 responses

Reference: Groq rate limit documentation

That is extremely useful because the server is the source of truth. Local counters are helpful, but the provider knows the real remaining quota.

So the scheduler treats provider headers as authoritative whenever available.

A simplified version of the header parsing looks like:

def parse_groq_headers(headers):
    return {
        "rpd_remaining": int(headers.get("x-ratelimit-remaining-requests", 0)),
        "rpd_reset": parse_duration(headers.get("x-ratelimit-reset-requests")),
        "tpm_remaining": int(headers.get("x-ratelimit-remaining-tokens", 0)),
        "tpm_reset": parse_duration(headers.get("x-ratelimit-reset-tokens")),
        "retry_after": parse_seconds(headers.get("retry-after")),
    }

Redis Atomic Reservations: The Bouncer at the Door

The biggest risk with concurrent workers is double-spending quota.

Imagine Redis says there are 2 requests left. Three workers check at the same time:

Worker A: sees 2 left
Worker B: sees 2 left
Worker C: sees 2 left

All three proceed.

Congratulations, you have now duplicated the source of spending.

To prevent that, CourseFlow uses atomic Redis reservations. The decision and the counter update must happen together.

Conceptually, the reservation looks like this:

-- Pseudocode
if remaining_requests < requested_requests then
  return "blocked"
end

if remaining_tokens < estimated_tokens then
  return "blocked"
end

remaining_requests = remaining_requests - requested_requests
remaining_tokens = remaining_tokens - estimated_tokens

return "reserved"

Redis Lua scripts execute atomically, so no other worker can sneak in halfway through. Redis Lua scripting.

This turns quota into something workers reserve before calling the API.

Not after.

After is too late. After is when you are writing an incident report to yourself.

PostgreSQL Durable Usage Ledger: The Memory That Survives Restart

Redis is great for fast coordination, but if the local machine restarts, Redis may lose volatile state depending on configuration.

PostgreSQL is where CourseFlow records durable usage:

CREATE TABLE groq_usage_ledger (
    id UUID PRIMARY KEY,
    model TEXT NOT NULL,
    usage_type TEXT NOT NULL,
    request_id TEXT,
    requests_used INTEGER NOT NULL,
    tokens_used INTEGER DEFAULT 0,
    audio_seconds_used NUMERIC DEFAULT 0,
    created_at TIMESTAMPTZ NOT NULL
);

The actual schema can evolve, but the idea is stable:

record successful requests
record token usage
record audio seconds
record model and request type
allow quota state to be reconstructed

PostgreSQL gives the durable database foundation here.

This matters because workers can crash. Redis can be rebuilt. But completed usage should not become expired information.

Short Limits vs Daily Exhaustion

Not all rate limits mean the same thing.

Some are short-term limits:

requests per minute
tokens per minute
audio seconds per hour

For these, the correct behavior is:

Wait, then retry.

The job should be marked as rate_limited, scheduled after the reset time, and resumed later.

Daily exhaustion is different:

requests per day
tokens per day
daily provider allowance

For these, retrying in 30 seconds is just performance art.

The correct behavior is:

Defer until the daily window resets.

CourseFlow maps short waits to rate_limited and daily exhaustion to deferred.

That distinction is important for UX too. A user should know whether the system is temporarily waiting or done for the day.

Whisper Is Not “Just Another API Call”

Whisper transcription adds another dimension: audio duration.

Groq’s speech-to-text API has its own constraints. Reference: Groq speech-to-text documentation.

For long videos, CourseFlow chunks audio and sends each chunk separately. But each chunk needs quota reservation too.

The scheduler reserves:

one request
estimated billable audio duration
model-specific Whisper capacity

The tricky bit is minimum billable duration.

If a provider bills short audio chunks using a minimum duration, your scheduler must account for that. Otherwise, you will underestimate usage.

Conceptually:

billable_seconds = max(actual_chunk_seconds, 10)

That tiny max() carries real engineering weight. Without it, 100 tiny chunks can look cheap locally while the provider counts them differently.

Why 429s Should Not Consume Quota

This one is subtle and important.

A 429 Too Many Requests means the provider rejected the request because a limit was exceeded.

So the local reservation should be released.

If the scheduler reserves quota, sends the request, receives a 429, and still counts the reservation as consumed, the system punishes itself twice:

Provider: "No."
Local scheduler: "Understood. I will now reduce my own local quota too."

Completely wrong.

The better behavior is:

Reserve estimated capacity.
Send request.
If success, reconcile with actual usage and headers.
If 429, release the reservation.
Set a shared blocked_until gate using retry-after or reset headers.

Simplified:

try:
    response = call_groq()
    scheduler.commit_success(reservation, response.headers, response.usage)
except RateLimitError as exc:
    scheduler.release(reservation)
    scheduler.block_until(parse_retry_time(exc.headers))
    raise

Basically, a rejected request should influence scheduling, not usage accounting.

That is how the system avoids slowly starving itself after temporary rate limits.

Resumability: The Quiet Superpower

The scheduler is not just about avoiding 429s. It also makes long course processing resumable.

A course can take a while. During that time:

a worker can crash
the machine can restart
Redis can be cleared
the API can rate limit
Whisper can fail on one chunk
one lesson can be deferred while others continue

CourseFlow handles this by making every meaningful unit durable:

videos have statuses
notes are stored
transcript chunks are tracked
Whisper chunks are stored
usage is recorded
retries have timestamps
completed work is skipped on restart

So if chunk 17 of a video hits a rate limit, the system does not redo chunks 1-16 like it has short-term memory issues.

It retries the blocked chunk later.

That is the difference between a demo and a system.

The Final Local Design

The final flow looks like this:

The important part is not one specific model or provider.

The important part is the pattern:

Estimate before request.
Reserve atomically.
Call provider.
Reconcile from headers.
Persist durable usage.
Retry only the failed unit.

That pattern works for LLMs, Whisper, image generation, and basically any paid or limited API where “oops” has a quota cost.

Takeaway

Free-tier AI sounds like a pricing detail.

It is not.

When you build a real application on top of free-tier or quota-limited APIs, you are suddenly dealing with distributed systems problems:

shared state
concurrent workers
partial failure
durable accounting
retry classification
idempotency
backpressure

In summary:

I wanted video summaries. I got consensus-adjacent quota coordination.

That is the engineering lesson CourseFlow taught me.

And honestly, it was a good one.

RAG Apps Don’t Fail at Generation. They Fail at Retrieval

Ponsubash Raj R — Sun, 19 Jul 2026 12:42:48 +0000

Why My RAG App Uses BM25, Vectors, Parent-Child Chunks, and Chat Memory Together

PROJECT REPOSITORY: Open

Most RAG projects start with a very confident idea:

“Let us split the PDF, embed the chunks, store them in a vector database, and ask questions.”

Beautiful. Elegant. Wrong often enough to be annoying.

That was also my first idea while building Docflow, a multi-user document question-answering app where users upload PDFs/images and chat with their own files. The application processes documents, stores chunks in Qdrant, retrieves relevant context, and sends that context to an LLM.

The first version looked simple:

And honestly, for a simple scenario with three clean paragraphs, this works nicely. Unfortunately, my documents were not three clean paragraphs. They were college lecture notes, policy PDFs, invoice files, scanned documents full of headers, footers, tables, page numbers, short codes, and lines like:

Company Confidential
Employee Benefits Guide
Page 7 of 42

Very official. Also very noisy.

So the retrieval system had to grow up a little.

Final retrieval design:

This blog explains why each piece exists.

The Project In One Minute

Docflow is a full-stack RAG application.

Users can:

register and log in
upload PDFs/images
manage uploaded files
create multiple chats
ask questions over only their own documents
continue old chats with memory
delete files from storage and index

The system uses:

FastAPI for backend APIs
React for UI
Celery + Redis for async document processing
S3/MinIO for raw file storage
Qdrant for vector search
BM25 for keyword retrieval
Groq LLM for final answers
SQLite/PostgreSQL for users, files, chats, and messages

The important part: retrieval is user-scoped. A user should never retrieve another user’s document chunks. That is not a feature. That is a lawsuit warming up.

Problem With The First Naive Idea: Just Embed Chunks And Search Qdrant

The first plan was simple:

Extract text from a PDF.
Split it into chunks.
Convert chunks into embeddings.
Store vectors in Qdrant.
Embed the user query.
Retrieve nearest chunks by cosine similarity.
Send those chunks to the LLM.

This is the classic semantic search flow. Sentence Transformers describes semantic search as embedding both the query and corpus into the same vector space, then finding the closest embeddings by semantic similarity: Sentence Transformers Semantic Search.

That already gives a decent system.

Vector search is good at meaning.

If the document says:

Employees can request paid leave after completing the probation period.

And the user asks:

When am I eligible for paid time off?

A vector model may understand that paid leave and paid time off are related.

Might.

That word is carrying a lot of emotional damage.

Semantic search can struggle with:

acronyms: PTO, SLA, KYC, GST
policy codes: HR-204, SEC-17, INV-009
exact clause names
invoice numbers
section titles
short technical queries
rare terms that are important

BM25: Because Exact Words Still Matter

To fix this, I added BM25.

BM25 is a classic lexical ranking method. It scores documents based on query term matches, term frequency, inverse document frequency, and document length.

Real Example

Suppose the uploaded document contains:

Payment terms: Net 30. The customer must complete payment within 30 calendar days from the invoice date.
Late payments may incur a 2% monthly service charge.

User asks:

What are the Net 30 payment terms and late fee?

Vector search may retrieve semantically similar billing chunks.

BM25 strongly boosts chunks containing:

Net
30
payment
late
fee

That gives better recall for technical phrases, policy names, codes, and numbers.

So now we have:

Vector search -> meaning
BM25 search   -> exact terms

Combining Rankings With RRF

Now we have two ranked lists:

Vector results:
1. chunk A
2. chunk B
3. chunk C

BM25 results:
1. chunk C
2. chunk D
3. chunk A

How do we combine them?

Bad idea:

final_score = vector_score + bm25_score

Why bad?

Because vector similarity and BM25 score are not on the same scale. Adding them directly is like adding kilograms and degrees Celsius.

So I used Reciprocal Rank Fusion, or RRF.

RRF combines rankings using rank positions instead of raw scores: Reciprocal Rank Fusion paper.

The scoring idea:

score(document) = sum over rankings: 1 / (k + rank)

This gives a nice behavior:

If a chunk ranks high in both vector and BM25, it rises.
If a chunk ranks high in only one method, it can still survive.
We do not care about incompatible score scales.

Simple. Effective.

Why Parent-Child Chunking Is Better

The next problem: what size should chunks be?

Small chunks are great for retrieval:

"Late payments may incur a 2% monthly service charge."

This is focused and searchable.

But small chunks are bad for final answering because the LLM may miss surrounding context:

Which invoice does this apply to?
Is the 2% charge monthly or one-time?
Does the 30-day period start from invoice date or delivery date?
Are there exceptions?

Large chunks are better for context, but worse for search. A giant chunk may contain the right answer buried under three pages of policy text, one approval matrix, and a footer reminding you that the document is confidential. Very searchable. Obviously.

So I use parent-child chunking.

Parent chunk: larger block, sent to LLM
Child chunk: smaller block, used for retrieval

Structure:

parent: abcd1234-p0
child:  abcd1234-p0-c0
child:  abcd1234-p0-c1
child:  abcd1234-p0-c2

Indexing stores both:

{
    "id": chunk["id"],
    "text": chunk["text"],
    "type": "child",
    "parent_id": "abcd1234-p0",
    "user_id": user_id,
    "file_id": file_id,
    "job_id": job_id,
    "filename": filename,
}

Search only looks at children:

FieldCondition(key="type", match=MatchValue(value="child"))

Then after RRF, the system fetches parent chunks:

parent_ids = [hit["parent_id"] for hit in fused_hits]
parents = retrieve_by_ids(parent_ids)

This gives the best of both worlds:

Child chunks -> precise search
Parent chunks -> useful LLM context

This matters a lot for real documents, where a rule, exception, effective date, and approval condition may be spread across nearby lines. Sending only the tiny matching sentence can make the LLM answer like it skimmed the policy during lunch and decided that was enough.

Answering Follow-Up Questions

The user does not always ask beautiful standalone questions.

They ask:

What is the company policy for remote work?

Then:

Who needs to approve it?

Humans understand it. A stateless API does not.

So each user can create multiple chats. Each chat has messages stored in the database:

chats
messages

When a new message arrives, Docflow loads recent messages:

history = _recent_chat_history(chat_id, user_id, limit=12)

Then it builds a retrieval query using recent context, so that the retrieval layer sees:

user: What is the company policy for remote work?
assistant: ...
user: Who needs to approve it?

Now the search query has enough context to understand that it probably means remote work.

Important design choice:

Chat history helps understand the question. It is not treated as factual source material.

That distinction matters. The LLM should not invent facts from conversation history. The uploaded documents remain the authority.

Grounded Answers And Fallback Behavior

The system prompt tells the model:

Use only the current user's retrieved document context for factual answers.
Use conversation history only to understand follow-up references.
If the answer is not in the context, say:
"I could not find this information in the provided documents."
If the user asks for an example and no example appears in the context,
say that no source example was found before giving a clearly labeled general example.

This is there because RAG systems should not pretend the document said something it did not say.

For example, if the document explains the expense reimbursement policy but gives no sample reimbursement scenario, and the user asks:

Give an example for it.

A good answer is:

I could not find an example in the provided documents.

General example:
If an employee spends ₹1,200 on approved client travel and uploads the receipt within the required period, the finance team may reimburse the amount after manager approval.

That is much better than:

The document says...

when the document absolutely did not say it. Classic LLM confidence.

How The Final Retrieval Flow Looks

Here is the final flow:

1. User asks a question
2. API loads recent chat history
3. API creates retrieval query
4. Query is embedded
5. Qdrant finds semantic child chunk matches
6. BM25 finds exact keyword child chunk matches
7. RRF combines both rankings
8. Best child chunks are mapped to parent chunks
9. Parent chunks become LLM context
10. LLM answers with sources
11. Chat history is saved

Takeaway

The main lesson from building Docflow:

RAG quality depends more on retrieval design than on just calling a powerful LLM.

Embeddings are useful, but they are not magic. BM25 is old, but old does not mean useless. Parent-child chunking sounds fancy, but it solves a very real context problem. Chat memory is necessary because users ask follow-up questions like normal humans. Grounded fallback behavior is needed because LLMs are very comfortable saying things with full confidence and zero evidence.

Final design:

Vectors for meaning
BM25 for exact terms
RRF for rank fusion
Child chunks for retrieval
Parent chunks for context
Chat history for follow-ups
Source-grounded prompting for honesty

A RAG app is not impressive because it uses embeddings. Everyone and their grandma can do that now.

It becomes impressive when it handles the messy parts:

noisy PDFs
exact terminology
multi-user isolation
follow-up questions
missing source evidence
explainable retrieval decisions

That is where the engineering starts.

The LLM Didn’t Need to See the Diagram. It Just Needed a Seat Number

Ponsubash Raj R — Sun, 12 Jul 2026 05:53:37 +0000

AI note apps love doing one very modern thing: taking a clean lecture PDF, feeding it to an LLM, and proudly returning notes with all the diagrams missing.

Amazing. We automated disappointment.

This project started with a simple goal: turn lecture PDFs and slides into useful study notes, without losing diagrams. Not “diagram summaries.” Not “imagine a flowchart here.” The actual source diagrams, in the right places.

PROJECT REPOSITORY: Open

Project Brief

I built Smart Notes Generator, a local-first app that takes PDFs and PowerPoint files, extracts text and diagrams, sends only the useful text structure to an LLM, and then rebuilds the final notes with the original diagrams inserted back.

The stack is simple:

React + TypeScript for the frontend.
FastAPI for the backend.
PyMuPDF for PDF text/image/vector extraction.
python-pptx for PowerPoint parsing.
Pillow for image scoring.
SQLite for local saved notes.
LLM API or manual copy-paste mode for generation.

The Normal Approach: Throw Everything at the Model and Pray

The naive approach is:

The problem is that most document extraction pipelines treat images as either:

Noise.
Base64 blobs.
Something to send to a vision model.
Someone else’s problem.

Sending every diagram to a vision-capable model sounds fancy, but it creates new problems:

It increases input size.
It adds cost.
It adds latency.
It still may not place the diagram correctly.
It can describe the diagram, but that is not the same as preserving it.

The key realization was:

The LLM does not need to understand every pixel of the diagram. It mostly needs to know where the diagram belongs.

That one sentence changed the architecture.

The Idea: Give Every Diagram a Seat Number

Instead of sending images to the model, I extract diagrams locally and replace them with stable placeholders.

Example:

A finite automaton can be represented using states and transitions.

{{IMG_001}}

The transition function defines how the automaton moves between states.

The LLM sees the placeholder as part of the source context. It can move it into the right location in the generated notes.

After generation, the backend replaces:

{{IMG_001}}

with the original local image.

So the model handles reasoning and structure. The system handles files, images, and reliability.

How It Works

The flow basically looks like:

A simplified version of the placeholder registry looks like this:

registry["IMG_001"] = {
    "placeholder": "{{IMG_001}}",
    "path": "/local/session/imgs/automata_p2_f0.png",
    "source_file": "lecture_automata.pdf",
    "page": 2,
    "score": 0.82,
    "context": "DFA transition diagram",
    "included": True,
}

Then the prompt contains text, not image bytes:

Use the following source material to create clear study notes.
Preserve image placeholders exactly where they belong.

Source:
A DFA consists of states, alphabet, transition function...

{{IMG_001}}

The accepting state is shown with a double circle.

The LLM returns something like:

## Deterministic Finite Automata

A **DFA** is a finite-state machine where each input symbol leads to exactly one next state.

{{IMG_001}}

The double-circled state represents an accepting state.

Then the postprocessor turns the token into a real figure:

def inject_images(markdown: str, registry: dict) -> str:
    for img_id, info in registry.items():
        figure_html = f"""
<figure>
  <img src="{info["path"]}" alt="{info["alt_text"]}" />
  <figcaption>{info["source_file"]}, page {info["page"] + 1}</figcaption>
</figure>
"""
        markdown = markdown.replace(info["placeholder"], figure_html)
    return markdown

But Not Every Image Deserves a VIP Pass

Lecture PDFs contain useful diagrams, yes. They also contain logos, headers, footers, slide backgrounds, decorative lines, and other visual confetti.

So I added image filtering.

Each extracted image gets a quality score based on:

Pixel dimensions
Aspect ratio
File size
Color entropy
Non-blankness
Duplicate hash

A simplified scoring idea:

score = (
    size_score    * 0.25 +
    aspect_score  * 0.20 +
    file_score    * 0.20 +
    entropy_score * 0.20 +
    blank_score   * 0.15
)

This removes tiny logos, blank slide backgrounds, repeated assets, and weird banner strips.

The user can still review the gallery and manually include or exclude images. Because yes, sometimes the “low quality” image is actually the one diagram the professor will ask for in the exam. Naturally.

Handling Failure: Because LLMs Have Vibes, Not Contracts

The LLM is told to preserve placeholders.

Does it always do that?

Of course not. It is an LLM, not a legally binding agreement.

Sometimes it drops {{IMG_003}}. So the system needs a fallback.

For every image, I store nearby text context and extract keywords:

"fallback_keywords": ["transition", "state", "dfa", "accepting", "alphabet"]

If a placeholder is missing after generation, the backend scans the generated notes and inserts the image near the paragraph with the strongest keyword match.

Simplified version:

def place_dropped_image(paragraphs, image):
    keywords = set(image["fallback_keywords"])

    best_index = 0
    best_score = 0

    for i, paragraph in enumerate(paragraphs):
        words = set(paragraph.lower().split())
        score = len(words & keywords)

        if score > best_score:
            best_score = score
            best_index = i

    paragraphs.insert(best_index + 1, image["placeholder"])
    return paragraphs

The LLM gets freedom to structure the notes. The system keeps guardrails around the parts that must not break.

Other Benefits I Got Almost for Free

1. Lower Cost

Images are not sent as model input. A placeholder like {{IMG_001}} is tiny. A base64 image is a suitcase full of nonsense as far as the prompt is concerned.

2. Better Privacy

Source files and extracted diagrams stay local. The model only sees text and placeholder IDs. For student notes, academic content, or internal training material, this matters.

3. Exact Diagrams

The final document uses the original image. No re-generated diagram. No “close enough.” No AI slop.

4. Easier Debugging

Placeholders make the pipeline inspectable.

If a diagram is missing, I can ask:

Was it extracted?
Was it filtered out?
Was the placeholder assigned?
Did the model drop it?
Did postprocessing fail?

That beats staring at a final blob of generated Markdown wondering where everything went.

The Actual Design Decision

The big decision was separating responsibilities:

Document parser: extract facts and files
Image filter: decide what is useful
Prompt builder: create clean LLM input
LLM: rewrite and organize
Postprocessor: restore local diagrams
Evaluator/RAG: help review and reuse notes

The LLM is not the system. It is one component inside the system.

That distinction matters.

A lot of AI apps are just:

This project is closer to:

That is the difference between a demo and an actual product.

Takeaway

The best AI engineering trick in this project was not a giant prompt. It was knowing what not to send to the model.

The diagram did not need to be seen. It needed to be tracked.

The LLM did not need image pixels. It needed a placeholder, nearby context, and a clear instruction.

The system did the rest.

And honestly, that is the lesson I keep coming back to:

Good AI systems are not built by asking the model to do everything. They are built by giving the model the right job, then surrounding it with boring, reliable software.

I Built an AI Feed, Then Spent Most of the Time Fighting Bad Input

Ponsubash Raj R — Sun, 05 Jul 2026 05:05:29 +0000

I thought I was building an AI app.
Turns out, I was building a garbage sorting machine with embeddings.

PROJECT REPOSITORY

The Brief: Make the Internet Less Annoying

Pulse is a personal AI feed for keeping up with AI engineering.

The idea was simple:

Pull content from RSS feeds, GitHub, arXiv, and Gmail newsletters.
Clean the text.
Ask an LLM to summarize and classify it.
Store embeddings.
Serve it in a mobile app with search, bookmarks, digest, trends, quizzes, and Ask mode.

In diagram form, the dream looked like this:

Very elegant. Very architectural. Very “drawn before reality entered the room”.

The actual version looked more like this:

The lesson came quickly:

The model is not the hard part. The hard part is getting sane input into the model.

"The Sources Were Messy" - is an understatement.

Pulse ingests from four main source types:

RSS / Atom feeds
GitHub repositories
arXiv papers
Gmail newsletters

Each one brought its own special personality disorder.

RSS feeds sound simple until you meet malformed XML. Some feeds work perfectly. Some return partial content. Some fail for a day and then return like nothing happened. Very mature behaviour.

GitHub was cleaner because the official GitHub Search API gives structured JSON. I used repository search as a fallback for AI-related repos, sorting by stars and limiting the result count. Still, even clean APIs need defensive handling. A repository might have no description. A URL might be invalid. The API might fail. A good ingestion pipeline should not fall apart because one repo decided to be mysterious.

arXiv was nicer because it has an official API with search_query, start, max_results, sortBy, and sortOrder parameters, documented in the arXiv API manual. I used category queries and sorted by submitted date. But arXiv abstracts still need cleaning. LaTeX needs stripping. The API gives you structured data, not finished product data. That distinction matters.

And then there was Gmail.

Gmail Newsletters Are GOATed

Great On Arrival, Awful To Transform.

Newsletters provide latest hand-picked news. Reading such newsletters everyday really boosts our knowledge.

But they are awful for processing.

A human sees:

“Here are five interesting AI links”.

A parser sees:

hidden preheader
sponsor block
unsubscribe link
view in browser link
social share buttons
CSS
HTML tables
actual article
footer
another footer
legal footer
unsubscribe again, just in case

Pulse uses the Gmail API, not IMAP. The Gmail users.messages.get endpoint supports retrieving message data: Gmail API docs. I intentionally used read-only access because this app has no business modifying my inbox.

The ingestion query only looks at selected newsletter senders, unread messages, and a recent time window.

Then each email gets fetched in full, parsed completely, and turned into one or more article candidates.

The Gmail pipeline had to handle:

nested MIME parts
text/plain
text/html
redirect links
sponsor blocks
promotional-only emails
social/share/footer links

This is where the project stopped being “AI summarizer” and became “forensic email cleaner”.

The Link Problem

Newsletter links are often not the actual article links.

They are tracking links.

Something like:

https://newsletter.com/click?url=https%3A%2F%2Factual-article.com

Or worse:

https://tracking-domain.com/CL0/https:%2F%2Factual-site.com%2Fpost

So Pulse tries to recover the real destination.

The logic is intentionally bounded. It resolves at most one redirect and uses a timeout. Because if a newsletter tracker wants to become a distributed systems problem, I politely decline.

async def resolve_redirect(url: str) -> str:
    async with httpx.AsyncClient(timeout=5, follow_redirects=False) as client:
        response = await client.get(url)
        location = response.headers.get("location")
        return location or url

The rule was simple:

Recover useful links, but do not let one link hold the ingestion worker hostage.

If redirect resolution fails, the system keeps the original link or uses the inline newsletter context. Failing open is better than losing the article.

Data Quality Decisions, Also Known As “Please Stop Feeding Trash To The LLM”

LLM calls cost quota. Burning it on garbage input is not AI engineering. It is donation.

So Pulse makes several data quality decisions before enrichment.

1. Skip Tiny Articles

Some records have almost no useful text. A title, a link, maybe three words. Very inspiring. Not worth a model call.

if len(clean_text) < 50:
    article.enrichment_status = "skipped"

In my corpus, 798 records were skipped because they had fewer than 50 useful characters.

That saved hundreds of LLM calls.

The LLM did not need to summarize “Click here”. Thankfully, I was capable enough to handle that complex academic material myself.

2. Deduplicate Aggressively

There are two kinds of duplicate problems:

same source item appears again
same content appears from a different path

Each item gets normalized and hashed.

content_hash = sha256(
    f"{normalized_url}|{clean_title}|{clean_text}".encode()
).hexdigest()

So Pulse uses both source IDs and content hashes.

This matters because ingestion is scheduled. If the same newsletter or feed entry returns again, the system should not create another article and proudly announce, “Good news, I found the same thing again”.

3. Cap Text Before Enrichment

The enrichment worker trims article text before sending it to Groq.

text = clean_body(article.raw_text or "", limit=3000)

This prevents long newsletters from becoming expensive prompt sludge.

The goal of enrichment is not to preserve every footer, tracking disclaimer, and “You are receiving this email because...” paragraph.

LLM Safety: Because JSON Mode Still Has Hobbies

The enrichment model returns structured metadata:

summary
category
importance
entities
keywords

The ideal response is JSON.

The actual response can be JSON, markdown-wrapped JSON, JSON with leading prose, JSON with trailing commas, or JSON wearing a small theatrical costume.

So Pulse does not trust the raw output.

It extracts a JSON object, then validates it with Pydantic. Pydantic supports custom validators for enforcing constraints and cleaning values: Pydantic validator docs.

The schema enforces rules like:

class EnrichmentResult(BaseModel):
    summary: str = Field(min_length=10, max_length=1000)
    category: Category
    importance: int = Field(ge=1, le=5)
    entities: EntityMap
    keywords: list[str]

The Prompt Was Not Trusted Either

The prompt asks for:

two-sentence summary
one supported category
importance from 1 to 5
entities grouped by known keys
5 to 8 keywords
JSON only

But the system still validates everything afterward.

Because prompts are requests, not contracts.

A contract looks like this:

result = parse_enrichment(model_output)
article.summary = result.summary
article.category = result.category
article.importance = result.importance

The database only gets validated output.

If parsing fails, the article is marked failed or retried. It does not sneak into the feed half-broken and become the mobile app’s problem. Frontend developers deserve peace too. Occasionally.

Reliability: Do Not Enrich The Same Article Twice

The enrichment worker claims work using PostgreSQL row locks.

PostgreSQL supports FOR UPDATE SKIP LOCKED, which is useful for queue-like tables where multiple consumers should avoid fighting over the same row. SKIP LOCKED skips rows that cannot be locked immediately: PostgreSQL SELECT docs

Pulse uses that pattern:

statement = (
    select(Article)
    .where(Article.enrichment_status == "pending")
    .order_by(Article.ingested_at.desc())
    .with_for_update(skip_locked=True)
    .limit(1)
)

Once an article is claimed:

article.enrichment_status = "processing"
await session.commit()

Then the worker calls the LLM.

On success:

article.enrichment_status = "done"
article.summary = result.summary
article.embedding = embedding

On quota exhaustion:

article.enrichment_status = "pending"

On handled failure:

article.enrichment_status = "failed"
article.retry_count += 1

The important rule:

Never leave a row stuck in processing.

A stuck processing row is the backend version of getting seen zoned.

Quota Is A Product Feature

Groq enrichment uses a daily quota.

Pulse reserves quota before external calls:

if not await reserve_quota(quota_manager):
    article.enrichment_status = "pending"
    return "quota_exhausted"

This avoids half-started work.

It also lets the system degrade gracefully. If quota is exhausted:

ingestion can still store new articles
feed can still serve old articles
search still works
enrichment waits until quota resets

That is a much better failure mode than “everything exploded because one external service said no”.

External APIs are not loyal friends. They are business relationships.

The Final Shape

After all this, the pipeline became:

Not glamorous.

But reliable.

And once this pipeline exists, the fun features become much easier:

semantic search
hybrid search
daily digest
trends
Ask mode
quizzes
mobile offline cache

AI apps are built on boring data discipline.

Takeaway

I started by thinking:

“I will build an AI feed”.

I ended up learning:

“I will build a defensive ingestion system, and if the data behaves, I may allow an LLM near it”.

The LLM was useful. But only after the input was cleaned, filtered, validated, deduplicated, capped, retried, locked, and politely threatened.

The real architecture was not:

content -> LLM -> magic

It was:

mess -> discipline -> model -> useful product

AI systems are not impressive when they work on spoon-fed input.

They are impressive when they survive the internet.

RAG Is Easy. Useful RAG Is the Hard Part

Ponsubash Raj R — Sat, 04 Jul 2026 13:55:57 +0000

Everybody says “just add RAG” like it is a button in settings.

It is not. I checked. Very disappointing.

The Brief: Personalized News Feeds

Pulse started as a personal AI intelligence feed.

Not a chatbot with a search bar glued to it. Not another app where an LLM confidently explains an article it has never seen. I wanted something more useful:

collect AI engineering content from RSS, GitHub, arXiv, and Gmail newsletters
summarize and classify articles
store embeddings
support exact, semantic, and hybrid search
answer questions from my own corpus
cite the articles it used
say “I do not know” when the corpus has no answer

That last part is important.

A RAG system that cannot say “I do not know” is not intelligent. It is just overconfident autocomplete in formal clothes.

The simple version looked like this:

Very clean. Very incomplete.

The useful version needed much more.

PROJECT REPOSITORY

The Actual System Architecture

Pulse uses a FastAPI backend, PostgreSQL with pgvector, Groq for generation, and an Expo Android app.

At a high level:

For retrieval, the important database columns are:

class Article(Base):
    title: Mapped[str]
    summary: Mapped[str | None]
    category: Mapped[str | None]
    keywords: Mapped[list[str] | None]
    embedding: Mapped[list[float] | None] = mapped_column(Vector(384))
    embedding_model: Mapped[str]
    enrichment_status: Mapped[str]
    hidden: Mapped[bool]

The vector column uses pgvector, which supports vector similarity search inside Postgres including cosine distance and approximate indexes: pgvector README

PostgreSQL also gives full-text search, documented in the PostgreSQL full-text search docs.

So Pulse does not choose between SQL search and vector search.

It uses both.

Because of course one search mode was too peaceful.

Why “Just Use Embeddings” Was Not Enough

Embeddings are useful. They are not magic.

If the user searches:

on-device foundation models

semantic search is great. It can find articles about local AI, small models, mobile inference, and related topics even if the exact words do not match.

But if the user searches:

Anthropic

exact search is often better. The word itself matters. I do not need a poetic interpretation of Anthropic. I need articles that mention Anthropic.

This is where pure vector search becomes annoying.

Vector search is good at meaning. Full-text search is good at exact language. A useful product usually needs both.

So Pulse supports three modes:

Exact      -> PostgreSQL full-text search
Semantic   -> pgvector cosine similarity
Hybrid     -> merge both result sets

Search Mode 1: Exact Search

Exact search uses PostgreSQL full-text search.

This works well for names, tools, companies, and terms that should match literally.

It is also fast and boring.

But boring is underrated. Many production systems are just boring things that work while exciting things are busy timing out.

Search Mode 2: Semantic Search

Semantic search embeds the query and compares it with article embeddings using cosine distance.

query_embedding = await call_embedder(query_text)

distance = Article.embedding.cosine_distance(query_embedding)

rows = await session.execute(
    select(Article, distance)
    .where(
        Article.enrichment_status == "done",
        Article.embedding.is_not(None),
        Article.hidden.is_(False),
    )
    .order_by(distance, Article.ingested_at.desc())
    .limit(limit)
)

Search Mode 3: Hybrid Search

Hybrid search combines exact and semantic results using Reciprocal Rank Fusion.

The idea is simple:

score = 1 / (k + rank)

If an article ranks well in exact search and semantic search, it rises. If it ranks well in only one, it still has a chance.

We merge both result lists:

scores[article_id] += rrf_score(exact_rank)
scores[article_id] += rrf_score(semantic_rank)

This made hybrid the default.

Why?

Because users do not wake up thinking:

“Today I shall formulate a query that is best served by cosine similarity.”

They type words. The system should adapt.

Hybrid search lets exact names win when they should, while semantic matches still catch broader ideas.

Ask Mode: RAG With Brakes

The Ask mode is where retrieval becomes generation.

The user asks:

What are the recent themes around AI coding tools?

Pulse does this:

Here, the rejection step matters.

If the top retrieved articles are weak, Pulse does not call the LLM.

This is not a failure.

This is the product behaving responsibly.

If I ask:

What is the weather in Mumbai?

Pulse should not a produce meteorology fan fiction.

It should say:

I do not have enough relevant context in the corpus.

Prompting With Context, Not Hope

The Ask prompt includes only controlled context:

Article ID
Title
Summary
URL
Similarity score
Recent conversation messages

Not raw HTML. Not full article bodies. Not the entire database. Not “please be accurate” as a magical spell.

A simplified prompt shape:

def build_ask_prompt(question, articles):
    context = "\n\n".join(
        f"[{article.id}]\n"
        f"Title: {article.title}\n"
        f"Summary: {article.summary}\n"
        f"URL: {article.url}"
        for article in articles
    )

    return f"""
Answer the user using only the context below.
If the context is not enough, say so.

Context:
{context}

Question:
{question}
"""

The answer includes citations back to article IDs and URLs.

This keeps the system grounded.

Not perfectly. Nothing with an LLM is perfect. But much better than letting the model free-climb the truth.

Personalization: Ranking Is Also Retrieval

Search is not the only retrieval problem.

The feed itself is retrieval.

Pulse learns from reading behavior:

short reads are weak signals
longer reads are stronger signals
read categories update category weights
article keywords update interest terms
bookmarks and hidden articles affect what should appear

The engagement score is intentionally simple:

def engagement_signal(duration_seconds: int):
    if duration_seconds < 5:
        return None
    if duration_seconds < 30:
        return 0.2
    if duration_seconds < 120:
        return 0.5
    return 1.0

No fake machine learning ceremony. No “neural preference engine” because I read one article for 14 seconds.

Category weights use an exponential moving average:

new_weight = old_weight + alpha * (signal - old_weight)

The feed score combines:

importance + category preference + recency + keyword overlap

Learning Features: RAG Was Only One Part Of The Loop

Once articles are cleaned, summarized, embedded, and ranked, other AI features become easier.

Pulse uses the same enriched corpus for:

1. Daily Digest

The digest selects recent high-importance enriched articles and asks Groq for a three-paragraph briefing.

This is not just summarization. It is scheduled synthesis.

2. Trends

Trend detection scans enriched entities from recent articles.

for entity in article.entities:
    mentions[normalized_entity].add(article.id)

trends = [
    entity for entity, article_ids in mentions.items()
    if len(article_ids) >= 3
]

This lets the app show repeated topics like companies, models, tools, or research themes.

3. LangGraph Quiz Agent

For learning retention, Pulse generates three-question quizzes from an article summary and entities.

LangGraph is useful for modeling multi-step agent flows.

Pulse uses the quiz flow for:

Quiz sessions are stored server-side with expiry. The answer key is not trusted from the client.

Because yes, even in a personal app, the client should not grade itself.

The Product Rule: Retrieval Before Generation

The biggest design rule became:

Retrieve first. Generate second. Refuse when retrieval is weak.

That rule shows up everywhere:

Search can run without Groq.
Ask mode refuses unrelated questions before spending quota.
Digest uses selected articles, not the entire database.
Quiz generation only works on enriched articles.
Feed ranking uses stored signals, not live model calls.

This made the system cheaper, faster, and less ridiculous.

LLMs are powerful. They are also expensive, rate-limited, and occasionally very committed to being wrong.

So Pulse uses them where they add value, and keeps boring deterministic code around them.

The Final Shape

The final RAG architecture looked like this:

That is more work than:

documents -> embeddings -> chatbot

Takeaway

RAG is easy when the input data is clean, the query is friendly, and nobody asks anything weird.

Useful RAG is different.

Useful RAG needs:

clean source data
validated enrichment
exact search
semantic search
hybrid ranking
relevance thresholds
citations
refusal paths
personalization

The hard part is not putting vectors in a database.

The hard part is deciding when the vector result is not good enough.

The hard part is not calling the LLM.

The hard part is knowing when not to call it.

That is what made Pulse useful.

Not because it could answer everything.

Because it knew when it could not.