<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Anurag Srivastava</title>
    <description>The latest articles on DEV Community by Anurag Srivastava (@anuragmerndev).</description>
    <link>https://dev.to/anuragmerndev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1458057%2F413aa169-6a52-4db2-bcb5-99cf99772724.jpeg</url>
      <title>DEV Community: Anurag Srivastava</title>
      <link>https://dev.to/anuragmerndev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anuragmerndev"/>
    <language>en</language>
    <item>
      <title>Building AI SaaS MVP, the right way</title>
      <dc:creator>Anurag Srivastava</dc:creator>
      <pubDate>Mon, 04 May 2026 18:18:27 +0000</pubDate>
      <link>https://dev.to/anuragmerndev/the-ai-feature-is-the-easy-part-3e7l</link>
      <guid>https://dev.to/anuragmerndev/the-ai-feature-is-the-easy-part-3e7l</guid>
      <description>&lt;p&gt;Adding AI to a product takes an afternoon. An API key, a prompt, a fetch call. Done.&lt;/p&gt;

&lt;p&gt;Building the system that runs that AI feature in production is a different problem entirely. How do you isolate data between tenants so Org A never sees Org B's rows? How do you bill for usage and actually block access when the quota runs out? How do you build a dashboard that shows real numbers instead of placeholder charts?&lt;/p&gt;

&lt;p&gt;I built &lt;a href="https://tubedigest-web.vercel.app/" rel="noopener noreferrer"&gt;TubeDigest&lt;/a&gt;, a YouTube video summarizer. Teams paste a video URL, get a summary. The summarization is one OpenAI call. The multi-tenant org structure, subscription billing, usage enforcement, caching layer -- that's where the actual work went.&lt;/p&gt;

&lt;p&gt;This post is about all of that.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F93anfw2hofthvv1skse0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F93anfw2hofthvv1skse0.png" alt="TubeDigest landing page with hero text " width="800" height="452"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What's actually running
&lt;/h2&gt;

&lt;p&gt;On the surface: paste a YouTube URL, get a summary. Behind that:&lt;/p&gt;

&lt;p&gt;Postgres Row Level Security handles tenant data isolation. Org A can't query Org B's rows even if the application code has a bug. The database enforces it, not my WHERE clauses.&lt;/p&gt;

&lt;p&gt;Billing runs through Dodo Payments with real subscriptions. Free tier has a monthly cap. Hit it, you're blocked. Upgrade, the limit goes up. Webhooks handle the whole lifecycle.&lt;/p&gt;

&lt;p&gt;Organizations are the tenant boundary. Invite team members, assign roles (owner or member), manage access. Actual multi-user orgs, not single-user accounts with a team label stapled on.&lt;/p&gt;

&lt;p&gt;There's also a caching layer in front of the AI pipeline. If any org has already summarized a particular video, the cached result gets returned. One transcript extraction and one OpenAI call serve every future request for that same video.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F32i1kf7e927az96kupz3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F32i1kf7e927az96kupz3.png" alt="Summarizer page with a YouTube URL pasted, showing a completed summary for a video about Bhangarh Fort with video thumbnail, summary text, copy and open buttons" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Tech stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Choice&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Frontend&lt;/td&gt;
&lt;td&gt;Next.js (App Router) + Tailwind CSS + shadcn/ui&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backend&lt;/td&gt;
&lt;td&gt;NestJS (separate service)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Database&lt;/td&gt;
&lt;td&gt;PostgreSQL on Neon&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ORM&lt;/td&gt;
&lt;td&gt;TypeORM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auth&lt;/td&gt;
&lt;td&gt;Clerk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Billing&lt;/td&gt;
&lt;td&gt;Dodo Payments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI&lt;/td&gt;
&lt;td&gt;OpenAI API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monorepo&lt;/td&gt;
&lt;td&gt;Turborepo + pnpm&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Frontend and backend are separate services, deployed independently. The frontend handles UI and auth. The backend owns the data, billing logic, and AI pipeline. They talk over REST -- 17 endpoints, all documented with Swagger.&lt;/p&gt;

&lt;p&gt;I split them because that's how I'd build a client's SaaS product. Not because it looks good on a diagram.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1i5qm179rhlpu79dgydz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1i5qm179rhlpu79dgydz.png" alt="TubeDigest architecture diagram showing Next.js frontend on Vercel, NestJS backend on Render, PostgreSQL on Neon with RLS, and connections to Clerk, Dodo Payments, OpenAI, and YouTube Captions API" width="804" height="583"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Data isolation
&lt;/h2&gt;

&lt;p&gt;Most multi-tenant apps put &lt;code&gt;WHERE org_id = ?&lt;/code&gt; on every query. Miss one and you have a data leak. That approach works until someone forgets, and then it's a breach.&lt;/p&gt;

&lt;p&gt;TubeDigest uses Postgres Row Level Security. The database enforces who sees what.&lt;/p&gt;

&lt;p&gt;Every request goes through a middleware that verifies the Clerk JWT, pulls the org_id, and sets it on the connection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;LOCAL&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;org_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'the-org-id'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;After that, every query on tenant-scoped tables gets filtered by the database engine. Not by application code. The policy:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;POLICY&lt;/span&gt; &lt;span class="n"&gt;tenant_isolation&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;
  &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_setting&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'app.org_id'&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This covers &lt;code&gt;users&lt;/code&gt;, &lt;code&gt;invitations&lt;/code&gt;, &lt;code&gt;subscriptions&lt;/code&gt;, &lt;code&gt;usage_records&lt;/code&gt;, and &lt;code&gt;user_summaries&lt;/code&gt;. The &lt;code&gt;videos&lt;/code&gt; table has no RLS because it's the shared cache. All orgs read from it.&lt;/p&gt;

&lt;p&gt;If my code has a bug that skips the org filter, the database catches it anyway. The safety net is in the infrastructure, not in my discipline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsb5blg3obds5ta4ycs18.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsb5blg3obds5ta4ycs18.png" alt="Workspace settings Members tab showing 3 members with roles (1 Owner, 2 Members), an invite form with email input and role selector, and joined dates" width="800" height="452"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Billing that works end to end
&lt;/h2&gt;

&lt;p&gt;A lot of SaaS demos have a pricing page and a test checkout button that goes nowhere. TubeDigest has a billing system that processes real payments.&lt;/p&gt;

&lt;p&gt;Dodo Payments handles subscriptions. Checkout sessions, webhook events, billing portal -- standard payment gateway patterns:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Sign up. Land on Free tier with a monthly summary cap&lt;/li&gt;
&lt;li&gt;Use up the quota. API returns 429s&lt;/li&gt;
&lt;li&gt;Click upgrade. Dodo checkout. Payment processes. Webhook fires&lt;/li&gt;
&lt;li&gt;Backend catches the webhook. Updates the subscription. Raises the limit&lt;/li&gt;
&lt;li&gt;User manages billing through the self-service portal&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Usage tracking is per org, per billing period. Every summary request increments a counter. The system knows how many summaries each org has used, when the period resets, and what the cap is. The dashboard shows all of this live.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F61mtp0a61yo9d8d7uisu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F61mtp0a61yo9d8d7uisu.png" alt="Workspace settings Billing tab showing Free plan, usage this period for May 2026, summaries and seats counts, upgrade to Pro option at $10/mo, and link to Dodo customer portal" width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Duplicate payments
&lt;/h2&gt;

&lt;p&gt;This problem doesn't show up in demos. It shows up in production, and it wrecks things.&lt;/p&gt;

&lt;p&gt;User clicks checkout twice. Network hiccup makes Dodo fire the same webhook twice. User opens checkout in two browser tabs. All of these cause the same issue: duplicate payment processing. Double charges, phantom subscription upgrades, usage limits applied twice.&lt;/p&gt;

&lt;p&gt;TubeDigest processes webhooks idempotently. Every incoming event gets checked against what's already been processed. If the system has seen that event before, it acknowledges it and moves on. No state change. The subscription table stays clean no matter how many times the same event arrives.&lt;/p&gt;

&lt;p&gt;Not a lot of code. But this is the kind of thing that separates a demo from a system that handles real money.&lt;/p&gt;


&lt;h2&gt;
  
  
  The AI pipeline
&lt;/h2&gt;

&lt;p&gt;The summarizer is simple on purpose. I wanted the complexity in the infrastructure, not the AI call. Here's the flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User submits a YouTube URL&lt;/li&gt;
&lt;li&gt;Backend extracts the video ID, checks the &lt;code&gt;videos&lt;/code&gt; cache table&lt;/li&gt;
&lt;li&gt;Cache miss: pull transcript via YouTube's captions API&lt;/li&gt;
&lt;li&gt;Truncate to about 4000 tokens (roughly 20 minutes of spoken content)&lt;/li&gt;
&lt;li&gt;Send to OpenAI, get summary back&lt;/li&gt;
&lt;li&gt;Store transcript and summary in the cache, keyed by video ID&lt;/li&gt;
&lt;li&gt;Log the request in &lt;code&gt;user_summaries&lt;/code&gt;, increment the org's usage counter&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Cache hit? Skip steps 3 through 6. Return the stored summary immediately. A YouTube video that gets summarized by 500 different orgs costs one API call. Not 500.&lt;/p&gt;

&lt;p&gt;No captions on the video? Return a 422 with a clear message. Don't touch the usage counter. Don't charge for failures.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv88x55xgpleirsb55mze.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv88x55xgpleirsb55mze.png" alt="Summary history page showing 7 summaries with video thumbnails, titles, preview text, dates, and view/open links for each entry" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Dashboard with real data
&lt;/h2&gt;

&lt;p&gt;Building a dashboard is easy. Having real data to put in it is the harder part, because the data pipeline has to exist first.&lt;/p&gt;

&lt;p&gt;TubeDigest tracks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Summaries used this period, out of the plan's limit. Live counter&lt;/li&gt;
&lt;li&gt;Recent activity. Last 5-10 summaries with video titles and timestamps&lt;/li&gt;
&lt;li&gt;Daily usage. Bar chart of summaries per day across the billing period&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of that comes from two tables: &lt;code&gt;usage_records&lt;/code&gt; and &lt;code&gt;user_summaries&lt;/code&gt;. No analytics service. No third party dashboard tool. The same data pipeline that tracks billing also feeds the dashboard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F13mpuunzmqg7gux8qqbk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F13mpuunzmqg7gux8qqbk.png" alt="TubeDigest dashboard showing 7 summaries used out of 10 this month, 3 remaining, Free plan, a daily usage bar chart over 30 days, and recent activity list with 5 YouTube video summaries" width="800" height="452"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Project structure
&lt;/h2&gt;

&lt;p&gt;Monorepo with Turborepo and pnpm:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tubedigest/
├── apps/
│   ├── web/     # Next.js frontend → Vercel
│   └── api/     # NestJS backend → Render
├── packages/    # Shared types
├── turbo.json
└── pnpm-workspace.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;CI runs lint, type-check, and build on every push via GitHub Actions. Vercel and Render auto-deploy when the pipeline passes. Swagger docs live at &lt;code&gt;/api/docs&lt;/code&gt; with request and response schemas for every endpoint.&lt;/p&gt;


&lt;h2&gt;
  
  
  What I'd add at scale
&lt;/h2&gt;

&lt;p&gt;These aren't missing. They're unnecessary at this stage. If TubeDigest had real traffic, they'd be next:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Async processing. Summarization is synchronous right now. At scale, a job queue with Bull and Redis would handle long videos without blocking request threads.&lt;/li&gt;
&lt;li&gt;Chunk and merge for long transcripts. I truncate at about 4000 tokens, roughly 20 minutes. A chunk, summarize, merge pipeline would handle 3 hour lectures but costs more in API calls.&lt;/li&gt;
&lt;li&gt;Integration tests for billing webhooks and RLS policies. The two most important paths in the system deserve the most test coverage.&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;Live demo: &lt;a href="https://tubedigest-web.vercel.app/" rel="noopener noreferrer"&gt;tubedigest-web.vercel.app&lt;/a&gt;&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/anuragmerndev" rel="noopener noreferrer"&gt;
        anuragmerndev
      &lt;/a&gt; / &lt;a href="https://github.com/anuragmerndev/tubedigest" rel="noopener noreferrer"&gt;
        tubedigest
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;TubeDigest&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;A multi-tenant SaaS YouTube video summarizer. Paste a YouTube URL, get a concise AI-generated summary. Built with organization-based multi-tenancy, usage-based billing, and role-based access control.&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Tech Stack&lt;/h2&gt;
&lt;/div&gt;

&lt;p&gt;&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;br&gt;
&lt;thead&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;br&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;/thead&gt;
&lt;br&gt;
&lt;tbody&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Monorepo&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;Turborepo + pnpm workspaces&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Frontend&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;Next.js (App Router) + Tailwind CSS + shadcn/ui&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Backend&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;NestJS&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Auth&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;Clerk&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Database&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;PostgreSQL (Neon) + TypeORM&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Multi-tenancy&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;Row-Level Security (RLS)&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Billing&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;Dodo Payments&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;AI&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;OpenAI API&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;/tbody&gt;
&lt;br&gt;
&lt;/table&gt;&lt;/div&gt;&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Features&lt;/h2&gt;
&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI Summarization&lt;/strong&gt; — paste a YouTube URL, get a summary powered by OpenAI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Tenancy&lt;/strong&gt; — organization-based isolation with Postgres RLS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Role-Based Access&lt;/strong&gt; — owner and member roles with backend-enforced permissions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Usage Tracking&lt;/strong&gt; — per-organization usage limits and daily usage charts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Billing&lt;/strong&gt; — subscription management via Dodo Payments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Team Management&lt;/strong&gt; — invite members, manage roles, revoke access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Video Caching&lt;/strong&gt; — same video = reuse cached summary, saving API costs&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Project Structure&lt;/h2&gt;

&lt;/div&gt;

&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;
&lt;pre class="notranslate"&gt;&lt;code&gt;tubedigest/
├── apps/
│   ├── web/        # Next.js frontend
│   └── api/        # NestJS backend
├──&lt;/code&gt;&lt;/pre&gt;…&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/anuragmerndev/tubedigest" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;



&lt;p&gt;The AI feature took a day. The system around it took the rest of the week. That ratio tells you where the real engineering is.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>saas</category>
      <category>nextjs</category>
      <category>mvp</category>
    </item>
    <item>
      <title>I built a production RAG pipeline. Here's what most tutorials skip.</title>
      <dc:creator>Anurag Srivastava</dc:creator>
      <pubDate>Mon, 13 Apr 2026 19:21:21 +0000</pubDate>
      <link>https://dev.to/anuragmerndev/i-built-a-production-rag-pipeline-heres-what-most-tutorials-skip-272n</link>
      <guid>https://dev.to/anuragmerndev/i-built-a-production-rag-pipeline-heres-what-most-tutorials-skip-272n</guid>
      <description>&lt;p&gt;I wanted a RAG system that was fast to run and fast to set up for clients. Upload a PDF, ask questions, get answers with citations. Pretty standard stuff for anyone freelancing in the AI space.&lt;/p&gt;

&lt;p&gt;The problem is that every tutorial I found stops at a Jupyter notebook. Working demo, zero production readiness. No auth, no caching, no way to handle more than one user. The happy path, and nothing else.&lt;/p&gt;

&lt;p&gt;So I built the whole thing. Deployed, running, something I can actually show to clients.&lt;/p&gt;

&lt;p&gt;Here's what that looked like.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://adv-rag-ui.vercel.app" rel="noopener noreferrer"&gt;Live Demo&lt;/a&gt;&lt;/strong&gt; | &lt;strong&gt;&lt;a href="https://github.com/anuragmerndev/adv-rag" rel="noopener noreferrer"&gt;Backend Repo&lt;/a&gt;&lt;/strong&gt; | &lt;strong&gt;&lt;a href="https://github.com/anuragmerndev/adv-rag-ui" rel="noopener noreferrer"&gt;Frontend Repo&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flqtsvm1ykn2t2mspd0ey.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flqtsvm1ykn2t2mspd0ey.png" alt="AdvChat conversation showing a streamed answer about the backwards law with five source citation chips expanded below" width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Following a question through the pipeline
&lt;/h2&gt;

&lt;p&gt;The easiest way to explain this system is to trace what happens when someone asks a question.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F243m7hli7bwp26g7zpff.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F243m7hli7bwp26g7zpff.png" alt="Architecture" width="883" height="548"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Say a user uploaded a 150-page book and types: &lt;em&gt;"What's the main argument of chapter 3?"&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The fingerprint problem
&lt;/h3&gt;

&lt;p&gt;First thing I need to figure out: have I seen this question before?&lt;/p&gt;

&lt;p&gt;Not this exact string. That's easy. But what about &lt;em&gt;"what is the main argument of chapter 3"&lt;/em&gt; without the contraction? Or &lt;em&gt;"What's the main argument of chapter 3??"&lt;/em&gt; with extra punctuation? Different casing?&lt;/p&gt;

&lt;p&gt;Same question, four different strings. A naive string comparison treats each one as unique, and each one costs an OpenAI embedding call. That adds up.&lt;/p&gt;

&lt;p&gt;The fix I landed on: normalize the query first. Expand contractions, strip punctuation, remove stopwords, collapse whitespace. Then SHA-256 the cleaned string.&lt;/p&gt;

&lt;p&gt;The normalization + fingerprinting code&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;normalizeQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="c1"&gt;// "don't" → "do not", "what's" → "what is"&lt;/span&gt;
    &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;v&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;CONTRACTIONS&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;RegExp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;b&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;k&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;b`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;g&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nx"&gt;v&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;[^&lt;/span&gt;&lt;span class="sr"&gt;a-z0-9&lt;/span&gt;&lt;span class="se"&gt;\s]&lt;/span&gt;&lt;span class="sr"&gt;/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt; &lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// strip punctuation&lt;/span&gt;
    &lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="sr"&gt;+/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt; &lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;           &lt;span class="c1"&gt;// collapse spaces&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt; &lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;w&lt;/span&gt; &lt;span class="o"&gt;=&amp;amp;&lt;/span&gt;&lt;span class="nx"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;STOPWORDS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;w&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;words&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt; &lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fingerprint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;crypto&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createHash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;sha256&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;normalizeQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;digest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;hex&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;All four variations produce the same hash. One fingerprint, one cache key. The embedding cache and response cache both use this, so every performance shortcut downstream starts here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cache check
&lt;/h3&gt;

&lt;p&gt;The pipeline looks in Redis for a cached embedding vector at &lt;code&gt;emb:{fingerprint}&lt;/code&gt;. If it finds a 1536-dimensional array there, it skips the OpenAI embedding call. About 200ms saved before we even talk to the vector database.&lt;/p&gt;

&lt;p&gt;Embedding cache logic&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;queryEmbedding&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;cacheService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`emb:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;fingerprint&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;queryEmbedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;queryEmbedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;embeddingService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getEmbedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;cacheService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="s2"&gt;`emb:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;fingerprint&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;queryEmbedding&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;There's also a response cache (&lt;code&gt;resp:{fingerprint}&lt;/code&gt;) that stores complete LLM answers. I found out during testing that this breaks in conversation mode, though. The same question asked as a follow-up has different context because of chat history, so the cached answer would be wrong. The response cache only kicks in for standalone queries. The embedding cache works everywhere.&lt;/p&gt;

&lt;p&gt;I added an &lt;code&gt;X-Cache-Embed&lt;/code&gt; response header so the frontend knows which path was taken. Helps when debugging latency issues in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Vector search
&lt;/h3&gt;

&lt;p&gt;The embedding goes to Pinecone for similarity search. Multi-tenancy is where things get interesting.&lt;/p&gt;

&lt;p&gt;Most tutorials scope queries with metadata filters: &lt;code&gt;where: { userId: 'abc' }&lt;/code&gt;. That works until someone forgets the filter and suddenly users can see each other's documents.&lt;/p&gt;

&lt;p&gt;I used Pinecone namespaces instead. Each user's vectors live in a separate namespace. It's not a filter you can forget to add. The data is physically separated.&lt;/p&gt;

&lt;p&gt;Namespace-scoped operations&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Every operation is scoped to the user's namespace&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;index&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;BATCH&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;index&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;queryEmbedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;topK&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;includeMetadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;index&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;deleteMany&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;ids&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Redacting context before it reaches the LLM
&lt;/h3&gt;

&lt;p&gt;Pinecone returns the top-K matching chunks. Before they go to the LLM, there's a step most tutorials ignore completely.&lt;/p&gt;

&lt;p&gt;If someone uploads a security doc with phrases like "bypass authentication" or "disable firewall rules," and asks the right question, the LLM will repeat those instructions. It's in the prompt context. The model doesn't know it shouldn't say that.&lt;/p&gt;

&lt;p&gt;So I scan the retrieved chunks for suspicious patterns and replace them with &lt;code&gt;[REDACTED_REASON]&lt;/code&gt; before the LLM ever sees the context. The response includes a policy field (&lt;code&gt;allow&lt;/code&gt; or &lt;code&gt;partial&lt;/code&gt;) so the client knows whether redaction happened.&lt;/p&gt;

&lt;p&gt;Pre-LLM redaction filter&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;preFilterDocs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;amp;&lt;/span&gt;&lt;span class="nx"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;suspicious&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
        &lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\b(&lt;/span&gt;&lt;span class="sr"&gt;bypass|disable|ignore rules|unrestricted|open firewall|run arbitrary&lt;/span&gt;&lt;span class="se"&gt;)\b&lt;/span&gt;&lt;span class="sr"&gt;/gi&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;redacted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;suspicious&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[REDACTED_REASON]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;found&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;suspicious&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;redacted&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;found&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;policyCheck&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;amp;&lt;/span&gt;&lt;span class="nx"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;hasSuspicious&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;preFilterResults&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;amp;&lt;/span&gt;&lt;span class="nx"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;prefilter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;found&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hasSuspicious&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;partial&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;context_redacted&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;allow&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ok&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Streaming the answer
&lt;/h3&gt;

&lt;p&gt;The redacted context hits GPT-4o-mini. Total response time is about 5.9 seconds, which sounds slow, but the user sees text after 3.5 seconds because of streaming. They're already reading before the model finishes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F19hnphqr2vwcekmomrzv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F19hnphqr2vwcekmomrzv.png" alt="AdvChat mid-conversation with a streaming response being generated in real time while the user asks about entitlement" width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The tricky part: source citations need to come &lt;em&gt;after&lt;/em&gt; the stream ends. I split it into two SSE event types.&lt;/p&gt;

&lt;p&gt;The streaming protocol&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Stream LLM chunks as they arrive&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;llmService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;streamAnswer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;redactedContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;amp;&lt;/span&gt;&lt;span class="nx"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;fullAnswer&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`data: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;chunk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt; &lt;span class="p"&gt;})}&lt;/span&gt;&lt;span class="s2"&gt;\n\n`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Final event: source citations + policy decision&lt;/span&gt;
&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`data: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;done&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;provenance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;policyResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;decision&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;policyResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reason&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;})}&lt;/span&gt;&lt;span class="s2"&gt;\n\n`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;code&gt;chunk&lt;/code&gt; events build the message in real time. &lt;code&gt;done&lt;/code&gt; triggers the source citation chips below the answer. Two event types, one connection.&lt;/p&gt;




&lt;h2&gt;
  
  
  Frontend problems I didn't expect
&lt;/h2&gt;

&lt;p&gt;The backend was the interesting part. The frontend is where things broke in ways I didn't anticipate.&lt;/p&gt;

&lt;p&gt;Streaming and navigation don't mix in Next.js. When a user sends their first message, the app creates a conversation and updates the URL from &lt;code&gt;/chat&lt;/code&gt; to &lt;code&gt;/chat/abc123&lt;/code&gt;. I used &lt;code&gt;router.replace()&lt;/code&gt;. It unmounted the component. The SSE connection died. The user saw half an answer and got bounced to an error page.&lt;/p&gt;

&lt;p&gt;Fix: &lt;code&gt;window.history.replaceState()&lt;/code&gt;. Updates the URL without triggering React navigation. Component stays mounted, stream keeps going.&lt;/p&gt;

&lt;p&gt;There's also this: the backend always returns source citations even when it says "I don't know." It found context but couldn't answer. Showing citations under a non-answer confuses people. The frontend checks the response text against a regex and hides the source chips when the answer looks like a "no knowledge" response. Took me longer than it should have.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8geg17qnxgpp679ignz4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8geg17qnxgpp679ignz4.png" alt="AdvChat showing the thinking indicator dots while waiting for the LLM to respond to a follow-up question" width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Going to production
&lt;/h2&gt;

&lt;p&gt;Getting this running locally was one thing. Deploying it was a different exercise.&lt;/p&gt;

&lt;p&gt;The first Dockerfile was 900MB and ran the dev server. After a few rounds, I landed on a multi-stage build: TypeScript compiles in one stage, the production image only gets the compiled JavaScript and prod dependencies. About 250MB.&lt;/p&gt;

&lt;p&gt;First Railway deployment crashed because &lt;code&gt;uploads/&lt;/code&gt; didn't exist in the container. I'd put it in &lt;code&gt;.dockerignore&lt;/code&gt; (you don't want local PDFs in your image), but multer needs that directory to write temp files. One-line fix: &lt;code&gt;fs.mkdirSync('uploads', { recursive: true })&lt;/code&gt; at startup. Then I realized the files should be deleted after indexing anyway. No persistent storage, no S3, no disk space to worry about.&lt;/p&gt;

&lt;p&gt;The health endpoint hits Postgres, Redis, and Pinecone in parallel with individual timeouts. Railway restarts the container if it fails. Graceful shutdown drains connections with a 10-second limit. CORS is wide open locally, locked to the Vercel domain in production.&lt;/p&gt;

&lt;p&gt;None of this is interesting, but skip any of it and the deploy falls over.&lt;/p&gt;




&lt;h2&gt;
  
  
  Numbers
&lt;/h2&gt;

&lt;p&gt;From the production deployment:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What&lt;/th&gt;
&lt;th&gt;How fast&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Index a 150-page PDF (840 chunks)&lt;/td&gt;
&lt;td&gt;~27s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Index a 42-page PDF (122 chunks)&lt;/td&gt;
&lt;td&gt;~7s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;First token appears&lt;/td&gt;
&lt;td&gt;~3.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full streaming response&lt;/td&gt;
&lt;td&gt;~5.9s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monthly cost&lt;/td&gt;
&lt;td&gt;~$5-8&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Most of the latency is OpenAI. Embedding generation and LLM inference, not server compute. Streaming covers the gap: 3.5 seconds to first visible text, and people start reading.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tech stack
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Frontend&lt;/td&gt;
&lt;td&gt;Next.js 15, Tailwind CSS, React Query&lt;/td&gt;
&lt;td&gt;App Router, SSE streaming, server side auth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backend&lt;/td&gt;
&lt;td&gt;Express, TypeScript&lt;/td&gt;
&lt;td&gt;Full control over SSE and middleware&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auth&lt;/td&gt;
&lt;td&gt;Clerk&lt;/td&gt;
&lt;td&gt;OAuth + webhook user sync&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vector DB&lt;/td&gt;
&lt;td&gt;Pinecone&lt;/td&gt;
&lt;td&gt;Managed, namespace isolation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM&lt;/td&gt;
&lt;td&gt;OpenAI GPT-4o-mini&lt;/td&gt;
&lt;td&gt;Fast, cheap ($0.15/1M tokens)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embeddings&lt;/td&gt;
&lt;td&gt;text-embedding-3-small&lt;/td&gt;
&lt;td&gt;1536 dims, $0.02/1M tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache&lt;/td&gt;
&lt;td&gt;Redis&lt;/td&gt;
&lt;td&gt;Embedding + response caching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Database&lt;/td&gt;
&lt;td&gt;PostgreSQL + Prisma&lt;/td&gt;
&lt;td&gt;Users, conversations, documents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hosting&lt;/td&gt;
&lt;td&gt;Railway + Vercel&lt;/td&gt;
&lt;td&gt;~$5-8/month total&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What I'd change
&lt;/h2&gt;

&lt;p&gt;My chunks are fixed at 500 tokens with 100 overlap. They cut mid-sentence sometimes, and the LLM struggles with the partial context. I'd switch to semantic chunking, splitting on paragraph boundaries instead.&lt;/p&gt;

&lt;p&gt;I wrote a reranking service and then never connected it. A cross-encoder between vector search and the LLM would drop the marginal chunks before they eat up context window. It's sitting there, just not plugged in.&lt;/p&gt;

&lt;p&gt;The other thing: if the LLM errors mid-stream, the user sees a half-finished message and nothing else. No error state, no retry button. I need a &lt;code&gt;type: 'error'&lt;/code&gt; event in the SSE protocol. It's on my list.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;TypeScript, Express, Next.js, OpenAI, Pinecone, Redis, PostgreSQL, Clerk. Railway + Vercel. About $5-8/month.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you have questions about any of the architecture decisions, I'm happy to talk about them in the comments.&lt;/em&gt;&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/anuragmerndev" rel="noopener noreferrer"&gt;
        anuragmerndev
      &lt;/a&gt; / &lt;a href="https://github.com/anuragmerndev/adv-rag" rel="noopener noreferrer"&gt;
        adv-rag
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      RAG-powered document Q&amp;amp;A API with streaming, dual-layer caching, and multi-tenant vector search
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;AdvChat — Backend API&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/d2b19c56bf530067483d8d2756fac7800e0aef54ef4360460d778c23ccc3db2b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f547970655363726970742d3331373843363f6c6f676f3d74797065736372697074266c6f676f436f6c6f723d7768697465"&gt;&lt;img src="https://camo.githubusercontent.com/d2b19c56bf530067483d8d2756fac7800e0aef54ef4360460d778c23ccc3db2b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f547970655363726970742d3331373843363f6c6f676f3d74797065736372697074266c6f676f436f6c6f723d7768697465" alt="TypeScript"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/ef0efa27f9a3268626104d5d2aed09ab5fb94809e76ccf6f2424524c5710d8c6/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4e6f64652e6a732d3333393933333f6c6f676f3d6e6f6465646f746a73266c6f676f436f6c6f723d7768697465"&gt;&lt;img src="https://camo.githubusercontent.com/ef0efa27f9a3268626104d5d2aed09ab5fb94809e76ccf6f2424524c5710d8c6/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4e6f64652e6a732d3333393933333f6c6f676f3d6e6f6465646f746a73266c6f676f436f6c6f723d7768697465" alt="Node.js"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/0b63d7c93454dec92ad25defd92a8bad94a5e8cedda9aee3660e94b119506300/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f457870726573732d3030303030303f6c6f676f3d65787072657373266c6f676f436f6c6f723d7768697465"&gt;&lt;img src="https://camo.githubusercontent.com/0b63d7c93454dec92ad25defd92a8bad94a5e8cedda9aee3660e94b119506300/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f457870726573732d3030303030303f6c6f676f3d65787072657373266c6f676f436f6c6f723d7768697465" alt="Express"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/d40f1442f28aa4d9dc385828b8ffbbb54a427e2a6f4474f8f4165de5168cdd7c/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f507269736d612d3244333734383f6c6f676f3d707269736d61266c6f676f436f6c6f723d7768697465"&gt;&lt;img src="https://camo.githubusercontent.com/d40f1442f28aa4d9dc385828b8ffbbb54a427e2a6f4474f8f4165de5168cdd7c/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f507269736d612d3244333734383f6c6f676f3d707269736d61266c6f676f436f6c6f723d7768697465" alt="Prisma"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/fac948737008a67ae4de2c8333bb822c2523e1dcdc96f42adcc0f7ed861fc9e7/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4f70656e41492d3431323939313f6c6f676f3d6f70656e6169266c6f676f436f6c6f723d7768697465"&gt;&lt;img src="https://camo.githubusercontent.com/fac948737008a67ae4de2c8333bb822c2523e1dcdc96f42adcc0f7ed861fc9e7/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4f70656e41492d3431323939313f6c6f676f3d6f70656e6169266c6f676f436f6c6f723d7768697465" alt="OpenAI"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/5cb32f5328be3851609a2baa5e66732c4b8183b6f6b3186fc5c6d2580c1e899a/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f50696e65636f6e652d3030303030303f6c6f676f3d70696e65636f6e65266c6f676f436f6c6f723d7768697465"&gt;&lt;img src="https://camo.githubusercontent.com/5cb32f5328be3851609a2baa5e66732c4b8183b6f6b3186fc5c6d2580c1e899a/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f50696e65636f6e652d3030303030303f6c6f676f3d70696e65636f6e65266c6f676f436f6c6f723d7768697465" alt="Pinecone"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/52ac35de4ca79ebcdf1a6fcb6446432433b381796f185f10ffb3c8ed38e3862b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f52656469732d4443333832443f6c6f676f3d7265646973266c6f676f436f6c6f723d7768697465"&gt;&lt;img src="https://camo.githubusercontent.com/52ac35de4ca79ebcdf1a6fcb6446432433b381796f185f10ffb3c8ed38e3862b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f52656469732d4443333832443f6c6f676f3d7265646973266c6f676f436f6c6f723d7768697465" alt="Redis"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/1d8555be6c7c523ee915e138ffdcda7cb9ce92310c100f02bde3b6a622ebedd3/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f506f737467726553514c2d3431363945313f6c6f676f3d706f737467726573716c266c6f676f436f6c6f723d7768697465"&gt;&lt;img src="https://camo.githubusercontent.com/1d8555be6c7c523ee915e138ffdcda7cb9ce92310c100f02bde3b6a622ebedd3/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f506f737467726553514c2d3431363945313f6c6f676f3d706f737467726573716c266c6f676f436f6c6f723d7768697465" alt="PostgreSQL"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;RAG-powered document Q&amp;amp;A API with streaming responses, dual-layer caching, multi-tenant vector search, and pre-LLM content redaction.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://adv-rag-ui.vercel.app" rel="nofollow noopener noreferrer"&gt;Live Demo&lt;/a&gt;&lt;/strong&gt; | &lt;strong&gt;&lt;a href="https://github.com/anuragmerndev/adv-rag-ui" rel="noopener noreferrer"&gt;Frontend Repo&lt;/a&gt;&lt;/strong&gt; | &lt;strong&gt;&lt;a href=""&gt;Case Study&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Architecture&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/anuragmerndev/adv-rag/docs/architecture.drawio.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fanuragmerndev%2Fadv-rag%2FHEAD%2Fdocs%2Farchitecture.drawio.png" alt="Architecture"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Upload path:&lt;/strong&gt; PDF → parse (LangChain) → chunk (500 tokens, 100 overlap) → embed (OpenAI text-embedding-3-small) → store (Pinecone, namespaced per user)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Query path:&lt;/strong&gt; Question → normalize → SHA-256 fingerprint → check embedding cache (Redis) → embed if miss → similarity search (Pinecone) → redact suspicious context → stream answer (GPT-4o-mini) → persist (Postgres)&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Features&lt;/h2&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RAG Pipeline&lt;/strong&gt; — PDF upload, chunking, embedding, vector search, LLM answer generation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSE Streaming&lt;/strong&gt; — real-time response streaming with a two-event protocol (chunks + done with provenance)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dual-Layer Caching&lt;/strong&gt; — embedding cache (&lt;code&gt;emb:&lt;/code&gt;) saves OpenAI calls; response cache (&lt;code&gt;resp:&lt;/code&gt;) for standalone queries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query Fingerprinting&lt;/strong&gt; — normalizes queries (contractions, stopwords, punctuation) then SHA-256 hashes for cache deduplication&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pre-LLM Redaction&lt;/strong&gt; — scans retrieved context…&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/anuragmerndev/adv-rag" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;



&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/anuragmerndev" rel="noopener noreferrer"&gt;
        anuragmerndev
      &lt;/a&gt; / &lt;a href="https://github.com/anuragmerndev/adv-rag-ui" rel="noopener noreferrer"&gt;
        adv-rag-ui
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Chat interface for RAG-powered document Q&amp;amp;A with real-time streaming and source citations
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;AdvChat — Frontend&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/24e3a9b25d3998038db9f90538f8b259efb0a69b2f2181b4f2fe8b9446d7ff01/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4e6578742e6a735f31352d3030303030303f6c6f676f3d6e657874646f746a73266c6f676f436f6c6f723d7768697465"&gt;&lt;img src="https://camo.githubusercontent.com/24e3a9b25d3998038db9f90538f8b259efb0a69b2f2181b4f2fe8b9446d7ff01/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4e6578742e6a735f31352d3030303030303f6c6f676f3d6e657874646f746a73266c6f676f436f6c6f723d7768697465" alt="Next.js"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/e23c7ba0d846b09e0a728a9857cab1e5d9d54a88226d120f0a84b53cbcf55d7a/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f52656163745f31392d3631444146423f6c6f676f3d7265616374266c6f676f436f6c6f723d626c61636b"&gt;&lt;img src="https://camo.githubusercontent.com/e23c7ba0d846b09e0a728a9857cab1e5d9d54a88226d120f0a84b53cbcf55d7a/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f52656163745f31392d3631444146423f6c6f676f3d7265616374266c6f676f436f6c6f723d626c61636b" alt="React"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/d2b19c56bf530067483d8d2756fac7800e0aef54ef4360460d778c23ccc3db2b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f547970655363726970742d3331373843363f6c6f676f3d74797065736372697074266c6f676f436f6c6f723d7768697465"&gt;&lt;img src="https://camo.githubusercontent.com/d2b19c56bf530067483d8d2756fac7800e0aef54ef4360460d778c23ccc3db2b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f547970655363726970742d3331373843363f6c6f676f3d74797065736372697074266c6f676f436f6c6f723d7768697465" alt="TypeScript"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/c9503584b266d8f1187c86c14a5e385ae74d55966d2f136e9fb314fe6e5aa583/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5461696c77696e645f4353535f342d3036423644343f6c6f676f3d7461696c77696e64637373266c6f676f436f6c6f723d7768697465"&gt;&lt;img src="https://camo.githubusercontent.com/c9503584b266d8f1187c86c14a5e385ae74d55966d2f136e9fb314fe6e5aa583/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5461696c77696e645f4353535f342d3036423644343f6c6f676f3d7461696c77696e64637373266c6f676f436f6c6f723d7768697465" alt="Tailwind CSS"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/51f6f045cf161ef9c58ff905c1cdd0c57ff0a110fda2c17d7b0975b39778dc52/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f436c65726b2d3643343746463f6c6f676f3d636c65726b266c6f676f436f6c6f723d7768697465"&gt;&lt;img src="https://camo.githubusercontent.com/51f6f045cf161ef9c58ff905c1cdd0c57ff0a110fda2c17d7b0975b39778dc52/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f436c65726b2d3643343746463f6c6f676f3d636c65726b266c6f676f436f6c6f723d7768697465" alt="Clerk"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Chat interface for RAG-powered document Q&amp;amp;A with real-time streaming responses, source citations, conversation history, and drag-and-drop PDF upload.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="https://adv-rag-ui.vercel.app" rel="nofollow noopener noreferrer"&gt;Live Demo&lt;/a&gt;&lt;/strong&gt; | &lt;strong&gt;&lt;a href="https://github.com/anuragmerndev/adv-rag" rel="noopener noreferrer"&gt;Backend Repo&lt;/a&gt;&lt;/strong&gt; | &lt;strong&gt;&lt;a href=""&gt;Case Study&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Architecture&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer nofollow" href="https://raw.githubusercontent.com/anuragmerndev/adv-rag/main/docs/architecture.drawio.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fanuragmerndev%2Fadv-rag%2Fmain%2Fdocs%2Farchitecture.drawio.png" alt="Architecture"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The frontend connects to the Express backend via REST and SSE. Clerk handles authentication client-side, and the backend verifies tokens on every request.&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Features&lt;/h2&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Streaming Chat&lt;/strong&gt; — SSE-based real-time response streaming with chunk-by-chunk rendering&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Source Citations&lt;/strong&gt; — expandable chips showing document name and quoted content per answer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conversation History&lt;/strong&gt; — persistent conversations with titles, message counts, sidebar navigation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PDF Upload&lt;/strong&gt; — drag-and-drop zone with file type validation (PDF only, 25MB limit) and progress feedback&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document Management&lt;/strong&gt; — searchable list with multi-select checkboxes to scope queries to specific documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smart Input&lt;/strong&gt; — arrow-key history navigation (up/down), Shift+Enter for multiline, auto-growing textarea&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skeleton Loaders&lt;/strong&gt; — animated message-shaped placeholders while loading old conversations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thinking Indicator&lt;/strong&gt; — pulsing dots while waiting for first token&lt;/li&gt;
&lt;li&gt;…&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/anuragmerndev/adv-rag-ui" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>webdev</category>
      <category>javascript</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
