Venkat Ambati

Posted on Jan 2

Memory Palace Part 2: Agentic RAG, Chrome Extension, and Making AI Actually Understand You 🧠✨

#ai #llm #rag

From "dumb search" to intelligent reasoning — plus save anything with one click

Previously on Memory Palace...

A few weeks ago, I shared how I built Memory Palace — a RAG-powered knowledge management system that handles both external research (Pockets) and personal thoughts (Memories).

The feedback was amazing. But two things kept coming up:

"This is great, but sometimes the answers don't quite get what I'm asking..."

"I don't want to copy-paste URLs into a web app. Can I just... click a button?"

So I rebuilt the entire RAG pipeline. And built a Chrome Extension.

What's New in Part 2?

🧠 Agentic RAG Pipeline — AI That Actually Thinks

The biggest upgrade isn't visible — it's in how the system thinks. We went from "dumb retrieval" to a multi-step reasoning pipeline.

🔌 Chrome Extension — Save Anything With One Click

A full-featured browser extension that brings Memory Palace to every webpage.

Let's dive into both.

Part 1: The Agentic RAG Pipeline

The original RAG was simple:

Embed query → Vector search → Get chunks → Send to LLM → Done

It worked. But it was... dumb. It didn't understand intent. It couldn't tell if you were asking a follow-up question. It treated "hello" the same as "compare the methodologies in my research papers."

The New Pipeline: 7 Steps of Intelligence

┌─────────────────────────────────────────────────────────────────────────────┐
│                        AGENTIC RAG PIPELINE                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────────┐     ┌─────────────┐     ┌─────────────┐                   │
│   │  1. Query   │     │  2. Adaptive │     │  3. Context │                   │
│   │   Router    │────▶│  Retrieval   │────▶│   Rewrite   │                   │
│   │             │     │   Params     │     │             │                   │
│   └─────────────┘     └─────────────┘     └─────────────┘                   │
│         │                                       │                           │
│         ▼                                       ▼                           │
│   ┌─────────────┐     ┌─────────────┐     ┌─────────────┐                   │
│   │  Skip RAG?  │     │  4. Multi   │     │  5. Hybrid  │                   │
│   │  (Greeting) │     │   Query     │────▶│   Search    │                   │
│   └─────────────┘     │  Generation │     │             │                   │
│                       └─────────────┘     └─────────────┘                   │
│                                                 │                           │
│                                                 ▼                           │
│                       ┌─────────────┐     ┌─────────────┐                   │
│                       │  7. Answer  │     │  6. CRAG    │                   │
│                       │  Synthesis  │◀────│  Grading    │                   │
│                       │  + Stream   │     │             │                   │
│                       └─────────────┘     └─────────────┘                   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Let me explain each step:

Step 1: Query Router — Intent Classification

Before doing anything, we ask: "What kind of question is this?"

type QueryIntent =
    | "no_retrieval" // "Hello!" - no sources needed
    | "simple_lookup" // "What is X?" - direct fact lookup
    | "comparison" // "Compare A and B" - needs multiple sources
    | "summarization" // "Summarize..." - needs aggregation
    | "analytical" // "Why does..." - deep reasoning required
    | "follow_up"; // "Tell me more" - needs conversation context

Why it matters: If someone says "Hi, how are you?", we don't need to search 1000 chunks. We can respond directly.

if (routerResult.skipRetrieval) {
    // Just respond, no RAG needed
    sendEvent({ type: "token", payload: "Hello! How can I help you today?" });
    return;
}

Step 2: Adaptive Retrieval Parameters

Different questions need different retrieval strategies:

function getAdaptiveRetrievalParams(
    intent: QueryIntent
): AdaptiveRetrievalParams {
    switch (intent) {
        case "comparison":
            return {
                chunkCount: 20, // Need more chunks for comparison
                vectorWeight: 0.5, // Balance semantic + keyword
                ftsWeight: 0.5,
                expansionQueries: 5, // Generate more query variations
            };

        case "simple_lookup":
            return {
                chunkCount: 5, // Few chunks, high precision
                vectorWeight: 0.7, // Lean into semantic
                ftsWeight: 0.3,
                expansionQueries: 2,
            };

        case "analytical":
            return {
                chunkCount: 15, // Lots of context for analysis
                vectorWeight: 0.6,
                ftsWeight: 0.4,
                expansionQueries: 4,
            };
        // ...
    }
}

The insight: "Compare Apple and Google's AI strategy" needs way more chunks than "What is Apple's market cap?"

Step 3: Context-Aware Query Rewriting

Follow-up questions are the hardest. When you ask "What about their revenue?", what does "their" mean?

We rewrite ambiguous queries using conversation history:

const rewrittenQuery = await rewriteQueryWithContext(
    "What about their revenue?", // Original query
    [
        { role: "user", content: "Compare Apple and Google AI" },
        { role: "assistant", content: "Apple focuses on..." },
    ]
);

// Result:
// {
//   original: "What about their revenue?",
//   rewritten: "What is Apple and Google's revenue?",
//   extractedEntities: ["Apple", "Google", "revenue"],
//   needsContext: true
// }

Now the search actually finds what you meant.

Step 4: Multi-Query Generation

One query isn't enough. We generate variations to catch different phrasings in your sources:

User asks: "What are the risks of AI?"

We search for:

"What are the risks of AI?"
"AI dangers and downsides"
"Negative impacts of artificial intelligence"
"AI safety concerns"
"Problems with AI adoption"

const searchQueries = await generateSearchQueriesStream(
    effectiveQuery,
    retrievalParams.expansionQueries // 2-5 queries based on intent
);

Result: 40% better recall on average. We find chunks that matter even if they don't use your exact words.

Step 5: Hybrid Search

Vector search is great for semantics. But sometimes you need exact matches.

We combine both:

-- Hybrid search: vector + full-text with adaptive weights
SELECT
  chunk_id,
  text,
  (
    ({vectorWeight} * (1 - (embedding <=> query_embedding))) +
    ({ftsWeight} * ts_rank(search_vector, websearch_to_tsquery(query)))
  ) as combined_score
FROM chunks
ORDER BY combined_score DESC
LIMIT {chunkCount};

Example: If you search for "NVIDIA earnings Q3", vector search finds semantically similar chunks. Full-text search finds chunks with those exact words. Combined = best results.

Step 6: CRAG — Corrective RAG (Chunk Grading)

Here's the innovation: not all retrieved chunks are relevant.

Before sending chunks to the LLM, we grade each one:

interface GradedChunk {
    chunk: any;
    relevance: "relevant" | "partially_relevant" | "irrelevant";
    score: number; // 0-1
    reasoning: string;
}

const cragResult = await gradeChunksRelevance(query, chunks);

// Result:
// {
//   decision: 'sufficient' | 'needs_expansion' | 'no_relevant_sources',
//   avgRelevanceScore: 0.73,
//   relevantChunks: [...],  // Only the good ones
// }

Three outcomes:

sufficient — Good chunks found, proceed to answer
needs_expansion — Chunks are borderline, try broader search
no_relevant_sources — Nothing relevant, tell user honestly

Why this matters: Without CRAG, the LLM gets noisy context and hallucinates. With CRAG, it only sees relevant chunks.

Step 7: Answer Synthesis with Streaming

Finally, we generate the answer with real-time streaming:

// Stream tokens as they're generated
for await (const token of chatGen) {
    sendEvent({ type: "token", payload: token });
}

// Include citations
sendEvent({
    type: "done",
    payload: {
        answer,
        citations: sources.map((s) => ({ id: s.source_id, title: s.title })),
        intent: routerResult.intent,
    },
});

Real-time status updates throughout:

[Status] Analyzing query intent...
[Routing] Intent: comparison, Confidence: 0.89
[Status] Rewriting query with context...
[Rewriting] "What about their approach?" → "What is Apple and Google's approach to AI?"
[Status] Generating 4 search queries...
[Queries] ["Apple Google AI approach", "tech giants AI strategy", ...]
[Status] Searching 4 queries...
[Status] Grading chunk relevance...
[Grading] Decision: sufficient, Avg Score: 0.78, 12/15 chunks relevant
[Sources] Found 3 relevant sources
[Status] Generating answer...
[Token] Apple...
[Token] 's approach...

Part 2: The Chrome Extension

Now onto the second major feature: save anything from anywhere.

Features

One-click save — Save any webpage as a published memory instantly
Built-in chat — Ask questions about your memories without leaving the page
Smart extraction — Pulls full content from any website
Secure login — Uses the same Supabase auth as the web app

The Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Your Browser                                │
│  ┌─────────────────┐     ┌─────────────────┐                    │
│  │   Popup UI      │     │  Content Script │                    │
│  │  - Login        │     │  - Extracts DOM │                    │
│  │  - Chat         │     │  - Full content │                    │
│  │  - Save button  │     │  - No limits    │                    │
│  └────────┬────────┘     └────────┬────────┘                    │
│           │                       │                              │
│           ▼                       ▼                              │
│  ┌─────────────────────────────────────────┐                    │
│  │         Background Service Worker       │                    │
│  └────────────────────┬────────────────────┘                    │
│                       │                                          │
└───────────────────────┼──────────────────────────────────────────┘
                        │
                        ▼
          ┌─────────────────────────┐
          │    Memory Palace API    │
          │      (Railway)          │
          └─────────────────────────┘

The Game Changer: Unlike the worker which fetches URLs and parses HTML, the extension runs directly in your browser with full DOM access:

No content limits — Get the entire article, not just 50KB
JavaScript-rendered content — Works on SPAs and dynamic sites
Bypasses bot detection — You're a real browser, not a scraper
Sees what you see — If you can read it, you can save it

Smart Content Extraction

We don't just grab document.body.innerText. We intelligently extract content:

Platform-specific selectors:

const mainSelectors = [
    // Medium
    "article[data-testid='post']",
    ".meteredContent",

    // Substack
    ".post-content",
    ".available-content",

    // WordPress
    ".entry-content",
    ".article-body",

    // News sites
    ".story-body",
    ".article__body",

    // Dev blogs
    ".markdown-body",
    ".prose",

    // Generic fallbacks
    "article",
    "main",
    '[role="main"]',
];

Find the best element (most content):

let mainElement = null;
let maxContentLength = 0;

for (const selector of mainSelectors) {
    const element = document.querySelector(selector);
    if (element && element.innerText.length > maxContentLength) {
        mainElement = element;
        maxContentLength = element.innerText.length;
    }
}

Aggressive cleanup:

const removeSelectors = [
    "script",
    "style",
    "noscript",
    "iframe",
    "nav",
    "header",
    "footer",
    ".sidebar",
    ".comments",
    ".advertisement",
    ".ad",
    ".social-share",
    ".related-posts",
    ".newsletter",
    ".cookie-notice",
    "button",
    "form",
];

Structure-preserving extraction:

const walkNode = (node) => {
    if (node.nodeType === Node.TEXT_NODE) {
        content += node.textContent;
    } else if (node.nodeType === Node.ELEMENT_NODE) {
        if (["p", "div", "h1", "h2", "li"].includes(tag)) content += "\n";
        if (["h1", "h2", "h3"].includes(tag)) content += "\n## ";
        if (tag === "li") content += "• ";

        for (const child of node.childNodes) walkNode(child);
    }
};

Result: Clean, formatted text with headings and bullet points preserved.

Auto-Publish: Skip the Draft Stage

Previously: Create draft → Review → Publish → Chunk → Embed

Now: Save from extension → Immediately published and searchable

// Extension sends:
body: JSON.stringify({
    org_id: orgId,
    title,
    content,
    status: "published", // Skip the draft!
});

// API automatically queues for chunking:
if (body.status === "published") {
    await queue.add("chunk-memory", { memoryId: memory.id });
}

Result: Save an article → Immediately searchable in chat. No extra clicks.

GitHub Repositories

vedha-pocket-extension — Chrome Extension for one-click saves
👉 https://github.com/venki0552/vedha-pocket-extension ← NEW!

The Funny Bits (More Lessons Learned)

1. The "Why Is Everything Relevant?" Disaster

First version of CRAG graded everything as "relevant" because the prompt was too lenient. LLM was like "well, it could be related..."

Fix: Added strict grading criteria and asked for reasoning before the score.

2. The Query Router That Said "Hello" To Everything

Intent classification kept detecting greetings in legitimate questions because the prompt prioritized "be friendly."

Before: "Hello, can you compare the AI strategies?" → intent: 'no_retrieval'

After: Only no_retrieval for actual greetings with no substantive question.

3. The Timeout Cascade

Agentic RAG has 7 LLM calls. At 10 seconds each, that's 70 seconds worst case. Original code had no timeouts. Users waited. Forever.

Fix: 10-second timeout per step, graceful fallbacks:

const routerResult = await fetchWithTimeout(
    url,
    options,
    LLM_TIMEOUT_MS // 10 seconds max
);

4. The "Multi-Query Made It Worse" Mystery

More queries = more results. But more results = more noise. CRAG was filtering out 90% of chunks.

Insight: Generate fewer, better queries. 3-5 is the sweet spot.

5. The "Why Is It All On One Line" Formatting Bug

Extension content extraction used:

content = content.replace(/\s+/g, " ");

Except \s+ matches newlines too. Every article became one giant paragraph.

Fix:

content = content.replace(/ +/g, " "); // Only spaces, preserve newlines

Updated Full Architecture

┌─────────────────┐
│                 │
│Chrome Extension │─────────────────────────────┐
│  - Save pages   │                             │
│  - Chat         │                             │
│                 │                             │
└─────────────────┘                             │
                                                ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│                 │     │                 │     │                 │
│   Next.js Web   │────▶│   Fastify API   │────▶│   BullMQ Worker │
│   (Vercel)      │     │   (Railway)     │     │   (Railway)     │
│                 │     │  + Agentic RAG  │     │                 │
│                 │     │                 │     │                 │
└─────────────────┘     └────────┬────────┘     └────────┬────────┘
                                 │                       │
                                 ▼                       ▼
                        ┌─────────────────┐     ┌─────────────────┐
                        │                 │     │                 │
                        │    Supabase     │     │    OpenRouter   │
                        │  (PostgreSQL +  │     │   (LLM + Embed) │
                        │    pgvector)    │     │                 │
                        └─────────────────┘     └─────────────────┘

What's Next?

Completed ✅

[x] Agentic RAG Pipeline — Query routing, CRAG, adaptive retrieval
[x] Chrome Extension — Save pages with one click
[x] Auto-publish from extension
[x] Real-time status streaming

Coming Soon 🚀

[ ] Self-reflective answer grading — Retry if answer has hallucinations
[ ] Firefox Extension
[ ] Keyboard shortcuts — Ctrl+Shift+S to save
[ ] Offline queue — Save when offline, sync when online

Try It Yourself

Web App: https://vedha-pocket-web.vercel.app
Extension: Clone from GitHub
API: https://vedha-api-production.up.railway.app

All open source. All self-hostable.

Final Thoughts

Part 1 was about making RAG work. Part 2 is about making it smart.

The difference between "good enough" and "actually useful" is in the details:

Understanding what kind of question you're asking
Knowing when retrieval isn't needed
Filtering noise before it reaches the LLM
Meeting users where they are (in the browser)

The biggest insight: RAG isn't one thing. It's a pipeline. And every step in that pipeline is an opportunity to add intelligence.

Next up: making the system learn from feedback. If you downvote an answer, it should remember why.

Built with ❤️, even more ☕, and a deep appreciation for graceful timeouts.

— Venkat

Part 1: I Built a RAG-Powered Second Brain

GitHub Repos:

Tags: #AI #AgenticRAG #CRAG #ChromeExtension #OpenSource #Supabase #TypeScript #RAG #KnowledgeManagement

DEV Community