From "dumb search" to intelligent reasoning β plus save anything with one click
Previously on Memory Palace...
A few weeks ago, I shared how I built Memory Palace β a RAG-powered knowledge management system that handles both external research (Pockets) and personal thoughts (Memories).
The feedback was amazing. But two things kept coming up:
"This is great, but sometimes the answers don't quite get what I'm asking..."
"I don't want to copy-paste URLs into a web app. Can I just... click a button?"
So I rebuilt the entire RAG pipeline. And built a Chrome Extension.
What's New in Part 2?
π§ Agentic RAG Pipeline β AI That Actually Thinks
The biggest upgrade isn't visible β it's in how the system thinks. We went from "dumb retrieval" to a multi-step reasoning pipeline.
π Chrome Extension β Save Anything With One Click
A full-featured browser extension that brings Memory Palace to every webpage.
Let's dive into both.
Part 1: The Agentic RAG Pipeline
The original RAG was simple:
- Embed query β Vector search β Get chunks β Send to LLM β Done
It worked. But it was... dumb. It didn't understand intent. It couldn't tell if you were asking a follow-up question. It treated "hello" the same as "compare the methodologies in my research papers."
The New Pipeline: 7 Steps of Intelligence
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AGENTIC RAG PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β 1. Query β β 2. Adaptive β β 3. Context β β
β β Router ββββββΆβ Retrieval ββββββΆβ Rewrite β β
β β β β Params β β β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Skip RAG? β β 4. Multi β β 5. Hybrid β β
β β (Greeting) β β Query ββββββΆβ Search β β
β βββββββββββββββ β Generation β β β β
β βββββββββββββββ βββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββ βββββββββββββββ β
β β 7. Answer β β 6. CRAG β β
β β Synthesis βββββββ Grading β β
β β + Stream β β β β
β βββββββββββββββ βββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Let me explain each step:
Step 1: Query Router β Intent Classification
Before doing anything, we ask: "What kind of question is this?"
type QueryIntent =
| "no_retrieval" // "Hello!" - no sources needed
| "simple_lookup" // "What is X?" - direct fact lookup
| "comparison" // "Compare A and B" - needs multiple sources
| "summarization" // "Summarize..." - needs aggregation
| "analytical" // "Why does..." - deep reasoning required
| "follow_up"; // "Tell me more" - needs conversation context
Why it matters: If someone says "Hi, how are you?", we don't need to search 1000 chunks. We can respond directly.
if (routerResult.skipRetrieval) {
// Just respond, no RAG needed
sendEvent({ type: "token", payload: "Hello! How can I help you today?" });
return;
}
Step 2: Adaptive Retrieval Parameters
Different questions need different retrieval strategies:
function getAdaptiveRetrievalParams(
intent: QueryIntent
): AdaptiveRetrievalParams {
switch (intent) {
case "comparison":
return {
chunkCount: 20, // Need more chunks for comparison
vectorWeight: 0.5, // Balance semantic + keyword
ftsWeight: 0.5,
expansionQueries: 5, // Generate more query variations
};
case "simple_lookup":
return {
chunkCount: 5, // Few chunks, high precision
vectorWeight: 0.7, // Lean into semantic
ftsWeight: 0.3,
expansionQueries: 2,
};
case "analytical":
return {
chunkCount: 15, // Lots of context for analysis
vectorWeight: 0.6,
ftsWeight: 0.4,
expansionQueries: 4,
};
// ...
}
}
The insight: "Compare Apple and Google's AI strategy" needs way more chunks than "What is Apple's market cap?"
Step 3: Context-Aware Query Rewriting
Follow-up questions are the hardest. When you ask "What about their revenue?", what does "their" mean?
We rewrite ambiguous queries using conversation history:
const rewrittenQuery = await rewriteQueryWithContext(
"What about their revenue?", // Original query
[
{ role: "user", content: "Compare Apple and Google AI" },
{ role: "assistant", content: "Apple focuses on..." },
]
);
// Result:
// {
// original: "What about their revenue?",
// rewritten: "What is Apple and Google's revenue?",
// extractedEntities: ["Apple", "Google", "revenue"],
// needsContext: true
// }
Now the search actually finds what you meant.
Step 4: Multi-Query Generation
One query isn't enough. We generate variations to catch different phrasings in your sources:
User asks: "What are the risks of AI?"
We search for:
- "What are the risks of AI?"
- "AI dangers and downsides"
- "Negative impacts of artificial intelligence"
- "AI safety concerns"
- "Problems with AI adoption"
const searchQueries = await generateSearchQueriesStream(
effectiveQuery,
retrievalParams.expansionQueries // 2-5 queries based on intent
);
Result: 40% better recall on average. We find chunks that matter even if they don't use your exact words.
Step 5: Hybrid Search
Vector search is great for semantics. But sometimes you need exact matches.
We combine both:
-- Hybrid search: vector + full-text with adaptive weights
SELECT
chunk_id,
text,
(
({vectorWeight} * (1 - (embedding <=> query_embedding))) +
({ftsWeight} * ts_rank(search_vector, websearch_to_tsquery(query)))
) as combined_score
FROM chunks
ORDER BY combined_score DESC
LIMIT {chunkCount};
Example: If you search for "NVIDIA earnings Q3", vector search finds semantically similar chunks. Full-text search finds chunks with those exact words. Combined = best results.
Step 6: CRAG β Corrective RAG (Chunk Grading)
Here's the innovation: not all retrieved chunks are relevant.
Before sending chunks to the LLM, we grade each one:
interface GradedChunk {
chunk: any;
relevance: "relevant" | "partially_relevant" | "irrelevant";
score: number; // 0-1
reasoning: string;
}
const cragResult = await gradeChunksRelevance(query, chunks);
// Result:
// {
// decision: 'sufficient' | 'needs_expansion' | 'no_relevant_sources',
// avgRelevanceScore: 0.73,
// relevantChunks: [...], // Only the good ones
// }
Three outcomes:
- sufficient β Good chunks found, proceed to answer
- needs_expansion β Chunks are borderline, try broader search
- no_relevant_sources β Nothing relevant, tell user honestly
Why this matters: Without CRAG, the LLM gets noisy context and hallucinates. With CRAG, it only sees relevant chunks.
Step 7: Answer Synthesis with Streaming
Finally, we generate the answer with real-time streaming:
// Stream tokens as they're generated
for await (const token of chatGen) {
sendEvent({ type: "token", payload: token });
}
// Include citations
sendEvent({
type: "done",
payload: {
answer,
citations: sources.map((s) => ({ id: s.source_id, title: s.title })),
intent: routerResult.intent,
},
});
Real-time status updates throughout:
[Status] Analyzing query intent...
[Routing] Intent: comparison, Confidence: 0.89
[Status] Rewriting query with context...
[Rewriting] "What about their approach?" β "What is Apple and Google's approach to AI?"
[Status] Generating 4 search queries...
[Queries] ["Apple Google AI approach", "tech giants AI strategy", ...]
[Status] Searching 4 queries...
[Status] Grading chunk relevance...
[Grading] Decision: sufficient, Avg Score: 0.78, 12/15 chunks relevant
[Sources] Found 3 relevant sources
[Status] Generating answer...
[Token] Apple...
[Token] 's approach...
Part 2: The Chrome Extension
Now onto the second major feature: save anything from anywhere.
Features
- One-click save β Save any webpage as a published memory instantly
- Built-in chat β Ask questions about your memories without leaving the page
- Smart extraction β Pulls full content from any website
- Secure login β Uses the same Supabase auth as the web app
The Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Your Browser β
β βββββββββββββββββββ βββββββββββββββββββ β
β β Popup UI β β Content Script β β
β β - Login β β - Extracts DOM β β
β β - Chat β β - Full content β β
β β - Save button β β - No limits β β
β ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Background Service Worker β β
β ββββββββββββββββββββββ¬βββββββββββββββββββββ β
β β β
βββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Memory Palace API β
β (Railway) β
βββββββββββββββββββββββββββ
The Game Changer: Unlike the worker which fetches URLs and parses HTML, the extension runs directly in your browser with full DOM access:
- No content limits β Get the entire article, not just 50KB
- JavaScript-rendered content β Works on SPAs and dynamic sites
- Bypasses bot detection β You're a real browser, not a scraper
- Sees what you see β If you can read it, you can save it
Smart Content Extraction
We don't just grab document.body.innerText. We intelligently extract content:
Platform-specific selectors:
const mainSelectors = [
// Medium
"article[data-testid='post']",
".meteredContent",
// Substack
".post-content",
".available-content",
// WordPress
".entry-content",
".article-body",
// News sites
".story-body",
".article__body",
// Dev blogs
".markdown-body",
".prose",
// Generic fallbacks
"article",
"main",
'[role="main"]',
];
Find the best element (most content):
let mainElement = null;
let maxContentLength = 0;
for (const selector of mainSelectors) {
const element = document.querySelector(selector);
if (element && element.innerText.length > maxContentLength) {
mainElement = element;
maxContentLength = element.innerText.length;
}
}
Aggressive cleanup:
const removeSelectors = [
"script",
"style",
"noscript",
"iframe",
"nav",
"header",
"footer",
".sidebar",
".comments",
".advertisement",
".ad",
".social-share",
".related-posts",
".newsletter",
".cookie-notice",
"button",
"form",
];
Structure-preserving extraction:
const walkNode = (node) => {
if (node.nodeType === Node.TEXT_NODE) {
content += node.textContent;
} else if (node.nodeType === Node.ELEMENT_NODE) {
if (["p", "div", "h1", "h2", "li"].includes(tag)) content += "\n";
if (["h1", "h2", "h3"].includes(tag)) content += "\n## ";
if (tag === "li") content += "β’ ";
for (const child of node.childNodes) walkNode(child);
}
};
Result: Clean, formatted text with headings and bullet points preserved.
Auto-Publish: Skip the Draft Stage
Previously: Create draft β Review β Publish β Chunk β Embed
Now: Save from extension β Immediately published and searchable
// Extension sends:
body: JSON.stringify({
org_id: orgId,
title,
content,
status: "published", // Skip the draft!
});
// API automatically queues for chunking:
if (body.status === "published") {
await queue.add("chunk-memory", { memoryId: memory.id });
}
Result: Save an article β Immediately searchable in chat. No extra clicks.
GitHub Repositories
vedha-pocket-extension β Chrome Extension for one-click saves
π https://github.com/venki0552/vedha-pocket-extension β NEW!
The Funny Bits (More Lessons Learned)
1. The "Why Is Everything Relevant?" Disaster
First version of CRAG graded everything as "relevant" because the prompt was too lenient. LLM was like "well, it could be related..."
Fix: Added strict grading criteria and asked for reasoning before the score.
2. The Query Router That Said "Hello" To Everything
Intent classification kept detecting greetings in legitimate questions because the prompt prioritized "be friendly."
Before: "Hello, can you compare the AI strategies?" β intent: 'no_retrieval'
After: Only no_retrieval for actual greetings with no substantive question.
3. The Timeout Cascade
Agentic RAG has 7 LLM calls. At 10 seconds each, that's 70 seconds worst case. Original code had no timeouts. Users waited. Forever.
Fix: 10-second timeout per step, graceful fallbacks:
const routerResult = await fetchWithTimeout(
url,
options,
LLM_TIMEOUT_MS // 10 seconds max
);
4. The "Multi-Query Made It Worse" Mystery
More queries = more results. But more results = more noise. CRAG was filtering out 90% of chunks.
Insight: Generate fewer, better queries. 3-5 is the sweet spot.
5. The "Why Is It All On One Line" Formatting Bug
Extension content extraction used:
content = content.replace(/\s+/g, " ");
Except \s+ matches newlines too. Every article became one giant paragraph.
Fix:
content = content.replace(/ +/g, " "); // Only spaces, preserve newlines
Updated Full Architecture
βββββββββββββββββββ
β β
βChrome Extension βββββββββββββββββββββββββββββββ
β - Save pages β β
β - Chat β β
β β β
βββββββββββββββββββ β
βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β β β β
β Next.js Web ββββββΆβ Fastify API ββββββΆβ BullMQ Worker β
β (Vercel) β β (Railway) β β (Railway) β
β β β + Agentic RAG β β β
β β β β β β
βββββββββββββββββββ ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ
β β
βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ
β β β β
β Supabase β β OpenRouter β
β (PostgreSQL + β β (LLM + Embed) β
β pgvector) β β β
βββββββββββββββββββ βββββββββββββββββββ
What's Next?
Completed β
- [x] Agentic RAG Pipeline β Query routing, CRAG, adaptive retrieval
- [x] Chrome Extension β Save pages with one click
- [x] Auto-publish from extension
- [x] Real-time status streaming
Coming Soon π
- [ ] Self-reflective answer grading β Retry if answer has hallucinations
- [ ] Firefox Extension
- [ ] Keyboard shortcuts β
Ctrl+Shift+Sto save - [ ] Offline queue β Save when offline, sync when online
Try It Yourself
- Web App: https://vedha-pocket-web.vercel.app
- Extension: Clone from GitHub
- API: https://vedha-api-production.up.railway.app
All open source. All self-hostable.
Final Thoughts
Part 1 was about making RAG work. Part 2 is about making it smart.
The difference between "good enough" and "actually useful" is in the details:
- Understanding what kind of question you're asking
- Knowing when retrieval isn't needed
- Filtering noise before it reaches the LLM
- Meeting users where they are (in the browser)
The biggest insight: RAG isn't one thing. It's a pipeline. And every step in that pipeline is an opportunity to add intelligence.
Next up: making the system learn from feedback. If you downvote an answer, it should remember why.
Built with β€οΈ, even more β, and a deep appreciation for graceful timeouts.
β Venkat
Part 1: I Built a RAG-Powered Second Brain
GitHub Repos:
Tags: #AI #AgenticRAG #CRAG #ChromeExtension #OpenSource #Supabase #TypeScript #RAG #KnowledgeManagement
Top comments (0)