DEV Community

Cover image for Steal my code: I built a RAG agent for sales people
Dipro Bhowmik
Dipro Bhowmik

Posted on

Steal my code: I built a RAG agent for sales people

Our sales team kept bugging us with the same questions: "How does the API handle rate limiting?" "Does the API support pagination?" "Can you explain embeddings to a customer?"

I thought, let me build them an AI agent instead.

I call him Allen - he searches our docs and answers questions instantly. No more Slack interruptions, no more stale wiki pages, no more "let me get back to you."

Source code here. Feel free to put this tutorial into Claude Code and see what it comes up with

Here's how I did it.

What is RAG?

If you've been living under a rock for the last 3 years, let me explain what RAG is.

RAG stands for Retrieval-Augmented Generation.

It is a fancy term for: "search your docs, feed results to an LLM, get a smart answer."

Without RAG, your vanilla agent:

  • Won't know your product
  • Will make things up about features you don't have

With RAG:

  • The agent searches your actual docs
  • Uses that context to answer questions

RAG is also really easy to implement these days.

The infra / stack

We're using a really common stack today -

  • Frontend: Next.js (streaming chat interface)
  • Agent: LangChain (handles the search→think→respond loop)
  • LLM: Claude Sonnet 4.5 (smart enough to know when to search)
  • Search: Vector database with semantic + keyword search

Total setup time was 1 day with Cursor (one morning and then another afternoon a few weeks later).

Most of that time was fussing with the frontend to get the chat interface to look nice.

Setting up the UI

Frontend is just Next.js. Messages go to /api/chat and responses stream back:

const response = await fetch("/api/chat", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ messages: apiMessages }),
});
Enter fullscreen mode Exit fullscreen mode

By default the results will seem slow, because the client will wait for the whole response before rendering. I used Server-Sent Events so that words appear one-by-one like in ChatGPT or Cursor.

In light-mode, Allen is a random old white dude. In dark mode, Allen becomes someone trapped in a computer:

Building the agent

Wiring up to an LLM

I'm using Claude because I had an API key for it already:

const model = new ChatAnthropic({
  model: "claude-sonnet-4-5-20250929",
  apiKey: process.env.ANTHROPIC_API_KEY,
  thinking: {
    type: "enabled",
    budget_tokens: 10000,
  },
  maxTokens: 20000,
});
Enter fullscreen mode Exit fullscreen mode

I initially did not have the "thinking" step there, but I realized it made my users trust the thing more. It's mainly filling the context window with reflection, but gives users a feeling that Allen is really trying.

Setting up LangChain agent

LangChain makes this easy:

const agent = createAgent({
  model,
  tools: [searchDocuments],
  systemPrompt: `
You are Allen (Al), a documentation assistant for Shaped. 
Help users find answers about the Shaped platform and API.

<basic_guidelines>
- Be concise. Prefer short, direct answers over long explanations.
- Use code examples when they clarify the answer. 
- Use search tools at your disposal. 
</basic_guidelines>

<prefer_search>
- Use search tools at your disposal. Run search at most 4 times per question.
- After retrieving search results, think carefully about whether the results are relevant to the user's query. If the results don't contain the information needed to answer the question, try searching again with a different query or search mode.
- After 4 searches, if the content is still not found or not relevant, tell the user: "I couldn't find information about this in the Shaped documentation. This topic may not be covered in the available documentation."
- When you have enough context, answer without extra searches.
- Prefer a single, focused search but use multiple when required.
</prefer_search>`,
});
Enter fullscreen mode Exit fullscreen mode

The system prompt is where you teach the agent how to do things. I added the prefer_search section to prevent it from hallucinating.

Making a Search tool

LangChain allows agents to have "tools", which are basically functions that they can call. I wrote a search tool, which is where RAG actually happens.

The agent can call this search function whenever the context requires it:

import { tool } from "langchain";

export const searchDocuments = tool(
  async (input) => {
    // Search happens here
  },
  {
    name: "search_documents",
    description: "Search the Shaped documentation for relevant content about a given topic",
    schema: z.object({
      query: z.string().describe("The search query to find relevant documents"),
      mode: z.enum(["vector", "lexical", "hybrid"])
        .describe(`The search mode. 
          Choose "vector" for semantic search: to return docs containing similar semantic meaning or phrase content to the input. 
          Choose "lexical" for BM25 lexical search: to return docs with specific keywords or IDs.
          Choose "hybrid" for a mix of strategies: 50% vector and 50% lexical.`)
    }),
  }
);
Enter fullscreen mode Exit fullscreen mode

The description tells Allen what the tool does, and the schema describes the inputs he will use.

Three Ways to Search

I implemented three different ways to search the documentation, to catch different ways that people may search. A single search tool handles all of them.

1. Vector Search (Semantic)

This is the "AI-powered" search everyone talks about. The items in your DB is turned into vectors using an embedding model, the query is also turned into a vector, and then the search engine compares the input to the items.

Good for: Natural language queries like "How do I authenticate?" (matches "authentication", "login", "API keys")

Bad for: "BM25" (may not find an exact match)

Since I'm using Shaped, I can do this with SQL:

SELECT * 
FROM text_search(
    query='$query', 
    text_embedding_ref='text_embedding', 
    mode='vector'
)
LIMIT 20
Enter fullscreen mode Exit fullscreen mode

2. Keyword Search (Lexical)

This is old-school keyword search. Uses BM25 algorithm to find exact keyword matches.

Good for: "rate_limit parameter" (finds exact API names)

Bad for: "How do I log in?" (doesn't understand paraphrasing)

SELECT * 
FROM text_search(
    query='$query', 
    mode='lexical',
    fuzziness=0
)
LIMIT 20
Enter fullscreen mode Exit fullscreen mode

3. Hybrid Search

This run both searches, and merges results. I explicitly weigh the results 50/50 between both approaches, but you can change this mix.

Good for: Almost everything.

SELECT *
FROM text_search(
       query='$query_text', 
       mode='vector',
       text_embedding_ref='text_embedding', 
       name='vector_search'
     ),
     text_search(
       query='$query_text', 
       mode='lexical', 
       name='lexical_search'
     )
ORDER BY score(expression='0.5 * retrieval.vector_search + 0.5 * retrieval.lexical_search')
LIMIT 20
Enter fullscreen mode Exit fullscreen mode

I default to hybrid. Semantic search catches paraphrasing, keyword search ensures exact terms aren't missed.

Actually hitting the search API

The search tool hits your vector database API. I'm using Shaped because I work there and get a free account:

export async function getSearchResults(query, mode) {
  const res = await fetch(SEARCH_API_URL, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "x-api-key": API_KEY,
    },
    body: JSON.stringify({
      query: `SELECT *
FROM text_search(
       query='$query_text', 
       mode='vector',
       text_embedding_ref='text_embedding', 
       name='vector_search'
     ),
     text_search(
       query='$query_text', 
       mode='lexical', 
       name='lexical_search'
     )
ORDER BY score(expression='0.5 * retrieval.vector_search \
    + 0.5 * retrieval.lexical_search')
LIMIT 20`,a
      parameters: {
        query_text: query,
        mode: mode
      },        
      return_metadata: true,
    }),
  });
  return await res.json();
}
Enter fullscreen mode Exit fullscreen mode

Lessons Learned

1. Chunk Size Matters

I started with tiny 100-token chunks. Allen retrieved single sentences that didn't give it enough context to answer meaningfully.

The sweet spot I found was chunk that are around a paragraph. Small enough to be specific, large enough to have context.

2. Metadata Is Clutch

Add metadata to every chunk so you can display it to the user without an additional hop:

{
  "content": "...",
  "source": "API Reference",
  "section": "Tables",
  "h1": "Create Table",
  "h2": "Parameters",
  "last_updated": "2024-01-15"
}
Enter fullscreen mode Exit fullscreen mode

It also lets you do filtered searches: "What changed in the API recently?" filters by last_updated.

3. Don't underestimate the system prompts

My first version was garbage, too simple. Key insights:

  • Tell it to search before answering (otherwise it guesses)
  • Set a max search limit (or it goes infinite loop on hard questions)
  • Tell it to admit when docs don't have the answer
  • Make it cite sources (sales team loves this)
  • It's better to be more verbose and specific

4. Monitor what is retrieved and cited

If the agent retrieves docs but doesn't cite them, your retrieval is not great. I log every search and check if the results actually appear in the response.

Stuff I'd add next

Re-ranking: After initial search, run results through another model to score on semantic relevance. You can use an algorithm like ColBERT for this

Feedback loop: Track which answers get thumbs-down, use that to improve chunking.

Multi-modal search: Our docs have diagrams, would be cool to search those too.

Boring stuff: ROI

This was fun to build, but what makes it appealing to my manager is:

  • Makes pre-sales team feel powerful
  • Less engineering Slack questions
  • Sales can answer questions without raising with a technical person

Setup cost: <2 days of dev time, Shaped usage is within the $100 free tier, plus LLM API calls.

Build it yourself

I'm 90% sure if you give Cursor this tutorial and the Github repo, it could build you the same thing in less than 1 hour.

The actual code is pretty simple:

  1. Convert documentation into a chunked JSON file
  2. Upload JSON file to Shaped
  3. Wire up LangChain agent with search tool
  4. Deploy to Vercel or wherever
  5. Send to your boss

The hardest part isn't the code, it's figuring out how to chunk your docs appropriately. But once it clicks, you've got an agent that actually knows your product.

Roast my approach in the comments lol

Top comments (0)