DEV Community

Shoaib Iqbal
Shoaib Iqbal

Posted on • Originally published at techcologic.com

RAG Systems with Claude: From Documentation to Production

Meta: Build production-grade RAG systems using Claude and vector search. Step-by-step guide to document retrieval, embedding, and cost optimization.

The Problem: How Do You Give Claude Your Company's Knowledge?

Claude has a 200K token context window—it can hold an entire book. But what if you need to:

  • Answer questions about docs that change monthly
  • Search across thousands of documents efficiently
  • Stay up-to-date without retraining
  • Control costs (processing 10MB of text every request is expensive)

A naive approach: throw everything into the prompt. This fails because:

  • You can't practically include all documents
  • Costs explode as documents grow
  • Irrelevant context confuses the model
  • Updates require new deployments

This is the RAG problem: Retrieval-Augmented Generation.

The Solution: Retrieve Relevant Context, Then Generate

RAG works in two steps:

1. RETRIEVAL: User asks a question
   → Search your documents for relevant context
   → Return top-5 most similar passages

2. GENERATION: Feed Claude the question + retrieved context
   → Claude answers based on your documents
   → Returns answer with citations
Enter fullscreen mode Exit fullscreen mode

This is powerful because:

  • Only relevant documents are processed (low cost)
  • Your docs can be updated independently
  • Answers are grounded in your knowledge
  • Fully auditable (you see which docs were used)

How Techcologic Builds RAG Systems

We use a three-layer architecture:

Layer 1: Embedding & Vector Search

Step 1: Chunk your documents into passages (500-1000 tokens each)

Document: "Claude API Overview.pdf" (200 pages)

Chunks: [
  "Claude is a large language model trained by Anthropic...",
  "To use Claude, you need an API key from...",
  "Claude 3 family includes Opus, Sonnet, and Haiku...",
  ... (200+ chunks)
]
Enter fullscreen mode Exit fullscreen mode

Step 2: Convert chunks to embeddings

Embedding Service: text-embedding-3-small (or Claude's embedding)
Chunk: "Claude is a large language model..."

Vector: [0.123, -0.456, 0.789, ..., 0.234] (1536 dimensions)
Enter fullscreen mode Exit fullscreen mode

Step 3: Store in vector database

Database: pgvector (PostgreSQL + vector extension)
        OR Pinecone, Weaviate, Milvus (cloud)

Table: documents
├─ id: chunk_id
├─ text: "Claude is a..."
├─ vector: [embeddings]
├─ source: "Claude API Overview.pdf"
└─ updated_at: 2024-06-15
Enter fullscreen mode Exit fullscreen mode

Layer 2: Retrieval on Query

When a user asks a question:

# 1. Embed the user's question
user_question = "How do I use Claude with streaming?"
query_vector = embed_model.embed(user_question)

# 2. Find similar documents in your database
similar = vector_db.search(
    query_vector,
    top_k=5,
    min_similarity=0.7
)

# 3. Result: Top-5 passages from your docs
retrieved = [
    {
        "text": "Claude supports streaming via server-sent events...",
        "source": "API Guide.pdf",
        "similarity": 0.94
    },
    ... (4 more)
]
Enter fullscreen mode Exit fullscreen mode

Layer 3: Generation with Claude

# Construct the augmented prompt
prompt = f"""
Use the following context from Techcologic documentation:

{retrieved_context}

User question: {user_question}

Answer the question using ONLY the context above.
If the answer isn't in the context, say: "I don't have information on this."
Include citations: (Source: document_name)
"""

# Call Claude with your knowledge
response = claude.message(prompt, max_tokens=500)
Enter fullscreen mode Exit fullscreen mode

Real Example: Internal Knowledge Base

Scenario: Techcologic's 50-page engineering handbook, constantly updated.

Without RAG:

  • Include entire handbook in every prompt (150K tokens)
  • Cost: $2.25 per query (expensive!)
  • Fails when handbook exceeds context window

With RAG (Techcologic approach):

  • Store handbook chunks in vector database
  • Retrieve only relevant sections per query (2-5K tokens)
  • Cost: $0.03 per query (75x cheaper!)
  • Handbook can grow unlimited

Comparison Table:

Approach Cost per Query Latency Scalability Updates
Naive (full context) $2-5 5-10s Limited to token window Requires redeploy
RAG with pgvector $0.02-0.05 1-2s Unlimited docs Instant
RAG + caching $0.005-0.01 <500ms Unlimited docs Instant

Building RAG Step-by-Step

Step 1: Prepare Documents

1. Collect your documents (PDFs, Markdown, text)
2. Extract text (PyPDF2, pdfplumber for PDFs)
3. Chunk into 500-1000 token pieces
4. Store in database with metadata
Enter fullscreen mode Exit fullscreen mode

Step 2: Set Up Vector Database

Option A: PostgreSQL + pgvector (self-hosted)
Option B: Pinecone (serverless)
Option C: Weaviate (open-source)

We recommend pgvector for most teams—it's cheap, reliable, debuggable.
Enter fullscreen mode Exit fullscreen mode

Step 3: Embed & Index

from anthropic import Anthropic

# Embed each document chunk
embeddings = model.embed(chunks)

# Store in vector DB
vector_db.insert(chunks, embeddings, metadata)
Enter fullscreen mode Exit fullscreen mode

Step 4: Build Retrieval Function

def retrieve_context(question: str, top_k: int = 5):
    query_vector = embed_model.embed(question)
    results = vector_db.search(query_vector, top_k)
    return [r.text for r in results]
Enter fullscreen mode Exit fullscreen mode

Step 5: Create Answer Function

def answer_question(question: str):
    context = retrieve_context(question)
    prompt = f"""Context: {context}

    Question: {question}
    Answer:"""

    response = claude.message(prompt, max_tokens=500)
    return response
Enter fullscreen mode Exit fullscreen mode

Common Pitfalls (and How to Avoid Them)

Problem Cause Solution
Low quality answers Irrelevant documents retrieved Improve chunking strategy, increase similarity threshold
High costs Too many tokens sent to Claude Optimize chunk size, retrieve fewer docs, use caching
Stale answers Documents never updated Set up automated sync, monitor freshness
Hallucination Model invents info not in docs Use system prompt: "Only answer from provided context"

Techcologic's RAG Stack

For production systems, we use:

Documents → Chunking (LangChain)
         → Embedding (text-embedding-3-small)
         → Storage (pgvector on RDS)
         → Retrieval (vector similarity search)
         → Generation (Claude API)
         → Monitoring (Langsmith, custom logging)
Enter fullscreen mode Exit fullscreen mode

Result: Production RAG systems that handle millions of queries, stay accurate, and cost <$0.02 per question.

Getting Started Today

If you're building with Claude and need to ground answers in your documents:

  1. Start small → Pick 5-10 important docs
  2. Chunk them → 500-token pieces
  3. Embed them → Use OpenAI embeddings or Claude's
  4. Store them → PostgreSQL + pgvector (free tier available)
  5. Test retrieval → Verify top-5 results make sense
  6. Add Claude → Build the augmented prompt
  7. Monitor → Track retrieval quality, token usage

This takes a weekend to prototype, a few days to production.

Ready to ship RAG? Book a Claude architecture call at Techcologic.


Key Takeaways:

  • RAG lets you augment Claude with your documents
  • Vector search finds relevant context in milliseconds
  • Costs drop 10-100x vs. naive approaches
  • Production RAG systems are reliable and maintainable

Top comments (1)

Collapse
 
alexshev profile image
Alex Shev

The production RAG question I keep coming back to is freshness versus authority. Embeddings help find likely context, but the answer still needs to preserve which source actually backs the claim and when that source was last read.