Tyson Cung

Posted on Mar 4

Building a RAG Pipeline That Actually Works

#ai #rag #mongodb #typescript

Most RAG tutorials show 20 lines of LangChain and call it production-ready. Then you try it on real documents and get garbage results.

Here's what every tutorial shows:

// The "tutorial RAG" that doesn't work
const loader = new DirectoryLoader('./docs');
const documents = await loader.load();
const textSplitter = new CharacterTextSplitter({
  chunkSize: 1000, 
  chunkOverlap: 0
});
const docs = await textSplitter.splitDocuments(documents);

const vectorstore = await Chroma.fromDocuments(docs, new OpenAIEmbeddings());
const retriever = vectorstore.asRetriever();

const chain = RetrievalQAChain.fromLLM(
  new OpenAI(), 
  retriever
);

This works great on the AI papers the tutorial author tested it on. Try it on your company's actual documents and you'll get irrelevant results, missed information, and confused users.

At my startup, we process thousands of documents daily. Contracts, manuals, reports, presentations. I've spent 6 months building RAG that actually works in production. Here's what I learned.

Why Most RAG Implementations Fail

Bad Chunking: Character splitting breaks sentences mid-thought. You lose context and get fragments that don't make sense.

Wrong Embedding Model: OpenAI's text-embedding-ada-002 is general-purpose but not optimized for your domain. Financial documents need different embeddings than technical manuals.

No Metadata Filtering: Real documents have structure - chapters, sections, authors, dates. Pure vector similarity ignores this valuable information.

Retrieval Without Reranking: Vector search finds similar text, not necessarily relevant text. The top 5 vector results might all be from the same paragraph.

No Quality Scoring: You have no idea if the retrieved context is good enough to answer the question. Users get confident-sounding wrong answers.

Let me show you how we solved each of these problems.

Smart Chunking That Preserves Context

Instead of splitting on character count, we split on semantic boundaries:

interface DocumentChunk {
  id: string;
  content: string;
  metadata: {
    documentId: string;
    title: string;
    section?: string;
    pageNumber?: number;
    author?: string;
    documentType: string;
    createdAt: Date;
  };
  embedding?: number[];
}

export class SmartChunker {
  private maxChunkSize = 512; // tokens, not characters
  private overlapSize = 50; // token overlap between chunks

  async chunkDocument(document: Document): Promise<DocumentChunk[]> {
    // First, split on structural boundaries
    const sections = this.splitOnStructure(document.content);

    const chunks: DocumentChunk[] = [];

    for (const section of sections) {
      // Then split large sections into smaller chunks
      const sectionChunks = await this.splitSection(section, document);
      chunks.push(...sectionChunks);
    }

    return chunks;
  }

  private splitOnStructure(content: string): Array<{ text: string; section?: string }> {
    // Split on headers (markdown or detected patterns)
    const headerPattern = /^#+\s+(.+)$|^(.+)\n={3,}$|^(.+)\n-{3,}$/gm;
    const sections: Array<{ text: string; section?: string }> = [];

    let currentSection = '';
    let currentHeader = '';

    const lines = content.split('\n');

    for (const line of lines) {
      const headerMatch = line.match(headerPattern);

      if (headerMatch) {
        // Save previous section
        if (currentSection.trim()) {
          sections.push({
            text: currentSection.trim(),
            section: currentHeader
          });
        }

        // Start new section
        currentHeader = headerMatch[1] || headerMatch[2] || headerMatch[3] || '';
        currentSection = '';
      } else {
        currentSection += line + '\n';
      }
    }

    // Don't forget the last section
    if (currentSection.trim()) {
      sections.push({
        text: currentSection.trim(),
        section: currentHeader
      });
    }

    return sections;
  }

  private async splitSection(
    section: { text: string; section?: string },
    document: Document
  ): Promise<DocumentChunk[]> {
    const sentences = this.splitIntoSentences(section.text);
    const chunks: DocumentChunk[] = [];

    let currentChunk = '';
    let currentTokens = 0;

    for (const sentence of sentences) {
      const sentenceTokens = await this.countTokens(sentence);

      // If adding this sentence would exceed chunk size, save current chunk
      if (currentTokens + sentenceTokens > this.maxChunkSize && currentChunk) {
        chunks.push(this.createChunk(currentChunk, section, document));

        // Start new chunk with overlap
        const overlapText = this.getLastTokens(currentChunk, this.overlapSize);
        currentChunk = overlapText + sentence;
        currentTokens = await this.countTokens(currentChunk);
      } else {
        currentChunk += (currentChunk ? ' ' : '') + sentence;
        currentTokens += sentenceTokens;
      }
    }

    // Don't forget the last chunk
    if (currentChunk.trim()) {
      chunks.push(this.createChunk(currentChunk, section, document));
    }

    return chunks;
  }

  private splitIntoSentences(text: string): string[] {
    // Simple sentence splitting (you might want to use a proper NLP library)
    return text
      .split(/[.!?]+/)
      .map(s => s.trim())
      .filter(s => s.length > 10); // Skip very short fragments
  }

  private createChunk(
    content: string,
    section: { section?: string },
    document: Document
  ): DocumentChunk {
    return {
      id: crypto.randomUUID(),
      content: content.trim(),
      metadata: {
        documentId: document.id,
        title: document.title,
        section: section.section,
        pageNumber: document.pageNumber,
        author: document.author,
        documentType: document.type,
        createdAt: document.createdAt,
      }
    };
  }
}

This preserves semantic boundaries and includes overlap so context doesn't get lost between chunks.

Domain-Specific Embeddings

We use different embedding models based on document type:

interface EmbeddingProvider {
  name: string;
  model: string;
  dimensions: number;
  embed(text: string): Promise<number[]>;
}

class OpenAIEmbeddings implements EmbeddingProvider {
  name = 'openai';
  model = 'text-embedding-3-large';
  dimensions = 3072;

  async embed(text: string): Promise<number[]> {
    const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

    const response = await openai.embeddings.create({
      model: this.model,
      input: text,
    });

    return response.data[0].embedding;
  }
}

class BedrockEmbeddings implements EmbeddingProvider {
  name = 'bedrock';
  model = 'amazon.titan-embed-text-v1';
  dimensions = 1536;

  async embed(text: string): Promise<number[]> {
    const client = new BedrockRuntimeClient({ region: 'us-east-1' });

    const command = new InvokeModelCommand({
      modelId: this.model,
      body: JSON.stringify({ inputText: text }),
    });

    const response = await client.send(command);
    const result = JSON.parse(new TextDecoder().decode(response.body));

    return result.embedding;
  }
}

export class EmbeddingPipeline {
  private providers = {
    technical: new OpenAIEmbeddings(),
    legal: new BedrockEmbeddings(),
    general: new OpenAIEmbeddings(),
  };

  async generateEmbeddings(chunks: DocumentChunk[]): Promise<DocumentChunk[]> {
    const results: DocumentChunk[] = [];

    // Process in batches to avoid rate limits
    const batchSize = 10;

    for (let i = 0; i < chunks.length; i += batchSize) {
      const batch = chunks.slice(i, i + batchSize);
      console.log(`Processing batch ${Math.floor(i / batchSize) + 1}/${Math.ceil(chunks.length / batchSize)}`);

      const batchResults = await Promise.all(
        batch.map(async (chunk) => {
          const provider = this.selectProvider(chunk);
          const embedding = await provider.embed(chunk.content);

          return {
            ...chunk,
            embedding,
          };
        })
      );

      results.push(...batchResults);

      // Rate limiting - wait between batches
      if (i + batchSize < chunks.length) {
        await new Promise(resolve => setTimeout(resolve, 1000));
      }
    }

    return results;
  }

  private selectProvider(chunk: DocumentChunk): EmbeddingProvider {
    const docType = chunk.metadata.documentType?.toLowerCase();

    if (docType?.includes('technical') || docType?.includes('api')) {
      return this.providers.technical;
    } else if (docType?.includes('legal') || docType?.includes('contract')) {
      return this.providers.legal;
    } else {
      return this.providers.general;
    }
  }
}

MongoDB Atlas Vector Search

Instead of spinning up a separate vector database, we use MongoDB Atlas Vector Search. If you're already using MongoDB for your app data, this is a no-brainer:

import { MongoClient } from 'mongodb';

interface VectorSearchResult {
  chunk: DocumentChunk;
  score: number;
}

export class MongoVectorStore {
  private client: MongoClient;
  private db: string;
  private collection: string;

  constructor(connectionString: string, database: string, collection: string) {
    this.client = new MongoClient(connectionString);
    this.db = database;
    this.collection = collection;
  }

  async saveChunks(chunks: DocumentChunk[]): Promise<void> {
    const collection = this.client.db(this.db).collection(this.collection);

    // MongoDB expects documents, not our interface
    const documents = chunks.map(chunk => ({
      _id: chunk.id,
      content: chunk.content,
      metadata: chunk.metadata,
      embedding: chunk.embedding,
      createdAt: new Date(),
    }));

    await collection.insertMany(documents);
  }

  async searchSimilar(
    queryEmbedding: number[],
    limit: number = 10,
    filters?: any
  ): Promise<VectorSearchResult[]> {
    const collection = this.client.db(this.db).collection(this.collection);

    // Build aggregation pipeline
    const pipeline: any[] = [
      {
        $vectorSearch: {
          index: 'vector_index', // You need to create this in Atlas
          path: 'embedding',
          queryVector: queryEmbedding,
          numCandidates: limit * 3, // Search more candidates than we return
          limit: limit,
        }
      }
    ];

    // Add metadata filters
    if (filters) {
      pipeline.push({ $match: filters });
    }

    // Add score to results
    pipeline.push({
      $addFields: {
        score: { $meta: 'vectorSearchScore' }
      }
    });

    const results = await collection.aggregate(pipeline).toArray();

    return results.map(doc => ({
      chunk: {
        id: doc._id,
        content: doc.content,
        metadata: doc.metadata,
        embedding: doc.embedding,
      },
      score: doc.score,
    }));
  }

  async searchWithMetadata(
    query: string,
    embedding: number[],
    filters: {
      documentType?: string;
      author?: string;
      dateRange?: { start: Date; end: Date };
      sections?: string[];
    }
  ): Promise<VectorSearchResult[]> {
    // Build MongoDB filter from our filters
    const mongoFilters: any = {};

    if (filters.documentType) {
      mongoFilters['metadata.documentType'] = filters.documentType;
    }

    if (filters.author) {
      mongoFilters['metadata.author'] = filters.author;
    }

    if (filters.dateRange) {
      mongoFilters['metadata.createdAt'] = {
        $gte: filters.dateRange.start,
        $lte: filters.dateRange.end
      };
    }

    if (filters.sections?.length) {
      mongoFilters['metadata.section'] = { $in: filters.sections };
    }

    return this.searchSimilar(embedding, 20, mongoFilters);
  }
}

The magic of MongoDB Atlas Vector Search is the $vectorSearch aggregation stage. You get vector similarity AND traditional database filtering in one query.

Reranking for Better Results

Vector search finds similar text. Reranking finds relevant text:

interface RankedResult {
  chunk: DocumentChunk;
  vectorScore: number;
  relevanceScore: number;
  finalScore: number;
}

export class CrossEncoderReranker {
  private model: string;

  constructor(model = 'cross-encoder/ms-marco-MiniLM-L-6-v2') {
    this.model = model;
  }

  async rerank(
    query: string,
    results: VectorSearchResult[],
    topK: number = 5
  ): Promise<RankedResult[]> {
    // For each result, calculate relevance score
    const rankedResults = await Promise.all(
      results.map(async (result) => {
        const relevanceScore = await this.calculateRelevance(
          query,
          result.chunk.content
        );

        // Combine vector similarity and relevance
        const finalScore = this.combineScores(
          result.score,
          relevanceScore
        );

        return {
          chunk: result.chunk,
          vectorScore: result.score,
          relevanceScore,
          finalScore,
        };
      })
    );

    // Sort by final score and return top K
    return rankedResults
      .sort((a, b) => b.finalScore - a.finalScore)
      .slice(0, topK);
  }

  private async calculateRelevance(query: string, text: string): Promise<number> {
    // In a real implementation, you'd use a proper cross-encoder model
    // For now, we'll use a simple heuristic

    const queryTokens = query.toLowerCase().split(/\s+/);
    const textTokens = text.toLowerCase().split(/\s+/);

    let matches = 0;
    for (const token of queryTokens) {
      if (textTokens.some(textToken => 
        textToken.includes(token) || token.includes(textToken)
      )) {
        matches++;
      }
    }

    return matches / queryTokens.length;
  }

  private combineScores(vectorScore: number, relevanceScore: number): number {
    // Weighted combination - tune these weights based on your data
    return (vectorScore * 0.7) + (relevanceScore * 0.3);
  }
}

RAG-Augmented Generation

Now we put it all together:

interface RAGResponse {
  answer: string;
  sources: Array<{
    content: string;
    metadata: any;
    score: number;
  }>;
  confidence: number;
}

export class RAGPipeline {
  constructor(
    private vectorStore: MongoVectorStore,
    private embeddings: EmbeddingProvider,
    private reranker: CrossEncoderReranker,
    private aiGateway: string // Our gateway from the previous article
  ) {}

  async generateAnswer(
    query: string,
    filters?: any
  ): Promise<RAGResponse> {
    // 1. Generate embedding for the query
    const queryEmbedding = await this.embeddings.embed(query);

    // 2. Search for relevant chunks
    const searchResults = await this.vectorStore.searchWithMetadata(
      query,
      queryEmbedding,
      filters || {}
    );

    if (searchResults.length === 0) {
      return {
        answer: "I don't have enough information to answer that question.",
        sources: [],
        confidence: 0,
      };
    }

    // 3. Rerank results for relevance
    const rankedResults = await this.reranker.rerank(query, searchResults, 5);

    // 4. Check if we have good enough context
    const avgScore = rankedResults.reduce((sum, r) => sum + r.finalScore, 0) / rankedResults.length;

    if (avgScore < 0.3) { // Tune this threshold
      return {
        answer: "I found some related information, but I'm not confident it answers your question. Could you rephrase or be more specific?",
        sources: rankedResults.map(r => ({
          content: r.chunk.content,
          metadata: r.chunk.metadata,
          score: r.finalScore,
        })),
        confidence: avgScore,
      };
    }

    // 5. Build context for the AI
    const context = rankedResults
      .map(r => r.chunk.content)
      .join('\n\n');

    // 6. Generate answer using our AI gateway
    const response = await fetch(this.aiGateway + '/chat', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        model: 'claude-3.5-sonnet',
        messages: [
          {
            role: 'system',
            content: `You are a helpful assistant that answers questions based on the provided context. 
                      If the context doesn't contain enough information to answer the question, say so.
                      Always cite which part of the context you used for your answer.`
          },
          {
            role: 'user',
            content: `Context:\n${context}\n\nQuestion: ${query}`
          }
        ],
        maxTokens: 500,
      })
    });

    const aiResult = await response.json();

    return {
      answer: aiResult.content,
      sources: rankedResults.map(r => ({
        content: r.chunk.content,
        metadata: r.chunk.metadata,
        score: r.finalScore,
      })),
      confidence: avgScore,
    };
  }
}

Quality Comparison: Before vs After

Here's a real example from our system. User question: "What are the payment terms in the Q3 vendor contract?"

Before (naive RAG):

Retrieved: 10 chunks about "Q3", "payments", and "contracts" from different documents
Answer: "The payment terms are 30 days. Q3 revenue was up 15%."
Wrong! Mixed up the vendor contract with financial reports.

After (our approach):

Filtered for document type: "contract"
Filtered for date range: Q3 2024
Vector search found relevant sections
Reranking prioritized actual payment terms over revenue mentions
Answer: "According to the Q3 vendor contract with Acme Corp, payment terms are NET 45 days from invoice date, with a 2% early payment discount if paid within 10 days."
Correct, specific, and cites the source!

The difference is night and day. Users went from getting confused, mixed-up answers to getting precise, actionable information. Our support tickets about "wrong AI responses" dropped to almost zero.

MongoDB Atlas Advantages

Using MongoDB Atlas Vector Search instead of a dedicated vector database has real advantages:

Same database for app data and vectors - no data sync issues
Rich metadata filtering - use MongoDB's powerful query language
Transactions - update documents and vectors atomically
$vectorSearch aggregation - combine vector similarity with complex business logic
No new infrastructure - if you already use MongoDB, you're done

Here's a complex query that would be painful in a pure vector DB:

// Find documents similar to the query, but only from contracts 
// signed in the last 6 months by authors in the legal team,
// excluding anything from the "archive" collection
const pipeline = [
  {
    $vectorSearch: {
      index: 'vector_index',
      path: 'embedding',
      queryVector: queryEmbedding,
      numCandidates: 100,
      limit: 20,
    }
  },
  {
    $match: {
      'metadata.documentType': 'contract',
      'metadata.author': { $in: ['legal-team-1', 'legal-team-2'] },
      'metadata.createdAt': { $gte: new Date(Date.now() - 6 * 30 * 24 * 60 * 60 * 1000) },
      'metadata.status': { $ne: 'archived' }
    }
  },
  {
    $lookup: {
      from: 'document_metadata',
      localField: 'metadata.documentId',
      foreignField: '_id',
      as: 'fullMetadata'
    }
  }
];

Try doing that with Pinecone or Weaviate!

Example Usage

You can see the complete RAG implementation in our examples repo at https://github.com/tysoncung/ai-platform-aws-examples/tree/main/02-rag-mongodb-atlas. It includes:

Smart chunking with semantic boundaries
Multiple embedding providers
MongoDB Atlas Vector Search setup
Cross-encoder reranking
Complete RAG pipeline with confidence scoring
CDK deployment for the infrastructure

What's Next

We've covered the foundation - gateway patterns for vendor flexibility and RAG for document understanding. In the next article, we'll tackle the hardest part: agents that don't suck.

Most agent frameworks fall apart on real multi-turn conversations. They lose context, repeat themselves, or get stuck in loops. I'll show you how we built agents that can handle complex workflows, use tools reliably, and maintain state across long conversations.

Plus, the patterns we've built so far (unified gateway + smart RAG) become the foundation for truly useful agents. You'll see how it all connects.

This is part 4 of an 8-part series on building a production AI platform. Find the complete code examples at https://github.com/tysoncung/ai-platform-aws-examples.

DEV Community