Diven Rastdus

Posted on Mar 20 • Edited on Mar 26

Building a Production RAG Pipeline: Architecture Decisions That Matter

#ai #architecture #rag #python

Most RAG tutorials show you a happy path. Chunk text, embed it, retrieve the top-k, stuff it into a prompt, ship it.

That gets you 80% of the way there. The other 20% is where your system breaks in production: wrong embedding model, bloated context windows, synthesis models hallucinating outside retrieved facts, no fallback when Bedrock returns a 500.

I built Scout, an AI company research agent, for the Amazon Nova AI Hackathon. The core of it is a RAG pipeline that takes data extracted from five websites, embeds the resulting briefings, and enables semantic search across a history of past research. This post covers the architectural decisions that actually mattered.

The Setup

Scout has two AI tasks that look similar but have completely different requirements:

Take raw scraped data from five sources (company website, LinkedIn, Crunchbase, Google News, job listings) and synthesize it into a structured briefing
Store each completed briefing and let users search across them semantically ("find SaaS companies that raised Series B in the last year")

Both tasks involve language models. But they need different things from a model.

Synthesis needs reasoning. It has to reconcile conflicting data across sources, infer things from context ("their job listings mention Kubernetes and Terraform, so they are running infrastructure at scale"), and produce structured JSON.

Retrieval needs representation. It has to map text into a vector space where semantically similar things end up near each other, regardless of exact word overlap.

These are separate concerns. They should be separate models.

Architecture Decision: Two Models, Different APIs

Scout uses:

Amazon Nova 2 Lite for synthesis, via the Bedrock Converse API
Amazon Nova Multimodal Embeddings for semantic search, via the Bedrock InvokeModel API

The two models have different APIs. That matters.

The Converse API is the conversational interface. You send a message, get a message back. It handles multi-turn context, supports tool use, and gives you model-agnostic structure across Nova, Anthropic, Cohere, and others. It is the right interface for synthesis because synthesis is fundamentally a reasoning task: "here is messy data, produce this structured output."

The InvokeModel API is the raw inference interface. You serialize a request body yourself and get back whatever the model returns. It is lower-level but required for Nova Embed because embedding is not a conversational task. You are not having a dialogue with an embedding model. You are computing a dense vector.

The synthesis call looks like this:

response = client.converse(
    modelId=settings.bedrock_model_id,
    messages=[
        {
            "role": "user",
            "content": [{"text": prompt}],
        }
    ],
    inferenceConfig={
        "maxTokens": 2048,
        "temperature": 0.1,
    },
)

raw_text = response["output"]["message"]["content"][0]["text"]

The embedding call looks like this:

body = json.dumps({
    "schemaVersion": "nova-multimodal-embed-v1",
    "taskType": "SINGLE_EMBEDDING",
    "singleEmbeddingParams": {
        "embeddingPurpose": purpose,
        "embeddingDimension": 384,
        "text": {
            "truncationMode": "END",
            "value": text[:8000],
        },
    },
})

response = client.invoke_model(
    modelId=settings.nova_embed_model_id,
    body=body,
)

Different interfaces, different mental models. You are not calling the same thing twice.

Why Nova Embed Over OpenAI ada-002

A few concrete reasons.

First, we were already in the AWS ecosystem. Nova Act for browser automation, Nova 2 Lite for synthesis, Bedrock for all of it. Adding a separate OpenAI dependency for embeddings would mean a second SDK, second credential chain, second billing account, second rate limit to track. That is not complexity you want in a system that already has five web extractors running in sequence.

Second, ada-002 produces 1536-dimensional vectors. Nova Embed produces 384-dimensional vectors. That is not better or worse by default, but for our use case it is better. We are doing in-memory cosine similarity across a growing list of stored embeddings. Smaller vectors mean less memory, faster computation, and less storage. For a system that will rarely have more than a few thousand research records, 384 dimensions captures enough semantic signal.

Third, the embeddingPurpose parameter is genuinely useful. You set GENERIC_INDEX when storing and TEXT_RETRIEVAL when querying. The model adjusts its internal representation accordingly. That is a meaningful distinction for asymmetric retrieval tasks, where the query and the document are doing different things.

Storage: SQLite for Embeddings

This will feel wrong to engineers who have shipped vector databases. It is not wrong. It is a deliberate choice.

SQLite stores the embeddings as JSON in a BLOB column:

await db.execute(
    """
    INSERT OR REPLACE INTO embeddings (research_id, embedding, text_content, created_at)
    VALUES (?, ?, ?, ?)
    """,
    (
        research_id,
        json.dumps(embedding_vector),
        text_content,
        datetime.utcnow().isoformat(),
    ),
)

At query time, we pull all embeddings into memory and compute cosine similarity in Python:

all_embeddings = await get_all_embeddings()

scored = []
for research_id, vec, text_content in all_embeddings:
    sim = compute_similarity(query_vec, vec)
    scored.append((research_id, sim, text_content))

scored.sort(key=lambda x: x[1], reverse=True)
top = scored[:5]

Where compute_similarity is:

def compute_similarity(vec1: list[float], vec2: list[float]) -> float:
    dot = sum(a * b for a, b in zip(vec1, vec2))
    mag1 = math.sqrt(sum(a * a for a in vec1))
    mag2 = math.sqrt(sum(b * b for b in vec2))

    if mag1 == 0.0 or mag2 == 0.0:
        return 0.0

    return dot / (mag1 * mag2)

When does this break? When your embedding count hits tens of thousands and search latency becomes noticeable. At that point you have two options: migrate to pgvector in Postgres, or drop in Qdrant or ChromaDB. Both are straightforward migrations because the storage layer is cleanly separated from the retrieval logic.

For a system with hundreds or low thousands of records, SQLite plus in-memory similarity is fine. It has zero operational overhead, zero external dependencies, and deploys as a single file alongside the application. Choosing a managed vector database before you have the usage to justify it is premature optimization that adds real costs and real complexity.

Retrieval: What You Actually Feed the Embedding Model

The hardest part of RAG retrieval is not the similarity math. It is deciding what text to embed.

Raw structured JSON embeds poorly. A JSON blob with fields like "funding": {"total_raised": "$12M", "last_round": "Series A"} does not produce semantically meaningful vectors. The model has to do too much structural parsing to extract meaning.

For Scout, we construct a compact text representation specifically for embedding:

def _build_embedding_text(company_name: str, briefing) -> str:
    parts = [company_name]
    if briefing:
        if briefing.summary:
            parts.append(briefing.summary)
        if briefing.industry:
            parts.append(f"Industry: {briefing.industry}")
        if briefing.business_model:
            parts.append(f"Business model: {briefing.business_model}")
        if briefing.growth_signals:
            parts.extend(briefing.growth_signals[:3])
        if briefing.talking_points:
            parts.extend(briefing.talking_points[:3])
    return " ".join(parts)

This is deliberate. We include the summary (dense semantic content), industry and business model (categorical signals), and a few growth signals and talking points (specific observations that a query might target). We leave out funding numbers, employee counts, and contact info because those are structured lookups, not semantic search targets.

The query uses TEXT_RETRIEVAL purpose, and the indexed document uses GENERIC_INDEX. They are asymmetric. The query is short ("find SaaS companies with B2B focus raising recently"). The document is longer and describes the company from multiple angles. The embedding model handles that asymmetry when given the correct purpose flag.

Synthesis: Keeping the Model Grounded

The synthesis step is where RAG pipelines hallucinate. You retrieve five documents, hand them to the model, ask for a structured briefing, and the model fills in gaps from its training data instead of from your retrieved context.

Two things prevent this in Scout.

First, the prompt is explicit about the rule:

Only include information from the source data. Never fabricate.
If a field has no data, use null or empty array.

That sounds obvious. Most production prompts I have read omit it. When you omit it, the model fills gaps. It is being helpful. It is also making things up.

Second, we use temperature: 0.1. Synthesis is not a creative task. We want the model to closely follow instructions and produce deterministic structured output. Higher temperature produces more varied output, which is exactly what you do not want when generating JSON from factual data.

The output is JSON. We strip markdown fences if present (the model adds them about 30% of the time even when told not to) and parse with Pydantic:

text = raw_text.strip()
if text.startswith("```

"):
    lines = text.splitlines()
    text = "\n".join(lines[1:-1] if lines[-1].strip() == "

```" else lines[1:])

briefing_data = json.loads(text.strip())
return Briefing(**briefing_data)

The JSON parse failure path is explicit. If the model returns malformed JSON, we log the error and return a Briefing with confidence=0.0 and an error message. The user knows synthesis failed. We do not silently return a partial result.

Production Lessons

Mode switching is necessary, not optional. Scout has three modes: mock (no API calls, canned data for development), http-fallback (real requests/BS4 scraping plus Bedrock), and nova-act (full browser automation). This was not planned. It emerged from the reality of building with APIs that have geographic restrictions and rate limits. The pattern works: the import structure switches at startup based on available credentials.

if app_settings.mock_mode:
    from backend.extractors.mock import MockWebsiteExtractor as WebsiteExtractor
elif app_settings.nova_act_api_key:
    from backend.extractors.website import WebsiteExtractor
else:
    from backend.extractors.http_website import HttpWebsiteExtractor as WebsiteExtractor

Embeddings should be non-fatal. The embedding step runs after synthesis completes. If it fails, the research job still succeeds. We wrap it in its own try/except and log a warning:

try:
    embedding_vector = get_embedding(text_for_embedding)
    if embedding_vector is not None:
        await save_embedding(research.id, embedding_vector, text_for_embedding)
except Exception as embed_err:
    logger.warning(f"[{research.id}] Embedding failed (non-fatal): {embed_err}")

Search is a feature on top of the core product. It should not take down the core product.

Rate limiting belongs in the API, not in your head. We added a simple in-memory rate limiter: 5 research requests per IP per hour. It uses a defaultdict of timestamps and a 1-hour sliding window. It is enough to prevent a single user from exhausting your Bedrock quota during a demo. Add it before you need it.

What I Would Change

The cosine similarity implementation should use numpy. The pure-Python version works. But sum(a * b for a, b in zip(vec1, vec2)) for 384-element vectors, called hundreds of times per search request, is slower than np.dot(vec1, vec2). This is the one place where the "ship first" principle should give way to "it costs 30 seconds to fix."

The synthesis prompt should include confidence calibration examples. Right now, confidence is defined in terms of how many sources succeeded. A cleaner approach: few-shot examples that show the model what 0.9 confidence looks like vs. 0.5. The current system works, but the confidence scores are not well-calibrated across different company types.

The embedding text construction should be tuned by query type. Right now we embed the same text for every briefing. A better approach: embed the summary for natural language queries, embed specific fields (industry, stage, funding) for structured queries. This requires knowing query types ahead of time, which means more product work. Worth doing once you have real query data to analyze.

The Pattern That Matters

The core insight from building this: retrieval and synthesis are different problems with different failure modes.

Retrieval fails when your embedding text is wrong. Not when the model is bad. When you give the model JSON and expect it to embed semantically like prose. Fix the text, fix the retrieval.

Synthesis fails when the model is not constrained. Low temperature, explicit grounding rules, JSON output format, explicit null handling for missing data. Without those constraints, the model helps too much.

They are both language model problems. They are not the same language model problem.

I build production AI systems for companies. If you are dealing with RAG challenges, I would love to hear about them. astraedus.dev

If you're building AI agents for production, check out my book Production AI Agents on Amazon Kindle. It covers architecture patterns, tool design, multi-agent coordination, and deployment strategies.

DEV Community