Yoganand Govind

Posted on Jun 14

RAG Architecture for SaaS Applications Using Amazon Bedrock

#rag #aws #infrastructure #ai

My first real production batch on Autowired.ai cost 3x what I'd budgeted.

200 documents. One real customer test. And an AWS bill that showed me exactly how wrong my mental model was.

The initial architecture felt reasonable: Textract handles OCR, Bedrock extracts all the fields from the OCR output. Simple. Straightforward. And expensive at scale because I was sending every document's full OCR text to a frontier model for every field on every page.

This post is the story of how I got to ~40% cost reduction. The biggest win wasn't a configuration change. It was rethinking what Bedrock should actually be doing.

What I Was Actually Paying For

Before touching anything, I instrumented every Bedrock call to log input tokens, output tokens, model ID, cache hit/miss, and latency. One week of real data:

System prompt tokens — ~35% of every input. The verification schema, field definitions, and output format instructions are completely static and present in full on every single call.
Full OCR context — I was sending the complete Textract response to Bedrock. For an invoice targeting 10 specific fields, maybe 30% of that OCR content was actually relevant. I was paying for the other 70%.
Model tier mismatch — Claude Sonnet for everything, including structured form extraction where fixed-field invoices have consistent, predictable layouts. Sonnet is ~5x Haiku pricing.
No result caching — Documents from the same vendor, same template, same layout — fresh Bedrock call every time.

Instrument first. Always. You can't optimise what you haven't measured.

The Change That Moved the Needle Most

The original flow was: Textract OCR → send full OCR to Bedrock → Bedrock extracts all fields.

The problem with this: Textract is actually very good at extracting structured fields from well-formatted documents—dates, totals, invoice numbers, and line items from tables. It's purpose-built for this. I was using a foundation model to re-derive values that a specialised OCR service had already extracted correctly.

The new architecture has three stages:

Instead of sending a full 3-page invoice OCR to Bedrock for all 10 fields, you're sending the following:

Targeted OCR sections for the 2–3 fields Textract couldn't get (gap-fill call)
The combined extraction result for validation (verification call)

For structured documents like standard invoices, Textract reliably gets 70–80% of target fields. Bedrock only handles the hard cases and confirms the final output. Token count drops significantly.

For complex unstructured documents — contracts, freeform text — Textract's confidence is lower across more fields, so Bedrock handles more. The architecture adapts naturally based on Textract's per-field confidence scores.

This single shift was the largest cost driver reduction of everything I did.

Prompt Caching: The Fastest Configuration Win

Both the gap-fill and verification system prompts are static per workflow type. The field schema, output format, confidence rules — none of it changes between documents in a batch.

Amazon Bedrock supports prompt caching on certain models. Marking the system prompt block as cacheable means the first call in a parallel batch wave pays the cache write cost (~25% premium on that portion), and every subsequent call in the same 5-minute window hits the cache at ~10% of the normal input token price.

const response = await bedrockClient.send(
  new InvokeModelCommand({
    modelId: "anthropic.claude-3-5-sonnet-20241022-v2:0",
    body: JSON.stringify({
      system: [
        {
          type: "text",
          text: systemPrompt,
          cache_control: { type: "ephemeral" },
        },
      ],
      messages: [
        {
          role: "user",
          content: [{ type: "text", text: documentContext }],
        },
      ],
      max_tokens: 1024,
    }),
  })
);

For a 10-document parallel batch wave, the system prompt is computed once. Nine of the ten calls were read from cache. At 1,100-token system prompts across hundreds of documents per batch, this adds up.

Measured impact: ~20% reduction in input token costs for standard batch workloads.

Sending the Right Context, Not All the Context

For the gap-fill call, you don't need the full Textract output — just the OCR blocks covering the fields Textract couldn't extract. Textract returns confidence scores per field and per block, so you can filter:

function filterOcrForMissingFields(
  textractResult: TextractOutput,
  missingFields: string[]
): string {
  const relevantBlocks = textractResult.blocks
    .filter(block => isBlockRelevantToFields(block, missingFields))
    .map(block => block.text)
    .join("\n");

  return relevantBlocks;
}

For the verification call, you don't need the raw OCR at all — you need the combined extraction result (Textract values + gap-fill values) structured as a validation input. Sending the full OCR here is wasted tokens.

I also audited both system prompts. After removing redundant formatting instructions already encoded in the output schema, verbose field descriptions that could be tightened, and defensive edge-case handling that never triggered, each went from ~2,400 tokens to ~1,100 tokens. Zero accuracy impact.

Measured impact: ~15% reduction in per-invocation input token count.

Testing the Smaller Model — Per Task Type

Not per system. Per task type.

The gap-fill task and the verification task have different complexity profiles. Gap-fill on structured forms (filling in 2–3 missing fields from a standard invoice) is a simpler task than gap-fill on an unstructured contract. Verification — validating already-extracted values is generally easier than deriving them from raw text.

I tested Haiku vs. Sonnet on 50 representative documents per workflow type:

Structured form gap-fill + verification: Haiku within 2% accuracy of Sonnet. Switched to Haiku. ~5x cost reduction on these workflows.
Unstructured document gap-fill: Haiku 8–12% less accurate than Sonnet. Kept Sonnet. The quality gap mattered.
Verification on structured forms: Haiku performed well — validating extracted values is easier than extracting them. Switched.

Model selection is now a per-workflow config, not a system-wide setting:

function selectModelForTask(workflow: Workflow, task: "gap-fill" | "verify"): string {
  if (workflow.complexityTier === "structured") {
    return "anthropic.claude-3-haiku-20240307-v1:0";
  }
  return "anthropic.claude-3-5-sonnet-20241022-v2:0";
}

Measured impact: ~15% overall cost reduction, concentrated in high-volume structured extraction workflows.

Application-Layer Result Caching

Beyond Bedrock's 5-minute prompt cache, I added document-level result caching at the application layer.

Cache key:

Extraction schema ID + version hash
Hash of normalised Textract output (whitespace-normalised)
Confidence threshold settings

Cache hit → return the stored extraction result. No gap-fill call, no verification call.

In production, cache hit rates for first-run batches are low. But during schema development — when you're running the same 20-document test set 5–10 times as you tune field definitions — the cache eliminates 80–90% of Bedrock calls.

Cache storage: DynamoDB with a 24-hour TTL. A configuration hash in the key means any schema change automatically invalidates affected documents.

Measured impact: 5–30%, highly workload-dependent. Highest during schema development.

The Architecture Diagram

Four layers: application cache (skip Bedrock entirely on repeat docs) → Bedrock prompt cache (cheap cache reads on static system prompts) → model selection (Haiku vs Sonnet per workflow type and task) → context filtering (gap-fill only for missing fields, structured input for verification).

What Didn't Work

Aggressive prompt compression. I stripped whitespace and punctuation from system prompts to reduce token count. Accuracy degraded measurably. Foundation models are trained on well-formatted text; stripping formatting works against the training distribution.

Shared universal prompt across workflow types. I tried building one system prompt with conditional sections, cacheable as a single prefix. Engineering complexity was high, the cache hit rate was lower than expected (different workflow types rarely executed close enough in time to share a cache window), and accuracy dropped on edge cases. Reverted.

Output post-processing to compress tokens. Asking the model to output abbreviated values and expanding them in Lambda saved some output tokens, but it added execution time and increased application complexity—net cost difference: negligible.

Wrapping Up

The 40% isn't one clever trick. It's an architectural shift plus four boring, measurable optimisations applied to what was left:

Architecture first: Textract for what it handles well, Bedrock only for gap-fill and verification
Prompt caching: One field change, immediate impact on batch workloads
Context filtering: Send the right OCR to Bedrock, not all of it
Model tiering: Test per task type, not per system
Result caching: Highest impact during schema development

If you're running production Bedrock workloads and haven't measured where your tokens are going — start there. The data will tell you which of these applies.

DEV Community

RAG Architecture for SaaS Applications Using Amazon Bedrock

Top comments (0)