DEV Community

Cover image for 5 Hard-Earned Lessons from Building a Production App on Amazon Bedrock
Ravindra Pandya
Ravindra Pandya

Posted on

5 Hard-Earned Lessons from Building a Production App on Amazon Bedrock

Recently I got an opportunity to review one of our client's project which was highlighted for sudden increase in cost and some other improvements. The project was an AI-powered document analysis tool for a document processing team using Amazon Bedrock.

We observed that the document processing team was handling files 10x faster compared to earlier. During review we identified some expensive learning moments after spending some late nights in the AWS console.

Here are the five lessons that made the biggest difference - things it was considered on day one.

1. Model Selection Isn't About Picking the Biggest One

Developer's first instinct was to use Claude Opus. As if you're building something important, you use the most powerful model, right?

Team was processing various documents - extracting key information, metadata, and structured data. Pretty standard extraction tasks. For two weeks, it was running everything through Opus, getting great results. Then during a review, we thought: "Why don't we try the smaller models?"

Earlier it seemed risky to go smaller for "serious" work.

Out of curiosity, we tested the same documents with Claude Haiku. The accuracy? Identical for the use case. The cost difference? Roughly 15x cheaper.

That's when it clicked - Bedrock gives you multiple model tiers for a reason. Different tasks need different levels of capability.

The current approach:

  • Structured extraction and classification → Start with Haiku
  • General analysis or moderate complexity → Test Sonnet
  • Complex reasoning or nuanced interpretation → Upgrade to Opus

Now team prototype new features with Haiku first, then move up only when results aren't meeting requirements. The model flexibility in Bedrock has been one of its best features.

Here's how we can structure the model selection logic:

def select_model_for_task(task_type, complexity_score):
    """
    Bedrock offers multiple Claude models - pick based on actual need
    """
    if task_type == 'extraction' and complexity_score < 3:
        return 'anthropic.claude-haiku-v1'
    elif task_type == 'analysis' or complexity_score < 7:
        return 'anthropic.claude-sonnet-v1'
    else:
        return 'anthropic.claude-opus-v1'
Enter fullscreen mode Exit fullscreen mode

2. CloudWatch Integration Saved the Budget (Once we Set It Up Right)

Here's something many developers don't fully appreciate at first: Bedrock integrates seamlessly with CloudWatch, but you need to actually configure meaningful monitoring.

When we deployed to production, many end-users started processing documents. The daily AWS bill jumped from $50 to $380 overnight. Found out Friday afternoon when we checked the billing dashboard.

The problem wasn't Bedrock - it was the implementation. There were logs for every single request with full input/output to CloudWatch Logs. Those logs were costing almost as much as the model invocations. Plus, it had zero rate limiting - if someone uploaded a 100-page document, it just processed it. Multiple times if they clicked impatiently.

Here's what I learned about effective monitoring:

import boto3

cloudwatch = boto3.client('cloudwatch')

def invoke_bedrock_with_tracking(prompt, max_tokens=1000):
    # Estimate before calling to catch oversized requests
    estimated_input_tokens = len(prompt.split()) * 1.3

    if estimated_input_tokens > 5000:
        # Cap large requests early
        raise ValueError("Document too large - please split into sections")

    response = bedrock_runtime.invoke_model(
        modelId='anthropic.claude-haiku-v1',
        body=json.dumps({
            'prompt': prompt,
            'max_tokens': max_tokens
        })
    )

    # Log metrics to CloudWatch, not full content
    usage = json.loads(response['body'].read())['usage']

    cloudwatch.put_metric_data(
        Namespace='BedrockApp/DocumentProcessing',
        MetricData=[
            {
                'MetricName': 'InputTokens',
                'Value': usage['input_tokens'],
                'Unit': 'Count'
            },
            {
                'MetricName': 'OutputTokens',
                'Value': usage['output_tokens'],
                'Unit': 'Count'
            }
        ]
    )

    return response
Enter fullscreen mode Exit fullscreen mode

I also set up CloudWatch alarms that actually matter:

  • Alert if hourly cost exceeds $50
  • Alert if any single user makes >100 requests/hour
  • Alert if average response size exceeds 2000 tokens (usually indicates something's wrong with my prompts)

These alarms have caught two incidents where users found creative ways to accidentally trigger expensive workflows.

The beauty of Bedrock being fully integrated with AWS? All my monitoring, alerting, and cost management tools work exactly the same way as the rest of my infrastructure.

3. Prompt Engineering Made the Difference Between "Good" and "Great"

If we start with writing prompt which looks very basic:

Analyze this document and extract important information.
Enter fullscreen mode Exit fullscreen mode

Results were inconsistent. Sometimes we get JSON, sometimes unformatted natural language. Sometimes dates in different formats. Developer spent days debugging and parsing code before realizing the problem wasn't the code.

The breakthrough came when we started treating prompts like API specifications - detailed, structured, with clear expectations.

My current prompt structure:

def build_document_analysis_prompt(document_text):
    prompt = f"""You are a document analyzer. Extract specific information from various document types.

Your task:
1. Read the document carefully
2. Extract ONLY the following fields
3. Return results in valid JSON format

Required fields:
- document_type: Type of document (invoice, receipt, form, report, etc.)
- key_entities: List of important names, organizations, or entities (array of strings)
- dates: All dates mentioned (array in YYYY-MM-DD format)
- amounts: Any monetary values with currency (array of objects)
- summary: Brief 1-2 sentence summary (string)
- metadata: Any additional relevant information (object)

Rules:
- If a field is not found, use null (not empty string)
- Convert all dates to YYYY-MM-DD format
- Be specific about amounts and include currency codes
- Do not infer information not explicitly stated

Document text:
{document_text}

Return ONLY valid JSON with no markdown formatting or preamble."""

    return prompt
Enter fullscreen mode Exit fullscreen mode

The improvement was dramatic. Inconsistency dropped from about 30% to under 5%.

Key insights:

  • Be explicit about format - Bedrock's models are capable, but they need clear instructions
  • Provide structure with numbered steps
  • Define what NOT to do (stopped the model from inventing missing dates)
  • Specify output format precisely (I was getting markdown code blocks until I said "no markdown formatting")

I now version my prompts in a prompts/ directory and A/B test changes against a set of 50 sample documents before deploying updates. AWS makes it easy to track these experiments since everything's tagged and logged in CloudWatch.

4. Building Resilient Error Handling from Day One

In development, everything worked smoothly. In production with many end-users uploading all kinds of documents? That's when we met every possible error condition.

The wake-up call came when the service went down for 20 minutes because we hit Bedrock's rate limits and the code just... stopped. No retry, no graceful degradation, just dead in the water.

Here's the error handling that's kept us running smoothly since:

import time
import random
from botocore.exceptions import ClientError

def invoke_with_resilience(prompt, max_retries=3):
    """
    Bedrock is highly available, but your code should handle edge cases
    """
    for attempt in range(max_retries):
        try:
            response = bedrock_runtime.invoke_model(
                modelId='anthropic.claude-haiku-v1',
                body=json.dumps({
                    'prompt': prompt,
                    'max_tokens': 2000
                })
            )
            return response

        except ClientError as e:
            error_code = e.response['Error']['Code']

            if error_code == 'ThrottlingException':
                # Bedrock has rate limits - use exponential backoff
                wait_time = (2 ** attempt) + random.uniform(0, 1)
                logger.info(f"Rate limited, waiting {wait_time:.2f}s")
                time.sleep(wait_time)
                continue

            elif error_code == 'ValidationException':
                # Input validation failed - don't retry
                logger.error(f"Invalid request format: {e}")
                return None

            elif error_code == 'ModelTimeoutException':
                # Request took too long
                if attempt < max_retries - 1:
                    logger.warning(f"Timeout on attempt {attempt + 1}, retrying...")
                    continue
                else:
                    logger.error("All retries exhausted on timeout")
                    return None

            else:
                # Unexpected error - retry with backoff
                logger.error(f"Unexpected error: {e}")
                if attempt < max_retries - 1:
                    time.sleep(2)
                    continue
                else:
                    raise

    return None
Enter fullscreen mode Exit fullscreen mode

But here's what really matters for production: having a fallback strategy.

For non-critical requests, if Bedrock is temporarily unavailable, we queue the document for background processing and show the user: "Analysis queued - you'll receive results via email within an hour."

For time-sensitive requests, we have a simple rule-based extractor as a backup. It's not as good as Bedrock - maybe catches 60% of what the AI does - but it keeps users unblocked.

The reliability of Bedrock itself has been solid. These error handlers are mostly catching our own mistakes (bad input formatting) or rate limit situations during peak usage.

5. Bedrock Guardrails Are a Production Requirement, Not Optional

Week five of production, someone on the document processing team uploaded a file and asked a question that made us realize we had a security problem.

We were echoing parts of inputs back in error messages. And documents often contain confidential information - personal details, financial data, proprietary information that shouldn't leak into logs or other users' sessions.

After a friendly but firm conversation with our security team, we implemented Bedrock Guardrails. This is one of those features that seems optional until you need it, then you can't believe you didn't set it up from day one.

What we configured:

# Apply guardrails to every Bedrock invocation
guardrail_config = {
    'guardrailIdentifier': 'document-processor-guardrail',
    'guardrailVersion': '1', 
    'trace': 'enabled'  # Shows what triggered blocks - super useful
}

response = bedrock_runtime.invoke_model(
    modelId='anthropic.claude-haiku-v1',
    body=json.dumps(request_body),
    guardrailIdentifier=guardrail_config['guardrailIdentifier'],
    guardrailVersion=guardrail_config['guardrailVersion'],
    trace=guardrail_config['trace']
)
Enter fullscreen mode Exit fullscreen mode

The guardrails block:

  • PII (names, addresses, SSNs) from being processed or returned
  • Attempts to ask questions about other documents in the system
  • Prompt injection attempts
  • Requests that go beyond data extraction capabilities

The trace option is invaluable - when something gets blocked, we can see exactly what triggered the guardrail. Helped us tune the policies so legitimate use cases weren't getting caught.

I also added application-level sanitization as a defense-in-depth measure:

import re

def sanitize_before_processing(text):
    """
    Pre-process before sending to Bedrock Guardrails
    """
    # Redact emails
    text = re.sub(
        r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', 
        '[EMAIL_REDACTED]', 
        text
    )

    # Redact phone numbers
    text = re.sub(
        r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', 
        '[PHONE_REDACTED]', 
        text
    )

    # Redact SSNs
    text = re.sub(
        r'\b\d{3}-\d{2}-\d{4}\b', 
        '[SSN_REDACTED]', 
        text
    )

    return text
Enter fullscreen mode Exit fullscreen mode

Between Bedrock Guardrails and application-level checks, I sleep better knowing there are multiple layers protecting sensitive information.


What I'd Tell Someone Starting with Bedrock Today

If you're about to build your first production application on Bedrock, here's what matters:

1. Start with the right model for your task - Bedrock's model variety is a feature, not a complication. Test smaller models first - you might be surprised.

2. Set up CloudWatch monitoring from day one - Cost alerts, usage metrics, and error tracking. Future you will be grateful.

3. Invest time in your prompts - Bedrock's models are incredibly capable, but clear, structured prompts make all the difference between good and great results.

4. Build retry logic and fallbacks - Not because Bedrock is unreliable (it's not), but because production systems need resilience at every layer.

5. Enable Guardrails before you go live - This isn't about trust, it's about defense in depth. Especially if you're handling any sensitive data.

The Results

Four months in, our document analyzer processes about 200 files per week across various document types. Monthly Bedrock costs run around $180, and the tool saves our document processing team an estimated 15 hours of manual work every week. That's roughly $45/hour if you value time conservatively - incredible ROI.

The combination of Bedrock's managed infrastructure, flexible model options, and tight AWS integration meant I could focus on building features instead of managing ML infrastructure. I went from zero to production in three weeks with no ML ops team.

Would I choose Amazon Bedrock again for my next AI project? Absolutely. The platform gave me everything I needed - I just had to learn how to use it properly.

And honestly? These "mistakes" weren't really mistakes. They were the learning curve of building production AI applications. Every developer goes through it. The difference is that with Bedrock, the platform itself wasn't the hard part - it was learning to use AI effectively in production.


Building something with Bedrock? I'd love to hear what you're working on. Drop a comment below or connect with me - always happy to chat about lessons learned.

Top comments (0)