Debapriya Dey

Posted on May 22

Building a Serverless AI Model Evaluation Platform on AWS

#ai #aws #llm #serverless

The Problem

A media company needed to evaluate which AI model produces the best podcast-style summaries from news articles. They wanted to:

Send an article to multiple AI models simultaneously
Compare the outputs side by side
Score each output automatically
Generate a visual comparison report

Doing this manually, copying articles into different model playgrounds, reading outputs, judging quality, doesn't scale. They needed an automated evaluation pipeline that could run experiments on demand and produce consistent, comparable results.

What We Built

A fully serverless evaluation platform on AWS that accepts an article, runs it through multiple foundation models in parallel, scores each output using a separate AI judge, and produces an HTML comparison report. All triggered by a single API call.

The system handles the entire lifecycle:

Prompt optimization — an AI agent refines the user's instructions into an effective prompt
Parallel model invocation — multiple Bedrock models generate summaries simultaneously
Automated scoring — a scoring agent evaluates each output against quality criteria
Report generation — produces a formatted HTML comparison page

Architecture Overview

The 6-Step Workflow

The core of the system is a Step Functions state machine that orchestrates six Lambda functions in sequence. Here's what each step does and why it exists as a separate step.

Step 1: Validate

def validate(event):
    """Read and validate the experiment definition from S3."""
    definition = s3.get_object(Bucket=BUCKET, Key=f"definitions/{experiment_id}/definition.json")
    # Validate required fields: article, models, prompt
    # Fail fast if inputs are malformed
    return validated_definition

Why a separate step? Fail-fast validation before incurring any Bedrock costs. If the definition is malformed, we stop here — no wasted model invocations.

Step 2: Invoke Models (Parallel)

This is where it gets interesting. We invoke multiple Bedrock models simultaneously using Python's ThreadPoolExecutor:

from concurrent.futures import ThreadPoolExecutor, as_completed

def invoke_models(definition):
    models = definition['models']  # e.g., ["meta.llama3-70b", "deepseek-r1", "amazon.nova-lite"]
    prompt = definition['prompt']
    article = definition['article']

    results = {}

    with ThreadPoolExecutor(max_workers=len(models)) as executor:
        futures = {
            executor.submit(invoke_bedrock, model_id, prompt, article): model_id
            for model_id in models
        }
        for future in as_completed(futures):
            model_id = futures[future]
            response = future.result()
            results[model_id] = {
                "output": response['output']['message']['content'][0]['text'],
                "usage": {
                    "input_tokens": response['usage']['inputTokens'],
                    "output_tokens": response['usage']['outputTokens']
                }
            }

    return results

Why ThreadPoolExecutor inside Lambda? Bedrock API calls are I/O-bound. Running them in parallel within a single Lambda invocation means we pay for one Lambda execution instead of three, and the total wall-clock time is roughly equal to the slowest model rather than the sum of all models.

Step 3: Store Outputs

Writes comparison.json to S3 — containing all model outputs but no scores yet. This creates a checkpoint: if scoring fails, we don't lose the generated content.

Step 4: Score (Parallel)

The scoring agent (Claude Haiku) evaluates each model's output against quality criteria. Again, parallel execution via ThreadPoolExecutor:

def score(outputs):
    scoring_prompt = """Rate this podcast summary on:
    - Accuracy (1-10): Does it faithfully represent the article?
    - Engagement (1-10): Would a listener find this compelling?
    - Structure (1-10): Is it well-organized for audio?
    Respond with JSON only."""

    with ThreadPoolExecutor(max_workers=len(outputs)) as executor:
        futures = {
            executor.submit(invoke_bedrock, SCORING_MODEL, scoring_prompt, output): model_id
            for model_id, output in outputs.items()
        }
        # ... collect scores

Why a separate scoring model? Using a different model (or at minimum, a separate invocation with a scoring-specific prompt) as the judge avoids self-evaluation bias. The scoring agent doesn't know which model produced which output.

Step 5: Store Scores

Updates comparison.json with the scores attached to each model's output.

Step 6: Generate HTML

Produces a formatted comparison.html report that displays all outputs side by side with their scores. This is the final deliverable the user downloads.

Why Amazon Bedrock's Converse API?

We use the Converse API rather than the model-specific InvokeModel API. The key advantage: one unified interface across all models.

def invoke_bedrock(model_id, system_prompt, user_message):
    response = bedrock_runtime.converse(
        modelId=model_id,
        messages=[{"role": "user", "content": [{"text": user_message}]}],
        system=[{"text": system_prompt}]
    )
    return response

Switching from Llama to Claude to Nova Lite requires changing only the model_id string. No code changes, no different request formats, no response parsing differences.

The Converse API also returns token usage in every response — which we pass through to the caller for billing:

{
  "results": [
    {
      "model_id": "meta.llama3-70b-instruct-v1:0",
      "summary": "...",
      "usage": { "input_tokens": 1523, "output_tokens": 847 }
    }
  ],
  "total_usage": { "total_input_tokens": 4569, "total_output_tokens": 2541 }
}

Cost Control: The Hardest Part

Here's the reality of building on top of foundation models: every API call costs money, and costs scale with input size. A single /run request invoking 3 models on a long article can cost $0.10–0.50. That sounds small until someone writes a script that calls it in a loop.

Billing Alarms (Day 1)

We set up CloudWatch billing alarms immediately:

CloudWatch Alarm ($10 threshold) → SNS → Email notification
CloudWatch Alarm ($25 threshold) → SNS → Email notification

This is the bare minimum. You'll know when costs are climbing, even if you can't stop them automatically.

API Security (Critical for Any AI-Backed API)

An unprotected API that invokes foundation models is essentially a public credit card. We learned this the hard way and now treat API security as P0 — before any external access:

API Keys on every endpoint (immediate protection)
Usage plans with per-key quotas (500 requests/day, 5000/month)
Rate limiting (10 req/s throttle) to prevent burst abuse
Request logging to attribute usage to specific callers

# Every request must include the API key
curl -X POST https://api.example.com/run \
  -H "x-api-key: btk_live_abc123def456" \
  -H "Content-Type: application/json" \
  -d '{"article": "...", "models": ["meta.llama3-70b"]}'

Without this, anyone who discovers your API URL can generate unbounded Bedrock charges.

Lessons Learned

1. Separate validation from execution

Bedrock calls are expensive. Validate everything before invoking any model. Check that the article isn't empty, the model IDs are valid, the prompt isn't too long. Fail at Step 1, not Step 2.

2. ThreadPoolExecutor > separate Lambda invocations for parallel model calls

We considered using Step Functions' native parallel states or invoking separate Lambdas per model. ThreadPoolExecutor within a single Lambda turned out simpler:

One Lambda execution to pay for (not N)
Shared memory for the article text (no repeated S3 reads)
Simpler error handling
Total time ≈ slowest model, not sum of all

The tradeoff: if one model times out, the entire Lambda times out. We mitigate this with per-future timeouts.

3. Store intermediate results

Each step writes to S3 before the next step begins. If Step 4 (scoring) fails, we still have the model outputs from Step 3. We can retry scoring without re-invoking the content models.

4. Token usage is free metadata — always capture it

Bedrock returns inputTokens and outputTokens in every response. Capturing and returning this costs nothing but enables:

Per-customer billing
Cost forecasting
Identifying expensive prompts
Detecting anomalies (sudden spike in token usage = possible abuse)

5. Start with S3, add a database when you need queries

For the POC, S3 handles all storage. It's simple, cheap, and sufficient for sequential read/write patterns. We're adding DynamoDB only now that we need to query experiment history by user — something S3 can't do efficiently.

What's Next

The platform is functional but evolving:

Selection History — DynamoDB-backed experiment sessions so users can revisit past comparisons and track which model they ultimately chose
Frontend UI — Visual interface for running experiments and browsing history
Cognito Authentication — User-level access control when the UI ships

Tech Stack Summary

Layer	Service	Why
API	API Gateway (HTTP API)	Low latency, pay-per-request
Compute	AWS Lambda (Python)	Serverless, scales to zero
Orchestration	Step Functions	Visual workflow, built-in retries
AI Models	Amazon Bedrock (Converse API)	Multi-model, unified interface
Storage	Amazon S3	Cheap, durable, simple
Monitoring	CloudWatch + SNS	Billing alarms, email alerts
Auth (planned)	API Keys + Cognito	Layered security
History (planned)	DynamoDB	Fast queries by user/session

Reach Out to Us

Interested in modernizing your cloud infrastructure and building enterprise-grade solutions? Storm Reply is driven by continuous learning and practical innovation. We specialize in designing and delivering scalable AWS architectures that support customers throughout their cloud journey, from early assessment to production-ready deployment.

With deep experience in AWS architecture, data engineering, and security best practices, we help enterprises migrate with confidence and move faster on their cloud transformation goals.

Let’s connect and explore how we can support your modernization initiatives.

🌐 Website: https://www.stormreply.cloud/

💼 LinkedIn: https://www.linkedin.com/company/storm-reply/posts/?feedView=all
Date: May 2026

The full system runs in eu-central-1 (Frankfurt), costs under $20/month excluding Bedrock usage, and handles the entire evaluation lifecycle in a single API call. Serverless means we pay nothing when nobody's running experiments, and scale automatically when they are.

If you're building something similar — any system where API calls trigger expensive downstream operations — lock down your API first, validate inputs aggressively, and always know what each request costs.

Built with AWS Lambda, Step Functions, and Amazon Bedrock.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.