DEV Community

Debapriya Dey
Debapriya Dey

Posted on

Building an AI Model Evaluation Pipeline on AWS for Audio Content Generation

Executive Summary

A European digital media publisher needed to determine which foundation model on Amazon Bedrock produces the highest-quality podcast-style summaries from news articles. Rather than selecting a model based on general benchmarks, they built a serverless evaluation pipeline on AWS that runs structured experiments — comparing multiple models in parallel, scoring outputs with an LLM-as-Judge approach, and delivering actionable insights to both technical and editorial teams.

This post describes the business drivers, architectural approach, evaluation methodology, and outcomes of the proof of concept (PoC), built entirely on AWS-native services.


Business Challenge

The customer is a digital media publisher experiencing declining engagement as user consumption shifts toward flexible, audio-first formats. Their strategic objective is to evolve from traditional text delivery into personalized, AI-driven audio experiences — such as user-specific podcast-style summaries generated from their existing article library.

This initiative is expected to:

  • Increase content consumption by meeting users where they are (commutes, workouts, passive listening)
  • Unlock new monetization opportunities through premium audio tiers and advertising
  • Extend the value of existing editorial content without proportional editorial cost

The technical challenge: foundation models produce highly variable output quality depending on the model, prompt strategy, and content type. Selecting the wrong model risks hallucinated facts (unacceptable for a news publisher), poor audio readability, or unsustainable cost per article.

The customer needed a data-driven approach to model selection — not a one-off playground test, but a repeatable evaluation framework that could inform decisions across formats, topics, and evolving model capabilities.


PoC Scope and Objectives

The PoC focused on building an evaluation and experimentation pipeline — not the production audio generation system itself. The goal was to enable structured, repeatable testing of multiple foundation models and prompt strategies for summarization and script generation.

Core Requirements

Requirement Description
Model Evaluation & Selection Test multiple models on Amazon Bedrock in parallel; compare outputs on quality, tone, coherence, and editorial relevance
Prompt & Format Experimentation Support different summary styles (short-form, multi-speaker podcast, custom editorial prompts) to identify optimal content structures
Scalable Evaluation Pipeline Automated workflows triggered via API, with consistent storage of outputs and metadata
Evaluation Metrics Framework LLM-as-Judge scoring combined with native Bedrock evaluation metrics, covering accuracy, completeness, faithfulness, and style
Results Visualization Side-by-side comparison of outputs with cost, latency, and quality insights for technical and editorial stakeholders
AWS-Native Architecture Fully built on AWS services: Amazon Bedrock, AWS Step Functions, AWS Lambda, Amazon S3, Amazon API Gateway

Solution Architecture

The evaluation pipeline is fully serverless, deployed via Terraform, and designed around the principle of experiment-as-configuration — each evaluation run is defined by a JSON document specifying models, prompts, inputs, and scoring criteria.

Architecture Overview

AWS Architecture

AWS Services Used

Service Role in Architecture
Amazon Bedrock Foundation model access via unified Converse API; supports Llama, Claude, DeepSeek, Nova, and others
AWS Step Functions Orchestrates the multi-step evaluation workflow with built-in error handling and state management
AWS Lambda Serverless compute for API handling, model invocation, scoring, and report generation
Amazon API Gateway HTTP API entry point with CORS support; JWT authorizer ready for Amazon Cognito integration
Amazon S3 Persistent storage for experiment artifacts, outputs, and generated reports
Amazon Cognito User authentication (pre-built, ready to enable for production)
Amazon CloudWatch Structured logging, billing alarms, and operational monitoring
Amazon SNS Alert notifications when cost thresholds are breached

How It Works: The Evaluation Workflow

Phase 1: Prompt Generation and Review

The user submits an article and selects a scenario type (interview, monologue, debate, short summary). A Prompt Agent powered by Claude Haiku generates an optimized instruction prompt tailored to the article topic and requested format.

The user reviews and optionally edits the prompt before execution. This human-in-the-loop step prevents wasted model invocations on suboptimal prompts.

Phase 2: Parallel Model Invocation

AWS Step Functions triggers the evaluation workflow. The Invoker Lambda uses Python's ThreadPoolExecutor to invoke 2-5 Bedrock models simultaneously via the Converse API — a unified interface that eliminates provider-specific request/response handling.

Results are written to S3 progressively as each model completes, enabling the frontend to display partial results without waiting for the slowest model.

Phase 3: LLM-as-Judge Scoring

A separate Claude Haiku instance evaluates each model's output against the source text using a strict rubric. Five dimensions are scored on a 0-100 scale:

Dimension Evaluation Criteria
Accuracy Factual correctness; penalizes hallucinations, distortions, and vague references
Fluency Natural language quality; penalizes unfinished sentences, repetition, and awkward phrasing
Completeness Coverage of key points; penalizes missing sections or shallow treatment
Neutrality Objectivity; penalizes editorial opinion or speculation presented as fact
Prompt Compliance Adherence to format, tone, and constraints; violations cap the score

The rubric is deliberately strict: scores of 96-100 are "almost never given," and most solid outputs land in the 61-80 range. This forces meaningful differentiation between models.

Phase 4: Report Generation and Approval

The pipeline generates a self-contained HTML comparison report with:

  • Side-by-side model outputs
  • Color-coded score badges per dimension
  • Latency and cost metadata
  • Reviewer preference selection

Editorial stakeholders can view reports via presigned S3 URLs without requiring AWS console access. The approval workflow saves the selected output for downstream use.


Key Architectural Decisions

Why Amazon Bedrock's Converse API?

The Converse API provides a unified interface across all foundation models on Bedrock. Adding a new model to the evaluation requires only a configuration change — no code modifications for request formatting or response parsing. This is critical for an evaluation platform where the set of models under test changes frequently.

Why Serverless (Lambda + Step Functions)?

  • Cost at rest: zero. The platform costs nothing when no experiments are running.
  • Automatic scaling. Parallel model invocations scale with Lambda concurrency.
  • Operational simplicity. No servers to patch, no clusters to size.
  • Built-in observability. Step Functions provides visual execution tracking; Lambda integrates natively with CloudWatch.

Why LLM-as-Judge Over Traditional Metrics?

Traditional NLP metrics (ROUGE, BLEU) measure surface-level text similarity. They cannot evaluate:

  • Whether a podcast script sounds natural when read aloud
  • Whether the tone matches the editorial brand
  • Whether the format constraints were followed (no markdown, no stage directions)

An LLM-as-Judge captures these subjective quality dimensions that matter most to editorial teams. The strict rubric ensures scoring consistency across experiments.

Why Build Evaluation Before Production?

The cost of running 100 evaluation experiments ($15-50 in Bedrock usage) is negligible compared to the cost of building a production system on the wrong model and discovering quality issues after launch. The evaluation pipeline de-risks the model selection decision and creates a reusable framework for ongoing optimization.


Evaluation Metrics Framework

The platform supports two complementary evaluation approaches:

LLM-as-Judge (Primary)

A separate foundation model evaluates outputs against source text and prompt instructions. Scoring features are configurable per experiment — teams can define custom dimensions relevant to their use case.

The scoring agent receives:

  • The original source text
  • The instruction prompt given to the content model
  • The model's output
  • A detailed rubric with scoring guidelines

It returns a JSON object with integer scores per dimension, which are validated against the defined scale ranges.

Amazon Bedrock Native Evaluation (Secondary)

Where available, the platform also leverages Bedrock's built-in evaluation API for standardized metrics:

  • Accuracy — factual alignment
  • Robustness — consistency under variation
  • Toxicity — harmful content detection

These provide a baseline that complements the more nuanced LLM-as-Judge scores.

Open Area: Custom Metrics via API

An area under exploration is the ability to define and register custom evaluation metrics programmatically — for example, an "audio readability" metric that specifically penalizes text patterns that sound unnatural in text-to-speech synthesis.


Results and Insights

After running structured experiments across multiple articles, models, and prompt strategies:

Model quality varies significantly by format. A model that excels at short-form summaries may produce awkward multi-speaker scripts. Format-specific evaluation is essential — there is no single "best model" across all use cases.

Prompt engineering impact often exceeds model selection impact. The quality difference between a well-crafted prompt and a generic one frequently exceeds the difference between models. The Prompt Agent + human review loop captures this value early.

Hallucination rates correlate with topic complexity. Simple event reporting is handled well by all tested models. Complex topics with nuance (scientific findings, policy debates) show significantly higher hallucination variance.

Scoring consistency requires explicit rubric design. Without strict guidelines, the AI judge assigns uniformly high scores. The calibrated rubric forces differentiation that maps to real editorial quality differences.


Cost Profile

Component Monthly Cost
AWS Lambda (API + Invoker) ~$7
AWS Step Functions ~$2
Amazon S3 < $1
Amazon API Gateway < $1
Amazon CloudWatch ~$3
Infrastructure total < $15/month
Amazon Bedrock (variable) $0.15-0.50 per experiment

The serverless architecture means infrastructure costs are near-zero when the platform is idle. The primary cost driver is Bedrock model invocations — directly proportional to experiment volume and controllable via API rate limiting and usage quotas.


Security and Operational Considerations

  • API authentication: Amazon Cognito with JWT authorizer (pre-built, ready to enable)
  • Rate limiting: API Gateway usage plans with per-key quotas to control Bedrock spend
  • Cost controls: CloudWatch billing alarms at $10 and $25 thresholds with SNS notifications
  • Error sanitization: AWS resource identifiers (ARNs, account IDs) are stripped from user-facing error messages
  • Infrastructure as Code: Full Terraform deployment; reproducible and version-controlled
  • Token usage tracking: Bedrock response metadata passed through for downstream billing attribution

Strategic Outlook : : The Start of Something Bigger

This PoC is phase one of a broader initiative:

Phase Focus Status
Phase 1 Evaluation pipeline — model comparison, prompt experimentation, quality scoring ✅ Complete
Phase 2 Production summarization — selected model(s) deployed for automated generation Planned
Phase 3 Multi-format generation — short-form audio/video, text-to-speech integration Roadmap
Phase 4 Optimization — fine-tuning on editorial feedback, A/B testing, cost-based model routing Roadmap

The customer is evaluating AWS as the long-term platform to support end-to-end content generation workflows — from article ingestion through summarization, text-to-speech synthesis, and media distribution.


The TikTok-Style Video Opportunity

The customer's next ambition is to automatically generate daily short-form video content — 30-60 second clips summarizing top stories in a format optimized for TikTok, Instagram Reels, and YouTube Shorts. This requires chaining multiple AI capabilities:

  • Summarization (Amazon Bedrock) — condense articles into punchy, spoken-word scripts
  • Text-to-Speech (Amazon Polly or third-party) — generate natural-sounding narration
  • Visual generation — create or select imagery to accompany the audio
  • Assembly — combine audio, visuals, and captions into a publishable video

The evaluation pipeline built in Phase 1 directly informs this: the model and prompt strategy selected for summarization quality will power the script generation layer of the video pipeline. The same LLM-as-Judge framework can evaluate script quality for the shorter, punchier format that short-form video demands.

The customer is seeking a continued partnership with AWS to build this end-to-end workflow — from article ingestion through AI summarization, speech synthesis, and automated video publishing. This represents a multi-phase engagement with growing AWS service consumption across Bedrock, Polly, S3, MediaConvert, and potentially SageMaker for fine-tuning.


Expected Outcomes

  • A working evaluation pipeline for large-scale AI experimentation on Amazon Bedrock
  • Clear, data-driven comparison of foundation models and prompt strategies
  • Identification of the most suitable model(s) for summarization and script generation
  • A reusable framework for future optimization and potential fine-tuning
  • Reduced decision risk around model selection through structured experimentation

Conclusion

Foundation model selection for production use cases requires more than benchmark comparisons. It requires evaluation on your actual content, with your actual quality criteria, at sufficient scale to surface meaningful differences.

By building the evaluation infrastructure first — using Amazon Bedrock for model access, AWS Step Functions for orchestration, and an LLM-as-Judge approach for quality scoring — this customer established a repeatable, cost-effective framework for making model decisions with confidence.

The serverless architecture ensures the platform costs nothing at rest, scales automatically during experimentation, and deploys in minutes via Terraform. The evaluation framework will continue to serve as new models become available on Amazon Bedrock and as the customer expands into new content formats.


Why Continued Investment Matters

Funding this initiative accelerates experimentation on Amazon Bedrock, reduces decision risk around model selection, and establishes a reusable evaluation framework that supports not just the current summarization use case — but the entire roadmap from podcast generation through TikTok-style video automation. The customer's growing ambition maps directly to expanding AWS service adoption.


AWS Services Referenced

Amazon Bedrock · AWS Step Functions · AWS Lambda · Amazon API Gateway · Amazon S3 · Amazon Cognito · Amazon CloudWatch · Amazon SNS · Terraform (Infrastructure as Code)


This solution was built by Storm Reply as part of an AWS-funded proof of concept for AI-driven content generation.

Top comments (0)