Dhananjay Lakkawar

Posted on Mar 24

# Treating Prompts Like Code: Building CI/CD for LLM Workflows on AWS

#aws #cicd #architecture #ai

If you look at the codebase of an early-stage AI startup, you will almost always find a file named utils.py or constants.js containing massive blocks of hardcoded text.

These are the LLM system prompts.

When a model hallucination occurs in production, a developer goes into the code, tweaks a few sentences in the prompt, runs a quick manual test, and pushes the change to production.

This works for prototypes, but for production systems, this is a massive operational risk.

"Prompt drift" is real. A small change designed to fix an edge case can unintentionally break the formatting, tone, or logic for dozens of other use cases. If you want to build reliable AI systems, you have to stop treating prompts like magical incantations and start treating them like code.

Here is how a modern engineering team architects an automated, version-controlled CI/CD pipeline for LLM prompts using GitHub Actions, AWS CodePipeline, and AWS Systems Manager (SSM) Parameter Store.

The Core Problem: Tightly Coupled AI

When you hardcode prompts into your application logic (e.g., inside an AWS Lambda function), you tightly couple your application release cycle with your AI tuning cycle.

To fix a typo in a prompt, you have to redeploy the entire application.
You have no historical record of why a prompt changed and how it affected output quality.
You have no automated gate preventing a "bad" prompt from reaching production.

The solution is to decouple the prompt from the code, version it in Git, evaluate it automatically, and inject it at runtime.

The Serverless Prompt Pipeline Architecture

To bring engineering rigor to our AI workflows, we need three distinct layers: Storage, Evaluation, and Runtime Injection.

1. The Git & Evaluation Flow

Instead of hardcoding strings, developers maintain a prompts.json or prompts.yaml file in their repository. When a pull request is opened, it triggers an evaluation pipeline.

2. Runtime Injection (AWS SSM Parameter Store)

Once the CI/CD pipeline validates that the new prompt doesn't break existing functionality, it uses the AWS CLI/SDK to push the updated prompt string into AWS SSM Parameter Store (e.g., under the path /prod/llm/customer_service_prompt).

When your application (running on AWS Lambda, ECS, or EKS) is invoked, it dynamically fetches the prompt from SSM.

The CTO Perspective: Why Architect It This Way?

Building this pipeline requires upfront engineering effort. Here is why it is worth it for scaling teams:

1. Zero-Downtime Prompt Updates

Because the Lambda function fetches the prompt from SSM at runtime, your product managers or AI engineers can deploy prompt improvements instantly without requiring a full backend deployment or passing through a lengthy code build process.

2. Guarding Against Regression

The "Automated Evaluation Gate" is the most critical piece of this architecture. You maintain a "Golden Dataset" of 50-100 real user inputs and expected outputs.
During the CI phase, you run the proposed prompt against this dataset using an "LLM-as-a-judge" pattern. If the new prompt causes the model to start hallucinating or dropping required JSON keys, the pipeline fails the build automatically.

3. Auditability and Rollbacks

Because SSM Parameter Store supports versioning, you get an automatic audit trail. If Version 14 of your prompt causes issues in production, rolling back is simply a matter of reverting to Version 13 via the AWS Console or CLI.

Engineering Tradeoffs & Best Practices

If you implement this architecture tomorrow, keep these real-world constraints in mind:

SSM API Limits: AWS SSM Parameter Store has API rate limits. If you have a high-traffic API (e.g., hundreds of requests per second), fetching the prompt from SSM on every single invocation will result in ThrottlingException errors.
- The Fix: Implement caching inside your Lambda execution environment (e.g., caching the prompt in memory outside the handler function for 5 minutes), or use AWS AppConfig, which is explicitly designed for high-throughput dynamic configuration.
Evaluation Costs: Running 100 tests through Claude 3.5 Sonnet on every single Git commit will spike your Amazon Bedrock bill.
- The Fix: Run the full evaluation suite only on merges to the main branch, or use a smaller, cheaper model (like Claude 3 Haiku) to run quick sanity checks on feature branches.
String Limits: Standard SSM parameters have a 4KB size limit. If you are using massive few-shot prompts with thousands of tokens, you will need to use the Advanced Parameter tier (up to 8KB) or store the prompt in an S3 bucket and store the S3 URI in SSM.

The Bottom Line

Generative AI is shifting from an experimental feature to a core architectural component of modern applications. If you wouldn't deploy database schema changes without testing and version control, you shouldn't deploy prompt changes without them either.

By combining GitOps, AWS CodePipeline, and SSM Parameter Store, you bridge the gap between AI experimentation and reliable software engineering.

How does your team currently manage LLM prompts? Are they hardcoded, stored in a database, or managed via an external tool? Let's discuss in the comments.

DEV Community