DEV Community: Anwaar Hussain

Resolving Control Tower StackSet Drift from Orphaned Accounts

Anwaar Hussain — Sat, 11 Jul 2026 09:38:29 +0000

I upgraded AWS Control Tower landing zone from version 3.3 to 4.0. The update failed the first time. StackSet instances showed INOPERABLE status with a Service-Linked Role conflict pointing to accounts that no longer existed in the organisation.

Turns out, accounts had been removed from AWS Organizations without first being un-enrolled from Control Tower. Their StackSet instances remained as orphans. Control Tower couldn't assume the execution role in deleted accounts. Every landing zone operation hit this wall.

Here's how I fixed it.

Why the Service-Linked Role conflict occurs

When Control Tower enrols an account, it creates a Service-Linked Role (AWSServiceRoleForAWSControlTower) and registers that account as a resource consumer of the role. If you close or remove the account from the organisation without un-enrolling it first, the role's resource reference still points to the deleted account. During a landing zone update, Control Tower tries to reconcile the StackSet across all registered accounts. It finds the orphaned reference, cannot reach the account, and fails with the INOPERABLE conflict.

The error

StackSet Id: AWSControlTowerBP-BASELINE-SERVICE-LINKED-ROLE Status: INOPERABLE ResourceLogicalId: ControlTowerServiceRole ResourceType: AWS::IAM::ServiceLinkedRole Reason: Resource of type 'AWS::IAM::ServiceLinkedRole' with identifier 'AWSServiceRoleForAWSControlTower' has a conflict. SLR [AWSServiceRoleForAWSControlTower] is in use by other resources referencing an orphaned account no longer in the organisation.

The fix

Step 1: Remove orphaned stack instances from the affected StackSet. Select "Retain Stacks" because the deleted account is unreachable.

aws cloudformation delete-stack-instances \
  --stack-set-name AWSControlTowerBP-BASELINE-SERVICE-LINKED-ROLE \
  --accounts <ORPHANED_ACCOUNT_ID> \
  --regions <GOVERNED_REGIONS> \
  --retain-stacks \
  --no-cli-pager

Step 2: Check your other Control Tower baseline StackSets for the same orphaned accounts and repeat. Common ones include:

AWSControlTowerBP-BASELINE-CONFIG
AWSControlTowerBP-BASELINE-CLOUDWATCH
AWSControlTowerBP-BASELINE-ROLES
AWSControlTowerBP-BASELINE-SERVICE-ROLES
AWSControlTowerBP-VPC-ACCOUNT-FACTORY-V1

The exact StackSets in your environment may vary depending on your Control Tower version and configuration. Check all StackSets prefixed with AWSControlTowerBP- for orphaned instances.

Step 3: Navigate to Service Catalog > Provisioned products. If any orphaned accounts have a provisioned product in TAINTED or ERROR state, terminate it.

Step 4: Re-run the Control Tower landing zone update.

Step 5: Re-register your Organisational Units. This propagates updated guardrails and baselines to enrolled accounts under the new landing zone version.

Console path: AWS Control Tower > Organization > select the OU > Re-register OU.

Things to note

Only remove instances for orphaned accounts. Do not touch active, enrolled accounts.
"Retain Stacks" is critical. Without it, the delete operation fails because the StackSet cannot assume a role in deleted accounts.
This issue can occur during any Control Tower landing zone upgrade, not just 3.3 to 4.0.

Prevention

Always un-enrol accounts from Control Tower before closing them in AWS Organizations. AWS documentation states this explicitly: "You must unenroll the account before you close it."

References

Durable Workflows on AWS: Lambda Durable Functions, Step Functions, and MWAA

Anwaar Hussain — Tue, 07 Jul 2026 12:05:30 +0000

Level: 300–400 | Reading time: ~10 minutes

Recently, a customer asked me when they should choose Amazon Managed Workflows for Apache Airflow (Amazon MWAA) versus AWS Step Functions. I walked them through their use cases, pointed them to the recently published AWS comparison blog, and mentioned AWS Lambda durable functions as something worth considering for their new data platform.

That made me realise how many teams are asking the same question. The AWS blog covers MWAA and Step Functions well. It references Lambda durable functions at the end, pointing readers to the documentation for more detail. This post expands that comparison by bringing Lambda durable functions into the picture.

In this post, you will learn how Lambda durable functions, Step Functions, and MWAA differ in developer experience and architectural fit. You get five real scenarios, the same workflow implemented three ways, and a decision framework you can apply to your own workloads.

Prerequisites

Familiarity with at least one AWS compute service (Lambda, Amazon ECS, or Amazon EC2). A basic understanding of workflow orchestration concepts such as DAGs, state machines, and checkpoints. The AWS Durable Execution SDK installed for Python, TypeScript, or Java.

Lambda durable functions in 30 seconds

Lambda durable functions use a checkpoint and replay mechanism. Your function runs for up to one year, recovering from interruptions by replaying from the last saved checkpoint. You write orchestration in application code, not in Amazon States Language (ASL) and not in Airflow DAGs.

The mental model is simple. Step Functions lives outside your code. Durable functions live inside it.

AWS positions these three services for different orchestration patterns. Durable functions handle application-level orchestration. Step Functions handles cross-service orchestration. MWAA handles data pipeline orchestration. They complement each other.

Developer experience compared

Lambda durable functions give you code-first orchestration. You write workflow logic in Python, TypeScript, or Java using the Durable Execution SDK. You test locally without cloud dependencies. You debug with familiar stack traces. There is no visual designer. You read code, not diagrams.

AWS Step Functions gives you DSL-driven orchestration. You define workflows in ASL or use Workflow Studio's visual canvas. You get 200+ native service integrations without writing Lambda glue code. Non-technical stakeholders can follow execution visually. The trade-off is that complex ASL gets verbose and Workflow Studio has limits on larger state machines.

Amazon MWAA gives you scheduler-driven orchestration. You write Python DAGs using the Apache Airflow ecosystem. You get scheduling, backfills, and SLA monitoring as first-class features. The operator ecosystem covers AWS and non-AWS systems. The trade-off is environment provisioning time and a baseline cost that only makes sense when you run multiple pipelines.

None of these services is universally better. Each fits a different architectural pattern.

Real scenarios: where each service wins

Scenario 1: Multi-step AI agent workflow

A customer submits a support ticket. You classify intent via Amazon Bedrock, extract entities, route to the right team, generate a draft response, wait for human review, then send the reply.

Durable functions fit naturally here. The logic is sequential, the human-in-the-loop wait costs nothing (no compute during the pause), and the entire workflow reads like a script. Step Functions also works well because of native Bedrock integration and visual tracing. MWAA is awkward for this pattern. There is no scheduling need, no backfill requirement, and you would force Airflow into a request/response pattern for which it was not designed.

Trade-off: Durable functions win on developer experience and testability. Step Functions wins on native integrations and visual tracing.

Scenario 2: Nightly data pipeline with conditional branching

Every night at 02:00, you pull data from three APIs, validate the schema, branch conditionally (trigger an AWS Glue crawler if the schema changed, skip if not), transform, load to Amazon Redshift, notify Slack, and support backfills for missed days.

MWAA wins clearly. Scheduling, backfill, SLA monitoring, Glue operators, and Redshift operators are all first-class. Step Functions works but you bolt on Amazon EventBridge for scheduling and build custom backfill logic. Durable functions are a poor fit here. No native scheduling, no backfill capability.

Winner: MWAA.

Scenario 3: E-commerce order fulfilment with parallel fan-out

An order arrives. You validate payment, reserve inventory, then fan out in parallel: generate an invoice to Amazon S3, notify the warehouse via Amazon SQS, send confirmation via Amazon SES, and update the CRM in Amazon DynamoDB. You wait for all branches to complete, then mark the order confirmed.

Step Functions wins clearly. The Parallel state handles fan-out natively. Each branch uses a different service integration directly without Lambda in between. Durable functions work for the sequential parts but parallel fan-out across multiple services requires additional orchestration. MWAA is overkill for this pattern.

Winner: Step Functions.

Scenario 4: Long-running document processing with human approval

A user uploads a contract PDF. You extract text with Amazon Textract, summarise clauses via Bedrock, flag risky clauses, then wait for legal team approval (which could take days). If approved, store and index. If rejected, notify the uploader with reasons.

Durable functions are strong here. The wait costs nothing. The logic reads top-to-bottom. Testing the approval branch is a unit test. Step Functions is also strong. The .waitForTaskToken pattern handles the human gate, and native Textract and Bedrock integrations avoid Lambda glue. MWAA is a poor fit because sensors polling for approval waste worker resources.

Trade-off: Durable functions win on developer experience. Step Functions wins on native integrations and visual audit trail.

Scenario 5: Multi-account infrastructure provisioning

A new team onboards. You create an AWS account via AWS Organizations, deploy baseline AWS CloudFormation stacks, configure networking, set up AWS Identity and Access Management (IAM) roles, wait for security team approval, enable Amazon GuardDuty and AWS Config, then notify via Amazon SNS.

Step Functions wins clearly. Every step is an AWS service call. Native integrations for Organizations, CloudFormation, and IAM mean zero custom code. The visual audit trail supports compliance requirements. Durable functions add no value here because you would write SDK calls for everything that Step Functions already integrates natively. MWAA is the wrong tool entirely.

Winner: Step Functions.

The scorecard

No single service dominates. That is the point.

Scenario 1: Durable functions ≈ Step Functions (trade-off)
Scenario 2: MWAA
Scenario 3: Step Functions
Scenario 4: Durable functions ≈ Step Functions (trade-off)
Scenario 5: Step Functions

Same workflow, three implementations

Taking Scenario 1 (AI agent workflow), here is the same logic implemented three ways.

Code samples in this post are illustrative. Refer to the AWS Durable Execution SDK documentation for the latest API syntax.

Lambda durable functions (Python)

from aws_durable_execution_sdk_python.config import Duration
from aws_durable_execution_sdk_python.context import DurableContext, StepContext, durable_step
from aws_durable_execution_sdk_python.execution import durable_execution


@durable_step
def classify_intent(step_context: StepContext, ticket: dict) -> str:
    return call_bedrock_classify(ticket)


@durable_step
def extract_entities(step_context: StepContext, ticket: dict) -> dict:
    return call_bedrock_extract(ticket)


@durable_step
def route_to_team(step_context: StepContext, intent: str, entities: dict) -> str:
    return determine_team(intent, entities)


@durable_step
def generate_draft(step_context: StepContext, ticket: dict, intent: str) -> str:
    return call_bedrock_draft(ticket, intent)


@durable_step
def send_reply(step_context: StepContext, ticket: dict, draft: str) -> None:
    dispatch_reply(ticket, draft)


@durable_execution
def lambda_handler(event, context: DurableContext) -> dict:
    ticket = event["ticket"]

    intent = context.step(classify_intent(ticket))
    entities = context.step(extract_entities(ticket))
    team = context.step(route_to_team(intent, entities))
    draft = context.step(generate_draft(ticket, intent))

    # Wait for human review window (24 hours max)
    context.wait(Duration.from_hours(24))

    context.step(send_reply(ticket, draft))

    return {"statusCode": 200, "body": f"Ticket routed to {team}, reply sent"}

Step Functions (ASL excerpt)

{
  "StartAt": "ClassifyIntent",
  "States": {
    "ClassifyIntent": {
      "Type": "Task",
      "Resource": "arn:aws:states:::bedrock:invokeModel",
      "Next": "ExtractEntities"
    },
    "ExtractEntities": {
      "Type": "Task",
      "Resource": "arn:aws:states:::bedrock:invokeModel",
      "Next": "RouteToTeam"
    },
    "RouteToTeam": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:route-to-team",
      "Next": "GenerateDraft"
    },
    "GenerateDraft": {
      "Type": "Task",
      "Resource": "arn:aws:states:::bedrock:invokeModel",
      "Next": "WaitForReview"
    },
    "WaitForReview": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
      "Next": "SendReply"
    },
    "SendReply": {
      "Type": "Task",
      "Resource": "arn:aws:states:::ses:sendEmail",
      "End": true
    }
  }
}

Amazon MWAA (Airflow DAG excerpt)

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.sensors.external_task import ExternalTaskSensor

with DAG("ticket_workflow", schedule=None) as dag:
    classify = PythonOperator(task_id="classify", python_callable=call_bedrock_classify)
    extract = PythonOperator(task_id="extract", python_callable=call_bedrock_extract)
    route = PythonOperator(task_id="route", python_callable=route_to_team)
    draft = PythonOperator(task_id="draft", python_callable=call_bedrock_draft)
    review = ExternalTaskSensor(task_id="wait_review", timeout=86400)
    send = PythonOperator(task_id="send", python_callable=send_reply)

    classify >> extract >> route >> draft >> review >> send

The durable functions version is more verbose than a pseudocode sketch, but each step is explicit, testable, and replay-safe. The ASL version is 35 lines. The Airflow DAG is 12 lines. But line count is not the full picture. Step Functions gives you a visual execution trace without writing a single log statement. MWAA gives you scheduling and backfill without writing a single cron job. Each trade-off is real.

Gotchas I learned the hard way

Replay non-determinism catches you off guard. Durable functions replay from the last checkpoint on recovery. Any non-deterministic call inside the orchestration logic produces a different result on replay. This includes datetime.now(), random(), and HTTP calls. Perform non-deterministic operations inside durable steps and keep your orchestration logic deterministic.

# This breaks on replay — called inside orchestration logic, not a step
timestamp = datetime.now()

# This works — wrapped in a durable step
@durable_step
def get_timestamp(step_context: StepContext) -> str:
    return datetime.now().isoformat()

Checkpoint frequency involves trade-offs. Recovery speed, replay overhead, and persistence cost all pull in different directions. The right balance depends on your workload. Test it rather than assuming a universal rule.

MWAA environment provisioning takes time. Spinning up a new environment typically takes 20 to 30 minutes. Iteration speed is slower compared to durable functions or Step Functions where you deploy in seconds.

Large Step Functions workflows become difficult to manage visually. Workflow Studio works well for smaller state machines. As complexity grows, many teams eventually edit ASL directly. Plan for this from the start if your workflow is complex.

Portability varies. MWAA runs on open-source Apache Airflow. Your DAGs work on any compatible Apache Airflow deployment, though provider-specific operators may still need changes. Durable functions use the AWS Durable Execution SDK, which is AWS-specific, though the checkpoint/replay pattern itself is transferable. Step Functions uses ASL, which has no equivalent outside AWS.

Decision framework

Choose durable functions when your team owns the code and wants orchestration embedded in application logic. When you value local testability and fast iteration. When the workflow is application-centric rather than cross-service infrastructure glue.

Choose Step Functions when you orchestrate multiple AWS services without custom code. When non-technical stakeholders need visual workflow visibility. When you want native integrations (DynamoDB, SQS, Bedrock) without Lambda in between. When you need declarative error handling with retry, catch, and fallback patterns.

Choose MWAA when you run complex data pipelines with scheduling, backfills, and SLA monitoring. When your team already knows Airflow and its operator ecosystem. When you need dependency-aware DAGs across dozens of tasks. When you coordinate across AWS and non-AWS systems.

Choose a combination when the real world demands it. Step Functions for cross-service orchestration with a durable function handling complex logic within a single task. MWAA for scheduling with Step Functions handling individual pipeline steps. Durable functions for core business logic with Step Functions for parallel fan-out.

Conclusion

Lambda durable functions, Step Functions, and MWAA serve different developer preferences and architectural patterns. Durable functions bring orchestration into your application code. Step Functions orchestrates across AWS services visually. MWAA handles complex scheduled data pipelines. They complement each other rather than compete.

Start by identifying your orchestration pattern. Match it to the service that fits. Combine them when a single service does not cover the full workflow.

Further reading:

Questions or a scenario you want me to map? Drop them in the comments.

DadOps v1.0: 7 DevOps principles I applied to fatherhood

Anwaar Hussain — Thu, 18 Jun 2026 16:44:04 +0000

Nature has always inspired technology. Birds inspired flight. Ant colonies inspired distributed systems. Neural networks borrowed from the human brain.

In my case? DevOps principles from 10 years of cloud infrastructure work across multiple organisations in 4 countries inspired me to be a better father.

7 weeks ago, I supported the most high-risk deployment of my life: Baby v1.0. Mum led the delivery. Instant promotion to DadOps Engineer.

Week 1 confirmed what I already knew: I am not the lead architect. Mum is. She is the Principal Engineer, Product Owner, and Key Stakeholder. Baby v1.0 has a hard dependency on her. I am the DevOps Engineer making sure the primary region stays healthy.

One thing they do not tell you: baby has one mode of communication in the initial weeks. Crying. It could mean hunger, wind, nappy, overstimulation, or just "I exist and I am angry about it." The system will not stop alerting until you fix the root cause.

In this post, I share 7 DadOps principles that a decade of designing resilient cloud systems rewired in my brain. They made me a more intentional, present father.

DadOps Principle #1: Culture of collaboration — align with your stakeholder

DevOps says: Map stakeholders before you design. Build without the product owner and you build the wrong thing.

DadOps says: Week 1 mistake. I thought my job was to "support." Wrong framing.

Mum carried Baby v1.0 for 9 months, delivered her, and runs the primary region: feeding, recovery, bonding. Baby has a physical dependency on Mum that I cannot replicate. She is the database. I am cache.

The shift: replace "How can I support?" with "What does the stakeholder need to succeed?" That single question changed every decision I made.

🎯 Takeaway: Collaboration starts with understanding who owns what. Align with the stakeholder. Execute their vision, not yours.

DadOps Principle #2: Continuous monitoring — observe the system AND the operator

DevOps says: Monitor CPU, memory, error rates. But also monitor the engineers running the system. Burned-out on-call engineers cause system failures.

DadOps says: Week 2, I tracked Baby metrics religiously: feeds, nappies, sleep. All green. But Mum metrics were red: sleep debt, recovery time, mental load. I was monitoring the service but ignoring the operator.

DadOps fix, added "Mum observability":

Sleep metric: Last 3+ hour uninterrupted stretch?
Capacity metric: How many requests has she handled today?
Recovery metric: Post-birth healing is a long-running job, not a sprint.

As a husband, this matters as much as being a good dad. If the operator fails, the service fails.

🎯 Takeaway: Monitor the people running the system, not just the system itself. Stakeholder health is your most critical metric.

DadOps Principle #3: Automate everything — remove toil so the stakeholder can focus

DevOps says: Remove undifferentiated heavy lifting. Use managed services so engineers focus on what matters.

DadOps says: Mum's "what matters" = feeding, recovery, bonding. My job = remove everything else.

My toil-reduction backlog:

Cooking: She should not spend compute cycles on meals. I own meal prep.
Cleaning: Dishes, laundry, floors = cognitive load. I own them.
Night shifts: I agreed with Mum that I need my sleep between 10pm and 2am as I am a heavy sleeper during those hours. Besides that, I support her in everything else.
Pram and car seat: Assembled, tested, and loaded in the car. Ready to deploy at a moment's notice.

But toil removal is not just chores. It is also offloading Mum physically and mentally:

Contact naps from Day 1: Baby sleeping on Dad's chest builds bond and gives Mum a break. Non-negotiable from the start.
Singing and humming: Learn a few tunes. When Mum needs a break, a calm hum from Dad can settle baby just as well.
Park walks from Day 2: Against popular opinion, my wife encouraged me to get out early. Slinging baby to my chest for a walk or a quick grocery run gave Mum uninterrupted recovery time. Between weeks 6 to 8, those walks became the fastest way to soothe hysterical crying. Fresh air helps Mum recover too. Sometimes the best automation is just stepping outside.

🎯 Takeaway: Good DevOps engineers remove toil. Good dads remove toil from Mum. If she is washing bottles at 2am, I failed my SLA.

DadOps Principle #4: Shift-left — prepare before production

DevOps says: Test early. Catch issues before they reach production. Security, quality, and validation shift left into the earliest stages.

DadOps says: We assembled the cot, tested the car seat, packed the hospital bag, and set up the bottle station. All before the deployment date.

One thing I did not practice: nappy changes. Learned that live in production. Tip I picked up fast: keep the wipes warm. A cold wipe on a sleeping baby is like a failed deployment that wakes up the entire system.

Teams that scramble after go-live skipped shift-left. Same applies here. If you are assembling the cot while your wife is in labour, you skipped testing.

🎯 Takeaway: Preparation is not optional. Shift-left means fewer incidents in production.

DadOps Principle #5: CI/CD — small, frequent iterations beat big-bang deployments

DevOps says: Deploy small changes frequently. Each one is low-risk. Batch everything into one massive release and you invite failure.

DadOps says: The newborn cycle is a continuous loop: feed, burp, nappy change, sleep, repeat. Every 2-3 hours. No sprint planning. No backlog grooming. Just continuous delivery on a fixed cadence.

Skip three feeds and batch them? That is a big-bang deployment. It will fail. Loudly.

Week 6 growth spurt = unexpected traffic spike. Feeding frequency doubled overnight. No warning. No change request. The fix: stop fighting it. Scale with demand. This is expected behaviour, per the documentation I did not read.

One more thing: keep cool when baby is crying. Panic is contagious. If you stay calm, baby reads that signal. Treat it like a production alert. Acknowledge, assess, act. Do not escalate your own stress into the system.

🎯 Takeaway: Continuous delivery of care. Small, frequent, low-risk. Big-bang parenting causes outages (screaming).

DadOps Principle #6: Version control and IaC — document everything, make it reproducible

DevOps says: Infrastructure as Code means anyone can deploy the system. No tribal knowledge. No "only Dave knows how to do this."

DadOps says: I documented everything so my wife, grandparent, or visitor can operate independently:

Feeding instructions and schedule
Nap routine and white noise settings
Burping positions that work (4 tested, 2 reliable)
The tummy-on-arm hold that calms her in seconds

If the routine lives only in your head, you are a single point of failure.

How I earned trust: Shared a "DadOps Runbook" with positions that work, cry patterns, and escalation paths. Execute Mum's decisions exactly. No "but I saw on TikTok…" Handle cooking and cleaning without raising tickets. Status updates: "Incident resolved. Stakeholder can sleep 2 more hours."

Week 3: Mum was in every incident. Week 7: Mum sleeps while I handle wake-ups. She trusts the runbook.

🎯 Takeaway: Document your routines. Version them. Boring ops = trusted DevOps engineer. Trust = stakeholder does not have to think about you.

DadOps Principle #7: Feedback loops and continuous improvement — iterate, do not stagnate

DevOps says: Use operational data, incident reviews, and user feedback to drive the next iteration. Blameless post-mortems focus on systemic fixes, not blame.

DadOps says: Bad night? Do not blame each other. Run a blameless retro.

"What happened?" She woke every 45 minutes.
"Why?" Likely a growth spurt. Possibly overtired from a short nap day.
"What do we change?" Earlier bedtime tomorrow. Extra feed before the long stretch.

No blame. No "you should have done X." Data, root cause, action items.

Cornwall road trip, a real-world deployment test. Last-minute decision to freshen up the family with a 2-day trip to Cornwall. Day 1: Baby screamed in the car seat. Like a failed data transfer. Day 2: She settled. By the end of the trip, she was comfortable. The feedback loop worked. We iterated on positioning, timing stops around feeds, and white noise in the car. Each journey got smoother.

🎯 Takeaway: Iterate based on data, not emotion. Blameless retros strengthen the team. Every failed deployment teaches you something for the next one.

What is next: DadOps Roadmap v1.1

Vaccinations start week 8. Side effects are documented. I am prepping the runbook: pain relief dosage, temperature monitoring, comfort positions. Shift-left applies here too. Prepare before the incident, not during it.

Growth spurts will keep coming, faster and less predictable. The system scales whether you are ready or not.

The biggest shift ahead: Baby's dependency on Mum will reduce over the coming months. Weaning, solids, mobility. New services to deploy. Dad moves from DevOps Engineer to Co-Architect. Shared ownership increases. Responsibility scales with the system.

I am ready. DevOps taught me how.

Conclusion

In this post, I showed how 7 DevOps principles (collaboration, monitoring, automation, shift-left, CI/CD, version control, and feedback loops) apply directly to fatherhood.

Nature inspires science. DevOps inspired me to be a better dad.

The most important lesson: do not try to be the primary region. Be the DevOps engineer that keeps the primary region healthy.

To new dads in tech: you already know this job. You know on-call, incidents, stakeholders, capacity planning. DadOps is just a new stack with worse documentation and a non-negotiable SLA.

Now if you will excuse me, my key stakeholder just raised a Severity 1 alert: HTTP 418 I'm a teapot = I am hungry. Time to execute the runbook. 🚀

Dads in tech: what is your #1 DadOps lesson? Drop it below. Let us write the docs no one gave us. 💙

DevOps to MLOps: Treat the ML Model as Your New Workload

Anwaar Hussain — Fri, 05 Jun 2026 13:58:10 +0000

This post references AWS services, frameworks, and tools to explain the Machine Learning Operations (MLOps) concepts. The principles apply to any cloud platform, orchestration tool, or ML service. Swap them with your preferred solutions; the pipeline discipline remains the same. Note that a foundational understanding of ML models is a prerequisite to MLOps, but you can build it in parallel while applying your existing Continuous Integration/Continuous Deployment (CI/CD) skills.

Introduction

I recently completed an internal AWS program focused on MLOps, and the biggest takeaway was this: if you already know DevOps, you already know most of MLOps.

DevOps engineers building CI/CD pipelines for Infrastructure as Code (IaC), microservices, and serverless applications already have 80% of the skills needed for MLOps. The fundamentals of code versioning, continuous integration, continuous deployment, testing, deployment strategies, monitoring, and rollback all apply directly.

The difference? Your workload changed. Instead of deploying application code or infrastructure templates, you are deploying a trained model. The pipeline stages stay the same. The artifacts passing through them are different.

In this post, you will learn how DevOps pipeline concepts map to MLOps, what new considerations come with ML workloads, and how to structure your first ML pipeline using the tools you already know.

The mental model: your workload changed, not your pipeline

In DevOps, your workload is application code, a container image, or a CloudFormation template. You version it, test it, deploy it, monitor it, and roll it back when something breaks.

In MLOps, your workload is the model. A model is the output of training code + training data + hyperparameters. It produces an artifact (a serialised file) that you deploy to an endpoint for inference.

Everything else stays the same:

You version the model artifact the same way you version a container image.
You test the model the same way you run integration tests on a microservice.
You deploy the model the same way you deploy a Lambda function through stages.
You monitor the model the same way you monitor API latency and error rates.
You roll back the model the same way you roll back an API Gateway deployment.

The pipeline is familiar. The workload inside it is new.

Repository structure: organising your ML workload

In DevOps, you separate src/ from infra/ from pipeline/. The same principle applies in MLOps. You add a model/ directory. This is your new workload.

A consistent structure lets your CI/CD pipeline know exactly where to find training scripts, inference code, tests, and dependencies. No guessing, no hardcoded paths. Here is a generic ML repository layout:

ml-project/
├── model/
│   ├── train/
│   │   ├── Dockerfile               # Training container definition
│   │   ├── train.py                 # Training entry point
│   │   ├── preprocessing.py         # Feature engineering
│   │   └── requirements.txt         # Training dependencies
│   ├── inference/
│   │   ├── Dockerfile               # Inference container definition
│   │   ├── serve.py                 # Inference entry point
│   │   ├── predictor.py            # Prediction logic
│   │   └── requirements.txt        # Inference dependencies (lighter)
│   ├── tests/
│   │   ├── test_model_quality.py    # Accuracy, precision, recall
│   │   ├── test_bias.py            # Fairness metrics
│   │   └── test_data_quality.py    # Input validation
│   └── config/
│       ├── hyperparameters.json     # Training hyperparameters
│       └── baseline.json            # Model Monitor baseline
├── infra/
│   ├── lib/                         # AWS Cloud Development Kit (CDK) or CloudFormation stacks
│   └── config/                      # Environment-specific config
├── pipeline/
│   └── buildspec/                   # One buildspec per CI/CD stage
├── monitoring/
│   ├── baselines/                   # Drift detection baselines
│   └── alarms/                      # CloudWatch alarm definitions
├── docs/
│   └── architecture.png
├── README.md
└── .gitignore

Here is why this structure works:

model/train/ and model/inference/ are separated. Different dependencies, different containers, different lifecycle. Training runs once or on a schedule. Inference runs continuously. Keeping them separate means your inference container stays lightweight.
model/tests/ lives next to model code. Your CI pipeline runs model quality tests the same way it runs unit tests for application code.
model/config/ is versioned alongside the model. When you retrain, hyperparameters and baselines change together. Git tracks both.
pipeline/buildspec/ has one spec per stage. Same pattern as your existing AWS CodeBuild projects.

Amazon SageMaker expects /opt/ml/model/ for artifacts and /opt/ml/code/ for scripts in custom containers. Each Dockerfile lives inside its respective directory (model/train/ and model/inference/). Since the inference code maps directly to /opt/ml/code/, the COPY instruction is a one-liner. No path gymnastics.

Your model/ directory is to MLOps what src/ is to application development. It has source code, tests, dependencies, and config. Treat it the same way.

What stays the same

The core DevOps pipeline stages transfer directly to MLOps. Here is how each one maps.

Code versioning

You already version application code in Git. In MLOps, you version the same way but add:

Training code (your model/train/ directory)
Hyperparameters (JSON config files)
Data versions (using tools like Data Version Control (DVC) or SageMaker Experiments)
Model artifacts (tracked in SageMaker Model Registry)

The principle is identical. If you cannot reproduce it, you cannot trust it.

Continuous integration

Your existing CI runs linting, unit tests, and contract tests on every pull request. In MLOps, you add:

Schema validation (linting your API spec with tools like Spectral)
Model quality tests (accuracy, precision, recall against a baseline)
Data quality checks (input validation, missing values, type mismatches)

The pipeline still fails fast on the first broken test. The tests are different, not the pattern.

Continuous deployment and delivery

You already deploy through stages: dev, staging, production. In MLOps, the same pattern applies:

Deploy model to staging endpoint
Run integration tests against staging
Approval gate (manual or automated)
Deploy to production

AWS CodePipeline orchestrates this the same way it orchestrates your IaC deployments. The target changes from an AWS CloudFormation stack to a SageMaker endpoint.

Testing

Your testing pyramid still applies:

Unit tests: Does the training script run without errors?
Integration tests: Does the deployed endpoint return valid responses?
Contract tests: Does the model output match the expected schema?
Performance tests: Does inference latency meet Service Level Agreement (SLA) requirements?

You add model-specific tests: accuracy thresholds, bias checks, and drift baselines. The testing philosophy (fail fast, test early, automate everything) stays the same.

Deployment strategies

Blue/green and canary deployments work the same way:

Blue/green. Deploy new model version to a separate endpoint. Switch traffic atomically. Roll back instantly if metrics degrade.
Canary. Route 10% of traffic to the new model. Monitor prediction quality. Gradually increase to 100%.
Shadow. Send production traffic to both old and new models. Compare outputs without affecting users. This is unique to ML but follows the same traffic-splitting principle.

Other strategies like Linear (gradually shifting traffic in equal increments over time) also apply. The choice depends on your risk tolerance and rollback speed requirements.

SageMaker production variants handle traffic splitting between model versions natively. Same concept as weighted target groups, different workload.

Monitoring and feedback

Amazon CloudWatch metrics, alarms, and dashboards work the same way. You monitor:

Invocation count, latency, error rates (same as any API)
Model-specific metrics: prediction distribution, confidence scores, feature drift

AWS X-Ray traces requests end-to-end the same way it traces your microservices. The difference is you also trace which model version served each prediction.

Rollback

Amazon API Gateway deployment history and SageMaker endpoint rollback work the same way as rolling back an AWS Lambda function or Amazon Elastic Container Service (Amazon ECS) service. You point traffic back to the previous version.

The difference in MLOps: rollback is not just operational, it is regulatory. More on this in the rollback section below.

What is new for you

These are the ML-specific concepts that do not have a direct DevOps equivalent. They extend your pipeline rather than replace it.

Model training

Think of training as your "build" step, but for data. Instead of compiling code into a binary, you feed data through an algorithm to produce a model artifact.

SageMaker Training Jobs handle this on managed compute. You specify the training script, input data location (Amazon Simple Storage Service (Amazon S3)), instance type, and hyperparameters. SageMaker provisions the infrastructure, runs training, and stores the output artifact in S3.

The key difference from a code build: training can take minutes to days depending on data size and model complexity. This is why caching matters more in MLOps.

Model testing

In application development, "does it run" is a valid first test. In ML, a model can run perfectly and still produce wrong results.

Model testing validates performance:

Accuracy: Does the model predict correctly above a threshold?
Precision and recall: Does it balance false positives and false negatives?
Bias: Does it treat different groups fairly?
Robustness: Does it handle edge cases without failing silently?

You run these tests in CI the same way you run integration tests. If accuracy drops below baseline, the pipeline fails.

Fine-tuning

Fine-tuning is iterative improvement of an existing model using new or domain-specific data. Think of it as patching, but with data instead of code.

You take a pre-trained model, feed it additional data, and produce an updated artifact. The pipeline stages (test, validate, deploy) remain the same. The input changes from code to data.

Model monitoring (drift detection)

This is the biggest difference from traditional DevOps. Application code does not degrade over time. Models do.

Model drift happens when the real-world data distribution changes from what the model was trained on. The model still runs, still returns responses, but the quality of those responses degrades silently.

SageMaker Model Monitor continuously evaluates live inference data against a training baseline. It detects:

Data quality drift: Input features change shape or distribution.
Model quality drift: Accuracy, precision, or recall drops below threshold.
Bias drift: Fairness metrics shift post-deployment.

When drift is detected, Model Monitor fires an Amazon EventBridge event. You can trigger an alarm, notify the team, or initiate automated rollback.

In DevOps terms: Model Monitor is your health check, but for prediction quality rather than uptime.

DevOps vs MLOps pipeline: the parallel

The following diagram shows how every DevOps pipeline stage has a direct MLOps equivalent. The workload passing through the pipeline changed. The pipeline structure did not.

The left side is your world today. The right side is MLOps. Notice how every stage has a direct equivalent.

Caching: why it matters more in MLOps

In DevOps, a failed build takes seconds to minutes to re-run. In MLOps, a failed training job can waste hours or days of compute. Caching between pipeline stages becomes critical for cost and speed.

Model artifacts in S3. Once training completes, store the artifact in a versioned S3 bucket. If deployment fails, you do not retrain. You redeploy the cached artifact.
Feature Store. Engineered features are expensive to compute. Amazon SageMaker Feature Store caches them for reuse across training and inference. This avoids recomputing the same transformations repeatedly.
Version resolution cache. At inference time, resolving which model version to invoke on every request adds latency. A caching layer (such as Amazon DynamoDB with DynamoDB Accelerator (DAX)) resolves version mappings in microseconds rather than milliseconds.
Container images. Cache your training and inference container images in Amazon Elastic Container Registry (Amazon ECR). Rebuilding containers for every pipeline run wastes time when only the model artifact changed.

In DevOps, you cache dependencies (node_modules, pip packages). In MLOps, you cache everything above plus the model itself. The cost of recomputation is orders of magnitude higher.

Rollback: why it is non-negotiable in AI/ML

In traditional DevOps, rollback is an operational best practice. In MLOps, it is a regulatory requirement. Regulators are paying attention to AI failures and the penalties are significant.

AI incidents hit a record 362 in 2025, up from 233 in 2024 (Stanford HAI AI Index 2026).
The EU AI Act imposes fines up to EUR 35M or 7% of global revenue for non-compliant AI systems (Lawfare Analysis).
The Consumer Financial Protection Bureau (CFPB) fined Goldman Sachs $65M for algorithmic failures in Apple Card (CFPB Enforcement Action).
The Equal Employment Opportunity Commission (EEOC) fined iTutorGroup $365K for age-based algorithmic discrimination (EEOC Press Release).
Gartner predicts 40%+ of agentic AI projects will be cancelled by 2027 due to inadequate risk controls (Gartner Press Release).

Your rollback strategy needs to answer three questions:

How fast can you roll back? Target sub-5-minute recovery. API Gateway deployment history and SageMaker endpoint variants support instant traffic switching.
Can you prove which model served which prediction? Regulators require traceability. Log model version metadata with every inference request using structured CloudWatch Logs.
Is your audit trail immutable? Use AWS CloudTrail with immutable logging. No one can tamper with the evidence after the fact.

In DevOps, rollback prevents downtime. In MLOps, rollback prevents fines.

Getting started: your first MLOps pipeline on AWS

You do not need to learn a new orchestration tool or CI/CD platform. Start with what you know and extend your existing pipeline with ML-specific stages.

CodePipeline orchestrates the pipeline. Same service, same console, same execution flow.
CodeBuild runs each stage. Add a training buildspec that calls SageMaker Training Jobs.
S3 stores model artifacts. Same versioned bucket pattern you use for CloudFormation templates.
SageMaker Model Registry tracks model versions. Think of it as ECR for models instead of containers.
SageMaker Endpoints serve inference. Think of it as a managed ECS service for your model.
SageMaker Model Monitor watches for drift. Think of it as CloudWatch alarms for prediction quality.

Here is what a training stage buildspec looks like. If you have written a buildspec manifest for compiling code, this structure is familiar:

# pipeline/buildspec/train.yml
version: 0.2

phases:
  install:
    runtime-versions:
      python: 3.11
  pre_build:
    commands:
      - echo "Validating training config..."
      - python -m pytest model/tests/test_data_quality.py
  build:
    commands:
      - echo "Starting SageMaker Training Job..."
      - python model/train/train.py
        --config model/config/hyperparameters.json
        --output s3://${ARTIFACT_BUCKET}/models/${CODEBUILD_RESOLVED_SOURCE_VERSION}/
  post_build:
    commands:
      - echo "Registering model in Model Registry..."
      - aws sagemaker create-model-package
        --model-package-group-name ${MODEL_PACKAGE_GROUP}
        --inference-specification file://model/inference/spec.json
        --model-approval-status PendingManualApproval

artifacts:
  files:
    - model/config/hyperparameters.json
    - model/inference/spec.json

The AWS Prescriptive Guidance: DevOps Pipeline Accelerator provides a reference architecture for CI/CD pipelines. The same patterns (source, build, test, deploy, monitor) apply directly to MLOps.

Conclusion

In this post, we showed how DevOps pipeline fundamentals apply directly to MLOps. Code versioning, continuous integration, continuous deployment, testing, deployment strategies, monitoring, and rollback all transfer to the ML space.

The model is your new workload. Version it, test it, deploy it, monitor it, roll it back. The pipeline structure stays the same. What passes through it changes.

Start with your existing pipeline. Add model training as a build step, model quality tests as integration tests, Model Registry as your artifact store, and Model Monitor as your health check. You already know how to do this. The workload is different. The discipline is the same.

AIP-C01 last-minute revision: exam traps, memory hooks, and quick notes

Anwaar Hussain — Fri, 01 May 2026 15:22:37 +0000

In Part 1, I explained why the AWS Certified Generative AI Developer - Professional (AIP-C01) certification stands apart from other AWS certifications. This follow-up post is a concise, 30-60 minute pre-exam revision guide covering exam traps, memory hooks, and quick notes across all five domains.

Disclaimer: These notes are a quick revision companion only. They are not a substitute for thorough exam preparation. Always refer to official AWS documentation and the recommended courses listed at the end of this post for comprehensive preparation.

Domain 1: Foundation Model Integration, Data Management, and Compliance (31%)

Foundation Models (FMs): Large pre-trained transformer models available via Amazon Bedrock: AWS Nova, Claude (Anthropic), Llama (Meta), Amazon Titan (text, embeddings, image), Jurassic-2 (AI21 Labs), Stable Diffusion (Stability AI). Select FMs based on task, latency, cost, and token limits.

Fine-tuning vs RAG:

Fine-tuning adapts an FM to a specific use case with proprietary training data. Titan, Cohere, and Meta models support fine-tuning via Amazon Bedrock. Text models need labelled prompt-completion pairs; image models need Amazon Simple Storage Service (Amazon S3) paths linked to descriptions. Secure training data with Amazon Virtual Private Cloud (Amazon VPC) + AWS PrivateLink.
RAG provides dynamic, up-to-date knowledge through vector stores (Amazon OpenSearch Serverless, Amazon Aurora pgvector, Amazon MemoryDB, Amazon ElastiCache, MongoDB Atlas, Pinecone, Redis Enterprise Cloud).
🧠 Memory Hook: Fine-tune = "teach the model new tricks"; RAG = "give the model a cheat sheet"
⚠️ Exam Trap: Fine-tune for style/tone changes; RAG for dynamic, up-to-date knowledge

LoRA Adapters: Lightweight fine-tuning technique. Amazon SageMaker AI Model Registry stores adapter versions with rollback strategies.

Chunking Strategies: Fixed-size, Hierarchical (smaller child chunks for precision, larger parent chunks for context), Semantic (FM-based, breaks content by meaning not length). Chunk size affects retrieval precision vs context.

Hybrid Search: Combines keyword search + vector search. Amazon Bedrock reranker models re-score results for improved relevance.

Query Expansion and Decomposition: Amazon Bedrock query expansion broadens search; AWS Lambda query decomposition breaks complex queries into sub-queries; AWS Step Functions orchestrates multi-step retrieval.

Embedding Models: Amazon Titan Embeddings, Cohere Embed. Match embedding model to vector store dimensions.

Vector Store Optimization: Binary vectors (32x compression vs float32), FP16 (16-bit scalar quantization for HNSW). Amazon OpenSearch Service Hierarchical Indices route queries from small fast top-level index to detailed domain-specific indices.

Prompt Engineering: Prompt = Instructions + Context + Input data + Output indicator. Few-shot prompting (examples of desired outputs). Chain of Thought (CoT) forces step-by-step reasoning.

Prompt Caching: Reuse previously processed prompts to reduce cost and latency.

Amazon Bedrock Prompt Management: Create, evaluate, version, and share prompts across teams. Supports variables in reusable templates.

Data Governance: Data residency, encryption at rest (AWS Key Management Service (AWS KMS)), encryption in transit (Transport Layer Security (TLS) 1.2+).

Amazon Bedrock Data Automation (BDA): Extracts structured data from multimodal inputs (documents, images, videos, audio). Uses Blueprints to specify extraction fields. Output: JSON, CSV, markdown, HTML.

🧠 Memory Hook: BDA = "Swiss Army knife for document processing"

Amazon Transcribe: Speech-to-text with PII redaction, automatic language identification, custom vocabularies, and ML-powered toxicity detection.

Bedrock Cross-Region Inference: Provides resilient FM deployments across regions for fault tolerance.

Domain 2: Implementation and Integration (26%)

Bedrock Agents: Action Groups (Lambda functions) + Knowledge Bases + Prompt Templates + Session Management. Action Groups rely on OpenAPI (Swagger) schema uploaded to Amazon S3.

🧠 Memory Hook: Agent = "Brain (FM) + Hands (Action Groups) + Memory (Knowledge Bases)"

Model Context Protocol (MCP): Standardised interface (JSON-RPC 2.0 over HTTP or stdio) for agent-tool interactions. MCP servers via Lambda (stateless) or Amazon Elastic Container Service (Amazon ECS) (complex tools).

🧠 Memory Hook: MCP = "USB-C for AI agents, one plug fits all tools"

Agent Frameworks: Strands Agents, AWS Agent Squad, Amazon Bedrock AgentCore for autonomous systems with memory and state management.

Agent Memory: Short-term (chat history via Sessions and Events). Long-term (extracted insights, user preferences stored as Memory Records). AgentCore Memory provides scalable, serverless storage.

Multi-Agent Workflows: Orchestrator delegates subtasks to worker LLMs, Synthesizer combines results. Chain of Sequence (sequential) or Parallelisation (concurrent execution, voting).

🧠 Memory Hook: Multi-agent = "assembly line with a foreman (orchestrator) and workers"

Amazon Bedrock Flows: Multi-step workflow orchestration with visual builder or JSON. Chain models, prompts, and conditions.

Sync vs Async Inference: Sync for real-time (InvokeModel); async for batch/long-running (InvokeModelWithResponseStream). Amazon Simple Queue Service (Amazon SQS) for async patterns.

Step Functions: Complex multi-service workflows, human-in-the-loop, error handling, parallel processing.

⚠️ Exam Trap: Step Functions for complex orchestration; Bedrock Agents handle simple multi-step tasks automatically

API Patterns: REST (Amazon API Gateway), GraphQL (AWS AppSync with real-time subscriptions), WebSockets for streaming.

Resilience Patterns: Exponential Backoff for retries (AWS SDK built-in). Circuit Breaker pattern via Step Functions + Amazon DynamoDB. API Gateway rate limiting.

🧠 Memory Hook: Circuit Breaker = "fuse box that trips before the whole house burns down"

AWS Cloud Development Kit (AWS CDK) / AWS CloudFormation: IaC for deploying GenAI stacks across environments. One CDK app + Stage construct per environment. Explicit env (account + region) per stack. Separate AWS accounts per environment.

⚠️ Exam Trap: Omitting env triggers environment-agnostic synthesis, breaking context lookups
🧠 Memory Hook: "One blueprint, multiple construction sites"

Continuous Integration / Continuous Delivery or Deployment (CI/CD) + AWS CodeDeploy: Canary, blue/green, rolling deployments for Lambda and compute targets.

Configuration and Secrets Management:

AWS Systems Manager Parameter Store: Static config (endpoints, URLs, free at 4 KB)
AWS Secrets Manager: Credentials with automatic rotation
AWS AppConfig: Dynamic runtime config without redeployment (feature flags, guardrail thresholds)
⚠️ Exam Trap: "rotation" = Secrets Manager. "without redeploying" or "feature flags" = AWS AppConfig
🧠 Memory Hook: "Phone book, vault with auto-lock-change, remote control"

Human-in-the-Loop (HITL): AI drafts, human refines. Route uncertain cases based on confidence scores. Collect feedback via API Gateway, store in DynamoDB.

Amazon Q Family:

Amazon Q Developer: Code generation, security scans, IDE extensions
Amazon Q Business: Enterprise GenAI assistant with data connectors (Amazon S3, SharePoint, Slack, Salesforce)
Amazon Q Apps: No-code GenAI productivity apps using natural language

Amazon Q Developer Project Configuration:

Uses .amazonq/ directory at the project root
Key file: .amazonq/rules.md (or multiple .md files in .amazonq/rules/)
Rules provide project-specific context, coding standards, architecture patterns, and constraints to Amazon Q Developer
Rules are scoped to the project, not global. Keep them concise and actionable
🧠 Memory Hook: .amazonq/rules.md = "instruction manual you leave for your AI coding assistant"

Domain 3: AI Safety, Security, and Governance (20%)

Amazon Bedrock Guardrails: Content filters (hate, insults, sexual, violence), denied topics, word filters, PII detection/masking, contextual grounding check (prevents hallucinations by measuring response alignment with retrieved context).

🧠 Memory Hook: Guardrails = "bouncer at both doors" (input AND output filtering)

Defense-in-Depth for Content Safety: Amazon Comprehend pre-processing > Amazon Bedrock Guardrails > Lambda post-processing > API Gateway filtering. Includes threat detection for prompt injection, jailbreaks, and input sanitisation.

🧠 Memory Hook: Defense-in-depth = "multiple security checkpoints, not just one gate"

Hallucination Reduction: Amazon Bedrock Knowledge Bases for grounding, confidence scoring, JSON Schema for structured outputs.

Amazon VPC Endpoints + AWS PrivateLink: Keep Amazon Bedrock traffic private within your VPC. Essential for sensitive fine-tuning data.

AWS Identity and Access Management (IAM) + AWS IAM Identity Center: Centralised access management. IAM Access Analyzer validates policies for least privilege.

Service Control Policies (SCPs) + Resource Control Policies (RCPs): SCPs restrict what accounts can do; RCPs restrict resource access.

⚠️ Exam Trap: SCPs don't grant permissions, they only restrict

Additional Security Services:

Amazon Macie: Data security and DLP for Amazon S3
Amazon Cognito: User auth for web/mobile apps
AWS WAF: Web application firewall
AWS Encryption SDK: Client-side encryption

Responsible AI: Fairness, explainability, transparency, human oversight, privacy and security, safety, controllability, veracity and robustness, governance.

Amazon Comprehend: NLP for sentiment, entities, PII detection, custom classification and entity recognition.

🧠 Memory Hook: Comprehend = "reads and understands text like a human"

Governance and Compliance: SageMaker AI model cards for documentation. AWS Glue Data Catalog for data lineage. AWS CloudTrail audit logging. Continuous monitoring for misuse, drift, and bias.

Domain 4: Operational Efficiency and Optimization (12%)

Amazon CloudWatch GenAI Observability: Track latency, token usage (InputTokenCount, OutputTokenCount), errors, API invocation counts. Time to First Token (TTFT) for streaming latency. Amazon CloudWatch Synthetics for canary monitoring.

Bedrock CountTokens API: Free API to estimate prompt token count before invoking the model.

AWS X-Ray: End-to-end distributed tracing across API Gateway, Lambda, Amazon Bedrock, Knowledge Bases.

🧠 Memory Hook: X-Ray = "MRI for your application's request flow"

Provisioned Throughput vs On-Demand: Reserved capacity for consistent performance vs pay-per-use. Provisioning is associated with a specific model ARN.

Prompt Caching: Caches static prompt prefix (instructions, system prompt). Only dynamic content tokenised on subsequent calls.

Cost Optimisation: Right-size models, cache prompts, batch inference, monitor token usage. Context Pruning (limit RAG chunks, filter via metadata, summarise old chat history). AWS Cost Explorer and AWS Cost Anomaly Detection for tracking GenAI spend.

Dynamic Routing (Intelligent Prompt Routing): Built into Amazon Bedrock. Routes complex queries to larger models, simple queries to smaller/cheaper models.

🧠 Memory Hook: Dynamic Routing = "express lane for simple questions, full service for complex ones"

Non-deterministic Outputs: Temperature, top-p, top-k control randomness. Lower temperature = more deterministic.

🧠 Memory Hook: Temperature = "creativity dial". 0 = robot, 1 = poet

Amazon SageMaker Clarify: Detects bias by measuring imbalances across demographic groups. Bias metrics: Class Imbalance (CI), Difference in Proportions of Labels (DPL).

Amazon SageMaker Model Monitor: Alerts via CloudWatch on quality deviations and data drift.

Semantic Caching: Cache similar queries' results using result fingerprinting. Edge caching via Amazon CloudFront for reduced latency.

Domain 5: Testing, Validation, and Troubleshooting (11%)

Model Evaluation: Amazon Bedrock Model Evaluation for accuracy, robustness, toxicity. A/B testing, canary testing, cost-performance analysis.

LLM-as-a-Judge: Use an LLM to evaluate another LLM's outputs. Bedrock Evaluation Jobs measure RAG performance against benchmarks or LLM judges.

RAG Evaluation Metrics: Correctness, Completeness, Helpfulness, Logical Coherence, Faithfulness (how well responses align with retrieved text).

ROUGE Metric: Measures overlap of units (words, n-grams) between generated text and ground truth for summarisation or translation tasks.

Agent Debugging: Trace agent reasoning steps, validate action group responses, check knowledge base retrieval.

Bedrock Agent Tracing: Trace types: PreProcessing, Orchestration, PostProcessing, Guardrail traces. Shows which knowledge bases were hit, how action groups were invoked, and errors encountered.

Amazon SageMaker Ground Truth: Data labelling service for creating high-quality training datasets.

Troubleshooting Patterns: Inconsistent outputs, agent failures, retrieval misses, latency spikes.

Context Window Overflow: Dynamic chunking, prompt design optimisation, truncation error analysis.

Retrieval System Troubleshooting: Embedding quality diagnostics, drift monitoring, vectorisation resolution.

Amazon Augmented AI (Amazon A2I): Human review/correction loops for quality assurance. Vital due to non-deterministic nature of GenAI.

Exam decision boundaries

rotation = AWS Secrets Manager, not Parameter Store
without redeploying or feature flags = AWS AppConfig
consistent deployments across environments = one AWS CDK app with Stages
grounding or hallucination prevention = Amazon Bedrock Guardrails contextual grounding check or RAG with Knowledge Bases
standardised agent-tool interface = MCP
bias detection or explainability = Amazon SageMaker Clarify
data drift or model quality monitoring = Amazon SageMaker Model Monitor
human review loop = Amazon A2I
speech-to-text = Amazon Transcribe
text extraction from documents = Amazon Textract
conversational chatbot interface = Amazon Lex
contact centre AI = Amazon Connect + Amazon Lex
real-time subscriptions or GraphQL = AWS AppSync
event-driven = Amazon EventBridge
private Amazon Bedrock traffic = Amazon VPC Endpoints + AWS PrivateLink
sensitive data discovery in Amazon S3 = Amazon Macie

Key AWS services quick reference

Amazon Bedrock Ecosystem: Amazon Bedrock, Bedrock Agents, Amazon Bedrock AgentCore, Bedrock Knowledge Bases, Amazon Bedrock Guardrails, Amazon Bedrock Flows, Amazon Bedrock Prompt Management, Amazon Bedrock Data Automation (BDA), Bedrock Cross-Region Inference, Bedrock Model Evaluation

Agentic AI: Strands Agents, AWS Agent Squad, Model Context Protocol (MCP)

Data Processing and AI/ML: Amazon Textract, Amazon Transcribe, Amazon Comprehend, Amazon Rekognition, Amazon Lex, Amazon Titan, Amazon SageMaker AI, Amazon SageMaker Clarify, Amazon SageMaker Ground Truth, Amazon SageMaker JumpStart, Amazon SageMaker Model Monitor, SageMaker AI Model Registry, Amazon SageMaker Neo, Amazon A2I

Amazon Q Family: Amazon Q Developer, Amazon Q Business, Amazon Q Apps

Search and Vector: Amazon OpenSearch Service, Amazon Kendra, Amazon Neptune

Integration and Compute: AWS Lambda, Amazon Elastic Compute Cloud (Amazon EC2), AWS Step Functions, Amazon API Gateway, AWS AppSync, Amazon EventBridge, Amazon DynamoDB, Amazon SQS, Amazon Simple Notification Service (Amazon SNS), Amazon AppFlow

Infrastructure and Deployment: AWS CDK, AWS CloudFormation, AWS CodePipeline + AWS CodeBuild + AWS CodeDeploy, AWS AppConfig, AWS Systems Manager Parameter Store

Security, Identity, and Compliance: IAM + IAM Identity Center, AWS KMS, AWS Secrets Manager, Amazon Macie, Amazon Cognito, AWS WAF, Amazon VPC + AWS PrivateLink

Storage: Amazon S3, Amazon Elastic Block Store (Amazon EBS), Amazon Elastic File System (Amazon EFS)

Monitoring and Observability: Amazon CloudWatch, AWS X-Ray, AWS CloudTrail, AWS Cost Explorer, AWS Cost Anomaly Detection, Amazon Managed Grafana

Recommended preparation sources

I also highly recommend reading the relevant AWS service FAQ pages. They provide deeper understanding of service capabilities, limitations, and best practices that frequently appear in exam questions.

All the best on your AIP-C01 journey, and happy GenAI building! 🚀

Why AWS Certified GenAI Developer stands apart from other AWS certs

Anwaar Hussain — Wed, 15 Apr 2026 13:11:54 +0000

I recently passed the AWS Certified Generative AI Developer - Professional (AIP-C01) exam, bringing my total to 13 AWS certifications. In 2024, I earned my AWS Golden Jacket—a recognition reserved for those who achieve all 12 active AWS certifications. (AWS Machine Learning Specialty certification retired on March 31, 2026.) With this breadth of AWS certification experience, I can confidently say that AIP-C01 stands apart from every other AWS credential I've earned.

This isn't just another cloud certification with a new badge. While my journey through Solution Architect, DevOps Engineer, Security Specialty, and other AWS certifications taught me to architect, secure, and operate cloud infrastructure, the GenAI Developer certification demanded something fundamentally different. It required me to synthesize knowledge across traditional artificial intelligence and machine learning (AI/ML), large language models (LLMs), serverless architecture, and application development—validating skills that didn't exist as a cohesive discipline until recently.

AWS designed this certification to help address a critical gap: organizations need GenAI Developers and Architects who can design robust systems, implement secure solutions, integrate AI capabilities into existing applications, and operate these systems reliably at scale. The challenge is that this role requires expertise spanning multiple domains—a combination rarely validated by a single credential until now.

A different kind of preparation

Back in December 2025, when I started preparing for this certification, my approach was quite similar to before. I followed well-known courses, studied AWS documentation and service FAQs, set up quick configurations in the console, and worked through practice exams. By the time I completed all of that, unlike in the past, I had one clear thought: "You are not ready for this!"

Throughout my initial preparation, I kept recalling a narrative from 15 years ago during my Bachelor's degree in Telecommunications Engineering. We were told that jobs in the telecom sector were saturated post-boom from the 1990s and early 2000s. The rapid advancement in radio frequency (RF) and antenna technologies and the advent of new mobile network standards like 2G and 3G meant that all the jobs were taken by Electrical and Electronics Engineers, Network Engineers, and similar roles, which left field specialists with limited opportunities. I don't know how true that was as I clearly didn't pursue that industry for long.

This memory resurfaced because I saw a similar pattern emerging in the GenAI space. I found myself wondering if AI/ML Consultants, Data Scientists, DevOps Engineers, and Application Architects would simply take over the GenAI space, leaving no room for dedicated GenAI Developers and Architects. There's nothing wrong with professionals from these backgrounds switching to the GenAI domain—as long as the right skills and knowledge are acquired. The challenge comes when you rely solely on your major specialization and treat GenAI as a minor add-on rather than developing the comprehensive skill set this discipline demands.

Coming from a DevOps and Cloud Infrastructure Architect background, I recognized significant knowledge gaps. To fill those, I enrolled in AWS internal Area of Depth (AoD) programs—specifically Serverless Application, ML, and MLOps—to enhance my skills. These programs helped me understand AWS services like AWS Step Functions, AWS X-Ray, and AWS AppSync (particularly GraphQL APIs), along with REST APIs, WebSockets, and asynchronous and synchronous architectures on the application side. On the ML side, I gained understanding of the ML lifecycle on AWS, fine-tuning models, optimizing their parameters, and importing them to Bedrock to fill vital gaps in my knowledge.

What makes AIP-C01 different

To understand why this certification matters, it helps to look at how we got here. About three years ago, when ChatGPT/OpenAI took the world by storm with the GenAI and LLM revolution, we saw AWS flagbearer GenAI service Amazon Bedrock being used primarily for setting up chatbots, statbots, and AI assistants with Retrieval Augmented Generation (RAG) enabled and basic agentic setups. Those were small-scale and mostly proof-of-concept (PoC)-grade solutions. Before Agentic AI became mainstream, the focus was narrow—build an auxiliary AI tool, add some retrieval capabilities, and call it done.

As organizations moved beyond experimentation to production deployment, the industry recognized a critical skills gap. To address that, AWS formulated this certification to prepare developers and architects who can deliver GenAI solutions at production grade. The focus is not entirely on AI/ML or LLMs (a common misconception about GenAI), but on fitting GenAI into business-critical applications and architectures as a key tool in futuristic tech stacks. The certification covers Bedrock heavily, but not just as a service for running chatbots. It validates your ability to run agents with AWS-managed orchestration or agent frameworks: Strands, LangChain, etc managing agents running on Amazon Bedrock AgentCore. It's about building systems that integrate GenAI capabilities into enterprise applications that need to scale, perform reliably, and deliver measurable business value.

Most other AWS certifications test your knowledge of cloud services and best practices within defined domains. The GenAI Developer certification assumes you already understand these fundamentals and pushes you into territory that requires running GenAI workloads alongside business-critical applications in production environments.

The exam covers five domains that reflect real-world operational complexity:

Domain 1: Foundation Model Integration, Data Management, and Compliance tests your ability to select appropriate models, implement RAG architectures, and handle data governance.

Domain 2: Implementation and Integration validates you can build agentic AI systems and integrate GenAI capabilities into existing applications using serverless orchestration.

Domain 3: AI Safety, Security, and Governance helps you implement guardrails and responsible AI practices.

Domain 4: Operational Efficiency and Optimization focuses on monitoring GenAI applications and optimizing costs for production workloads.

Domain 5: Testing, Validation, and Troubleshooting covers debugging agent behaviors and resolving production issues.

Building production-grade GenAI applications

The certification validates more than just your ability to call foundation model APIs—it tests your understanding of how to architect complete GenAI solutions using serverless technologies and deploy them across multiple environments using AWS Cloud Development Kit (AWS CDK) and AWS CloudFormation through continuous integration and continuous delivery (CI/CD) pipelines.

Real-world implementations comprise of synchronous and asynchronous inference patterns, event-driven architectures using Amazon EventBridge, workflow orchestration with Step Functions, data processing with AWS Lambda, state management with Amazon DynamoDB, and security with AWS Identity and Access Management (AWS IAM). They require abilities to design serverless architectures that scale automatically, handle failures gracefully, and optimize costs.

Production-grade solutions leverage AWS AI/ML services to complement Amazon Bedrock. Amazon Comprehend provides natural language processing capabilities. Amazon Rekognition captures frames from videos for visual analysis. Amazon Bedrock Data Automation handles complex document processing, while Amazon Textract extracts text and data from documents.

Vector stores for semantic and hybrid search rely on Amazon OpenSearch Service and Amazon Simple Storage Service (Amazon S3). Prompt caching helps reduce costs by reusing previously processed prompts. Amazon Bedrock Prompt Management simplifies the creation, evaluation, versioning, and sharing of prompts to help you get the best responses from foundation models. Flow orchestration with Amazon Bedrock Flows enables you to design and execute complex multi-step workflows. Additionally, Amazon Bedrock Guardrails provides content filtering and safety controls to help you implement responsible AI practices.

Security and governance are critical. Keeping Amazon Bedrock traffic private requires Amazon Virtual Private Cloud (Amazon VPC) endpoints, while Service Control Policies (SCPs), Resource Control Policies (RCPs), and AWS IAM Identity Center manage access by identities and model resources centrally. Amazon CloudWatch GenAI Observability provides comprehensive monitoring for AI workloads, tracking latency, token usage, errors, and API invocation counts.

Beyond the core services, Lambda functions complement LLM flows through Amazon Bedrock Flows and Step Functions orchestration. Lambda enables custom processing logic within your GenAI workflows, handling tasks like data transformation, API integrations, and business logic execution. The certification tests your knowledge of various deployment strategies for compute resources using AWS CodeDeploy, including canary deployments, blue/green deployments, and rolling updates across Lambda functions and other compute targets. A critical aspect is understanding dynamic configuration loading through AWS AppConfig, which allows you to modify application behavior without redeployment—essential for managing feature flags, model parameters, and operational settings in production GenAI applications.

The certification assesses your ability to troubleshoot issues unique to GenAI applications—inconsistent model outputs, agent failures, non-deterministic behaviors, and the operational complexity of systems that make autonomous decisions. These skills help distinguish professionals who can deploy GenAI applications that deliver business value from those who primarily build PoC solutions.

Conclusion

AIP-C01 certification represents a new category of cloud certification—one that validates your ability to work across multiple disciplines and build production-ready GenAI applications. It's not just another AWS certification with a different badge. It's AWS's answer to the GenAI skills gap, designed to prepare professionals for roles that didn't exist a few years ago but are now critical to many organizations' AI strategies.

The market recognizes this value. According to Glassdoor data from April 2026, GenAI roles command strong compensation in both the US and UK markets. In the United States, GenAI Developers earn an average of US$81K/yr (range: US$63K-US$104K), GenAI Engineers earn US$100K/yr (range: US$76K-US$130K), and GenAI Architects earn US$140K/yr (range: US$105K-US$188K). In the United Kingdom, GenAI Engineers earn an average of £38K/yr (range: £29K-£48K). The salary progression clearly reflects the increasing complexity and business impact of these roles.

If you're considering this certification, prepare for an exam that challenges you to think like an architect, developer, and operator simultaneously. It tests your ability to synthesize knowledge across traditional AI/ML, LLMs, serverless architecture, and application development. When you pass, you'll have validated skills that are currently in high demand and valuable for building the next generation of AI-powered applications.

Ready to start your AIP-C01 journey? Begin by reviewing the official exam guide.