This post references AWS services, frameworks, and tools to explain the Machine Learning Operations (MLOps) concepts. The principles apply to any cloud platform, orchestration tool, or ML service. Swap them with your preferred solutions; the pipeline discipline remains the same. Note that a foundational understanding of ML models is a prerequisite to MLOps, but you can build it in parallel while applying your existing Continuous Integration/Continuous Deployment (CI/CD) skills.
Introduction
I recently completed an internal AWS program focused on MLOps, and the biggest takeaway was this: if you already know DevOps, you already know most of MLOps.
DevOps engineers building CI/CD pipelines for Infrastructure as Code (IaC), microservices, and serverless applications already have 80% of the skills needed for MLOps. The fundamentals of code versioning, continuous integration, continuous deployment, testing, deployment strategies, monitoring, and rollback all apply directly.
The difference? Your workload changed. Instead of deploying application code or infrastructure templates, you are deploying a trained model. The pipeline stages stay the same. The artifacts passing through them are different.
In this post, you will learn how DevOps pipeline concepts map to MLOps, what new considerations come with ML workloads, and how to structure your first ML pipeline using the tools you already know.
The mental model: your workload changed, not your pipeline
In DevOps, your workload is application code, a container image, or a CloudFormation template. You version it, test it, deploy it, monitor it, and roll it back when something breaks.
In MLOps, your workload is the model. A model is the output of training code + training data + hyperparameters. It produces an artifact (a serialised file) that you deploy to an endpoint for inference.
Everything else stays the same:
- You version the model artifact the same way you version a container image.
- You test the model the same way you run integration tests on a microservice.
- You deploy the model the same way you deploy a Lambda function through stages.
- You monitor the model the same way you monitor API latency and error rates.
- You roll back the model the same way you roll back an API Gateway deployment.
The pipeline is familiar. The workload inside it is new.
Repository structure: organising your ML workload
In DevOps, you separate src/ from infra/ from pipeline/. The same principle applies in MLOps. You add a model/ directory. This is your new workload.
A consistent structure lets your CI/CD pipeline know exactly where to find training scripts, inference code, tests, and dependencies. No guessing, no hardcoded paths. Here is a generic ML repository layout:
ml-project/
├── model/
│ ├── train/
│ │ ├── Dockerfile # Training container definition
│ │ ├── train.py # Training entry point
│ │ ├── preprocessing.py # Feature engineering
│ │ └── requirements.txt # Training dependencies
│ ├── inference/
│ │ ├── Dockerfile # Inference container definition
│ │ ├── serve.py # Inference entry point
│ │ ├── predictor.py # Prediction logic
│ │ └── requirements.txt # Inference dependencies (lighter)
│ ├── tests/
│ │ ├── test_model_quality.py # Accuracy, precision, recall
│ │ ├── test_bias.py # Fairness metrics
│ │ └── test_data_quality.py # Input validation
│ └── config/
│ ├── hyperparameters.json # Training hyperparameters
│ └── baseline.json # Model Monitor baseline
├── infra/
│ ├── lib/ # AWS Cloud Development Kit (CDK) or CloudFormation stacks
│ └── config/ # Environment-specific config
├── pipeline/
│ └── buildspec/ # One buildspec per CI/CD stage
├── monitoring/
│ ├── baselines/ # Drift detection baselines
│ └── alarms/ # CloudWatch alarm definitions
├── docs/
│ └── architecture.png
├── README.md
└── .gitignore
Here is why this structure works:
- model/train/ and model/inference/ are separated. Different dependencies, different containers, different lifecycle. Training runs once or on a schedule. Inference runs continuously. Keeping them separate means your inference container stays lightweight.
- model/tests/ lives next to model code. Your CI pipeline runs model quality tests the same way it runs unit tests for application code.
- model/config/ is versioned alongside the model. When you retrain, hyperparameters and baselines change together. Git tracks both.
- pipeline/buildspec/ has one spec per stage. Same pattern as your existing AWS CodeBuild projects.
Amazon SageMaker expects /opt/ml/model/ for artifacts and /opt/ml/code/ for scripts in custom containers. Each Dockerfile lives inside its respective directory (model/train/ and model/inference/). Since the inference code maps directly to /opt/ml/code/, the COPY instruction is a one-liner. No path gymnastics.
Your model/ directory is to MLOps what src/ is to application development. It has source code, tests, dependencies, and config. Treat it the same way.
What stays the same
The core DevOps pipeline stages transfer directly to MLOps. Here is how each one maps.
Code versioning
You already version application code in Git. In MLOps, you version the same way but add:
- Training code (your
model/train/directory) - Hyperparameters (JSON config files)
- Data versions (using tools like Data Version Control (DVC) or SageMaker Experiments)
- Model artifacts (tracked in SageMaker Model Registry)
The principle is identical. If you cannot reproduce it, you cannot trust it.
Continuous integration
Your existing CI runs linting, unit tests, and contract tests on every pull request. In MLOps, you add:
- Schema validation (linting your API spec with tools like Spectral)
- Model quality tests (accuracy, precision, recall against a baseline)
- Data quality checks (input validation, missing values, type mismatches)
The pipeline still fails fast on the first broken test. The tests are different, not the pattern.
Continuous deployment and delivery
You already deploy through stages: dev, staging, production. In MLOps, the same pattern applies:
- Deploy model to staging endpoint
- Run integration tests against staging
- Approval gate (manual or automated)
- Deploy to production
AWS CodePipeline orchestrates this the same way it orchestrates your IaC deployments. The target changes from an AWS CloudFormation stack to a SageMaker endpoint.
Testing
Your testing pyramid still applies:
- Unit tests: Does the training script run without errors?
- Integration tests: Does the deployed endpoint return valid responses?
- Contract tests: Does the model output match the expected schema?
- Performance tests: Does inference latency meet Service Level Agreement (SLA) requirements?
You add model-specific tests: accuracy thresholds, bias checks, and drift baselines. The testing philosophy (fail fast, test early, automate everything) stays the same.
Deployment strategies
Blue/green and canary deployments work the same way:
- Blue/green. Deploy new model version to a separate endpoint. Switch traffic atomically. Roll back instantly if metrics degrade.
- Canary. Route 10% of traffic to the new model. Monitor prediction quality. Gradually increase to 100%.
- Shadow. Send production traffic to both old and new models. Compare outputs without affecting users. This is unique to ML but follows the same traffic-splitting principle.
Other strategies like Linear (gradually shifting traffic in equal increments over time) also apply. The choice depends on your risk tolerance and rollback speed requirements.
SageMaker production variants handle traffic splitting between model versions natively. Same concept as weighted target groups, different workload.
Monitoring and feedback
Amazon CloudWatch metrics, alarms, and dashboards work the same way. You monitor:
- Invocation count, latency, error rates (same as any API)
- Model-specific metrics: prediction distribution, confidence scores, feature drift
AWS X-Ray traces requests end-to-end the same way it traces your microservices. The difference is you also trace which model version served each prediction.
Rollback
Amazon API Gateway deployment history and SageMaker endpoint rollback work the same way as rolling back an AWS Lambda function or Amazon Elastic Container Service (Amazon ECS) service. You point traffic back to the previous version.
The difference in MLOps: rollback is not just operational, it is regulatory. More on this in the rollback section below.
What is new for you
These are the ML-specific concepts that do not have a direct DevOps equivalent. They extend your pipeline rather than replace it.
Model training
Think of training as your "build" step, but for data. Instead of compiling code into a binary, you feed data through an algorithm to produce a model artifact.
SageMaker Training Jobs handle this on managed compute. You specify the training script, input data location (Amazon Simple Storage Service (Amazon S3)), instance type, and hyperparameters. SageMaker provisions the infrastructure, runs training, and stores the output artifact in S3.
The key difference from a code build: training can take minutes to days depending on data size and model complexity. This is why caching matters more in MLOps.
Model testing
In application development, "does it run" is a valid first test. In ML, a model can run perfectly and still produce wrong results.
Model testing validates performance:
- Accuracy: Does the model predict correctly above a threshold?
- Precision and recall: Does it balance false positives and false negatives?
- Bias: Does it treat different groups fairly?
- Robustness: Does it handle edge cases without failing silently?
You run these tests in CI the same way you run integration tests. If accuracy drops below baseline, the pipeline fails.
Fine-tuning
Fine-tuning is iterative improvement of an existing model using new or domain-specific data. Think of it as patching, but with data instead of code.
You take a pre-trained model, feed it additional data, and produce an updated artifact. The pipeline stages (test, validate, deploy) remain the same. The input changes from code to data.
Model monitoring (drift detection)
This is the biggest difference from traditional DevOps. Application code does not degrade over time. Models do.
Model drift happens when the real-world data distribution changes from what the model was trained on. The model still runs, still returns responses, but the quality of those responses degrades silently.
SageMaker Model Monitor continuously evaluates live inference data against a training baseline. It detects:
- Data quality drift: Input features change shape or distribution.
- Model quality drift: Accuracy, precision, or recall drops below threshold.
- Bias drift: Fairness metrics shift post-deployment.
When drift is detected, Model Monitor fires an Amazon EventBridge event. You can trigger an alarm, notify the team, or initiate automated rollback.
In DevOps terms: Model Monitor is your health check, but for prediction quality rather than uptime.
DevOps vs MLOps pipeline: the parallel
The following diagram shows how every DevOps pipeline stage has a direct MLOps equivalent. The workload passing through the pipeline changed. The pipeline structure did not.
The left side is your world today. The right side is MLOps. Notice how every stage has a direct equivalent.
Caching: why it matters more in MLOps
In DevOps, a failed build takes seconds to minutes to re-run. In MLOps, a failed training job can waste hours or days of compute. Caching between pipeline stages becomes critical for cost and speed.
- Model artifacts in S3. Once training completes, store the artifact in a versioned S3 bucket. If deployment fails, you do not retrain. You redeploy the cached artifact.
- Feature Store. Engineered features are expensive to compute. Amazon SageMaker Feature Store caches them for reuse across training and inference. This avoids recomputing the same transformations repeatedly.
- Version resolution cache. At inference time, resolving which model version to invoke on every request adds latency. A caching layer (such as Amazon DynamoDB with DynamoDB Accelerator (DAX)) resolves version mappings in microseconds rather than milliseconds.
- Container images. Cache your training and inference container images in Amazon Elastic Container Registry (Amazon ECR). Rebuilding containers for every pipeline run wastes time when only the model artifact changed.
In DevOps, you cache dependencies (node_modules, pip packages). In MLOps, you cache everything above plus the model itself. The cost of recomputation is orders of magnitude higher.
Rollback: why it is non-negotiable in AI/ML
In traditional DevOps, rollback is an operational best practice. In MLOps, it is a regulatory requirement. Regulators are paying attention to AI failures and the penalties are significant.
- AI incidents hit a record 362 in 2025, up from 233 in 2024 (Stanford HAI AI Index 2026).
- The EU AI Act imposes fines up to EUR 35M or 7% of global revenue for non-compliant AI systems (Lawfare Analysis).
- The Consumer Financial Protection Bureau (CFPB) fined Goldman Sachs $65M for algorithmic failures in Apple Card (CFPB Enforcement Action).
- The Equal Employment Opportunity Commission (EEOC) fined iTutorGroup $365K for age-based algorithmic discrimination (EEOC Press Release).
- Gartner predicts 40%+ of agentic AI projects will be cancelled by 2027 due to inadequate risk controls (Gartner Press Release).
Your rollback strategy needs to answer three questions:
- How fast can you roll back? Target sub-5-minute recovery. API Gateway deployment history and SageMaker endpoint variants support instant traffic switching.
- Can you prove which model served which prediction? Regulators require traceability. Log model version metadata with every inference request using structured CloudWatch Logs.
- Is your audit trail immutable? Use AWS CloudTrail with immutable logging. No one can tamper with the evidence after the fact.
In DevOps, rollback prevents downtime. In MLOps, rollback prevents fines.
Getting started: your first MLOps pipeline on AWS
You do not need to learn a new orchestration tool or CI/CD platform. Start with what you know and extend your existing pipeline with ML-specific stages.
- CodePipeline orchestrates the pipeline. Same service, same console, same execution flow.
- CodeBuild runs each stage. Add a training buildspec that calls SageMaker Training Jobs.
- S3 stores model artifacts. Same versioned bucket pattern you use for CloudFormation templates.
- SageMaker Model Registry tracks model versions. Think of it as ECR for models instead of containers.
- SageMaker Endpoints serve inference. Think of it as a managed ECS service for your model.
- SageMaker Model Monitor watches for drift. Think of it as CloudWatch alarms for prediction quality.
Here is what a training stage buildspec looks like. If you have written a buildspec manifest for compiling code, this structure is familiar:
# pipeline/buildspec/train.yml
version: 0.2
phases:
install:
runtime-versions:
python: 3.11
pre_build:
commands:
- echo "Validating training config..."
- python -m pytest model/tests/test_data_quality.py
build:
commands:
- echo "Starting SageMaker Training Job..."
- python model/train/train.py
--config model/config/hyperparameters.json
--output s3://${ARTIFACT_BUCKET}/models/${CODEBUILD_RESOLVED_SOURCE_VERSION}/
post_build:
commands:
- echo "Registering model in Model Registry..."
- aws sagemaker create-model-package
--model-package-group-name ${MODEL_PACKAGE_GROUP}
--inference-specification file://model/inference/spec.json
--model-approval-status PendingManualApproval
artifacts:
files:
- model/config/hyperparameters.json
- model/inference/spec.json
The AWS Prescriptive Guidance: DevOps Pipeline Accelerator provides a reference architecture for CI/CD pipelines. The same patterns (source, build, test, deploy, monitor) apply directly to MLOps.
Conclusion
In this post, we showed how DevOps pipeline fundamentals apply directly to MLOps. Code versioning, continuous integration, continuous deployment, testing, deployment strategies, monitoring, and rollback all transfer to the ML space.
The model is your new workload. Version it, test it, deploy it, monitor it, roll it back. The pipeline structure stays the same. What passes through it changes.
Start with your existing pipeline. Add model training as a build step, model quality tests as integration tests, Model Registry as your artifact store, and Model Monitor as your health check. You already know how to do this. The workload is different. The discipline is the same.
Further reading
- MLOps Foundation Roadmap for Enterprises with Amazon SageMaker
- AWS Well-Architected Machine Learning Lens
- Amazon SageMaker Model Monitor
- Multi-Account Model Deployment with Amazon SageMaker Pipelines
Time to ship. Your first model is waiting. 🚀

Top comments (0)