A lot of people say they built an AI system, but what they really mean is they trained a model once and ran it on a laptop.
Production AI is a different game.
You need a pipeline that can ingest data, clean it, train models, version them, deploy them safely, expose them through an API, and keep watching latency, errors, drift, and cost after release.
That is what real AI infrastructure looks like.
In this post, I will break down a full AI deployment stack on AWS in simple words without removing the technical depth.
The big idea
A production AI platform usually has four layers:
- Data layer for collecting and storing data
- Training layer for model training and evaluation
- Serving layer for live inference
- Ops layer for security, CI/CD, logging, and monitoring
If you get these four parts right, the system becomes repeatable instead of fragile.
Architecture diagram
End-to-end request and training flow
At a high level, the pipeline works like this:
- data arrives from apps, logs, APIs, files, and user feedback
- raw data lands in Amazon S3
- AWS Glue or Lambda cleans and transforms the data
- processed data is saved back to S3 in train, validation, and test form
- Amazon SageMaker runs training and evaluation jobs
- the best model is stored in a model registry
- CI/CD deploys the chosen model to an inference service
- users hit the model through an API
- CloudWatch tracks performance, health, and alerts
That is the clean version.
Now let us go layer by layer.
1. Data ingestion layer
The first rule is simple: keep raw data raw.
Do not overwrite the original input files. Land them in S3 first and treat that bucket like your source of truth.
Typical sources include:
- application events
- CSV uploads
- clickstream or logs
- internal databases
- support tickets
- images, audio, or text documents
- user feedback from the product
Common AWS services here:
- Amazon S3 for raw storage
- AWS Lambda for lightweight event-driven ingestion
- Amazon Kinesis for streaming data
- Amazon EventBridge for trigger-based workflows
Why this matters
If a downstream transformation fails or a training job goes bad, you still have the original input. That saves you from pipeline pain later.
2. Data processing and ETL
Once the raw data lands, it usually needs cleanup before training.
This stage can include:
- null and duplicate handling
- schema validation
- text normalization
- feature generation
- image resizing
- train, validation, and test splits
- quality checks before training
A practical AWS setup is:
- Glue Jobs for scheduled ETL
- Lambda for smaller transforms
- Athena for querying S3-backed datasets
- EMR if you need heavier Spark-style compute
Easy way to think about it
Raw data is messy.
Processed data is model-ready.
That transformation is the job of your ETL layer.
3. Model training on SageMaker
Once the processed dataset is ready, training moves into SageMaker.
SageMaker helps with:
- training jobs
- hyperparameter tuning
- managed experiment runs
- GPU or CPU compute
- model artifact output
- pipeline automation
A clean training pipeline usually does this:
- read processed data from S3
- run the training script
- evaluate the result
- compare metrics against the current model
- register the best model version
- deploy only if it passes the quality bar
Example training script
import pandas as pd
import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# SageMaker mounts training data here
train_path = "/opt/ml/input/data/train/train.csv"
df = pd.read_csv(train_path)
X = df[["text_length", "num_keywords"]]
y = df["label"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = RandomForestClassifier(n_estimators=200, random_state=42)
model.fit(X_train, y_train)
preds = model.predict(X_test)
print({"accuracy": accuracy_score(y_test, preds)})
# SageMaker expects model artifacts here
joblib.dump(model, "/opt/ml/model/model.joblib")
What is really happening here
SageMaker spins up a training environment, mounts the dataset into the container, runs the script, and saves the trained model artifact for later deployment.
4. Model registry and versioning
This part gets ignored way too often.
If you train multiple versions, you need to know:
- which dataset version trained the model
- which code version was used
- what the evaluation metrics were
- which model is live in production
- whether rollback is possible
That is why a model registry matters.
Without it, your system becomes: “I think model_final_v7_really_final.joblib is the latest one.”
That is an L.
5. Live inference layer
This is the production-facing part. It answers user requests in real time.
A common pattern is:
- Route 53 for DNS
- CloudFront for edge delivery
- WAF for traffic filtering
- ALB or API Gateway as the public entry point
- ECS Fargate, EKS, or a SageMaker endpoint for model inference
- ElastiCache Redis for hot-response caching
- Aurora or DynamoDB for metadata and app state
What should you choose?
Use SageMaker Endpoints when:
- you want AWS-managed model serving
- you need autoscaling tightly linked to the model lifecycle
- your team prefers more managed ML infrastructure
Use ECS Fargate when:
- you want container-first deployment
- your API needs custom business logic around the model
- you want simpler ops than Kubernetes
Use EKS when:
- you run many models
- you need advanced orchestration
- you have platform engineering maturity
- you want more control over GPU scheduling and inference topology
For a lot of teams, SageMaker for training plus ECS Fargate for inference is a strong default.
6. FastAPI inference service example
Here is a simple production-style inference API:
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
app = FastAPI()
model = joblib.load("model/model.joblib")
class PredictionRequest(BaseModel):
text_length: int
num_keywords: int
@app.get("/health")
def health():
return {"status": "ok"}
@app.post("/predict")
def predict(data: PredictionRequest):
features = [[data.text_length, data.num_keywords]]
prediction = model.predict(features)[0]
return {"prediction": int(prediction)}
This gives you two important endpoints:
-
/healthfor health checks -
/predictfor live inference
ALB or ECS can use /health to know whether the service is healthy before routing traffic.
7. Containerizing the inference app
Once the API works locally, package it into a container and push it to ECR.
Dockerfile example
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
COPY model/ ./model/
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]
Then your AWS deployment flow becomes straightforward:
- build image
- push image to ECR
- deploy ECS task or service
- attach ALB and health checks
- scale based on CPU, memory, or custom metrics
8. Terraform for repeatable infrastructure
If you click everything manually in the console, it will work once and confuse you later.
Use Infrastructure as Code.
Terraform keeps the setup repeatable, reviewable, and easier to recover.
Example: S3 buckets
resource "aws_s3_bucket" "raw_data" {
bucket = "my-ai-raw-data-bucket"
}
resource "aws_s3_bucket" "processed_data" {
bucket = "my-ai-processed-data-bucket"
}
resource "aws_s3_bucket" "model_artifacts" {
bucket = "my-ai-model-artifacts-bucket"
}
Example: ECR repository
resource "aws_ecr_repository" "inference_repo" {
name = "ai-inference-service"
image_scanning_configuration {
scan_on_push = true
}
}
Example: CloudWatch log group
resource "aws_cloudwatch_log_group" "inference_logs" {
name = "/ecs/ai-inference"
retention_in_days = 14
}
That is the kind of setup you want in Git, not in your memory.
9. CI/CD pipeline
A good AI stack is not just about the model. It is about safe delivery.
Typical CI/CD flow:
- developer pushes to main
- tests run
- Docker image builds
- image gets pushed to ECR
- Terraform plan and apply run if needed
- ECS or SageMaker deployment updates
- health checks verify rollout
GitHub Actions example
name: Deploy AI Inference
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Login to ECR
uses: aws-actions/amazon-ecr-login@v2
- name: Build image
run: docker build -t ai-inference-service .
- name: Tag image
run: docker tag ai-inference-service:latest \
123456789012.dkr.ecr.us-east-1.amazonaws.com/ai-inference-service:latest
- name: Push image
run: docker push \
123456789012.dkr.ecr.us-east-1.amazonaws.com/ai-inference-service:latest
This removes a lot of manual deployment risk.
10. Security design
This is the part that saves you from future regret.
A minimum serious AWS setup should include:
- private subnets for compute and databases
- IAM least privilege for services and pipelines
- Secrets Manager instead of hardcoded credentials
- KMS encryption for S3, databases, and secrets
- Security Groups with tight inbound and outbound rules
- CloudTrail for audit logs
- WAF for basic web protection
Easy wording version
Security is basically answering four questions:
- who can access the system?
- what is exposed to the internet?
- where are the secrets stored?
- how do you know when something weird happens?
If you cannot answer those clearly, the architecture is not production ready yet.
11. Monitoring and observability
Do not deploy AI and just hope for the best.
You need visibility into:
- latency
- 4xx and 5xx errors
- request volume
- CPU and memory
- container restarts
- model accuracy trends
- drift signals
- infrastructure cost
A basic but solid AWS observability stack includes:
- CloudWatch Logs for service logs
- CloudWatch Metrics for runtime numbers
- CloudWatch Alarms for threshold-based alerts
- AWS X-Ray for tracing if you need deeper request visibility
- SNS for notifying humans when something breaks
Good alarms to set
- p95 latency above target
- error rate spike
- memory above 80 percent
- CPU above 80 percent
- unhealthy target count increase
- training job failure
- endpoint scaling anomaly
12. Recommended production stack
If you want a practical end-to-end stack without overengineering, this is a strong setup:
- Route 53 for DNS
- CloudFront for edge delivery
- WAF for filtering bad traffic
- ALB for routing traffic to your app
- ECS Fargate for the inference API
- ECR for container images
- S3 for raw data, processed data, and model artifacts
- Glue for ETL
- SageMaker for training and registry workflows
- Aurora PostgreSQL or DynamoDB for metadata
- ElastiCache Redis for caching
- CloudWatch for observability
- Secrets Manager for secrets
- Terraform for IaC
- GitHub Actions for CI/CD
This is technical, scalable, and still realistic for a serious project.
13. Best default architecture choice
If you are building from scratch and want a clean answer, this is my default take:
- keep raw and processed data in S3
- use Glue for ETL
- train in SageMaker
- store model artifacts in S3 and registry metadata in SageMaker or your platform layer
- serve inference from ECS Fargate with FastAPI
- put CloudWatch, IAM, Secrets Manager, and Terraform around everything
That setup gives you:
- managed training
- controlled deployment
- simpler inference than full Kubernetes
- room to scale later
14. Final takeaway
Real AI infrastructure is not just the model.
It is the full system around the model:
- data coming in cleanly
- training jobs running reliably
- versions tracked properly
- inference exposed safely
- deployments automated
- monitoring always on
That is what turns a cool AI demo into production engineering.
If you can explain and build this pipeline clearly, you are already thinking more like an ML platform engineer than someone just calling an API.
References
- AWS, What is Amazon SageMaker? https://aws.amazon.com/sagemaker/
- AWS, What is Amazon ECS? https://aws.amazon.com/ecs/
- AWS, What is AWS Glue? https://aws.amazon.com/glue/
- AWS, What is Amazon CloudWatch? https://aws.amazon.com/cloudwatch/
- AWS, What is Amazon S3? https://aws.amazon.com/s3/
- HashiCorp, Terraform Use Cases https://developer.hashicorp.com/terraform/intro/use-cases


Top comments (0)