Nimesh Kulkarni

Posted on May 19

Full AI Infrastructure Deployment on AWS: Architecture, Pipeline, and Production Setup

#ai #aws #devops #architecture

A lot of people say they built an AI system, but what they really mean is they trained a model once and ran it on a laptop.

Production AI is a different game.

You need a pipeline that can ingest data, clean it, train models, version them, deploy them safely, expose them through an API, and keep watching latency, errors, drift, and cost after release.

That is what real AI infrastructure looks like.

In this post, I will break down a full AI deployment stack on AWS in simple words without removing the technical depth.

The big idea

A production AI platform usually has four layers:

Data layer for collecting and storing data
Training layer for model training and evaluation
Serving layer for live inference
Ops layer for security, CI/CD, logging, and monitoring

If you get these four parts right, the system becomes repeatable instead of fragile.

Architecture diagram

End-to-end request and training flow

At a high level, the pipeline works like this:

data arrives from apps, logs, APIs, files, and user feedback
raw data lands in Amazon S3
AWS Glue or Lambda cleans and transforms the data
processed data is saved back to S3 in train, validation, and test form
Amazon SageMaker runs training and evaluation jobs
the best model is stored in a model registry
CI/CD deploys the chosen model to an inference service
users hit the model through an API
CloudWatch tracks performance, health, and alerts

That is the clean version.

Now let us go layer by layer.

1. Data ingestion layer

The first rule is simple: keep raw data raw.

Do not overwrite the original input files. Land them in S3 first and treat that bucket like your source of truth.

Typical sources include:

application events
CSV uploads
clickstream or logs
internal databases
support tickets
images, audio, or text documents
user feedback from the product

Common AWS services here:

Amazon S3 for raw storage
AWS Lambda for lightweight event-driven ingestion
Amazon Kinesis for streaming data
Amazon EventBridge for trigger-based workflows

Why this matters

If a downstream transformation fails or a training job goes bad, you still have the original input. That saves you from pipeline pain later.

2. Data processing and ETL

Once the raw data lands, it usually needs cleanup before training.

This stage can include:

null and duplicate handling
schema validation
text normalization
feature generation
image resizing
train, validation, and test splits
quality checks before training

A practical AWS setup is:

Glue Jobs for scheduled ETL
Lambda for smaller transforms
Athena for querying S3-backed datasets
EMR if you need heavier Spark-style compute

Easy way to think about it

Raw data is messy.

Processed data is model-ready.

That transformation is the job of your ETL layer.

3. Model training on SageMaker

Once the processed dataset is ready, training moves into SageMaker.

SageMaker helps with:

training jobs
hyperparameter tuning
managed experiment runs
GPU or CPU compute
model artifact output
pipeline automation

A clean training pipeline usually does this:

read processed data from S3
run the training script
evaluate the result
compare metrics against the current model
register the best model version
deploy only if it passes the quality bar

Example training script

import pandas as pd
import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# SageMaker mounts training data here
train_path = "/opt/ml/input/data/train/train.csv"
df = pd.read_csv(train_path)

X = df[["text_length", "num_keywords"]]
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = RandomForestClassifier(n_estimators=200, random_state=42)
model.fit(X_train, y_train)

preds = model.predict(X_test)
print({"accuracy": accuracy_score(y_test, preds)})

# SageMaker expects model artifacts here
joblib.dump(model, "/opt/ml/model/model.joblib")

What is really happening here

SageMaker spins up a training environment, mounts the dataset into the container, runs the script, and saves the trained model artifact for later deployment.

4. Model registry and versioning

This part gets ignored way too often.

If you train multiple versions, you need to know:

which dataset version trained the model
which code version was used
what the evaluation metrics were
which model is live in production
whether rollback is possible

That is why a model registry matters.

Without it, your system becomes: “I think model_final_v7_really_final.joblib is the latest one.”

That is an L.

5. Live inference layer

This is the production-facing part. It answers user requests in real time.

A common pattern is:

Route 53 for DNS
CloudFront for edge delivery
WAF for traffic filtering
ALB or API Gateway as the public entry point
ECS Fargate, EKS, or a SageMaker endpoint for model inference
ElastiCache Redis for hot-response caching
Aurora or DynamoDB for metadata and app state

What should you choose?

Use SageMaker Endpoints when:

you want AWS-managed model serving
you need autoscaling tightly linked to the model lifecycle
your team prefers more managed ML infrastructure

Use ECS Fargate when:

you want container-first deployment
your API needs custom business logic around the model
you want simpler ops than Kubernetes

Use EKS when:

you run many models
you need advanced orchestration
you have platform engineering maturity
you want more control over GPU scheduling and inference topology

For a lot of teams, SageMaker for training plus ECS Fargate for inference is a strong default.

6. FastAPI inference service example

Here is a simple production-style inference API:

from fastapi import FastAPI
from pydantic import BaseModel
import joblib

app = FastAPI()
model = joblib.load("model/model.joblib")

class PredictionRequest(BaseModel):
    text_length: int
    num_keywords: int

@app.get("/health")
def health():
    return {"status": "ok"}

@app.post("/predict")
def predict(data: PredictionRequest):
    features = [[data.text_length, data.num_keywords]]
    prediction = model.predict(features)[0]
    return {"prediction": int(prediction)}

This gives you two important endpoints:

/health for health checks
/predict for live inference

ALB or ECS can use /health to know whether the service is healthy before routing traffic.

7. Containerizing the inference app

Once the API works locally, package it into a container and push it to ECR.

Dockerfile example

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app.py .
COPY model/ ./model/

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]

Then your AWS deployment flow becomes straightforward:

build image
push image to ECR
deploy ECS task or service
attach ALB and health checks
scale based on CPU, memory, or custom metrics

8. Terraform for repeatable infrastructure

If you click everything manually in the console, it will work once and confuse you later.

Use Infrastructure as Code.

Terraform keeps the setup repeatable, reviewable, and easier to recover.

Example: S3 buckets

resource "aws_s3_bucket" "raw_data" {
  bucket = "my-ai-raw-data-bucket"
}

resource "aws_s3_bucket" "processed_data" {
  bucket = "my-ai-processed-data-bucket"
}

resource "aws_s3_bucket" "model_artifacts" {
  bucket = "my-ai-model-artifacts-bucket"
}

Example: ECR repository

resource "aws_ecr_repository" "inference_repo" {
  name = "ai-inference-service"

  image_scanning_configuration {
    scan_on_push = true
  }
}

Example: CloudWatch log group

resource "aws_cloudwatch_log_group" "inference_logs" {
  name              = "/ecs/ai-inference"
  retention_in_days = 14
}

That is the kind of setup you want in Git, not in your memory.

9. CI/CD pipeline

A good AI stack is not just about the model. It is about safe delivery.

Typical CI/CD flow:

developer pushes to main
tests run
Docker image builds
image gets pushed to ECR
Terraform plan and apply run if needed
ECS or SageMaker deployment updates
health checks verify rollout

GitHub Actions example

name: Deploy AI Inference

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Login to ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build image
        run: docker build -t ai-inference-service .

      - name: Tag image
        run: docker tag ai-inference-service:latest \
          123456789012.dkr.ecr.us-east-1.amazonaws.com/ai-inference-service:latest

      - name: Push image
        run: docker push \
          123456789012.dkr.ecr.us-east-1.amazonaws.com/ai-inference-service:latest

This removes a lot of manual deployment risk.

10. Security design

This is the part that saves you from future regret.

A minimum serious AWS setup should include:

private subnets for compute and databases
IAM least privilege for services and pipelines
Secrets Manager instead of hardcoded credentials
KMS encryption for S3, databases, and secrets
Security Groups with tight inbound and outbound rules
CloudTrail for audit logs
WAF for basic web protection

Easy wording version

Security is basically answering four questions:

who can access the system?
what is exposed to the internet?
where are the secrets stored?
how do you know when something weird happens?

If you cannot answer those clearly, the architecture is not production ready yet.

11. Monitoring and observability

Do not deploy AI and just hope for the best.

You need visibility into:

latency
4xx and 5xx errors
request volume
CPU and memory
container restarts
model accuracy trends
drift signals
infrastructure cost

A basic but solid AWS observability stack includes:

CloudWatch Logs for service logs
CloudWatch Metrics for runtime numbers
CloudWatch Alarms for threshold-based alerts
AWS X-Ray for tracing if you need deeper request visibility
SNS for notifying humans when something breaks

Good alarms to set

p95 latency above target
error rate spike
memory above 80 percent
CPU above 80 percent
unhealthy target count increase
training job failure
endpoint scaling anomaly

12. Recommended production stack

If you want a practical end-to-end stack without overengineering, this is a strong setup:

Route 53 for DNS
CloudFront for edge delivery
WAF for filtering bad traffic
ALB for routing traffic to your app
ECS Fargate for the inference API
ECR for container images
S3 for raw data, processed data, and model artifacts
Glue for ETL
SageMaker for training and registry workflows
Aurora PostgreSQL or DynamoDB for metadata
ElastiCache Redis for caching
CloudWatch for observability
Secrets Manager for secrets
Terraform for IaC
GitHub Actions for CI/CD

This is technical, scalable, and still realistic for a serious project.

13. Best default architecture choice

If you are building from scratch and want a clean answer, this is my default take:

keep raw and processed data in S3
use Glue for ETL
train in SageMaker
store model artifacts in S3 and registry metadata in SageMaker or your platform layer
serve inference from ECS Fargate with FastAPI
put CloudWatch, IAM, Secrets Manager, and Terraform around everything

That setup gives you:

managed training
controlled deployment
simpler inference than full Kubernetes
room to scale later

14. Final takeaway

Real AI infrastructure is not just the model.

It is the full system around the model:

data coming in cleanly
training jobs running reliably
versions tracked properly
inference exposed safely
deployments automated
monitoring always on

That is what turns a cool AI demo into production engineering.

If you can explain and build this pipeline clearly, you are already thinking more like an ML platform engineer than someone just calling an API.

References

AWS, What is Amazon SageMaker? https://aws.amazon.com/sagemaker/
AWS, What is Amazon ECS? https://aws.amazon.com/ecs/
AWS, What is AWS Glue? https://aws.amazon.com/glue/
AWS, What is Amazon CloudWatch? https://aws.amazon.com/cloudwatch/
AWS, What is Amazon S3? https://aws.amazon.com/s3/
HashiCorp, Terraform Use Cases https://developer.hashicorp.com/terraform/intro/use-cases

DEV Community