DEV Community

Cover image for Full AI Infrastructure Deployment on AWS: Architecture, Pipeline, and Production Setup
Nimesh Kulkarni
Nimesh Kulkarni

Posted on

Full AI Infrastructure Deployment on AWS: Architecture, Pipeline, and Production Setup

AWS AI infrastructure editorial cover

A lot of people say they built an AI system, but what they really mean is they trained a model once and ran it on a laptop.

Production AI is a different game.

You need a pipeline that can ingest data, clean it, train models, version them, deploy them safely, expose them through an API, and keep watching latency, errors, drift, and cost after release.

That is what real AI infrastructure looks like.

In this post, I will break down a full AI deployment stack on AWS in simple words without removing the technical depth.

The big idea

A production AI platform usually has four layers:

  1. Data layer for collecting and storing data
  2. Training layer for model training and evaluation
  3. Serving layer for live inference
  4. Ops layer for security, CI/CD, logging, and monitoring

If you get these four parts right, the system becomes repeatable instead of fragile.

Architecture diagram

AWS AI infrastructure architecture diagram

End-to-end request and training flow

At a high level, the pipeline works like this:

  • data arrives from apps, logs, APIs, files, and user feedback
  • raw data lands in Amazon S3
  • AWS Glue or Lambda cleans and transforms the data
  • processed data is saved back to S3 in train, validation, and test form
  • Amazon SageMaker runs training and evaluation jobs
  • the best model is stored in a model registry
  • CI/CD deploys the chosen model to an inference service
  • users hit the model through an API
  • CloudWatch tracks performance, health, and alerts

That is the clean version.

Now let us go layer by layer.

1. Data ingestion layer

The first rule is simple: keep raw data raw.

Do not overwrite the original input files. Land them in S3 first and treat that bucket like your source of truth.

Typical sources include:

  • application events
  • CSV uploads
  • clickstream or logs
  • internal databases
  • support tickets
  • images, audio, or text documents
  • user feedback from the product

Common AWS services here:

  • Amazon S3 for raw storage
  • AWS Lambda for lightweight event-driven ingestion
  • Amazon Kinesis for streaming data
  • Amazon EventBridge for trigger-based workflows

Why this matters

If a downstream transformation fails or a training job goes bad, you still have the original input. That saves you from pipeline pain later.

2. Data processing and ETL

Once the raw data lands, it usually needs cleanup before training.

This stage can include:

  • null and duplicate handling
  • schema validation
  • text normalization
  • feature generation
  • image resizing
  • train, validation, and test splits
  • quality checks before training

A practical AWS setup is:

  • Glue Jobs for scheduled ETL
  • Lambda for smaller transforms
  • Athena for querying S3-backed datasets
  • EMR if you need heavier Spark-style compute

Easy way to think about it

Raw data is messy.

Processed data is model-ready.

That transformation is the job of your ETL layer.

3. Model training on SageMaker

Once the processed dataset is ready, training moves into SageMaker.

SageMaker helps with:

  • training jobs
  • hyperparameter tuning
  • managed experiment runs
  • GPU or CPU compute
  • model artifact output
  • pipeline automation

A clean training pipeline usually does this:

  1. read processed data from S3
  2. run the training script
  3. evaluate the result
  4. compare metrics against the current model
  5. register the best model version
  6. deploy only if it passes the quality bar

Example training script

import pandas as pd
import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# SageMaker mounts training data here
train_path = "/opt/ml/input/data/train/train.csv"
df = pd.read_csv(train_path)

X = df[["text_length", "num_keywords"]]
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = RandomForestClassifier(n_estimators=200, random_state=42)
model.fit(X_train, y_train)

preds = model.predict(X_test)
print({"accuracy": accuracy_score(y_test, preds)})

# SageMaker expects model artifacts here
joblib.dump(model, "/opt/ml/model/model.joblib")
Enter fullscreen mode Exit fullscreen mode

What is really happening here

SageMaker spins up a training environment, mounts the dataset into the container, runs the script, and saves the trained model artifact for later deployment.

4. Model registry and versioning

This part gets ignored way too often.

If you train multiple versions, you need to know:

  • which dataset version trained the model
  • which code version was used
  • what the evaluation metrics were
  • which model is live in production
  • whether rollback is possible

That is why a model registry matters.

Without it, your system becomes: “I think model_final_v7_really_final.joblib is the latest one.”

That is an L.

5. Live inference layer

This is the production-facing part. It answers user requests in real time.

A common pattern is:

  • Route 53 for DNS
  • CloudFront for edge delivery
  • WAF for traffic filtering
  • ALB or API Gateway as the public entry point
  • ECS Fargate, EKS, or a SageMaker endpoint for model inference
  • ElastiCache Redis for hot-response caching
  • Aurora or DynamoDB for metadata and app state

What should you choose?

Use SageMaker Endpoints when:

  • you want AWS-managed model serving
  • you need autoscaling tightly linked to the model lifecycle
  • your team prefers more managed ML infrastructure

Use ECS Fargate when:

  • you want container-first deployment
  • your API needs custom business logic around the model
  • you want simpler ops than Kubernetes

Use EKS when:

  • you run many models
  • you need advanced orchestration
  • you have platform engineering maturity
  • you want more control over GPU scheduling and inference topology

For a lot of teams, SageMaker for training plus ECS Fargate for inference is a strong default.

6. FastAPI inference service example

Here is a simple production-style inference API:

from fastapi import FastAPI
from pydantic import BaseModel
import joblib

app = FastAPI()
model = joblib.load("model/model.joblib")

class PredictionRequest(BaseModel):
    text_length: int
    num_keywords: int

@app.get("/health")
def health():
    return {"status": "ok"}

@app.post("/predict")
def predict(data: PredictionRequest):
    features = [[data.text_length, data.num_keywords]]
    prediction = model.predict(features)[0]
    return {"prediction": int(prediction)}
Enter fullscreen mode Exit fullscreen mode

This gives you two important endpoints:

  • /health for health checks
  • /predict for live inference

ALB or ECS can use /health to know whether the service is healthy before routing traffic.

7. Containerizing the inference app

Once the API works locally, package it into a container and push it to ECR.

Dockerfile example

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app.py .
COPY model/ ./model/

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]
Enter fullscreen mode Exit fullscreen mode

Then your AWS deployment flow becomes straightforward:

  • build image
  • push image to ECR
  • deploy ECS task or service
  • attach ALB and health checks
  • scale based on CPU, memory, or custom metrics

8. Terraform for repeatable infrastructure

If you click everything manually in the console, it will work once and confuse you later.

Use Infrastructure as Code.

Terraform keeps the setup repeatable, reviewable, and easier to recover.

Example: S3 buckets

resource "aws_s3_bucket" "raw_data" {
  bucket = "my-ai-raw-data-bucket"
}

resource "aws_s3_bucket" "processed_data" {
  bucket = "my-ai-processed-data-bucket"
}

resource "aws_s3_bucket" "model_artifacts" {
  bucket = "my-ai-model-artifacts-bucket"
}
Enter fullscreen mode Exit fullscreen mode

Example: ECR repository

resource "aws_ecr_repository" "inference_repo" {
  name = "ai-inference-service"

  image_scanning_configuration {
    scan_on_push = true
  }
}
Enter fullscreen mode Exit fullscreen mode

Example: CloudWatch log group

resource "aws_cloudwatch_log_group" "inference_logs" {
  name              = "/ecs/ai-inference"
  retention_in_days = 14
}
Enter fullscreen mode Exit fullscreen mode

That is the kind of setup you want in Git, not in your memory.

9. CI/CD pipeline

A good AI stack is not just about the model. It is about safe delivery.

Typical CI/CD flow:

  1. developer pushes to main
  2. tests run
  3. Docker image builds
  4. image gets pushed to ECR
  5. Terraform plan and apply run if needed
  6. ECS or SageMaker deployment updates
  7. health checks verify rollout

GitHub Actions example

name: Deploy AI Inference

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Login to ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build image
        run: docker build -t ai-inference-service .

      - name: Tag image
        run: docker tag ai-inference-service:latest \
          123456789012.dkr.ecr.us-east-1.amazonaws.com/ai-inference-service:latest

      - name: Push image
        run: docker push \
          123456789012.dkr.ecr.us-east-1.amazonaws.com/ai-inference-service:latest
Enter fullscreen mode Exit fullscreen mode

This removes a lot of manual deployment risk.

10. Security design

This is the part that saves you from future regret.

A minimum serious AWS setup should include:

  • private subnets for compute and databases
  • IAM least privilege for services and pipelines
  • Secrets Manager instead of hardcoded credentials
  • KMS encryption for S3, databases, and secrets
  • Security Groups with tight inbound and outbound rules
  • CloudTrail for audit logs
  • WAF for basic web protection

Easy wording version

Security is basically answering four questions:

  1. who can access the system?
  2. what is exposed to the internet?
  3. where are the secrets stored?
  4. how do you know when something weird happens?

If you cannot answer those clearly, the architecture is not production ready yet.

11. Monitoring and observability

Do not deploy AI and just hope for the best.

You need visibility into:

  • latency
  • 4xx and 5xx errors
  • request volume
  • CPU and memory
  • container restarts
  • model accuracy trends
  • drift signals
  • infrastructure cost

A basic but solid AWS observability stack includes:

  • CloudWatch Logs for service logs
  • CloudWatch Metrics for runtime numbers
  • CloudWatch Alarms for threshold-based alerts
  • AWS X-Ray for tracing if you need deeper request visibility
  • SNS for notifying humans when something breaks

Good alarms to set

  • p95 latency above target
  • error rate spike
  • memory above 80 percent
  • CPU above 80 percent
  • unhealthy target count increase
  • training job failure
  • endpoint scaling anomaly

12. Recommended production stack

If you want a practical end-to-end stack without overengineering, this is a strong setup:

  • Route 53 for DNS
  • CloudFront for edge delivery
  • WAF for filtering bad traffic
  • ALB for routing traffic to your app
  • ECS Fargate for the inference API
  • ECR for container images
  • S3 for raw data, processed data, and model artifacts
  • Glue for ETL
  • SageMaker for training and registry workflows
  • Aurora PostgreSQL or DynamoDB for metadata
  • ElastiCache Redis for caching
  • CloudWatch for observability
  • Secrets Manager for secrets
  • Terraform for IaC
  • GitHub Actions for CI/CD

This is technical, scalable, and still realistic for a serious project.

13. Best default architecture choice

If you are building from scratch and want a clean answer, this is my default take:

  • keep raw and processed data in S3
  • use Glue for ETL
  • train in SageMaker
  • store model artifacts in S3 and registry metadata in SageMaker or your platform layer
  • serve inference from ECS Fargate with FastAPI
  • put CloudWatch, IAM, Secrets Manager, and Terraform around everything

That setup gives you:

  • managed training
  • controlled deployment
  • simpler inference than full Kubernetes
  • room to scale later

14. Final takeaway

Real AI infrastructure is not just the model.

It is the full system around the model:

  • data coming in cleanly
  • training jobs running reliably
  • versions tracked properly
  • inference exposed safely
  • deployments automated
  • monitoring always on

That is what turns a cool AI demo into production engineering.

If you can explain and build this pipeline clearly, you are already thinking more like an ML platform engineer than someone just calling an API.

References

  1. AWS, What is Amazon SageMaker? https://aws.amazon.com/sagemaker/
  2. AWS, What is Amazon ECS? https://aws.amazon.com/ecs/
  3. AWS, What is AWS Glue? https://aws.amazon.com/glue/
  4. AWS, What is Amazon CloudWatch? https://aws.amazon.com/cloudwatch/
  5. AWS, What is Amazon S3? https://aws.amazon.com/s3/
  6. HashiCorp, Terraform Use Cases https://developer.hashicorp.com/terraform/intro/use-cases

Top comments (0)