DEV Community

Cover image for Building a Cost-Effective AutoML Platform on AWS: $10-25/month vs $150+ for SageMaker Endpoints

Building a Cost-Effective AutoML Platform on AWS: $10-25/month vs $150+ for SageMaker Endpoints

TL;DR: I built a serverless AutoML platform that trains ML models for ~$10-25/month. Upload CSV, select target column, get a trained model. No ML expertise required.

Prerequisites

To deploy this project yourself, you'll need:

  • AWS Account with admin access
  • AWS CLI v2 configured (aws configure)
  • Terraform >= 1.5
  • Docker installed and running
  • Node.js 18+ and pnpm (for frontend)
  • Python 3.11+ (for local development)

⏱️ Deployment time: ~15 minutes from clone to working platform

The Problem

AWS SageMaker Autopilot is powerful but expensive for prototyping. While training has a free tier (50 hours/month for 2 months), the real cost comes from real-time inference endpoints: a single ml.c5.xlarge endpoint costs ~$150/month running 24/7. I needed something simpler and cheaper for side projects.

Goals:

  • Upload CSV → Get trained model (.pkl)
  • Auto-detect classification vs regression
  • Generate EDA reports automatically
  • Cost under $25/month

Architecture Decision: Why Lambda + Batch (Not Containers Everywhere)

The key insight: ML dependencies (265MB) exceed Lambda's 250MB limit, but the API doesn't need them.

Split architecture benefits:

  • Lambda: Fast cold starts (~200ms), cheap ($0.0000166/GB-sec)
  • Batch/Fargate Spot: 70% cheaper than on-demand, handles 15+ min jobs
  • No always-on containers = no idle costs

Data Flow

Tech Stack

Component Technology Why
Backend API FastAPI + Mangum Async, auto-docs, Lambda-ready
Training FLAML + scikit-learn Fast AutoML, production-ready
Frontend Next.js 16 + Tailwind SSR support via Amplify
Infrastructure Terraform Reproducible, multi-env
CI/CD GitHub Actions + OIDC No stored AWS credentials

Key Implementation Details

1. Smart Problem Type Detection

The UI automatically detects if a column should be classification or regression:

# Classification if: <20 unique values OR <5% unique ratio
def detect_problem_type(column, row_count):
    unique_count = column.nunique()
    unique_ratio = unique_count / row_count

    if unique_count < 20 or unique_ratio < 0.05:
        return 'classification'
    return 'regression'
Enter fullscreen mode Exit fullscreen mode

2. Environment Variable Cascade (Critical Pattern)

Training container runs autonomously on Batch. It receives ALL context via environment variables:

Terraform → Lambda env vars → batch_service.py → containerOverrides → train.py
Enter fullscreen mode Exit fullscreen mode

If you add a parameter to train.py, you MUST also add it to containerOverrides in batch_service.py.

# batch_service.py
container_overrides = {
    'environment': [
        {'name': 'DATASET_ID', 'value': dataset_id},
        {'name': 'TARGET_COLUMN', 'value': target_column},
        {'name': 'JOB_ID', 'value': job_id},
        {'name': 'TIME_BUDGET', 'value': str(time_budget)},
        # ... all S3/DynamoDB configs
    ]
}
Enter fullscreen mode Exit fullscreen mode

3. Auto-Calculated Time Budget

Based on dataset size:

Rows Time Budget
< 1K 2 min
1K-10K 5 min
10K-50K 10 min
> 50K 20 min

4. Training Progress Tracking

Real-time status via DynamoDB polling (every 5 seconds):

5. Generated Reports

EDA Report - Automatic data profiling:

Training Report - Model performance and feature importance:

CI/CD with GitHub Actions + OIDC

No AWS credentials stored in GitHub. Uses OIDC for secure, temporary authentication.

Required IAM Permissions (Least Privilege)

{
  "Statement": [
    {
      "Sid": "CoreServices",
      "Effect": "Allow",
      "Action": ["s3:*", "dynamodb:*", "lambda:*", "batch:*", "ecr:*"],
      "Resource": "arn:aws:*:*:*:automl-lite-*"
    },
    {
      "Sid": "APIGatewayAndAmplify",
      "Effect": "Allow",
      "Action": ["apigateway:*", "amplify:*"],
      "Resource": "*"
    },
    {
      "Sid": "IAMRoles",
      "Effect": "Allow",
      "Action": ["iam:*Role*", "iam:*RolePolicy*", "iam:PassRole"],
      "Resource": "arn:aws:iam::*:role/automl-lite-*"
    },
    {
      "Sid": "ServiceLinkedRoles",
      "Effect": "Allow",
      "Action": "iam:CreateServiceLinkedRole",
      "Resource": "arn:aws:iam::*:role/aws-service-role/*"
    },
    {
      "Sid": "Networking",
      "Effect": "Allow",
      "Action": ["ec2:Describe*", "ec2:*SecurityGroup*", "ec2:*Tags"],
      "Resource": "*"
    },
    {
      "Sid": "Logging",
      "Effect": "Allow",
      "Action": "logs:*",
      "Resource": "arn:aws:logs:*:*:log-group:/aws/*/automl-lite-*"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Removed (not needed): CloudFront, X-Ray, ECS (Batch manages it internally).

Deployment Flow

Push to dev  → Auto-deploy to DEV
Push to main → Plan → Manual Approval → Deploy to PROD
Enter fullscreen mode Exit fullscreen mode

Granular deployments save time:

  • Lambda only: ~2 min
  • Training container: ~3 min
  • Frontend: ~3 min
  • Full infrastructure: ~10 min

Cost Breakdown (20 jobs/month)

Service Monthly Cost
AWS Amplify $5-15
Lambda + API Gateway $1-2
Batch (Fargate Spot) $2-5
S3 + DynamoDB $1-2
Total $10-25

Comparison:

  • SageMaker with real-time endpoint: ~$150-300/month (ml.c5.xlarge 24/7)
  • This solution: $10-25/month (80-90% cheaper)
  • No always-on costs: pay only when training

Feature Comparison: SageMaker vs AutoML Lite

Feature SageMaker Autopilot AWS AutoML Lite
Monthly Cost ~$150+ (with endpoint) $10-25
Setup Time 30+ min (Studio setup) ~15 min
Portable Models ❌ Locked to SageMaker ✅ Download .pkl
ML Expertise Required Medium None
Auto Problem Detection ✅ Yes ✅ Yes
EDA Reports ❌ Manual ✅ Automatic
Infrastructure as Code ❌ Console-heavy ✅ Full Terraform
Cold Start N/A (always-on) ~200ms (Lambda)
Best For Production ML pipelines Prototyping & side projects

Using Your Trained Model

Download the .pkl file and use Docker for predictions:

# Build prediction container
docker build -f scripts/Dockerfile.predict -t automl-predict .

# Show model info
docker run --rm -v ${PWD}:/data automl-predict /data/model.pkl --info

# Predict from CSV
docker run --rm -v ${PWD}:/data automl-predict \
  /data/model.pkl -i /data/test.csv -o /data/predictions.csv
Enter fullscreen mode Exit fullscreen mode

Lessons Learned

  1. Container size matters: 265MB ML deps forced the Lambda/Batch split
  2. Environment variable cascade: Document your data flow or debugging becomes painful
  3. Fargate Spot is great: 70% savings, rare interruptions for short jobs
  4. FLAML over AutoGluon: Smaller footprint, faster training, similar results

What's Next? (Future Roadmap)

  • [ ] ONNX Export - Deploy models to edge devices
  • [ ] Model Comparison - Train multiple models, compare metrics side-by-side
  • [ ] Real-time Updates - WebSocket instead of polling
  • [ ] Multi-user Support - Cognito authentication
  • [ ] Hyperparameter UI - Fine-tune FLAML settings from the frontend
  • [ ] Email Notifications - Get notified when training completes

Contributions welcome! Check the GitHub Issues for good first issues.

Try It Yourself

GitHub: cristofima/AWS-AutoML-Lite

git clone https://github.com/cristofima/AWS-AutoML-Lite.git
cd AWS-AutoML-Lite/infrastructure/terraform
terraform init && terraform apply
Enter fullscreen mode Exit fullscreen mode

Top comments (0)