TL;DR: I built a serverless AutoML platform that trains ML models for ~$10-25/month. Upload CSV, select target column, get a trained model. No ML expertise required.
Prerequisites
To deploy this project yourself, you'll need:
- AWS Account with admin access
-
AWS CLI v2 configured (
aws configure) - Terraform >= 1.5
- Docker installed and running
- Node.js 18+ and pnpm (for frontend)
- Python 3.11+ (for local development)
⏱️ Deployment time: ~15 minutes from clone to working platform
The Problem
AWS SageMaker Autopilot is powerful but expensive for prototyping. While training has a free tier (50 hours/month for 2 months), the real cost comes from real-time inference endpoints: a single ml.c5.xlarge endpoint costs ~$150/month running 24/7. I needed something simpler and cheaper for side projects.
Goals:
- Upload CSV → Get trained model (
.pkl) - Auto-detect classification vs regression
- Generate EDA reports automatically
- Cost under $25/month
Architecture Decision: Why Lambda + Batch (Not Containers Everywhere)
The key insight: ML dependencies (265MB) exceed Lambda's 250MB limit, but the API doesn't need them.
Split architecture benefits:
- Lambda: Fast cold starts (~200ms), cheap ($0.0000166/GB-sec)
- Batch/Fargate Spot: 70% cheaper than on-demand, handles 15+ min jobs
- No always-on containers = no idle costs
Data Flow
Tech Stack
| Component | Technology | Why |
|---|---|---|
| Backend API | FastAPI + Mangum | Async, auto-docs, Lambda-ready |
| Training | FLAML + scikit-learn | Fast AutoML, production-ready |
| Frontend | Next.js 16 + Tailwind | SSR support via Amplify |
| Infrastructure | Terraform | Reproducible, multi-env |
| CI/CD | GitHub Actions + OIDC | No stored AWS credentials |
Key Implementation Details
1. Smart Problem Type Detection
The UI automatically detects if a column should be classification or regression:
# Classification if: <20 unique values OR <5% unique ratio
def detect_problem_type(column, row_count):
unique_count = column.nunique()
unique_ratio = unique_count / row_count
if unique_count < 20 or unique_ratio < 0.05:
return 'classification'
return 'regression'
2. Environment Variable Cascade (Critical Pattern)
Training container runs autonomously on Batch. It receives ALL context via environment variables:
Terraform → Lambda env vars → batch_service.py → containerOverrides → train.py
If you add a parameter to train.py, you MUST also add it to containerOverrides in batch_service.py.
# batch_service.py
container_overrides = {
'environment': [
{'name': 'DATASET_ID', 'value': dataset_id},
{'name': 'TARGET_COLUMN', 'value': target_column},
{'name': 'JOB_ID', 'value': job_id},
{'name': 'TIME_BUDGET', 'value': str(time_budget)},
# ... all S3/DynamoDB configs
]
}
3. Auto-Calculated Time Budget
Based on dataset size:
| Rows | Time Budget |
|---|---|
| < 1K | 2 min |
| 1K-10K | 5 min |
| 10K-50K | 10 min |
| > 50K | 20 min |
4. Training Progress Tracking
Real-time status via DynamoDB polling (every 5 seconds):
5. Generated Reports
EDA Report - Automatic data profiling:
Training Report - Model performance and feature importance:
CI/CD with GitHub Actions + OIDC
No AWS credentials stored in GitHub. Uses OIDC for secure, temporary authentication.
Required IAM Permissions (Least Privilege)
{
"Statement": [
{
"Sid": "CoreServices",
"Effect": "Allow",
"Action": ["s3:*", "dynamodb:*", "lambda:*", "batch:*", "ecr:*"],
"Resource": "arn:aws:*:*:*:automl-lite-*"
},
{
"Sid": "APIGatewayAndAmplify",
"Effect": "Allow",
"Action": ["apigateway:*", "amplify:*"],
"Resource": "*"
},
{
"Sid": "IAMRoles",
"Effect": "Allow",
"Action": ["iam:*Role*", "iam:*RolePolicy*", "iam:PassRole"],
"Resource": "arn:aws:iam::*:role/automl-lite-*"
},
{
"Sid": "ServiceLinkedRoles",
"Effect": "Allow",
"Action": "iam:CreateServiceLinkedRole",
"Resource": "arn:aws:iam::*:role/aws-service-role/*"
},
{
"Sid": "Networking",
"Effect": "Allow",
"Action": ["ec2:Describe*", "ec2:*SecurityGroup*", "ec2:*Tags"],
"Resource": "*"
},
{
"Sid": "Logging",
"Effect": "Allow",
"Action": "logs:*",
"Resource": "arn:aws:logs:*:*:log-group:/aws/*/automl-lite-*"
}
]
}
Removed (not needed): CloudFront, X-Ray, ECS (Batch manages it internally).
Deployment Flow
Push to dev → Auto-deploy to DEV
Push to main → Plan → Manual Approval → Deploy to PROD
Granular deployments save time:
- Lambda only: ~2 min
- Training container: ~3 min
- Frontend: ~3 min
- Full infrastructure: ~10 min
Cost Breakdown (20 jobs/month)
| Service | Monthly Cost |
|---|---|
| AWS Amplify | $5-15 |
| Lambda + API Gateway | $1-2 |
| Batch (Fargate Spot) | $2-5 |
| S3 + DynamoDB | $1-2 |
| Total | $10-25 |
Comparison:
- SageMaker with real-time endpoint: ~$150-300/month (ml.c5.xlarge 24/7)
- This solution: $10-25/month (80-90% cheaper)
- No always-on costs: pay only when training
Feature Comparison: SageMaker vs AutoML Lite
| Feature | SageMaker Autopilot | AWS AutoML Lite |
|---|---|---|
| Monthly Cost | ~$150+ (with endpoint) | $10-25 |
| Setup Time | 30+ min (Studio setup) | ~15 min |
| Portable Models | ❌ Locked to SageMaker | ✅ Download .pkl |
| ML Expertise Required | Medium | None |
| Auto Problem Detection | ✅ Yes | ✅ Yes |
| EDA Reports | ❌ Manual | ✅ Automatic |
| Infrastructure as Code | ❌ Console-heavy | ✅ Full Terraform |
| Cold Start | N/A (always-on) | ~200ms (Lambda) |
| Best For | Production ML pipelines | Prototyping & side projects |
Using Your Trained Model
Download the .pkl file and use Docker for predictions:
# Build prediction container
docker build -f scripts/Dockerfile.predict -t automl-predict .
# Show model info
docker run --rm -v ${PWD}:/data automl-predict /data/model.pkl --info
# Predict from CSV
docker run --rm -v ${PWD}:/data automl-predict \
/data/model.pkl -i /data/test.csv -o /data/predictions.csv
Lessons Learned
- Container size matters: 265MB ML deps forced the Lambda/Batch split
- Environment variable cascade: Document your data flow or debugging becomes painful
- Fargate Spot is great: 70% savings, rare interruptions for short jobs
- FLAML over AutoGluon: Smaller footprint, faster training, similar results
What's Next? (Future Roadmap)
- [ ] ONNX Export - Deploy models to edge devices
- [ ] Model Comparison - Train multiple models, compare metrics side-by-side
- [ ] Real-time Updates - WebSocket instead of polling
- [ ] Multi-user Support - Cognito authentication
- [ ] Hyperparameter UI - Fine-tune FLAML settings from the frontend
- [ ] Email Notifications - Get notified when training completes
Contributions welcome! Check the GitHub Issues for good first issues.
Try It Yourself
GitHub: cristofima/AWS-AutoML-Lite
git clone https://github.com/cristofima/AWS-AutoML-Lite.git
cd AWS-AutoML-Lite/infrastructure/terraform
terraform init && terraform apply







Top comments (0)