Diabetes Detection on AWS — Step-by-Step Complete Guide
A practical, step-by-step walkthrough to build a production-ready Diabetes Prediction web app using AWS SageMaker, S3, EC2, SNS, DynamoDB (or MongoDB Atlas), API Gateway, and Amplify. Includes code, CLI commands, deployment tips, and troubleshooting.
TL;DR (One-line)
Train a scikit-learn model (SageMaker or locally) → store it in S3 → host a Flask API on EC2 that loads the model from S3, predicts & stores results (DynamoDB or MongoDB), and notifies users via SNS → expose via API Gateway → host frontend with AWS Amplify.
GitHub reference: https://github.com/naman-0804/Diabetes_Prediction_onAWS
Why This Architecture?
- SageMaker / Local Training — Scalable, repeatable training
- S3 — Reliable model artifact storage
- EC2 + Flask — Easy-to-debug backend for predictions (can be swapped for Lambda)
- DynamoDB or MongoDB Atlas — Persistent user/prediction storage for retraining & analytics
- SNS — Email notification to users
- API Gateway — Secure, managed public endpoint
- Amplify — Fast GitHub → hosting pipeline for frontend
Prerequisites
- AWS account with permissions (or ability to create resources)
- AWS CLI installed and configured or console access
- Python 3.8+ locally
- WinSCP (or scp) for file transfer to EC2: https://winscp.net/eng/download.php
- GitHub repo: https://github.com/naman-0804/Diabetes_Prediction_onAWS
- Basic knowledge of Flask, React (Amplify), and AWS console
Project Roadmap (High Level)
- Prepare dataset & train model (SageMaker or local)
- Save model artifact (
modelaws.joblib
) and upload to S3 - Create SNS topic and subscription (email)
- Create DynamoDB table or MongoDB Atlas cluster (choose one)
- Launch EC2 instance, configure IAM role, transfer code, and deploy Flask backend
- Create API Gateway to front the EC2 endpoint (optional but recommended)
- Deploy frontend with Amplify (connect to GitHub)
- Test end-to-end and enable monitoring
Step 1 — Train the Model (SageMaker or Locally)
Option A — Quick Local Training (scikit-learn)
Create a notebook train_local.ipynb
or a script train.py
:
# train.py (minimal example using Pima Indian diabetes dataset)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import joblib
# load dataset (change path as needed)
df = pd.read_csv('diabetes.csv') # from Kaggle link you used
X = df.drop('Outcome', axis=1)
y = df['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# save model
joblib.dump(model, 'modelaws.joblib')
print("Saved modelaws.joblib")
Upload to S3:
aws s3 mb s3://your-bucket-name # create bucket (if not exists)
aws s3 cp modelaws.joblib s3://your-bucket-name/modelaws.joblib
Option B — Train in SageMaker (Recommended for Production)
Use a SageMaker notebook instance or Studio to run a similar scikit-learn script and save artifact to S3 (SageMaker supports direct model export to S3). You can adapt the local train.py
into a SageMaker training script.
Step 2 — Create S3 Bucket & Upload Model
# create bucket (choose region)
aws s3api create-bucket --bucket your-bucket-name --region us-east-1 \
--create-bucket-configuration LocationConstraint=us-east-1
# upload model
aws s3 cp modelaws.joblib s3://your-bucket-name/modelaws.joblib
Security note: Do not make the bucket public. Use least privilege IAM role for EC2 to read from S3.
Step 3 — Create SNS Topic (Email Notifications)
# create topic
aws sns create-topic --name diabetes-results
# returns TopicArn (copy it)
# subscribe an email (user will need to confirm)
aws sns subscribe --topic-arn arn:aws:sns:REGION:ACCOUNT_ID:diabetes-results \
--protocol email --notification-endpoint user@example.com
# publish test message
aws sns publish --topic-arn arn:aws:sns:... --message "Test diabetes result"
Step 4 — Storage for User Data: DynamoDB or MongoDB Atlas
Important: Your earlier description mentioned MongoDB, but the provided backend.py
uses DynamoDB. Pick one. I provide both options.
Option A — DynamoDB (Serverless AWS Native)
Create table (primary key: email, sort key: timestamp or use composite key):
aws dynamodb create-table \
--table-name DiabetesResults \
--attribute-definitions AttributeName=email,AttributeType=S AttributeName=timestamp,AttributeType=S \
--key-schema AttributeName=email,KeyType=HASH AttributeName=timestamp,KeyType=RANGE \
--provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5
Option B — MongoDB Atlas (If You Prefer MongoDB)
- Create a free cluster at https://www.mongodb.com/cloud/atlas
- Create a database
diabetes_db
and collectionresults
- Whitelist your EC2 public IP (or use VPC peering)
- Add a database user and copy the connection string
Example pymongo snippet to store results:
from pymongo import MongoClient
mongo_uri = "mongodb+srv://user:pass@cluster0.mongodb.net/diabetes_db?retryWrites=true&w=majority"
client = MongoClient(mongo_uri)
db = client['diabetes_db']
collection = db['results']
collection.insert_one(result_data)
Step 5 — Launch EC2 & Deploy Backend
EC2 Launch Checklist (Quick)
- Launch instance (Amazon Linux 2 or Ubuntu LTS)
- Create / use
.pem
key pair - Security Group:
- SSH (22) from your IP
- HTTP (80) and HTTPS (443) if you use nginx
- If you directly expose Flask (not recommended), open port 5000 (prefer behind nginx/API Gateway)
IMPORTANT: Create an IAM role for the EC2 instance that allows:
-
s3:GetObject
on your bucket -
sns:Publish
&sns:Subscribe
on the SNS topic -
dynamodb:PutItem
on your table (if using DynamoDB)
Attach this role to the EC2 instance — do NOT store AWS keys in code.
Connect and Prepare Instance
Example for Amazon Linux 2:
# SSH connect
ssh -i yourkey.pem ec2-user@ec2-public-ip
# update
sudo yum update -y
sudo yum install -y python3 git
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
Create requirements.txt
:
flask
flask-cors
boto3
pandas
scikit-learn
joblib
gunicorn
pymongo # only if using MongoDB Atlas
Install:
pip install -r requirements.txt
Transfer Code
Use WinSCP (Windows) or scp
to copy backend.py
, requirements.txt
, and other files to /home/ec2-user/app/
.
Or git clone https://github.com/naman-0804/Diabetes_Prediction_onAWS.git
Step 6 — Improved Backend (Production-Ready Patterns)
Below is a polished backend.py
based on your original, with environment variables, better validation, DynamoDB & MongoDB options, and IAM role usage (no static keys). Save this as backend.py
:
# backend.py (improved, supports DynamoDB or MongoDB)
import os
import json
import boto3
import pandas as pd
import joblib
from flask import Flask, request, jsonify
from flask_cors import CORS
from datetime import datetime
app = Flask(__name__)
CORS(app)
# Configuration via environment variables
BUCKET_NAME = os.environ.get('MODEL_S3_BUCKET', 'your-bucket-name')
MODEL_KEY = os.environ.get('MODEL_S3_KEY', 'modelaws.joblib')
SNS_TOPIC_ARN = os.environ.get('SNS_TOPIC_ARN', '')
USE_MONGODB = os.environ.get('USE_MONGODB', 'false').lower() == 'true'
AWS_REGION = os.environ.get('AWS_REGION', 'us-east-1')
# AWS clients (EC2 instance should have IAM Role with proper permissions)
s3 = boto3.client('s3', region_name=AWS_REGION)
sns_client = boto3.client('sns', region_name=AWS_REGION)
dynamodb = boto3.resource('dynamodb', region_name=AWS_REGION)
# DynamoDB table (if using)
DYNAMO_TABLE_NAME = os.environ.get('DYNAMO_TABLE_NAME', 'DiabetesResults')
table = None
if not USE_MONGODB:
table = dynamodb.Table(DYNAMO_TABLE_NAME)
# Optional MongoDB (Atlas) setup
mongo_collection = None
if USE_MONGODB:
from pymongo import MongoClient
MONGO_URI = os.environ.get('MONGO_URI') # set this in EC2 env
client = MongoClient(MONGO_URI)
mongo_collection = client.get_default_database()['results']
# Download model at startup
download_path = '/tmp/modelaws.joblib'
model = None
try:
s3.download_file(BUCKET_NAME, MODEL_KEY, download_path)
model = joblib.load(download_path)
app.logger.info("Model loaded from S3.")
except Exception as e:
app.logger.error("Error loading model from S3: %s", e)
model = None
@app.route('/', methods=['GET'])
def health_check():
return jsonify({'status': 'API is live'})
def validate_and_extract(data):
fields = {
'Pregnancies': int, 'Glucose': float, 'BloodPressure': float,
'SkinThickness': float, 'Insulin': float, 'BMI': float,
'DiabetesPedigreeFunction': float, 'Age': int
}
parsed = {}
for key, t in fields.items():
if key not in data:
parsed[key] = 0 if t is not float else 0.0
else:
parsed[key] = t(data.get(key))
return parsed
@app.route('/predict', methods=['POST'])
def predict():
if not model:
return jsonify({'error': 'Model not loaded'}), 500
try:
data = request.get_json(force=True)
email = data.get('email')
if not email:
return jsonify({'error': 'Email is required'}), 400
inputs = validate_and_extract(data)
input_df = pd.DataFrame([[
inputs['Pregnancies'], inputs['Glucose'], inputs['BloodPressure'],
inputs['SkinThickness'], inputs['Insulin'], inputs['BMI'],
inputs['DiabetesPedigreeFunction'], inputs['Age']
]], columns=['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age'])
pred = int(model.predict(input_df)[0])
# send SNS (publish only)
if SNS_TOPIC_ARN:
sns_client.publish(TopicArn=SNS_TOPIC_ARN,
Message=f"Diabetes Prediction Result: {pred}",
Subject='Diabetes Prediction Result')
# prepare result document
result_data = {
'email': email,
'timestamp': datetime.utcnow().isoformat(),
**inputs,
'PredictedOutcome': pred
}
# save to DynamoDB or MongoDB
if USE_MONGODB and mongo_collection:
mongo_collection.insert_one(result_data)
elif table:
# ensure DynamoDB types are strings/numerics appropriate
table.put_item(Item=result_data)
return jsonify({'PredictedLabel': pred})
except Exception as e:
app.logger.exception("Error in /predict")
return jsonify({'error': str(e)}), 400
if __name__ == "__main__":
app.run(host='0.0.0.0', port=int(os.environ.get('PORT', 5000)), debug=False)
Notes:
- Set env vars on the EC2 instance (or in systemd service) for
MODEL_S3_BUCKET
,MODEL_S3_KEY
,SNS_TOPIC_ARN
,DYNAMO_TABLE_NAME
,USE_MONGODB
,MONGO_URI
- Prefer attaching an IAM role to the EC2 instance granting S3/DynamoDB/SNS read/write
Step 7 — API Gateway (Optional but Recommended)
- In AWS Console → API Gateway → Create REST API (HTTP API or REST)
- Create a POST method
/predict
integrated with:- HTTP integration pointing to
http://ec2-private-ip:8000/predict
(if using private integration behind NLB), or - Use the public EC2 URL
http://ec2-public-ip/predict
if you have set nginx to expose port 80
- HTTP integration pointing to
- Enable CORS if frontend runs on a different domain
- Optionally use a custom domain & ACM certificate
Why use API Gateway?
Rate limiting, API keys, WAF, usage plans, logging, and a fixed endpoint even if EC2 IP changes (use Elastic IP for EC2 to avoid churn).
Step 8 — Frontend & Amplify (Connect to GitHub)
- Create a React frontend (or use your existing
frontend/
)
Example fetch call from frontend to API:
// sample POST to API Gateway endpoint
async function predict(formValues) {
const resp = await fetch('https://<api-id>.execute-api.<region>.amazonaws.com/prod/predict', {
method: 'POST',
headers: {'Content-Type':'application/json'},
body: JSON.stringify(formValues)
});
return resp.json();
}
- Deploy with Amplify Console:
- Connect Amplify to your GitHub repo & branch
- Configure build settings (
amplify.yml
) if needed - Add environment variables in Amplify (e.g.,
API_URL
)
Step 9 — Test End-to-End
Use curl to test the prediction endpoint:
curl -X POST https://<api-endpoint>/predict \
-H "Content-Type: application/json" \
-d '{
"email": "test@example.com",
"Pregnancies": 2,
"Glucose": 120,
"BloodPressure": 70,
"SkinThickness": 20,
"Insulin": 79,
"BMI": 25.5,
"DiabetesPedigreeFunction": 0.5,
"Age": 33
}'
Expected JSON:
{"PredictedLabel": 0}
Check:
- Email inbox for SNS confirmation and result
- DynamoDB table or MongoDB collection for the inserted record
- CloudWatch logs (EC2/cloudwatch agent or API Gateway logs) for errors
Step 10 — Production Considerations & Improvements
- Use HTTPS: Set up ACM + CloudFront or nginx with certbot or use API Gateway with custom domain
- Use Elastic IP or an ALB/NLB in front of EC2 for stable endpoint
- Use Autoscaling and place app behind an ALB for traffic spikes
- Use CloudWatch: Collect logs and metrics (CPU, memory, latency)
- Implement authentication (Cognito / JWT) to secure API
- Implement a retraining pipeline: Pull new records from DynamoDB/MongoDB, retrain in SageMaker, version models in S3, and rollout updates
- Encrypt S3 objects (SSE) and enable server-side encryption on DynamoDB (if needed)
- Use parameter store or Secrets Manager for secrets (MongoDB URI), not environment variables if sensitive
Troubleshooting Quick Tips
-
Model not loaded — Verify
BUCKET_NAME
,MODEL_KEY
, and EC2 IAM role hass3:GetObject
- 403 from DynamoDB/SNS — Check IAM policies attached to EC2 role
- Email not received — The subscriber must confirm SNS subscription (check spam folder)
- CORS errors — Enable CORS on API Gateway or respond with correct Access-Control headers from Flask
- Slow cold start — Warm the EC2 instance or use Lambda provisioned concurrency if migrating to serverless
Appendix A — Example requirements.txt
flask
flask-cors
boto3
pandas
scikit-learn
joblib
gunicorn
pymongo
Appendix B — IAM Policy JSON (Example for EC2 Role)
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": ["arn:aws:s3:::your-bucket-name/*"]
},
{
"Effect": "Allow",
"Action": ["sns:Publish", "sns:Subscribe", "sns:ListSubscriptionsByTopic"],
"Resource": ["arn:aws:sns:us-east-1:ACCOUNT_ID:diabetes-results"]
},
{
"Effect": "Allow",
"Action": ["dynamodb:PutItem", "dynamodb:UpdateItem", "dynamodb:GetItem"],
"Resource": ["arn:aws:dynamodb:us-east-1:ACCOUNT_ID:table/DiabetesResults"]
}
]
}
Appendix C — Where Your Project Files Go (Recommended Layout)
/home/ec2-user/app/
├── backend.py
├── requirements.txt
├── modelaws.joblib # not required on EC2, you read direct from S3
├── nginx.conf
└── systemd/diabetes-api.service
Final Notes
- The code sample I supplied upgrades your original
backend.py
to be safer and easier to deploy (env vars, optional MongoDB support, IAM role usage) - Decide whether you want to use DynamoDB (native AWS) or MongoDB Atlas — pick based on analytics needs and familiarity. The provided code supports both via a toggle env var
This complete guide provides everything you need to deploy a production-ready diabetes prediction system on AWS. The architecture is scalable, secure, and follows AWS best practices for machine learning applications.
Top comments (0)