Author Bio: ML Engineer specializing in real-time ML systems and serverless architectures on AWS. Built production notification routing systems, processing millions of events daily.
Table of Contents
- Introduction
- The Problem
- Solution Architecture
- Core Components
- ML Pipeline
- Implementation
- Lessons Learned
- Performance
- What’s Next
- Conclusion
Introduction
Sending notifications at the wrong time is like knocking on someone’s door at 3 AM.
Even if the message is important — timing matters.
In this article, I’ll walk you through building a production-grade, ML-powered notification routing engine that predicts the optimal send time for each user based on their historical engagement patterns.
What We’ll Build
- Real-time ML inference system using Amazon SageMaker
- Event-driven architecture processing millions of events
- Automated ML training pipeline with feedback loops
- Infrastructure as Code using AWS CDK
- Cost-optimized serverless system
Tech Stack
- AWS Lambda (Java 21 + SnapStart)
- Amazon SageMaker (XGBoost)
- AWS Glue (PySpark ETL)
- Amazon Kinesis Data Streams
- DynamoDB
- EventBridge Scheduler
- AWS CDK (TypeScript)
...
The Problem: Notification Fatigue
Business Challenge
Modern applications send billions of notifications daily.
But:
- 50–70% of notifications go unread
- User churn increases by 30% due to notification fatigue
- Engagement rates vary 5–10x depending on timing
Traditional Approaches
- “Send at 9 AM local time”
- “Send when user was last active”
- “Batch and send at fixed intervals”
These approaches fail to account for individual user behavior.
Technical Challenges
- Real-time prediction (<500ms latency)
- Personalized decision per user
- Scale to millions of events
- Continuous learning via feedback
- Cost optimization
...
Solution Architecture
System Flows
- Event Ingestion (real-time)
- Decision & Scheduling (real-time)
- ML Training Pipeline (batch, daily)
...
Design Principles
Event-Driven & Decoupled
- Kinesis for streaming
- Services communicate via events
- Independent scaling
Serverless-First
- Lambda for compute
- DynamoDB for storage
- No infrastructure management
ML Feedback Loop
- Delivery outcomes feed training
- Daily retraining
- Continuous improvement
...
Core Components
Event Ingestion (Control Plane)
- Accepts user events (clicks, sends, etc.)
- Streams data to Kinesis
Key Decisions:
- Partition by userId → ordered processing
- Async processing → low latency
- SnapStart → reduced cold start
Event Processing
- Stores raw data in S3 (data lake)
- Updates DynamoDB user profiles
Key Decisions:
- Dual writes (S3 + DynamoDB)
- Atomic counters
- Partitioned storage
Decision Service (Real-Time ML)
- Predicts optimal send time
- Calls SageMaker endpoint
Flow:
- Fetch user profile
- Evaluate each hour in time window
- Score each candidate time
- Select best hour
- Schedule notification
...
ML Pipeline
End-to-End Flow
S3 Raw → Glue ETL → Features → SageMaker Training → Model → Endpoint
↑____________________________________________________________↓
………………………..Feedback Loop (Clicks, Opens)…………………….
Feature Engineering
- Hour of day
- Click rate (last 7 days)
- Number of sends per hour
Model Choice: XGBoost
- Fast inference (<50ms)
- Excellent for tabular data
- Built-in SageMaker support
- Handles overfitting well
Orchestration
- EventBridge → daily trigger
- Step Functions → workflow
- Glue → feature generation
- SageMaker → training
...
Implementation Highlights
DynamoDB (Single Table Design)
{
"pk": "USER#user_12345",
"sk": "PROFILE",
"lastSeenAt": "...",
"counters": {
"events": 1523,
"notifications_sent": 47,
"notifications_clicked": 12
}
}
Benefits:
- Single read operation
- Atomic updates
- Cost-efficient
Infrastructure as Code (CDK)
- Version-controlled infrastructure
- Reproducible deployments
- Type-safe configurations
...
Lessons Learned
Feature Mismatch
Training features ≠ inference features → model breaks
👉 Always keep features consistent
Data Format Issues
CSV vs Parquet mismatch → failed training
👉 Standardize data formats
Wrong Event Types
Incorrect filters → zero training data
👉 Validate data pipelines
[NOTE: Actively updating this part]
SageMaker Mode Confusion
Mixing built-in + custom training → broken setup
👉 Choose one approach
Time Window Bugs
Using all-time data instead of last 7 days
👉 Always explicitly filter time windows
...
Performance & Cost
Performance
- API latency (p50): 180ms
- Inference latency: 45ms
- Throughput: 5,000 events/sec
Cost
- Under Estimation(1M events/day)
...
Optimizations
- Lambda SnapStart
- DynamoDB on-demand
- S3 intelligent tiering
- Small SageMaker instances
...
What’s Next
- Multi-Armed Bandits (better optimization)
- Multi-channel prediction (email vs SMS vs push)
- A/B testing framework
- Feature Store integration
- Model monitoring & drift detection
- AutoML integration
Conclusion
Building ML systems is:
10% modeling and 90% engineering
The real challenges are:
- Data pipelines
- Feature consistency
- Scalability
- Feedback loops
- Production debugging
Key Takeaways
- Use event-driven architecture
- Keep training and inference features aligned
- Prefer managed ML services
- Use Infrastructure as Code
- Optimize costs early
- Always implement feedback loops
Discussion
Have you built similar systems?
- What challenges did you face?
- How do you ensure feature consistency?
- Let’s discuss 👇
Smart Notification Routing Engine
ML-Powered Intelligent Notification Delivery System with Real-Time Optimization
Project Status: 🚧 Active Development - Core infrastructure and ML pipeline implemented. Performance benchmarking and production validation in progress.
Overview
A production-grade, enterprise-scale notification routing engine designed to leverage machine learning for optimizing message delivery timing and channel selection. This system addresses the critical problem of notification fatigue by intelligently predicting when users are most likely to engage with notifications, with projected engagement rate improvements of 40-60% compared to traditional uniform delivery strategies.
Built entirely on AWS serverless architecture, the system processes millions of events, trains ML models nightly, and serves real-time predictions with sub-second latency—all while maintaining strict security, observability, and cost optimization standards.
Key Capabilities
- 🤖 ML-Driven Send-Time Optimization: XGBoost models predict optimal delivery windows per user
- 📊 Real-Time Feature Engineering: Apache Spark ETL pipelines transform raw events into ML features
- ⚡ Sub-Second…
Tags: #AWS #MachineLearning #SageMaker #Serverless #MLOps #XGBoost #EventDriven #CDK

Top comments (0)