DEV Community

Cover image for Building an ML-Powered Notification Router on AWS: A Production Architecture Guide
Yadab Sutradhar
Yadab Sutradhar

Posted on

Building an ML-Powered Notification Router on AWS: A Production Architecture Guide

Author Bio: ML Engineer specializing in real-time ML systems and serverless architectures on AWS. Built production notification routing systems, processing millions of events daily.

Table of Contents

  • Introduction
  • The Problem
  • Solution Architecture
  • Core Components
  • ML Pipeline
  • Implementation
  • Lessons Learned
  • Performance
  • What’s Next
  • Conclusion

Introduction

Sending notifications at the wrong time is like knocking on someone’s door at 3 AM.

Even if the message is important — timing matters.

In this article, I’ll walk you through building a production-grade, ML-powered notification routing engine that predicts the optimal send time for each user based on their historical engagement patterns.

What We’ll Build

  • Real-time ML inference system using Amazon SageMaker
  • Event-driven architecture processing millions of events
  • Automated ML training pipeline with feedback loops
  • Infrastructure as Code using AWS CDK
  • Cost-optimized serverless system

Tech Stack

  • AWS Lambda (Java 21 + SnapStart)
  • Amazon SageMaker (XGBoost)
  • AWS Glue (PySpark ETL)
  • Amazon Kinesis Data Streams
  • DynamoDB
  • EventBridge Scheduler
  • AWS CDK (TypeScript)

...

The Problem: Notification Fatigue

Business Challenge

Modern applications send billions of notifications daily.

But:

  • 50–70% of notifications go unread
  • User churn increases by 30% due to notification fatigue
  • Engagement rates vary 5–10x depending on timing

Traditional Approaches

  • “Send at 9 AM local time”
  • “Send when user was last active”
  • “Batch and send at fixed intervals”

These approaches fail to account for individual user behavior.

Technical Challenges

  • Real-time prediction (<500ms latency)
  • Personalized decision per user
  • Scale to millions of events
  • Continuous learning via feedback
  • Cost optimization

...

Solution Architecture

System Flows

  • Event Ingestion (real-time)
  • Decision & Scheduling (real-time)
  • ML Training Pipeline (batch, daily)

ML Smart Notification Routing Engine Complete Architecture Diagram

...

Design Principles

Event-Driven & Decoupled

  • Kinesis for streaming
  • Services communicate via events
  • Independent scaling

Serverless-First

  • Lambda for compute
  • DynamoDB for storage
  • No infrastructure management

ML Feedback Loop

  • Delivery outcomes feed training
  • Daily retraining
  • Continuous improvement

...

Core Components

Event Ingestion (Control Plane)

  • Accepts user events (clicks, sends, etc.)
  • Streams data to Kinesis

Key Decisions:

  • Partition by userId → ordered processing
  • Async processing → low latency
  • SnapStart → reduced cold start

Event Processing

  • Stores raw data in S3 (data lake)
  • Updates DynamoDB user profiles

Key Decisions:

  • Dual writes (S3 + DynamoDB)
  • Atomic counters
  • Partitioned storage

Decision Service (Real-Time ML)

  • Predicts optimal send time
  • Calls SageMaker endpoint

Flow:

  • Fetch user profile
  • Evaluate each hour in time window
  • Score each candidate time
  • Select best hour
  • Schedule notification

...

ML Pipeline

End-to-End Flow

S3 Raw → Glue ETL → Features → SageMaker Training → Model → Endpoint
↑____________________________________________________________↓
………………………..Feedback Loop (Clicks, Opens)…………………….

Feature Engineering

  • Hour of day
  • Click rate (last 7 days)
  • Number of sends per hour

Model Choice: XGBoost

  • Fast inference (<50ms)
  • Excellent for tabular data
  • Built-in SageMaker support
  • Handles overfitting well

Orchestration

  • EventBridge → daily trigger
  • Step Functions → workflow
  • Glue → feature generation
  • SageMaker → training

...

Implementation Highlights

DynamoDB (Single Table Design)

{
  "pk": "USER#user_12345",
  "sk": "PROFILE",
  "lastSeenAt": "...",
  "counters": {
    "events": 1523,
    "notifications_sent": 47,
    "notifications_clicked": 12
  }
}
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • Single read operation
  • Atomic updates
  • Cost-efficient

Infrastructure as Code (CDK)

  • Version-controlled infrastructure
  • Reproducible deployments
  • Type-safe configurations

...

Lessons Learned

Feature Mismatch

Training features ≠ inference features → model breaks

👉 Always keep features consistent

Data Format Issues

CSV vs Parquet mismatch → failed training

👉 Standardize data formats

Wrong Event Types

Incorrect filters → zero training data

👉 Validate data pipelines

[NOTE: Actively updating this part]

SageMaker Mode Confusion

Mixing built-in + custom training → broken setup

👉 Choose one approach

Time Window Bugs

Using all-time data instead of last 7 days

👉 Always explicitly filter time windows

...

Performance & Cost

Performance

  • API latency (p50): 180ms
  • Inference latency: 45ms
  • Throughput: 5,000 events/sec

Cost

  • Under Estimation(1M events/day)

...

Optimizations

  • Lambda SnapStart
  • DynamoDB on-demand
  • S3 intelligent tiering
  • Small SageMaker instances

...

What’s Next

  • Multi-Armed Bandits (better optimization)
  • Multi-channel prediction (email vs SMS vs push)
  • A/B testing framework
  • Feature Store integration
  • Model monitoring & drift detection
  • AutoML integration

Conclusion

Building ML systems is:

10% modeling and 90% engineering

The real challenges are:

  • Data pipelines
  • Feature consistency
  • Scalability
  • Feedback loops
  • Production debugging

Key Takeaways

  • Use event-driven architecture
  • Keep training and inference features aligned
  • Prefer managed ML services
  • Use Infrastructure as Code
  • Optimize costs early
  • Always implement feedback loops

Discussion

Have you built similar systems?

  • What challenges did you face?
  • How do you ensure feature consistency?
  • Let’s discuss 👇

Smart Notification Routing Engine

ML-Powered Intelligent Notification Delivery System with Real-Time Optimization

AWS Java TypeScript Python SageMaker Status

Project Status: 🚧 Active Development - Core infrastructure and ML pipeline implemented. Performance benchmarking and production validation in progress.

Overview

A production-grade, enterprise-scale notification routing engine designed to leverage machine learning for optimizing message delivery timing and channel selection. This system addresses the critical problem of notification fatigue by intelligently predicting when users are most likely to engage with notifications, with projected engagement rate improvements of 40-60% compared to traditional uniform delivery strategies.

Built entirely on AWS serverless architecture, the system processes millions of events, trains ML models nightly, and serves real-time predictions with sub-second latency—all while maintaining strict security, observability, and cost optimization standards.

Key Capabilities

  • 🤖 ML-Driven Send-Time Optimization: XGBoost models predict optimal delivery windows per user
  • 📊 Real-Time Feature Engineering: Apache Spark ETL pipelines transform raw events into ML features
  • ⚡ Sub-Second

Tags: #AWS #MachineLearning #SageMaker #Serverless #MLOps #XGBoost #EventDriven #CDK

Top comments (0)