DEV Community

Cloudev
Cloudev

Posted on

Automating AWS RDS Failover Alerts with Terraform and Lambda

Failover automation is one of the most important parts of building resilient cloud infrastructure. In this post, I’ll walk you through how I built a simple but powerful project that monitors an AWS RDS instance and sends real-time alerts whenever a failover occurs, using Terraform, Lambda, and SNS.
Why This Project Matters

High availability is critical for production databases. AWS RDS automatically handles failovers when an issue is detected in the primary instance, but knowing when it happens helps you respond faster and verify your recovery plan.

Instead of manually checking CloudWatch or relying on delayed alerts, this project automates everything from infrastructure deployment to continuous monitoring and notifications.
What You’ll Build

Here’s what the setup includes:

VPC – a private network for secure communication.

RDS (MySQL) – deployed in multiple availability zones for automatic failover.

EC2 instance – optional, for testing connectivity to the database.

SNS Topic – sends email alerts when a failover occurs.

Lambda Function – checks RDS health every 5 minutes.

CloudWatch EventBridge Rule – schedules the Lambda function.

Terraform Backend – stores state files securely in S3 with a lockfile.

Folder Structure
aws-rds-failover/
├── modules/
│ ├── vpc/
│ ├── rds/
│ └── ec2/
├── lambda/
│ └── rds_health_check.py
├── environments/
│ ├── dev.tfvars
│ └── prod.tfvars
├── main.tf
├── variables.tf
├── outputs.tf
├── provider.tf

Each module handles a specific piece of infrastructure. This modular approach keeps your project clean and reusable.

Key Terraform Features
Backend Configuration
terraform {
backend "s3" {
bucket = "my-terraform-states"
key = "global/s3/terraform.tfstate"
region = "eu-west-1"
encrypt = true
use_lockfile = true
}
}

provider "aws" {
region = "eu-west-1"
}

This ensures your Terraform state is stored safely in S3 and supports remote collaboration.

RDS Deployment
resource "aws_db_instance" "rds" {
identifier = "${var.environment}-db"
allocated_storage = var.allocated_storage
engine = "mysql"
engine_version = "8.0"
instance_class = var.instance_type
multi_az = true
publicly_accessible = false
db_subnet_group_name = aws_db_subnet_group.rds_subnets.name
vpc_security_group_ids = [aws_security_group.rds_sg.id]
backup_retention_period = 7
skip_final_snapshot = true
tags = {
Environment = var.environment
}
}

This configuration deploys a multi-AZ RDS instance with backups and automatic failover enabled.

SNS and Lambda Integration
resource "aws_sns_topic" "rds_failover_alerts" {
name = "${var.environment}-rds-failover-alerts"
}

resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.rds_failover_alerts.arn
protocol = "email"
endpoint = var.alert_email
}

resource "aws_lambda_function" "rds_health_check" {
filename = "lambda.zip"
function_name = "${var.environment}-rds-health-check"
role = aws_iam_role.lambda_role.arn
handler = "rds_health_check.lambda_handler"
runtime = "python3.11"

environment {
variables = {
RDS_ENDPOINT = module.rds.endpoint
DB_USER = var.db_username
DB_PASS = var.db_password
SNS_TOPIC_ARN = aws_sns_topic.rds_failover_alerts.arn
}
}
}

The Lambda runs every 5 minutes (via EventBridge) to check database health. If it fails to connect, it sends an SNS alert.

Lambda Script (rds_health_check.py)
import os
import pymysql
import boto3

def lambda_handler(event, context):
rds_host = os.environ['RDS_ENDPOINT']
user = os.environ['DB_USER']
password = os.environ['DB_PASS']
sns_topic = os.environ['SNS_TOPIC_ARN']

sns = boto3.client('sns')

try:
    conn = pymysql.connect(host=rds_host, user=user, password=password, connect_timeout=5)
    conn.close()
    print("RDS is healthy.")
except Exception as e:
    print(f"RDS connection failed: {e}")
    sns.publish(
        TopicArn=sns_topic,
        Message=f"RDS failover or connectivity issue detected: {e}",
        Subject="RDS Failover Alert"
    )
Enter fullscreen mode Exit fullscreen mode

How to Deploy

Initialize Terraform:

terraform init

Create a workspace:

terraform workspace new dev
terraform workspace select dev

Apply configuration:

terraform apply -var-file=environments/dev.tfvars

Confirm the email subscription from AWS SNS.

Watch the logs in CloudWatch to see Lambda health checks in action.

What Happens During Failover

When RDS fails over (for example, due to an AZ outage), AWS promotes the standby to primary. During this short downtime, your Lambda function will fail to connect and send an alert through SNS immediately.

You can then verify the new RDS endpoint in the AWS Console and confirm that the application reconnects as expected.
This setup combines Infrastructure as Code (Terraform) with serverless automation (Lambda) to create a self-monitoring, fault-tolerant database system. It’s cost-effective, scalable, and production-ready.

If you’re building AWS infrastructure, automating your monitoring and alerting like this saves time and reduces downtime when issues occur.
**
GitHub Repository**

You can view and clone the full project here:
https://github.com/Copubah/aws-rds-multi-az-terraform

It includes all Terraform modules, Lambda source code, and environment setup files so you can deploy your own version right away.

Top comments (0)