Eliana Lam

Posted on Nov 30, 2025 • Originally published at aws-user-group.com

Brex Database Disaster Recovery

#aws #database #postgres #architecture

Speaker: Fabiano Honorato, Michelle Koo, Stephen Brandon @ AWS FSI Meetup 2025 Q4

Introduction to Brex

Financial operating system platform for managing expenses, travel, credit.
Engineering manager and team members discuss leveraging Amazon Aurora for resiliency and international expansion

Brex services

Importance of preparing infrastructure for disaster scenarios

Focus on the data layer, primarily using PostgreSQL with PG bouncer and replicas for applications and analytical purposes
Merge smaller databases into a single database instance
Past disaster recovery process was manual and time-consuming

Goals for disaster recovery solution

Warm disaster recovery solution to decrease Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
RTO: maximum time to recover normal operations after disaster
RPO: maximum amount of data tolerable to lose

Determining RPO and RTO

Choice of Amazon Aurora Global Database

Current implementation caveats

Creating custom DNS endpoint for read applications used for both applications and analytical purposes

Migration challenges and approach

Difficulty in migrating from PostgreSQL to Aurora due to potential application downtime
Focus on automation to minimize manual handling
Built a temporal workflow for running automated jobs to validate migration steps and prepare the environment
Performed the switch over to Aurora Global after ensuring automation validated database status

Downtime management during migration

AWS provides a small window of downtime (2-3 minutes) for migration
Utilized this window to adjust endpoints and applications consuming the database
Leveraged the short downtime period for a smooth transition

Using temporal workflows for automation

Current state before migration

Application connected to PG bouncer, which connected to PostgreSQL instance and replica instance

Migration process

Created Aurora read replica through AWS with zero downtime
Workflow promoted Aurora read replica and created Aurora global cluster
Application connected to PG bouncer, which then connected to Aurora global cluster using global writer endpoint
Possibility to create another cluster for multi-region setup

Flux:

Tool for keeping Kubernetes clusters in sync through git repositories
Workflow generated Flux git pull requests ahead of time
Workflow automatically merged pull requests after manual verification
Confirmation signal sent to workflow to proceed with downtime and promote Aurora global cluster

Automatically reviewing Flux using AI

Dry run migration flag

Allowed testing of migration without causing destructive actions or downtime
Created Flux git pull requests ahead of time for review, but did not merge or promote cluster

Additional tools and processes used in the migration

Internal command line tool:

Added commands for teams to self-service switch over or failover for their Aurora global clusters
Failover: used to recover from unplanned outages by switching to a different region if one region is down
Switchover: used for controlled scenarios like operational maintenance or planned procedures with no data loss

Iterative journey of improving workflow performance

Initial workflow took around 15 minutes for end-to-end automation
Downtime for promoting Aurora global cluster and creating Flux git pull requests
Sequential process with no parallelization

Addition of parallelization reduced workflow time to 10 minutes

Updated workflow to perform steps in parallel, including fetching credentials and creating Flux pull requests ahead of time
Introduced dry run flag for non-destructive testing of migration

Final restructuring of workflow achieved 3-minute performance time

Created Flux pull requests ahead of time, allowing workflow to pause until downtime window
Reduced git add operations to minimize costs
Added signal command for controlled initiation of downtime

Lessons learned during the process

Thorough testing and deliberate deployment are crucial before formal migration
Start with staging environment, resolve issues, and then proceed to production
Automation reduces human error and enables easy replication for multiple databases
Simulate migration using the dry run option to test workflow without causing downtime (A dry run or practice run, a trial exercise or rehearsal, is a software testing process used to make sure that a system works correctly and will not result in severe failure.)
Iterative improvement by migrating a few databases each week

DEV Community