Speaker: Fabiano Honorato, Michelle Koo, Stephen Brandon @ AWS FSI Meetup 2025 Q4
Introduction to Brex
Financial operating system platform for managing expenses, travel, credit.
Engineering manager and team members discuss leveraging Amazon Aurora for resiliency and international expansion
Brex services
Corporate cards, expense management, travel, bill pay, and banking
Aim to help clients spend wisely and smartly
Importance of preparing infrastructure for disaster scenarios
Focus on the data layer, primarily using PostgreSQL with PG bouncer and replicas for applications and analytical purposes
Merge smaller databases into a single database instance
Past disaster recovery process was manual and time-consuming
Goals for disaster recovery solution
Warm disaster recovery solution to decrease Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
RTO: maximum time to recover normal operations after disaster
RPO: maximum amount of data tolerable to lose
Determining RPO and RTO
Analyze metrics, assess current capabilities, and conduct extensive testing
Understand how applications will handle additional latency and data loss
Choice of Amazon Aurora Global Database
Provides necessary features without significant changes to the current setup
Allows use of a secondary region when needed
Current implementation caveats
- Creating custom DNS endpoint for read applications used for both applications and analytical purposes
Migration challenges and approach
Difficulty in migrating from PostgreSQL to Aurora due to potential application downtime
Focus on automation to minimize manual handling
Built a temporal workflow for running automated jobs to validate migration steps and prepare the environment
Performed the switch over to Aurora Global after ensuring automation validated database status
Downtime management during migration
AWS provides a small window of downtime (2-3 minutes) for migration
Utilized this window to adjust endpoints and applications consuming the database
Leveraged the short downtime period for a smooth transition
Using temporal workflows for automation
Current state before migration
- Application connected to PG bouncer, which connected to PostgreSQL instance and replica instance
Migration process
Created Aurora read replica through AWS with zero downtime
Workflow promoted Aurora read replica and created Aurora global cluster
Application connected to PG bouncer, which then connected to Aurora global cluster using global writer endpoint
Possibility to create another cluster for multi-region setup
Flux:
Tool for keeping Kubernetes clusters in sync through git repositories
Workflow generated Flux git pull requests ahead of time
Workflow automatically merged pull requests after manual verification
Confirmation signal sent to workflow to proceed with downtime and promote Aurora global cluster
Automatically reviewing Flux using AI
- Identified errors or issues in pull requests and provided comments for review
Dry run migration flag
Allowed testing of migration without causing destructive actions or downtime
Created Flux git pull requests ahead of time for review, but did not merge or promote cluster
Additional tools and processes used in the migration
Terraform: created a template for managing databases as a new global cluster
Added Terraform for each database after migration workflow completion
Managed reader instances in the global cluster through Terraform
Internal command line tool:
Added commands for teams to self-service switch over or failover for their Aurora global clusters
Failover: used to recover from unplanned outages by switching to a different region if one region is down
Switchover: used for controlled scenarios like operational maintenance or planned procedures with no data loss
Iterative journey of improving workflow performance
Initial workflow took around 15 minutes for end-to-end automation
Downtime for promoting Aurora global cluster and creating Flux git pull requests
Sequential process with no parallelization
Addition of parallelization reduced workflow time to 10 minutes
Updated workflow to perform steps in parallel, including fetching credentials and creating Flux pull requests ahead of time
Introduced dry run flag for non-destructive testing of migration
Final restructuring of workflow achieved 3-minute performance time
Created Flux pull requests ahead of time, allowing workflow to pause until downtime window
Reduced git add operations to minimize costs
Added signal command for controlled initiation of downtime
Lessons learned during the process
Thorough testing and deliberate deployment are crucial before formal migration
Start with staging environment, resolve issues, and then proceed to production
Automation reduces human error and enables easy replication for multiple databases
Simulate migration using the dry run option to test workflow without causing downtime (A dry run or practice run, a trial exercise or rehearsal, is a software testing process used to make sure that a system works correctly and will not result in severe failure.)
Iterative improvement by migrating a few databases each week
Top comments (0)