Eliana Lam

Posted on Nov 30, 2025 • Originally published at aws-user-group.com

Scaling Resilience

#systemdesign #distributedsystems #cloudcomputing #aws

Speaker: Gregory St. Cyr @ AWS FSI Meetup 2025 Q4

What is USAA?

USAA is a financial services company.
Offers a wide variety of products across banking, insurance, wealth management.
Emphasizes customer service as a key product.
Serves 14 million members, primarily service members and their families.
Focuses on availability and resiliency due to the unique needs of its members.

USAA’s Journey to Cloud

Began cloud deployment in 2019 with initial deployments in US East 1.
Gradual increase in cloud traffic and momentum over the next few years.
By June 2023, had dozens of deployments.
Adopted a “never let an outage go to waste” mentality.
Instead of repatriating, used outages as opportunities to mature the platform.
Enabled a secondary region by deploying applications to US West 2.
By 2024, had hundreds of applications deployed to AWS and dozens to US West 2.
Viewed the October outage as another learning and growth opportunity.

Multi-Region Deployment Strategy

Decided to adopt a multi-region deployment strategy.
Focused on three failover patterns: backup restore, pilot light, and warm standby.
Backup and restore is critical, especially for data availability.
Cannot fail over to the secondary region without data availability.
Must have a process to get data to the secondary region, even if it involves recovering from snapshots.
Pilot light and warm standby are used for high and critical applications.
Goal is to improve Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
Aim is to enhance resiliency and customer experience for military members.

Three Phases for Application Resilience:

Architectural Design:

Assess each application and identify dependencies.
Understand necessary actions to make the application resilient.
Consider downstream dependencies in the critical path.
Focus on instant response and recovery requirements.
Define necessary Recovery Time Objective (RTO).
Incorporate RTO as a requirement in architectural design.
Select appropriate failover pattern (backup restore, pilot light, warm standby) for the use case.

Operational Readiness:

Configure necessary settings for deployment.
Develop playbooks and monitoring tools.
Ensure availability of resources to meet RTO and Recovery Point Objective (RPO) metrics and Service Level Objectives (SLOs).

Testing and Feedback:

Conduct resiliency testing and plan for outages.
Learn from outages and gather continuous feedback.
Perform retrospective analysis to ensure SLO objectives for RTO and RPO are met.
Improve platform maturity through continuous learning and adaptation.

Process and Tooling:

Centralized Resilience:

Organize and centralize resilience standards and frameworks.
Ensure a common set of standards across the organization.

Distributed DevOps Platform:

Build a DevOps platform that meets application engineers where they are.
Ensure consistency across all development teams.
Provide necessary tools and support for DevOps maturity.

Automation:

Look for opportunities to automate resiliency processes.
Implement self-service blueprints integrated into SDLC processes.
Provide infrastructure as code templates and quick start guides.
Ensure app teams can quickly start development with resiliency requirements built-in.

Specific Examples of USAA’s Resiliency Process

Custom Well-Architected Reviews:

Started with the Well-Architected Framework and applied it to all USAA applications.
Recognized consistent elements across applications due to company structure (centralization, DevOps platform).
Built a custom lens on top of the Well-Architected Review for more meaningful, automated scoring.
Identifies and highlights components each application can control to improve resiliency.

Shifting Left:

"Shift left" is a strategy in software development and IT service management that involves moving critical activities, such as testing, security, and quality assurance, to an earlier stage in the development or support lifecycle. This approach aims to find and fix issues sooner, which reduces costs, improves quality, and speeds up delivery.
Early validation of RTO to reduce churn and improve resiliency in application building.
Ensured resiliency understanding is critical during the solutioning of business use cases.

Tooling Integration in CI/CD Pipeline:

Use GitLab in CI/CD pipeline.
Integrated AWS Resilience Hub and Resource Groups for insights.
Application teams receive immediate feedback on assessments during pipeline runs.
Evidence-based, real-time reporting on resilience before production implementation.
Lower environment stages also provide feedback.

Canary Deployment Strategy:

Rolling out new software versions by gradually shifting a small percentage of user traffic to the new version.
Automated canary deployments through the pipeline.
Particularly useful for serverless and container-based applications.
Weight routing to slowly roll out changes and test resiliency.
Allows learning from changes and understanding application resiliency as it rolls out to members.

Pattern Library Components

Quick Start Guides:

Allow developers to start quickly while ensuring consistency in development practices at USAA.
Include non-functional requirements related to resiliency.

Serverless Patterns:

Examples include async API, choreography pattern with event bridge, real-time API with orchestration using Step Functions.
Provide quick starts for serverless applications.

Container Patterns:

Use Helm charts (packages that define, install, and upgrade applications in Kubernetes) as templates for deploying container-based applications based on specific needs.

Archival Data Patterns:

Moving inactive data from active systems to cheaper, long-term storage to improve performance and reduce costs, using tiered storage like on-premises, cloud, or hybrid models, and implementing strategies like data partitioning for efficient movement.
Include archival patterns for moving data from application VPC to data VPC for archival.
Automated snapshots and restore patterns for recovery from archival data VPC or local snapshots.
Ensures safe data movement and resiliency within or across regions based on application needs.

Selection Criteria:

Used to identify the best pattern for a use case based on reliability, performance, latency, and sensitivity.
Helps implement consistency in applications and choose the best resiliency architecture.

Example: Origination System for Deposits Application

Overview:

Critical system for opening new deposit accounts (checking, savings) at USAA.
Part of the new member onboarding process.
Complex multi-service architecture relying on downstream systems.

Before Multi-Region Resiliency Implementation:

Assessment showed poor performance in disaster recovery, durability, and observability.
Not resilient or available enough for members.

After Implementing Warm Standby Pattern:

Reduced RTO from 4 hours to 30 minutes.
Reduced RPO from 1 hour to 5 minutes.
Enabled failover within 30 minutes during the US East 1 outage, ensuring seamless member experience.
Significant business impact with improved recovery and data handling.

Architectural Design:

Platform account handles network traffic and routing.
Application deployed in US East 1 and US West 2 VPCs (warm standby implementation).
Use of DynamoDB global tables for near real-time replication to the secondary region.

Additional Considerations for Multi-Region Deployment

Idempotency:

Concerns whether a transaction can be posted more than once without adverse effects.
Critical in banking space where monetary transactions must not be duplicated.
Considered from a within-region resiliency perspective.

Retries and Transaction IDs:

Necessary to manage retries and maintain transaction IDs in DynamoDB.
Essential for managing state within Step Functions.

State Management During Failover:

Challenge of maintaining state when failing over during a Step Function execution.
State is not maintained when moving to a secondary region.
Must design around this consideration to ensure resiliency during failover.

Architectural Design (Phase One):

Incorporates considerations like Idempotency and state management.
Ensures resiliency is designed into the failover procedure.

Operational Runbooks (Phase Two):

Describes procedures and considerations for operational resilience.
Includes handling retries, transaction IDs, and state management.

Testing and Continuous Feedback (Phase Three):

Learn from errors and discrepancies during testing and real-world scenarios.
Addresses issues like Idempotency, performance degradation, cold starts, and cache hydration in the secondary region.

Key to Building Resilient Systems:

Understanding and accounting in design.
Ensuring resiliency not just within one region but across multiple regions.
Critical for providing resilient services to USAA members.

USAA’s Resiliency Framework and Success Factors

100% New Applications Through Resiliency Framework:

All new applications at USAA now go through the resiliency framework.
Reassessing existing applications deployed to AWS to ensure they meet resiliency needs.

Shift to Conservative Multi-Region Strategy:

Moving more applications to a multi-region architecture based on learned best practices.
Example: improved origination system RTO and RPO significantly.

Improved Resilience and Architecture Reviews:

Resilience reviews and architecture reviews have become faster with increased learning and feedback.
Understanding trade-offs has accelerated the process.
Reduction in resilience-related incidents by 66% during recent outages.

Organizational Success Factors for Resiliency

Culture:

Shared responsibility among technologists, IT, and business partners.
Understanding the meaning of resiliency and defining Service Level Objectives (SLOs).

Architecture Leadership:

Executive sponsorship and dedicated funding for resiliency as a primary objective.
Using members’ needs as the mission to drive investment in resiliency.

Approach:

Started small with a single application deploying multi-region.
Learned from initial deployments and scaled with short feedback loops.
Improved platform maturity and consistency in resiliency implementation.

Tooling:

Investment in automation and self-service capabilities.
Blueprints within the SDLC and templates for Infrastructure as Code (IaC).
Ensuring developers can implement resilient practices without churn or toil.
Key to meeting business objectives while providing excellent member service.

Key Takeaways from the Session

Standardization Accelerates Adoption:

Use blueprints and templates to reduce complexity.
Simplify understanding of resiliency for both developers and business stakeholders.

Shift Left:

Move resiliency considerations earlier in the project ideation phase.
Foster partnerships with business partners to align resiliency strategy with business needs.

Continuous Testing: