DEV Community

Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - FIS: High-performance instant payment processing at massive scale (IND3318)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - FIS: High-performance instant payment processing at massive scale (IND3318)

In this video, FIS and AWS demonstrate how they built a cloud-native Money Movement Hub on AWS to process payments at massive scale. The solution handles 1,000+ payments per second with sub-5-second settlement times and 99.995% availability. Key architectural decisions include event-driven microservices on EKS, Aurora PostgreSQL with ElastiCache Redis caching, MSK for event streaming, and multi-region resiliency using native AWS cross-region replication. The team leveraged Keda for pod autoscaling, Karpenter for node scaling, and Conductor for workflow orchestration. They achieved production deployment in just nine months, now serving 30 clients with 100 signed. The presentation covers specific performance optimizations like Bottlerocket for fast pod startup, custom Spring annotations for read replica routing, and CloudWatch Application Signals for service level objective monitoring. The solution supports multiple payment rails including FedNow, TCH RTP, ACH, wire, and emerging digital currencies, with AI-driven smart routing and payment exception handling.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction: Building a High-Performance Payment Solution on AWS

Hello and welcome everyone. To start, I'd like to ask a quick question. By a show of hands, how many of you used your mobile phone to make some sort of payment within the last week? As I suspected, practically everyone. So payments are an integral part of our daily lives, whether it is paying a few dollars for a cup of coffee or, like in my case, sending a whole bunch of dollars for my daughter's college education to her university. Behind the scenes, there is a complex and comprehensive technology infrastructure that makes this money movement and these payments possible.

In this session today, you will learn how FIS collaborated with AWS to build a high-performance and massively scalable payment solution on AWS. My name is Sameer Sharma, and I'm a Lead Principal, Solutions Architecture at AWS. Today I'm joined by Ade Sturley, Vice President of Money Movement at FIS, and Elango Sundararajan, who is a Senior Delivery Consultant at AWS. Together, we will share some key learnings and insights from our journey so that you can take these away and apply them as you build your own mission-critical applications on AWS.

Thumbnail 110

Let's get started. Here's a quick overview of our session today. We'll first start by telling you about FIS and the critical role FIS technology plays in global financial services. Ade will share with you why FIS decided to rethink payment solutions from the ground up. Then we'll show you a quick demo of instant payment processing. The goal here is to show you the complexity of the processing that happens to make every single payment possible.

Then we'll touch upon the technical requirements and challenges in building such a solution. But most of the time in today's session, we'll spend diving into the solution. We'll share with you the architectural design decisions that we made, the trade-offs that we balanced, and the technologies and services that we used to build this solution. We will also take a deep dive so that you can understand how we approach building massive scalability, high performance, high resiliency, and end-to-end observability. All of these are key pillars of building business-critical applications on AWS, and hopefully you'll learn some unique insights as to how we did it.

FIS's Role in Global Payments and the Changing Landscape of Financial Transactions

Then we will pivot and talk about a few things that are enabled by the architecture that we have designed, the roadmap, and finally we'll conclude by sharing some of the key learnings and lessons that you can take away. With that, I'd like to invite Ade on the stage. Thank you, Sameer, and good afternoon. I'm Ade Sturley, the global product manager for enterprise payments at FIS, and I'm co-inventor of the solution that we're going to show you today. I know everybody is really keen to see the technical deep dive solution details, but if you just bear with us for a little while, I want to try and give you the why we did this, based on our history of providing payments in the payments ecosystem around the world.

Thumbnail 250

Every day, payments move silently and securely across the globe. In fact, trillions of dollars move silently and securely across the globe. FIS is behind a lot of that movement. We're a silent, invisible player in the background of processing those payments. When Sameer transferred money to his daughter today, when you guys bought coffee, when you checked your retirement balance, it was very likely FIS was running behind the scenes in that area. We think of money in three stages: money at rest, money in motion, and money at work. Money at rest is the secure and appropriate core banking storage of the accounts that you have. FIS provides core banking storage for the accounts you use every day.

We process 58% of large and retail organizations in the United States with our core banking, payment, and digital solutions. Money in Motion is very much around moving payments, treasury, and risk management. Today we process 16 billion transactions annually through our card and non-card networks. Money at Work is where we use trading, lending, wealth, and retirement services, and we're managing 18 trillion in assets for our customers today. So we're not a startup; we're a big company that processes at scale.

Over 95% of the world's largest organizations use us globally. We move over $16 trillion of payments annually, and we work with over 50% of the world's most innovative companies. The scale piece of this is really important for everybody. If you imagine the years we've been working—around 60 years overall—you can start to see how we've had to evolve our payments and processes over time. This makes us unique because we've gone through many revolutions of the payment ecosystem, and this is important because we're doing this again.

Thumbnail 400

Partly this is because payment experience and expectations are changing. Everybody in this room is making payments over their mobile phone. Everybody in this room and outside wants things to be faster. I have a great story to tell you that just happened to me this weekend to give you an example of why our expectations have changed. For some reason, I accidentally overpaid a credit card, and I wanted to close the account that was associated with the bank because I'm moving banks.

I phoned up and was told to call back between 9 and 5 on a weekday. When I called back, the option was that they would refund the amount of money on my credit card and send me a check in the mail. That means I have multiple processes to do. I have to receive that check, go down to a branch, and put it into a different bank. So it's not automated. The second option was to transfer the money to an account and then it's up to me to move my money out of that account. Again, I'm into that non-automation.

We're seeing disappointment in the way we want to actually move money. I want to be able to take money out whenever I want to. I want to be able to transfer that as quickly as possible, and I want to do that in an embedded, constructive way without using multiple apps in multiple different ways and taking my time to make that process. Our expectations are causing a revolution in payments where we want everything to be done really quickly, and this causes problems for the financial institutions themselves.

Thumbnail 500

If we think about payment rails, the traditional batch types of payment rails that we're all used to are now changing. They're moving from batch to instant rails, but they're being added to. We're moving on from ACH and wire and other forms of payments, and now we're seeing all the other types of payments come to fruition, including instant payments. We're seeing things like FedNow and TCH RTP gain traction. We're also seeing additional payments coming from things like Venmo and Zelle and other types of payment-initiated services.

We also expect things to be always on, so as things operate in a bank which has traditionally been very much a batch-oriented service or a 9 to 5 service, now we're always on. I want to make payments any time of the day, no matter when they are. That changes fundamental batch operations or fundamental operations within an institution. If we think about how these were all wired previously, lots of payments were wired into digital environments and core environments, and it's like spaghetti. We're only adding to the problem when we bring new rails forward, like digital currency.

The way banks used to win was by rates. They attracted you by rates. But the trouble with that is now experience is taking over rates. The experience that we have is more important. The faster that we can make things happen is more important for us. So we're seeing experience win over rates, and we're seeing the large big banks win that game because of the experience they're providing.

Thumbnail 630

Of course, we have to comply with all regulations, risk management, and everything else that goes with the payment ecosystem. We're seeing a change in behavior, and now we're seeing the problems it's causing in the institutions. At FIS, we decided to rethink the way payments are made from the ground up.

Rethinking Payments from the Ground Up: The Money Movement Hub Vision

This was despite having many different payment systems. We had legacy payment systems, ACH silo systems, wire systems, and real-time systems. We deal with bill pay and Zelle, and we're a major Zelle provider. We handled these in very isolated cases because our clients needed those rails at the time they needed them. However, rails are table stakes now. The most important capability we need is in orchestration and execution.

We need more flexible orchestration and execution to determine how payments should be made. What is the fastest, safest, cheapest, and most strategic way for our financial institutions to make payments? Those are the things that matter. The value drivers are in our liquidity and risk management, not in moving money. We can move money in so many different ways, and the choices we make matter. We need to move money in a more intelligent way.

Why do I need to make a decision about whether to put something over ACH, wire, instant digital currency, or another option? Why can't that be automated? Why can't I have that offer to be automated or allow me to make that call? If I want to make a payment now, that's real-time. If I want to make it tomorrow, that's ACH. If I want to make a high-value payment, that's different. If I want to move money over a particular stablecoin, that's digital currency movement. It shouldn't matter what they are. The intelligent piece is what matters most.

Rails are not plumbing. Rails will always expand. Everything is about orchestration and execution. We have to think about how to bring all the rails together, how to orchestrate them from single entry and exit points, and how to ensure that decisioning is done in real-time. You have smart routing and intelligence behind the decisions you're making. You're providing the capability for those things to be programmable for the institutions we work with, and you're providing the mechanism for inline risk management and fraud detection.

We thought about payments from the ground up, despite having many payment systems around the globe. We reached an aha moment probably at the beginning of 2024 when working together with our AWS friends. We came up with a solution they called the Money Movement Hub. This was about putting all of those practices in place to address the challenges we were facing because of changed expectations and the challenges our financial institutions were having. We had to provide fixes and a different way of making payment processing.

Thumbnail 810

We built a solution in the cloud, built from scratch, designed for multi-tenancy, designed to provide multi-rail access to our customers. It was designed to interact with multi-digital environments and designed to integrate with multi-core environments, both in FIS and outside of FIS. That was a specific challenge for us, given the fact that we've traditionally built these single isolated rails, which were extremely successful. We're also building this for 1,000 institutions and moving millions and millions of payments every single day across those rails.

The biggest part of this process is very much about how you bring everything together so that you protect and preserve the future of our institutions. They don't have to introduce new rails, and we'll talk a little bit about that later on.

Thumbnail 930

Thumbnail 940

Demo: Real-Time Payment Execution and Orchestration in Action

I'm going to show a little demo. I want to show this from a perspective of execution, not from the perspective of what you see from payment initiation. There are lots of people who make payment initiation. Bear with me a second while we bring this up. You should see this. I'll get my Elango to go fix it for me. While we're talking about the demo, it's really important to understand what we've put together. We created a universal API for all payment types. That universal API is exposable to our institutions or their customers. That was critical for us to basically take any single payment type and drive it through those areas.

We use a standard pain 001 message, if that makes sense for anybody working in instant payments, but we made that work with ACH and the other payment types. We also created SDKs and designed those from a payment initiation perspective so we could launch those SDKs into our digital providers who are now turning into what we call portal technologies. It was important for us to put something together that showed behind the scenes how things would operate from an execution perspective. What I'm going to show you now is more about that execution and that orchestration than actually the initiation of the payment.

Thumbnail 1010

Let's take a standard UI for a transaction transfer. You see your balances, you want to make a transfer, you decide to click on the transfer, you want to try and move money. You'll see a standard type of transfer money movement screen. We're trying to capture the information from the customer, we're trying to capture the information from the perspective of how much we're paying, we're trying to capture the information of who we're paying, and so on. In this case, we're paying Jane, and we're just transferring money for the purposes of paying for tickets to re:Invent. I'm going to show you what's happening behind the scenes in this area.

Thumbnail 1050

Thumbnail 1060

This is the API and you'll see how it's basically constructed as a standard pain 001 message. But you'll also see that if I go in here and I want to make some changes and I say I want to pay $100 and $150, you'll see the dynamic API being updated. I'm going to put Disney World tickets in this one, but I'm actually going to pay for re:Invent. You'll see how that's dynamically changed in that API. The purpose was to expose that out to everybody so that this becomes a standard way of interacting with us, whether you're doing an ACH file in terms of an ACH set of transactions and mass transactions, or whether you're doing a single instant transaction or a wire. This even works for digital currency, with a slightly different screen where you'll see a lot of FX rates, but it gives you the idea of what's going on.

Thumbnail 1100

Thumbnail 1120

Thumbnail 1140

Then I'm going to scroll down and you'll see there's a semi-transfer. As I submit the transfer, we're doing a very quick validation check. We're making sure that that transfer can be done. You'll see as I scroll over the areas that the areas in here are being updated. You'll see the name of Jane and the changes that I've made in that payment transfer. I'm going to make a transfer. Things happen incredibly quickly in payments. We have to transact in less than 5 seconds for instant payments. That means a conversation with the rail itself. I'm basically asking the question on a scan: I want to send you some money, do you accept it? But before I even get there, there's a load of things that are happening in the background. I'm going to go back in a second and show some of these things.

Thumbnail 1150

Thumbnail 1160

Then on the other side, we have the receive. So now it's about what I'm doing as the receive institution. Let's go back and have a quick look at those things. All the validations that we did as part of that service. We checked the core banking systems, so we actually went out from APIs calling to the core bank systems to ask: is this account valid, does it have the money to make sure that transaction can happen, is the status of that account in a good status so that it allows that to happen? We also looked at fraud. We want to make sure that we're checking the fraud mechanisms so that we're checking for transaction screening, we're checking for AML, and we're checking for transaction monitoring. We're doing that inline in the transaction, and remember that time frame is 5 seconds back and forth to the scheme. We're also posting to the core, so we're basically saying: if I want to move the money that I did for $100, I now need to be posting that into the core as a debit for the send transaction. Everything happens with the back and forth to the scheme within that 5 seconds and at that scale.

Thumbnail 1220

On the receive side, you'll see that we did a very similar thing. Especially if it's not on us—on us is different—but if we're doing a received transaction coming from outside the scheme, then we're also going back into the account. We're checking that the account can receive that credit. We're checking if there are any particular issues with that account receiving it, and we're checking the fraud of the incoming item on both the sender and the beneficiary of those transactions. There's a lot going on in a very simple use case of an instant payment. All of that has to comply with all the regulations that happen across the multiple schemes and obviously within the time frames that are set for the experience that needs to happen.

Thumbnail 1300

Thumbnail 1310

Technical Requirements and AWS Solution Architecture Overview

I'm going to hand this now back to Samir, who's going to go into the technical details of the solution. We'll speak a little bit later about the outcomes and what we received and what we achieved. I have had the privilege to work with FIS and the rest of the FIS team since the day they decided to rethink payments from the ground up. The very first thing we did was start collecting all the requirements. We collected all the functional and nonfunctional requirements. There's a humongous list of requirements for a solution like this, but let me highlight three key technical requirements that we had to meet, which define the key characteristics of the solution that we built.

First and foremost is performance. Each payment rail has its own performance requirements, but for the Money Movement Hub, we set the requirement to not only execute but settle the payments in under 5 seconds. For payments, this includes instant payments and real-time payments. For scalability, Money Movement Hub is a multi-tenant solution. Many financial institutions are simultaneously sending payments through Money Movement Hub, which means at any given point in time the payment volume can be exceptionally high and can also go down exceptionally low. The solution has to scale dynamically and rapidly up and down. We designed the system to process over 1,000 payments per second.

Thumbnail 1400

Finally, on availability, this is not a 9 to 5 solution—this is a 24/7, 365-day solution. The requirement on availability was 99.995% availability. In real terms, this means we can afford less than 30 minutes of downtime in an entire year of operation. Apart from these key requirements, we also had to overcome some key challenges. Payment solutions don't live in isolation. They need to interact with existing banking solutions. A lot of banking solutions still use legacy technologies and still reside in on-premises data centers. Our solution had to seamlessly integrate with on-premises banking solutions while meeting all the requirements I highlighted.

We also wanted to make sure that as new payment rails come on, we can easily extend the system without a complete overwrite, which was the old way of doing things. One example of this is digital currency and stablecoins, which are becoming a new form of payment rail. We will talk about how that is enabled by this architecture as well. Finally, because there are so many subsystems involved in a solution like this, it is critical for us to overcome the challenge of end-to-end observability. We only care if the payment we initiated from us was received by the end receiver. It might touch many subsystems in between. End-to-end monitoring's job is to collect and make sure there is complete observability from the start to the end and up and down the stack.

Thumbnail 1480

Thumbnail 1500

Now I'll show you a very high-level functional architecture which highlights some of the main subsystems involved in such a payment processing solution. At the heart of this solution is what I call the payment processing pipeline. Essentially, payments that come in from financial institutions go through a pipeline process.

The pipeline process includes several key subsystems such as payment validation, which you saw in the demo, payment orchestration, and finally payment execution. The payments are then sent over one of many payment rails. We discussed earlier that payment rails could include FedNow, real-time payments, wire payments, SWIFT payments, and now even digital currency-based payments. While the payment is being processed through the payment processing pipeline, we not only have to interact with banking systems located in on-premises data centers, but we also have to perform real-time fraud and risk management. This is different from payments that used to settle in two days, three days, or even after twenty-four hours, where you had time to evaluate them. Here we literally have to evaluate in real time whether the transaction has a potential fraud risk.

Thumbnail 1570

Thumbnail 1580

Having seen this functional architecture, I will now build for you the AWS solution. Let's take a look at how this solution looks on AWS. The Money Movement Hub solution has been built and is running in production on AWS. However, outside of AWS there are a couple of entities. First and foremost are the financial institutions who are the users of this solution. As we said, the solution exposes APIs and the financial institutions interact via APIs. This interaction can happen over multiple channels. I'm showing here public Internet with secure HTTPS-based API interactions, but some financial institutions are already on AWS. In that case, they will access Money Movement Hub over AWS's own secure private network without needing to go over the Internet.

Thumbnail 1620

Thumbnail 1650

Secondly, we have banking systems in on-premises data centers. Given the low latency and high performance requirements of our solution, we decided to use AWS Direct Connect. For those who are familiar with it, AWS Direct Connect offers very high throughput and very low latency connectivity between AWS and your data centers, and that's what we are leveraging for on-premises connectivity. Then at a very high level, we chose a multi-account strategy. What does a multi-account strategy give you? A multi-account strategy essentially enables you to isolate different subsystems of your overall solution into isolated accounts. AWS accounts offer a beautiful isolation boundary. It can be a compliance boundary, a fault boundary, or a security boundary. But for us there was one more important thing: we had different teams working on fraud and risk solutions, so we had a separate account for them. We had a separate team working on integration with different payment rails, so we had a separate account for them as well. Then we had a core team working on the main payment processing pipelines.

Thumbnail 1690

Each of these teams had a different operating model. For example, the payment rails team was a small team that did infrastructure and application all in one team. They could work that way. The payment processing team had a dedicated infrastructure team and a dedicated applications team, so a different operating model. Having multiple accounts allowed us to have these different models work seamlessly, and we interconnected all the services using AWS PrivateLink, which goes over the AWS private network. Now let's take a look at the heart of the solution, which is the payment processing.

Thumbnail 1740

Core Technology Stack: EKS, Aurora PostgreSQL, and Event-Driven Design

First, let me touch upon the compute layer. As we highlighted, the compute layer is going to be critical in payment processing because it has to have high performance and high scalability. At the highest level, the design decision we made was to use what we call event-driven design with microservices that are decoupled and completely asynchronous processing. This architecture allows us to independently scale different microservices, which gives us immense horizontal scalability and also creates resiliency because one service can fail while all other services can continue working. We chose to use containerized compute to implement these services, and as we all know, EKS, or Elastic Kubernetes Service on AWS, is probably one of the most popular orchestration solutions for containers, so that's what we are using to orchestrate the containers. Here I'm showing you three of the key microservices.

While there are dozens and dozens of microservices in the overall system, here I'm showing you payment validation, orchestration, and execution services just as an example, but all are built using Kubernetes and orchestrated using Kubernetes. Not only did we use these native AWS services, we also used a few open source technologies like Keda for pod scaling and Karpenter for node scaling. We also use another open source tool called Conductor for workflow management. Soon we will dive a little bit deeper to explain what these technologies are and what they offer.

Thumbnail 1860

Any high performance payment solution not only needs high performance compute, it also needs a high performance transactional database. For database, we chose Amazon Aurora with a PostgreSQL engine. Amazon Aurora is a cloud native database that is built from the ground up to support high scalability and high performance. It offers up to 15 read replicas that you can have within the same region. Its IO throughput is extremely high, and you can get IO latency as low as single digit milliseconds. For all those reasons, Aurora PostgreSQL was the perfect transactional database for our solution.

But we weren't just satisfied with having a high performance database. We also felt the need for a caching layer that provides not millisecond but microsecond latency. For this we chose Amazon ElastiCache Redis to build our in-memory cache. The in-memory cache is used by the entire payment orchestration workflow, especially to keep real-time state as we scale in real time to thousands of payments per second.

Thumbnail 1960

Not only are we processing payments in real time, we are also generating humongous amounts of data. We need to store and process this data for compliance reasons, but we are in the age and era of AI. AI is fueled by data, and for all the good reasons we decided to build an S3-based data lake to store all the data generated by the system. I mentioned to you we are using an event-driven design, which means there are a lot of events being generated by the system and consumed by the system. To process these events and move them from producers to consumers, we chose to use Amazon MSK, or Managed Streaming for Kafka. For all the non-real time processing of the data that is sitting in S3 buckets, we use AWS Glue. These are very standard services, but for a high performance system, these were the right choices for the solution.

Thumbnail 2000

Observability I mentioned to you was an important element we had to address. Not only are we using all the capabilities of Amazon CloudWatch for observability, but we're going to highlight to you one unique capability that some of you may not be familiar with. It is a relatively newer capability called Application Signals. This allows you to monitor the application level, not just the infrastructure level, meaning service level objective monitoring. We'll talk a little bit more about that as well.

Thumbnail 2030

Finally, to round out, we all know a system like this needs to be extremely secure. So we use Amazon security services like KMS, Secrets Manager, and HSMs. Now that you have a picture of the overarching solution, I would like to invite my colleague Elango to really dive deep into how we address scalability, performance, resiliency, and observability. Elango.

Thumbnail 2090

Deep Dive: Achieving Scalability, Performance, Resiliency, and Observability

Thank you, Samir. The solution that we have to build has to provide scalability, performance, resiliency, and end to end observability. Let's dive into all these aspects in the subsequent slides. The backend is written in Java 21, and the front end is written in Angular. The entire solution is deployed in EKS. To provide the massive scalability that we require, we use Keda, the pod autoscaler. Keda looks into two metrics: HTTP request count metrics and the ElastiCache queue depth metrics. HTTP request count metric is a direct correlation to the number of payment APIs that we get in. Each payment API, as we saw, is a complex orchestration. For that, we use open source Conductor.

Conductor is an orchestration engine that converts all those orchestrations into workflows. Each workflow is divided into multiple smaller units of work called tasks. These tasks are put into a queue, and the queue depth is the number of tasks that we have to complete to process a payment. Keda looks into these two metrics and based on that, it scales the pods. We want the pods to be immediately schedulable and start running.

Thumbnail 2140

For that, we use warm pools. In the warm pools we run low priority pods, and all our money movement hub pods have high priority. So Kubernetes is using the priority class to preempt the low priority pods from the warm nodes and then schedules our high priority money movement hub pods in the warm nodes. Our pods are immediately schedulable and start running. This solution works for most cases. In cases where we need additional worker nodes, we use Karpenter, which quickly brings in additional worker nodes and attaches them to the EKS cluster.

Thumbnail 2180

The next problem that we have to solve is providing enough IPs for our pods. For that, we use a custom VPC CNI plugin. We use secondary non-routable IP ranges for our pods, and the worker nodes run in a routable IP space. We use /20 CIDR blocks for our non-routable pods, which gives us enough IPs for our pods to scale.

Thumbnail 2210

With respect to the application, we built our application using the open-closed principle: open for extension, closed for modification. Our business demands a lot of changes to the application. As you know, workflow payment is a complex orchestration of changes that we have to make. One such workflow is shown here. You can see how we are using parallel tasks and how we have multiple deciders, which change that workflow based on multiple parameters. Conductor helps us a lot here. Conductor also manages multiple versions, so we can use any version of a workflow. If there is a problem, we can roll back to an older version too. Our operations teams love Conductor. They can go to Conductor UI, monitor what is happening with respect to all the workflows, and they can rerun any workflow, so they really love it from the operations perspective.

Thumbnail 2260

When it comes to performance, we look at performance in two aspects. One is pod startup time and data access. For pod startup, we use Bottlerocket, a purpose-built container operating system. We also try to keep our application Docker image small. We use multi-stage Docker builds to add only the libraries that are needed for the runtime. We also carefully evaluate all the Java libraries and add only what is really necessary. This helps us keep our Docker image small. We also use Spring's lazy bean loading to load only the necessary beans that are needed for startup. These techniques help us start our pod in under four seconds.

Thumbnail 2310

Thumbnail 2340

For data access, we created custom Spring annotations to send all the read SQL queries to the Aurora read replica. We use two to three read replicas, which provides enough performance for our read queries. Anytime when we need a fast write, we use a write-through cache, wherein we write everything to ElastiCache and then asynchronously update our Aurora endpoint. We are also evaluating a couple of things. We are looking at Spring Ahead-of-Time to compile the image at build time. We are also looking at GraalVM to create native images. One of the key features of Bottlerocket is seekable OCI, which does parallel Docker image pull on demand. We are also evaluating that, which all gives us fast pod startup.

Thumbnail 2370

Thumbnail 2400

For multi-region resiliency, as mentioned, any service that we take, we look for whether it provides native cross-region replication. Amazon Aurora provides a global database, ElastiCache provides a global datastore, and S3 provides native S3 cross-region replication. For Kafka, we use MSK Replicator. MSK Replicator not only replicates the messages but also replicates the consumer offsets. So if the application is running in another region, it can consume from where it stopped. We use Route 53's failover routing to route the traffic based on region failures. We constantly evaluate this setup using AWS Fault Injection Service. Fault Injection Service provides multiple scenarios where we can simulate failures. For places where we need custom scripts, we use a laptop.

Thumbnail 2420

Since we have multiple microservices running in multiple accounts, we want to provide unified observability. For that, we use OpenTelemetry. AWS Distro for OpenTelemetry is the library that we chose, with auto-instrumentation for Java and Python. It captures all the observability signals—logs, metrics, and traces—and sends them to CloudWatch Application Signals. CloudWatch Application Signals is a new application monitoring service in CloudWatch that provides an app dashboard, telemetry, tracing, root cause analysis, and end-to-end visibility.

Thumbnail 2490

From our account, we send data to the centralized observability account. We also use Firehose and AWS Lambda to send logs and metrics. For traces, we use trace propagation, where we use the same span ID across multiple APIs so we get a tracer bullet view of how the API flows through the system. I want to highlight one important feature of Application Signals called Service Level Objectives. Service Level Objectives track objectives based on multiple service level indicators. In this example, we are tracking an objective to receive API latency under one second, and we need to achieve this 99.9 percent of the time. The indicator we use here is HTTP latency. This dashboard quickly gives us a glimpse of what our goal is, what is happening with respect to that objective, and when we last met the objective.

Thumbnail 2580

Production Outcomes and Future Roadmap: From Pilot to Mass Deployment

Our leadership team loves this because at any point in time when they ask what is happening with our app, we just tell them this metric and they quickly understand the performance we are giving to customers. The banks also like it because it helps them from a complaints perspective. The SLO can be obtained not only from EKS workloads but also from ECS and EC2 workloads. These are some of the techniques we use to meet the business needs. I will pass on to Ade. Thank you, Elango. We are going to talk about outcomes, and it is quite incredible the time that we saved ourselves by having the technology behind the ability to build this product in a rapid timeframe. From concept to production was nine months, but the actual build started in August 2024 and we were in production in January 2025 in pilot with our first two clients.

This was remarkable for FIS. Traditionally, in our payment systems, you would go and install, you would take time to actually deploy, you would actually onboard, and that could turn into many months. So it was a key requirement when we first started development that we would build something that would be easy to implement and easy to deploy. We turned the deployment of what we would call regular implementation into an onboarding activity, meaning we built it into the product. By doing that, however, we caused other issues within our organization. FIS has a pretty strict change management implementation profile and also has a lot of manual testing.

Even though we run sprints and very quick development internally, traditionally we have not done things in the way that we had automation around these systems. Observability brings a new problem that we needed to look at regarding how our teams were orchestrated and actually built for support purposes. They needed to understand the new technologies from an orchestration perspective. We hired full stack people into our support organizations so they understood infrastructure as code, the automation that we were using, and the deployment methods that we had. We moved to a CI/CD model using Harness for CD to allow us to deploy quickly. We are now in a state where we can deploy three or four releases into production every single week. This was a complete change for FIS. We had to change our change management solutions and bring different people in from the perspective of supporting this new approach.

We also had to introduce new automation of testing. With Harness, Zephyr, and Cucumber, we created test automation that allowed us to run those tests on a regular basis. As we implement new parts into production, we continue with that automated testing to reach our end point. That was all new for us, and we carry on.

The outcome was effectively going into production in January. We embedded the system, went through hardening of the system, and completed our testing from a performance perspective. We looked at our timings in terms of performance throughout the chain to meet our SLAs. We also adapted some of our core technologies to help us reach those cores in a more rapid way.

Once we built this solution, it got fingers and tips everywhere into the FIS ecosystem, which allowed us to improve that ecosystem and then share that with other parts of FIS. This became a really strategic play for us. When you look at where we are now, we have 30 clients in production and have signed 100 clients. We are going to start mass migrations next year and have been adding different capabilities onto the platform.

Thumbnail 2810

We started with instant payments, starting with FedNow and TCH. We wanted to prove that we could build something at scale that would handle instant payments, and then we are adding additional rails. New payment rails are a given. I said at the beginning that payment rails are table stakes now. Adding ACH and wire are the things that everybody does. The more important part is how we embed them now in the orchestration process.

How do we get intelligence in our environment? How do we make smarter decisions on decisioning for those payments and routing those payments? As we speak, we are adding digital currencies and stablecoins. We had a partnership with Circle, and that partnership is growing. We will be live in production very soon in Q1 in the hub, and it is a good example of adding another scheme very quickly into the ecosystem.

Once we got our orchestration and execution correct, rail additions are pretty straightforward. We are also building SDKs for initiation and building ways of improving the experience so that our customers, our clients, and their corporates can make quicker decisions and have that automated. The AI components of this become really crucial. There are two ways we are using AI.

One is the improvement of payment exceptions. How do you repair payments that have gone into a state where maybe a core was unavailable or the account was unable to post? We do not want to leave any payment behind on whatever rail that happens. This was critical in the way we actually built the testing automation to get us to the point of being able to quickly add on the rails.

The other way we use AI is for smart routing and intelligent routing. Now I start getting really interested about actually satisfying what I said at the beginning about customer expectations. If I want to move money today, I do not have to tell you what rail it needs to be. Let the system figure it out based on my history, based on how it is processed, and based on the configuration of my consumer base. All that comes from data.

The data lake that has been built contains a massive amount of data that is helping us and feeding us into the loop of payment execution. As we go through that payment execution, we go through the fraud data, we go through the fraud execution, we determine what that fraud looks like, and that becomes a wider signal across our rails. That becomes more important data, which feeds into the loop of the data exchange, which allows the AI to start making more decisions based on where those payment routes should be.

The next piece of this is also international, so AI is going to help us crawl across borders. We are adding push to hard and push to account on both of the two major networks. Then we are adding universal rails for the purposes of moving money in a more efficient way, whether that is through digital currencies for off-ramping and on-ramping, or whether it is a particular own rail on US payment methods. These are all critical items that are feeding into the next piece of this.

The technology that's been built is allowing us to rapidly add. We have 11 initiatives for next year. The initiatives cover refinement of the AI models for the data lakes so that they start making decisions. The build out of the payment exception processing and the payment repair processing so that they're using the AI to make our operations more efficient. Self-service to our financial institutions. We're actually enabling our financial institutions to come in and make the decisions they need to make rather than we making the decisions, rather than them having to go through manual operations.

Remember at the beginning, I talked about that expectation around automation. So we're now starting to put the automation in place, which is now possible because of the way that we built the platform in the first place. So these were all great outcomes that effectively allowed us to get to a point of being able to mass deploy into production every single week. And now start our migrations from our legacy platforms and start to decommission those platforms over time.

Key Lessons Learned: Building Mission-Critical Applications on AWS

I'm going to hand over to Sameer to do the lessons learned. From my perspective, this has been a great collaboration that we had, and the technology has allowed us to move as quickly as we can. In this session we try to quickly walk you through our journey from the time FIS conceived of this solution to us gathering the requirements, architecting the solution, building the solution, and now we have the solution in production on AWS in a very short order. Plus it is scaling up with a lot of customers.

Thumbnail 3120

Thumbnail 3130

Thumbnail 3140

Whenever we go through a journey like this, we learn a lot of lessons. Not everything works right the first time. So what I want to do is share four key lessons from this journey. Hopefully these will be valuable to you. First and foremost, whenever you are thinking of designing and building systems that require massive scale, start by thinking about event-driven architectures. Event-driven designs allow you to decouple various subsystems, implement them as microservices, and do completely asynchronous processing. These technologies give you massive horizontal scalability, which means you can add more compute and your system automatically scales. We did this in our solution by using Kubernetes to orchestrate our containers.

Thumbnail 3180

The second thing I wanted to highlight was on high performance. To build high performance systems, especially those that have real-time requirements, it's very important for you to identify what is your real-time processing pipeline. Not everything in the system needs to be real-time, but the pieces that need to be, we need to optimize every single step of that pipeline. In our solution, we focused not only on optimizing the compute layer and choosing the high performance database layer, we added a high performance caching layer. We also optimized boot up times and optimized the size of Docker images. We started keeping warm pools of nodes so that when the system experienced load, we could instantaneously scale rather than booting cold nodes up from scratch. These are all techniques that allow you to build high performance systems.

Thumbnail 3240

Now, resiliency is exceptionally important for a solution like this. In AWS, most services actually inherently offer you multi-AZ high availability, which means most services operate across multiple availability zones within AWS. That's a core architectural design principle we have for a solution like this where we could not tolerate more than 30 minutes of downtime in an entire year. We also chose to build multi-region resiliency, which means this solution simultaneously runs in at least two different AWS regions, for example, US-EAST-1 and US-WEST-2. Now, sometimes people think that building multi-region resiliency is going to be either expensive or difficult. In this case, what we showed to you was by prudently choosing some AWS services, you can actually get a long way there.

Thumbnail 3330

You can actually get a long way there because some of the services natively have capacity to do multi-region support. For example, Aurora Global Database has the ability to replicate across regions, as does ElastiCache for Redis. Managed Streaming for Kafka also supports multi-region replication, and of course S3 has cross-regional replication as well. To build unified observability end to end, it's important to think left to right, end to end, as well as top to bottom. Left to right meaning you cannot leave your on-premises systems as separate silos for observability. We need to aggregate all the signals—logs, metrics, traces—and bring them all together. That really is a powerful unlock for a solution like this, giving you deep visibility into what went wrong, where it went wrong, and how to fix it.

Secondly, many times we focus on infrastructure or technical metrics, but in this solution we highlighted to you by using something called Application Signals. We're not only monitoring the technical aspects of the solution—the infrastructure and the application—we're also monitoring service level objectives, which are more akin to business KPIs. So hopefully all these four lessons will be useful to you as you build your own solutions on AWS.

Thumbnail 3410

With that, I truly want to thank you for sharing your time with us. If you'd like to continue your learning journey, I have some QR codes here that you can scan. Please do take a moment to share feedback in the mobile app. It does help us a lot to improve the sessions in the future. Thank you very much.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)