Kazuya

Posted on Dec 8, 2025

AWS re:Invent 2025 - Architecture lessons: Three failures and how to prevent them (DEV341)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Architecture lessons: Three failures and how to prevent them (DEV341)

In this video, Bruno Marangoni, a Latam Senior Solutions Architect at DXC, shares three critical architecture failure lessons from Latin American customers. Lesson one covers a Brazilian e-commerce company with unbalanced EC2 instances across availability zones, manual scaling, no caching, and 20-22 milliseconds of latency due to poor RDS placement. Implementing Auto Scaling, caching, and load balancer consolidation reduced response time by 80 milliseconds. Lesson two describes an Enterprise Bank's ECS cluster that went down for six hours, costing $1.2 million reals ($250,000), when the security team upgraded without coordinating with the platform team—highlighting that architecture fails when processes and communication break down. Lesson three demonstrates a company treating dev environments as production with no right-sizing or shutdown policies, resulting in $11,000 monthly savings after implementing proper financial governance and tagging strategies.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Failure Lesson One: E-Commerce Resilience Through Balanced Architecture and Auto Scaling

Hey, nice to meet you guys. It's a huge pleasure to be here, and I'm going to present architecture lessons from customers in Latin America that I work with. These are failures they are experiencing, but we help them to prevent that, and I will teach you how you can do the same. So I'm Bruno Marangoni, I'm a Latam Senior Solutions Architect at DXC.

So our agenda is just an introduction about this architecture perspective. For failure number one, we are going to talk about resilience, security, operational excellence, and multi-account and cost efficiency. And at the end, I'm going to share some final thoughts. We are all architects, we are all failures, and also real lessons. It's all about best practices, right? So we learn about the errors or the issues from each other and we can apply them in your perspective or scenario.

Every architecture looks perfect until it meets reality, right? So we saw a couple of days ago Amazon had a problem, and we all know the results of that problem. Some clients got stuck on the infrastructure because they tried to prevent issues, but they didn't test it, right? So that's the problem. The real problem of a perfect architecture is when we draw it and implement it, but we don't actually test it. That's the problem.

So my first failure lesson is about resilience. I have a customer, this customer is actually a Brazilian customer. They are an e-commerce fulfillment company, and they host two of the major mobile companies to sell cell phones. They sell iPhones, accessories, Motorola, and so on, and they get about 30,000 calls per day on their website. I'm just saying here this is a mix of both customers, and this is one of the problems. They don't isolate the customers, they just put them into the same architecture, which is actually a contract problem, right? So at the end, their customers are just sharing the same infrastructure.

It's just a basic three-tier application, so load balancer, web and application layer, and database. We all learned in AWS this is the most perfect web three-tier application. So load balancer, we just balance the EC2 instances or whatever, doing auto scaling, and also we do the same for our database, RDS or even a database on EC2, right, balanced over the availability zones.

But this is the real architecture for this customer. They have manual scaling, and we are talking about e-commerce, right? So they have peak seasonal marketing campaigns like Black Friday. With a web application, doing manual scaling is the worst part, right? So as you can see here, let me point this out. If you count, it's right, they just have instances, about five instances on one availability zone and just one in another availability zone, and I asked them why.

There was a reason for that, and they just said no, we just added it this way. Here also, they have the primary database in a different availability zone from where most of the instances are hosted. We know AWS has a region component, and in the region we have availability zones. AWS says it's a region, you can add in whatever availability zone you want, but an availability zone is just a data center, right? In this case here, I got almost 20 to 22 milliseconds of latency when I go to this instance and call the RDS database. So I'm adding latency for the communication between the web application and the database in this scenario.

When we rearranged that and I put the primary RDS here, I just decreased the time of the calls to the database. Also, we are talking about e-commerce, right? There's no caching, it's a disaster here. Also, as they do manual scaling, they don't even have an EC2 golden image just to build up a new instance when they need it for seasonal marketing or Black Friday or whatever. So they have to just ask the support and create a new instance, deploy, install every dependency in the application to add in front of the load balancer. Too much manual steps, right?

So what we have done for this customer. First, balance the instances, right? If we are going to need more availability for an e-commerce, there is a reason to just distribute the instances between availability zones evenly. If the availability zone is not a trade-off for this customer, we can just edge everything inside one availability zone. But we are talking about e-commerce. If availability zone one just goes down, the application will go down too.

Also, we created an image, an EC2 golden image, to create an Auto Scaling group. So we have done the template for the EC2. We filled out the parameters and created the way that the Auto Scaling works. For the RDS, as we just spread the instances with Auto Scaling, there's no reason to just move the primary to here.

One thing I didn't show you guys, an e-commerce of mobile phones has too many photos, right? Because they are selling mobile phones. So that photo for iPhone, photos for whatever cell phone they are going to sell. They have here an S3 instance to host the images. But I added here the caching strategy. So the first one is for the web architecture, the frontend, right, to cache the images for the photos. So I added more performance to answer the calls for the customers because they don't need to download the image again.

We are adding caching on the development parts as well. Before this, every time a customer went to the website, they would call the database and that database would answer the call with the images. So every time one customer visited, they would download everything. We added caching for the database as well. At the end of this architecture, we decreased 80 milliseconds when one customer goes to the e-commerce website. So we got significant performance improvements for this client. We actually delivered very good data in Brazil, which is good to sell to our customers as improvement.

So the lesson one, the modifications we have done: we optimized the image bootstrap, implemented auto scaling, balanced the web application between availability zones, and consolidated load balancers. I didn't tell you guys, but they have on average 3,000 calls per day. They had created six load balancers for development, QA, production, for other parts of the website like for admin, and I couldn't remember the last one. But they created six load balancers for each kind of part of the website. We consolidated and actually decreased the amount of money they were paying. We also implemented caching for the database layer and web layer.

So after the modifications, the e-commerce site could handle more peak customers. In this discussion, we were going to do a Black Friday. If you guys remember, the iPhone was released three or four years ago when Apple just released the iPhone in Brazil. We are talking about the most important two companies in Brazil that sell iPhones and other phones, and their websites went down. They lost money, and the mobile company lost money as well.

So in this lesson, availability is an architecture decision, not a checkbox. You have to do your trade-offs. We have done this trade-off with them. When the website went down on Black Friday, the engineering team said, "Okay, let's go and spread the application between the six availability zones in North Virginia." Why? There's a reason for that. It's availability. We can get more availability, but there's no reason because the average of calls per user on a day, 2,000 to 3,000 calls, is too small. So there's no reason for that. We approved all that, and after we changed everything, they got satisfied.

Failure Lesson Two: When Security Hub Alerts Meet Siloed Communication in ECS Updates

Cool. Second lesson, failure lesson two. This is security plus operational excellence. For this customer, they got an Amazon ECS cluster plus Security Hub. So the cluster was for an Enterprise Bank with the most critical application running on it.

This application is a customer service platform, so it's ECS's most mission-critical workload. Security Hub was enabled in the whole account following the best practices and also the foundational security best practices. The alerts were generated, but some of them were ignored, and the version update for ECS was popping up every month, every day, and they just were postponing this pop-up.

So the problem is the ECS was not updated. Is there someone here who's got a problem with ECS that you have to update, right? Yeah, it's dangerous. It's dangerous. So just to let you know, GuardDuty has some features that are very good with ECS. You have to enable GuardDuty and also enable it on your ECS cluster. Just here I just printed the images from the services, but the point is here. So the Security Hub alerts were alerting that the ECS version update support is ending soon, right?

So the whole ticketing system was running very well. They were alerting you that you should upgrade as quickly as possible, but they were just ignoring and suppressing the alerts. Yeah, they were coming. So we work with teams to have a DevSecOps cycle to break down the silos, but sometimes the communication doesn't work, right? So the guys from the platform team were suppressing the alerts, and they didn't talk with the security team.

And one day the security team said enough. Every time this alert is getting to us, let's apply the upgrade for ECS. And what happened? Yeah, this is the documentation for ECS on the AWS documentation. Once you upgrade the cluster, you can't downgrade, right? And even worse, if you upgrade and you were on a version that is not supported anymore on the console, you cannot go back, and that's the problem here.

AWS doesn't say it in the documentation, but sometimes you can call your Technical Account Manager, your Solutions Architect, your partner and say, man, my application is down, I'm losing money, I have to go back. Yeah, they could help you. In this case it was done. So we called AWS as a partner and helped this customer downgrade the cluster again and bring down and bring up the cluster, and the application came back.

So here, the problem is the cluster was up, but the application was down, right? Some ways that the customer could have avoided this. Just look for CI/CD pipeline validation for updates and also governance and architecture. I'm not going through each one, but here the point is everything was working. They had pretty good governance, they had pretty good processes.

The problem here wasn't the architecture itself. The problem was the delay from the platform team and the security team talking to each other and planning a day to do the upgrade. The architecture failed when the process failed, and the process and the siloed communication were broken in this case. Just to give you a number, this customer was down for six hours, and it cost $1.2 million reals, which is almost $250,000. The point here is not just the $250,000, but the SLA agreement, because this was a bank, a huge bank.

Failure Lesson Three: Multi-Account Strategy and Cost Efficiency in Cloud Migration

Okay, cool. Lesson 3: multi-account and cost efficiency. Here it's pretty simple. The company was on a migration journey to the cloud. They migrated 70% of workloads that were not in production, so dev and QA. Every workload was running on the main instance, including EC2, ECS, with no right-sizing, no operational window, and no shutdown policy. They treated dev as production, so everything was on all the time. Also, for supporting the dev environment, they used the same SLA at the same cost as production.

So what happened? There was a lack of financial governance, fear of shutdown without clear policy, and lack of tags and communication. If no one owns the cost, everyone wants the resources. The point here is the numbers. We saved for this customer 60,000 reals per month, almost $11,000, and the annual saving was pretty good. So cloud efficiency should be true. It's pretty important that you know where the cost should go, and that's the cloud way.

Final thoughts. I would like you guys to just take a picture because it's important for you to understand everything, the key takeaways that we have talked about here. Awesome. You can find me on LinkedIn. Yeah, it was pretty good talking and bringing the lessons from this architecture. And yeah, that's all. Thank you very much. Thank you, Bruno. Bruno will be next to theater two if you guys have any questions.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community