DEV Community

Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Accelerating the Connected Future: EDA for the unpredictable (IND308)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Accelerating the Connected Future: EDA for the unpredictable (IND308)

In this video, Christian Mueller from AWS and Ludwig Goohsen from BMW Group present how BMW modernized their Connected Drive remote services backend using AWS serverless architecture. They detail the migration from a monolithic Java EE application running on Amazon EKS to an event-driven serverless solution using API Gateway, Step Functions, Lambda, SNS, SQS, and DynamoDB. The presentation covers their two-week pilot testing Remote Horn Blow and Remote Light Flash use cases, achieving sub-second P99 latency while processing over 1 million requests. Key improvements include 60% faster time to market, 20% cost reduction, near-zero infrastructure maintenance, and blue-green deployment capabilities. They explain architectural decisions like GraalVM native compilation for Lambda cold starts, subscription filters for efficient event routing, and a serverless vehicle simulator for testing. The solution now handles 2.5 million daily events and 100 million API calls, supporting BMW's 24.5 million connected vehicles with scalability to triple capacity by 2027.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction: BMW and AWS Partnership on Remote Services Modernization

Hello everyone. Welcome to day one at AWS re:Invent 2025. In the next hour, we will explore how BMW accelerates the connected future by using an event-driven architecture for their unpredictable workload using AWS serverless services. When we engaged in 2023, BMW had a working solution for their remote services. It was functioning and met requirements, solving the business need. However, having a working solution was not satisfying us, and we both asked ourselves what we could do to make a good solution great by insisting on our highest standards. This is what we want to share with you today.

My name is Christian Mueller. I'm a Principal Solutions Architect with AWS and I have had the pleasure of working with BMW for the last five years. Today, I'm excited to co-present with Ludwig. Thanks, Christian. I'm also very delighted to share this breakout session with you today about one of the most widely used BMW Connected Drive services. My name is Ludwig Goohsen, and I oversee the development of the remote services backend as a Product Manager for BMW Group.

Thumbnail 100

So let's dive right into it. Sounds good. Today we will share the significant improvements BMW Connected Drive could realize by moving and modernizing their remote services application to a fully event-driven serverless solution on AWS by reimagining their current solution and taking advantage of running in the AWS cloud. Ludwig will start by providing you with an overview of what BMW Connected Drive is all about and what remote services offers to BMW customers. I will continue by diving into the remote services architecture that was in place in 2023 when we engaged. Then I will dive into the new architecture, the decisions we made, and the learnings we gained from it. Ludwig will continue by explaining how BMW iterated over this new initial architecture and improved it even further. Then at the end, Ludwig will close with the benefits BMW could realize after moving this new architecture into production earlier this year.

Thumbnail 170

Thumbnail 190

Thumbnail 200

Thumbnail 210

BMW Connected Drive: Scale and Impact of Remote Services

Let's start by establishing some context. Across the entire product portfolio, from BMW, Mini, and even Rolls-Royce, the BMW Group is providing compelling and industry-leading connected services. We call them Connected Drive. From interacting with your vehicle via the MyBMW smartphone app to in-car gaming, in-car streaming, all the way to your intelligent personal assistant based on large language models, the list is very long, and these are just a few examples. These features transform your vehicle from being a simple object that moves you from A to B into an intelligent, connected companion. Over 20 years ago, BMW connected its first vehicle. Today we have the largest fleet of connected vehicles with over 24.5 million. Out of those, we regularly update more than 10.5 million over the air. Managing this scale requires handling more than 16.6 billion requests and processing more than 184 terabytes of data each day.

Thumbnail 240

Thumbnail 260

However, with the growing fleet and more data-queued products and features, we expect those numbers—the requests and the traffic processed each day—to triple within the next two years. As you can see, we require a very strong, resilient, and scalable backend, and that's one of the reasons we partnered with AWS for 10 years already. One of the most prominent products of Connected Drive is the MyBMW app. It's the ultimate indispensable vehicle companion that fuses our customers and their vehicles together. We recently hit the milestone of 50 million active users while maintaining a 4.8-star rating on iOS and Android. The app provides a multitude of different features, from in-vehicle live data to information about recommended services and repairs, the option to book appointments at your local dealerships, and of course the remote services. That's the topic of this breakout session.

Thumbnail 310

Remote Services empower our customers to control their vehicles from anywhere at any time. They aim to make the lives of our customers easier. Imagine you're getting ready for your commute to work and you could pre-climatize your vehicle to your preference while still being in bed. Or picture another scenario where you leave and lock your vehicle, and after a couple of minutes, you receive a push notification that you forgot to close one of the windows, and you can do so with a simple tap on the screen.

Thumbnail 370

Thumbnail 380

There are all sorts of different use cases. These are just a few examples. On the far right, you see another one that involves events from within the vehicle. If someone is trying to steal your car, you receive a push notification, and if you opt in, you will also receive videos from the inside and the outside of the vehicle. Currently, we're handling 2.5 million events every day, and that involves processing more than 100 million API calls every single day.

Thumbnail 400

Thumbnail 420

Evolution of Remote Services Technology from 2006 to Neue Klasse

Before we look at how Remote Services appeared in 2023, I want to recap how Remote Services evolved over time. As you can see, in 2006, the very first use case of door unlock was built on top of the communication standard CDMA here in the US. Let's travel back in time and listen to one of the very first Remote Services executions. It was actually done with simple voice calls. Additionally, we needed a way to remotely wake up the vehicles. If you park your vehicle and leave it for hours, days, or weeks, we cannot have it always on, so we need to remotely wake it up.

Thumbnail 440

Thumbnail 470

Thumbnail 490

Back in 2006, this was not possible because there were no always-on modems. Instead, the vehicle regularly booted its modem by itself, and the BMW call centers, which were the only ones triggering Remote Services back then, were constantly calling the vehicle until they reached it. This is quite interesting from a technological point of view, but it doesn't scale. That's why in 2009, with the first always-on modems, we introduced remote wake-up with SMS. We also added another layer of communication, which was SMS to talk to the vehicle. What we didn't do is sunset the older generation, because if you buy a vehicle, you want it to be connected for decades.

Thumbnail 500

Thumbnail 510

That's always the pattern with each new vehicle generation: we keep adding new technology while not being able to sunset any of the older ones. For example, in 2013, we added an HTTP-based communication protocol, and then in 2018, MQTT was added. Then in 2021, we became able to remotely wake up the vehicles with UDP/IP triggers. The only two weak generations that have been sunsetted are the first ones, simply because the communication standard is not supported anymore.

Thumbnail 580

Thumbnail 590

We didn't stop there. It's not 2021 anymore. Most of you have heard that BMW is reinventing the vehicle once again with its Neue Klasse technology. Let's stop right here at the MyBMW app, which is the front end to our Remote Services backend. What we've just seen is the brand new BMW iX3, and it's not just any new electric vehicle that pushes boundaries. It's actually the beginning of a new era because between now and 2027, BMW will release more than 40 new and updated models that will benefit from Neue Klasse technology.

Thumbnail 640

Thumbnail 650

As Neulasser challenged everything inside of the vehicle, we as a remote services backend team ask ourselves the same questions. We do have a working solution, but is it good enough? Can we improve something? Are we future ready? To answer all these questions, I'll hand back to Christian to look at how they assessed the situation in 2023. Thank you, Ludwig. Now that we know what ConnectedDrive is and what services remote services is offering, let's take a look at the current architecture which was in place in 2023. You need to know that BMW Remote Services was migrated to AWS in 2021 using a lift and shift approach. Previously, it was running in BMW data centers, containerized in OpenShift.

Thumbnail 670

Thumbnail 680

Thumbnail 690

The 2023 Architecture Assessment: Challenges of the Monolithic System

In most cases, as we have seen, a BMW customer is triggering a request to the remote service backend, for example, to locate their car. To fulfill such a request, the request is usually enriched with additional data from within BMW, and of course the request is also validated. Afterwards, the request is forwarded to the MQTT broker, which is responsible for the communication with the vehicle. At this time, Remote Services was mainly leveraging Amazon EKS and Amazon Relational Database Service. While this architecture looks quite simple, it does not tell the whole story.

Thumbnail 730

This nice little container icon you can see here is hiding the fact that there is a large monolith behind it. If we zoom into this container, you can see the stack was running on top of Linux and the Java runtime inside a Java EE server. This sounds familiar, right? Back in 2006, we developed applications in this way. However, over the last years we also learned that for a large application which needs to scale to handle millions of requests, this is probably not the best architecture. Managing the complexity also became a problem.

Thumbnail 780

Thumbnail 810

Ludwig just mentioned there was no real opportunity to send down legacy code because they have to support functions for decades. Over time, technical debt grew. While Remote Services was serving BMW customers well for decades, the team, as Ludwig mentioned, felt the service was aging and presented the team with challenges. Ludwig mentioned that BMW expects to triple the number of requests in the next two years, and probably it will not stop there. Ludwig also mentioned that until 2027, BMW plans to release more than 40 new models. Time to market is very critical for remote services to offer new services or to adapt services to new vehicles.

Thumbnail 830

Thumbnail 860

As a BMW customer, you expect a premium product. This is also true for remote services. Remote services have to make sure that even under increased load, the system is reliable and provides the same low latency as today or even lower. This was already a challenge in 2023 at high demand. With the increased number of requests, people also were asking what this means to AWS infrastructure cost. Will it increase in the same way? What can we do about it? Optimizing remote services for cost efficiency is also an important topic.

Thumbnail 880

Thumbnail 890

Thumbnail 900

With this unique opportunity to rebuild remote services from scratch, we also took the opportunity to talk to our key stakeholders and ask them if you could just improve one thing in remote services, what would it be? We talked to the customer and for the customer, having a reliable service which responds fast was their most important thing. We also talked to the DevOps engineer and the DevOps engineer was concerned about delivering new features faster.

Thumbnail 910

Thumbnail 930

Thumbnail 940

Senior management was telling us they would like our engineers to focus on new features and not spend so much time on maintenance tasks. Ludwig was requesting how we could get our AWS bill down. The reliability engineer we talked to was mostly interested in having a robust system which can scale quickly.

The Two-Week Pilot: Building an Event-Driven Serverless Architecture

We know that running a proof of concept is a considerable investment and can be quite costly. Therefore, we agreed upfront to run only an intense two-week pilot. We intentionally call it a pilot because we want to use this as an accelerator if successful, and we do not want to throw away the code we have developed. We selected two representative use cases from remote services: Remote Horn Blow and Remote Light Flash. The pilot was intended to give us the data points and the confidence that the new architecture will solve the challenges the team was facing and also achieve the goals our key stakeholders were asking for.

Thumbnail 1010

Thumbnail 1020

Thumbnail 1030

For our pilot, we started on a blank sheet with only two hard requirements. First, we must support the current interfaces that the MyBMW app is using, so we were not able to change this. Second, we also have to support the same bidirectional interface to the MQTT broker. Happily, BMW already had a vehicle mock which we used in this pilot to simulate a real vehicle because we also wanted to do load tests, and therefore everything needs to be automated.

Thumbnail 1040

We analyzed the key building blocks in remote services and identified four main building blocks. Starting at the top left, we have the event creation component which is responsible for receiving remote service requests coming from the MyBMW app. We have the outgoing vehicle communication service which is responsible for communication with the MQTT broker. On the bottom right, the incoming vehicle communication service is responsible for taking the asynchronous requests coming back from the vehicle and processing them. At the top right, the event processing component is responsible for event distribution, persistence, and lifecycle management.

Thumbnail 1100

Thumbnail 1120

In addition, we decided to implement one subscription mock so that we can measure end-to-end latency of a remote service execution. This measures from the time when we first see the request until we reply with a push notification to the MyBMW app. If you look under the hood, this is how these services look. Because we broke down these services into independent components, this helped us to work on these services independently after we agreed on the event format, which is the same for all of these components.

Thumbnail 1190

The best way to understand how these services work together is by walking through one example use case, such as the Remote Horn Blow. It starts at the top left in the MyBMW app where a customer requests the remote horn blow, and this request is received by an API gateway. This API gateway is integrated with our web application firewall, which validates the request, and a valid request is then directly forwarded to an AWS Step Functions Express. Here we are leveraging the direct service integration from API Gateway to Step Functions. This helps us keep latency low and also cost low because we do not have an additional Lambda function in between. If this is new to you, you have a QR code here and also the link to our documentation.

Thumbnail 1210

When this event is in our Step Functions Express, we start with a parallel task step which allows us to process an event within two or more components at the same time. While the left side is where the real business logic is happening, I would like to focus on the right part first because this message is then forwarded to the event processing service, which is a component called multiple times during the execution of one remote service request.

Thumbnail 1280

This message or event is forwarded directly from Step Functions via the direct service integration to an SNS topic. As you can see, we have three SQS queues subscribed to this SNS topic. You may also recognize this little red filter symbol right next to the SNS topic. We use subscription filters here because not all of these components are interested in all events. By using subscription filters, you can avoid unnecessary work in your downstream components, which means just consuming a message and then figuring out that you are not interested in it and throwing it away. Especially with a Lambda function, this is just compute you have to pay for where you do not get any value out of it.

Thumbnail 1310

Thumbnail 1350

Now let's focus on the middle box, the event persister component. This event persister component is responsible for storing every event in DynamoDB in this case, so that we have a history of all events we have processed. This process is quite straightforward, and after this event is processed, this request is in the status created in our database. The only thing to mention here is that to keep the latency and cost low, we use AWS batching in our integration from our Lambda function to SQS. We do not use a batch window, but using a batch size of ten provided us with the best balance between speed and cost efficiency.

Thumbnail 1420

At the same time, in the top right, the event distribution layer also processed this event. In this Lambda function, we transform this event to the format an external subscriber is expecting. In this pilot, we only subscribed to final events because our subscription mock here is only interested in final events. In this component, we calculate the entire time which takes from seeing the request the first time until we have pushed the notification about the successful execution of a request to the MyBMW app. We do this by leveraging the embedded metric format. This is an easy way where you just log a predefined JSON structure to system out, and CloudWatch logs and CloudWatch events will take care of creating a custom metric for you. In our case, this was a metric of the duration from when we have seen the message or the request the first time in API Gateway, we store this data in DynamoDB, and the time when we see this final request here.

At the bottom of this event processing box, we have this event lifecycle component. This is a component we discussed quite heavily regarding what the best architecture would be. For this, you have to know that within remote services at a given time, there can be only one active or in-flight remote service per vehicle. This means for every request which is triggered by the MyBMW app, we have to make sure that this request will end in a final state, whether it is a successful process or a failure. If you imagine you park your car in a parking space where you do not have connectivity, this would not be the case. The request would stay forever in the pending state because the car is not receiving this message and also not responding.

Thumbnail 1530

In this component, we have a Lambda function which calculates the maximum expected time for an event status change depending on the remote service type and the status. Our solution puts a new message into a second queue, which is an SQS delay queue. For example, if you expect a status change within the next 10 seconds, we would add this message to the second queue with a delay of 10 seconds. The message remains in the queue but cannot be read from subscribed Lambda functions until this time expires.

After the 10 seconds elapse, the message becomes visible to the Lambda function consuming this event. It checks DynamoDB to determine whether a state change occurred in the interim. If yes, everything is good and we can drop the message. However, if there was no state change, this Lambda function generates a new event into the event processing component indicating that the request execution failed.

Thumbnail 1620

Let us return to the event creation service on the left side where the real business logic is happening. In parallel to what I have explained, we are processing the incoming request. Here again we are using the parallel state which allows us to validate and enrich the event to save time. BMW is heavily using Java, and our Lambda functions are implemented in Java. To keep the Lambda cold start latency low and costs low, we are using GraalVM to compile these Lambda functions to native code, which means you do not need a Java runtime to run it. It is simply a native binary you can execute, and it helped us significantly lower the latency.

After these Lambda functions execute successfully, as the event is enriched, it goes into another parallel layer step. On the right side, we send this event to the event processing component and eventually the status changes to pending. On the left side, we put the message into an SQS queue which belongs to the outgoing vehicle communications service. This message is consumed by an AWS Fargate service, which translates this event into an MQTT message and forwards it to the MQTT broker.

When the vehicle is connected to the MQTT broker, it receives the message and starts working on this request. Asynchronously, the vehicle sends back a message to the MQTT broker acknowledging receipt of this message. This message is consumed by an AWS Fargate service running our incoming vehicle communications service. This Fargate service simply forwards this message to our second step function, which then in parallel forwards this event to our event processing service. The event processing service updates the status in our database to running, so we know that the request was received by the car and the car has acknowledged receipt.

Thumbnail 1780

Depending on the remote service type and the status or event type which we received, we have to execute a more complex or much simpler workflow to work on this response from the vehicle. In our use case, there is nothing else to do and we have a very simple workflow here where we do not have to do anything else. When the vehicle finishes the execution of this request, it sends another asynchronous message in the exact same way as I have explained to the MQTT broker, which is then processed in the same way, and the status in the event processing component is changed to executed.

Thumbnail 1840

Thumbnail 1860

Pilot Results: Achieving Goals in Time to Market, Maintenance, Scalability, and Cost

Now let's take a look at the results of this pilot and what we achieved. Let's examine what we accomplished by iterating over the goals we set for ourselves at the start of the pilot and see how we performed. One of our first goals was to significantly increase the time to market. If you recall the high-level architecture Christian just explained, you see the four main components and how they are broken down into these microservices. By breaking down our monolithic architecture from before into these small building blocks, each responsible for a certain business logic, we established a very strong separation of concern.

Thumbnail 1890

Thumbnail 1910

What this allowed us to do was split up and parallelize the work. During the pilot, we had several two-person teams working on completely independent parts of the application, and this significantly sped up the entire development process. Additionally, what we introduced with this architecture was very high extensibility. For example, if you were to add a new remote service use case, it's mainly not code that we have to write. It's mainly going through all these little modules and deciding if it's relevant for the use case or not, so mainly it's adjusting subscriptions. Of course, we have to serialize and deserialize the payloads to and from the vehicle, but other than this, we can already introduce a new command.

Thumbnail 1920

Thumbnail 1950

Additionally, if you look at the Step Functions in the event creation box, we can easily extend it. If there's a requirement to do additional pre-checks, we can fetch another API and perform another conditional check. Also in event processing, these modules are all subscribed to the event status topic. This can be easily extended. For example, we are running in more than 100 different countries worldwide, and they all have different requirements from authorities. Sometimes we need to disclose certain information when we fetch data from vehicles. We can just add a regulatory disclosure module, have it subscribe to the relevant topics and vehicles and commands, format the data, and push it out to the authorities. So it's very extensible, and we've quite significantly reached that goal.

Thumbnail 1980

Thumbnail 2000

Thumbnail 2010

Thumbnail 2020

Thumbnail 2030

Thumbnail 2040

Thumbnail 2050

A second goal of ours was to reduce the maintenance effort for the team to run remote services. I'm going to do this by comparing the legacy architecture and the serverless pilot side by side. First, at the top, let's compare the code complexity and code maintenance. It's identical. In both worlds, we had to take care of our framework, the Java JDK, and all of our dependencies. But way more interesting is the infrastructure maintenance. In the legacy world, we were using a proprietary API gateway where we had to update its versions and install security patches. We are running on Kubernetes, so we had to take care of all these versions and check for breaking changes. Then the containers inside the cluster were responsible for the operating system and the Java runtime. We're using Aurora RDS database, which also comes with versioning. And since we're not supposed to persist the events forever, we had another Node.js script that would regularly delete some of the jobs, as Aurora doesn't have anything like this built in.

Thumbnail 2060

Thumbnail 2070

Thumbnail 2080

Thumbnail 2090

If you compare this now with the new serverless world, we use AWS API Gateway and it's fully managed by AWS for us. We don't have to take care of anything. Then instead of Kubernetes, we run Fargate, Lambda functions, Step Functions, queues, and topics. They're also all managed by AWS for us, so we don't have to take care of them. You might say that the Java runtime is still attached to the Lambda function, but as Christian just explained, since we are building natively with GraalVM and we're only pushing binaries to Lambda, there's also no Java runtime to be taken care of. Then instead of Aurora RDS, we use DynamoDB, also fully managed, and it brings time-to-live functionality on the document level. AWS would delete the events after a certain amount of time for us.

Thumbnail 2120

As you can see with the new surveillance pilot, we almost stripped down all maintenance effort. The only remaining effort was the vehicle connector that is running at Fargate.

Thumbnail 2160

In this new architecture, we also have to ensure that we can scale to the required scale BMW needs and also under sustained load. For this, we set up a load test to verify and demonstrate this capability. Here on the left side, you can see our Artillery configuration. We are using Artillery, a well-known load testing tool. In the first two minutes, we slowly increased from one request per second to five. In the next five minutes, we increased the load from five to one hundred requests per second. Then, over the next almost three hours, we ran a sustained load of one hundred requests per second. You can also see that we configured two types of remote services: Remote Horn Blow and Remote Light Flash.

Thumbnail 2180

Thumbnail 2200

Thumbnail 2240

The result of this new architecture was that we could run this sustained load for almost three hours and process more than one million events or requests without a single error. Because we are moving from a monolithic application to a distributed application, people had concerns that increased latency would not be acceptable and that we might not be able to process P99 in under one second. During the load test, we also measured the end-to-end latency with our custom metric. Our P99 metric was way below one second. If you look at the P50 metric, we could demonstrate that every second message could be processed in less than four hundred milliseconds.

Thumbnail 2250

Thumbnail 2280

If you may be wondering what this latency spike is, we also looked into it after our pilot and solved this issue. This was because in our pilot setup, we were not setting HTTP request timeouts in a proper way. Now, when a request does not come back within a second, we simply drop the request and retry. This helped us improve the P100 metric as well.

Thumbnail 2310

Cost is an important aspect for BMW as well. After we defined the architecture, we used the AWS Calculator to calculate the expected AWS infrastructure cost. The quick rough estimate already indicated that we could substantially reduce AWS infrastructure costs with this new architecture. The AWS Cost Explorer provided us with the proof after running the load test. When we provisioned the pilot architecture, every AWS service was tagged, which helped us attribute the AWS cost to each service. After we completed this load test, we looked up in Cost Explorer the AWS cost broken down on an hourly basis.

For services like Lambda, SNS, SQS, Step Functions, and API Gateway, which come with a pay-as-you-go model, we could simply take the cost of executing one million requests and extrapolate this to the scale BMW typically sees in a month. For other services like Fargate and DynamoDB storage cost and CloudWatch storage cost, we also took the cost from running this load test for three hours and extrapolated the cost of running it for a full month. The result of this load test showed us that we were able to decrease AWS infrastructure costs by twenty percent.

Enterprise Implementation: Four Key Optimizations from Pilot to Production

After we had successfully finished the pilot and reached all the goals, we decided to build the full enterprise solution. During the implementation, we went through a couple of iterations, and I want to take you through the most interesting ones.

Thumbnail 2420

Thumbnail 2430

Thumbnail 2440

Thumbnail 2460

For the first optimization, let's look at the communication towards the vehicle via MQTT. So far in the presentation, we've always abstracted that away by saying the vehicle is connected to the back end with an MQTT broker residing in another account. Let's zoom in a bit further here. In our legacy architecture, we connected to the broker inside our monolith running on Kubernetes by implementing the BMW libraries wrapped around the broker. The first step when we implemented the pilot was moving that into a container running on Fargate. That still implemented the libraries and directly connected to the broker, and it did work. However, we were not completely happy with this approach.

Thumbnail 2480

Thumbnail 2500

Thumbnail 2530

We reached out to the broker team and discussed some ideas for improvements. Luckily, they were already working on something. They would build an API that abstracts away how we place messages on the broker. They would offer an API that we could simply invoke from within our new account, pass the payload we want to send to the vehicle, and reference a topic. They would then place that message onto the broker for us. There are many benefits to this approach. With different vehicle generations, we have different framework versions and library versions, but it's all abstracted away now. We just have to call an API. For the MQTT topics we're telling the broker to subscribe to, they would receive that message and place it into our incoming vehicle message queue.

Thumbnail 2550

Thumbnail 2570

Just like that, we were able to remove the connector on Fargate that was directly connected to the MQTT broker. With serverless components, we're just invoking an API and receiving the results in our queue. If you recall the maintenance slide from before where I mentioned there's some maintenance left, now having that serverless connection towards the MQTT broker, we've almost stripped down all of the maintenance efforts. There are still some certificates to handle and some secrets, but no more software maintenance.

Thumbnail 2590

Thumbnail 2600

Then comes optimization number two: the vehicle simulator. In order to validate new vehicle generations, new commands, and new use cases, we cannot always rely on proper hardware. Sometimes the prototypes are built very late in the development stage, so we need some sort of vehicle mock. We have to take the car out of the picture. At the top, you see the legacy implementation. We had already built a very high-level vehicle simulator that was running on Fargate connecting to the broker as a vehicle. This implied having all different security aspects in place, certificate handling, so it was quite a complex project. We could only mock a few use cases and just the good case where everything is working.

Thumbnail 2650

Thumbnail 2690

Now with the serverless connection to the broker, you see at the bottom, we thought we could do something else. We can mock that API from the broker in our own account and use a Lambda function to place mocked messages and vehicle messages into our incoming vehicle queue. Just like this, we could remove all the broker and all the legacy stuff out of the picture and simulate thousands and thousands of use cases, edge cases, and error cases. Mainly, it's just configuration—what type of messages we're placing into our own queue. This allowed us to shift left the entire validation phase. At BMW, there are entire departments that would later test all the use cases manually with a real vehicle. By having this simulator in place now, we can shift left all that validation.

So while having the connector and the simulator in place, we went on to blue and green deployments.

Thumbnail 2710

Thumbnail 2720

When we were updating our infrastructure with Terraform, there was mostly downtime involved. If you would decommission and commission some new Lambda functions and edit something at the Step Function or API Gateway, we experienced significant downtime. We experimented with Lambda analysis and API Gateway stages , but we realized we couldn't completely separate the customer traffic from the new deployments.

Thumbnail 2730

Here's what we did instead. We added an Application Load Balancer in the very front and gave it target groups. Based on weights, it would route the traffic to either a blue or a green execution environment. We duplicated all our relevant components, and we can easily do so because it's pay as you go. We don't pay any extra for it. There are also some shared components like the persistence layer, but all the main components have been duplicated.

Thumbnail 2770

We added a custom header to the Application Load Balancer , which would allow us to override the weights. We could have all the customers run on blue, deploy a new version on green, and then use the server simulator to run it against the green environment. Only if all these test cases are 100 percent successful would we then switch over the traffic canary style from blue to green.

Thumbnail 2810

This was the outbound traffic, where customers trigger the app, API Gateway is invoked, and we send a message to the vehicle. Now the interesting question is what we do on the way back. When the vehicle answers us and acknowledges that it received one of the messages, we get the message on our queue. We use a Lambda function to deserialize the payload, but then we actually don't know which environment it was triggered on. Was it blue or was it green? During the deployment when we have half the traffic on either environment, we want to make sure that it's running in the same environment it was triggered on.

Thumbnail 2850

Thumbnail 2870

When we send messages to the vehicle, we lose all context, and in our stateless system we don't know which environment it was. So what we did is we invoked another Step Function. Within that Step Function, we would fetch data from DynamoDB. It would be the event ID that we had created during event creation and where we persisted the execution environment next to it. We'd then find that event ID that we received from the vehicle and say OK, it was triggered on blue. So we found the result and went the right path of the Step Function , then would place that message onto the blue topic.

Thumbnail 2890

Thumbnail 2900

Thumbnail 2910

But also, if you recall, there are some events from the vehicle where the user didn't trigger it at all, like a theft event when someone is trying to steal your car. So when we fetch that event ID, we won't find anything in the database. We go on and follow the left path of the Step Function and fetch another table that reflects the weights of the Application Load Balancer. That way we know whether it's fifty-fifty or eighty-twenty , and then we also know where to process and where to put the message onto. Just like this, we have increased our resilience tremendously .

Thumbnail 2930

The very last optimization I want to share is developer decoupling. In order to provide quality releases, we set up different stages in our accounts, so like the production environment and the integration and test environment. Developers, of course, want to be able to quickly smoke test their new implementations locally. With the legacy approach, this was quite easily doable by just running a local Docker container and spinning up that implementation. Well, in the serverless world, it became a bit harder because people could invoke their Lambda functions locally, but for end-to-end testing, you need the whole chain. You need Step Functions and queues, and you want to check the subscriptions on your topics.

Thumbnail 2970

Thumbnail 2980

Thumbnail 3000

So the developers need to deploy to our test stage and then run the tests. With several developers in place, this of course became a bottleneck. Because there were conflicts, we had to sequence the deployments and we had to sequence the testing. This of course was a big bottleneck. So we thought of our activities we had done for the blue and green deployment where we had duplicated all components. We thought, well, when we can duplicate them, why not split them into more parts and why not give each developer its own set of components. So we have a CI pipeline and each developer can now just commission or decommission their own set of components.

Thumbnail 3020

Success Summary: Benefits Realized from the Serverless Transformation

Since it's all pay-as-you-go, we don't pay anything extra here. To summarize the presentation, I think most of you have guessed it already. This was a full success story for us. We successfully completed the pilots, implemented the enterprise solution, and migrated all the traffic onto it.

Thumbnail 3040

I want to quickly summarize all the benefits we're receiving from it. First, we have almost limitless scaling because AWS is taking care of that for us. Our site reliability engineers don't have to worry anymore in the middle of the night. Additionally, we could increase the time to market significantly by 60 percent. We reduced the maintenance effort quite a bit and therefore have considerable efficiency gains. That's also one of the reasons half of our team can now focus on other innovations because we can extend and run the stack with fewer people.

We could reduce our costs. There are no idle non-production environments that are producing a lot of costs anymore. With the serverless simulator in place and the shifting left of the entire validation phase, we save tens of thousands of euros because these entire BMW departments have to test less later on. We could do all this while keeping the latency the same. Usually, if you come from a monolithic architecture and go to a distributed event-driven system with all these components connected via the network, it typically adds latency. We did some iterations and tweaking and could actually keep the latency identical while still being able to benefit from all these serverless innovations.

With this, I'd like to thank you all for coming.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)