Booster Framework raises the bar in scalability

Álvaro López Espinosa — Fri, 26 Jan 2024 11:50:19 +0000

Thanks to the collaboration with our partners using Booster Framework in Azure, version 2.2 has completely redesigned how the event processing is handled.

It now uses a combination of the services Azure Cosmos DB change feed and Azure Event Hubs in a way that it can process virtually any number of events in a given time without affecting the rest of the application and still keeping very low processing times.

How it works

Before this change, the event processor was only managed by the “Azure Cosmos DB change feed”, which forced Booster to only use a single processing unit (Azure Function App) for the event processing in most situations. While, with this approach, you can still process almost any number of events at a given time, you can experience long delays in situations where thousands of events are registered in a short period of time (in seconds). This is translated to read models that take tens of seconds to reflect the changes, event handlers that get executed long after the corresponding event is registered, etc.

Booster 2.2 now uses Azure Cosmos DB change feed just to send the event to Azure Event Hubs, where it is distributed optimally among partitions and sent to the corresponding event processor. The great advantage of this approach is that we can have many Azure Function Apps (not just one) processing events in parallel, reducing the latency to the minimum.

Despite this extra level of concurrency and parallelism, all the guarantees Booster offers about data consistency and the order of events are still kept.

Of course, this change doesn’t affect the developer experience or the code of any current Booster application.

Conclusions

Booster has always been focused on scalability (this is one of the reasons that made us take architectural decisions like using event-driven or CQRS), but this new version takes it to the level of the most demanding enterprises of the world.

We keep pushing hard to bring the best of the event-sourcing, CQRS, and serverless worlds to a framework that offers a neat developer experience focused on what matters. Is there something you think can be improved? Go ahead and make a suggestion! Remember that Booster is open-source.

Why a DynamoDB call can take 5 minutes to complete

Álvaro López Espinosa — Fri, 30 Apr 2021 14:21:48 +0000

As part of the work I was doing developing Booster Framework, I was in charge of creating heavy load tests to ensure that its auto-scalability feature works and that the data integrity is preserved (under the AWS provider).

The problem

After many hours of work, load testing Booster with 3,000 requests per second was working pretty well. However, sometimes we were seeing this problem:

The final counter of processed events had to be 60,000 to consider the test successful, but sometimes it got stuck in a random number very close to the right one (for example, 59,489)
We didn't see any error anywhere.
Suddenly, after 5 minutes or so, the right number appeared.

This was unacceptable.

Searching for the root cause

After a lot of investigation, we were seeing that the 0.00001% of all the DynamoDB calls done (it didn't matter if they were a “query”, “get” or “put”) were taking like 4 or 5 minutes to succeed, while the usual time for them was between 5 and 20 milliseconds.

– "Aha! That's the problem!", I thought.

Yes but, Why were those calls taking that long?

The AWS SDK has a built-in retry mechanism. When you do a DynamoDB call, for example, it will automatically redo the request if it gets an error that’s considered “retriable” (like a temporary network error).

It keeps doing retries using an exponential back-off algorithm, meaning that it will wait 50ms before doing the first retry. If it fails again, it will wait twice that time, 100ms, to do the next retry. If it fails again, it will wait 200ms, then 400ms, then 800ms, then 1.6 seconds, etc.

– "That's why calls take so long!!!", I told to myself.
– "Nein, nein, nein!!!", the facts replied to me

It turned out that only 10 retries are attempted by default, and that’s only 51 seconds in total.

Where did the rest of the time (more than 4 minutes) come from?

To the point: It turned out that those 0.00001% DynamoDB calls were done with an HTTP socket in a bad state (because these things happen randomly) and the requests never reached DynamoDB.

The AWS SDK has a default timeout of 2 minutes, so the call was stuck that amount of time before returning an error (a "socket timeout error").

This means that, in order for the first retry to occur, we need to wait 2 minutes. Then, it could be that the second retry also finds the socket in a bad state (it is being reused by thousands of requests), so we wait 2 more minutes.

This explains why a simple DynamoDB call occasionally took 4 or 5 minutes.

Solution

Surprisingly, the solution was just to reduce the timeout to, for example, 5 seconds, instead of 2 minutes. This way we fail much earlier, the AWS SDK can do the next retry (with a new socket connection) sooner, and the chances to succeed are much higher.

This is the code to do this in the Typescript AWS SDK:

const dynamoDB: DynamoDB.DocumentClient = new DynamoDB.DocumentClient({
  maxRetries: 10,
  httpOptions: {
    timeout: 5000,
  },
})

Lessons learned

a) Networking things can fail randomly for no apparent reasons. Even if there are no bugs, just because of network conditions.
b) Timeouts are extremely important. Any request should have a timeout. If not, it can get stuck because of point a) above and there is no way to stop it.
c) Retries are a must in any request. Due to points a) and b) you need to have a retry policy. Not having it is simply a bug, because any network request is expected to fail for no apparent reason. If you don’t have a retry policy, it is similar to having a switch in code with no default and you missed a case condition.
d) Load tests are the most useful thing to find those kinds of errors that you never think of.

Extra tip!

If you find yourself facing similar abnormal long waits when doing any AWS SDK call, try to set the parameter "maxRetries" to 0 and try again. This will expose the error that's causing the retries, so you will be able to act upon it.

Of course, don't forget to enable the retries again after the error is fixed!!

DEV Community: Álvaro López Espinosa