In early 2019, Betclic was in a corner, and we decided to migrate to microservices approach and migrate to the cloud.
The first step of this change was the implementation of an Event Driven Architecture (EDA).
Betclic presentation
Before speaking of architecture and EDA, it’s essential to know what Betclic is doing to understand our choice quickly.
Betclic is a French online gambling company. Betclic operates in Sports betting, horse race betting, and online poker.
The specificity of gambling is that we don’t have predictive traffic. We have high traffic peaks during significant events and weekends. We can see an example of traffic on a soccer event:
EDA wishlist
When beginning 2019, we did a workshop to identify the key features of our ideal EDA; we would like something:
allowing multiple reads: the same message can be read by various services
with error recovery: in case of error, we would like to replay events
easy to monitor: particularly with Datadog, our centralized monitoring solution
with schema registry
language-agnostic: we use 3 main languages .net, *Kotlin, **and **Typescript *(for Lambda), and the chosen solution must work with all languages
with analytics capabilities: we would like to integrate all our events of EDA in our data lake
fully managed
with auto scaling: essential feature for us; we have unpredictable traffic with big spikes at the end of matches, so we need to have an EDA to be able to absorb our spikes
with low latency and high availability: also an essential feature, we would like our EDA to serve critical use cases for our final users, so we want something the most real-time and the most available possible
well-integrated with AWS services: EDA is one of our first bricks in the cloud, but we want something to help us to migrate to AWS
easy to use: we want all the time something easy to use :) but here it’s particularly true because EDA is a centralized solution, and the majority of our developers will use it (and back in 2019 majority of our developers wasn’t Cloud developer)
cost-effective: what is the cost-effectiveness?
2019: Amazon SNS + Amazon SQS: The best solution
Remember here; we were beginning 2019: Amazon EventBridge (announced in July 2019) or Amazon MQ for RabbitMq (announced in November 2020) weren’t out.
Following our needs, we decided to combine Amazon SNS with Amazon SQS.
Amazon SNS is essential in our solution because it helps us to do multiple reads and also helps us to centralize the solution.
All events are published in a centralized Amazon SNS
Each functional domain has its own topic
Each functional domain can publish only on its own topic
Lambda/Fargate/On-Premise publish in the same manner
A service consume only events of another functional domain
Each service has its own queue with its own filter
A filter is based on an Amazon SNS message attributes, and it’s limited to 10 attributes
Fargate/On-Premise service consume in the same manner
For Lambda, we prefer to create an Amazon SNS subscription of type Lambda with a filter and consume directly from Amazon SNS
Custom features
Following the wish list features listed above, the majority of features are taken into account in this architecture excepted:
schema registry
error recovery
analytics capabilities
We decided to make a custom implementation for these features.
Error recovery and analytics capabilities are implemented in the same architecture
All events are stored in our Data Lake with the help of Kinesis Data Firehose and Amazon S3
In case of an issue, we can replay events by extracting events from our data lake to Amazon S3. An object S3 triggers a Lambda and Lambda push events to correct topic in Amazon SNS.
We store events in a file in case of connectivity failure between our on-premise and AWS. Once connectivity comes back, events of the files are pushed to Amazon SNS on the correct topic with the help of FluentD
One feature that is missing for now on error recovery it’s the ability to replay events on a specific SQS queue. All the events are replayed to an SNS topic, so all events consumers service receive events again. **Idempotence** is necessary for consumer services.
Concerning schema registry, nothing is in production for now. We have just developed a portal with all the available events and a debug console to validate the event format in Stage environment.
Cost
550 million events are published by month
The average size of an event is 630 B / event
1.1 billion events are transferred for domain (average of 2 services consume an event)
1.5 billion Amazon SQS requests are made to consume and delete events
Publication cost in Amazon SNS
- 550 million events published = $ 275.50
Consumption cost in Amazon SNS + Amazon SQS
Amazon SNS: Data transfer = 1.1 billion events x 630 B = 693 Gb transferred ==> 693 x 0.09 (price by Gb out)= $ 63.78
Amazon SQS: Events sent = 1.1 billion events = $ 440
Amazon SQS: Receive + Delete = 1.5 billion requests = $ 600
Consumption cost = $ 1103.78
Error recovery + analytics capabilities
550 million events published to Kinesis Data Firehose = $ 104.50
346.5 Gb of data transferred from Amazon SNS to Kinesis Data Firehose = $ 31.09
Kinesis Data Firehose cost = $ 81.30
S3 cost with 1 month retention = 346.5 Gb of data = $ 7.97
Total cost = $ 1603.14
Lambda for replay isn’t estimated because we don’t know how many replays we do monthly.
Detail of the price is available in AWS calculator here.
2021: Amazon EventBridge ?
As we saw in 2019, our best choice for EDA was Amazon SNS + Amazon SQS. Is it still right in 2021?
A lot of interesting features have been out on Amazon EventBridge for 2 years:
Content filtering (02/2020)
Schema registry (04/2020)
Dead letter queues (10/2020)
Event replay (11/2020)
Increase quotas (11/2020)
Cross account (04/2021)
According to this, all the features of our wishlist are natively supported.
We can imagine doing something like this for publishing events with Amazon EventBridge:
All events are published in a centralized Amazon EventBridge
Each functional domain has its own Event Bus
Each functional domain can publish only in its own Event Bus
Lambda/Fargate/On-Premise publish in the same manner
Schema Registry is natively supported, Events are defined in Schema Registry, and developers generate code based on these events (Code bindings isn’t supported in the language that we use in Betclic)
Schema Discovery allows knowing which events are published to Amazon EventBridge and allows to identify events that don’t respect the contract
Schema Discovery and Schema Registry are fully managed
Schema Discovery is generally activated on-demand in Production.
For consuming events, something like that:
A service consume only events of another functional domain
To send events to other accounts, we need to publish on a new Amazon EventBridge Event Bus
Events can be filtered on all the data of the events
Amazon EventBridge can’t publish directly to Fargate or On-Premise; we need to pass by an Amazon SQS queue to consume in this case
Each service has its own EventBridge Rule, its own queue with its own filter
Fargate/On-Premise service consume in the same manner
For Lambda, we consume directly from Amazon EventBridge through a rule
Error recovery and analytics capabilities
Error recovery and analytics capabilities are fully managed with Amazon EventBridge
All events are stored in our Data Lake with the help of Kinesis Data Firehose and Amazon S3
In case of issue, we can replay events directly from Amazon EventBridge
Possibility to replay data on a specific rule or all the rules
Cost
We have the following based on the same elements as Amazon SNS + SQS.
Publication cost in Amazon EventBridge
- 550 million events published = $ 550
Consumption cost in Amazon EventBridge
Amazon EventBridge: Events transferred to others Event Bus= 1.1 billion events = $ 1100
Amazon SQS: Events sent = 1.1 billion events = $ 440
Amazon SQS: Receive + Delete = 1.5 billion requests = $ 600
Consumption cost = $ 2140
Error recovery + analytics capabilities
Kinesis Data Firehose cost = $ 81.30
S3 cost with 1 month retention = 346.5 Gb of data = $ 7.97
EventBridge archive for 3 months: = 346.5 Gb x 3 months = 1039 Gb x $ 0.11 = $ 114.29
Total cost = $ 2893.56
EventBridge replay and Schema Registry aren’t estimated because we don’t know how many replays/discoveries we will do monthly.
Detail of the price (without EventBridge archive) is available in AWS calculator here.
Conclusion
Amazon EventBridge has all the features that we want natively supported. No need to implement custom features with EventBridge.
However, cross-account EventBridge is more complicated than Amazon SNS because 2 rules need to be updated to consume a new event in a domain. In contrast, SNS need only the creation of 1 subscription with the correct filter.
It would have been nice to allow EventBridge to publish Events directly with Fargate and even better with On-Premise server. This feature is missing because we need to consume from Amazon SQS.
So in terms of features, Amazon EventBridge wins over Amazon SNS + SQS.
Concerning the price, it’s a little bit different.
The cost of the architecture with Amazon SNS + SQS, in our case, is $ 1603.14.
While with EventBridge, the cost of the same architecture is $ 2893.56.
80% of cost difference is huge. Is it the price of the missing features of Amazon SNS compared to EventBridge ?
In Betclic, the choice is quickly seen; we have already developed the missing features that we want, so we are not interested in migrating to Amazon EventBridge. But we follow the news of Amazon EventBridge carefully because it’s a constantly evolving service.
Top comments (3)
Interesting comparison. I was wondering if you looked into comparing the performances. I think SNS has higher throughput than EventBridge.
I haven't actually tried it but API Destination apparently allows you to send events from EB to any HTTP API endpoint, such as on-prem? aws.amazon.com/blogs/compute/using...
No, I didn't have compare performance. But solution AWS SNS + SQS has been in production for 2 years, and performance is good (near real-time), and it's enough for our use case.
Concerning EB with an API Destination, the issue is that it's complicated to manage, and we start to disgrace from an EDA.
There is also an additional cost to use API Destination $ 0.20 / million.
AWS has now released global endpoints in Eventbridge that support failover. SNS/SQS don't have that yet. What are your thoughts on it? Is it worth migrating to Eventbridge for that? Have you built something similar with SNS/SQS and has it been successful?