From On-Premise Monolith to Scalable AWS Architecture: The Ticket Sales Case Study

#aws #cloud

The Problem Statement

Imagine the following scenario: a ticket sales application residing on a physical server (On-premise). Currently, the application is a monolith written in Node.js; it handles persistence in a MySQL database hosted on the same server, and stores static files (like event posters) directly on the local hard drive.

This architecture faces critical issues when tickets for a famous artis go on sale: the server crashes due to traffic, the database gets locked, and images load extremely slowly.

To address these root problems, the decision is made to migrate the application to AWS. This is where architecture planning begins, based on the following functional requirements:

High Availability (HA): If a server or zone fails, the app must continue operating without interruption.
Scalability: The system must handle user load and absorb traffic spikes during major events on demand.
Persistence: Transaction integrity is vital; no sale can be lost.
Security: The database must be protected and isolated from public internet access.

The Challenge

We need structure the solution by addressing four fundamental pillars:

Compute: Where do we run the application and how do we manage traffic?
Database: Which service do we use for MySQL and how do we optimize reads without saturating the system?
Static Storage: How do we serve poster images to ensure fast loading?
Network & Security: How do we organize the network (VPC) to protect data while allowing user access to the web?

The Architecture Proposal

For the compute layer, we can run the application on EC2 instances managed by an Auto Scaling Group. This allows us to scale horizontally on demand to handle traffic spikes. In front, we will place a Load Balancer (ALB) to distribute requests among instances, which will be spread across differente Availability Zones (AZs) to ensure high availability.

For MySQL, we will use the managed service Amazon RDS. To optimize performance, we will evaluate two strategies: using Read Replicas or implementing Amazon ElastiCache (we will define the best option later).

Static Content (poster images) will be migrated to a S3 bucket, utilizing Amazon CloudFront as a CDN to cache content and drastically reduce load times globally.

Finally, for Network Security, we will implement a three-tier architecture within the VPC:

The Load Balancer (our entry point) will reside in a public subnet.
Both the application instances and the database will be located in private subnets.
We will use Security Groups to stricly restric access between layers.

Deep Dive: Distributed Systems Challenges

The architecture above is solid and meets the infrastructure requirements. However, moving from a monolith to a distributed environment exposes our design to two critical logical problems:

1. The User Session

The original application stored the session in the server's RAM. In the new architecture, using the combination of Auto Scaling + Loab Balancer means that if the balancer routes us to different instance than the one we logged into, we lose the state, resulting in a terrible user experience (unexpected logout).

How do we solve this? We convert the application to be stateless. Instead of storing the session locally, we externalize it to Amazon ElastiCache. Beign an in-memory database, it offers sub-millisecond latency and ensures that even if the user changes instances, their session reamins centrally available.

2. Data Consistency (Race Condition)

Here we revisit the debate between using Read Replicas or ElastiCache. Image User A buys a ticket. Milliseconds later, User B checks that same seat. If we use Read Replicas, there is a small delay (replication lag) before User A's purchase is reflected in all copies. This could lead User B to attempt purchasing an already sold seat, causing an error or, worse, overbooking.

How do we handle immediate availability without saturating the database? The ideal solution is ElastiCache (Redis). Read Replicas are not ideal for real-time stock control due to the aforemetioned lag. Instead, Redis allow us to leverage its automicity. Since Redis is single-threated (processes operations one by one), it acts as a perfect control mechanism: if multiple purchase requests for the same seat arrive simultaneously, Redis "queues" them and processes them sequentially, allowing only the first transaction to succeed. This not only solves the race condition but also offloads read traffic from the main database.

Conclusion

Migrating from an on-premises environemnt to the cloud isn't just about moving servers (Lift & Shift); it's about rethinking how our application handles state and concurrency.

By integrating Amazon ElastiCache (Redis) into our architecture, we didn't just gain speed in reads-we solved two of the most complex problems in distributed systems: session management in stateless applications and data integrity during race conditions.

With this architecture, we've moved from a server that collapses under the demand of a famous artist to an elastic, robust infrastructure ready to scale automatically according to demand.