Small Architecture Re-cap:
Before jumping to the article, I would like to re-cap some concepts:
What is an Availability Zone (AZ)?
An AZ in AWS is a physically separate and isolated data center (or a group of data centers) within a Region. AWS Regions consist of multiple AZs, with a minimum of 3 AZs per region.
Each AZ has independent power, cooling, and networking, ensuring high availability and fault tolerance.
Why Is It Best Practice to Use Different Subnets in Different AZs for Resiliency?
When designing highly available architectures in AWS, it is considered best practice to distribute workloads across multiple Availability Zones while segmenting different services into different subnets.
For example:
- A Database should be on a subnet and in multiple AZs
- The frontend should be on a different subnet than the Database and also deployed in different AZs
Following this principle, we enhance the security, improve the scalability, and leverage the security
Public vs Private Subnets
A public subnet has a direct route to the internet.
EG: A web server will be located in the public subnet and will be directly exposed to the internet
A private subnet does not have a direct route to the Internet.
EG: A database will be located in a private subnet and won´t have to be exposed to the internet
- It is possible to allow communication from the resources inside of a private subnet to the Internet, but not the other way around. To achieve this, some additional networking services have to be deployed.
Now, yes, let´s jump to it!
Some months ago, a friend challenged me to architect an application.
I selected the topic and started to think about how I could create it, thinking as a solutions architect. The small project escalated, and this is the main reason why you are reading me.
How would you create a ping dashboard?
Thank you, Batman! Exactly, the network!
You can't start building the application without considering the Network scope. I looked at myself in the mirror and started designing how the application would be:
I want to create a Multi-AZ resilient ping dashboard with services exposed to public and private subnets. In the future, I want to extend some parts of my Application to different regions expanding the measurements.
Now that I know which type of structure the network will have, let's deep dive into the Application scope:
The ping dashboard will have a web server accessible via the Internet in a public subnet and display latency data from several virtual machines used as probes from different private subnets.
These probes will take ping measurements to a specific target (URL or IP) sent by the webserver to understand and measure how long it takes from AWS to reach that destination.
The probes will write the data to a database, which will be used by the webserver to expose the data to the end user
VPC Structure
I have selected the eu-south-2 region (Spain) and the IPv4 network 10.0.0.0/24.
Because I don't need so many IPs (a /24 contains 256 IPs), I want to segregate this network into smaller subnets and place different components inside each AZ (in different subnets.)
Let's do some subnetting.
We will pass from /24 to /28; this means that we will have 16 new networks of 16 hosts each (11 usable because AWS reserves 5 IPs).
I have grouped the networks I will use in the different AZs.
AZ A - Networks from the 10.0.0.0/28 - 10.0.0.64/28
- There are 5 available networks in this range;
- 10.0.0.0/28
- 10.0.0.16/28
- 10.0.0.32/28
- 10.0.0.48/28
- 10.0.0.64/28
AZ B - Networks from the 10.0.0.80/28 - 10.0.0.144/28
- There are 5 available networks in this range;
- 10.0.0.80/28
- 10.0.0.96/28
- 10.0.0.112/28
- 10.0.0.128/28
- 10.0.0.144/28
AZ C - Networks from the 10.0.0.160/28 - 10.0.0.240/28
- There are 6 available networks in this range;
- 10.0.0.160/28
- 10.0.0.176/28
- 10.0.0.192/28
- 10.0.0.208/28
- 10.0.0.224/28
- 10.0.0.240/28
Ok, what are we going to build on these networks?
Based on the specifications, we need:
- A DB in a private subnet
- A probe in a private subnet
- A Web Server in a public subnet
In other words, for each AZ, we only need three subnets.
Why don't you put everything in the same subnet?
For security reasons, each component should be isolated from the other. If there is a breach, the network and our firewall rules will limit access to resources from network to network.
Drawing of our design
I have selected 3 networks for the different AZs of the list from above.
Looks sexy, isn't it? Let's continue.
If the application needs to be accessed from the Internet, we must deploy an Internet Gateway. The Internet Gateway is a networking component placed in the VPC that grants access to the Internet to the web server; it does not have to be in any subnet.
The probes need internet access, and we want to restrict external services from reaching them; using a NAT gateway will be the solution. The NAT gateway must be placed in the public subnet, and in combination with the Internet Gateway, it will grant the Internet flow only from the inside to the outside. A NAT gateway will be placed in each AZ.
Is placing one NAT Gateway per AZ necessary?
We will answer this question in the second and third iterations 🌚.
Let's add some EC2 instances!
- I have selected the t3.micro for the probes and the web servers.
-
Yes, web servers, we will allocate two web servers under a Network Load Balancer, which will redistribute the traffic between the AZs. Why not three web servers?
- We will build the web servers inside an AutoScaling group that expands in different AZs.
- If the ELB detects a failure in the application's health checks, the AutoScaling Group will deploy the web server in the same or in another AZ.
- If the AZ fails, the ELB will stop sending the traffic to that instance, and the AutoScaling Group will automatically re-deploy it to another AZ.
The networking probes are not crucial; we can afford failure in them, but not in the application itself.
We will use an Elastic IP to bind the NLB.
Let's add the DBs!
- I have chosen an RDS Multi-AZ MySQL architecture with a master database and two DBs on standby.
- The Master DB updates the replicas automatically.
- If the master DB fails, one of the DBs in Standby will take over.
- The Family of the DB for this simple design is the DB.t3.micro
Let's expand our network!
Following the same principle and because we want to extend to other regions for taking measurements, we can create the following services:
- A new VPC using the 10.1.0.0/24 in Frankfurt, for example
- A public and private subnet for the probe in each AZ
- A NAT gateway in each AZ
- An Internet Gateway
- The probes
- VPC peering for interconnecting the new region with the main one
Security Measures
- Session Manager will be used to connect to the web server and the probes. Eliminating any possible attack surface. SSH is disabled by default
- There will be a Security Group (SG) per service.
- The Web Server SG will allow connections from the NLB
- The NLB will allow connections from the internet
- The Database SG will allow connections coming from the Web Server SG and from the Probes SG
- The Probes SG will allow to send and receive ICMP traffic from everywhere (0.0.0.0) and traffic from the Web Server SG
Approximate costs per month
- 8 EC2 t3.micro instances (On for 24h // On Demand) = 8 x 8.32USD = 66.56 USD
- 2 Internet Gateway = FREE
- 6 NAT Gateways (1 GB of monthly traffic) = 6 x 35.09 USD = 210.54 USD
- VPC Peering between Spain and Germany - Each probe will generate about 6 MB per day = 180 MB per month x 6 probes ≈ 1GB of traffic = 0.04$ USD
- Network Load Balancer ≈ 20 USD
- Elastic IP = 3.65 USD per IP
- RDS Multi-AZ (db.t3.micro) = 77.13 USD
Total: 377.92 USD - Monthly invoice
Estimated Carbon footprint
40 KgCO₂eq (Between EC2 and RDS instances)
Second Iteration: Searching for savings
- How could we save some money?
- What are the tradeoffs we must make to make this possible?
Let's start with the network:
We can eliminate the NAT Gateways that are not indispensable, drastically reducing costs. The NAT Gateway can be placed in one AZ, allowing access to the instances from other AZs.
If this NAT gateway fails for any reason, no instance inside the private subnets can access the Internet until AWS redeploys it.
- Because the NAT Gateway is an AWS self-managed service, if it fails, it will be redeployed, but it will take up to 15 minutes to create.
A good approach would be to create an automated failover function (using lambda) that checks the health of the current NAT Gateway and updates the routes to a healthy one if that starts to fail.
I have kept two NAT Gateways in Spain to ensure high availability.
-
I have left one in the German region because it is not critical.
- If the NAT Fails, in this case, I will have to wait until it redeploys
Let's continue with the DB!:
By removing the Multi-AZ approach, we will save more money at the end of the month. But what are our tradeoffs?
- RDS Single-AZ does not have the RDS SLA: 99.95%
- Recoverable failures, such as failures with the DB instance.
- Are automatically handled within the same AZ, with RTO under 30 minutes (this can vary depending on the instance size).
- Unrecoverable failures, such as the EBS volume.
- The RTO would be the time it takes to start up a new RDS instance and then apply all the changes since the last backup.
- This has to be manually triggered or automated via a lambda script
- Availability Zone failures require manual or scripted recovery via point-in-time restoration in another AZ.
Let's end with the EC2:
This will be a paradigm change in our approach.
- We have killed the Load balancer and the Web Server instance in the AZ-B. The elastic IP of the NLB has now been delegated to the Webserver
What is the tradeoff of doing this?
- AWS monitors the health of the hardware but does not keep track of the health of our application as the Load Balancer did.
- For this case, we should use Route53 health checks and lambda functions to detect failures in the app by automatically deleting the unhealthy instance and replicating it through the Autoscaling Group.
But there is one more thing...
What if the web server automates the process by turning the probes on only when a user requests it through the web interface or at set time intervals to take measurements and then switches them off again afterward?
We could reduce a lot of money with this automated scheduler...
Approximate costs per month
- Probes: 6 EC2 t3.micro instances (Working for 1 hour per day) = 6 VMs x 0.34 USD (30 hours in a month) = 2.04 USD
- Web Server: 1 EC2 t3.micro instance (Working for 24h per day) = 8.32 USD
- 2 Internet Gateway = FREE
- 3 NAT Gateways (2 GB of monthly traffic) = 105.42 USD
- VPC Peering between Spain and Germany - Each probe will generate about 6 MB per day = 180 MB per month x 6 probes = 1GB of traffic = 0.04$ USD
- Elastic IP = 3.65 USD per IP
- RDS Single-AZ (t3.micro) = 50.66 USD
Total: 170.13 USD - Monthly invoice // 2.22 times less
Estimated Carbon footprint
8.991 KgCO₂eq (Between EC2 and RDS instances) // 4.45 times less
Third Iteration: Hold my Beer
Let's swap to ARM:
By changing the EC2 instance type from x86 to ARM Graviton, we will reduce costs, increase electrical efficiency, and reduce the carbon footprint by about 40% compared to the x86 family.
Therefore, the t3.micro and db.t3.micro have been swapped to the t4g.micro family.
Note: In some cases, the use of ARM architecture requires re-compilation of some workloads. Before migrating, make sure the whole application is compatible.
What have you done to the network?
For the sake of saving more money, there is a way to achieve some extra savings by getting rid of the NAT Gateways, but at the cost of adding more overhead to the solution, using NAT Instances instead.
We are moving from having a managed service to building it ourselves, and maybe this is not the best approach, but it is worth pointing out that this possibility exists.
Advantages:
- Cheaper price
- In combination with an AutoScaling group and Route53, we can create automatic failover, building a resilient system.
Disadvantages:
- The maximum throughput is 5Gbps, so we should monitor this parameter to ensure that it does not hit the limit and drop traffic. In comparison, NAT gateways can go up to 100Gbps of throughput.
- This design requires an extra Elastic IP per NAT instance.
- Manual management of the instance; it does not scale by default
This is the AWS Tutorial on how to configure it.
In our case, if the NAT instance fails either in Spain or Germany, the downtime will be the time to redeploy it. This is the tradeoff I have chosen.
Approximate costs per month
- Probes: 6 EC2 t4g.micro instances (Working for 1 hour per day) = 6 VMs x 0.28 USD (30 hours in a month) = 1.68 USD
- Web Server: 1 EC2 t4g.micro instance (Working for 24h per day) = 6.72 USD
- Nat Instances: 2 EC2 t4g.micro (Working for 24 hours per day) = 13.44 USD
- 2 Internet Gateway = FREE
- VPC Peering between Spain and Germany - Each probe will generate about 6 MB per day = 180 MB per month x 6 probes = 1GB of traffic = 0.04$ USD
- Elastic IP = 3.65 USD per IP x 3 (2 NAT Instances + Web Server) = 10.95 USD
- RDS Single-AZ (db.t4g.micro) = 49.20 USD
Total: 82.03 USD - Monthly invoice // 4.6 times less
Estimated Carbon footprint
5.293 KgCO₂eq (Between EC2 and RDS instances) // 7.56 times less
Conclusion
This is the most optimized architecture regarding resources, being the most efficient and cheapest. It was not easy to visualize, but everything started to become smooth when I started drawing the problem with paper and a pen.
This is my approach to this problem, but there are plenty of solutions. What would you have done differently? I read you in the comments.
PS: Should AWS promote this type of exercise for the community and hold architecture contests?
Top comments (0)