Luis Horvath for AWS Community Builders

Posted on Feb 11, 2025 • Edited on Feb 13, 2025

Let's Architect! - Ping Dashboard

#aws #finops #architecture #networking

Small Architecture Re-cap:

Before jumping to the article, I would like to re-cap some concepts:

What is an Availability Zone (AZ)?

An AZ in AWS is a physically separate and isolated data center (or a group of data centers) within a Region. AWS Regions consist of multiple AZs, with a minimum of 3 AZs per region.

Each AZ has independent power, cooling, and networking, ensuring high availability and fault tolerance.

Why Is It Best Practice to Use Different Subnets in Different AZs for Resiliency?

When designing highly available architectures in AWS, it is considered best practice to distribute workloads across multiple Availability Zones while segmenting different services into different subnets.

For example:

A Database should be on a subnet and in multiple AZs
The frontend should be on a different subnet than the Database and also deployed in different AZs

Following this principle, we enhance the security, improve the scalability, and leverage the security

Public vs Private Subnets

A public subnet has a direct route to the internet.

EG: A web server will be located in the public subnet and will be directly exposed to the internet

A private subnet does not have a direct route to the Internet.

EG: A database will be located in a private subnet and won´t have to be exposed to the internet

It is possible to allow communication from the resources inside of a private subnet to the Internet, but not the other way around. To achieve this, some additional networking services have to be deployed.

Now, yes, let´s jump to it!

Some months ago, a friend challenged me to architect an application.

I selected the topic and started to think about how I could create it, thinking as a solutions architect. The small project escalated, and this is the main reason why you are reading me.

How would you create a ping dashboard?

Thank you, Batman! Exactly, the network!

You can't start building the application without considering the Network scope. I looked at myself in the mirror and started designing how the application would be:

I want to create a Multi-AZ resilient ping dashboard with services exposed to public and private subnets. In the future, I want to extend some parts of my Application to different regions expanding the measurements.

Now that I know which type of structure the network will have, let's deep dive into the Application scope:

The ping dashboard will have a web server accessible via the Internet in a public subnet and display latency data from several virtual machines used as probes from different private subnets.

These probes will take ping measurements to a specific target (URL or IP) sent by the webserver to understand and measure how long it takes from AWS to reach that destination.

The probes will write the data to a database, which will be used by the webserver to expose the data to the end user

VPC Structure

I have selected the eu-south-2 region (Spain) and the IPv4 network 10.0.0.0/24.

Because I don't need so many IPs (a /24 contains 256 IPs), I want to segregate this network into smaller subnets and place different components inside each AZ (in different subnets.)

Let's do some subnetting.

We will pass from /24 to /28; this means that we will have 16 new networks of 16 hosts each (11 usable because AWS reserves 5 IPs).

I have grouped the networks I will use in the different AZs.

AZ A - Networks from the 10.0.0.0/28 - 10.0.0.64/28

There are 5 available networks in this range;
- 10.0.0.0/28
- 10.0.0.16/28
- 10.0.0.32/28
- 10.0.0.48/28
- 10.0.0.64/28

AZ B - Networks from the 10.0.0.80/28 - 10.0.0.144/28

There are 5 available networks in this range;
- 10.0.0.80/28
- 10.0.0.96/28
- 10.0.0.112/28
- 10.0.0.128/28
- 10.0.0.144/28

AZ C - Networks from the 10.0.0.160/28 - 10.0.0.240/28

There are 6 available networks in this range;
- 10.0.0.160/28
- 10.0.0.176/28
- 10.0.0.192/28
- 10.0.0.208/28
- 10.0.0.224/28
- 10.0.0.240/28

Ok, what are we going to build on these networks?

Based on the specifications, we need:

A DB in a private subnet
A probe in a private subnet
A Web Server in a public subnet

In other words, for each AZ, we only need three subnets.

Why don't you put everything in the same subnet?

For security reasons, each component should be isolated from the other. If there is a breach, the network and our firewall rules will limit access to resources from network to network.

Drawing of our design

I have selected 3 networks for the different AZs of the list from above.

Looks sexy, isn't it? Let's continue.

If the application needs to be accessed from the Internet, we must deploy an Internet Gateway. The Internet Gateway is a networking component placed in the VPC that grants access to the Internet to the web server; it does not have to be in any subnet.
The probes need internet access, and we want to restrict external services from reaching them; using a NAT gateway will be the solution. The NAT gateway must be placed in the public subnet, and in combination with the Internet Gateway, it will grant the Internet flow only from the inside to the outside. A NAT gateway will be placed in each AZ.

Is placing one NAT Gateway per AZ necessary?
We will answer this question in the second and third iterations 🌚.

Let's add some EC2 instances!

I have selected the t3.micro for the probes and the web servers.
Yes, web servers, we will allocate two web servers under a Network Load Balancer, which will redistribute the traffic between the AZs. Why not three web servers?
- We will build the web servers inside an AutoScaling group that expands in different AZs.
- If the ELB detects a failure in the application's health checks, the AutoScaling Group will deploy the web server in the same or in another AZ.
- If the AZ fails, the ELB will stop sending the traffic to that instance, and the AutoScaling Group will automatically re-deploy it to another AZ.
The networking probes are not crucial; we can afford failure in them, but not in the application itself.
We will use an Elastic IP to bind the NLB.

Let's add the DBs!

I have chosen an RDS Multi-AZ MySQL architecture with a master database and two DBs on standby.
The Master DB updates the replicas automatically.
If the master DB fails, one of the DBs in Standby will take over.
The Family of the DB for this simple design is the DB.t3.micro

Let's expand our network!

Following the same principle and because we want to extend to other regions for taking measurements, we can create the following services:

A new VPC using the 10.1.0.0/24 in Frankfurt, for example
A public and private subnet for the probe in each AZ
A NAT gateway in each AZ
An Internet Gateway
The probes
VPC peering for interconnecting the new region with the main one

Security Measures

Session Manager will be used to connect to the web server and the probes. Eliminating any possible attack surface. SSH is disabled by default
There will be a Security Group (SG) per service.
- The Web Server SG will allow connections from the NLB
- The NLB will allow connections from the internet
- The Database SG will allow connections coming from the Web Server SG and from the Probes SG
- The Probes SG will allow to send and receive ICMP traffic from everywhere (0.0.0.0) and traffic from the Web Server SG

Approximate costs per month

8 EC2 t3.micro instances (On for 24h // On Demand) = 8 x 8.32USD = 66.56 USD
2 Internet Gateway = FREE
6 NAT Gateways (1 GB of monthly traffic) = 6 x 35.09 USD = 210.54 USD
VPC Peering between Spain and Germany - Each probe will generate about 6 MB per day = 180 MB per month x 6 probes ≈ 1GB of traffic = 0.04$ USD
Network Load Balancer ≈ 20 USD
Elastic IP = 3.65 USD per IP
RDS Multi-AZ (db.t3.micro) = 77.13 USD

Total: 377.92 USD - Monthly invoice

Estimated Carbon footprint
40 KgCO₂eq (Between EC2 and RDS instances)

Second Iteration: Searching for savings

How could we save some money?
What are the tradeoffs we must make to make this possible?

Let's start with the network:

We can eliminate the NAT Gateways that are not indispensable, drastically reducing costs. The NAT Gateway can be placed in one AZ, allowing access to the instances from other AZs.

If this NAT gateway fails for any reason, no instance inside the private subnets can access the Internet until AWS redeploys it.

Because the NAT Gateway is an AWS self-managed service, if it fails, it will be redeployed, but it will take up to 15 minutes to create.

A good approach would be to create an automated failover function (using lambda) that checks the health of the current NAT Gateway and updates the routes to a healthy one if that starts to fail.

I have kept two NAT Gateways in Spain to ensure high availability.
I have left one in the German region because it is not critical.
- If the NAT Fails, in this case, I will have to wait until it redeploys

Let's continue with the DB!:

By removing the Multi-AZ approach, we will save more money at the end of the month. But what are our tradeoffs?

RDS Single-AZ does not have the RDS SLA: 99.95%
Recoverable failures, such as failures with the DB instance.
- Are automatically handled within the same AZ, with RTO under 30 minutes (this can vary depending on the instance size).
Unrecoverable failures, such as the EBS volume.
- The RTO would be the time it takes to start up a new RDS instance and then apply all the changes since the last backup.
- This has to be manually triggered or automated via a lambda script
Availability Zone failures require manual or scripted recovery via point-in-time restoration in another AZ.

Let's end with the EC2:

This will be a paradigm change in our approach.

We have killed the Load balancer and the Web Server instance in the AZ-B. The elastic IP of the NLB has now been delegated to the Webserver

What is the tradeoff of doing this?

AWS monitors the health of the hardware but does not keep track of the health of our application as the Load Balancer did.
- For this case, we should use Route53 health checks and lambda functions to detect failures in the app by automatically deleting the unhealthy instance and replicating it through the Autoscaling Group.

But there is one more thing...

What if the web server automates the process by turning the probes on only when a user requests it through the web interface or at set time intervals to take measurements and then switches them off again afterward?

We could reduce a lot of money with this automated scheduler...

Approximate costs per month

Probes: 6 EC2 t3.micro instances (Working for 1 hour per day) = 6 VMs x 0.34 USD (30 hours in a month) = 2.04 USD
Web Server: 1 EC2 t3.micro instance (Working for 24h per day) = 8.32 USD
2 Internet Gateway = FREE
3 NAT Gateways (2 GB of monthly traffic) = 105.42 USD
VPC Peering between Spain and Germany - Each probe will generate about 6 MB per day = 180 MB per month x 6 probes = 1GB of traffic = 0.04$ USD
Elastic IP = 3.65 USD per IP
RDS Single-AZ (t3.micro) = 50.66 USD

Total: 170.13 USD - Monthly invoice // 2.22 times less

Estimated Carbon footprint
8.991 KgCO₂eq (Between EC2 and RDS instances) // 4.45 times less

Third Iteration: Hold my Beer

Let's swap to ARM:

By changing the EC2 instance type from x86 to ARM Graviton, we will reduce costs, increase electrical efficiency, and reduce the carbon footprint by about 40% compared to the x86 family.

Therefore, the t3.micro and db.t3.micro have been swapped to the t4g.micro family.

Note: In some cases, the use of ARM architecture requires re-compilation of some workloads. Before migrating, make sure the whole application is compatible.

What have you done to the network?

For the sake of saving more money, there is a way to achieve some extra savings by getting rid of the NAT Gateways, but at the cost of adding more overhead to the solution, using NAT Instances instead.

We are moving from having a managed service to building it ourselves, and maybe this is not the best approach, but it is worth pointing out that this possibility exists.

Advantages:

Cheaper price
In combination with an AutoScaling group and Route53, we can create automatic failover, building a resilient system.

Disadvantages:

The maximum throughput is 5Gbps, so we should monitor this parameter to ensure that it does not hit the limit and drop traffic. In comparison, NAT gateways can go up to 100Gbps of throughput.
This design requires an extra Elastic IP per NAT instance.
Manual management of the instance; it does not scale by default

This is the AWS Tutorial on how to configure it.

In our case, if the NAT instance fails either in Spain or Germany, the downtime will be the time to redeploy it. This is the tradeoff I have chosen.

Approximate costs per month

Probes: 6 EC2 t4g.micro instances (Working for 1 hour per day) = 6 VMs x 0.28 USD (30 hours in a month) = 1.68 USD
Web Server: 1 EC2 t4g.micro instance (Working for 24h per day) = 6.72 USD
Nat Instances: 2 EC2 t4g.micro (Working for 24 hours per day) = 13.44 USD
2 Internet Gateway = FREE
VPC Peering between Spain and Germany - Each probe will generate about 6 MB per day = 180 MB per month x 6 probes = 1GB of traffic = 0.04$ USD
Elastic IP = 3.65 USD per IP x 3 (2 NAT Instances + Web Server) = 10.95 USD
RDS Single-AZ (db.t4g.micro) = 49.20 USD

Total: 82.03 USD - Monthly invoice // 4.6 times less

Estimated Carbon footprint

5.293 KgCO₂eq (Between EC2 and RDS instances) // 7.56 times less

Conclusion

This is the most optimized architecture regarding resources, being the most efficient and cheapest. It was not easy to visualize, but everything started to become smooth when I started drawing the problem with paper and a pen.

This is my approach to this problem, but there are plenty of solutions. What would you have done differently? I read you in the comments.

PS: Should AWS promote this type of exercise for the community and hold architecture contests?

DEV Community

Let's Architect! - Ping Dashboard

Small Architecture Re-cap:

How would you create a ping dashboard?

VPC Structure

Drawing of our design

Let's add some EC2 instances!

Let's add the DBs!

Let's expand our network!

Security Measures

Approximate costs per month

Second Iteration: Searching for savings

Let's start with the network:

Let's continue with the DB!:

Let's end with the EC2:

Approximate costs per month

Third Iteration: Hold my Beer

Let's swap to ARM:

What have you done to the network?

Approximate costs per month

Conclusion

Top comments (0)