Daniel Kneipp for AWS Community Builders

Posted on Jan 30, 2024

Global Service on AWS

#aws #networking #dns #webdev

In a previous post I showed how you can have a multi-region service running while keeping response times low using an architectural pattern called Availability Zone Affinity

However, the previous design has a considerable issue: it doesn't perform a regional failover. In other words, if an entire region goes down, the service will become inoperable for the customers closer to that specific region.

To overcome this problem, this project shows how Global Accelerator can be used to provide a single point of entry to your service with static IPs available globally.

The code of this project is available here

Working with Global Accelerator

So, to improve the previous design, we will add on top it a new entrypoint using Global Accelerator (GA). This AWS service offers a global fixed endpoint with two static IPs.

When a web client uses this endpoint, its traffic is sent to the nearest point of presence of the AWS Edge network, and from there it goes through the AWS backbone instead of going through the Internet all the way to the intended resource (which can be a load balancer or an EC2 instance).

GA is used by several customers. Let's take Okta as an example. Okta follows a multi-tenant architecture and has subdomains for their customers, and you can see for LinkedIn the GA endpoint available as a CNAME record as shown in the image below

Feel free to test it on other Okta customers such as Zoom for example, to see a different GA endpoint.

AWS also provides a webpage that allows you to see the differences in response times from different regions when you use GA as opposed to going via the public Internet to reach an AWS endpoint: https://speedtest.globalaccelerator.aws/

Implementing the Design

GA alone can provide the same features Route53 offers with latency-based and failover records. So it could replace all of that. However, in the code, we will keep everything deployed previously to allow some comparisons between the approaches.

In summary, in the place of Route53, using GA in the design looks something like this:

To understand step by step how the design was made, please visit the aws-route53-global-dns to learn more.

A new file has been added at aws-route53-global-dns/terraform/ga.tf with all the relevant code there. This change was made in a separate branch so we can keep track of changes made and leave the previous project untouched.

GA follows a component hierarchy of listener -> endpoint group -> endpoint.

Listeners define the port and network protocol to listen to and which endpoint groups should receive the traffic.

Endpoint group describes a regional group of endpoints, which can be Application Load Balancers, EC2 instances, or in this case, Network Load Balancers (NLBs). For each endpoint you can set a weight which is used to define how the traffic to that endpoint group will be balanced.

In the code you can see endpoints defined as follows:

endpoint_configuration {
  client_ip_preservation_enabled = false
  endpoint_id                    = module.services_eu["eu1"].nlb_arn
  weight                         = 255
}
endpoint_configuration {
  client_ip_preservation_enabled = false
  endpoint_id                    = module.services_eu["eu2"].nlb_arn
  weight                         = 1
}

With the above configuration, we are defining that the primary endpoint in the EU (eu1) should receive 255/256 of the traffic. Meanwhile, the secondary endpoint used for failover received 1/256.

An endpoint with weight 0 doesn't receive traffic if another endpoint group has healthy endpoints. In other words, if we had the eu2 weight set to 0 and eu1 stopped working, the traffic would failover to the other region, and not the eu2

⚠️ Note: failover cluster receives a small portion of traffic (1/256 = 0,39%), which makes the design a active-active setup, in contrast to the active-passive configuration we had before. This has the benefit of ensuring that the failover cluster is always operational by receiving a portion of customer traffic at all times.

As another implementation detail: as of now, GA doesn't support client IP preservation when the traffic is being forwarded to a NLB with a TLS listener. That is why you see client_ip_preservation_enabled=false

Only this addition is enough to be able to test GA without impacting the existing infrastructure, which shows the benefit of having a progressive design that allows improvement by composition with minimal change to existing components.

Testing the Design

As mentioned before, a new branch global_accelerator has been created on the aws-route-53-global-dns project with the changes required to add the GA.

In order to deploy the whole thing, just do:

cd aws-route53-global-dns/terraform/
tf apply

The deployment can take several minutes. For more information regarding the deployment procedure, please refer to this more detailed description.

The previous domain names are still working so we can test them and compare the differences. To hit the GA you can use the domain name such as service.dkneipp.com. To reach the closest primary NLB you can use www.service.dkneipp.com, as shown below.

💡 From the domain name using the Global Accelerator, we can see the two AnyCast IPs. So, even in the event of a regional failover, the IPs of your service don't change, which allows customers to define network policies for your service only based on IPs if required.

Note: now service.dkneipp.com works. As opposed to the limitation of CNAME records, the A record of type alias used for the GA endpoint can overlap with the SOA record of the zone.

Now, let's do some testing.

Failover

To simulate an issue in one of the web servers, the instance is removed from the associated target group (as shown here)

After that, you can see that external web clients are seamlessly redirected to the secondary web server, as shown below:

This is the command used to perform the test shown above:

while true; do sleep 2 && curl -w 'Total: %{time_total}s\n' 'https://service.dkneipp.com' && date +%T && echo ""; done;

This was also performed in the previous project. The interesting addition is the cross-region failover. And after all the web servers in the region are taken down:

You can see the failover happening automatically, and the web server on the other region starts to respond to the traffic (now with much higher response times, but still with the service operational). However, in this case, the failover was not transparent, and the end user would have experienced issues for around 17 seconds.

And finally, once the primary web server is live again, the recovery is also automatic after a transition window, as seen below:

Response times

In order to get more interesting statistics (like average with standard deviation and percentiles) over response times, I've created a small utility in Go that can be used to get those from response times of GET requests to a specified URL.

The code of the utility is in http-latency-test/. Binaries for MacOS on ARM and x86 Linux have been built already, and you can use the Makefile to build a binary from the source if required.

The utility accepts the following arguments.

-count: Max number of requests. Pass 0 to keep it running forever;
-sleep: The amount of time in milliseconds to wait between requests (default is 500ms);
-url: The endpoint to make the request.

So, a simple test can be ./http-latency-test --url https://service.dkneipp.com --count 1000 --sleep 100.

With this tool, a test was performed to compare the response times of GA compared to hitting the closest NLB directly. This was done for both Europe and South America regions and the results are shown below.

Europe	South America

The interesting thing to point out is that the Global Accelerator has the same or better response times than hitting the NLB directly. For the European region, we can see an improvement of 22% in the response times on average! 🤩

However, this varies by region and also depends on several networking factors, such as the location of the web client and its Internet connection conditions.

As mentioned before, this improvement is due to the network path that is taken by the packets when the request is made. When using the NLB, the traffic goes through the internet before reaching the AWS resource. However, when using GA, the traffic goes through the AWS backbone as much as possible.

Cleanup

A simple terraform destroy should delete all 127 resources.

Conclusion

This project shows how you can use the AWS Global Accelerator to make a service globally available, resilient to failures in availability zones and entire regions, while also keeping response times low for users worldwide.

AWS Global Accelerator can be used for many other use-cases, such as Blue-Green deployments, custom routing to build sessions for online games, or even just to provide a static IP and endpoint to customers without having to rely on DNS.

I encourage you to have a look if you manage AWS environments and don't know about this service.