DEV Community

Cover image for Gradually Shift Traffic with AWS Route 53 Weighted Routing Policy
devgrowth
devgrowth

Posted on

Gradually Shift Traffic with AWS Route 53 Weighted Routing Policy

🤔 Why do we need to shift traffic gradually?

It is common in software to migrate an existing service to a new infrastructure such as moving to cloud. The business logic remains the same, but it is still a drastic change since the new service stack will be on new infrastructure. We can use extensive integration tests, load tests etc to ensure the new service is working as expected, however with critical services, it is still a safer option to gradually shift production traffic to the new service stack, so that the service owner can verify the new service is also robust and scalable in a safe and incremental way, as well as giving clients time to adjust if necessary.

High level idea of traffic shifting

You may think there are already tools for this "gradual shifting" scenarios, for example, there are many feature flags tools, either built in-house or from third party vendors, they probably have functionalities like this, and it could be the right solution for some cases, especially for experimenting a new feature. But for other scenarios you have to consider if it's the right solution, will increase latencies that could break your service's SLA? or if the tools can handle the level of traffic for all traffic goes to a service? or any cleanup work you need to do with the feature flag after all traffic is migrated. 

More often if there is no plan to maintain the existing service stack, "redirect" the traffic using DNS resolution is common, and this is what this article will focus(I've also seen people do a one time flip using DNS, but this is usually not recommended considering the risk)


💡 How do we use DNS to solve this problem

Before we get into the details of using AWS Route53, we need to understand a bit more about DNS and AWS Route 53:

Amazon Route 53 is a highly available and scalable cloud domain name system(DNS) service. Enables to customize DNS routing policies to reduce latency

So how does DNS(domain name system) work?

Image description
source: AWS - What is DNS?

In a nutshell, DNS is the phone book to translate a human friendly domain name such as example.com to machine readable IP address such as 192.0.2.244.

When a request is initiated, DNS lookup happens in a hierarchy name resolution architecture that resolves the DNS name with different name servers. For example in above diagram, the domain name www.example.com is answered first by DNS root name server, then name server for .com TLD, when it reaches Route53 name server, which has the record for www.example.com, then it will return the machine readable IP address for the client to make request to the host, and the resolution result is heavily cached along this path.
Now that we know how DNS work, let's see how to implement it with AWS Route53:

Scenario A: Ask clients to use new API domain/URL

If it's easy to ask clients to use new API URL, then it is relatively straightforward to add weighted routing policy records. You will delegate a new domain from either company's internal infrastructure or third party service provider, then create a new (public) hosted zone in Route53

Add a new hosted zone in Route 53

Click on the hosted zone line, you should be able to see a NS record(name server) and a SOA record(start of authority) created automatically, for this scenario you don't need to make changes to these record, just keep it as it is and know that they are the administration type of information for DNS resolution. We will talk more about it in another scenario.
Next step is to create records with weighted routing policy:

create weighted record to shift traffic

After creating a record that points to the new service(with 155 as the weight), we can create another record that points to the existing service with a weight of 100, then we should see these two records in the hosted zone as below:

weighted records in hosted zone

Once this is set up, you can gradually change the weight config so that eventually the new service stack can get all the traffic.

Scenario B: No change required from clients to use weighted record

In reality, production services often serves a wide range of clients and it could be challenging to ask every client to use the new API domain/URL, luckily we can still control which endpoints clients are using behind the scenes. To understand how this approach works, we need to understand what is the role of name sever:

An NS record (or nameserver record) is a DNS record that contains the name of the authoritative name server within a domain or DNS zone. When a client queries for an IP address, it can find the IP address of their intended destination from an NS record via a DNS lookup.

In another words, A name server or DNS server contains all of the DNS zone files and records for a domain. As we mentioned in scenario A, when you create a hosted zone in Route53, by default a NS record and SOA record will be created, basically these name servers know all the records you create in this hosted zone.

How do we make the clients using the weighted records we create in route53 without any changes? For example, the clients are using an endpoint medium.com, this domain is managed by some internal infra or third party tool and you can get the corresponding name server record with this command

dig medium.com +noall +answer NS

; <<>> DiG 9.10.6 <<>> medium.com +noall +answer NS
;; global options: +cmd
medium.com.  86400 IN NS alina.ns.cloudflare.com.
medium.com.  86400 IN NS kip.ns.cloudflare.com.
Enter fullscreen mode Exit fullscreen mode

In this case, we want to use the name server in Route53 instead of the cloudflare name server so that the DNS resolution will flow through Route53 instead of cloudflare, then the weighted records we set up in Route53 hosted zone will be effective.


📝 Lessons Learned

  • DNS is cached heavily, changes are not instant and traffic is going to linger for a while after making changes
  • Shorter TTL for records is helpful for easier rollback. Once migration is completed and there is no need for rollback, you can change the TTL back to default value or other value you see fit.
  • During testing phase, if you have load test set up, remember that a host will resolve to one of the record destination and use that for the duration of TTL, so it's better to have the load tests running from several hosts and perhaps longer than the DNS resolution TTL, then you may see the traffic following the DNS resolution weights.
  • If the domains in DNS record is not publicly accessible, you cannot use the commonly used Route53 health check since it requires the domain to be accessible publicly. There are other metrics such as DNS query metrics, but if the domains can only be reached internally, it's better have to find other ways to measure the health of the request
  • To make your application more resilient, Route53 Application Recovery Controller(ARC) can help, and AWS recently released a feature called zonal shift you can use to mitigate grey failures.
  • One thing I noticed is that if you have a weight setup as 100% traffic goes to A, 0% traffic goes to B, theoretically you would not see traffic goes to B. However if A cannot be resolved, Route53 will fall back to other records such as B.
  • When you replace name server record as described in scenario B, remember to replace the name server record in one single operation(append and delete), otherwise there will be a period of time when the domain cannot be resolved.

If you are interested in learning more about DNS or networking in general, this Coursera course from google and course from University of Colorado gives you an overview of computer networking, or if you prefer books, there is a classic computer networking book Computer Networking: A Top-down Approach. Hope this helps!

I've compiled and curated a list of job hunting resources for software developers, it covers resources for writing resumes, applying and managing job applications, efficient ways to prepare for coding interviews, resources to learn system design, you can download the free PDF here.

Top comments (0)