In 2015, AWS announced the managed NAT Gateway service as a (better) alternative for NAT Instances. And indeed, NAT Gateways have many advantages over more traditional NAT Instances, as outlined by Amazon in this comparison. However, at Kabisa were are moving in the opposite direction: from NAT Gateways toward NAT instances. In this blog post I will outline our reasons for doing so, and say something about how we were able to migrate away from NAT Gateways without much effort.
Both NAT Gateways and Instances are used to provide internet access to nodes and services in a private subnet (a subnet without direct connection to the internet (gateway)) in a VPC (Virtual Private Cloud). Without NAT (Network Address Translation), those nodes wouldn't be able to create outgoing connections to the internet. By default, the route table for a private subnet only contains a local route, meant to route traffic within the VPC. A default private subnet looks like this:
This means that every service that uses this route table will not be able to pull updates from the internet, or make API calls to external services such as your ERP software. Sometimes this is fine, especially if you have a highly available cluster with a load balancer and disposable infrastructure. Then software updates are simply managed by spinning up a new instance with a new AMI (Amazon Machine Image) and throwing the old one away.
Schematically, it could look like this:
In this scenario the cluster and RDS aren't directly connected to the Internet Gateway, so outgoing traffic isn't possible and incoming traffic is only possible through the Load Balancer. If we want the cluster to have internet access, it can look something like this:
In order to get information or data from the internet, some kind of external connection is required. That's when you need a NAT Instance or Gateway. The route table then looks something like this for a NAT Instance:
The ENI is the Elastic Network Interface of the NAT Instance. A NAT Instance is nothing more than a regular EC2 instance, running an Amazon-provided NAT AMI, that performs Network Address Translation. In the dropdown menu you can simply select the NAT instance that you wish to use. It will be automatically converted to the right ENI so you need not worry about those.
In December 2015, Amazon announced a new service: NAT Gateways. NAT Gateways are fully managed by Amazon and are built to be highly available and scalable. A normal EC2 instance has a certain amount of (network) capacity, but it won't scale as the load increases. EC2 is also not inherently highly available.
Because it's fully managed, creating a NAT gateway is extremely easy. All you have to specify is which (elastic) IP to use and the subnet it should be located in. A route table that uses a NAT Gateway will look like this:
Once again, you'll be able to just browse your NAT Gateway in the dropdown and pick the one you wish to use.
Depending on your business and technical requirements you can choose to use one NAT Gateway across all AZs (Availability Zones) or, for example, create a different Gateway for every AZ. The latter can be cost-effective if you have lots of traffic and wish to limit inter-AZ data traffic. Data traffic within an Availability Zone is free. Or it can be a requirement to have outgoing traffic be able to withstand the failure of one or more AZs.
Most of the technical differences are outlined by AWS on this page. The main take-away should be that NAT Gateways are managed and more highly-available and scalable. From a technical perspective, NAT Gateways are superior in almost every sense, except if you need a LOT of flexibility for your outgoing traffic, such as a custom firewall/routing that can't be done through Security Groups, Network ACLs or route tables.
So why are we phasing out many of our NAT Gateways in favour of NAT Instances? The answer to this question is fairly simple: costs. We have many DTA (Development, Test, Acceptance) environments in AWS, where costs are a much bigger factor than for a fancy highly available production environment.
NAT Gateways aren't exactly what you'd call cheap. Just having a NAT Gateway costs $0.048 per hour in the region eu-west-1 (Ireland). This translates to roughly $35 per month. Considering that you can get a t3.medium EC2 instance for this kind of money, a NAT Gateway looks disproportionately expensive. Especially for test environments with very little outgoing data traffic.
This is why we are currently only running NAT Gateways in our production environments anymore. It's important to note that the NAT Instance doesn't have to run the application; it just has to serve as infrastructure node to route traffic to the internet. Incoming traffic doesn't pass through a NAT Instance or Gateway, as it's outgoing only. Considering even a t3.nano instance gets 5Gbit/s of bandwidth these days, I don't see the NAT Instance becoming a bottleneck for these types of environments anytime soon.
The hourly cost for a t3.nano instance is merely $0.0057 per hour, or $4.161 per month. With smart purchasing, such as using Reserved Instances, you can even get one for as cheap as $2.75 per month. That's only 7% of the cost of a NAT Gateway.
In addition, a NAT Instance is basically just a regular Linux box, so it can also serve as jump host or bastion host from which to reach the private instances. So in a sense you're already paying for it if you are already using such a jump host, meaning the marginal costs will be even lower than the $2.75 to $4.17 per month.
There are just 4 things you need to do in order to get a working NAT Instance:
- Set up an EC2 instance that is capable of performing NAT operations
- Disable the "Source/Destination check" in AWS
- Add a rule to the routing table(s) to route internet-directed traffic through the newly created instance
- Make sure the security group and Network ACLs of the NAT instance accepts connections from the hosts it needs to route traffic for.
Below I will illustrate in more depth what I mean and how to achieve this.
Considering Amazon's own NAT Instance AMIs haven't been updated in a while (almost 6 months at the time of writing), I'm not even sure they're still properly maintained. Therefore we decided to use a regular Amazon Linux 2 AMI and configure it to perform NAT. This is easier than it sounds. All you need is 2 commands:
sysctl -w net.ipv4.ip_forward=1 /sbin/iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
As it's always recommended to use disposable infrastructure, meaning you should be able to throw away your current box and be able to generate a new box with the same functionality. In essence, reproducible infrastructure. To this end we are using Terraform so we can code our infra. This is called Infrastructructure as Code (IaC) and is highly recommended if you want manageable, stable and reproducible infrastructure. However, Terraform is outside the scope of this blog post. Suffice it to say, you can either put these commands in the EC2 instance's "User Data" using IaC tools like Terraform or Cloudformation, or you can just login and execute these commands as root.
Disabling the Source/Destination check in AWS to not block traffic that doesn't originate from or is destined for that particular instance. For an instance that essentially functions as a router, this check has to be disabled. This can easily be done through the AWS Console by right-clicking on the EC2 instance:
The last step is to let AWS know that it can/has to send internet-bound traffic through the newly created NAT EC2 instance. This should only be done for the (private) subnets for which you wish to use this NAT instance. Adding the rule is simple, as you can simply select the instance and use 0.0.0.0/0 as shown in one of the earlier screenshots.
If you were previously using a NAT Gateway, that rule will show up here. Obviously you can remove that rule, as you now have a much more cost-effective NAT instance that you can use!
If you're seeing a route to 0.0.0.0/0 with as target an Internet Gateway (igw-xxx), you're modifying a public subnet; not a private subnet. A public subnet means that traffic is routed directly to and from the internet gateway, so that nodes in this subnet are internet accessible and have internet access. It's possible to remove this rule but do note that this will essentially convert your public subnet into a private one, meaning the nodes in this subnet won't accept incoming connections anymore on their respective public IP addresses.
What I recommend is using the Security Group(s) (SG) for the hosts that need internet access as source security group in the security group for the NAT instance, and to allow all incoming traffic from those SGs, unless you want to restrict traffic from the private instances more than the traffic originating from the NAT instance itself.
But if not, I would recommend to filter outgoing traffic for either the NAT instance itself or the private instances in the most sensible place: the outgoing SGs or Network ACLs (NACLs) for the respective hosts. Note that if you deny certain outgoing traffic for the NAT instance, it will also not be able to route traffic to those destinations for the private instances so you'll have to manage those differences in the incoming SGs and NACLs of the NAT instance.
That's all! You can easily test if it works by looking up the public IP of your NAT instance and then SSH'ing into one of the private nodes through a jump box or VPN. With a command like curl ifconfig.co you should be able to get the public IP with which that node connects to the internet. This should be equal to the public IP address of the NAT instance.
This is something we hear often, and we have thought about as well, especially for non-critical environments. It would obviously solve the problem as well and we wouldn't need a NAT instance or NAT Gateway, because every instance or service would have its own public IP address. There are 2 main reasons we have decided against this.
Security should be layered. We don't want a single configuration error to lead to the resources suddenly becoming publicly available. Ideally, the perfect Security Group and NACL would essentially achieve the same thing, but it's good to have those resources be isolated from the internet no matter what's set up in the SG or NACL.
In addition, our acceptance environments often contain almost the same data as production environments, so we are unwilling to accept a lower level of security.
We are setting up our environments as code, using Terraform. Architecturally it's much easier and uniform to just swap out a NAT Gateway for a NAT instance, rather than moving all resources to public subnets.
Of course. Here are a few:
- Use Reserved Instances to get 10-35% discount on EC2 or RDS instances
- Use Spot Instances in a highly available cluster for up to 90% discount
- (Automatically) shut down resources when they're not in use, e.g. test environments outside of office hours
- Use the correct tier for S3 Storage or use the recently introduced intelligent tiering feature
- Change previous-generation instances to current or newer generation instance types, e.g. from t2 to t3. Not only will it be faster, it'll also be cheaper.
- Use metric-based autoscaling