DEV Community

Bruno Ferreira
Bruno Ferreira

Posted on • Originally published at codacy.com

DNS Hell in k8s

Here at Codacy, everyone's been working really hard in the last few months to move all of our services to kubernetes. And it has been a bumpy road... From having to run a nfs-server provisioner, to be able to share files between pods, to launching our own scheduler to avoid scaling issues, we can say that we're hitting some interesting problems along the way, giving us the opportunity to do and learn new things everyday - which is pretty cool.

In this blog post I will talk about one of the most common issues that almost everyone, including us, seems to struggle with when moving to k8s: DNS intermittent delays

One of our components, we simply call it worker, runs as a short-living pod and is responsible to spawn other pods that run the analysis tools (eslint, pylint, etc.), gather their data and save it.
Because of some pitfalls on our architecture, we agreed that this was one of the toughest parts of the system to put on k8s and therefore, we decided to tackle it first.
During one of the first attempts to put these workers running on k8s in production, after some minutes we noticed a problem that we didn't get during our tests on the development and staging environments.
Some workers were throwing UnknownHostExceptions while trying to access graylog and the databases, that were running outside the cluster. This seemed to increase when the number of running workers increased on the nodes. After some research, we found lots of users complaining about DNS intermittent delays of ~5s in this github issue. This was a problem, since the default timeout for the DNS is 5 seconds. We went through almost all of the solutions referred in the issue thread:

  1. "arp table overflow on the nodes (arp -n showing more than 1000 entries). Increasing the limits solved the problem"

    We checked this on our nodes and this was not a problem for us, since we had around 50 entries in all of the nodes.

  2. "dnsPolicy: Default works without delays"

    Well, actually, this was a bit confusing for us because despite this dnsPolicy being called Default, it is not the default policy in k8s. The default DNS policy is "ClusterFirst", i.e., "any DNS query that does match the configured cluster domain suffix, is forwarded to the upstream nameserver inherited from the node", while the "Default" DNS policy just "inherits the name resolution configuration from the node that the pods run on". We tested this, by running some test pods trying to resolve www.google.com and this configuration decreased from an average of ~5s to ~2s... It is still a lot of time to resolve the name but we then decided to try it on our workers. However, after some minutes we still got the UnknownHostExceptions while trying to access the external services on startup.

  3. "Use fully-qualified names when possible"

    As explained by Marco Pracucci in his blog, in k8s if you're trying to resolve a name of something outside of the cluster, with the default configuration on /etc/resolv.conf, any request for resolution that contains fewer than 5 dots will cycle through all of the search domains as well in an attempt to resolve. For example, to resolve codacy.com, codacy.com.kube-system.svc.cluster.local., codacy.com.svc.cluster.local., codacy.com.cluster.local., codacy.com.ec2.internal. and finally codacy.com. must be looked up, for both A and AAAA records. While this didn't solve our issue, it was something good to be aware to tweak our apps and get a better performance.

  4. "using the option single-request-reopen on /etc/resolv.conf fixed the problem"

    Fortunately, the guys from weaveworks and also from XING investigated this problem in depth and explain in detail why this can be a solution. Basically, the root cause of these delays is the Linux connection tracking mechanism, aka conntrack, which is implemented as a kernel module and is inherently racy. According to the man page for resolv.conf, the single-request-reopen option enables sequential lookups using different sockets for the A and AAAA requests, reducing the race conditions. We also tried this but, after some minutes, we continued to see workers failing and while running conntrack -S on the nodes, the insert_failed counter was still increasing.

  5. "use tcp for DNS requests"

    Since UDP is lossy, one of the UDP packets might get dropped by the kernel due to the races, so the client will try to re-send it after a timeout (5 seconds). Therefore, we decided to try using TCP for DNS as a workaround for this issue, as it was also one of the workarounds suggested by Pavithra Ramesh and Blake Barnett in their recent talk at Europe's Kubecon. Despite this being a bit slower because we are now using TCP, it actually solved the problem for us and the pods stopped failing.

Some good news is that these DNS problems caused by race conditions in the connection tracking mechanism (fun fact: this is briefly "documented" in Linux's source code) already have two patches to fix it (if you're brave enough, you can take a look at them here and here).
However, the most recent patch is only available since version 5 of the linux kernel and it's not always possible to control the kernel version of the nodes where your pods will run. In our case, since we are running on EKS, we are using Amazon Linux 2 and the most recent update (7/18/2019) only supports the 4.19 kernel, which only contains one of the patches.

Meanwhile, the most proper solution seems to be the usage of a NodeLocal DNScache (beta in k8s v1.15), already detailed in the k8s official documentation. This solution aims to improve the overall DNS performance on the cluster by running DNS caching agents on every node as a DaemonSet so pods can reach out to these agents running on the same node, thereby reducing the number of upstream trips which would still use conntrack and increase the probability of being impacted by the race conditions previously referred.

Other solutions, such as using a sidecar on every pod running tc to delay DNS packages, we actually didn't try since they seemed more complex and required more configuration. We also discarded every solution that required any modification on the nodes configuration since we can also deploy our application on-premises and, in this case, the cluster nodes are managed by our customers.

In the end, after some research and the typical "trial and error", we were able to find a workaround for this issue until we get to a proper solution in the future, enabling us to proceed and hit the next problem.
😄

See the original article here

Top comments (0)