Recently, my team and I set out to build an open-source tool for Kubernetes that automatically creates a functional map of which pods are talking to which pods: a map of the network of services running in the cluster. In this blog, I wanted to share how we approached the problem of figuring out “who’s calling whom” within a Kubernetes cluster.
The context was to provide a starting point for making it easy to access services securely: the network map would show what access is needed now, the cluster could be configured to only allow this intended access, and the map could then be evolved as the functional needs evolve, so security would always align with desired functionality rather than get in its way. You can read more about intent-based access control here.
When we say “who’s calling whom”, we mean at the logical, functional level: e.g. “checkout service” calls “catalog service”. Having a logical network map of your cluster seems to be pretty useful regardless of context. In our experience, people who operate the cluster don’t necessarily know all the ways that services call on and depend on one another, so this may be useful for them for all sorts of reasons. It’s likely interesting for security, for dependency analysis, and just to catch things that look odd.
So how do we tell who’s calling whom? One approach was the way Monzo Bank started their exploration of service-to-service calls: by analyzing service code to figure out what calls it intends to make. But we were wary that it would be very difficult if not impossible to make a general-purpose tool that covered most languages and would robustly catch most calls.
So, instead, we opted for an approach based on sniffing network traffic over a period of time, assuming that all intended calls would happen at least once during that time, and figuring out who’s on either side of these calls. True, we might miss infrequent calls, but that’s a limitation we’d reveal in our docs; once developers were handed bootstrapped files describing most of their calls, they could always add any missed calls manually, if needed.
We were left with two things to work out:
- Who’s calling whom at the network level, i.e. pairs of IP addresses; and
- What's the functional name of the services with those IP addresses.
In this blog post, we’ll describe the former, and leave the functional name resolution to a future post.
If you're already curious about the solution and want to check it out for yourself you can browse https://github.com/otterize/network-mapper
Goals and constraints
We’re a very technical, curious, and persistent team, and we knew it wouldn’t take much to take us down a rabbit hole. So before brainstorming how to solve this, we made sure to write down what exactly we need to solve and what constraints an acceptable solution must meet:
- Map pod-to-pod communication as pairs of IPs (client and server)
- Focus on in-cluster traffic for now (not cluster ingress and egress)
- Should work in most Kubernetes “flavors” without needing to tailor for each one
- Must be easy to get started
- Must export the output as structured text, for bootstrapping intents files
- Minimize dependencies on other tools which users may not have
- Minimize the impact on the cluster being observed
We pinned this on our wall (at least virtually) and set off to research a solution – this is always the fun part!
Don’t reinvent the wheel
It’s often been said that the tool most by developers is Google search, and this was no exception. Like most devs, we’re lazy, in a good way: we want to learn from others, we want to reuse what’s already out there, and we want to spend our time pushing the boundaries instead of reinventing the wheel. That’s especially true for a project like this: we wouldn’t want to find out after building something that it was not needed after all.
So we started to look for open-source software that could sniff pod-to-pod traffic. It needs to be OSS because our overall solution would itself be OSS so this part of it would need to fit in. There are certainly several projects out there. Perhaps the most well-known is Calico by Tigera, so we started there.
Logging traffic with Calico
We drew inspiration from a blog post from Monzo about network isolation. They used Calico’s logging capabilities to detect connections that would have been blocked by network policies and use the information to update their existing network policies. We considered using the same logging capability to achieve observability for all connections. The follow snippet demonstrates a Calico network policy that logs and allows all connections:
apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
name: log-and-allow-all
namespace: production
spec:
types:
- ingress
- egress
ingress:
- action: Log
- action: Allow
egress:
- action: Log
- action: Allow
In generic Kubernetes network policies, there is no action field. The Calico CNI plugin (Kubernetes network plugin that implements the Container Network Interface) provides this functionality, and in particular provides logging even for allowed traffic. And this worked when we tried it in our test clusters and in our own back end.
But we also realized the hardships it might cause:
- We’ll have to ship Calico with our solution, or require it as a prerequisite.
- More importantly, it might conflict with an existing Kubernetes network plugin the user might be using.
- And we were also pretty convinced that logging all allowed requests would push many clusters to the edge, resource-wise.
So while Calico network policies provide much more capabilities over native ones, using Calico as part of Otterize OSS would not meet our goals and constraints. (Support for Calico could always be added later should the community be interested.)
Any other OSS solutions out there?
Kubernetes is the most popular container orchestration tool by far, it’s also open source, and it’s backed by a massive and active community. So we were completely expecting that we’d find a wide array of tools that at least partially solve the same issues we’re tackling.
One promising candidate was Mizu, which is in fact a traffic viewer for Kubernetes. It’s easy to get started with it, has a great UI for viewing traffic, and a cool network mapper that’s great for troubleshooting. But it was designed for troubleshooting and debugging, and apparently not for reuse, since there is no way to export the map or obtain it via the CLI, at least that we could find. For us, the tool needs to be part of the larger solution, so not having a way to export as text is a deal breaker for our use case. We could fork the project, implement an export, and send back a pull request with our enhancement. But…
When looking to enhance a project, it’s important to understand what the project is aiming to do, not just what it does right now. Because Mizu is aimed at debugging use cases, it needs to look at all traffic, it needs to capture payloads and not just connections, and it needs various powerful capabilities such as seeing the plain text content of encrypted traffic. It’s simply not designed to be lightweight, reusable, low-impact, and easy to understand and vet. Adding an export feature would still leave its design goals far from our stated goals – it’s just not a good fit.
We found other tools for monitoring, alerting, and logging, but none of the ones we looked at came close to meeting our goals. After some time, it became clear we needed to consider building something from scratch.
Sniffing network traffic
We’ve got a lot of collective experience in our team sniffing network traffic at various levels, and Kubernetes offers a rich and reasonably standardized platform to apply this. Let’s recall that all we were trying to do was to identify pairs of pod IPs, i.e. a caller and a callee. So looking at all the traffic in the cluster would be heavy-handed and wasteful. A better idea would be to look only at the SYN packets that initiate any TCP session and establish the connection between caller and callee. But upon further thought, we realized that we’d still likely see many connections over the lifetime of the same pair of pods. An even better approach presented itself: how about looking just at the DNS query made by the caller to initially find the IP address of the callee? Or even better, why not just look at DNS replies, which should be the most efficient approach? We went with DNS replies over TCP SYN because:
- DNS responses are still available even when TCP gets blocked for some reason.
- DNS is typically less traffic intensive for several reasons:
- DNS can benefit significantly from caching, so e.g. any external traffic with a TTL will usually hit the cache and won’t generate more DNS load.
- When TCP connections are blocked, we would still see TCP retransmissions: common TCP stacks (that run within the kernel, so they don’t re-resolve between attempts) will retransmit TCP SYNs 3 times before giving up.
- We could look at the names in DNS responses and determine whether they likely pointed at Kubernetes endpoints just by parsing them, which is much less intensive than attempting to resolve all IP addresses seen in TCP SYNs to pods.
- DNS offers intriguing possibilities to also discover traffic directed outside the cluster, perhaps to another cluster or to a non-Kubernetes-based service, because DNS replies contain the target service name as text. We are planning to expand the reach of the mapper beyond in-cluster traffic, and of course as the network mapper is open source, users can extend it to implement additional, perhaps custom, resolution mechanisms.
Approaches to sniffing DNS
The straightforward way to get at DNS queries would be to use an efficient tool like tcpdump (or equivalent) to sniff for DNS queries, and then process them to figure out pod IP pairs.
But another approach that could work in Kubernetes, because the DNS servers are within the cluster itself, would be to work directly with the DNS server pods. In most Kubernetes clusters, whether standalone or managed (GKE, AKS, EKS), the cluster DNS is either coredns or kube-dns. That was great to minimize how much configuration options we’d need to support. We realized we could edit the coredns or kube-dns configmap resources to enable their log option, which would make them log all the queries they handle. We’ll cover exactly how it’s done in more detail below.
Both approaches seemed reasonable, so we tried them both. We started with the latter, thinking that it might be simpler and would not require any traffic sniffing, which could prevent some limitations of traffic sniffing down the line.
Monitoring Kubernetes DNS server logs
to enable the DNS server logging option, we simply edited the appropriate configmap for our DNS provider in the kube-system namespace and added the log option. For our EKS cluster, the coredns configmap looked like this:
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
log <--- ADD THIS; PODS WILL RELOAD THE CONFIG & START TO LOG
errors
health
The coredns pods reload the the configuration and, voila!, we could see queries being logged:
172.31.4.118:51448 - 48460 "AAAA IN nginx-svc.namespace.svc.cluster.local
It seemed all was well with the world. We could move on to parsing these queries and start building the network map.
But lo and behold, after a minute or two, we could not see any more queries being logged! Were there still DNS queries being made? Certainly. So how come they’re no longer being logged?
It turns out that, in a managed Kubernetes cluster, some things like the cluster’s DNS configurations are, well, managed for you. In an EKS cluster, for example, the coredns add-on is installed automatically when creating a cluster via the AWS management console. We noticed our changes to the coredns configmap went missing, and when checking the cluster’s audit we saw the changes were overwritten by eks:addon-manager user. The same behavior was observed in a GKE cluster, with the kube-dns configmap being overwritten by the cloud provider.
There are crude actions you can take to prevent this behavior. For example, we removed the patch permissions for configmaps from the eks:addon-manager cluster role, and saw that our changes were no longer being overwritten. But those kinds of actions aren’t a good option, in our opinion: EKS should be able to manage the cluster. We felt that actions like editing default ClusterRole resources are aggressive, may have unforeseen consequences, and will be frowned upon by users wanting to adopt our solution. We also did not want to tailor a specific solution for each cloud provider, which would go against our goal of “one size fits most”. Nor did we want to tell our standalone Kubernetes users how to manage their DNS configs. All these reasons added up to a lot of discomfort with this solution, even if it initially seemed like the most straightforward.
Direct DNS network sniffing - success!
And so we went back to the first DNS-based approach: filtering out of the network traffic all but DNS queries, and processing them to build out a map of pod-to-pod IP pairs.
We set up a quick tcpdump-based POC of a DaemonSet running with the hostNetwork: true option so it ran within the same network as the host of the Kubernetes node itself. It simply captured DNS requests (UDP port 53) and logged them. According to the Pod Security Policy documentation, any pod using the host’s network “could be used to snoop on network activity of other pods on the same node”.
We could now see our solution in action:
# An example DNS query to 'cartservice' from our lab namespace
192.168.15.228.43303 > 192.168.33.60.53: 39615+ A? cartservice.labstg.svc.cluster.local. (54)
192.168.15.228.43303 > 192.168.33.60.53: 41003+ A? cartservice.svc.cluster.local. (47)
192.168.15.228.43303 > 192.168.33.60.53: 42023+ A? cartservice.cluster.local. (43)
192.168.15.228.43303 > 192.168.33.60.53: 56922+ A? cartservice.ec2.internal. (42)
192.168.33.60.53 > 192.168.15.228.43303: 39615*- 1/0/0 cartservice.labstg.svc.cluster.local. A 10.100.115.187 (106)
And we now knew we could reliably tail DNS queries in the cluster. So we set about converting our solution from a tcpdump-based POC to a more robust Go solution, using gopacket to actually parse the requests and build connection pairs.
We also noticed, as you can see in the above example, that we were still seeing multiple requests being sent for a single DNS lookup, and figured that it was due to our pods having multiple entries under search in their resolv.conf file. From the man page for resolv.conf:
Resolver queries having fewer than ndots dots (default is 1) in them will be attempted using each component of the search path in turn until a match is found.
This means that if a DNS query occurs in a pod using just a service name (without a full domain name), all suffixes will be queried, hence the 4 extra DNS requests. So a lookup like nslookup cartservice will actually generate 4 extra queries before getting the answer on the 5th one, like the example above. That presented an obvious optimization: why not only listen for DNS answers? After changing our code to filter out all but DNS answers, we only see the one line:
192.168.33.60.53 > 192.168.15.228.43303: 39615*- 1/0/0 cartservice.labstg.svc.cluster.local. A 10.100.115.187 (106)
Success! We now process even less data, further reducing our resource requirements.
Completing the picture
DNS replies, of course, will only be generated when a connection is initiated. What about long-lived connections in the cluster that were initiated before the tool was turned on? To deal with those, we took an approach similar to the well-known netstat tool. In addition to capturing DNS traffic, we parse the files representing existing connections in /proc on each node, which provides us with the IP addresses of the endpoints participating in a connection. There are other means for finding all open connections and resolving them to the relevant pods (such as with eBPF), but we preferred a method which users can easily reason about. Many people know what netstat outputs – not many understand eBPF well.
Wrapping it up
After multiple iterations, research sessions and some trial & error, we could produce an exportable list of network connections in any Kubernetes cluster. You might recall that our larger goal was to get to a logical (functional) map of pod-to-pod traffic, and that will be covered in a future posting. After adding that capability, here’s an example output from our project, now called network-mapper, when pointed at one of the clusters in our “lab” environment:
cartservice in namespace otterize-ecom-demo calls:
- redis-cart
checkoutservice in namespace otterize-ecom-demo calls:
- cartservice
- currencyservice
- emailservice
- paymentservice
- productcatalogservice
- shippingservice
frontend in namespace otterize-ecom-demo calls:
- adservice
- cartservice
- checkoutservice
- currencyservice
- productcatalogservice
- recommendationservice
- shippingservice
loadgenerator in namespace otterize-ecom-demo calls:
- frontend
recommendationservice in namespace otterize-ecom-demo calls:
- productcatalogservice
Our solution has low cluster impact, no external dependencies, should work in any Kubernetes flavor, and is quite minimal, making it easy to plug in wherever a pod-to-pod communication map is needed. We felt we successfully achieved our goals without violating any of the principles and constraints we set at the beginning of our journey. After more polish and bug fixes, we were ready to release it to the OSS community.
Please take a look; we’d love your feedback and input. Of course, pull requests are always appreciated ;-)
Top comments (0)