DEV Community

Cover image for Investigating a timeout issue in an Istio service mesh
Yann Soubeyrand for Camptocamp Infrastructure Solutions

Posted on • Updated on

Investigating a timeout issue in an Istio service mesh

Last week, with my colleague Marc, we faced a timeout issue in an Istio service mesh. An idle PostgreSQL connection was shut down precisely one hour after it has been opened. During our investigations, I had to capture the network traffic entering and leaving the PostgreSQL client pod.

Ksniff

For this, I've been using Ksniff. This Kubernetes plugin tries to deploy a statically compiled tcpdump binary inside the pod which traffic you want to capture and then streams the captured packets to a Wireshark instance running on your workstation. This plugin is just awesome, thanks a lot to its author Eldad Rudich!

Envoy traffic interception

Istio works by injecting an Envoy proxy sidecar container inside every pod, which will intercept inbound and outbound network traffic. So when a process inside your container communicates with an external server, there are in fact two TCP connections: one between the process and Envoy, and one between Envoy and the distant server. However, in Wireshark, I saw only the packets between Envoy and the distant server and the packets from Envoy to the process but not the packets from the process to Envoy.

In the Wireshark screenshot below, 10.0.9.225 is the IP address of the process, 172.20.94.219 is the IP address of the virtual service the process communicates with, and 10.0.5.234 is the IP address of the distant (real) server backing the virtual service.

Wireshark screenshot

I was a bit surprised 🤔 So I searched how Istio diverted the network traffic through Envoy. It does so by adding iptables REDIRECT rules to send the traffic to port 15001 on localhost which Envoy is listening on. Indeed, I could see it in Wireshark:

Wireshark screenshot

But then, how can Envoy know where to forward the traffic if all the traffic it sees entering is destined for 127.0.0.1:15001? All my knowledge on IP network functioning was questioned! However, after some time spent looking for the answer, I finally found it!

When a packet hits an iptables REDIRECT rules, the kernel sets a socket option named SO_ORIGINAL_DST which contains the original packet destination. Envoy just has to read this option to decide what to do with this packet.

Conclusion

We've seen how network traffic is redirected in an Istio service mesh using iptables REDIRECT rules and SO_ORIGINAL_DST socket option.

Going back to our original issue, the investigations confirmed that the culprit was Envoy's idle timeout. From what I understand, it should be possible to configure it, but I didn't figure out yet how to do so.

Top comments (0)