Post-mortem: Kubernetes pods don't start because of too many services

#kubernetes #sre

A few weeks ago, my team ran into a really interesting service limit in Kubernetes that caused a really unexpected problem in our development environment.

What we saw first

We first noticed our REST API was not available, and we quickly noticed our pods were not running. Starting the investigation, we noticed

All of the pods for our REST API deployment were in Init:CrashLoopBackoff.
Some pods in our compute deployment had the same issue, but some of those pods were working fine.
The pods that were working fine tended to be over a day old, so we suspected they would hit this issue if the pods restarted.

We started going down the path of investigating if there was a bug with the init container, but this ended up not being the problem. Looking at the logs for the init container, the only line logged was.

standard_init_linux.go:228: exec user process caused: argument list too long

After escalating to our Kubernetes team, they let us know this error indicates the Docker container cannot start because there are too many environment variables being passed to the container.

Who put all these environment variables in my pod?

Our Kubernetes experts told us about enableServiceLinks in Kubernetes, which adds environment variables for each active service. From the documentation:

When a Pod runs on a Node, the kubelet adds a set of environment variables for each active Service. Note If the service environment variables are not desired (because possible clashing with expected program ones, too many variables to process, only using DNS, etc) you can disable this mode by setting the enableServiceLinks flag to false on the pod spec.

To verify this was the cause, we started up a pod with enableServiceLinks: false and it started fine. After removing that line, we hit the same error we had above.

In a working pod, you can also check this out yourself and see the environment variables being used.

kubectl exec -ti my-pod -- sh -c "env | grep SERVICE | sort"

Where did all these services come from?

So, we had too many environment variables because we had too many services in our namespace. Why do we have so many services?

The underlying problem was caused by one process in our system that dynamically creates deployments and services, and there was a bug in cleaning up the services at the time that we would clean up the deployments. After many weeks, the orphaned services piled up and we finally hit the limit that caused pods to fail to start up.

Take Aways

This is my favorite kind of incident because

We hit this in development before it hit production.
We learned about a deep part of our infrastructure we would not have known about otherwise.

So what did we learn and plan to do next?

If enableServiceLinks is enabled (which is the default), Kubernetes will populate environment variables in your pod.
To catch this sooner: We can monitor the number of services in the namespace before it exceeds that threshold.
To mitigate faster: Write a runbook explaining the problem and how we solved it in this instance.
To prevent this again: Alert when we approach the threshold, or if we see the log message in our containers again.