I’m drawn to the puzzles of distributed systems and abstracted platforms – the problems that only crop up when lots of moving pieces work in tandem (or not!).
I recently encountered one of these issues a few weeks after a platform migration to AWS EKS.
The initial problem manifested itself as an application async worker issue.
- Async process queues began stacking up and triggering alerts.
- Investigation of the worker process revealed that:
- Workers reported healthy
- Workers seemed to be processing the maximum number of threads per worker
- Workers were using minimal compute resources
- Some of the queue was getting processed
- Re-deploying the async worker Kubernetes (k8s) pods resolved the immediate problem and the queues started draining again.
Our core app reported a small number of failed database requests at the same time that queues started stacking. This pointed us at the network and our DB connection pooler, pgbouncer, both of which looked fine. However , a couple of pgbouncer k8s pods had migrated to different k8s nodes a few minutes before we saw the queue issue.
This got us looking at node autoscaling. The node the migrated pgbouncer pods were running on had been autoscaled down, forcing their restart on another node. This is expected behavior. It was, however, unexpected that our async workers’ connections to pgbouncer wouldn’t time out and attempt a re-connect.
The async worker threads were holding open connections that would never complete or fail, stalling them out and preventing them from processing new queue items.
Wanting to lean into k8s’ transience and statelessness we approached the problem from a few angles:
- Liveness probes w/ a DB connection health check – We already had these configured for pgbouncer and most other components but not for our async workers. Thinking through this option, we suspected there would be issues with false negatives so decided to put it on the back burner.
- Async worker DB connection time outs – This had global app ramifications as it required reconfiguration of the pg gem or Rails ActiveRecord, both of which felt like bad options and turned out to actually be pretty gnarly when tested.
- Configure a k8s container lifecycle hook for pgbouncer. This was already in place but didn’t appear to be working consistently.
- Set up a dedicated node pool just for pgbouncer and disable autoscaling. This seems to be what most people running pgbouncer in k8s are doing, but it felt bad philosophically, so we set it aside as a last resort.
Most effort focused on the lifecycle hook option. Making sure pgbouncer received a SIGINT instead of a SIGTERM let it close out running connections safely and reject new connections. This looked like it was going to solve the problem.
It did not.
This issue plagued us for a few weeks while we worked on higher priority items and performed research / testing. The problem was consistently tied to autoscaling down and happened at roughly the same time, but didn’t occur every day or every time a pgbouncer-hosting node scaled down.
Having run out of attractive options, we built out a dedicated node pool for pgbouncer and began testing it in QA. Again, this felt bad – adding a static component to a dynamic architecture.
Prior to deploying to production, we had another queue backup.
We looked at the pgbouncer logs during the last autoscaling event and noticed that neither SIGINT or SIGTERM were getting called via the lifecycle preStop hook for the container. Then we looked at how the node was getting autoscaled and compared to an event where SIGINT was issued (and the preStop hook did trigger).
When the k8s cluster autoscaler was responsible for autoscaling a node down, SIGINT was sent and pgbouncer shut down gracefully. When AWS autoscaling performed a rebalance (making sure an equal number of instances is running in each availability zone), neither SIGINT or SIGTERM were sent to the pgbouncer pod and it died gracelessly.
This explained why the issue had been inconsistent – it only happened after the k8s cluster autoscaler scaled down and then the AWS autoscaler performed a rebalance across availability zones and just happened to pick a node with pgbouncer on it.
Turns out, this is a known, if lightly documented issue. Spelunking in the Kubernetes Autoscaler docs revealed:
Cluster autoscaler does not support Auto Scaling Groups which span multiple Availability Zones; instead you should use an Auto Scaling Group for each Availability Zone and enable the –balance-similar-node-groups feature. If you do use a single Auto Scaling Group that spans multiple Availability Zones you will find that AWS unexpectedly terminates nodes without them being drained because of the rebalancing feature.
Which was our exact scenario. Derp.
The workarounds people are using:
- The thing in the docs – creating separate AWS autoscaling groups in each AZ and letting the k8s cluster autoscaler handle balancing with the –balance-similar-node-groups flag . This is kind of ugly and introduces more scaling and load-balancer complexity on the AWS side.
- Creating a node drain Lambda tied to an AWS autoscaling lifecycle hook that pauses the termination and issues drain commands on the node that’s being scaled down to let pods migrate gracefully. There is almost zero documentation for this solution. An example was added to the AWS-Samples repo about a week prior to us discovering the root cause.
It’s pretty simple, at least in hindsight.
There is much about Kubernetes that feels automagical, especially when you’re relatively new to it. Some of that is by design, it’s meant to abstract away a lot of the pain/details of the underlying infrastructure. In this case, I trusted too much in the automagicalness of the cluster autoscaler and assumed it was more tightly integrated into AWS than it currently is.
This is not a gripe, more a testament to how impressive the Kubernetes ecosystem is overall. It generally is automagical, so it’s easy to make assumptions. This is just an edge case the automagic hasn’t reached yet.