DEV Community: Hashfyre

Root Cause Chronicles: Quivering Queue

Hashfyre — Tue, 16 Jan 2024 07:36:33 +0000

River was buried with work. It was time for the Quarterly Earnings call, and they seemed to be getting nowhere. Additional piles of work were being dumped on their desk every day. Business teams did not follow scrum or sprints like the tech folks on the second floor. You had to do whatever was assigned to you, and yesterday.

They had to clear all the pending orders for Q3 that somehow got stuck in the system due to glitches or downtimes and prepare their quarterly summary of dispatch throughput once that was done. River also had to write their self-assessment and performance reviews for all three interns in the team.

River had already been reminded thrice that day by their senior to clear the orders backlog. But River kept getting drawn into meetings and paperwork every time they attempted to sit down with it.

Carter from the second floor had given River some sort of script to finish up the pending orders that were not yet dispatched to paying customers, but had asked River to monitor its progress. “It’s an ad hoc script I’m cobbling together to get this done fast. This may cause issues down the line. Keep checking on it and ping me if something goes wrong”, they had said. Fingers crossed, River pressed the Run button on the pipeline for this script and waited.

“Hey River, care to come in for a minute?” their senior beckoned, pointing at the nearest meeting room, “We need to get on top of your performance review.”

“Hey, Sure…”, River hesitated, “But I think I should monitor this script I just ran. You have been asking me to clear up that backlog of orders for a while, Carter gave me this script to get it resolved.”

“It’s automated, right?” the senior asked.

“Yeah, sure. But Carter asked me to inform them if anything goes wrong!”

“It’ll be alright, Carter knows his job. You don’t need to wait on it. Just come in, and we can finish this conversation. My superiors are breathing down my neck too.”

30 minutes later…

“Hey, Robin!” someone tapped on their shoulder.

Robin was neck-deep trying to finish writing a new Terraform module and replied without looking up, “What’s gone wrong?”.

“Payments are failing, customers are unable to buy anything, it just gets stuck on the web page.”

Robin switched to the Grafana dashboard tab, and sure enough, the 5xx volume on web service was rising. It had not hit the critical alert thresholds yet, but customers had already started noticing.

(Web service showing ~15% 5xx responses)

(Payment service showing ~18% 5xx responses)

Another day at Robot-Shop

River, Carter, and Robin all work for Robot-Shop Inc. They run a sufficiently complex cloud native architecture to address the needs of their million-plus customers.

The traffic from load-balancer is routed via a gateway service optimized for traffic ingestion, called Web, which distributes the traffic across various other services.
User handles user registrations and sessions.
Catalogue maintains the inventory in a MongoDB datastore.
Customers can see the ratings of available products via the Ratings service APIs.
They choose products they like and add them to the Cart, a service backed by Redis cache to temporarily hold the customer’s choices.
Once the customer pays via the Payment service, the purchased items are published as orders to a RabbitMQ channel.
These orders are consumed by the Dispatch service and prepared for shipping. Shipping uses MySQL as its datastore, as does Ratings.

(High-Level Architecture of Robot-Shop Application Stack)

Only now, payment failures were rising at an alarming rate. “OK, let’s look at payment logs.” Robin clicked on the attached Grafana dashboard for Payments logs from Loki.

“Payments can’t connect to RabbitMQ! Is the cluster down?” This could be bad. Robin was distraught. If the RabbitMQ cluster was entirely down, it would be hard to recover any messages that were in transit and in-memory. The only way of recovery would be to replay failure logs from the Payments service later. But RabbitMQ was not down, Robin would have received immediate alerts if it were so. “Let’s check the queue health.”

“Oh, wow! Seems the number of messages has gone up drastically in the last 30 minutes. Are we really processing a record number of payments at the end of a quarter?” Robin thought. “Close to 300-350 orders/second. Let me correlate that with the Payments service request count.” Robin jumped around the dashboards with a few keystrokes.

“This does not compute. It can’t be payments! The service is only processing 1-2 payments/second!”, Robin exclaimed.

“Check the publisher count of the queue against the number of payment pods in the deployment!” Blake was there, as were a few other folks around her seat. Robin never felt too comfortable when people looked on as they typed. It was nerve-wracking.

“Whoa, why do we have ~465 publishers?” Blake was right as usual. “Even with auto-scaling enabled, payment doesn’t usually scale beyond 20-30 pods, and right now, we only have 3 of them, the minimum count.”

This was expected, customers never shopped much at the end of the quarter with tax filing around the corner. Overall system load was expected to be at an all-time low as per seasonal traffic analysis data.

“Who, or what, is publishing orders to the queue? Also, how is our consumer service, Dispatch doing?” Blake was curious now; this could escalate very fast if they could not find a solution soon. Blake pulled up a chair and sat down, looking at the graphs Robin had pulled up.

“Consumers look alright! But I wish we had instrumented our apps with OpenTelemetry.” Blake sighed at the onlooking Backend folks. “Right now, we know we have too many publishers, but we don’t know who they are! We need to complete our Tracing sprint, and soon.”

![Detailed Kubernetes native architecture of Robot-shop]!(https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vl4yv52z55rpvpm4lzwo.png)

(Detailed Kubernetes native architecture of Robot-Shop)

“We have Kiali. It should show which services other than Payments and Dispatch are connected to RabbitMQ,” Robin said.

Robin fiddled around with namespace filters until they zoomed in on something.

“It seems some workload called pending-orders-recreation is connecting to RabbitMQ from a namespace called pending-orders.”

“Let's see what’s running there. I do not know this namespace.”

And lo, around 400 pods with the prefix pending-orders-recreation-xxx were running there.

“Who could be running this?” Robin pinged the #engineering channel on Slack:

Hey Team, has anyone deployed a new workload to a namespace called pending-orders in production?

The root cause of all things

Soon, Blake and Robin heard someone rushing up to them. It was Carter: “Hey, that was me. That workload you pinged about just now. What’s up with it?”

“It seems to have taken down our live payments system, what are you running there?” Blake asked.

“Well, it’s the end of the quarter, and the Biz team wanted to automate reconciliation of all pending orders for the last three months that were never dispatched due to glitches or errors. Remember the downtime we had last month on the User service, and a bunch of payments went bad? We are recreating those from scratch now.”

“All of them? How many are we talking about?”

“Ah, close to half a million?” Carter was sweating, “They need to complete by today, it’s the end of the quarter. Otherwise, our numbers will look all wrong. How do we fix this now?”

“Can we stop the workload and let the queue recover?” Blake queried. “Because right now the cluster is dying under the load. Most messages are getting re-queued.”

(Unroutable messages returned to publishers)

“Can’t we just scale up dispatch to consume the orders faster?” Carter was not in favor of his deadline getting stuck.

“We can deploy a quick event-driven autoscaler with KEDA. We should have had it in the first place, but this subsystem had never been required to deal with scale spikes before.”

Thankfully KEDA operator was already part of the cluster, and all Robin had to do was create a ScaledObject manifest targeting the Dispatch ScaleUp event, based on the rabbitmq_global_messages_received_total metric from Prometheus.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: dispatch-scale-up
  namespace: robot-shop
spec:
  scaleTargetRef:
    kind: Deployment                                       
    name: dispatch                                         
  pollingInterval:  10                                     
  cooldownPeriod:   300                                    
  minReplicaCount: 3                                       
  maxReplicaCount: 10                                      
  advanced:                                                
    restoreToOriginalReplicaCount: true
    horizontalPodAutoscalerConfig:                   
      behavior:                                     
        scaleUp:
          stabilizationWindowSeconds: 0
          policies:
          - type: Pods
            value: 1
            periodSeconds: 1
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-stack-kube-prom-prometheus.monitoring.svc.cluster.local:9090
      query: |
        sum(rate(rabbitmq_global_messages_received_total[60s]))
      threshold: '10'

“Dispatch has scaled up, it’ll keep adding more pods if the number of orders published to the queue goes beyond a threshold of 10 messages,” Robin said.

“Let’s wait a few minutes and see if scaling consumers would resolve the issue,” Blake commented.

(RabbitMQ pods getting OOMKilled due to large number of connections)

“This doesn’t look like a fix.” Blake pointed at the memory consumption and connections per pod metric, “See here, this is not about scaling up the consumers. RabbitMQ pods are unable to handle that many connections, given how little memory we have allocated to them. Can we use KEDA to scale up the RabbitMQ statefulset?”

“We are using RabbitMQ Operator, not sure if they allow autoscaling. They might wrap a HorizontalPodAutoscaler resource within the operator itself!” Robin shrugged. They started digging into the problem.

(Add Support For Pods Auto Scaling)

“Doesn’t look like it. The maintainers do not recommend auto-scaling RabbitMQ for legitimate reasons and cap the scale at 9 nodes or pods. Let’s just manually scale up for now. I’m going to scale them up vertically too after the new nodes are up, and add more memory per node.”

export RMQ_CLUSTER_NS=robot-shop
kubectl patch rabbitmqcluster rabbitmq-cluster --type='json' \
    -p='[
        {"op": "replace", "path": "/spec/replicas", "value": 6}, 
        {"op": "replace", "path": "/spec/resources/limits/memory", "value": "400Mi"}, 
        {"op": "replace", "path": "/spec/resources/requests/memory", "value": "100Mi"}, 
        ]' \
     -n $RMQ_CLUSTER_NS

“The new nodes will take a few seconds to start accepting connections, also we are not using mirrored or quorum queues. This was thought to be a small workload and we decided a simple direct queue would suffice. The cluster was not modeled for spikey loads such as today’s.”

“If we are going to run spikey workloads like this on the queueing infrastructure, we need to benchmark for them and publish those benchmarks cross-team, else they would not know whether or not a certain infrastructure is resilient enough. Also, the teams need to communicate ahead of time before running auxiliary Biz team workloads like this with the Infrastructure team in the loop,” Blake thought.

“Clock the RCA time, Robin,” Blake said.

“5 minutes tops. We handled this before it could get out of hand.” Robin replied.

It was fast troubleshooting, and payments were back online as the upgraded cluster handled the increased publisher count due to pending orders and live payments.

Key takeaways

Many incidents occur in sufficiently complex scenarios when underlying infrastructure is not optimally benchmarked. Benchmarking is the first step towards optimal capacity planning. Base infrastructure like cache, queue, and databases need to be tested for seasonal rise and fall of traffic but also be resilient enough for sudden unplanned traffic spikes; whether initiated by customers or internal needs like batched jobs or analytics.

During the above incident, the team found that:

Communication broke down between the business and tech/infra teams.
- The business team was unaware whether or not the underlying infra could sustain a sudden spike of 400-500 concurrent jobs, up from the standard 20-30 payments per minute.
- The business team never communicated with the infrastructure team about running a new workload.
There was no internally published benchmark or SLI (service level indicator) that informed other teams of the capabilities of the systems their work relied on.
- The infrastructure team did not have enough policy-oriented safeguards on production infrastructure so that un-sanctioned workloads could not be side-loaded without their knowledge.
- The queue infrastructure and consumers were not configured with event-driven auto-scaling to respond to traffic and connection spikes.
- Enough in-depth exploration of the elasticity and scalability of RabbitMQ was not conducted. The team had assumed current publisher concurrency to stay consistent over time.

Open and transparent cross-team collaboration is key to running software efficiently and scalably in modern systems. Assumptions are almost always erroneous unless confidence is built around them with proper benchmarking of infrastructure. Systems sometimes have implicit limitations that make them unsuitable for specific types of workloads.

Author’s note

Hope you all enjoyed that story about a hypothetical troubleshooting scenario. We see incidents like this and more across various clients we work with at InfraCloud. The above scenario can be reproduced using our open source repository.

We are working on adding more such reproducible production outages and subsequent mitigations to this repository.

We would love to hear from you about the incidents where your knowledge of queueing theory came in handy. What is the largest deployment of RabbitMQ you have run?

If you have any questions, you can connect with me on Twitter or LinkedIn.

References

Root Cause Chronicles: Connection Collapse

Hashfyre — Fri, 15 Dec 2023 09:33:28 +0000

On a usual Friday evening, Robin had just wrapped up their work, wished their colleagues a happy weekend, and turned themselves in for the night. At exactly 3 am, Robin receives a call from the organization’s automated paging system, “High P90 Latency Alert on Shipping Service: 9.28 seconds”.

Robin works as an SRE for Robot-Shop, an e-commerce company that sells various robotics parts and accessories, and this message does not bode well for them tonight. They prepare themselves for a long, arduous night ahead and turn on their work laptop.

Setting the Field

Robot-Shop runs a sufficiently complex cloud native architecture to address the needs of their million-plus customers.

The traffic from load-balancer is routed via a gateway service optimized for traffic ingestion, called Web, which distributes the traffic across various other services.
User handles user registrations and sessions.
Catalogue maintains the inventory in a MongoDB datastore.
Customers can see the ratings of available products via the Ratings service APIs.
They choose products they like and add them to the Cart, a service backed by Redis cache to temporarily hold the customer’s choices.
Once the customer pays via the Payment service, the purchased items are published to a RabbitMQ channel.
These are consumed by the Dispatch service and prepared for shipping. Shipping uses MySQL as its datastore, as does Ratings.

(Figure 1: High Level Architecture of Robot-shop Application stack)

Troubles in the Dark

“OK, let’s look at the latency dashboards first.” Robin clicks on the attached Grafana dashboard on the Slack notification for the alert sent by PagerDuty. This opens up the latency graph of the Shipping service.

“How did it go from 1s to ~9.28s within 4-5 minutes? Did traffic spike?” Robin decides to focus on the Gateway ops/sec panel of the dashboard. The number is around ~140 ops/sec. Robin knows this data is coming from their Istio gateway and is reliable. The current number is more than affordable for Robot-Shop’s cluster, though there is a steady uptick in the request-count for Robot-Shop.

None of the other services show any signs of wear and tear, only Shipping. Robin understands this is a localized incident and decides to look at the shipping logs. The logs are sourced from Loki, and the widget is conveniently placed right beneath the latency panel, showing logs from all services in the selected time window. Nothing in the logs, and no errors regarding connection timeouts or failed transactions. So far the only thing going wrong is the latency, but no requests are failing yet; they are only getting delayed by a very long time. Robin makes a note: We need to adjust frontend timeouts for these APIs. We should have already gotten a barrage of request timeout errors as an added signal.

Did a developer deploy an unapproved change yesterday? Usually, the support team is informed of any urgent hotfixes before the weekend. Robin decides to check the ArgoCD Dashboards for any changes to shipping or any other services. Nothing there either, no new feature releases in the last 2 days.

Did the infrastructure team make any changes to the underlying Kubernetes cluster? Any version upgrades? The Infrastructure team uses Atlantis to gate and deploy the cluster updates via Terraform modules. The last date of change is from the previous week.

With no errors seen in the logs and partial service degradation as the only signal available to them, Robin cannot make any more headway into this problem. Something else may be responsible, could it be an upstream or downstream service that the shipping service depends on? Is it one of the datastores? Robin pulls up the Kiali service graph that uses Istio’s mesh to display the service topology to look at the dependencies.

Robin sees that Shipping has now started throwing its first 5xx errors, and both Shipping and Ratings are talking to something labeled as PassthroughCluster. The support team does not maintain any of these platforms and does not have access to the runtimes or the codebase. “I need to get relevant people involved at this point and escalate to folks in my team with higher access levels,” Robin thinks.

Stakeholders Assemble

It’s already been 5 minutes since the first report and customers are now getting affected.

(Figure 5: Detailed Kubernetes native architecture of Robot-shop)

Robin’s team lead Blake joins in on the call, and they also add the backend engineer who owns Shipping service as an SME. The product manager responsible for Shipping has already received the first complaints from the customer support team who has escalated the incident to them; they see the ongoing call on the #live-incidents channel on Slack, and join in. P90 latency alerts are now clogging the production alert channel as the metric has risen to ~4.39 minutes, and 30% of the requests are receiving 5xx responses.

The team now has multiple signals converging on the problem. Blake digs through shipping logs again and sees errors around MySQL connections. At this time, the Ratings service also starts throwing 5xx errors – the problem is now getting compounded.

The Product Manager (PM) says their customer support team is reporting frustration from more and more users who are unable to see the shipping status of the orders they have already paid for and who are supposed to get the deliveries that day. Users who just logged in are unable to see product ratings and are refreshing the pages multiple times to see if the information they want is available.

“If customers can’t make purchase decisions quickly, they’ll go to our competitors,” the PM informs the team.

Blake looks at the PassthroughCluster node on Kiali, and it hits them: It’s the RDS instance. The platform team had forgotten to add RDS as an External Service in their Istio configuration. It was an honest oversight that could cost Robot-Shop significant revenue loss today.

“I think MySQL is unable to handle new connections for some reason,” Blake says. They pull up the MySQL metrics dashboards and look at the number of Database Connections. It has gone up significantly and then flattened. “Why don’t we have an alert threshold here? It seems like we might have maxed out the MySQL connection pool!”

To verify their hypothesis, Blake looks at the Parameter Group for the RDS Instance. It uses the default-mysql-5.7 Parameter group, and max_connections is set to:

{DBInstanceClassMemory/12582880}

But, what does that number really mean? Blake decides not to waste time with checking the RDS Instance Type and computing the number. Instead, they log into the RDS instance with mysql-cli and run:

#mysql> SHOW VARIABLES LIKE "max_connections";

Then Blake runs:

#mysql> SHOW processlist;

“I need to know exactly how many," Blake thinks, and runs:

#mysql> SELECT COUNT(host) FROM information_schema.processlist;

It’s more than the number of max_connections. Their hypothesis is now validated: Blake sees a lot of connections are in sleep() mode for more than ~1000 seconds, and all of these are being created by the shipping user.

(Figure 13: Affected Subsystems of Robot-shop)

“I think we have it,” Blake says, “Shipping is not properly handling connection timeouts with the DB; it’s not refreshing its unused connection pool.” The backend engineer pulls up the Java JDBC datasource code for shipping and says that it’s using defaults for max-idle, max-wait, and various other Spring datasource configurations. “These need to be fixed,” they say.

“That would need significant time,” the PM responds, “and we need to mitigate this incident ASAP. We cannot have unhappy customers.”

Blake knows that RDS has a stored procedure to kill idle/bad processes.

mysql#> CALL mysql.rds_kill(processID)

Blake tests this out and asks Robin to quickly write a bash script to kill all idle processes.

#!/bin/bash

# MySQL connection details
MYSQL_USER="<user>"
MYSQL_PASSWORD="<passwd>"
MYSQL_HOST="<rds-name>.<id>.<region>.rds.amazonaws.com"

# Get process list IDs
PROCESS_IDS=$(MYSQL_PWD="$MYSQL_PASSWORD" mysql -h "$MYSQL_HOST" -u "$MYSQL_USER" -N -s -e "SELECT ID FROM INFORMATION_SCHEMA.PROCESSLIST WHERE USER='shipping'")

for ID in $PROCESS_IDS; do 
    MYSQL_PWD="$MYSQL_PASSWORD" mysql -h "$MYSQL_HOST" -u "$MYSQL_USER" -e "CALL mysql.rds_kill($ID)"
    echo "Terminated connection with ID $ID for user 'shipping'"
done

The team runs this immediately and the connection pool frees up momentarily. Everyone lets out a visible sigh of relief. “But this won’t hold for long, we need a hotfix on DataSource handling in Shipping”, Blake says. The backend engineer informs they are on it and soon they have a patch-up that adds better defaults for

spring.datasource.max-active
spring.datasource.max-age
spring.datasource.max-idle
spring.datasource.max-lifetime
spring.datasource.max-open-prepared-statements 
spring.datasource.max-wait
spring.datasource.maximum-pool-size
spring.datasource.min-evictable-idle-time-millis 
spring.datasource.min-idle

The team approves the hotfix and deploys it, finally mitigating a ~30 minute long incident.

Key Takeaways

Incidents such as this can occur in any organization with sufficiently complex architecture involving microservices written in different languages and frameworks, datastores, queues, caches, and cloud native components. A lack of understanding of end-to-end architecture and information silos only adds to the mitigation timelines.

During this RCA, the team finds out that they have to improve on multiple accounts.

Frontend code had long timeouts and allowed for large latencies in API responses.
The L1 Engineer did not have an end-to-end understanding of the whole architecture.
The service mesh dashboard on Kiali did not show External Services correctly, causing confusion.
RDS MySQL database metrics dashboards did not send an early alert, as no max_connection (alert) or high_number_of_connections (warning) thresholds were set.
The database connection code was written with the assumption that sane defaults for connection pool parameters were good enough, which proved incorrect.

Pressure to resolve incidents quickly that often comes from peers, leadership, and members of affected teams only adds to the chaos of incident management, causing more human errors. Coordinating incidents such as this through the process of having an Incident Commander role has shown more controllable outcomes for organizations around the world. An Incident Commander assumes the responsibility of managing resources, planning, and communications during a live incident, effectively reducing conflict and noise.

When multiple stakeholders are affected by an incident, resolutions need to be handled in order of business priority, working on immediate mitigations first, then getting the customer experience back at nominal levels, and only afterward focusing on long-term preventions. Coordinating these priorities across stakeholders is one of the most important functions of an Incident Commander.

Troubleshooting complex architecture remains a challenging activity to date. However, with the Blameless RCA Framework coupled with periodic metric reviews, a team can focus on incremental but constant improvements of their system observability. The team could also convert all successful resolutions to future playbooks that can be used by L1 SREs and support teams, making sure that similar errors can be handled well.

Concerted effort around a clear feedback loop of Incident -> Resolution -> RCA -> Playbook Creation eventually rids the system of most unknown-unknowns, allowing teams to focus on Product Development, instead of spending time on chaotic incident handling.

That’s a wrap

Hope you all enjoyed that story about a hypothetical but complex troubleshooting scenario. We see incidents like this and more across various clients we work with at InfraCloud. The above scenario can be reproduced using our open source repository. We are working on adding more such reproducible production outages and subsequent mitigations to this repository.

We would love to hear from you about your own 3 am incidents. If you have any questions, you can connect with me on Twitter and LinkedIn.