<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Hashfyre</title>
    <description>The latest articles on DEV Community by Hashfyre (@hashfyre).</description>
    <link>https://dev.to/hashfyre</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F27219%2F5716ec08-7c38-485d-8246-5c074906fb0b.png</url>
      <title>DEV Community: Hashfyre</title>
      <link>https://dev.to/hashfyre</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hashfyre"/>
    <language>en</language>
    <item>
      <title>Root Cause Chronicles: Quivering Queue</title>
      <dc:creator>Hashfyre</dc:creator>
      <pubDate>Tue, 16 Jan 2024 07:36:33 +0000</pubDate>
      <link>https://dev.to/infracloud/root-cause-chronicles-quivering-queue-2jc1</link>
      <guid>https://dev.to/infracloud/root-cause-chronicles-quivering-queue-2jc1</guid>
      <description>&lt;p&gt;River was buried with work. It was time for the Quarterly Earnings call, and they seemed to be getting nowhere. Additional piles of work were being dumped on their desk every day. Business teams did not follow scrum or sprints like the tech folks  on the second floor. You had to do whatever was assigned to you, and yesterday.&lt;/p&gt;

&lt;p&gt;They had to clear all the pending orders for Q3 that somehow got stuck in the system due to glitches or downtimes and prepare their quarterly summary of dispatch throughput once that was done.  River also had to write their self-assessment and performance reviews for all three interns in the team.&lt;/p&gt;

&lt;p&gt;River had already been reminded thrice that day by their senior to clear the orders backlog. But River kept getting drawn into meetings and paperwork every time they attempted to sit down with it. &lt;/p&gt;

&lt;p&gt;Carter from the second floor had given River some sort of script to finish up the &lt;strong&gt;pending orders that were not yet dispatched to paying customers&lt;/strong&gt;, but had asked River to monitor its progress. &lt;em&gt;“It’s an ad hoc script I’m cobbling together to get this done fast. This may cause issues down the line. Keep checking on it and ping me if something goes wrong”&lt;/em&gt;, they had said. Fingers crossed, River pressed the &lt;strong&gt;Run&lt;/strong&gt; button on the pipeline for this script and waited.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“Hey River, care to come in for a minute?”&lt;/em&gt; their senior beckoned, pointing at the nearest meeting room, “We need to get on top of your performance review.”   &lt;/p&gt;

&lt;p&gt;“Hey, Sure…”, River hesitated, “But I think I should monitor this script I just ran. You have been asking me to clear up that backlog of orders for a while, Carter gave me this script to get it resolved.” &lt;/p&gt;

&lt;p&gt;“It’s automated, right?” the senior asked. &lt;/p&gt;

&lt;p&gt;“Yeah, sure. But Carter asked me to inform them if anything goes wrong!” &lt;/p&gt;

&lt;p&gt;“It’ll be alright, Carter knows his job. You don’t need to wait on it. Just come in, and we can finish this conversation. My superiors are breathing down my neck too.”&lt;/p&gt;

&lt;h2&gt;
  
  
  30 minutes later…
&lt;/h2&gt;

&lt;p&gt;“Hey, Robin!” someone tapped on their shoulder.  &lt;/p&gt;

&lt;p&gt;Robin was neck-deep trying to finish writing a new &lt;a href="https://www.terraform.io/"&gt;Terraform&lt;/a&gt; module and replied without looking up, “What’s gone wrong?”.  &lt;/p&gt;

&lt;p&gt;&lt;em&gt;“&lt;strong&gt;Payments are failing&lt;/strong&gt;, customers are unable to buy anything, it just gets stuck on the web page.”&lt;/em&gt;  &lt;/p&gt;

&lt;p&gt;Robin switched to the &lt;a href="https://grafana.com/"&gt;Grafana&lt;/a&gt; dashboard tab, and sure enough, the 5xx volume on web service was rising. It had not hit the critical alert thresholds yet, but customers had already started noticing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--IjsnU6yD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ipo0pvkeqi0rgcss29ch.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--IjsnU6yD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ipo0pvkeqi0rgcss29ch.png" alt="Web service showing ~15% 5xx responses" width="800" height="307"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;(Web service showing ~15% 5xx responses)&lt;/center&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KxwEZ7sZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/yps1k9f220vnr8s31ol6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KxwEZ7sZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/yps1k9f220vnr8s31ol6.png" alt="Payment service showing ~18% 5xx responses" width="800" height="307"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;(Payment service showing ~18% 5xx responses)&lt;/center&gt;

&lt;h2&gt;
  
  
  Another day at Robot-Shop
&lt;/h2&gt;

&lt;p&gt;River, Carter, and Robin all work for Robot-Shop Inc. They run a sufficiently complex &lt;a href="https://cloud.google.com/learn/what-is-cloud-native"&gt;cloud native architecture&lt;/a&gt; to address the needs of their million-plus customers. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The traffic from load-balancer is routed via a gateway service optimized for traffic ingestion, called &lt;strong&gt;Web&lt;/strong&gt;, which distributes the traffic across various other services.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User&lt;/strong&gt; handles user registrations and sessions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Catalogue&lt;/strong&gt; maintains the inventory in a &lt;strong&gt;&lt;a href="https://www.mongodb.com/"&gt;MongoDB&lt;/a&gt; datastore&lt;/strong&gt;. &lt;/li&gt;
&lt;li&gt;Customers can see the ratings of available products via the &lt;strong&gt;Ratings&lt;/strong&gt; service APIs.&lt;/li&gt;
&lt;li&gt;They choose products they like and add them to the &lt;strong&gt;Cart&lt;/strong&gt;, a service backed by &lt;strong&gt;&lt;a href="https://redis.io/"&gt;Redis&lt;/a&gt; cache&lt;/strong&gt; to temporarily hold the customer’s choices. &lt;/li&gt;
&lt;li&gt;Once the customer pays via the &lt;strong&gt;Payment&lt;/strong&gt; service, the purchased items are published as &lt;strong&gt;orders&lt;/strong&gt; to a &lt;strong&gt;&lt;a href="https://www.rabbitmq.com/"&gt;RabbitMQ&lt;/a&gt;&lt;/strong&gt; channel.&lt;/li&gt;
&lt;li&gt;These &lt;strong&gt;orders&lt;/strong&gt; are consumed by the &lt;strong&gt;Dispatch&lt;/strong&gt; service and prepared for shipping. &lt;strong&gt;Shipping&lt;/strong&gt; uses &lt;strong&gt;&lt;a href="https://www.mysql.com/"&gt;MySQL&lt;/a&gt;&lt;/strong&gt; as its datastore, as does &lt;strong&gt;Ratings&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--iwPSMXbp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/k33g0omvp1s7frrc70cq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--iwPSMXbp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/k33g0omvp1s7frrc70cq.png" alt="High-Level Architecture of Robot-shop Application Stack" width="800" height="374"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;(High-Level Architecture of Robot-Shop Application Stack)&lt;/center&gt;

&lt;p&gt;Only now, payment failures were rising at an alarming rate. &lt;em&gt;“&lt;strong&gt;OK, let’s look at payment logs.&lt;/strong&gt;”&lt;/em&gt; Robin clicked on the attached Grafana dashboard for Payments logs from &lt;a href="https://grafana.com/oss/loki/"&gt;Loki&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--CvD4RK1j--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/30xrrobufpcxwcqa3z5l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--CvD4RK1j--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/30xrrobufpcxwcqa3z5l.png" alt="Payment service logs showing AMQP timeout handshake with rabbitmq-cluster" width="800" height="185"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“&lt;strong&gt;Payments can’t connect to RabbitMQ! Is the cluster down?&lt;/strong&gt;”&lt;/em&gt; This could be bad. Robin was distraught. If the RabbitMQ cluster was entirely down, it would be hard to recover any messages that were in transit and in-memory. The only way of recovery would be to replay failure logs from the Payments service later. But RabbitMQ was not down, Robin would have received immediate alerts if it were so. &lt;em&gt;“&lt;strong&gt;Let’s check the queue health&lt;/strong&gt;.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pv_7odLZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hgxzlxyxzdz8joeacail.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pv_7odLZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hgxzlxyxzdz8joeacail.png" alt="RabbitMQ dashboard showing ~154 messages per second and 266 active connections" width="800" height="399"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“&lt;strong&gt;Oh, wow! Seems the number of messages has gone up drastically in the last 30 minutes&lt;/strong&gt;. Are we really processing a record number of payments at the end of a quarter?”&lt;/em&gt; Robin thought. “Close to  300-350 orders/second. Let me correlate that with the Payments service request count.” Robin jumped around the dashboards with a few keystrokes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YhOmdoZh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6yzh7eqoufkt41snro2b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YhOmdoZh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6yzh7eqoufkt41snro2b.png" alt="Payment service dashboard showing extremely low number of requests per second" width="800" height="372"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“&lt;strong&gt;This does not compute. It can’t be payments!&lt;/strong&gt; The service is only processing 1-2 payments/second!”&lt;/em&gt;, Robin exclaimed. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;“&lt;strong&gt;Check the publisher count of the queue against the number of payment pods in the deployment!&lt;/strong&gt;”&lt;/em&gt; Blake was there, as were a few other folks around her seat. Robin never felt too comfortable when people looked on as they typed. It was nerve-wracking.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--1kvrfGGj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5bzfwy04uz4kq7wuifju.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--1kvrfGGj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5bzfwy04uz4kq7wuifju.png" alt="RabbitMQ dashboard showing 400+ publishers" width="800" height="399"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“&lt;strong&gt;Whoa, why do we have ~465 publishers?&lt;/strong&gt;”&lt;/em&gt; Blake was right as usual. “Even with auto-scaling enabled, payment doesn’t usually scale beyond 20-30 pods, and right now, we only have 3 of them, the minimum count.” &lt;/p&gt;

&lt;p&gt;This was expected, customers never shopped much at the end of the quarter with tax filing around the corner. Overall system load was expected to be at an all-time low as per seasonal traffic analysis data.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“&lt;strong&gt;Who, or what, is publishing orders to the queue? Also, how is our consumer service, Dispatch doing?&lt;/strong&gt;”&lt;/em&gt; Blake was curious now; this could escalate very fast if they could not find a solution soon. Blake pulled up a chair and sat down, looking at the graphs Robin had pulled up.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OVb_1b5g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uoxul8ewtntv7u1q8y98.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OVb_1b5g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uoxul8ewtntv7u1q8y98.png" alt="RabbitMQ dashboard showing message confirmations sent back to publisher from the consumers" width="800" height="748"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“Consumers look alright! But I wish we had instrumented our apps with &lt;a href="https://opentelemetry.io/"&gt;OpenTelemetry&lt;/a&gt;.”&lt;/em&gt; Blake sighed at the onlooking Backend folks. &lt;em&gt;“Right now, we know we have too many publishers, but we don’t know who they are! We need to complete our Tracing sprint, and soon.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;![Detailed Kubernetes native architecture of Robot-shop]!(&lt;a href="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vl4yv52z55rpvpm4lzwo.png"&gt;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vl4yv52z55rpvpm4lzwo.png&lt;/a&gt;)&lt;/p&gt;

&lt;center&gt;(Detailed Kubernetes native architecture of Robot-Shop)&lt;/center&gt;

&lt;p&gt;“We have Kiali. It should show which services other than Payments and Dispatch are connected to RabbitMQ,” Robin said.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--K0eyNahA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wwnfjnsqfi10xskvy1jl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--K0eyNahA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wwnfjnsqfi10xskvy1jl.png" alt="Kiali dashboard showing workload pending-orders-recreation sending heavy message load to RabbitMQ cluster" width="800" height="299"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Robin fiddled around with namespace filters until they zoomed in on something. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;“It seems some workload called pending-orders-recreation is connecting to RabbitMQ from a namespace called &lt;strong&gt;pending-orders&lt;/strong&gt;.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;“Let's see what’s running there. I do not know this namespace.”&lt;/p&gt;

&lt;p&gt;And lo, around 400 pods with the prefix &lt;strong&gt;&lt;em&gt;pending-orders-recreation-xxx&lt;/em&gt;&lt;/strong&gt; were running there.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rox4Izco--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mv621yne2lq30wlm1qku.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rox4Izco--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mv621yne2lq30wlm1qku.png" alt="K9S Dashboard showing 400+ pods of pending-orders-recreation in the pending-orders namespace" width="800" height="335"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;“Who could be running this?” Robin pinged the #engineering channel on Slack:&lt;br&gt;&lt;br&gt;
Hey Team, has anyone deployed a new workload to a namespace called &lt;strong&gt;&lt;em&gt;pending-orders&lt;/em&gt;&lt;/strong&gt; in production?&lt;/p&gt;
&lt;h2&gt;
  
  
  The root cause of all things
&lt;/h2&gt;

&lt;p&gt;Soon, Blake and Robin heard someone rushing up to them. It was Carter: &lt;em&gt;“Hey, that was me. That workload you pinged about just now. What’s up with it?”&lt;/em&gt;   &lt;/p&gt;

&lt;p&gt;&lt;em&gt;“It seems to have taken down our live payments system, what are you running there?”&lt;/em&gt; Blake asked.   &lt;/p&gt;

&lt;p&gt;&lt;em&gt;“Well, it’s the end of the quarter, and the Biz team wanted to automate reconciliation of all pending orders for the last three months that were never dispatched due to glitches or errors. Remember the &lt;a href="https://dev.to/blogs/root-cause-chronicles-connection-collapse/"&gt;downtime we had last month&lt;/a&gt; on the User service, and a bunch of payments went bad? We are recreating those from scratch now.”&lt;/em&gt;  &lt;/p&gt;

&lt;p&gt;&lt;em&gt;“All of them? How many are we talking about?”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“Ah, close to half a million?”&lt;/em&gt; Carter was sweating, &lt;em&gt;“They need to complete by today, it’s the end of the quarter. Otherwise, our numbers will look all wrong. How do we fix this now?”&lt;/em&gt;  &lt;/p&gt;

&lt;p&gt;&lt;em&gt;“Can we stop the workload and let the queue recover?”&lt;/em&gt; Blake queried. &lt;em&gt;“Because right now the cluster is dying under the load. Most messages are getting re-queued.”&lt;/em&gt;  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--D7X362mJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/heumlid8y9nsi7sulfv2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--D7X362mJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/heumlid8y9nsi7sulfv2.png" alt="RabbitMQ dashboard showing unroutable messages being returned back to publisher as queue length is full" width="800" height="272"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;(Unroutable messages returned to publishers)&lt;/center&gt;

&lt;p&gt;&lt;em&gt;“Can’t we just scale up dispatch to consume the orders faster?”&lt;/em&gt; Carter was not in favor of his deadline getting stuck. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;“We can deploy a quick event-driven autoscaler with KEDA. We should have had it in the first place, but this subsystem had never been required to deal with scale spikes before.”&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;Thankfully &lt;a href="https://keda.sh/"&gt;KEDA&lt;/a&gt; operator was already part of the cluster, and all Robin had to do was create a ScaledObject manifest targeting the Dispatch ScaleUp event, based on the rabbitmq_global_messages_received_total metric from &lt;a href="https://prometheus.io/"&gt;Prometheus&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;keda.sh/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ScaledObject&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dispatch-scale-up&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;robot-shop&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scaleTargetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;                                       
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dispatch&lt;/span&gt;                                         
  &lt;span class="na"&gt;pollingInterval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;10&lt;/span&gt;                                     
  &lt;span class="na"&gt;cooldownPeriod&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;   &lt;span class="s"&gt;300&lt;/span&gt;                                    
  &lt;span class="na"&gt;minReplicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3&lt;/span&gt;                                       
  &lt;span class="na"&gt;maxReplicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10&lt;/span&gt;                                      
  &lt;span class="na"&gt;advanced&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;                                                
    &lt;span class="na"&gt;restoreToOriginalReplicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;horizontalPodAutoscalerConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;                   
      &lt;span class="na"&gt;behavior&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;                                     
        &lt;span class="na"&gt;scaleUp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;stabilizationWindowSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
          &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pods&lt;/span&gt;
            &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
            &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;triggers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;serverAddress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://prometheus-stack-kube-prom-prometheus.monitoring.svc.cluster.local:9090&lt;/span&gt;
      &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;sum(rate(rabbitmq_global_messages_received_total[60s]))&lt;/span&gt;
      &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;10'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ExVMgwIC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vho4yijjqf0mokvcq6jc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ExVMgwIC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vho4yijjqf0mokvcq6jc.png" alt="Terminal screenshot showing Dispatch service scaling up thanks to Horizontal Pod Autoscaler configuration based on global_messages_received metric" width="800" height="239"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“Dispatch has scaled up, it’ll keep adding more pods if the number of orders published to the queue goes beyond a threshold of 10 messages,”&lt;/em&gt; Robin said.   &lt;/p&gt;

&lt;p&gt;&lt;em&gt;“Let’s wait a few minutes and see if scaling consumers would resolve the issue,”&lt;/em&gt; Blake commented.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3CkaJn0D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ja70ax4xe0ida6dplkzi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3CkaJn0D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ja70ax4xe0ida6dplkzi.png" alt="Terminal screenshot showing rabbitmq pods getting OOMKilled due to load" width="800" height="114"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;(RabbitMQ pods getting OOMKilled due to large number of connections)&lt;/center&gt;

&lt;p&gt;&lt;em&gt;“This doesn’t look like a fix.”&lt;/em&gt; Blake pointed at the memory consumption and connections per pod metric, &lt;em&gt;“See here, this is not about scaling up the consumers. RabbitMQ pods are unable to handle that many connections, given how little memory we have allocated to them. Can we use KEDA to scale up the RabbitMQ statefulset?”&lt;/em&gt;  &lt;/p&gt;

&lt;p&gt;&lt;em&gt;“We are using RabbitMQ Operator, not sure if they allow autoscaling. They might wrap a HorizontalPodAutoscaler resource within the operator itself!”&lt;/em&gt; Robin shrugged. They started digging into the problem.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XCaL5MU7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jbly9l8gassde0ouk2p7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XCaL5MU7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jbly9l8gassde0ouk2p7.png" alt="Github Issue Screenshot showing maintainer of rabbitmq/cluster-operator commenting on why high-availability and autoscaling are out-of-scope of the operator's design." width="800" height="598"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/rabbitmq/cluster-operator/issues/800#issuecomment-897074779"&gt;(Add Support For Pods Auto Scaling)&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;&lt;em&gt;“Doesn’t look like it. The maintainers do not recommend auto-scaling RabbitMQ for legitimate reasons and cap the scale at 9 nodes or pods. Let’s just manually scale up for now. I’m going to scale them up vertically too after the new nodes are up, and add more memory per node.”&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;RMQ_CLUSTER_NS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;robot-shop
kubectl patch rabbitmqcluster rabbitmq-cluster &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'[
        {"op": "replace", "path": "/spec/replicas", "value": 6}, 
        {"op": "replace", "path": "/spec/resources/limits/memory", "value": "400Mi"}, 
        {"op": "replace", "path": "/spec/resources/requests/memory", "value": "100Mi"}, 
        ]'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="nv"&gt;$RMQ_CLUSTER_NS&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;“The new nodes will take a few seconds to start accepting connections, also we are not using mirrored or &lt;a href="https://www.rabbitmq.com/quorum-queues.html"&gt;quorum queues&lt;/a&gt;. This was thought to be a small workload and we decided a simple direct queue would suffice. The cluster was not modeled for spikey loads such as today’s.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--4gX1Pg1t--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/cfl5mzzk0t67lauwe1np.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--4gX1Pg1t--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/cfl5mzzk0t67lauwe1np.png" alt="RabbitMQ Admin dashboard screenshot showingthe type of queue configured by the application as classic direct queue." width="800" height="59"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“If we are going to run spikey workloads like this on the queueing infrastructure, we need to benchmark for them and publish those benchmarks cross-team, else they would not know whether or not a certain infrastructure is resilient enough. Also, the teams need to communicate ahead of time before running auxiliary Biz team workloads like this with the Infrastructure team in the loop,”&lt;/em&gt; Blake thought.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“Clock the RCA time, Robin,”&lt;/em&gt; Blake said.  &lt;/p&gt;

&lt;p&gt;&lt;em&gt;“5 minutes tops. We handled this before it could get out of hand.”&lt;/em&gt; Robin replied.  &lt;/p&gt;

&lt;p&gt;It was fast troubleshooting, and payments were back online as the upgraded cluster handled the increased publisher count due to pending orders and live payments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;p&gt;Many incidents occur in sufficiently complex scenarios when underlying infrastructure is not optimally benchmarked. Benchmarking is the first step towards optimal capacity planning. Base infrastructure like cache, queue, and databases need to be tested for seasonal rise and fall of traffic but also be resilient enough for sudden unplanned traffic spikes; whether initiated by customers or internal needs like batched jobs or analytics.&lt;/p&gt;

&lt;p&gt;During the above incident, the team found that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Communication broke down between the business and tech/infra teams.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The business team was unaware whether or not the underlying infra could sustain a sudden spike of 400-500 concurrent jobs, up from the standard 20-30 payments per minute.&lt;/li&gt;
&lt;li&gt;The business team never communicated with the infrastructure team about running a new workload.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;

&lt;p&gt;There was no internally published benchmark or SLI (service level indicator) that informed other teams of the capabilities of the systems their work relied on.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The infrastructure team did not have enough policy-oriented safeguards on production infrastructure so that un-sanctioned workloads could not be side-loaded without their knowledge.&lt;/li&gt;
&lt;li&gt;The queue infrastructure and consumers were not configured with event-driven auto-scaling to respond to traffic and connection spikes.&lt;/li&gt;
&lt;li&gt;Enough in-depth exploration of the elasticity and scalability of RabbitMQ was not conducted. The team had assumed current publisher concurrency to stay consistent over time.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Open and transparent cross-team collaboration is key to running software efficiently and scalably in modern systems. Assumptions are almost always erroneous unless confidence is built around them with proper benchmarking of infrastructure. Systems sometimes have implicit limitations that make them unsuitable for specific types of workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  Author’s note
&lt;/h2&gt;

&lt;p&gt;Hope you all enjoyed that story about a hypothetical troubleshooting scenario. We see incidents like this and more across various clients we work with at InfraCloud.  The above &lt;a href="https://github.com/infracloudio/sre-stack/tree/main/scenarios/scenario-04"&gt;scenario can be reproduced using our open source repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We are working on adding more such reproducible production outages and subsequent mitigations to this repository.  &lt;/p&gt;

&lt;p&gt;We would love to hear from you about the incidents where your knowledge of queueing theory came in handy. What is the largest deployment of RabbitMQ you have run?      &lt;/p&gt;

&lt;p&gt;If you have any questions, you can connect with me on &lt;a href="https://twitter.com/hashfyre"&gt;Twitter&lt;/a&gt; or &lt;a href="https://in.linkedin.com/in/joybhattacherjee"&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.rabbitmq.com/quorum-queues.html"&gt;RabbitMQ Quorum Queues&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.rabbitmq.com/ha.html"&gt;RabbitMQ Classic Queue Mirroring&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://copyprogramming.com/howto/rabbitmq-throttling-fast-producer-against-large-queues-with-slow-consumer"&gt;Managing Fast Producers and Slow Consumers in RabbitMQ through Throttling of Large Queues&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>rca</category>
      <category>rootcauseanalysis</category>
      <category>rabbitmq</category>
      <category>cloudnative</category>
    </item>
    <item>
      <title>Root Cause Chronicles: Connection Collapse</title>
      <dc:creator>Hashfyre</dc:creator>
      <pubDate>Fri, 15 Dec 2023 09:33:28 +0000</pubDate>
      <link>https://dev.to/infracloud/root-cause-chronicles-connection-collapse-52kc</link>
      <guid>https://dev.to/infracloud/root-cause-chronicles-connection-collapse-52kc</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LImASrjS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h9x0cw6egg4ljxzw8fre.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LImASrjS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h9x0cw6egg4ljxzw8fre.png" alt="Post Banner" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On a usual Friday evening, Robin had just wrapped up their work, wished their colleagues a happy weekend, and turned themselves in for the night. At exactly 3 am, Robin receives a call from the organization’s automated paging system, &lt;strong&gt;&lt;em&gt;“High P90 Latency Alert on Shipping Service: 9.28 seconds”.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Robin works as an SRE for Robot-Shop, an e-commerce company that sells various robotics parts and accessories, and this message does not bode well for them tonight. They prepare themselves for a long, arduous night ahead and turn on their work laptop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting the Field
&lt;/h2&gt;

&lt;p&gt;Robot-Shop runs a sufficiently complex cloud native architecture to address the needs of their million-plus customers. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The traffic from load-balancer is routed via a gateway service optimized for traffic ingestion, called &lt;strong&gt;Web&lt;/strong&gt;, which distributes the traffic across various other services.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User&lt;/strong&gt; handles user registrations and sessions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Catalogue&lt;/strong&gt; maintains the inventory in a &lt;strong&gt;MongoDB datastore.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Customers can see the ratings of available products via the &lt;strong&gt;Ratings&lt;/strong&gt; service APIs.&lt;/li&gt;
&lt;li&gt;They choose products they like and add them to the &lt;strong&gt;Cart&lt;/strong&gt;, a service backed by &lt;strong&gt;Redis cache&lt;/strong&gt; to temporarily hold the customer’s choices. &lt;/li&gt;
&lt;li&gt;Once the customer pays via the &lt;strong&gt;Payment&lt;/strong&gt; service, the purchased items are published to a &lt;strong&gt;RabbitMQ&lt;/strong&gt; channel.&lt;/li&gt;
&lt;li&gt;These are consumed by the &lt;strong&gt;Dispatch&lt;/strong&gt; service and prepared for shipping. &lt;strong&gt;Shipping&lt;/strong&gt; uses &lt;strong&gt;MySQL&lt;/strong&gt; as its datastore, as does &lt;strong&gt;Ratings&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--eTxNkT2e--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wq3buafq03q3958n30xb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--eTxNkT2e--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wq3buafq03q3958n30xb.png" alt="High level architecture" width="800" height="374"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;(Figure 1: High Level Architecture of Robot-shop Application stack)&lt;/center&gt;

&lt;h2&gt;
  
  
  Troubles in the Dark
&lt;/h2&gt;

&lt;p&gt;“&lt;strong&gt;&lt;em&gt;OK, let’s look at the latency dashboards first.”&lt;/em&gt;&lt;/strong&gt; Robin clicks on the attached Grafana dashboard on the Slack notification for the alert sent by PagerDuty. This opens up the latency graph of the &lt;strong&gt;&lt;em&gt;Shipping service.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--fdDXyAG8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/67uqywkj9gl53z5oexb8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fdDXyAG8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/67uqywkj9gl53z5oexb8.png" alt="uptick in P90 latency for shipping service via Grafana Dashboard" width="800" height="232"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;“How did it go from 1s to ~9.28s within 4-5 minutes? Did traffic spike?”&lt;/em&gt;&lt;/strong&gt; Robin decides to focus on the &lt;strong&gt;Gateway ops/sec&lt;/strong&gt; panel of the dashboard. The number is around &lt;strong&gt;~140 ops/sec.&lt;/strong&gt; Robin knows this data is coming from their &lt;strong&gt;Istio gateway&lt;/strong&gt; and is reliable. The current number is more than affordable for Robot-Shop’s cluster, though there is a steady uptick in the request-count for Robot-Shop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OHLcyZ6c--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ddkk8239kc8k002jx4ov.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OHLcyZ6c--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ddkk8239kc8k002jx4ov.png" alt="uptick in overall requests per second via Grafana Dashboard" width="800" height="254"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;None of the other services show any signs of wear and tear, only &lt;strong&gt;Shipping&lt;/strong&gt;. Robin understands this is a localized incident and decides to look at the shipping logs. The logs are sourced from &lt;strong&gt;Loki&lt;/strong&gt;, and the widget is conveniently placed right beneath the latency panel, showing logs from all services in the selected time window. &lt;strong&gt;&lt;em&gt;Nothing in the logs,&lt;/em&gt; and &lt;em&gt;no errors regarding connection timeouts or failed transactions.&lt;/em&gt;&lt;/strong&gt; So far the only thing going wrong is the latency, but no requests are failing yet; they are only getting delayed by a very long time. Robin makes a note: &lt;strong&gt;&lt;em&gt;We need to adjust frontend timeouts for these APIs. We should have already gotten a barrage of request timeout errors as an added signal.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Did a developer deploy an unapproved change yesterday?&lt;/em&gt;&lt;/strong&gt; Usually, the support team is informed of any urgent hotfixes before the weekend. Robin decides to check the ArgoCD Dashboards for any changes to shipping or any other services. &lt;strong&gt;&lt;em&gt;Nothing there either, no new feature releases in the last 2 days.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Did the infrastructure team make any changes to the underlying Kubernetes cluster? Any version upgrades?&lt;/em&gt;&lt;/strong&gt; The Infrastructure team uses &lt;strong&gt;&lt;em&gt;Atlantis&lt;/em&gt;&lt;/strong&gt; to gate and deploy the cluster updates via Terraform modules. &lt;strong&gt;The last date of change is from the previous week.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With no errors seen in the logs and partial service degradation as the only signal available to them, Robin cannot make any more headway into this problem. &lt;strong&gt;&lt;em&gt;Something else may be responsible, could it be an upstream or downstream service that the shipping service depends on? Is it one of the datastores?&lt;/em&gt;&lt;/strong&gt; Robin pulls up the &lt;strong&gt;Kiali service graph&lt;/strong&gt; that uses Istio’s mesh to display the service topology to look at the dependencies.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--EYCgt2uS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vuhvnmdbn1etchrjkz7k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--EYCgt2uS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vuhvnmdbn1etchrjkz7k.png" alt="Kiali dashboard showing the Robot-Shop stack" width="800" height="439"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Robin sees that &lt;strong&gt;Shipping&lt;/strong&gt; has now started throwing its first &lt;strong&gt;5xx errors&lt;/strong&gt;, and both &lt;strong&gt;Shipping and Ratings&lt;/strong&gt; are talking to something labeled as &lt;strong&gt;&lt;em&gt;PassthroughCluster.&lt;/em&gt;&lt;/strong&gt; The support team does not maintain any of these platforms and does not have access to the runtimes or the codebase. &lt;strong&gt;&lt;em&gt;“I need to get relevant people involved at this point and escalate to folks in my team with higher access levels,” Robin thinks.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Stakeholders Assemble
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;It’s already been 5 minutes since the first report and customers are now getting affected.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Vu7XuFmD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uf2flmu0hp58hobqx8xj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Vu7XuFmD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uf2flmu0hp58hobqx8xj.png" alt="Detailed Kubernetes native architecture of Robot-shop" width="800" height="554"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;(Figure 5: Detailed Kubernetes native architecture of Robot-shop)&lt;/center&gt;

&lt;p&gt;Robin’s team lead Blake joins in on the call, and they also add the backend engineer who owns Shipping service as an SME. The product manager responsible for &lt;strong&gt;Shipping&lt;/strong&gt; has already received the first complaints from the customer support team who has escalated the incident to them; they see the ongoing call on the &lt;strong&gt;&lt;em&gt;#live-incidents&lt;/em&gt;&lt;/strong&gt; channel on Slack, and join in. &lt;strong&gt;P90&lt;/strong&gt; latency alerts are now clogging the production alert channel as the metric has risen to &lt;strong&gt;&lt;em&gt;~4.39 minutes, and 30% of the requests are receiving 5xx responses.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--syTp0QnL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gou75n16mko3pvmzo79y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--syTp0QnL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gou75n16mko3pvmzo79y.png" alt="Greater uptick in P90 latency for shipping service via Grafana Dashboard" width="800" height="254"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The team now has multiple signals converging on the problem. Blake digs through &lt;strong&gt;&lt;em&gt;shipping logs&lt;/em&gt;&lt;/strong&gt; again and sees  &lt;strong&gt;&lt;em&gt;errors around MySQL connections&lt;/em&gt;&lt;/strong&gt;. At this time, the &lt;strong&gt;Ratings&lt;/strong&gt; service also starts throwing 5xx errors – the problem is now getting compounded.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6BiXxvPY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dihwj1dp0icplqkru2ir.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6BiXxvPY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dihwj1dp0icplqkru2ir.png" alt="Shipping service logs showing JDBC connection timeouts via Loki Logs" width="800" height="345"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--eeihT3ig--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uidodbpu6ip4r4vn3wz7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--eeihT3ig--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uidodbpu6ip4r4vn3wz7.png" alt="84% Request failures in Ratings service via Grafana Dashboard" width="800" height="270"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Product Manager (PM) says their customer support team is reporting frustration from more and more users who are unable to see the shipping status of the orders they have already paid for and who are supposed to get the deliveries that day. Users who just logged in are unable to see product ratings and are refreshing the pages multiple times to see if the information they want is available. &lt;/p&gt;

&lt;p&gt;“If customers can’t make purchase decisions quickly, they’ll go to our competitors,” the PM informs the team.&lt;/p&gt;

&lt;p&gt;Blake looks at the &lt;strong&gt;&lt;em&gt;PassthroughCluster node&lt;/em&gt;&lt;/strong&gt; on &lt;strong&gt;Kiali&lt;/strong&gt;, and it hits them: It’s the RDS instance. &lt;strong&gt;The platform team had forgotten to add RDS as an External Service in their Istio configuration.&lt;/strong&gt; It was an honest oversight that could cost Robot-Shop significant revenue loss today.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--NlDuvlOU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ht7zf6s8n5u4k145m824.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--NlDuvlOU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ht7zf6s8n5u4k145m824.png" alt="Kiali dashboard showing Shipping and Ratings connecting to unknown external service as PassThroughCluster" width="800" height="241"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;“I think MySQL is unable to handle new connections for some reason,” Blake says.&lt;/strong&gt; They pull up the MySQL metrics dashboards and look at the number of Database Connections. It has gone up significantly and then flattened. &lt;strong&gt;“Why don’t we have an alert threshold here? It seems like we might have maxed out the MySQL connection pool!”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bvkBxzUd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6q7t2xohdyl4cj6fhq6m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bvkBxzUd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6q7t2xohdyl4cj6fhq6m.png" alt="Large uptick in MySQL Database connection count via Grafana Dashboard" width="800" height="384"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To verify their hypothesis, Blake looks at the Parameter Group for the RDS Instance. It uses the default-mysql-5.7 Parameter group, and max_connections is set to:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;{DBInstanceClassMemory/12582880}&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;But, what does that number really mean? Blake decides not to waste time with checking the RDS Instance Type and computing the number. Instead, they log into the RDS instance with mysql-cli and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;#&lt;/span&gt;&lt;span class="n"&gt;mysql&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;VARIABLES&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="nv"&gt;"max_connections"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--eLSITxCz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qy6ki4b84c5wsh6andta.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--eLSITxCz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qy6ki4b84c5wsh6andta.png" alt="MySQL query output showing max_connections" width="545" height="206"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then Blake runs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;#&lt;/span&gt;&lt;span class="n"&gt;mysql&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;processlist&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;“&lt;strong&gt;I need to know exactly how many,&lt;/strong&gt;" Blake thinks, and runs:&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;#&lt;/span&gt;&lt;span class="n"&gt;mysql&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;host&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processlist&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--0Wqxivh_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8qk93mukhulnikfh3len.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--0Wqxivh_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8qk93mukhulnikfh3len.png" alt="MySQL query output showing connected processes" width="800" height="805"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It’s more than the number of max_connections. Their hypothesis is now validated: Blake sees &lt;strong&gt;&lt;em&gt;a lot of connections are in &lt;code&gt;sleep()&lt;/code&gt; mode for more than ~1000 seconds, and all of these are being created by the shipping user.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HU1ITv7O--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1dbbu7pxesj1zzsi9x0d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HU1ITv7O--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1dbbu7pxesj1zzsi9x0d.png" alt="Affected Subsystems of Robot-shop" width="535" height="606"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;center&gt;(Figure 13: Affected Subsystems of Robot-shop)&lt;/center&gt;

&lt;p&gt;“I think we have it,” Blake says, &lt;strong&gt;&lt;em&gt;“Shipping is not properly handling connection timeouts with the DB; it’s not refreshing its unused connection pool.”&lt;/em&gt;&lt;/strong&gt; The backend engineer pulls up the Java JDBC datasource code for shipping and says that it’s using defaults for max-idle, max-wait, and various other Spring datasource configurations. “&lt;strong&gt;These need to be fixed,”&lt;/strong&gt; they say.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--kHft0G0o--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/q1e96p2jvnti1x29f6sc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kHft0G0o--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/q1e96p2jvnti1x29f6sc.png" alt="Code snippet showing shipping service JDBC connector function" width="800" height="568"&gt;&lt;/a&gt;&lt;br&gt;
“That would need significant time,” the PM responds, “and we need to mitigate this incident ASAP. We cannot have unhappy customers.” &lt;/p&gt;

&lt;p&gt;Blake knows that RDS has a stored procedure to kill idle/bad processes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;mysql&lt;/span&gt;&lt;span class="o"&gt;#&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;CALL&lt;/span&gt; &lt;span class="n"&gt;mysql&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rds_kill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Blake tests this out and asks Robin to quickly write a bash script to kill all idle processes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;

&lt;span class="c"&gt;# MySQL connection details&lt;/span&gt;
&lt;span class="nv"&gt;MYSQL_USER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;user&amp;gt;"&lt;/span&gt;
&lt;span class="nv"&gt;MYSQL_PASSWORD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;passwd&amp;gt;"&lt;/span&gt;
&lt;span class="nv"&gt;MYSQL_HOST&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;rds-name&amp;gt;.&amp;lt;id&amp;gt;.&amp;lt;region&amp;gt;.rds.amazonaws.com"&lt;/span&gt;

&lt;span class="c"&gt;# Get process list IDs&lt;/span&gt;
&lt;span class="nv"&gt;PROCESS_IDS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nv"&gt;MYSQL_PWD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$MYSQL_PASSWORD&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; mysql &lt;span class="nt"&gt;-h&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$MYSQL_HOST&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$MYSQL_USER&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-N&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;"SELECT ID FROM INFORMATION_SCHEMA.PROCESSLIST WHERE USER='shipping'"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for &lt;/span&gt;ID &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nv"&gt;$PROCESS_IDS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do 
    &lt;/span&gt;&lt;span class="nv"&gt;MYSQL_PWD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$MYSQL_PASSWORD&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; mysql &lt;span class="nt"&gt;-h&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$MYSQL_HOST&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$MYSQL_USER&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;"CALL mysql.rds_kill(&lt;/span&gt;&lt;span class="nv"&gt;$ID&lt;/span&gt;&lt;span class="s2"&gt;)"&lt;/span&gt;
    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Terminated connection with ID &lt;/span&gt;&lt;span class="nv"&gt;$ID&lt;/span&gt;&lt;span class="s2"&gt; for user 'shipping'"&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The team runs this immediately and the connection pool frees up momentarily. Everyone lets out a visible sigh of relief. “&lt;strong&gt;&lt;em&gt;But this won’t hold for long, we need a hotfix on&lt;/em&gt;&lt;/strong&gt; DataSource handling in &lt;strong&gt;&lt;em&gt;Shipping&lt;/em&gt;&lt;/strong&gt;”, Blake says. The  backend engineer informs they are on it and soon they have a patch-up that adds better defaults for&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;spring&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;datasource&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;max&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;active&lt;/span&gt;
&lt;span class="n"&gt;spring&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;datasource&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;max&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;
&lt;span class="n"&gt;spring&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;datasource&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;max&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;idle&lt;/span&gt;
&lt;span class="n"&gt;spring&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;datasource&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;max&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;lifetime&lt;/span&gt;
&lt;span class="n"&gt;spring&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;datasource&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;max&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;open&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;prepared&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;statements&lt;/span&gt; 
&lt;span class="n"&gt;spring&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;datasource&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;max&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;wait&lt;/span&gt;
&lt;span class="n"&gt;spring&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;datasource&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;maximum&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;
&lt;span class="n"&gt;spring&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;datasource&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;min&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;evictable&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;idle&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;millis&lt;/span&gt; 
&lt;span class="n"&gt;spring&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;datasource&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;min&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;idle&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The team approves the hotfix and deploys it, finally mitigating a ~30 minute long incident.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;Incidents such as this can occur in any organization with sufficiently complex architecture involving microservices written in different languages and frameworks, datastores, queues, caches, and cloud native components. A lack of understanding of end-to-end architecture and information silos only adds to the mitigation timelines.&lt;/p&gt;

&lt;p&gt;During this RCA, the team finds out that they have to improve on multiple accounts.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Frontend code had long timeouts and allowed for large latencies in API responses.&lt;/li&gt;
&lt;li&gt;The L1 Engineer did not have an end-to-end understanding of the whole architecture.&lt;/li&gt;
&lt;li&gt;The service mesh dashboard on Kiali did not show External Services correctly, causing confusion.&lt;/li&gt;
&lt;li&gt;RDS MySQL database metrics dashboards did not send an early alert, as no max_connection (alert) or high_number_of_connections (warning) thresholds were set.&lt;/li&gt;
&lt;li&gt;The database connection code was written with the assumption that sane defaults for connection pool parameters were good enough, which proved incorrect.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pressure to resolve incidents quickly that often comes from peers, leadership, and members of affected teams only adds to the chaos of incident management, causing more human errors. Coordinating incidents such as this through the process of having an Incident Commander role has shown more controllable outcomes for organizations around the world. An Incident Commander assumes the responsibility of managing resources, planning, and communications during a live incident, effectively reducing conflict and noise. &lt;/p&gt;

&lt;p&gt;When multiple stakeholders are affected by an incident, resolutions need to be handled in order of business priority, working on immediate mitigations first, then getting the customer experience back at nominal levels, and only afterward focusing on long-term preventions. Coordinating these priorities across stakeholders is one of the most important functions of an Incident Commander.&lt;/p&gt;

&lt;p&gt;Troubleshooting complex architecture remains a challenging activity to date. However, with the &lt;a href="https://handbook.gitlab.com/handbook/customer-success/professional-services-engineering/workflows/internal/root-cause-analysis/"&gt;Blameless RCA Framework&lt;/a&gt; coupled with periodic metric reviews, a team can focus on incremental but constant improvements of their system observability. The team could also convert all successful resolutions to future playbooks that can be used by L1 SREs and support teams, making sure that similar errors can be handled well.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--guKmVLdR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/x8zkyjnlzbmukl9amp32.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--guKmVLdR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/x8zkyjnlzbmukl9amp32.png" alt="feedback loop" width="337" height="352"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Concerted effort around a clear feedback loop of  &lt;strong&gt;Incident -&amp;gt; Resolution -&amp;gt; RCA -&amp;gt; Playbook Creation&lt;/strong&gt; eventually rids the system of most unknown-unknowns, allowing teams to focus on Product Development, instead of spending time on chaotic incident handling.&lt;/p&gt;

&lt;h2&gt;
  
  
  That’s a wrap
&lt;/h2&gt;

&lt;p&gt;Hope you all enjoyed that story about a hypothetical but complex troubleshooting scenario. We see incidents like this and more across various clients we work with at InfraCloud. The &lt;a href="https://github.com/infracloudio/sre-stack/tree/main/scenarios/scenario-02"&gt;above scenario can be reproduced using our open source repository&lt;/a&gt;. We are working on adding more such reproducible production outages and subsequent mitigations to this repository.&lt;/p&gt;

&lt;p&gt;We would love to hear from you about your own 3 am incidents. If you have any questions, you can connect with me on &lt;a href="https://twitter.com/hashfyre"&gt;Twitter&lt;/a&gt; and &lt;a href="https://in.linkedin.com/in/joybhattacherjee"&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/mysql-stored-proc-ending.html"&gt;Ending a session or query - Amazon Relational Database Service&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://handbook.gitlab.com/handbook/customer-success/professional-services-engineering/workflows/internal/root-cause-analysis/"&gt;Blameless Root Cause Analyses - The GitLab Handbook&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.baeldung.com/spring-boot-tomcat-connection-pool"&gt;Configuring a Tomcat Connection Pool in Spring Boot - Baeldung&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.spring.io/spring-framework/reference/data-access/jdbc/connections.html"&gt;Controlling Database Connections: Spring Framework&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://istio.io/latest/docs/tasks/traffic-management/egress/egress-control/"&gt;Istio / Accessing External Services&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://istio.io/v1.12/blog/2019/monitoring-external-service-traffic/"&gt;Monitoring Blocked and Passthrough External Service Traffic&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Limits.html#RDS_Limits.MaxConnections"&gt;Quotas and constraints for Amazon RDS - Amazon Relational Database Service&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.atlassian.com/incident-management/incident-response/incident-commander#2-why-do-teams-need-an-incident-commander"&gt;The role of the incident commander - Atlassian&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>rca</category>
      <category>troubleshooting</category>
      <category>sre</category>
      <category>incidentresponse</category>
    </item>
  </channel>
</rss>
