<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dipesh Mittal</title>
    <description>The latest articles on DEV Community by Dipesh Mittal (@dimittal).</description>
    <link>https://dev.to/dimittal</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F918604%2Ffe61289e-57ac-4af1-bd48-78ecb9a74301.jpg</url>
      <title>DEV Community: Dipesh Mittal</title>
      <link>https://dev.to/dimittal</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dimittal"/>
    <language>en</language>
    <item>
      <title>Mistakes to avoid in Observability</title>
      <dc:creator>Dipesh Mittal</dc:creator>
      <pubDate>Fri, 21 Apr 2023 11:13:00 +0000</pubDate>
      <link>https://dev.to/drdroid/mistakes-to-avoid-in-observability-2cgd</link>
      <guid>https://dev.to/drdroid/mistakes-to-avoid-in-observability-2cgd</guid>
      <description>&lt;h4&gt;
  
  
  TABLE OF CONTENTS:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Common Mistakes in Observability&lt;/li&gt;
&lt;li&gt;
Not building context of your product

&lt;ul&gt;
&lt;li&gt;&lt;a href="//#chapter-2.1"&gt;Not tracking what the customer sees&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//#chapter-2.2"&gt;Following the same metric sampling rate and thresholds 
across services:&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
Having a setup that is hard to investigate / triage

&lt;ul&gt;
&lt;li&gt;&lt;a href="//#chapter-3.1"&gt;Only tracking what the customer sees&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//#chapter-3.2"&gt;Lack of instrumentation guidelines for new services&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//#chapter-3.3"&gt;Not setting up tracing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//#chapter-3.4"&gt;Inaccessible or hard-to-find data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//#chapter-3.5"&gt;Adopting tools without alignment / Using too many tools&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
Creating Fatigue and not actionable insights

&lt;ul&gt;
&lt;li&gt;&lt;a href="//#chapter-4.1"&gt;Too many dashboards create too much noise&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//#chapter-4.2"&gt;Only having time-series graph-based dashboards&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//#chapter-4.3"&gt;Too many alerts&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
Cultural Gaps

&lt;ul&gt;
&lt;li&gt;&lt;a href="//#chapter-5.1"&gt;Positioning observability as tool to use during issues and incidents only&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//#chapter-5.2"&gt;Having a single point of failure for the observability tool&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The capabilities of observability tools to manage crisis situations have significantly improved over the past decade. Despite that, a lot of issues that could’ve been avoided, with good observability, end up in production.&lt;/p&gt;

&lt;p&gt;From our experience, and inspired by inputs from experts like &lt;a href="https://www.linkedin.com/in/stephentownshend/"&gt;Stephen&lt;/a&gt; &amp;amp; &lt;a href="https://www.linkedin.com/in/soumyadeepmukherjee/"&gt;Soumyadeep&lt;/a&gt;, in this article, we talk about some common mistakes in observability that can be avoided:&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes in Observability&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Not building context of your product:&lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Not tracking what the customer sees:&lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;It’s easier to set up monitoring at a host level from the observability data but it doesn’t tell you the customer’s perspective. Set up metrics that give you insight into how the customer would be impacted.&lt;/p&gt;

&lt;p&gt;For e.g., in my previous job, we created custom dashboards to monitor the API response time to our clients and the success rate for our supply-demand matching algorithm since they were a direct indicator of our end customers’ experience.&lt;/p&gt;

&lt;p&gt;The flip side is, what if you track &lt;strong&gt;only what the customer sees&lt;/strong&gt;? Continue reading to see how that could be problematic too.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Following the same metric sampling rate and thresholds across services:&lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;When setting up alerts, spend time identifying the sweet spot of metrics sampling rate and thresholds in them - these could vary based on your user’s requirement and use case. Business critical flows have stringent thresholds while internal tools could have relatively lenient thresholds.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--euRwlZD9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/o3ukn0p2d2a6c4abtvqf.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--euRwlZD9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/o3ukn0p2d2a6c4abtvqf.jpg" width="800" height="200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Having a setup that is hard to investigate / triage:&lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Only tracking what the customer sees:&lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;High-level metrics (e.g. response time) are useful to see the health but while investigating, your team will immediately feel the need to peek into more detailed metrics (e.g. CPU / memory / iops) so that finding the root cause is easier if any of the high-level metrics go bad.&lt;/p&gt;

&lt;p&gt;Having deeper-second order dashboards alongside an overall health dashboard helps the team investigate faster.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Lack of instrumentation guidelines for new services:&lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;When instrumentation is not done at the source or if it’s done differently for different parts of the code, it increases the difficulty of finding root-cause of misbehaviours: search queries become hard with inconsistent logs; monitoring becomes hard with high variance in metrics being tracked.&lt;/p&gt;

&lt;p&gt;It is recommended to share common instructions with the team on how to instrument (logs, metrics and traces data) within any new service/component.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Not setting up tracing:&lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;Do you follow a micro-services architecture with multiple components calling each other? Tracing enables you to follow a particular request within your code, and across services. Setup tracing (at least) on your most critical product flows. It will save you crucial triaging time, esp. in times of crucial SEV0 / P0 incidents.&lt;/p&gt;

&lt;h4&gt;
  
  
  4. Inaccessible or hard-to-find data&lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;Limiting access to observability data &amp;amp; creating bottlenecks for access leads to data silos for engineers trying to understand the system as they are deeply interconnected. Democratic access to observability data empowers teams to triage faster and without the need for assistance.&lt;/p&gt;

&lt;h4&gt;
  
  
  5. Adopting tools without alignment / Using too many tools:&lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;Data scattered across multiple dashboards (e.g. logs across server files, CloudWatch &amp;amp; Kibana) create an artificial need for context switch and slows the investigation process. Additionally, mandating tools without alignment of engineers (users) can lead to poor adoption and hence, difficulty in investigations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating Fatigue and not actionable insights:&lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Too many dashboards create too much noise&lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;Your service is a small portion of the overall product architecture. Standalone dashboards for each service can be avoided if they intertwine closely/cascade with another service. Combining dashboards for critical flows makes it easier to find data and give a holistic picture of the situation.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Only having time-series graph-based dashboards&lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;Not every metric is a graph. For some things that are as direct as error rate or the number of live pods, keep numerical counters in your dashboards as well. That will make them easy to find and absorb for quick action.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Too many alerts&lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;While you would want to know about everything that is going wrong with your system, only set up alerts with thresholds that you are ready to wake up at 3 AM for. For everything else, rely on dashboards alone as they are not critical for your customer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cultural Gaps:&lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Positioning observability as tool to use during issues and incidents only&lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;Apart from incident triaging, observability is an important mechanism to help teams understand how their systems are behaving. It helps understand the performance of different requests, APIs and errors - which provides an opportunity to improve the quality of applications.&lt;/p&gt;

&lt;p&gt;Without this, teams are collecting tech debt that will be too expensive to pay later. Read more on how to promote observability in your team &lt;a href="https://charity.wtf/2019/12/17/questionable-advice-how-do-i-get-my-team-into-observability/"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Having a single point of failure for the observability tool&lt;a&gt;&lt;/a&gt;
&lt;/h4&gt;

&lt;p&gt;Even if your team becomes trained at using the tool and has adopted it well, you should share knowledge on the set-up process and how to create dashboards/alerts. That not only will give your team a deeper perspective of how the tool works, it will remove dependency from individual developers.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Observability can become your most powerful weapon in driving data-first culture in your engineering team. Read more about this &lt;a href="https://notes.drdroid.io/building-a-data-driven-engineering-culture-with-good-observability-practices"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Was there any other mistake that your team corrected over the last few years in your observability journey? Share them in the comments below and help others avoid them!&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>observability</category>
      <category>developers</category>
      <category>doctordroid</category>
    </item>
    <item>
      <title>Observability | Simplified</title>
      <dc:creator>Dipesh Mittal</dc:creator>
      <pubDate>Fri, 21 Apr 2023 09:53:00 +0000</pubDate>
      <link>https://dev.to/drdroid/observability-simplified-34m4</link>
      <guid>https://dev.to/drdroid/observability-simplified-34m4</guid>
      <description>&lt;h4&gt;
  
  
  TABLE OF CONTENTS
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
What is Observability?

&lt;ul&gt;
&lt;li&gt;&lt;a href="//#chapter-1.1"&gt;Logs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//#chapter-1.2"&gt;Metrics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//#chapter-1.3"&gt;Traces&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;What is monitoring?&lt;/li&gt;
&lt;li&gt;
How does this work?

&lt;ul&gt;
&lt;li&gt;&lt;a href="//#chapter-3.1"&gt;Ok, but do I need to know how this works? 😬&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Bonus section&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Observability is a common term thrown around in our developer circles; often coupled with monitoring &amp;amp; alerting. A lot of popular tools claim to be solving your problems end-to-end and a lot of exchanges go on around open source technologies and protocols around this. This article tries to simplify some of these terms and how observability really works.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Observability?&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Observability is the practice of having data about your system that can help you know the unknown. It doesn’t refer to your metrics dashboards (that is monitoring) or to the alerts you set up. The process of instrumenting and collecting data that enables you to observe how your software systems behave, be aware of their health and gather detailed knowledge of how they are working is observability.&lt;/p&gt;

&lt;p&gt;There are 3 common types of data sources (called telemetry data) that help you uncover the truth:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Logs &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;If you’re a developer, the first thing you add while testing your code is logs. They can be either system generated (e.g. by nginx) or manually generate and can have a variety of data that helps in knowing vital information about the execution of the code.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Metrics &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;They are numerical values quantifying a certain behavioural aspect of your software, which is saved in a time series storage for seeing over a period of time. Most software emit metrics, be it your service running on a pod or the k8s cluster itself.&lt;/p&gt;

&lt;p&gt;To put it into context, the throughput (Requests per minute) or Avg response time of your API calls per minute are some metrics that you'll be familiar with and must have noticed in dashboards.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7dFjPg2H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bdhud02s281c44roseku.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7dFjPg2H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bdhud02s281c44roseku.jpg" width="800" height="515"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Traces &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;You can think of traces as a specialized form of logs, designed to give details around the set of steps your “request” took. It splits your entire execution into smaller chunks, including code level logic, DB queries &amp;amp; downstream calls. These executions (called “spans”) are easily identifiable with their names and their prefixes. Common names that you might have seen if your team has already setup traces:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Datastore - DB queries and connection handling steps&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;External - Calls made outside your service over a network &lt;br&gt;
protocol like HTTP, MQTT etc.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Function - Code execution within the current program&lt;br&gt;
There are other span names that can come up based on your &lt;br&gt;
instrumentation agent.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--eBPd3Fny--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kps0l3fwv14ugyjfv7v0.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--eBPd3Fny--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kps0l3fwv14ugyjfv7v0.jpg" width="800" height="634"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is monitoring? &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Monitoring is the part where you use the telemetry data to set up dashboards and visualisations of metrics you already know that you need to track to view the system's health at any point in time. Observability means having data such that even when you don’t know what you need to track, you can investigate your system deeply enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  How does this work? &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Below is a sample flow of how observability, when integrated within your micro-services architecture, looks like.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--DW8e4SGM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vn05fvr7hh0t79y1jmhk.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--DW8e4SGM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vn05fvr7hh0t79y1jmhk.jpg" width="792" height="743"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Instrumentation&lt;/em&gt;&lt;/strong&gt;: Refers to how the telemetry data is generated within the system. Typically, it involves adding a small piece of code/program (instrumentation agent) to your existing code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ok, but do I need to know how this works? 😬 &lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;No. Not really. You can decide to go ahead with a commercial tool and all you need to do is follow a couple of lines of instructions to set it up. All the steps mentioned above are taken care of by them so the details are abstracted out and you can directly start monitoring your system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caveat&lt;/strong&gt;: As your system scales, the cost of the commercial tools will start pinching and you might consider moving to OSS.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bonus section:&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Over the last couple of years, the term o11y is starting to get popular for the word Observability (e.g. The event &lt;a href="https://o11yfest.org/"&gt;o11yfest&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Now you say how? Find the output of this code to know how:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def word_encoder(word):
    word = word.replace(" ","") #removing spaces
    mid_char_count = len(word) - 2
    encoding = word[0].lower() + str(mid_char_count) + word[-1].lower()
    print(encoding)
    return encoding

word_encoder("observability")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Are you thinking “where is this inspiration coming from?” Try to find the output for these function calls and you'll know the answer :)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;word_encoder("kubernetes")
word_encoder("Andreessen Horowitz")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fun fact: These words are “&lt;a href="https://en.wikipedia.org/wiki/Numeronym"&gt;numeronym&lt;/a&gt;”&lt;/p&gt;

&lt;p&gt;If you have come across any other jargon that needs to be simplified, &lt;strong&gt;mention them in the comments!&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;We are shortly publishing a comparison of the most relevant open source &amp;amp; commercial tools for observability. If you would like to get a copy of it, sign up below!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>webdev</category>
      <category>developers</category>
      <category>monitoring</category>
      <category>doctordroid</category>
    </item>
    <item>
      <title>Observability of APIs in Production Environment</title>
      <dc:creator>Dipesh Mittal</dc:creator>
      <pubDate>Fri, 21 Apr 2023 09:33:00 +0000</pubDate>
      <link>https://dev.to/drdroid/observability-of-apis-in-production-environment-4h06</link>
      <guid>https://dev.to/drdroid/observability-of-apis-in-production-environment-4h06</guid>
      <description>&lt;p&gt;TABLE OF CONTENTS:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1. Why are we talking about APIs&lt;/li&gt;
&lt;li&gt;2. What does a high performant API mean?&lt;/li&gt;
&lt;li&gt;3. Setting up observability&lt;/li&gt;
&lt;li&gt;4. Setting up monitoring&lt;/li&gt;
&lt;li&gt;
5. API Symptoms &amp;amp; root causes

&lt;ul&gt;
&lt;li&gt;&lt;a href="//#chapter-5.1"&gt;Common API Errors&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//#chapter-5.2"&gt;Degradation of API latency&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//#chapter-5.3"&gt;Other Reasons&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
6. Bonus Section

&lt;ul&gt;
&lt;li&gt;&lt;a href="//#chapter-6.1"&gt;Investigation Strategy for APIs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="//#chapter-6.2"&gt;Cheatsheet for fixing errors&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’re someone who understands instrumentation well, feel free to jump directly to the &lt;a href="https://dev.tourl"&gt;symptoms and investigation&lt;/a&gt; section directly.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Why are we talking about APIs&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Software is eating the world. Adoption of &lt;a href="https://techcrunch.com/2015/05/06/apis-fuel-the-software-thats-eating-the-world/"&gt;API-first approach has been one of the key drivers for this fast paced software development.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;APIs are the communication pathways through which programs talk to each other. They have become a powerful tool for abstracting the underlying implementations of a software and just expose what is needed for the caller to interact with.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8sqFvRUa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qy8jrtynpve23fqohysj.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8sqFvRUa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qy8jrtynpve23fqohysj.jpg" width="500" height="559"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. What does a high performant API mean?&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;APIs come with certain promises, like repeatability of request/response structures (Contracts), predictable speed of receiving the response (SLAs) and logical outcomes (Status codes). Here are the 4 expectations from the API:&lt;/p&gt;

&lt;p&gt;1.Predictable &amp;amp; Fast Latency&lt;br&gt;
APIs are written for specific purpose and hence, that purpose must be fulfilled in a predictable time period. The faster and more predictable your APIs are, the better the experience of the caller.&lt;/p&gt;

&lt;p&gt;2.No Errors &amp;amp; logical status codes&lt;br&gt;
Runtime exceptions will cause your code to exit and throw 5xx errors to the client, which will have to be overridden using custom middleware. In cases where an exception occurs, clear reasoning on the error and the status code must be put in place.&lt;/p&gt;

&lt;p&gt;3.Scalability&lt;br&gt;
The performance and behaviour of the API should not change based on how much traffic it takes. There can be upper limits of how much traffic you can handle that you tell your clients/users, but it should behave similarly when below those limits.&lt;/p&gt;

&lt;p&gt;4.Consistent contracts&lt;br&gt;
Abstraction means that the caller wouldn’t be able to know if the structure or the code of the API has been changed. Any change to the API payload, response or code need to be updated to the caller explicitly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OR_jZ8tw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ugatwalu81xgxfndh4a7.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OR_jZ8tw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ugatwalu81xgxfndh4a7.jpg" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Why are we talking specifically about production&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QDlICJl_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mlug9b1gqcoo4j1m8sqe.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QDlICJl_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mlug9b1gqcoo4j1m8sqe.jpg" width="735" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Adding a debugger in the compiler or running unit tests typically enable you to evaluate and test functionality of the APIs in staging environment. But it doesn’t replicate the complexities &amp;amp; challenges of the production environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Setting up observability&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;To identify if our API is performing well, we need to observe our API's behaviour in production. This is doable by instrumenting our service code which has this API and view its metrics &amp;amp; traces. If you are new to instrumentation, &lt;a href="https://notes.drdroid.io/observability-instrumentation-monitoring-simplified-meaning"&gt;read more about it here&lt;/a&gt;.&lt;/p&gt;

&lt;h5&gt;
  
  
  Logs
&lt;/h5&gt;

&lt;p&gt;You can log statements and pass them to your logging framework to be available for querying later. You can put smart logs to tell the stage of the code your API request has reached and what the value of the variables are. This can help you gain a lot of insights. Adding &lt;a href="https://www.rapid7.com/blog/post/2016/12/23/the-value-of-correlation-ids/"&gt;unique identifier in the logs&lt;/a&gt; will help you search them better (more relevant if you do not have tracing implemented).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--UxbNzarJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/g038pw8ole2hra72dhis.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--UxbNzarJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/g038pw8ole2hra72dhis.jpg" width="559" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h5&gt;
  
  
  Errors
&lt;/h5&gt;

&lt;p&gt;If you are checking logs for runtime exceptions in an API, Sentry, GlitchTip or equivalent will save you time - it pin points the error reasons and stack traces! (Both are open source)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--e648fLR2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xeli3emjjw0y43n2sqer.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--e648fLR2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xeli3emjjw0y43n2sqer.jpg" width="800" height="497"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h5&gt;
  
  
  Metrics
&lt;/h5&gt;

&lt;p&gt;A quick health check of any API can be done by having a quick scan of historical time-series based metrics analysis of the following data points:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Traffic / Request count (in requests / minute)&lt;/li&gt;
&lt;li&gt;Latency (in milliseconds)&lt;/li&gt;
&lt;li&gt;Error rate (% error)&lt;/li&gt;
&lt;/ol&gt;

&lt;h5&gt;
  
  
  Traces
&lt;/h5&gt;

&lt;p&gt;Traces enable you to see step-wise details of your code during execution. For example, clicking on the DB call within the steps will tell you which query ran and what is its average behaviour as a metric.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--oiELSWJz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3josj4z8xszxzk6xs4zo.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--oiELSWJz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3josj4z8xszxzk6xs4zo.jpg" width="800" height="634"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Metrics and traces can be setup using commercial tools or open source alternatives (Prometheus / Jaeger). More details on this to be released on a blog shortly.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Setting up monitoring&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;The telemetry data mentioned above, still needs to be made available at a UI that is accessible to the user. The following are the two essentials that need to be setup here:&lt;/p&gt;

&lt;p&gt;1.Dashboards - A quick read about &lt;a href="https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signals"&gt;Golden Signals&lt;/a&gt; will give you an overview of what are some essential software metrics to track. Here are two tips to making your dashboards effective:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Grouping of metrics: Group metrics at an API or at a service or a product workflow level depending on the criticality of the API. If it’s a business critical API (e.g. payment or login), create a unique dashboard but if not, it can be a part of the service dashboard.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Accessibility: Add links to relevant dashboards in your troubleshooting playbooks and give democratic access to all dash&lt;br&gt;
boards to your users.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;1.Alerts: Setting up an optimal alerting system with low false positives is iterative and it will be hard to explain it here in a short way. For now, you could read &lt;a href="https://sre.google/workbook/alerting-on-slos/"&gt;this guide&lt;/a&gt; by Google which explains on how to iterate on your alerts.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. API Symptoms &amp;amp; root causes&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;In most cases, you might end up with a scenario where your API is not performing right. It could be because of multiple reasons.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--DACDh14P--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wl93k7uqd67lfck9gsk3.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--DACDh14P--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wl93k7uqd67lfck9gsk3.jpg" width="800" height="554"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Common API Errors&lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;API errors can largely be classified into two categories: handled or unhandled.&lt;/p&gt;

&lt;h4&gt;
  
  
  Handled Errors
&lt;/h4&gt;

&lt;h5&gt;
  
  
  1. HTTP 400:
&lt;/h5&gt;

&lt;p&gt;For validation failures in the request data of the API, you return a 400 error. If that happens a lot, that would mean either the callers are sending wrong data too frequently or you have added some validations which are failing correct requests. How to fix: You should check the Pull Requests for recent releases in that service to find that changed validation or if you have added logs for validation failure cases, identify which callers are failing the most and inform them to correct their request data.&lt;/p&gt;

&lt;h5&gt;
  
  
  2. HTTP 401 / 403:
&lt;/h5&gt;

&lt;p&gt;Failure of proper authentication or authorisation results in 401 and 403 errors. If they happen too many times, that would mean your authentication token generation is happening improperly or the token checking process is failing. Most often the case is that the access token storage layer has some issue. How to fix: Check the API that is returning the auth token to the user app for any errors from your monitoring tools. If that doesn’t work, pick up a sample token from your logs which is failing to get authenticated and see how the user got it (if your company policy allows it).&lt;/p&gt;

&lt;h5&gt;
  
  
  3. HTTP 404 / 405:
&lt;/h5&gt;

&lt;p&gt;In case the endpoint the client is hitting on your service isn't exposed, you throw the 404 code. In case the endpoint is present but the HTTP verb used in the call isn't supported, you throw 405. These are mostly handled by all modern web frameworks themselves. Any presence of these shows incorrectly integrated client. How to fix: Isolate the clients who are creating these errors from your error monitoring tool and inform them to correct the integration by sharing your correct documentation for the API.&lt;/p&gt;

&lt;h5&gt;
  
  
  4. HTTP 429:
&lt;/h5&gt;

&lt;p&gt;In rare scenarios, your clients may be exceeding the rate limits you have set on the APIs and that throttling is kicking in. This would return HTTP status code 429 to them for each extra hit. This is a practice you follow to protect your servers from being hogged by a few clients. How to fix: Ideally, in such cases you should either relax the throttling at your end if your business requires it (as long as your system can handle that load) or you ask your client to check at their if they are making so many requests.&lt;/p&gt;

&lt;h4&gt;
  
  
  Unhandled Errors
&lt;/h4&gt;

&lt;h5&gt;
  
  
  1. HTTP 500:
&lt;/h5&gt;

&lt;p&gt;When an error happens in your code that you haven't handled, the web framework will mostly throw 500 error. That indicates that your code and the variables it is handling ended up in a state that your code couldn’t handle like a NullPointerException. These unhandled errors can be seen on your error monitoring tools. How to fix: From your error monitoring tool or from your logs, you will know which line of code is causing the error. This could have been introduced either due to a new release or some new data flowing in which wasn’t earlier. Most likely you’ll need to make a code fix or disable some feature that caused it to break.&lt;/p&gt;

&lt;h5&gt;
  
  
  2. HTTP 502:
&lt;/h5&gt;

&lt;p&gt;If your API is returning 502 errors, that would mean some server is unreachable due to their DNS resolution failing. That happens if your configured hostname is incorrect for a downstream API call or that hostname is incorrect. How to fix: Putting retries in your caller code mostly solves it if its a network glitch, but don't put too many re-tries because in case the hostname is genuinely unavailable, it could cause trouble with your processing queues. You should put logs so that you can identify right away which downstream server in your API context is throwing this error. More on this here.&lt;/p&gt;

&lt;h5&gt;
  
  
  3. HTTP 503:
&lt;/h5&gt;

&lt;p&gt;503 errors happen when your service is unavailable to take on requests. This can happen due to the web container unable to connect with the application server or if your LB doesn't have any healthy targets to serve the requests. How to fix: Check in the load balancer of your service if it has healthy targets to send requests to. Most likely this happens because health checks are failing to the targets due to them being too slow or having run out of available connections. Adding more targets can solve the problem if it is a connection pool issue but if your new targets are also going unhealthy, this could be linked to latency degradation of health check API.&lt;/p&gt;

&lt;h3&gt;
  
  
  Degradation of API latency&lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--naHTBMUf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jofktw4tbf0xbrrkb3j3.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--naHTBMUf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jofktw4tbf0xbrrkb3j3.jpg" width="800" height="632"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Response time increasing is a pain point that comes with increasing scale or poorly written code. Either way, the best place to know this is in the traces. For an API, the time is took to respond can be deconstructed into smaller steps it had to execute. They are called &lt;a href="https://plumbr.io/blog/monitoring/distributed-tracing-for-dummies"&gt;spans&lt;/a&gt;. By looking at them, you can find the slow moving parts.&lt;/p&gt;

&lt;p&gt;Typical reasons for API slowness:&lt;/p&gt;

&lt;h5&gt;
  
  
  a. DB queries are taking time
&lt;/h5&gt;

&lt;p&gt;DB call spans will tell you time it took to connect to your DB and querying it. Compare them with the DB span from the period when the API was working fine. Slowness in these spans could be caused by:&lt;/p&gt;

&lt;p&gt;i) &lt;strong&gt;New code changes&lt;/strong&gt; with inefficient queries (not using the correct index when selecting a large data set or fetching too big datasets). &lt;strong&gt;How to fix&lt;/strong&gt;: The fastest way to fix this will be disabling the feature which causes that query or rolling back your changes. In case neither is possible, then quick remediation could be done by introducing new indexes in real-time, although that is highly discouraged.&lt;/p&gt;

&lt;p&gt;ii) &lt;strong&gt;DB is under stress&lt;/strong&gt; and queries are taking time (this can be confirmed by checking if all queries to the same DB are taking longer than before or not).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to fix&lt;/strong&gt;: This could be due to a crunch of different type of resources in the DB. A detailed note on this will be published soon.&lt;/p&gt;

&lt;p&gt;iii) &lt;strong&gt;In relational DBs, concerned table could be locked&lt;/strong&gt;. In case your are writing into a particular table and it is locked by some other thread, you query could be slow and eventually time out based on your DB setup. &lt;strong&gt;How to fix&lt;/strong&gt;: Database running queries need to be check. Different DBs have different way to store and query this data. Here is how you find it in &lt;a href="https://oracle-base.com/articles/mysql/mysql-identify-locked-tables"&gt;MySQL&lt;/a&gt; and &lt;a href="https://jaketrent.com/post/find-kill-locks-postgres/"&gt;Postgresql&lt;/a&gt;. The session which is running the locking query must be killed. These steps mostly likely can only be performed by your Devops or DBA team.&lt;/p&gt;

&lt;h5&gt;
  
  
  b. External API call is taking time
&lt;/h5&gt;

&lt;p&gt;If your API is making a call synchronously to some other API, your slowness could be due to that. This could be a call to Redis or a broker or some other internal/external API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to fix&lt;/strong&gt;: As a caller of APIs from anywhere in your code, always setup timeouts to protect your own customer's experience. Should also look at implementing &lt;a href="https://microservices.io/patterns/reliability/circuit-breaker.html"&gt;circuit breakers&lt;/a&gt; if you are dependent on too many such downstream APIs and you can afford them not being available in your product temporarily. In any case, reach out to the owner of that API immediately if you can’t rectify it.&lt;/p&gt;

&lt;h5&gt;
  
  
  c. Code execution is taking time
&lt;/h5&gt;

&lt;p&gt;This happens due to your service running on under-provisioned infrastructure. You can identify this by checking code spans in the request trace. They can be identified by names starting with 'Function' or your programming language.&lt;/p&gt;

&lt;p&gt;i) &lt;strong&gt;CPU&lt;/strong&gt; - If a server is taking more requests than it can handle in terms of CPU cycles, it starts becoming slower overall as processes fight with each other for processing power. &lt;strong&gt;How to fix: &lt;em&gt;Robust auto scaling based on CPU&lt;/em&gt;&lt;/strong&gt; must be setup on your service hosts to make sure no host goes over the tipping point w.r.t request traffic. Make sure your host doesn’t allow more traffic than what it can handle by fine-tuning the number of connections it can take in.&lt;/p&gt;

&lt;p&gt;ii) &lt;strong&gt;Memory&lt;/strong&gt; - If the processes running on the host are utilising the memory but aren't releasing it, that would make the memory unavailable for other processes to use. Although most modern languages do auto garbage collection for freeing up memory, poorly written code can still cause it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to fix: &lt;em&gt;Quick remediation for memory issues&lt;/em&gt;&lt;/strong&gt; on hosts is restarting your application process on it, but for long term code changes might be needed to remove the erroneous code. Make sure you use the latest stable version for all third party libraries as they would have been tested well for memory leaks by the authors and the community.&lt;/p&gt;

&lt;h5&gt;
  
  
  d. Insufficient connection pool
&lt;/h5&gt;

&lt;p&gt;Your web containers could be finding it difficult to get connections to your application layer due to exhaustion of the connection pool. This happens when your hosts have maxed on the number of connections per host and also on the number of hosts. This could be happening due to poor configuration even though you can handle more load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to fix&lt;/strong&gt;: This also can be solved using auto scaling on your hosts and auto scaling of workers on your hosts up to the limit each can handle. Quick remediation would be addition of more hosts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Other Reasons&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Another issue that you might face is that the API is within expected latency, has normal error rate but is not responding as per expectation.&lt;/p&gt;

&lt;p&gt;This usually means some logical change has gone inside the system that has broken the API. Some obvious reasons could be:&lt;/p&gt;

&lt;p&gt;a) For read-only APIs, it could be due to u*&lt;em&gt;nderlying data being corrupted or missing&lt;/em&gt;*.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to fix&lt;/strong&gt;: Check what process inserts/updates that data in the storage. Putting logs in both the insertion and the read API can tell which part isn’t working right. If you don’t have logs, try and make the API call for reading data which exists and should have been returned. The result of this test can help you isolate the problem.&lt;/p&gt;

&lt;p&gt;b) Some &lt;strong&gt;feature flag&lt;/strong&gt; could have caused misbehaviour in the APIs. Lack of proper testing can leave bugs in the new feature or unintended consequences in existing product flows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to fix&lt;/strong&gt;: Should ****look to disable the feature or rollback the release entirely to remediate quickly.&lt;/p&gt;

&lt;p&gt;c) Although unlikely, it could be caused by &lt;strong&gt;bad data coming in from your API caller&lt;/strong&gt; due to an issue at their end but is not causing any exception.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to fix&lt;/strong&gt;: There should be good validations set up on request data and any anomalies in it must be notified to the caller through 4xx status codes or logged for being noticed.&lt;/p&gt;

&lt;p&gt;A very useful way to identify root cause for incorrect API behaviour is to compare the current API traces with the past. You should be able to see some differences in the spans and their latency and can tell you what code flow isn't being taken anymore or is being taken now and that can help you find a pattern of the change that is happening underneath.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Bonus Section&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Investigation Strategy for APIs&lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Investigation Strategy&lt;/p&gt;

&lt;p&gt;As you start investigating, what do you check first and what next? Here’s the mental model that I follow to resolve any issue related to an API latency.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xpJ8D_gH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uzhe9sov8lgfm0cd3zya.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xpJ8D_gH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uzhe9sov8lgfm0cd3zya.jpg" width="800" height="578"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Cheatsheet for fixing errors&lt;a&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--5osCyI7U--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/miffam2wu6v71xk5g4hk.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--5osCyI7U--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/miffam2wu6v71xk5g4hk.jpg" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How do you debug an API? Tell us about your debugging strategies in the comments below!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>API callback &amp; webhooks monitoring</title>
      <dc:creator>Dipesh Mittal</dc:creator>
      <pubDate>Fri, 21 Apr 2023 07:48:00 +0000</pubDate>
      <link>https://dev.to/drdroid/api-callback-webhooks-monitoring-1pi3</link>
      <guid>https://dev.to/drdroid/api-callback-webhooks-monitoring-1pi3</guid>
      <description>&lt;ul&gt;
&lt;li&gt;The 3 different types of webhooks to monitor&lt;/li&gt;
&lt;li&gt;Important practices to follow when building webhooks&lt;/li&gt;
&lt;li&gt;There are 2 ways you can monitor these webhooks&lt;/li&gt;
&lt;li&gt;How do you set up the monitoring process?&lt;/li&gt;
&lt;li&gt;3 Ways you can fix your webhooks before contacting the application owner&lt;/li&gt;
&lt;li&gt;Maximize the user experience by monitoring webhook callbacks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Webhooks are crucial components of a multi-system architecture. They enable one application to send automated messages or notifications to another application when a certain event occurs, without the need for constant polling. Hence, monitoring them is super important.&lt;/p&gt;

&lt;p&gt;At Doctor Droid, we are enabling API callback monitoring through a context-based linking between events. To try it out, sign up here.&lt;/p&gt;

&lt;p&gt;In this blog, I’m going to walk you through how you can monitor webhooks so that your user experience is never compromised.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3 different types of webhooks to monitor&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;h5&gt;
  
  
  1. FYI webhooks (e.g. SMS got delivered).
&lt;/h5&gt;

&lt;p&gt;These webhooks are stored for future analysis and performance measurement and do not trigger real-time actions.&lt;/p&gt;

&lt;h5&gt;
  
  
  2. Critical webhooks (e.g. Payment completed).
&lt;/h5&gt;

&lt;p&gt;Essential webhooks play a crucial role in enabling product flow and directly impact the customer experience. These types of webhooks typically require immediate attention and should trigger alerts to ensure timely response and resolution.&lt;/p&gt;

&lt;h5&gt;
  
  
  3. Human action webhooks (e.g. third-party vendor updates).
&lt;/h5&gt;

&lt;p&gt;These are vital for monitoring operational activities and may have some indirect impacts on the system's triggers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Important practices to follow when building webhooks&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;The aim here is to build secure and efficient webhooks to create an application that provides a seamless user experience. Here are some of the best practices for you to follow:&lt;/p&gt;

&lt;h5&gt;
  
  
  1.Keep a fallback for polling the caller.
&lt;/h5&gt;

&lt;p&gt;There could be situations where the calling application is unable to make those calls. It’s recommended to keep a fallback handy where after a certain time from the forward call, you poll that application for the same data that they may have sent you the webhook with.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--J3h31zLx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ue0e50xp8erue8v9gyo4.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--J3h31zLx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ue0e50xp8erue8v9gyo4.jpg" width="800" height="434"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This makes sure your product journey remains intact. However, if their application is not working then this won’t be useful. You should make a note internally (or alert) if this polling fails to yield the desired output.&lt;/p&gt;

&lt;h5&gt;
  
  
  2.Always put validations for request schema coming into your application
&lt;/h5&gt;

&lt;p&gt;The trick here is to only read those fields which are relevant to you and discard the rest. If this integration is important for financial reconciliation in the future, keep a copy of the incoming webhook requests in some persistent storage. This will also help in case of a callback request failure, as you'll have a copy of the original request to refer to.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6k41eQ6a--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/emv7ku6oft3h0hcer7pe.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6k41eQ6a--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/emv7ku6oft3h0hcer7pe.jpg" width="800" height="707"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h5&gt;
  
  
  3.Throw alerts for deserialization errors in webhook requests
&lt;/h5&gt;

&lt;p&gt;Since this application is not under your control, there is a rare but possible chance they may change their request schema without prior intimation. To make sure you are aware of this when it happens, set up error handling at your deserialization and request validation layer.&lt;/p&gt;

&lt;h5&gt;
  
  
  4.Make the caller aware of your response status
&lt;/h5&gt;

&lt;p&gt;When handling webhooks, it's important to consider how failures are handled, specifically in the case of validation errors or serialization errors. Webhooks are typically called in a "fire-and-forget" mode, meaning that the caller may not be aware of any breakage in the webhook request data.&lt;/p&gt;

&lt;p&gt;This can lead to the receiver suffering from issues without the caller being aware of them. To prevent this, it's crucial to make sure that the caller is aware of the response status to ensure proper communication and handling of errors.&lt;/p&gt;

&lt;h2&gt;
  
  
  There are 2 ways you can monitor these webhooks&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;h5&gt;
  
  
  Stateless monitoring
&lt;/h5&gt;

&lt;p&gt;For high-traffic webhooks, like telephony callbacks on SMS delivery, you can measure overall behaviour by counting and analyzing a field in the webhook API request. No need to map it to the original request.&lt;/p&gt;

&lt;h5&gt;
  
  
  Stateful monitoring
&lt;/h5&gt;

&lt;p&gt;To track the entity for which the callback is received, map the callback status from the bank to the transaction initiated via API. Alerts should be set up for missed or delayed webhook calls, as this helps to take further actions to maintain customer experience and product journey. There are two ways you can set up alerts:&lt;/p&gt;

&lt;h5&gt;
  
  
  1.At an individual level
&lt;/h5&gt;

&lt;p&gt;You need to be informed of the duration for which each webhook has been missed. There’s little tolerance for failure due to the severe repercussions that can be inflicted and so every failure needs to be reported and investigated.&lt;/p&gt;

&lt;p&gt;For example, in the financial world, the processing of payments can’t be paused or delayed unless there’s a problem at the recipient’s bank, but that also must be known at the earliest possible time.&lt;/p&gt;

&lt;h5&gt;
  
  
  2.At aggregated level
&lt;/h5&gt;

&lt;p&gt;In this situation, you know some leakage is happening but there’s a tolerance range. To monitor potential leakage within an acceptable range, track webhooks against orders in your system, such as those fulfilled by third-party vendors. Although real-time performance monitoring may not be feasible, you can still ensure that the performance doesn't deteriorate beyond a specified threshold.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you set up the monitoring process?&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;h5&gt;
  
  
  1.For stateless monitoring
&lt;/h5&gt;

&lt;h5&gt;
  
  
  a.Use logs
&lt;/h5&gt;

&lt;p&gt;Add logs for each request coming in and the field in it that defines success/failure (or whatever state matters to you). Take these logs into Grafana Loki or ELK stack and plot the count and the trend of that status value. This will be limited with the retention period of your logs so not good if you want to look at very old performance data also.&lt;/p&gt;

&lt;h5&gt;
  
  
  b. Store in DB
&lt;/h5&gt;

&lt;p&gt;Put the webhook call data into a DB of your choice as an immutable entry. Plot this using any data visualization tool you want to use like Sisense, Metabase, or Redash. This data could remain forever. You can archive this into s3 as parquet files every week and move older queries onto the s3 + athena stack.&lt;/p&gt;

&lt;h5&gt;
  
  
  c. Metrics
&lt;/h5&gt;

&lt;p&gt;Keep counters for each incoming request and another counter for each success and failure status. Use Prometheus to scrape them and then see them in any observability tool of your choices like Grafana or commercial tools like Newrelic or Datadog. This again is limited by data retention. Tools like chronosphere allow you to resample the data by coalescing it for the past so you can enjoy longer retention but with reduced granularity over time.&lt;/p&gt;

&lt;h5&gt;
  
  
  For stateful monitoring
&lt;/h5&gt;

&lt;p&gt;There is no easy way to monitor this. You want to keep the reference of your forward transaction when receiving or waiting for the webhook and map this behaviour onto alerts or charts.&lt;/p&gt;

&lt;h5&gt;
  
  
  a. Using logs
&lt;/h5&gt;

&lt;p&gt;You can log forward and incoming webhook calls with some common log_message that can help link them when searching and plotting. You can use an id to represent the entity in question, but you can search both logs or the lack of the webhook by searching for that ID. This can be done in any log visualization tool. However, if you want to do this in aggregation, the logging solutions fall short.&lt;/p&gt;

&lt;h5&gt;
  
  
  b. Store in DB
&lt;/h5&gt;

&lt;p&gt;You store both forward and webhook transactions into a DB and run periodic queries for the webhook being missed. This can be used to know of each failure (with the periodic query running every few seconds) or in aggregation every ‘X’ minutes. This can then be plotted on a chart or sent in a notification. These features exist in data visualization tools like Metabase, Superset, Sisense. You may not want to run these periodic queries on your OLTP DB, so you may want to replicate it or set up an ETL pipeline to make a data lake and then run these queries on it like Snowflake. Making this in-house is not only tedious but also adds a lot of DevOps overhead. Using a cloud solution for this is simply too expensive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick note:&lt;/strong&gt; Dr Droid specializes in the stateful monitoring of products. Sign up here to get started.&lt;/p&gt;

&lt;h2&gt;
  
  
  3 Ways you can fix your webhooks before contacting the application owner&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;1.If the number of webhooks is dropping, do the following:&lt;/p&gt;

&lt;p&gt;a. Inspect the forward action that causes webhook calls. Maybe your forward action itself is breaking or not happening, causing the application to not trigger callback requests.&lt;/p&gt;

&lt;p&gt;b. Check for any rate-limiting errors in your nginx logs or any serialization/authentication errors your server might be throwing for the webhook call. These happen less often but can be a reason for breakage.&lt;/p&gt;

&lt;p&gt;2.If the webhooks count is good but there are some missing for specific cases, identify a pattern in the forward calls for which webhooks are missing. That can help identify if the application triggering webhooks could be failing for cases matching that pattern.&lt;/p&gt;

&lt;p&gt;3.If you still can’t identify why your webhooks are missing, contact the application owner.&lt;/p&gt;

&lt;h2&gt;
  
  
  Maximize the user experience by monitoring webhook callbacks&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;There are no two ways about it - failing to monitor webhook callbacks will expose you to several application errors that will cripple application performance. Simply follow the steps outlined in this article and you’ll boost your chances of delivering an application that provides a seamless user experience.&lt;br&gt;
If you want a fast and secure way of detecting technical issues before they impact your business, then &lt;a href="https://www.loom.com/share/527944cfeabf49eab231d701f7c8431d"&gt;watch this demo&lt;/a&gt; on how DrDroid can do this for you.&lt;/p&gt;

</description>
      <category>doctordroid</category>
      <category>webdev</category>
      <category>monitoring</category>
      <category>api</category>
    </item>
    <item>
      <title>Some Post</title>
      <dc:creator>Dipesh Mittal</dc:creator>
      <pubDate>Wed, 31 Aug 2022 12:04:09 +0000</pubDate>
      <link>https://dev.to/dimittal/some-post-5dm9</link>
      <guid>https://dev.to/dimittal/some-post-5dm9</guid>
      <description>&lt;p&gt;Want to see how it looks like.&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>webdev</category>
      <category>beginners</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
