<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: gaurang101197</title>
    <description>The latest articles on DEV Community by gaurang101197 (@gaurang101197).</description>
    <link>https://dev.to/gaurang101197</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F474341%2Fbc9ff8e7-4e74-44b7-a2b7-251d4211e296.jpg</url>
      <title>DEV Community: gaurang101197</title>
      <link>https://dev.to/gaurang101197</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gaurang101197"/>
    <language>en</language>
    <item>
      <title>Resource cap on the non user facing workload in Clickhouse</title>
      <dc:creator>gaurang101197</dc:creator>
      <pubDate>Sat, 31 Jan 2026 09:49:29 +0000</pubDate>
      <link>https://dev.to/gaurang101197/resource-cap-on-the-non-user-facing-workload-in-clickhouse-5cn</link>
      <guid>https://dev.to/gaurang101197/resource-cap-on-the-non-user-facing-workload-in-clickhouse-5cn</guid>
      <description>&lt;p&gt;If you are looking for solution to safeguard your critical workload from failure in Clickhouse because of some adhoc, unwanted and undesired queries then you are at right place. This blog is for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why is it important to cap resource usage for non user facing workload ?
&lt;/h2&gt;

&lt;p&gt;Clickhouse is built to make use of all available resources to execute query. One bad query can utilize all available resources and can impact the other critical business queries. And we don't want someone to run expensive queries by mistake (while debugging issue or performing adhoc analysis) which can impact the user facing queries. So it is very critical that we apply the resource usage limit on non user facing queries to safeguard our critical workload.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to safeguard critical workload
&lt;/h2&gt;

&lt;p&gt;It is a best practice to create separate role and user for different use cases in clickhouse. It will help you mange the access and configure different settings for different role/user.&lt;/p&gt;

&lt;p&gt;So In this blog, we assume that we have isolated role/user for user facing and non user facing workload.&lt;/p&gt;

&lt;p&gt;In practice, we should limit the number of concurrent queries and amount of memory one user can run and utilize. Clickhouse has limit on maximum number of concurrent queries it can run at any given time. And it terminates the new queries if this limit is breached (even though memory usage is less). So we want to cap the maximum number of concurrent query that our non user facing workload can run.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is settings profile ?
&lt;/h3&gt;

&lt;p&gt;Settings profile is a way to define group of settings and attach them to given user or role.&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating a settings profile to limit the memory usage and concurrent number of queries
&lt;/h3&gt;

&lt;p&gt;Below query creates a settings profile to limit the max memory usage to 1GB and max concurrent queries to 100 for given role/user.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Limit max memory usage to 1GB and concurrent query to 100&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;SETTINGS&lt;/span&gt; &lt;span class="n"&gt;PROFILE&lt;/span&gt; &lt;span class="n"&gt;restrict_resource_on_non_user_facing_workload&lt;/span&gt; 
&lt;span class="n"&gt;SETTINGS&lt;/span&gt; &lt;span class="n"&gt;max_memory_usage_for_user&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_concurrent_queries_for_user&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt; 
&lt;span class="k"&gt;to&lt;/span&gt; &lt;span class="n"&gt;non_user_facing_role&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- You can alter the settings using below query.&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="n"&gt;SETTINGS&lt;/span&gt; &lt;span class="n"&gt;PROFILE&lt;/span&gt; &lt;span class="n"&gt;restrict_resource_on_non_user_facing_workload&lt;/span&gt; 
&lt;span class="n"&gt;SETTINGS&lt;/span&gt; &lt;span class="n"&gt;max_memory_usage_for_user&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_concurrent_queries_for_user&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt; 
&lt;span class="k"&gt;to&lt;/span&gt; &lt;span class="n"&gt;non_user_facing_role&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Below query deleted the provided settings profile&lt;/span&gt;
&lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="n"&gt;SETTINGS&lt;/span&gt; &lt;span class="n"&gt;PROFILE&lt;/span&gt; &lt;span class="n"&gt;restrict_resource_on_non_user_facing_workload&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Settings which can be useful
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://clickhouse.com/docs/operations/settings/settings#max_memory_usage_for_user" rel="noopener noreferrer"&gt;max_memory_usage_for_user&lt;/a&gt; - The maximum amount of RAM in bytes to use for running a user's queries on a single server.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://clickhouse.com/docs/operations/settings/settings#max_concurrent_queries_for_user" rel="noopener noreferrer"&gt;max_concurrent_queries_for_user&lt;/a&gt; - The maximum number of simultaneously processed queries per user.

&lt;ul&gt;
&lt;li&gt;Even though limit on total memory usage will give us a starting point but we should also cap the number of concurrent queries as well, because clickhouse node can run 1000 queries at a time. So we should add limit on number of concurrent queries run by non user facing workload.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;a href="https://clickhouse.com/docs/operations/settings/settings#max_rows_to_read" rel="noopener noreferrer"&gt;max_rows_to_read&lt;/a&gt; - The maximum number of rows that can be read from a table when running a query. The restriction is checked for each processed chunk of data, applied only to the deepest table expression and when reading from a remote server, checked only on the remote server.

&lt;ul&gt;
&lt;li&gt;This can be ignored if we can set limit on memory usage per user. That will restrict the number of rows internally. Our idea is to restrict resource usage and if we are able to do it using max_memory_usage_for_user then we should be good to leave this setting.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;a href="https://clickhouse.com/docs/operations/settings/settings#max_bytes_to_read" rel="noopener noreferrer"&gt;max_bytes_to_read&lt;/a&gt; - The maximum number of bytes (of uncompressed data) that can be read from a table when running a query. The restriction is checked for each processed chunk of data, applied only to the deepest table expression and when reading from a remote server, checked only on the remote server.

&lt;ul&gt;
&lt;li&gt;This can be ignored if we can set limit on memory usage per user.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Memory limit
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;numbers_mt&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;1000000&lt;/span&gt;
&lt;span class="n"&gt;SETTINGS&lt;/span&gt; &lt;span class="n"&gt;max_memory_usage_for_user&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected Error&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User memory limit exceeded: would use 88.41 KiB (attempt to allocate chunk of 0.00 B bytes), maximum: 9.77 KiB. OvercommitTracker decision: Query was selected to stop by OvercommitTracker: While executing AggregatingTransform. 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Concurrent queries
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Run below query from multiple session&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;numbers_mt&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;1000000&lt;/span&gt;
&lt;span class="n"&gt;SETTINGS&lt;/span&gt; &lt;span class="n"&gt;max_concurrent_queries_for_user&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- To get list of running queries by current user&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;running&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processes&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;user&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;currentUser&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected error&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error: Too many simultaneous queries for user XYZ. Current: 1, maximum: 1.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>clickhouse</category>
      <category>dataengineering</category>
      <category>dbadmin</category>
      <category>observability</category>
    </item>
    <item>
      <title>Improve node app responsiveness using partitioning</title>
      <dc:creator>gaurang101197</dc:creator>
      <pubDate>Mon, 19 Jan 2026 04:18:10 +0000</pubDate>
      <link>https://dev.to/gaurang101197/improving-node-app-responsiveness-using-partitioning-201i</link>
      <guid>https://dev.to/gaurang101197/improving-node-app-responsiveness-using-partitioning-201i</guid>
      <description>&lt;h2&gt;
  
  
  Problem Statement
&lt;/h2&gt;

&lt;p&gt;We have a one GET endpoint which fetches documents from database, transforms and returns. There were few clients for which we have to fetch large number of documents which used to block the event loop thread of node. When event loop thread is blocked, it creases below problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;App becomes unresponsive to new requests.&lt;/li&gt;
&lt;li&gt;Liveliness probe fails and k8s restarts the pod.&lt;/li&gt;
&lt;li&gt;Add unpredictable delay to small and fast requests.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Restarts became very frequent and we can not directly add pagination and deprecate this legacy endpoint. So need to figure out a short term solution which requires less code changes to prevent the restart and improve the responsiveness of application.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution
&lt;/h2&gt;

&lt;p&gt;There is a concept called &lt;a href="https://nodejs.org/en/learn/asynchronous-work/dont-block-the-event-loop#partitioning" rel="noopener noreferrer"&gt;partitioning&lt;/a&gt;. It is very simple, break your large synchronous processing into smaller tasks and add event loop yield in between of each task to let event loop work upon other requests in between. Below is the simple pseudocode which unblocks the event loop in between of each task (batch) and let it serve other requests.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Return a Promise which resolves immediately. But event loop continue processing in next cycle which unblock the event loop and let it work upon other tasks as well.&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;yieldToEventLoop&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;setImmediate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;smallBatchSize&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;smallBatchSize&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Do your processing of batch here&lt;/span&gt;
  &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(...&lt;/span&gt;&lt;span class="nf"&gt;processBatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;yieldToEventLoop&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Key Take Away
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;It does not make processing faster and remove CPU cost.&lt;/li&gt;
&lt;li&gt;It improves &lt;strong&gt;responsiveness&lt;/strong&gt; of the application.&lt;/li&gt;
&lt;li&gt;One heavy request can starve the entire Node process. Chunk + yield can keep the event loop alive and provide temporary relief.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;If your server relies heavily on complex calculations, you should think about whether Node.js is really a good fit. Node.js excels for I/O-bound work, but for expensive computation it might not be the best option. &lt;a href="https://nodejs.org/en/learn/asynchronous-work/dont-block-the-event-loop#partitioning" rel="noopener noreferrer"&gt;Reference&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://nodejs.org/en/learn/asynchronous-work/dont-block-the-event-loop" rel="noopener noreferrer"&gt;https://nodejs.org/en/learn/asynchronous-work/dont-block-the-event-loop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://nodejs.org/en/learn/asynchronous-work/understanding-setimmediate" rel="noopener noreferrer"&gt;https://nodejs.org/en/learn/asynchronous-work/understanding-setimmediate&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>node</category>
      <category>eventloop</category>
      <category>programming</category>
      <category>typescript</category>
    </item>
    <item>
      <title>Alert on counter discontinue in Grafana</title>
      <dc:creator>gaurang101197</dc:creator>
      <pubDate>Wed, 22 Jan 2025 04:32:47 +0000</pubDate>
      <link>https://dev.to/gaurang101197/alert-on-counter-discontinue-in-grafana-5fap</link>
      <guid>https://dev.to/gaurang101197/alert-on-counter-discontinue-in-grafana-5fap</guid>
      <description>&lt;h2&gt;
  
  
  Requirement
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;We have a counter named &lt;em&gt;heartbeat_count&lt;/em&gt; which indicates whether application is up or not. It has label called &lt;em&gt;application&lt;/em&gt; which is application name.&lt;/li&gt;
&lt;li&gt;Each application send this heartbeat metrics at 15 seconds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, we want to get an alert whenever any application stop pushing heartbeat metric for 5 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;count_over_time()&lt;/code&gt; - this function counts the number of time metrics has value in given time. So if application is sending metrics at every 15 seconds, count_over_time(heartbeat_count{application="ABC"}[1m]) give 4 (4 times metrics has value in last 1 minutes as metric is pushed every 15 seconds).&lt;/p&gt;

&lt;p&gt;Now, in 10 minutes, &lt;em&gt;count_over_time&lt;/em&gt; should be 40 for application working fine. We can use this function to send an alerts if counter is missing 20 times in last 10 minute. Below query print the heartbeat count in last 10 minutes by application. So if we see metric for any application going below 20 which means counter's value is missing for 20 times (5 minutes might not be continuos but that is the limitation of this solution) in last 10 minutes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum by(application) (count_over_time(heartbeat_count{application!=""}[10m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Below query will give us a applications for which heartbeat counter is missing 20 times in last 10 minutes. And we can setup alert on this easily.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum by(application) (count_over_time(heartbeat_count{application!=""}[10m])) &amp;lt; 20
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Whenever new application starts, alerts is sent as new application don't have counter value in past 10 minutes. Which can be ignored.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Resource
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://prometheus.io/docs/prometheus/latest/querying/functions/#aggregation_over_time" rel="noopener noreferrer"&gt;https://prometheus.io/docs/prometheus/latest/querying/functions/#aggregation_over_time&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>grafana</category>
      <category>prometheus</category>
      <category>alerting</category>
      <category>devops</category>
    </item>
    <item>
      <title>Plotting Histogram Distribution Over Time in Grafana</title>
      <dc:creator>gaurang101197</dc:creator>
      <pubDate>Sat, 10 Aug 2024 08:02:01 +0000</pubDate>
      <link>https://dev.to/gaurang101197/plotting-histogram-distribution-over-time-in-grafana-469n</link>
      <guid>https://dev.to/gaurang101197/plotting-histogram-distribution-over-time-in-grafana-469n</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwnhrv6wqanc65olbzd08.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwnhrv6wqanc65olbzd08.png" alt="Histogram Distribution Over Time"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you are looking for plotting histogram distribution over time as shown in above image then this blog is for you. This blog does not cover internals of histogram and Grafana.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Histogram Distribution Over Time
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;It helps to understand how distribution looks like over time.&lt;/li&gt;
&lt;li&gt;It is very useful to find the time period when distribution skewed.&lt;/li&gt;
&lt;li&gt;While histogram distribution summarize distribution and useful to check system performance at glance, distribution over time help to detect time period when performance degrades.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pre-requisite
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Internals of histogram: &lt;a href="https://prometheus.io/docs/practices/histograms/" rel="noopener noreferrer"&gt;https://prometheus.io/docs/practices/histograms/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Better to have hands on experience on how Prometheus histogram works and prior experience with Grafana.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Use-case
&lt;/h3&gt;

&lt;p&gt;Plot latency distribution over time of any operation, for e.g. API latency, db latency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setup
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Measure latency metric using Prometheus Histogram.&lt;/li&gt;
&lt;li&gt;Metric name is &lt;code&gt;my_latency_metric&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Histogram buckets used are &lt;code&gt;[0, 80, 160, 320, 640, 1280, 2560, 5120]&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 1: Panel visualization
&lt;/h2&gt;

&lt;p&gt;Select &lt;a href="https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/heatmap/" rel="noopener noreferrer"&gt;Heatmap&lt;/a&gt; in Panel section shown as below image.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm58yuo5507zg2zv9jmc3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm58yuo5507zg2zv9jmc3.png" alt="Heatmap Panel"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Query
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

round(sum by (le) (increase(my_latency_metric_bucket{label_name=~"label_value"}[$__interval])))


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;label_name=~"label_value"&lt;/code&gt; - [Optional] filters the metric data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;increase&lt;/code&gt; - Calculate the difference between two data points. We have used &lt;code&gt;$__interval&lt;/code&gt; to make use of appropriate interval automatically calculated by Grafana.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Quote from prometheus &lt;a href="https://prometheus.io/docs/prometheus/latest/querying/functions/#increase" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;increase(v range-vector)&lt;/code&gt; calculates the increase in the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for. &lt;strong&gt;The increase is extrapolated to cover the full time range as specified in the range vector selector, so that it is possible to get a non-integer result even if a counter increases only by integer increments&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;increase&lt;/code&gt; acts on native histograms by calculating a new histogram where each component (sum and count of observations, buckets) is the increase between the respective component in the first and last native histogram in &lt;code&gt;v&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;sum by (le)&lt;/code&gt;: Sums metric values by &lt;code&gt;le&lt;/code&gt; (where &lt;code&gt;le&lt;/code&gt; refers histogram bucket label name). Suppose you measure latencies of your API which is deployed on k8s with multiple pods and you have pod id as label name. In this case, each pod emits latency data and we want to get picture of overall deployment. So we need to aggregates data of all pods and &lt;code&gt;sum by (le)&lt;/code&gt; perform this. It aggregates increase happens in each pod by &lt;code&gt;le&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;round&lt;/code&gt;: As you might know, &lt;code&gt;increase&lt;/code&gt; can return non integer value and if we see non-integer number for counter then it looks bad. To avoid this, we use &lt;code&gt;round&lt;/code&gt; function to convert all values to integer.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Step 3: Query Options
&lt;/h2&gt;

&lt;p&gt;Select &lt;code&gt;heatmap&lt;/code&gt; in Format and type &lt;code&gt;{{le}}&lt;/code&gt; in Legend in query option as shown in below image.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl8sqycueujcp5ljy170v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl8sqycueujcp5ljy170v.png" alt="Query Option"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Panel Query Options
&lt;/h2&gt;

&lt;p&gt;Select &lt;code&gt;Min Interval&lt;/code&gt; as twice of Scrape Interval. In given example, I have used &lt;code&gt;1m&lt;/code&gt;. &lt;strong&gt;This handles variation in Scrape Interval If any&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz0qcjrea243c387ospwi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz0qcjrea243c387ospwi.png" alt="Panel Query Options"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://grafana.com/blog/2020/06/23/how-to-visualize-prometheus-histograms-in-grafana/" rel="noopener noreferrer"&gt;https://grafana.com/blog/2020/06/23/how-to-visualize-prometheus-histograms-in-grafana/&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>grafana</category>
      <category>prometheus</category>
      <category>observability</category>
      <category>histogram</category>
    </item>
    <item>
      <title>Plotting Histogram Distribution In Grafana</title>
      <dc:creator>gaurang101197</dc:creator>
      <pubDate>Sat, 10 Aug 2024 07:29:47 +0000</pubDate>
      <link>https://dev.to/gaurang101197/plotting-histogram-distribution-in-grafana-3eo8</link>
      <guid>https://dev.to/gaurang101197/plotting-histogram-distribution-in-grafana-3eo8</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fimlir1hh354i7v66ua6g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fimlir1hh354i7v66ua6g.png" alt="Histogram Distribution" width="800" height="249"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you are looking for plotting histogram distribution as shown in above image then this blog is for you. This blog does not cover internals of histogram and Grafana.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Histogram Distribution
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Histogram distribution gives overview of how data distribution looks like for selected period.&lt;/li&gt;
&lt;li&gt;API latency histogram is incredibly useful for understanding the performance and behavior of API.&lt;/li&gt;
&lt;li&gt;Range of Latency: Histogram distribution shows how latency is spread out across different buckets. This helps us understand the typical range of response times.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pre-requisite
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Internals of histogram: &lt;a href="https://prometheus.io/docs/practices/histograms/" rel="noopener noreferrer"&gt;https://prometheus.io/docs/practices/histograms/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Better to have hands on experience on how Prometheus histogram works and prior experience with Grafana.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Use-case
&lt;/h3&gt;

&lt;p&gt;Plot latency distribution for selected time period, for e.g. API latency, db latency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setup
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Measure latency metric using Prometheus Histogram.&lt;/li&gt;
&lt;li&gt;Metric name is &lt;code&gt;my_latency_metric&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Histogram buckets used are &lt;code&gt;[0, 80, 160, 320, 640, 1280, 2560, 5120]&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 1: Panel visualization
&lt;/h2&gt;

&lt;p&gt;Select &lt;a href="https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/bar-gauge/" rel="noopener noreferrer"&gt;Bar Gauge Panel&lt;/a&gt; as panel.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyfhzjhoy02ca1yu4rd31.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyfhzjhoy02ca1yu4rd31.png" alt="Bar gauge" width="798" height="1458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Query
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;round(sum by (le) (increase(my_latency_metric_bucket{label_name=~"label_value"}[$__interval])))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;label_name=~"label_value"&lt;/code&gt; - [Optional] filters the metric.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;increase&lt;/code&gt; - Calculate the difference between two data points. We have used &lt;code&gt;$__interval&lt;/code&gt; to make use of appropriate interval automatically calculated by Grafana.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Quote from prometheus &lt;a href="https://prometheus.io/docs/prometheus/latest/querying/functions/#increase" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;increase(v range-vector)&lt;/code&gt; calculates the increase in the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for. &lt;strong&gt;The increase is extrapolated to cover the full time range as specified in the range vector selector, so that it is possible to get a non-integer result even if a counter increases only by integer increments&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;increase&lt;/code&gt; acts on native histograms by calculating a new histogram where each component (sum and count of observations, buckets) is the increase between the respective component in the first and last native histogram in &lt;code&gt;v&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;sum by (le)&lt;/code&gt;: Sums metric values by &lt;code&gt;le&lt;/code&gt; (where &lt;code&gt;le&lt;/code&gt; refers histogram bucket label name). Suppose you measure latencies of your API which is deployed on k8s with multiple pods and you have pod id as label name. In this case, each pod emits latency data and we want to get picture of overall deployment. So we need to aggregates data of all pods and &lt;code&gt;sum by (le)&lt;/code&gt; perform this. It aggregates increase happens in each pod by &lt;code&gt;le&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;round&lt;/code&gt;: As you might know, &lt;code&gt;increase&lt;/code&gt; can return non integer value and if we see non-integer number for counter then it looks bad. To avoid this, we use &lt;code&gt;round&lt;/code&gt; function to convert all values to integer.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Step 3: Query Options
&lt;/h2&gt;

&lt;p&gt;Select &lt;code&gt;heatmap&lt;/code&gt; in Format and type &lt;code&gt;{{le}}&lt;/code&gt; in Legend as shown in below image.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxp07g44z43g5824pellx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxp07g44z43g5824pellx.png" alt="Latency Histogram Query Option" width="800" height="173"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Panel Query Options
&lt;/h2&gt;

&lt;p&gt;Select &lt;code&gt;Min Interval&lt;/code&gt; as twice of Scrape Interval. In given example, I have used &lt;code&gt;1m&lt;/code&gt;. &lt;strong&gt;This handles variation in Scrape Interval If any&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz0qcjrea243c387ospwi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz0qcjrea243c387ospwi.png" alt="Panel Query Options" width="800" height="194"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Value options
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Want to know more ?: &lt;a href="https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/bar-gauge/#value-options" rel="noopener noreferrer"&gt;https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/bar-gauge/#value-options&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Select &lt;code&gt;Total&lt;/code&gt; as calculation as shown in below image.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft5cts6gntr4t9bfijdih.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft5cts6gntr4t9bfijdih.png" alt="Bar Gauge Value Option" width="800" height="935"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://grafana.com/blog/2020/06/23/how-to-visualize-prometheus-histograms-in-grafana/" rel="noopener noreferrer"&gt;https://grafana.com/blog/2020/06/23/how-to-visualize-prometheus-histograms-in-grafana/&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>grafana</category>
      <category>prometheus</category>
      <category>observability</category>
    </item>
  </channel>
</rss>
