<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mustafa ERBAY</title>
    <description>The latest articles on DEV Community by Mustafa ERBAY (@merbayerp).</description>
    <link>https://dev.to/merbayerp</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3921203%2Fe3a198a1-49a0-466f-99e6-74bdf202a867.png</url>
      <title>DEV Community: Mustafa ERBAY</title>
      <link>https://dev.to/merbayerp</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/merbayerp"/>
    <language>en</language>
    <item>
      <title>Metric Collection: Push vs. Pull Models - When to Use Which?</title>
      <dc:creator>Mustafa ERBAY</dc:creator>
      <pubDate>Fri, 29 May 2026 15:50:15 +0000</pubDate>
      <link>https://dev.to/merbayerp/metric-collection-push-vs-pull-models-when-to-use-which-81</link>
      <guid>https://dev.to/merbayerp/metric-collection-push-vs-pull-models-when-to-use-which-81</guid>
      <description>&lt;h2&gt;
  
  
  Metric Collection Approaches: The Core Differences
&lt;/h2&gt;

&lt;p&gt;Collecting metrics is crucial for understanding the health and performance of our systems. There are two primary methods for obtaining these metrics: &lt;strong&gt;Push&lt;/strong&gt; and &lt;strong&gt;Pull&lt;/strong&gt;. I've used both models extensively in my own projects and in consulting roles. Which one we choose depends on our infrastructure's structure, scale, and the specific metrics we want to collect.&lt;/p&gt;

&lt;p&gt;In the Push model, the system that collects metrics (e.g., a monitoring service) doesn't continuously query the applications or services sending the metrics. Instead, the collecting service actively fetches the metrics from the relevant systems. This is a form of "pulling" information. In the Pull model, the collecting service periodically polls the target systems and requests the metrics. This approach is quite common, especially in distributed systems and microservice architectures.&lt;/p&gt;

&lt;h3&gt;
  
  
  Advantages and Disadvantages of the Push Model
&lt;/h3&gt;

&lt;p&gt;With the Push model, the application or service generating the metrics sends them to a central collection point at its own intervals or when specific events occur. This is often seen in "agent-based" solutions. For example, an application might push its metrics to its own logs or a specific metric database (like InfluxDB with the Telegraf agent).&lt;/p&gt;

&lt;p&gt;The biggest advantage of the Push model is that the target system (the metric collector) doesn't need to constantly query the metric producers. The metric producer can use its own resources more efficiently and manage network traffic more controllably. Additionally, collecting metrics from systems behind firewalls or behind NAT becomes easier with this model. However, since each metric producer needs to send metrics independently, a central collection system might need to manage all these connections.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️ Use Cases for the Push Model&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Push model is particularly beneficial in the following scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Event-driven systems:&lt;/strong&gt; Sending metrics when a specific event occurs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Environments with network constraints:&lt;/strong&gt; Collecting metrics from systems behind firewalls or with difficult access.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Short-lived services:&lt;/strong&gt; For containers or functions that start and finish within seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edge devices or IoT:&lt;/strong&gt; When collecting metrics from resource-constrained devices.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Advantages and Disadvantages of the Pull Model
&lt;/h3&gt;

&lt;p&gt;In the Pull model, the main collecting service periodically polls the services that produce and expose metrics. Popular monitoring tools like Prometheus use this model. Prometheus collects metrics by regularly querying configured targets. The biggest advantage of this model is having a central point of control. Which metrics to collect and how often can be managed from a single location.&lt;/p&gt;

&lt;p&gt;A disadvantage of the Pull model is that the metric collecting service must be able to reach all target systems. If a target system is behind a firewall or unreachable, it's impossible to pull its metrics. Furthermore, when there are a large number of target systems, the metric collector can experience significant load. However, this load is generally manageable, and tools like Prometheus are quite successful in terms of scalability.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Advantages of the Pull Model&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Pull model is preferred in the following situations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Microservice architectures:&lt;/strong&gt; Each service exposes its own metric endpoint, and a central agent pulls them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stable and continuously running services:&lt;/strong&gt; Infrastructure where metrics can be regularly pulled.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detailed and real-time metric tracking:&lt;/strong&gt; Accessing more up-to-date data by pulling metrics at specific intervals.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Centralized configuration:&lt;/strong&gt; Managing metric collection settings from a single point.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Pull Model: Concrete Examples with Prometheus
&lt;/h2&gt;

&lt;p&gt;The Pull model is very popular, especially in modern, distributed systems and microservice architectures. The most well-known example of this model is undoubtedly Prometheus. Prometheus collects metrics by querying the &lt;code&gt;/metrics&lt;/code&gt; endpoint over HTTP. These metrics are typically served in Prometheus's own text-based format or the OpenMetrics format.&lt;/p&gt;

&lt;p&gt;Let's go through an example. Suppose we have a FastAPI application and we want to collect some basic metrics from it. We can use the &lt;code&gt;prometheus_client&lt;/code&gt; library for this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;prometheus_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Histogram&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Gauge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generate_latest&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;starlette.responses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Response&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Define the metrics
&lt;/span&gt;&lt;span class="n"&gt;REQUEST_COUNT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http_requests_total&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Total number of HTTP requests&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status_code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;REQUEST_LATENCY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Histogram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http_request_duration_seconds&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;HTTP request latency in seconds&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;ACTIVE_USERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Gauge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;active_users&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Number of active users&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@app.middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;call_next&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;start_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;
    &lt;span class="n"&gt;endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;call_next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;
    &lt;span class="n"&gt;process_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start_time&lt;/span&gt;

    &lt;span class="n"&gt;REQUEST_COUNT&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;inc&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;REQUEST_LATENCY&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;observe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;process_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Simulate a random number of active users
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;ACTIVE_USERS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;ACTIVE_USERS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;

&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/metrics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;generate_latest&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;media_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text/plain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;homepage&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello, World!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/slow&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;slow_page&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;This is a slow page.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Example usage:
# uvicorn main:app --reload
# Configure Prometheus server to scrape this application.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This FastAPI application will monitor every incoming request and generate metrics like &lt;code&gt;REQUEST_COUNT&lt;/code&gt;, &lt;code&gt;REQUEST_LATENCY&lt;/code&gt;, and &lt;code&gt;ACTIVE_USERS&lt;/code&gt;. When you configure the Prometheus server to scrape the &lt;code&gt;/metrics&lt;/code&gt; endpoint of this application at regular intervals, the pull model is in action.&lt;/p&gt;

&lt;p&gt;In Prometheus's &lt;code&gt;scrape_configs&lt;/code&gt; section, we can define this target like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;scrape_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my_fastapi_app'&lt;/span&gt;
    &lt;span class="na"&gt;static_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;localhost:8000'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# Where your FastAPI application is running&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this configuration, Prometheus will fetch metrics from &lt;code&gt;http://localhost:8000/metrics&lt;/code&gt; every 15 seconds (the default scrape interval). This provides centralized control and regular data collection.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️ Challenges of the Pull Model&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the Pull model, Prometheus's inability to reach target services is the biggest issue. If the &lt;code&gt;localhost:8000&lt;/code&gt; address is blocked by a firewall or the service is down, Prometheus cannot collect metrics from that service. In such cases, we see incomplete or outdated data on our monitoring dashboards. Setting up alert mechanisms correctly for such situations is vital.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Push Model: Sending Metrics to the Center
&lt;/h2&gt;

&lt;p&gt;The Push model operates in the opposite way to the Pull model. The service or agent that generates metrics actively sends them to a central collection point. This model is more useful in situations where the network topology is complex, firewall rules are strict, or short-lived threads need to produce metrics.&lt;/p&gt;

&lt;p&gt;For example, consider an application running inside a Docker container. This container might have a short lifespan, and it might not always be possible for Prometheus to query it directly. In such cases, an agent within the container can collect metrics and send them to a more persistent database (like InfluxDB or Graphite).&lt;/p&gt;

&lt;p&gt;Another common use case is integrating metrics with a central log aggregation system. We can capture specific error patterns in logs and increment metrics corresponding to these patterns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;

&lt;span class="c1"&gt;# The endpoint where we will send metrics (e.g., InfluxDB's Telegraf)
&lt;/span&gt;&lt;span class="n"&gt;METRIC_ENDPOINT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://your-metric-collector:8086/write?db=mydb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="c1"&gt;# InfluxDB example
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;send_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;measurement&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1e9&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Nanosecond precision for InfluxDB
&lt;/span&gt;    &lt;span class="n"&gt;tag_str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()])&lt;/span&gt;
    &lt;span class="n"&gt;field_str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()])&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;measurement&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tag_str&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;field_str&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;METRIC_ENDPOINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;204&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# InfluxDB write success is 204 No Content
&lt;/span&gt;            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error sending metric: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; - &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RequestException&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Request failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Application logic simulation
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;tags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_app&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;request_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;start_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Simulate processing
&lt;/span&gt;        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# 10% error rate
&lt;/span&gt;            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Internal processing error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duration_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nf"&gt;send_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;request_latency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Request &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; processed successfully.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duration_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error_message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
        &lt;span class="nf"&gt;send_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;request_latency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Request &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Main loop
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="c1"&gt;# Simulate 10 requests
&lt;/span&gt;        &lt;span class="nf"&gt;process_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;req_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this code, the &lt;code&gt;process_request&lt;/code&gt; function, after processing each request, sends metrics indicating the duration of the operation and its outcome (success/failure) via the &lt;code&gt;send_metric&lt;/code&gt; function to a central endpoint. This endpoint could be a Telegraf agent writing to an InfluxDB database.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Flexibility of the Push Model&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Push model offers great flexibility, especially in dynamic environments and situations with network constraints. When you start or stop a container, the task of sending metrics automatically begins or ends. This reduces the need for manual configuration.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Why Are We Collecting So Many Metrics?
&lt;/h3&gt;

&lt;p&gt;The primary goal of metric collection is to understand our systems' behavior, detect problems, and optimize their performance. Some critical metrics I've encountered in production environments include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;CPU Usage:&lt;/strong&gt; The processor load of servers or containers. High CPU usage can be a sign of performance issues or insufficient resources.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Memory Usage:&lt;/strong&gt; How much RAM applications are consuming. Memory leaks or insufficient RAM can seriously affect system stability.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Disk I/O:&lt;/strong&gt; Disk read/write speeds. Slow disks can slow down database or file system operations, reducing overall performance.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Network Traffic:&lt;/strong&gt; The size and number of incoming and outgoing network packets. Network bottlenecks or abnormal traffic patterns can be detected.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Error Rates:&lt;/strong&gt; The number of errors within the application or in HTTP requests (e.g., 5xx HTTP errors).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Latency:&lt;/strong&gt; How long it takes for requests to be responded to. High latency negatively impacts user experience.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Collecting these metrics allows us to understand the system's "normal" behavior not just when there's a problem, but also during normal operations. This "baseline" information is invaluable for detecting anomalies (e.g., 50% higher CPU usage than normal).&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use Which Model?
&lt;/h2&gt;

&lt;p&gt;Both models have their use cases. Some factors to consider when making a choice include:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Infrastructure Structure:&lt;/strong&gt; Microservices or monolith? Containers or virtual machines? How complex is the network structure?&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Metric Producer Characteristics:&lt;/strong&gt; Short-lived or continuously running? Are network accesses restricted? Can it expose its own metric endpoint?&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Scalability Needs:&lt;/strong&gt; How many services and metrics will be collected? What will be the load on the central collector?&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Network Security and Accessibility:&lt;/strong&gt; Situations like firewall rules, services behind NAT.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Operational Complexity:&lt;/strong&gt; Which model is easier to manage?&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️ Hybrid Approach&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the real world, we often see hybrid approaches that combine both models. For example, we might use the Pull model (with Prometheus) for continuously running services, while using the Push model (with Fluentd, Logstash, or custom agents) for short-lived or network-constrained services. This allows us to leverage the advantages of both models.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Examples from My Own Experience
&lt;/h3&gt;

&lt;p&gt;While working on a production ERP system, we needed to monitor both the main application (which was monolithic) and various background processors. For the main application, we used the Pull model with Prometheus. We collected basic metrics like CPU, memory, request count, and latency through the application's &lt;code&gt;/metrics&lt;/code&gt; endpoint.&lt;/p&gt;

&lt;p&gt;However, we had background processes that ran periodically (e.g., hourly invoice generation, daily reporting). These processors were sometimes one-off tasks, and sometimes they finished within a few minutes. For these short-lived and sometimes firewall-behind processors, we opted for the Push model. Each processor, during its execution, sent metrics it generated (processing time, success/failure record count, etc.) directly to an InfluxDB. This way, we could monitor the health of the main application in real-time and analyze the performance of background processors in detail. This hybrid approach played a critical role in achieving our 99.9% uptime goal.&lt;/p&gt;

&lt;p&gt;In another scenario, for our mobile application's performance, we collected crash reports and performance metrics (screen load times, network request times) directly from the application itself. These metrics were typically pushed from mobile devices to a central service. This is because mobile devices cannot be kept constantly open for our servers to pull from, and network connections are also unreliable. In such cases, the Push model becomes almost the only option for data collection.&lt;/p&gt;

&lt;h3&gt;
  
  
  When is the Pull Model More Advantageous?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Ease of Service Discovery:&lt;/strong&gt; If your services have a service discovery mechanism, Prometheus can automatically find them and pull metrics. This is a great convenience, especially in dynamic environments (like Kubernetes).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Centralized Control:&lt;/strong&gt; Settings like metric collection frequency and format are managed from a single location.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Network Load Distribution:&lt;/strong&gt; The load of pulling metrics falls on the metric collector (Prometheus). Metric-producing services do not have additional workload (other than exposing an endpoint).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;More Reliable Data:&lt;/strong&gt; The metric collector (Prometheus) regularly checks if target services are running. If a service doesn't respond, this is immediately detected.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  When is the Push Model More Advantageous?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Systems Behind Firewalls:&lt;/strong&gt; When the metric producer cannot directly access the collection point.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Short-Lived Workloads:&lt;/strong&gt; When metrics need to be collected from a script or a short-running container.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Event-Driven Metrics:&lt;/strong&gt; For sending metrics after a specific event.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Low Bandwidth Environments:&lt;/strong&gt; When the metric producer needs to send aggregated data to the collection point at specific intervals.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Visualizing and Analyzing Metrics
&lt;/h2&gt;

&lt;p&gt;Collecting metrics is just the first step. The real value lies in making these metrics meaningful. Metrics collected with Prometheus are typically used in conjunction with visualization tools like Grafana. Grafana allows us to create rich and interactive dashboards with metrics from Prometheus.&lt;/p&gt;

&lt;p&gt;A dashboard typically includes the following panels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;General Status Panel:&lt;/strong&gt; Shows basic system metrics like CPU, memory, and disk usage.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Application Performance Panel:&lt;/strong&gt; Contains application-specific metrics like request count, error rates, and latency.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Error Analysis Panel:&lt;/strong&gt; Graphs showing error types and their frequencies.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Capacity Planning Panel:&lt;/strong&gt; Shows resource usage trends to help predict future needs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Consider a "request_latency" histogram graph we created in Grafana. This graph shows how long requests took to complete within a specific time frame. For example, the 50th percentile (p50) indicates that 50% of requests were completed within this duration. The 99th percentile (p99) shows how long the slowest 1% of requests took. These metrics are critical for understanding user experience.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Example Grafana PromQL query:
sum(rate(http_request_duration_seconds_bucket{job="my_fastapi_app", le="0.5"}[5m])) by (le)
/
sum(rate(http_request_duration_seconds_count{job="my_fastapi_app"}[5m])) by (le)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query draws a graph showing whether 50% (p50) of requests in the last 5 minutes were completed under 0.5 seconds.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️ Alerting Mechanisms&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Continuously monitoring collected metrics and receiving alerts when anomalies occur is also very important. Prometheus Alertmanager receives alerts from Prometheus and, according to configured rules, notifies the relevant individuals (via email, Slack, PagerDuty, etc.). For example, rules like "Alert if CPU usage exceeds 90% and this condition persists for more than 5 minutes" can be defined.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Conclusion: Choosing the Right Model
&lt;/h2&gt;

&lt;p&gt;The choice between Push and Pull models for metric collection depends entirely on your project's specific requirements. Both models have their strengths and weaknesses. Often, the best approach is to choose the model that is most suitable for different components of your infrastructure, or to use both models in conjunction.&lt;/p&gt;

&lt;p&gt;The Pull model is a great option for modern, distributed systems that require centralized control and service discovery. Prometheus is the most popular representative of this model. The Push model, on the other hand, offers a more flexible solution for systems with network constraints, short-lived processes, or event-driven architectures.&lt;/p&gt;

&lt;p&gt;It's important to remember that metric collection is just a tool. The ultimate goal is to use this data to make our systems more reliable, performant, and understandable. Therefore, selecting the right metrics, collecting them correctly, and visualizing them meaningfully are integral parts of modern system operations.&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>observability</category>
      <category>prometheus</category>
      <category>grafana</category>
    </item>
    <item>
      <title>Database Index Selection: Why Basic Approaches Fall Short?</title>
      <dc:creator>Mustafa ERBAY</dc:creator>
      <pubDate>Fri, 29 May 2026 14:10:35 +0000</pubDate>
      <link>https://dev.to/merbayerp/database-index-selection-why-basic-approaches-fall-short-36hl</link>
      <guid>https://dev.to/merbayerp/database-index-selection-why-basic-approaches-fall-short-36hl</guid>
      <description>&lt;h2&gt;
  
  
  Introduction: The Unseen Costs of Indexes
&lt;/h2&gt;

&lt;p&gt;When we talk about database performance, indexes are usually the first thing that comes to mind. When a query runs slowly, the first place we look is often for missing or incorrect indexes. We generally know what B-tree, GIN, and BRIN index types do and when to use them. We even have those famous graphs from PostgreSQL documentation in our minds. But in the real world, especially in large and complex systems, how much do we question why these basic index choices often fall short, or even sometimes degrade performance?&lt;/p&gt;

&lt;p&gt;In this post, I'll explain with concrete examples from my own experiences why index selections cannot be made by just looking at table and query structures, and how organizational workflows, data distribution, and even hardware can influence these decisions. From the "late shipment report" problem I encountered in a manufacturing ERP to index optimizations in my own financial calculators, we will focus on moments where we pushed the limits of basic approaches.&lt;/p&gt;

&lt;h2&gt;
  
  
  B-Tree Index: The Savior for Every Situation?
&lt;/h2&gt;

&lt;p&gt;The default and most frequently used index type in PostgreSQL is undoubtedly B-tree. It is generally very successful in speeding up queries using operators like equality (&lt;code&gt;=&lt;/code&gt;), greater than (&lt;code&gt;&amp;gt;&lt;/code&gt;), less than (&lt;code&gt;&amp;lt;&lt;/code&gt;), and &lt;code&gt;BETWEEN&lt;/code&gt;. It even works for prefix searches like &lt;code&gt;LIKE 'prefix%'&lt;/code&gt;. I remember adding a B-tree index to almost every table while working on a manufacturing ERP for over 5 years.&lt;/p&gt;

&lt;p&gt;However, B-trees also have their limits. Especially for searches like &lt;code&gt;LIKE '%suffix'&lt;/code&gt; or &lt;code&gt;LIKE '%substring%'&lt;/code&gt;, due to the structure of B-tree, it can't do much beyond a full table scan. When we encounter such queries, the first solution that comes to mind is either using FTS (Full-Text Search) for more complex search algorithms or moving towards more advanced index structures.&lt;/p&gt;

&lt;p&gt;For example, in a client project, we were trying to filter product movements in operator screens in real-time. We were querying by date range and product code, and these queries were quite fast with B-tree indexes. However, operators sometimes wanted to search by entering part of the product description. A search like &lt;code&gt;LIKE '%screen%'&lt;/code&gt; caused serious performance issues in tables with millions of rows. Initially, we tried GIN indexes using the &lt;code&gt;pg_trgm&lt;/code&gt; extension, but this slowed down table writes. Finally, we made the search need more structural by moving to a different data model.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️ Limitations of B-Tree Index&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;B-tree indexes, with their ordered data structure, speed up many common queries. However, their performance can degrade as search patterns become more complex or when data distribution is very uneven. They are particularly insufficient for full-text or complex string matching.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  GIN Index: The Powerhouse for Text Searches?
&lt;/h2&gt;

&lt;p&gt;When working with Full-Text Search (FTS) or text-heavy data, GIN (Generalized Inverted Index) indexes come into play. They are used to search for specific words or patterns in data of JSONB, array, or text types. GIN indexes can be a lifesaver in analyzing product descriptions or reviews on an e-commerce site.&lt;/p&gt;

&lt;p&gt;In a client project, we were storing product features as JSONB. We needed to query the existence of a specific feature (&lt;code&gt;"color": "blue"&lt;/code&gt;) or multiple features (&lt;code&gt;"color": "blue"&lt;/code&gt; AND &lt;code&gt;"size": "XL"&lt;/code&gt;) within this JSONB field. GIN indexes were a perfect fit for such queries. We created this index with the command &lt;code&gt;CREATE INDEX idx_products_features ON products USING GIN (features);&lt;/code&gt;, and our queries went from seconds to milliseconds.&lt;/p&gt;

&lt;p&gt;However, GIN indexes also come with their own costs. GIN indexes occupy much more disk space than B-trees and, importantly, slow down table data insertion (INSERT) or update (UPDATE) operations. This is because the index needs to be updated with every data change. In a project with my own financial calculators, using GIN indexes while processing constantly updated financial data had slowed down write performance so much that we considered moving the data to a separate time-series database.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Example GIN index creation command&lt;/span&gt;
CREATE INDEX idx_articles_content ON articles USING GIN &lt;span class="o"&gt;(&lt;/span&gt;to_tsvector&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'turkish'&lt;/span&gt;, content&lt;span class="o"&gt;))&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c"&gt;# JSONB query with GIN index&lt;/span&gt;
SELECT &lt;span class="k"&gt;*&lt;/span&gt; FROM products WHERE features @&amp;gt; &lt;span class="s1"&gt;'{"renk": "mavi", "boyut": "XL"}'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️ Considerations for GIN Indexes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While GIN indexes excel in complex data structures and text searches, they come with significant costs in terms of disk space and write performance. They should be used with caution in systems with heavy write operations.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  BRIN Index: An Alternative for Large, Ordered Data
&lt;/h2&gt;

&lt;p&gt;BRIN (Block Range Index) indexes are designed as an alternative to B-trees for large tables and ordered datasets. BRIN indexes use the physical order of the table on disk to determine if data falls within a certain range. Since they only store one entry per data block, they are much smaller than B-trees.&lt;/p&gt;

&lt;p&gt;In a data warehouse project, we had a time-series table with millions of records. Data was typically added to this table in chronological order. When querying using the &lt;code&gt;event_timestamp&lt;/code&gt; column, using a B-tree index both greatly increased the index size and didn't provide the expected performance for some queries. This is precisely where BRIN indexes came into play.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;CREATE INDEX idx_timeseries_event_time ON timeseries_data USING BRIN (event_timestamp);&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;With this index, when the query specified a time range, PostgreSQL only had to scan the data blocks corresponding to that range, rather than reading all millions of records. The biggest advantage of BRIN indexes is that when data is added in order or has a specific natural order, they can offer similar or better performance with a much smaller footprint than B-trees.&lt;/p&gt;

&lt;p&gt;However, BRIN indexes also have a critical prerequisite: the data must be physically ordered on disk. If your data is frequently updated, deleted, or randomly inserted, the benefits of BRIN indexes quickly disappear. I once tried a BRIN index on a stock movement table in a manufacturing company's ERP system. The data was ordered when added, but later stock corrections and returns caused the order to be disrupted, rendering the BRIN index useless.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Advantages and Conditions of BRIN Indexes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;BRIN indexes are an excellent option for large and ordered datasets. They save disk space and are effective for range queries. However, as they rely on the physical order of data on disk, maintaining data order is critical.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Overlooked Factors in Index Selection
&lt;/h2&gt;

&lt;p&gt;Typically, when selecting indexes, we focus on query patterns, data types, and the basic characteristics of the index type. However, in a production environment, things are much more complex.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Distribution and Cardinality
&lt;/h3&gt;

&lt;p&gt;The cardinality of a column (the number of unique values) plays a critical role in index selection. A B-tree index on a column with low cardinality (e.g., columns with only a few distinct values like gender or status codes) often doesn't perform better than a full table scan. This is because the index will point to rows representing a large portion of the table. In such cases, it's crucial to carefully examine the &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; output.&lt;/p&gt;

&lt;p&gt;At one point, a client's order status table had a &lt;code&gt;status&lt;/code&gt; column with only 3 distinct values: 'pending', 'processing', 'completed'. We had created a B-tree index on this column. However, the query &lt;code&gt;WHERE status = 'completed'&lt;/code&gt; was slow because it scanned 70% of the table. In this situation, optimizing the query or managing the status in a different data structure might have been a more appropriate approach than using an index.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# B-tree index on a low-cardinality column being insufficient&lt;/span&gt;
EXPLAIN ANALYZE SELECT &lt;span class="k"&gt;*&lt;/span&gt; FROM orders WHERE status &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'completed'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nt"&gt;--&lt;/span&gt; We would expect to see a large &lt;span class="s1"&gt;'Seq Scan'&lt;/span&gt; or &lt;span class="s1"&gt;'Bitmap Heap Scan'&lt;/span&gt; &lt;span class="k"&gt;in &lt;/span&gt;the analysis output.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Write vs. Read Balance
&lt;/h3&gt;

&lt;p&gt;Indexes improve read performance but degrade write performance. Every index must be updated during a data change. If your table experiences very frequent data writes (e.g., logging or real-time transaction records), updating multiple indexes for each added piece of data can create a significant performance bottleneck.&lt;/p&gt;

&lt;p&gt;In the backend of my own mobile application, I was anonymously logging user activities. Initially, I had created B-tree indexes on columns like date, user ID, and activity type. When millions of log rows were added daily, write performance dropped so much that the application started to slow down. Eventually, I realized that most log queries were just searching by time and switched to a BRIN index solely on &lt;code&gt;event_timestamp&lt;/code&gt;, removing the other indexes. This change increased write performance by over 300%.&lt;/p&gt;

&lt;h3&gt;
  
  
  Index Maintenance and Cost
&lt;/h3&gt;

&lt;p&gt;Indexes don't just take up space; they also require maintenance. In PostgreSQL, the &lt;code&gt;VACUUM&lt;/code&gt; operation is important for reclaiming free space left by deleted or updated rows and optimizing indexes. Operations like &lt;code&gt;VACUUM FULL&lt;/code&gt; are more aggressive but can cause significant access issues by locking the table.&lt;/p&gt;

&lt;p&gt;In a manufacturing ERP system, we weren't regularly checking the &lt;code&gt;pg_stat_user_indexes&lt;/code&gt; table. Over time, the indexes had become so bloated that we started experiencing disk space issues. By looking at the &lt;code&gt;idx_scan&lt;/code&gt; and &lt;code&gt;last_vacuum&lt;/code&gt;/&lt;code&gt;last_autovacuum&lt;/code&gt; columns in &lt;code&gt;pg_stat_user_indexes&lt;/code&gt;, we identified which indexes were unused or hadn't been &lt;code&gt;VACUUM&lt;/code&gt;ed for a long time. Deleting unused indexes and optimizing &lt;code&gt;VACUUM&lt;/code&gt; settings helped us reduce disk usage by 20%.&lt;/p&gt;

&lt;h2&gt;
  
  
  Advanced Indexing Approaches
&lt;/h2&gt;

&lt;p&gt;There are also more advanced methods we can resort to when basic index types are insufficient.&lt;/p&gt;

&lt;h3&gt;
  
  
  Partial Indexes
&lt;/h3&gt;

&lt;p&gt;Partial indexes allow you to create an index on only a specific subset of the table. This reduces the index size and improves write performance. For example, if you frequently query only records with a specific status, you can create a partial index for that status.&lt;/p&gt;

&lt;p&gt;In a client project, we rarely queried cancelled orders. The order table had millions of rows, and queries with the condition &lt;code&gt;status = 'cancelled'&lt;/code&gt; were slow. However, cancelled orders constituted only 1% of the table. In this case, instead of adding an index to the entire table, we created a partial index just for cancelled orders:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_orders_cancelled&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'cancelled'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This index was much smaller, containing only the &lt;code&gt;order_id&lt;/code&gt;s of cancelled orders, and it sped up relevant queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Expression Indexes
&lt;/h3&gt;

&lt;p&gt;Expression indexes allow you to create an index on the results of functions or expressions performed on columns, rather than on the columns themselves. The &lt;code&gt;to_tsvector&lt;/code&gt; expression I mentioned earlier is an example. Or you can use the &lt;code&gt;lower()&lt;/code&gt; function for case-insensitive comparisons.&lt;/p&gt;

&lt;p&gt;For instance, if you have a &lt;code&gt;username&lt;/code&gt; column in a user table and frequently perform queries like &lt;code&gt;WHERE lower(username) = 'admin'&lt;/code&gt;, creating an expression index on &lt;code&gt;lower(username)&lt;/code&gt; will speed up these queries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_users_lower_username&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'admin'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Covering Indexes (with &lt;code&gt;INCLUDE&lt;/code&gt; in PostgreSQL)
&lt;/h3&gt;

&lt;p&gt;With the &lt;code&gt;INCLUDE&lt;/code&gt; keyword in PostgreSQL 11 and later versions, it's possible to create covering indexes. This allows the query to be completed using only the index, without needing to access the main table. This can significantly improve query performance.&lt;/p&gt;

&lt;p&gt;In a financial reporting tool, I needed to retrieve transaction details for a specific account and date range. We had a B-tree index on both the account ID and the date. However, the query also retrieved the transaction description. In this case, I created a covering index by adding the transaction description to the &lt;code&gt;INCLUDE&lt;/code&gt; part of the index, which included the &lt;code&gt;order by&lt;/code&gt; and &lt;code&gt;where&lt;/code&gt; conditions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_transactions_account_date&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;transactions&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transaction_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;INCLUDE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This way, queries that needed the &lt;code&gt;account_id&lt;/code&gt;, &lt;code&gt;transaction_date&lt;/code&gt;, and &lt;code&gt;description&lt;/code&gt; columns could run solely from the index, without touching the main table at all.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;🔥 Considerations for Covering Indexes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Covering indexes can significantly improve query performance but also increase index size. Since each included column in &lt;code&gt;INCLUDE&lt;/code&gt; increases the index size, it's important to only add columns that are truly needed. Otherwise, the index itself can become a performance bottleneck.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Conclusion: Indexes Are a Tool, But Not a Solution on Their Own
&lt;/h2&gt;

&lt;p&gt;Database indexes are one of the cornerstones of performance optimization. However, making the right decision by just looking at basic types like B-tree, GIN, or BRIN, or even just analyzing query plans, is often not possible. Factors like data distribution, write/read balance, index costs, and advanced indexing strategies must also be considered.&lt;/p&gt;

&lt;p&gt;We must remember that indexes are only one part of the solution in complex systems. Sometimes, the best index is no index at all. Or a better data model, better query writing, or even choosing a different database technology can yield much more effective results than index optimization. One of the biggest mistakes I've seen in my career is over-reliance on indexes while neglecting the underlying data model or query logic.&lt;/p&gt;

&lt;p&gt;As I mentioned in my previous [related: database performance analysis] posts, learning to read &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; output is the first step, but being able to see the system as a whole and manage trade-offs correctly is essential.&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>index</category>
      <category>postgres</category>
    </item>
    <item>
      <title>Zero-Trust Architecture: A Pragmatic Roadmap for Small Teams</title>
      <dc:creator>Mustafa ERBAY</dc:creator>
      <pubDate>Fri, 29 May 2026 13:19:23 +0000</pubDate>
      <link>https://dev.to/merbayerp/zero-trust-architecture-a-pragmatic-roadmap-for-small-teams-33b2</link>
      <guid>https://dev.to/merbayerp/zero-trust-architecture-a-pragmatic-roadmap-for-small-teams-33b2</guid>
      <description>&lt;h2&gt;
  
  
  Zero-Trust Architecture: A Pragmatic Start for Small Teams
&lt;/h2&gt;

&lt;p&gt;Traditional security models trusted everyone once they were inside. It was assumed that everyone within the network was safe. But things don't work that way anymore. Once attackers breached the network, they could move freely inside. This is exactly where &lt;strong&gt;zero-trust&lt;/strong&gt; architecture comes into play. The core principle of this model is simple: Never trust, always verify. This applies to every device, every user, and every application on our network.&lt;/p&gt;

&lt;p&gt;For small teams, this concept might seem complex and costly at first glance. However, with the right approach, it's possible to integrate zero-trust into our own operations. In this post, I'll cover pragmatic steps and tools that small teams can understand and implement, rather than relying on complex enterprise solutions. My goal is to move away from jargon and offer solutions that work in the field.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Zero-Trust? Glimpses from My Experience
&lt;/h3&gt;

&lt;p&gt;I've been working in system and network security for years. During this time, I've encountered many different scenarios. Once, I witnessed how malware that infiltrated an ERP system of a manufacturing firm spread rapidly within the internal network. Traditional firewalls had kept the malware out, but once it got inside, it was as if it became invisible. User accounts were compromised, sensitive data was stolen, and production came to a standstill. This incident once again showed me how critical internal network security is.&lt;/p&gt;

&lt;p&gt;In another case, an unauthorized access in a financial technology company's cloud infrastructure led to a massive financial data leak in just a few minutes. Access controls were insufficient, and a one-time authorization jeopardized the entire system. Events like these reveal how common and devastating "we trusted, but we were wrong" scenarios can be. Zero-trust architecture is designed precisely to minimize these risks. Continuously verifying the source, purpose, and authorization of every request allows us to prevent such disasters.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️ Core Principles of Zero-Trust&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Zero-trust is not a single product or technology, but a security philosophy. Its main principles are:&lt;/p&gt;

&lt;ul&gt;
    &lt;li&gt;
&lt;strong&gt;Always Verify:&lt;/strong&gt; Every access request, regardless of its source, must be verified.&lt;/li&gt;
    &lt;li&gt;
&lt;strong&gt;Principle of Least Privilege:&lt;/strong&gt; Users and devices should be granted only the minimum permissions necessary to perform their tasks.&lt;/li&gt;
    &lt;li&gt;
&lt;strong&gt;Reduce Attack Surface:&lt;/strong&gt; The attack surface should be narrowed through network segmentation and micro-segmentation.&lt;/li&gt;
    &lt;li&gt;
&lt;strong&gt;Continuous Monitoring:&lt;/strong&gt; Network traffic and user activities should be continuously monitored to detect anomalies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These principles apply to both large and small organizations.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Zero-Trust for Small Teams: First Steps
&lt;/h3&gt;

&lt;p&gt;Large companies often use complex identity and access management (IAM) solutions, end-to-end encryption, and advanced network segmentation tools. However, for small teams, such solutions typically require budget and expertise. So, what can we do? Here's a pragmatic starting plan for you:&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Identity Management: The Foundation of Everything
&lt;/h4&gt;

&lt;p&gt;The first and most crucial step in zero-trust is identity management. We need to know who has access to what and verify it.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Multi-Factor Authentication (MFA):&lt;/strong&gt; This is a non-negotiable aspect of zero-trust. Relying solely on passwords is no longer sufficient. Users must use at least two different verification methods when logging into the system. Methods like mobile app approvals, SMS codes, or hardware tokens can be used. For example, I mandated Google Authenticator or Authy for my team members working on a project. This way, even a stolen password wasn't enough on its own.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Centralized Identity Provider (IdP):&lt;/strong&gt; Managing all your user accounts from a single place simplifies the enforcement of access policies. Solutions like Okta, Azure AD (Microsoft Entra ID), or LastPass can offer affordable plans for small teams. I use LastPass Business for my own VPN and some internal services. This allows me to manage accounts centrally when a new team member joins or leaves.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Role-Based Access Control (RBAC):&lt;/strong&gt; Grant users only the minimum permissions necessary to do their jobs. For instance, a developer should not have direct access to the production database. They should have a separate sandbox environment. In an internal tool I developed myself, I defined different roles: &lt;code&gt;admin&lt;/code&gt;, &lt;code&gt;developer&lt;/code&gt;, &lt;code&gt;viewer&lt;/code&gt;. These roles determine which features a user can access and which operations they can perform.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 MFA and IdP Selection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cost is a significant factor for small teams. Research free or low-cost MFA and IdP solutions. Many services offer free plans for basic features. The important thing is to implement them consistently.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  2. Device Security: No Device is Trusted by Default
&lt;/h4&gt;

&lt;p&gt;We must remember that every device connecting to our network can pose a potential threat. Therefore, we must also ensure the security of our devices.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Device Inventory and Status:&lt;/strong&gt; Maintain an inventory of all devices on your network (computers, servers, mobile devices). Ensure these devices have up-to-date patches, running antivirus software, and use encryption. A simple Python script I used for a project scanned active devices on the network and reported basic security checks (open ports, OS information).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Endpoint Security:&lt;/strong&gt; Use a reliable antivirus/anti-malware solution. Modern endpoint detection and response (EDR) solutions can detect not only viruses but also suspicious behaviors. Among my preferred solutions are platforms like CrowdStrike and SentinelOne. More affordable options are also available for small teams.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Patch Management:&lt;/strong&gt; Operating systems and applications must be updated regularly. Security vulnerabilities are closed with patches. On my Ubuntu servers, I use the &lt;code&gt;unattended-upgrades&lt;/code&gt; package to ensure critical updates are installed automatically. This reduces the need for manual intervention and enhances security.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  3. Network Segmentation and Micro-Segmentation
&lt;/h4&gt;

&lt;p&gt;Dividing our network into logical parts makes it harder for an attacker to spread.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;VLAN Usage:&lt;/strong&gt; Separate different departments or functions into different VLANs. For example, isolate the guest network from the server network and the user network. This can be done even with simple switch configuration. In my previous workplace, I used VLANs to separate the production network from the office network. This prevented a ransomware attack targeting production systems from spreading to devices on the office network.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Security Groups and Access Control Lists (ACLs):&lt;/strong&gt; Clearly define which traffic is allowed to which segments using security groups and ACLs on your firewall or network devices. For example, only allow specific servers to access the database server. In a client project, I defined restrictive ACLs so that only CI/CD servers could deploy to the staging environment.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Micro-segmentation (Optional but Powerful):&lt;/strong&gt; At a more advanced level, you can isolate each workload (e.g., each server or container) with its own firewall. This is effective in complex environments but can be difficult for small teams to manage. If you use containers, you can implement micro-segmentation with solutions like Kubernetes Network Policies or Calico.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Zero-Trust in Practice: Real Scenarios and Tools
&lt;/h3&gt;

&lt;p&gt;Let's look at practical tools and scenarios for implementing zero-trust, beyond theoretical knowledge.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Secure Remote Access: From VPN to ZTNA
&lt;/h4&gt;

&lt;p&gt;Traditional VPN solutions connect a user to the network and generally grant access to all network resources. In a zero-trust approach, this model changes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;VPN Security:&lt;/strong&gt; If you're still using VPN, enforce MFA and ensure that users connecting to the VPN can only access the resources they need. Avoid split tunneling.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Zero-Trust Network Access (ZTNA):&lt;/strong&gt; ZTNA is a more granular approach than VPN. Users and devices access corporate resources not directly, but through a ZTNA broker. This broker verifies every access request and grants access only to the necessary resource. Solutions like Cloudflare Access, Palo Alto Prisma Access, or Tailscale offer ZTNA models for small teams. I use Tailscale in my own projects. It's both easy to use and a powerful ZTNA solution.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️ VPN Risks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Standard VPN solutions, if not configured correctly, allow a compromised attacker to spread rapidly within the network. The absence of MFA or excessive authorization can make VPNs a serious risk.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  2. Application and API Security
&lt;/h4&gt;

&lt;p&gt;Our applications and APIs must also comply with zero-trust principles.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;API Authorization:&lt;/strong&gt; Ensure every API request is made with valid credentials (API key, OAuth token) and that these credentials have sufficient authorization. JWT (JSON Web Tokens) are commonly used, but secure storage and verification of tokens are critical. While developing the backend for an e-commerce site, I used the OAuth2 flow for all externally exposed APIs. This allowed third-party applications to access only the data they were authorized for.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Web Application Security (WAF):&lt;/strong&gt; Use a Web Application Firewall (WAF) to block common attacks like SQL injection and XSS. Cloudflare WAF is both powerful and affordable for small teams. I use Cloudflare WAF for my own blog site. It's quite effective in blocking bot attacks and known exploits.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  3. Data Access and Encryption
&lt;/h4&gt;

&lt;p&gt;Securing our data is also part of zero-trust.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Data Encryption:&lt;/strong&gt; Encrypt data both in transit and at rest. Use TLS/SSL for data in transit. For data at rest, implement encryption at the database or file system level. I encrypted sensitive fields using the &lt;code&gt;pgcrypto&lt;/code&gt; extension in my PostgreSQL database. This prevents data from being read even if physical access to the database files is gained.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Access Logging:&lt;/strong&gt; Log in detail who accessed which data, and when. These logs are vital for detecting and analyzing potential breaches. I use &lt;code&gt;journald&lt;/code&gt;'s log rotation settings to prevent logs from consuming disk space, while forwarding important logs to a separate server.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Measurement and Continuous Improvement
&lt;/h3&gt;

&lt;p&gt;Zero-trust architecture is not a static structure; it must be continuously monitored and improved.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Log Analysis and Monitoring:&lt;/strong&gt; Regularly analyze logs to detect security incidents and anomalies. SIEM (Security Information and Event Management) tools can help with this, but simpler log collection and analysis tools can also be sufficient for small teams. Solutions like ELK Stack (Elasticsearch, Logstash, Kibana) or Graylog can be considered.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Periodic Audits:&lt;/strong&gt; Regularly review your security policies and access controls. Check the permissions of team members and remove those that are no longer needed. I once realized that after a team member left, their account remained active for a while because we didn't immediately disable it. After this mistake, I automated the process of disabling accounts for departing personnel.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Attack Simulations (Optional):&lt;/strong&gt; Small-scale penetration tests or red teaming exercises can help you proactively identify your security vulnerabilities.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Tool Recommendations for Small Teams&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
    &lt;li&gt;
&lt;strong&gt;MFA:&lt;/strong&gt; Google Authenticator, Authy, Microsoft Authenticator&lt;/li&gt;
    &lt;li&gt;
&lt;strong&gt;IdP:&lt;/strong&gt; LastPass Business, Bitwarden, Azure AD Free&lt;/li&gt;
    &lt;li&gt;
&lt;strong&gt;ZTNA:&lt;/strong&gt; Tailscale, Cloudflare Access, ZeroTier&lt;/li&gt;
    &lt;li&gt;
&lt;strong&gt;WAF:&lt;/strong&gt; Cloudflare WAF, AWS WAF&lt;/li&gt;
    &lt;li&gt;
&lt;strong&gt;Log Management:&lt;/strong&gt; journald, rsyslog, ELK Stack (for simple setups)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many of these tools offer free or affordable options suitable for the needs of small teams.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Conclusion: Security is a Journey
&lt;/h3&gt;

&lt;p&gt;Zero-trust architecture is a journey, not a destination. For small teams, embarking on this journey might seem daunting, but understanding the core principles and proceeding step by step will significantly enhance our security. Enabling MFA, implementing strong identity management, segmenting our network, and verifying every access are the cornerstones of the zero-trust philosophy.&lt;/p&gt;

&lt;p&gt;Remember, security can never be 100%, but it's in our hands to minimize risks. Based on my own experiences, I can say that taking these steps offers solutions that are both cost-effective and operationally manageable. The important thing is to adapt to the changing threat landscape and continuously review our security strategy. In the next post, we can delve deeper into another aspect of zero-trust architecture, such as continuous monitoring and log analysis.&lt;/p&gt;

</description>
      <category>security</category>
      <category>network</category>
      <category>architecture</category>
      <category>zerotrust</category>
    </item>
    <item>
      <title>Secret Rotation: Practical Ways to Enhance Security</title>
      <dc:creator>Mustafa ERBAY</dc:creator>
      <pubDate>Fri, 29 May 2026 11:51:00 +0000</pubDate>
      <link>https://dev.to/merbayerp/secret-rotation-practical-ways-to-enhance-security-1409</link>
      <guid>https://dev.to/merbayerp/secret-rotation-practical-ways-to-enhance-security-1409</guid>
      <description>&lt;p&gt;I've seen countless times how much risk static secrets (API keys, database passwords, certificates) pose in my systems. A few years ago, on a client project, we experienced a serious security vulnerability due to an old service account's API key that was forgotten in the production environment. We had only disabled the secret instead of rotating it, and later realized the old key was still active in another system. This incident clearly showed me that secret rotation is not just a "best practice," but also a fundamental security requirement.&lt;/p&gt;

&lt;p&gt;In this post, I'll explain why secret rotation is so important, different rotation strategies, and the practical methods I've implemented in my own systems. My goal is to share the challenges I've faced and the solutions I've found to make my secret management processes more robust.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Is Secret Rotation a Critical Security Step?
&lt;/h2&gt;

&lt;p&gt;The longer any static secret remains unchanged in a system, the greater the risk of it being compromised or misused. In the event of a breach, an attacker's first target is usually these types of static credentials. If these secrets are not regularly renewed, a once-compromised key can remain valid indefinitely, creating a persistent backdoor.&lt;/p&gt;

&lt;p&gt;In my experience, especially in legacy systems or projects with rapid development, I've seen how easily secrets can be overlooked. In a production ERP, there was a database user defined for an old integration that hadn't changed in six years. This user had broad privileges, and this situation was flagged as a major risk during a cybersecurity audit. This example alone demonstrates how vital regular rotation is.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️ The Danger of Long-Lived Secrets&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Long-lived secrets can provide attackers with persistent access in the event of a breach. This makes detection difficult and increases the extent of the damage. The longer a secret's lifespan, the higher the probability of that secret being obtained and used by malicious actors.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Furthermore, human error is a significant factor. A developer might accidentally commit a secret to a code repository or leave it exposed in a log file. If this secret is subject to a rotation policy, even such an error will be rendered ineffective after a certain period. I remember accidentally writing an S3 bucket key to test logs while developing the backend for one of my side products. Fortunately, this key had a 30-day rotation period and was automatically renewed a few days after the incident. This limited the impact of a potential vulnerability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Secret Rotation Strategies and Approaches
&lt;/h2&gt;

&lt;p&gt;There are several different ways to implement secret rotation, and each has its own advantages and disadvantages. Generally, they fall on a spectrum from manual to fully automated.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Manual Rotation
&lt;/h3&gt;

&lt;p&gt;This is the simplest method. At regular intervals (e.g., once a month), an administrator or developer manually changes the secret and updates it in all relevant systems. This approach might be feasible for small systems with few secrets. However, it's prone to human error, time-consuming, and tends to be inconsistent.&lt;/p&gt;

&lt;p&gt;I tried this method initially for one of my small side products. Every month, I'd put a note on my calendar: "Change DB password and API keys." But I remember skipping a month or two during a busy period and then getting frustrated with myself. As the scale grows or the number of secrets increases, this method becomes unsustainable. Especially changing a secret used by more than 10 applications on a single database server could turn into almost half a day's work.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Semi-Automated Rotation
&lt;/h3&gt;

&lt;p&gt;In this strategy, the creation or modification of the secret is automated, but its distribution or the updating of applications might still require manual intervention. For example, a script might generate a new secret, but the system administrator copies this secret to the relevant configuration files and restarts the services.&lt;/p&gt;

&lt;p&gt;On a client project, I saw that the security team automatically generated certain certificates and placed them in a repository, but the distribution to the Nginx servers using these certificates and the restart of the Nginx service were the responsibility of the operations team. While better than manual, this still carried coordination and human factor risks. I experienced a similar situation with the &lt;code&gt;deploy-hook&lt;/code&gt; feature of the &lt;code&gt;certbot&lt;/code&gt; tool I used to renew Let's Encrypt certificates on my own server. &lt;code&gt;certbot&lt;/code&gt; would renew the certificate, but if I forgot to restart Nginx, the old certificate would remain active. That's why I added the &lt;code&gt;ExecReload&lt;/code&gt; command to my &lt;code&gt;systemd&lt;/code&gt; unit to automate this process.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Fully Automated Rotation
&lt;/h3&gt;

&lt;p&gt;This is the ideal approach. The creation, distribution, updating of relevant applications or services, and even the cleanup of old secrets are completely automated. This is typically achieved with a Secret Management Tool (SMT) or custom automation scripts and CI/CD processes.&lt;/p&gt;

&lt;p&gt;In a production ERP I used, database passwords, API keys, and service tokens were managed with an SMT like HashiCorp Vault. Applications would fetch updated secrets from this Vault upon startup or at regular intervals. This way, when we rotated a secret, all dependent systems could automatically receive the new secret. This significantly reduced operational overhead while strengthening the security posture. I delved into more details on [relevant: security integration in CI/CD processes].&lt;/p&gt;

&lt;h2&gt;
  
  
  Database Credentials and Rotation
&lt;/h2&gt;

&lt;p&gt;Databases typically house the most sensitive secrets of systems. Therefore, regularly rotating database credentials is one of the highest priorities. I have experience in this area, especially in projects where I worked with PostgreSQL.&lt;/p&gt;

&lt;p&gt;Changing a user's password in PostgreSQL is quite simple with the &lt;code&gt;ALTER USER&lt;/code&gt; command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;USER&lt;/span&gt; &lt;span class="n"&gt;myapp_user&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;PASSWORD&lt;/span&gt; &lt;span class="s1"&gt;'new_strong_password'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, the real challenge is implementing this change in a live system without causing downtime. My strategy in a production ERP was as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Creating a New User (Optional but Secure):&lt;/strong&gt; If needed, creating a new user with the same privileges provides a safety net for rollback scenarios.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Transition at the Application Layer:&lt;/strong&gt; How applications manage database connection pools is critical. Many modern connection pools (e.g., HikariCP in Java, custom pools written with &lt;code&gt;asyncpg&lt;/code&gt; in Python) can dynamically detect password changes or load new credentials with a &lt;code&gt;reload&lt;/code&gt; command. If this feature isn't available, applications might need to be restarted sequentially.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Two-Phase Rotation:&lt;/strong&gt; In some cases, I implemented a transition strategy that allowed both the old and new passwords to be valid simultaneously for a period. For example, a new password is defined first, then applications are switched to the new password. Once all applications complete the transition, the old password is disabled. This is particularly useful for minimizing downtime in large and complex deployments.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;&lt;code&gt;pg_hba.conf&lt;/code&gt; Management:&lt;/strong&gt; Authentication methods are defined in the &lt;code&gt;pg_hba.conf&lt;/code&gt; file. If IP-based restrictions or different authentication mechanisms are used here, these changes must also be included in the rotation plan.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once, while rotating the PostgreSQL password for the backend of a task management application I developed, I realized that the connection pool wasn't automatically picking up the new password. Everything worked fine after I restarted the application, but even this brief outage made me more cautious. This situation highlights the importance of understanding how each component reacts to secret rotation. I specifically automated reloading secrets using commands like &lt;code&gt;ExecReload&lt;/code&gt; or &lt;code&gt;ExecStartPost&lt;/code&gt; for &lt;code&gt;systemd&lt;/code&gt; services. I also touched upon the intricacies of database management in my post on [relevant: PostgreSQL performance tuning and WAL bloat issues].&lt;/p&gt;

&lt;h2&gt;
  
  
  API Keys and Service Tokens
&lt;/h2&gt;

&lt;p&gt;API keys and service tokens used in inter-application communication are also important categories of secrets that require regular rotation. Especially keys used for publicly exposed APIs or integrations with third-party services should be rotated more frequently as they expand the attack surface.&lt;/p&gt;

&lt;h3&gt;
  
  
  JWT and OAuth2 Tokens
&lt;/h3&gt;

&lt;p&gt;Rotation strategies for JWT (JSON Web Tokens) and OAuth2 tokens, commonly used in modern applications, are slightly different. JWTs typically have a short lifespan (minutes or hours). The crucial part is the regular rotation of the keys used to sign these tokens (HMAC secret or RSA private key).&lt;/p&gt;

&lt;p&gt;In a production ERP I used, I rotated the signing keys for JWTs used for user sessions every 30 days. This meant that even if a key was compromised, it would expire within a maximum of one month. I set up this process to happen automatically in my key management system. When a new key was generated, application services dynamically loaded it. This ensured a seamless transition by allowing the &lt;code&gt;ExecReload&lt;/code&gt; command in &lt;code&gt;systemd&lt;/code&gt; units to load the new key without sending a &lt;code&gt;SIGTERM&lt;/code&gt; signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Third-Party API Keys
&lt;/h3&gt;

&lt;p&gt;Many applications use APIs from third-party services like Stripe, Twilio, or similar. The rotation of these API keys depends on the capabilities offered by the service provider. Typically, a new key is generated from the service provider's management panel, and the old key is deactivated.&lt;/p&gt;

&lt;p&gt;In the backend of my Android spam blocker app, I was integrated with an SMS gateway service. I needed to rotate this service's API key every 90 days. I managed this process with an automation script:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Generate a new key via the service provider's API.&lt;/li&gt;
&lt;li&gt; Check the validity of the old key.&lt;/li&gt;
&lt;li&gt; Add the new key to the application's configuration file.&lt;/li&gt;
&lt;li&gt; Restart application services or reload the configuration.&lt;/li&gt;
&lt;li&gt; Deactivate the old key.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Automating this process was critical because it was very prone to being forgotten when done manually. Once, I forgot to set up this automation, and the key expired, causing SMS deliveries to stop for 6 hours. This situation showed how important automation is not just for convenience, but also for reliability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Automation Tools and Processes
&lt;/h2&gt;

&lt;p&gt;Automation is indispensable for successful secret rotation. Manual operations carry both the risk of error and don't scale. Here are some automation approaches I've used in my own systems and client projects:&lt;/p&gt;

&lt;h3&gt;
  
  
  Secret Management Tools (SMT)
&lt;/h3&gt;

&lt;p&gt;SMTs like HashiCorp Vault, AWS Secrets Manager, and Azure Key Vault offer ideal solutions for centrally managing secrets, dynamically generating them, and automating their rotation. These tools also simplify auditing and logging access to secrets by applications.&lt;/p&gt;

&lt;p&gt;On a client project, we were managing secrets for over 3000 services via HashiCorp Vault. Vault could automatically generate and rotate database credentials and API keys with specific TTL (Time-To-Live) periods. Applications would retrieve these secrets using Vault client libraries or tools like &lt;code&gt;envconsul&lt;/code&gt;. This way, when we rotated a secret, Vault automatically generated a new one, and applications would fetch this secret within minutes, ensuring a seamless transition. This type of configuration significantly reduces operational overhead, especially in microservice architectures with a large number of services.&lt;/p&gt;

&lt;h3&gt;
  
  
  CI/CD Integration
&lt;/h3&gt;

&lt;p&gt;CI/CD pipelines offer a powerful platform for automating secret rotation processes. Steps for creating a new secret, updating configuration files, and restarting services can be integrated into the CI/CD workflow.&lt;/p&gt;

&lt;p&gt;In the deployment process for one of my side products, I use GitLab CI. Here, there's a step that ensures a newly generated API key is automatically added to the &lt;code&gt;env&lt;/code&gt; file deployed to the &lt;code&gt;production&lt;/code&gt; environment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# A snippet from .gitlab-ci.yml&lt;/span&gt;
&lt;span class="na"&gt;deploy_production&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deploy&lt;/span&gt;
  &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;export NEW_API_KEY=$(generate_new_api_key_script)&lt;/span&gt; &lt;span class="c1"&gt;# Generate new key with custom script&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;sed -i "s/^API_KEY=.*$/API_KEY=${NEW_API_KEY}/g" .env.production&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ssh user@prod-server "sudo systemctl reload myapp-backend.service"&lt;/span&gt;
  &lt;span class="na"&gt;only&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;master&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this example, &lt;code&gt;generate_new_api_key_script&lt;/code&gt; represents a custom, external script that generates a new key from a key management system or directly from the service's API. This approach guarantees that the most up-to-date secrets are used at the time of deployment. I can elaborate on this topic in my post on [relevant: building reliable CI/CD pipelines].&lt;/p&gt;

&lt;h3&gt;
  
  
  Custom Scripts and &lt;code&gt;systemd&lt;/code&gt; Timers
&lt;/h3&gt;

&lt;p&gt;For smaller-scale systems or specific needs, I use custom shell scripts or Python scripts with &lt;code&gt;systemd&lt;/code&gt; timers for automation. For example, I use a &lt;code&gt;systemd&lt;/code&gt; timer to renew TLS certificates used for an Nginx reverse proxy and reload Nginx.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/systemd/system/nginx-cert-rotate.service
&lt;/span&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Nginx Certificate Rotation Script&lt;/span&gt;

&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;oneshot&lt;/span&gt;
&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/local/bin/rotate_nginx_certs.sh&lt;/span&gt;
&lt;span class="py"&gt;User&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;root&lt;/span&gt;

&lt;span class="c"&gt;# /etc/systemd/system/nginx-cert-rotate.timer
&lt;/span&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Run Nginx Certificate Rotation Daily&lt;/span&gt;

&lt;span class="nn"&gt;[Timer]&lt;/span&gt;
&lt;span class="py"&gt;OnCalendar&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;daily&lt;/span&gt;
&lt;span class="py"&gt;Persistent&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;

&lt;span class="nn"&gt;[Install]&lt;/span&gt;
&lt;span class="py"&gt;WantedBy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;timers.target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;/usr/local/bin/rotate_nginx_certs.sh&lt;/code&gt; script renews the certificates and then reloads Nginx with &lt;code&gt;sudo systemctl reload nginx&lt;/code&gt; to activate the new certificates. This is very useful, especially on bare-metal servers or when I'm not using container orchestration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Challenges and Solutions of Secret Rotation
&lt;/h2&gt;

&lt;p&gt;While secret rotation offers significant security benefits, it also brings some operational challenges. Knowing these challenges beforehand and developing solution strategies is critical for a seamless transition.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Risk of Interruption and Downtime
&lt;/h3&gt;

&lt;p&gt;An incorrectly performed rotation can lead to applications being unable to access secrets, thus causing downtime. Especially in large systems, updating all components simultaneously is a challenging task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Phased Rollouts (Blue/Green or Canary Deployments):&lt;/strong&gt; Deploying new service instances configured with new secrets and gradually shifting traffic to them.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Two-Phase Secret Policy:&lt;/strong&gt; Ensuring that both the old and new secrets are valid for a certain period. This allows applications to gradually transition to the new secret. For example, defining two different passwords for a database user or implementing a "continue to accept the old one" policy for an API key.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Connection Pool Reload:&lt;/strong&gt; Ensuring that application connection pools can dynamically reload secrets. If this isn't possible, ensure the application can pick up new secrets with a graceful restart.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once, when rotating the database password for a service of my side product, I forgot to add the new password to the deployment pipeline. The service started but couldn't connect to the database, and I experienced a 15-minute outage. This showed how important it is to meticulously test every step of the automation.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Dependency Management
&lt;/h3&gt;

&lt;p&gt;When a secret is used by multiple applications or services, identifying and updating all dependent systems can be challenging. An old, forgotten service or cron job can cause problems after rotation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Centralized Secret Management (SMT):&lt;/strong&gt; Managing all secrets in one place makes it easier to track dependencies.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Secret Mapping:&lt;/strong&gt; Documenting which secret is used by which application or service and regularly reviewing it.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Access Control and Auditing:&lt;/strong&gt; SMTs typically log secret accesses. By analyzing these logs, we can see which services accessed which secrets and when.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a production project, we discovered the existence of a reporting script from 2018, running in a test environment but connecting to the production database, during rotation. This script had stopped its reporting function because it didn't pick up the new password. Such "ghost" dependencies can only be identified through regular audits and inventorying.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Debugging and Observability
&lt;/h3&gt;

&lt;p&gt;To quickly identify and resolve issues that arise during or after rotation, it's necessary to have adequate logging and monitoring mechanisms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Detailed Logging:&lt;/strong&gt; Log secret rotation operations and related service secret access errors in detail. Error messages should be clear and understandable.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Metrics and Alerts:&lt;/strong&gt; Collect proactive metrics and set up alerts for secret access errors, connection errors, or service outages.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Audit Logs:&lt;/strong&gt; Maintain audit logs showing who rotated or accessed which secret and when.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In my system, when I rotated the Redis password, I saw that some services were trying to connect to Redis with the old password and getting an &lt;code&gt;ERR invalid password&lt;/code&gt; error. By examining the &lt;code&gt;journald&lt;/code&gt; logs, I quickly identified this error and restarted the relevant service. In such situations, seeing how quickly logs and metrics respond significantly shortens troubleshooting time.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️ How Often Should Rotation Occur?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The rotation period depends on the secret's sensitivity, your risk tolerance, and operational complexity. Generally, for sensitive secrets (database passwords, root API keys), 30-90 days is ideal. This period can be extended for less sensitive or short-lived tokens. However, the better the automation, the more frequently rotation can be performed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Secret rotation is one of the cornerstones of modern system security. Transitioning from manual approaches to full automation not only increases operational efficiency but also significantly strengthens the security posture. In my 20 years of field experience, I've seen numerous projects and systems pay the price for underestimating this issue.&lt;/p&gt;

&lt;p&gt;Remember, a secret compromise may be inevitable, but by shortening the secret's lifespan, we can minimize potential damage. Automation, detailed monitoring, and well-defined processes are the keys to turning secret rotation from a dreaded task into a routine security practice. My preference is to aim for full automation wherever possible and remove the human factor from the process as much as I can. Always being prepared for things to go wrong, rather than just saying "it happens," means far fewer headaches in the long run.&lt;/p&gt;

</description>
      <category>technology</category>
      <category>security</category>
      <category>devops</category>
      <category>systemadmin</category>
    </item>
    <item>
      <title>Dependency Security: Stopping the Build or Warning?</title>
      <dc:creator>Mustafa ERBAY</dc:creator>
      <pubDate>Fri, 29 May 2026 11:01:23 +0000</pubDate>
      <link>https://dev.to/merbayerp/dependency-security-stopping-the-build-or-warning-468k</link>
      <guid>https://dev.to/merbayerp/dependency-security-stopping-the-build-or-warning-468k</guid>
      <description>&lt;p&gt;Dependency management in software projects, while seemingly easy at first glance, becomes complex when security is involved. Once you start using a few libraries, and those libraries have their own dependencies, you quickly find yourself managing hundreds, even thousands, of packages. This is where the issue of &lt;strong&gt;Dependency Security&lt;/strong&gt; brings with it a fundamental question: "Should we stop the build, or just issue a warning?"&lt;/p&gt;

&lt;p&gt;Over the years, I've encountered this dilemma many times, both in large corporate projects and in my own side projects. Both approaches have their advantages and disadvantages. As a pragmatic systems engineer, what's important to me is to keep the risk at an acceptable level without completely killing development speed. In this post, I'll share the points I consider when making this decision and the experiences I've gained in the field.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Does Dependency Security Constantly Cause Headaches?
&lt;/h2&gt;

&lt;p&gt;Dependencies in our projects are the libraries we use and their own dependencies. Modern software development is unthinkable without these packages, as writing everything from scratch is both time-consuming and inefficient. However, this convenience brings serious security risks.&lt;/p&gt;

&lt;p&gt;A few years ago, while working on the backend of an e-commerce site, we had a constantly updated stack of packages. When we ran the &lt;code&gt;npm audit&lt;/code&gt; command, the results sometimes showed 20-30 "High" level CVEs. Most of these were not directly related to our code but had infiltrated the system through transitive dependencies. This situation meant a significant potential vulnerability, especially in a publicly exposed system. Every new vulnerability in open-source libraries could directly affect our project.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️ Transitive Dependencies&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Transitive dependencies are other libraries used by a library that your project directly uses. This layered structure makes it difficult to trace security vulnerabilities and can lead to problems emerging from unexpected places.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One of the main reasons for this constant headache is the complexity of the dependency tree. If a library has 5-10 dependencies, and those also have their own dependencies, the chain quickly extends. Manually checking the security of each dependency is almost impossible. That's why we need automated tools, but how these tools should act becomes a critical question.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stopping the Build: A Zero-Tolerance Approach to Security
&lt;/h2&gt;

&lt;p&gt;Stopping the build, or applying the "fail-fast" principle, is a zero-tolerance approach to security. In this method, when your CI/CD pipeline detects a vulnerability, it completely prevents the code from being deployed. The basic argument is: preventing vulnerable code from reaching the production environment from the outset is much cheaper than the costs that would arise later.&lt;/p&gt;

&lt;p&gt;We adopted this approach for a service we developed for an internal banking platform. The security team demanded that the build absolutely fail if any "High" or "Critical" level CVE was detected. Initially, it sounded logical: clean code, secure system. However, this led to significant friction within the development team. We were getting an average of 12 build failures a day. Most of the time, the entire deployment process would stop due to a vulnerability in a small library's function that we weren't even directly using.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example CI/CD pipeline step (pseudo-code)&lt;/span&gt;
&lt;span class="na"&gt;security_scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test&lt;/span&gt;
  &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;npm audit --production --audit-level=high&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;if [ $? -ne 0 ]; then&lt;/span&gt;
        &lt;span class="s"&gt;echo "Critical or High-level dependency vulnerabilities found. Stopping the build!";&lt;/span&gt;
        &lt;span class="s"&gt;exit 1;&lt;/span&gt;
      &lt;span class="s"&gt;fi&lt;/span&gt;
  &lt;span class="na"&gt;allow_failure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt; &lt;span class="c1"&gt;# This is critical, it stops the build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The biggest advantage of this approach is that it minimizes the risk of security vulnerabilities leaking into the production environment. Every error is immediately visible and must be fixed. However, its disadvantages are also considerable. Developers can lose motivation due to constantly broken builds and develop a kind of "blindness" to security scanners. Additionally, there are situations where not every dependency vulnerability poses an immediate risk, but this approach doesn't differentiate them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Issuing a Warning: Flexibility or Risk Postponement?
&lt;/h2&gt;

&lt;p&gt;Issuing only a warning instead of stopping the build is a more flexible approach. In this scenario, dependency scanners detect and report vulnerabilities, but the CI/CD pipeline continues to run. The goal is to inform developers and provide security teams with a list to track.&lt;/p&gt;

&lt;p&gt;In one of my side projects, I initially preferred this method. I didn't want to interrupt development speed and found it unnecessary for "Medium" or "Low" level vulnerabilities to immediately stop the build. At first, everything was fine; we occasionally reviewed the warnings and fixed the critical ones. However, about 6 months later, the accumulation of over 40 medium-level CVEs made me seriously reconsider. Most of these vulnerabilities, though not directly related, were starting to pose a significant overall risk.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example CI/CD pipeline step (pseudo-code)&lt;/span&gt;
&lt;span class="na"&gt;security_scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test&lt;/span&gt;
  &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;npm audit --production --audit-level=info&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;echo "Dependency vulnerabilities detected. Please review the report."&lt;/span&gt;
  &lt;span class="na"&gt;allow_failure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="c1"&gt;# This is important, it does not stop the build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The main advantage of this approach is that the development flow is not interrupted. Developers are informed about security issues but are not required to make an immediate fix. This can be preferred in projects requiring rapid delivery. However, the risk is that these warnings may be ignored over time, and security debt accumulates. Over time, accumulated warnings become "noise," and even a truly critical vulnerability can get lost in this noise.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️ Warnings Getting Lost&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Too many warnings, just like too many logs, can cause important information to be overlooked. Development teams may eventually start to disregard constant warnings, which can lead to serious security vulnerabilities going unnoticed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Criticality Levels and the Role of Automated Fixes
&lt;/h2&gt;

&lt;p&gt;Not all dependency vulnerabilities are equal. Criticality levels such as "Critical," "High," "Medium," and "Low" indicate the potential impact and exploitability of a vulnerability. Taking action based on this distinction offers a more balanced approach. For example, stopping the build for a "Low" level vulnerability might be less sensible than only stopping it for "Critical" or "High" levels.&lt;/p&gt;

&lt;p&gt;In an ERP project for a manufacturing company, we adjusted our security policy according to these criticality levels. We decided to stop the build only for "Critical" and "High" level CVEs. This reduced the number of build failures by 75% and allowed developers to deal with fewer "false positives." For "Medium" and "Low" level vulnerabilities, we created a separate security dashboard and tracked them regularly.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Automated Remediation Bots&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Tools like Dependabot or Renovate can help remediate vulnerabilities by automatically updating your dependencies. These bots create pull requests for secure updates and reduce developer workload. However, it's important to remember that automated updates don't always work flawlessly and can sometimes lead to breaking changes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Automated dependency updaters also play an important role in this process. These bots can automatically create a pull request for a patched version of a dependency when a new security vulnerability is detected. This significantly reduces the manual workload developers have to track. However, it's also important to consider that automated updates don't always work flawlessly and can sometimes lead to breaking changes or incompatibilities. Therefore, automated updates must also pass through the CI/CD pipeline and be tested.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Preference: A Context-Based Hybrid Approach
&lt;/h2&gt;

&lt;p&gt;Years of experience have shown me that a "one-size-fits-all" solution does not exist for dependency security. Every project, every team, and every organization has its unique risk tolerance and development culture. Therefore, my clear position is a context-based, hybrid approach.&lt;/p&gt;

&lt;p&gt;The strategy we applied in an internal banking platform was completely different from the strategy I applied in my Android spam application. In the bank, even the slightest vulnerability carried significant financial and reputational risks; therefore, stopping the build for "Critical" and "High" level vulnerabilities was mandatory. In my Android application, being a less risky project, I only monitored "Medium" level vulnerabilities and intervened manually periodically.&lt;/p&gt;

&lt;p&gt;I generally implement my hybrid approach with the following steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Stop the Build for Critical Level Vulnerabilities:&lt;/strong&gt; I absolutely stop the CI/CD pipeline for all CVEs marked as "Critical" or "High." This is the fastest way to eliminate the most urgent and potentially most destructive risks.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Warn and Track for Medium and Low Level Vulnerabilities:&lt;/strong&gt; I do not stop the build for "Medium" and "Low" level vulnerabilities. Instead, I track these vulnerabilities on a separate security dashboard (e.g., via Slack integration or Jira tickets). This keeps developers informed without disrupting their flow.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Use Automated Updates:&lt;/strong&gt; I try to automatically integrate patched dependency versions using tools like Dependabot or Renovate. These pull requests pass through the test pipeline like other code changes.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Periodic Manual Review and Risk Assessment:&lt;/strong&gt; Every quarter or before a major release, I manually review accumulated "Medium" and "Low" level vulnerabilities. During this review, I assess how much risk the vulnerability poses in the project's real-world usage scenario. Sometimes a vulnerability may not affect the module we are using, and in this case, it can be added to an exception list.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This approach allows for a delicate balance between security and development speed. We eliminate the most critical risks and prevent developers from constantly struggling with build failures. My experiences with [related: Observability in Software Development] have repeatedly shown me how important these tracking processes are. Similarly, I detailed these automation steps in a post I wrote on [related: CI/CD Pipeline Security].&lt;/p&gt;

&lt;h2&gt;
  
  
  Considerations and Metrics in Practice
&lt;/h2&gt;

&lt;p&gt;When implementing a hybrid dependency security strategy, there are a few important points to consider. First, the "false positive" rate needs to be managed well. Sometimes security scanners can issue warnings for situations that do not actually pose a risk. In such cases, it is important to carefully evaluate whether the vulnerability is truly exploitable in the project's context and, if necessary, add it to an exception list. However, these exceptions must be used very carefully and documented.&lt;/p&gt;

&lt;p&gt;In an ERP for a manufacturing company, we received a "false positive" warning for a "Medium" level CVE in a specific library for 6 months. Constantly seeing the same warning caused the team to become desensitized to other critical warnings. Then we realized that in our use case, this did not pose a risk because we never called the vulnerable function. In such situations, creating a decision log and clearly stating why an exception was made is vital.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;🔥 Exception Lists and Risks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While exception lists are useful for managing "false positive" situations, they can create security gaps if misused. Every exception should be made with a detailed risk assessment and security team approval, and also reviewed regularly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Second, it's important to track the right metrics to measure security performance. Some key metrics I track include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;New Vulnerabilities Per Sprint:&lt;/strong&gt; The number of new "Critical" and "High" level vulnerabilities detected at the end of each sprint.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Critical Vulnerability Mean Time To Resolve (MTTR):&lt;/strong&gt; The average time from detection to resolution of a "Critical" or "High" marked vulnerability.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Build Failure Rate Due to Security:&lt;/strong&gt; The percentage of builds that fail due to security vulnerabilities, relative to the total number of builds.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Security Debt:&lt;/strong&gt; The total number of accumulated "Medium" and "Low" level vulnerabilities.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric Name&lt;/th&gt;
&lt;th&gt;Definition&lt;/th&gt;
&lt;th&gt;Target Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;New Critical Vulnerabilities / Sprint&lt;/td&gt;
&lt;td&gt;Number of new Critical/High vulnerabilities emerging each sprint&lt;/td&gt;
&lt;td&gt;&amp;lt; 1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Critical Vulnerability MTTR&lt;/td&gt;
&lt;td&gt;Time from detection to remediation of a Critical/High vulnerability&lt;/td&gt;
&lt;td&gt;&amp;lt; 24 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security Build Failure Rate&lt;/td&gt;
&lt;td&gt;Ratio of builds failing due to security scans&lt;/td&gt;
&lt;td&gt;&amp;lt; 5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security Debt (Medium/Low)&lt;/td&gt;
&lt;td&gt;Total number of accumulated Medium/Low vulnerabilities&lt;/td&gt;
&lt;td&gt;&amp;lt; 50 (varies by project)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These metrics allow us to see trends over time and understand whether our security posture is improving. For example, if the critical vulnerability resolution time is increasing, this could indicate a workload issue for the security team or developers.&lt;/p&gt;

&lt;p&gt;Finally, creating and maintaining an &lt;code&gt;SBOM&lt;/code&gt; (Software Bill of Materials) provides transparency in dependency security. An SBOM is a list of all dependencies used in your project and their versions. This list helps you quickly identify which of your projects are affected when a new CVE is published.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Dependency security is an inevitable reality of modern software projects. Choosing between stopping the build or just issuing a warning depends on the project's context, risk tolerance, and team culture. In my experience, the most effective way is to implement a hybrid strategy that balances these two approaches.&lt;/p&gt;

&lt;p&gt;Showing zero tolerance for critical vulnerabilities while providing flexibility for lower-level issues is key to both maintaining security and preserving development speed. Let's remember that security is a journey and requires continuous adaptation and learning. The important thing is to understand the risks, use the right tools, and keep the team's security awareness high.&lt;/p&gt;

</description>
      <category>dependencysecurity</category>
      <category>cicd</category>
      <category>vulnerabilitymanagement</category>
      <category>softwaresupplychain</category>
    </item>
    <item>
      <title>Eventual Consistency: 3 Decision-Making Criteria for Side Projects</title>
      <dc:creator>Mustafa ERBAY</dc:creator>
      <pubDate>Fri, 29 May 2026 10:26:33 +0000</pubDate>
      <link>https://dev.to/merbayerp/eventual-consistency-3-decision-making-criteria-for-side-projects-12en</link>
      <guid>https://dev.to/merbayerp/eventual-consistency-3-decision-making-criteria-for-side-projects-12en</guid>
      <description>&lt;p&gt;Side projects are, for me, a space to try new things on one hand, and to solve a problem in my head on the other. Generally, in these projects, we want everything to be immediate and perfect. But both our time and our money are limited. This is exactly where &lt;strong&gt;Eventual Consistency&lt;/strong&gt; becomes a lifesaver for me. Not everything needs to be consistent instantly, all the time. Sometimes, being able to say, "it'll be fine," provides critical flexibility to bring projects to life.&lt;/p&gt;

&lt;p&gt;In this post, I will explain when I prefer the Eventual Consistency approach for my own side projects, the 3 core criteria I consider when making this decision, and the experiences I've gained in this process. This isn't just a technical choice; it's also part of a philosophy on how I manage my personal resources.&lt;/p&gt;

&lt;h2&gt;
  
  
  Eventual Consistency: An Art of Balance in Life and Software
&lt;/h2&gt;

&lt;p&gt;Eventual Consistency is a model that assumes data in a system will become consistent after a certain period, not instantly. This means when data is updated, that update might not propagate to all copies immediately; but eventually, they all reach the same state. While in enterprise projects this is often associated with complex distributed system architectures, for me in side projects, this concept has a much more personal meaning.&lt;/p&gt;

&lt;p&gt;In life, not everything has to be perfect instantly. Sometimes, allowing something to mature over time, rather than rushing to finish it, yields better results. It's no different in software. In a financial calculator or a task management app that I'm developing as my own side product, it's not essential for every user to see every piece of data within milliseconds. What matters is reaching the correct result eventually. This flexibility is one of the biggest factors that allows projects to launch, especially for developers like me with limited time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Criterion 1: Data Value and My Tolerance for Latency
&lt;/h2&gt;

&lt;p&gt;The first thing I consider when deciding on Eventual Consistency is how critical the relevant data is and how much latency it can tolerate. Every piece of data has a different "value," and this value directly affects the need for consistency. For example, for instant balance information in a bank's internal platform, strong consistency is a must; even a 50-millisecond delay can cause serious problems. But in a spam blocker app I'm developing on the Android side, updating the blocked numbers list every 5 minutes wouldn't bother anyone.&lt;/p&gt;

&lt;p&gt;In my own side projects, I generally evaluate data with questions like: "How much of a problem would it create if this information was updated 10 seconds late?" or "Would a 1-minute delay in updating this information disrupt the user's workflow?". If the answer is "it wouldn't be much of a problem," then Eventual Consistency is a good candidate for me. For instance, on a page showing historical transaction records for a financial calculator on my own website, it's acceptable for the most recently added transaction to appear 1-2 seconds later. However, if it needs to start processing a value the user just entered immediately, then I'd want a state close to strong consistency.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Pragmatic Data Evaluation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When determining the criticality level of a piece of data, thinking in terms of the "most common scenario" rather than the "worst-case scenario" yields more realistic results for side projects. Always considering the absolute worst-case scenario often leads to unnecessary complexity.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In a production ERP system, it's critical for an order appearing on operator screens in the production planning flow to be visible instantly. In a project like that recently, when an operator finished the previous order and pressed the "complete" button, the next order had to appear on the screen immediately. There was no room for eventual consistency here, as a 5-second delay could halt the production line. But for the "weekly production reports" section of the same ERP, a 30-minute data delay wouldn't be an issue. Making this distinction forms the basis of my Eventual Consistency decisions, both in enterprise and side projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Criterion 2: Cost and Operational Overhead: My Pocket Money and My Sleep's Value
&lt;/h2&gt;

&lt;p&gt;The biggest constraints in side projects are usually budget and my personal time. Ensuring strong consistency often requires more expensive and complex infrastructure. Things like replication synchronizations, distributed locking mechanisms, and two-phase commit protocols both extend development time and increase server costs. For a solo developer like me, this burden can mean the project never gets finished.&lt;/p&gt;

&lt;p&gt;Eventual Consistency lightens this load. For example, in a microservice architecture running on my own VPS, instead of ensuring instant data consistency between services, I use asynchronous communication via a message queue (like Redis Streams or a simple PostgreSQL table). This makes the services independent of each other and prevents the entire system from crashing in case of an error. Recently, in the backend of one of my side products, I used this method to transfer data from a service processing user data to a reporting service. The operation took 2 seconds instead of 100 milliseconds, but that was an acceptable trade-off for me.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simple message queue simulation (can also be done with PostgreSQL or Redis)
# This represents an "outbox" pattern for Eventual Consistency.
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;deque&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MessageQueue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;deque&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] Message published: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;consume&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;popleft&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] Message consumed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="c1"&gt;# Usage example
&lt;/span&gt;&lt;span class="n"&gt;queue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MessageQueue&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Publish from Service A
&lt;/span&gt;&lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;profile_update&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Simulated delay
&lt;/span&gt;
&lt;span class="c1"&gt;# Consume from Service B
&lt;/span&gt;&lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;consume&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Output (approximate):
# [1716902400.00] Message published: {'user_id': 1, 'action': 'profile_update'}
# [1716902400.10] Message consumed: {'user_id': 1, 'action': 'profile_update'}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This type of approach allows me to use fewer server resources (a simple background job requiring less CPU/RAM) and simplifies my development and debugging processes. Ultimately, when I don't have a primary goal like making money from side projects or reaching a large audience, the value of my sleep and the money I spend outweighs the need for instant consistency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Criterion 3: User Experience and My Expectations
&lt;/h2&gt;

&lt;p&gt;The third criterion relates to the user experience the project aims for and my own expectations. To what extent can the users of an application (which is often me, initially) tolerate a slight delay? In which situations do they expect immediate feedback? It's important to strike a good balance here.&lt;/p&gt;

&lt;p&gt;For example, when I block a number in my own Android spam app, the blocking is expected to take effect instantly. There's no room for eventual consistency here. However, it's acceptable for the app to update the list of new spam numbers in the background with a 5-10 minute delay. Users expect this kind of operation "in the background"; they don't expect immediate feedback.&lt;/p&gt;

&lt;p&gt;Another example: Consider a user creating a new order in an ERP system for a manufacturing company. A message confirming that the order has been successfully saved to the database should appear instantly. However, the processing of this order in the backend inventory system or the shipment planning module might be delayed by 30 seconds. As long as the user is guaranteed that the order has been received, having backend processes run with eventual consistency won't cause problems.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️ Managing Expectations with Eventual Consistency&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Clearly informing the user about how long they might wait or that a process is continuing in the background is key to managing the perceptual delays caused by eventual consistency. Messages like "Your request has been queued and will be completed shortly" increase user satisfaction.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When I publish a blog post on my own website, it's important for it to appear "published" instantly when I save it. But it's fine if search engine indexing or the RSS feed updates 5 minutes later. This is how I manage my own expectations as a user (and also as a developer). If I need to see the result of an operation immediately, I prefer strong consistency. But if the operation is a "notification" or a "report," then eventual consistency is a reasonable option for me.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I've Learned While Implementing Eventual Consistency in Side Projects
&lt;/h2&gt;

&lt;p&gt;Although Eventual Consistency has attractive advantages, I've also faced some challenges while implementing this approach in my side projects. One of the biggest issues was determining when and how the "final state" would be guaranteed. Once, in a task management application I developed myself, I noticed that synchronization to my other devices was very slow when I completed a task. It sometimes took longer than 1 minute, leading to the dilemma of "Did I complete it or not?".&lt;/p&gt;

&lt;p&gt;To solve this, I added a simple "last_updated" timestamp mechanism and ensured that each device checked the server for this timestamp at regular intervals (e.g., every 15 seconds). If the timestamp on the server was newer than the one on my device, I would pull the data. This significantly improved the user experience while preserving the system's eventual consistency model. In a previous post I wrote about [related: mobile synchronization issues], I discussed such problems in more detail.&lt;/p&gt;

&lt;p&gt;Another important lesson was planning in advance how Eventual Consistency would behave in error situations. What happens if a process in a message queue fails? Should the message be retried? Or should it go to a "dead-letter queue"? Answering these questions upfront prevented me from waking up at midnight wondering, "Why wasn't this data updated?". In one of my side products, I wrote a simple Python script to retry messages that couldn't be processed on a queue I was using on Redis after a certain period, and if still unsuccessful, log them to a separate file. This provided me with operational ease and minimized the risk of data loss.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simple retry mechanism (pseudo-code)
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_message_with_retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;retries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;retries&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Message processing logic
&lt;/span&gt;            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Processing message: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="c1"&gt;# If processing fails, raise an exception
&lt;/span&gt;            &lt;span class="c1"&gt;# if random.random() &amp;lt; 0.3: # Simulate 30% error rate
&lt;/span&gt;            &lt;span class="c1"&gt;#    raise ValueError("Processing error!")
&lt;/span&gt;            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Message processed successfully: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;retries&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error processing message (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;). Retrying &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;retries&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;retries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Exponential backoff
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Message failed after &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; retries: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. Sending to dead-letter queue.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;

&lt;span class="c1"&gt;# Usage:
# process_message_with_retry({"data": "critical info"})
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Practical solutions like these increase the applicability of Eventual Consistency in side projects. The key is to understand the risks and establish simple yet effective mechanisms to manage them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: Letting Go with the Flow and Holding Tight Where It Matters
&lt;/h2&gt;

&lt;p&gt;My approach to Eventual Consistency in side projects is more than just a technical choice; it's a reflection of a broader life philosophy. Instead of expecting everything to be perfect and instant, identifying what is truly critical and allowing flexibility for the rest ensures project progress and saves me from unnecessary stress. Data value, cost, and user expectations serve as my compass in establishing this balance.&lt;/p&gt;

&lt;p&gt;My clear stance is this: if a piece of data or an operation can fulfill its basic functionality without instant consistency and doesn't significantly negatively impact user experience, then Eventual Consistency is my default option. This provides me with faster prototyping, lower operational costs, and fewer headaches. Just like in life, in software, instead of trying to control everything all the time, sometimes it's better to let things flow and only hold tight where it truly matters. In the next post, we can discuss [related: time management and software projects].&lt;/p&gt;

</description>
      <category>life</category>
      <category>eventualconsistency</category>
      <category>architecture</category>
      <category>decisionmaking</category>
    </item>
    <item>
      <title>The Cost of Offline-First Synchronization in Mobile Applications</title>
      <dc:creator>Mustafa ERBAY</dc:creator>
      <pubDate>Fri, 29 May 2026 06:26:30 +0000</pubDate>
      <link>https://dev.to/merbayerp/the-cost-of-offline-first-synchronization-in-mobile-applications-3321</link>
      <guid>https://dev.to/merbayerp/the-cost-of-offline-first-synchronization-in-mobile-applications-3321</guid>
      <description>&lt;p&gt;The cost of offline-first synchronization in mobile applications is not just incurred during the software development phase; it's an operational bill that emerges when you reach thousands of users in a production environment. Often, this journey begins with the request, "Let the user work offline too," and can transform into a full engineering nightmare with local database management on the device, packet loss in the network layer, and data consistency issues on the server side. I've personally paid these hidden costs brought about by this architecture in mobile projects I've developed and in teams I've consulted for in the field.&lt;/p&gt;

&lt;p&gt;In this post, I will delve into the real technical burdens of the offline-first architecture, from a mobile app's local database layer to server-side conflict resolution algorithms, using concrete data. You will clearly see what trade-offs you need to consider before saying, "Let's just build it."&lt;/p&gt;




&lt;h2&gt;
  
  
  The Invisible Burden of Local Data Storage and Schema Management
&lt;/h2&gt;

&lt;p&gt;The heart of an offline-capable mobile application is the local database running within the device. Solutions based on SQLite (Room, Writable SQLite) or NoSQL alternatives (Isar, Hive) are commonly preferred. However, when you reach 25,000 active devices in a production environment, schema migrations for these local databases become a full-blown operational risk.&lt;/p&gt;

&lt;p&gt;While you can update a server-side database with a single live deployment, you cannot arbitrarily update the schema of the database on a user's phone. A user might not have updated your app for 6 months and could jump directly from version v1.0.2 to v2.1.0. In this scenario, the migration scripts you write must work flawlessly; otherwise, the local database will become corrupted, and all local data not yet synchronized on the user's device will be lost.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Example of a v3 migration on SQLite - Adding a new column without losing local data&lt;/span&gt;
&lt;span class="c1"&gt;-- If a user jumps from v1 to v3, all intermediate paths (1-&amp;gt;2, 2-&amp;gt;3) must be defined.&lt;/span&gt;
&lt;span class="k"&gt;BEGIN&lt;/span&gt; &lt;span class="n"&gt;TRANSACTION&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;local_orders_new&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="nb"&gt;REAL&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;synced&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;discount_code&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="c1"&gt;-- New column added with v3&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;local_orders_new&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;synced&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;synced&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;local_orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;local_orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;local_orders_new&lt;/span&gt; &lt;span class="k"&gt;RENAME&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;local_orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;COMMIT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When designing indexes in the local database, you must also account for the limited hardware resources of a mobile device. Every B-tree index unnecessarily defined on SQLite increases the device's disk write (I/O) load with every &lt;code&gt;INSERT&lt;/code&gt; operation and directly impacts battery consumption. If the CPU consumption of database operations running in the background on Android and iOS platforms exceeds a certain threshold, the operating system may flag your app as "resource-heavy" and force-kill it (force kill).&lt;/p&gt;




&lt;h2&gt;
  
  
  Network Packets and Protocol Choice: REST vs WebSockets vs gRPC
&lt;/h2&gt;

&lt;p&gt;In an offline-first architecture, you must optimize data exchange between the local device and the remote server. Synchronizing the entire database from scratch every time a connection is established (full sync) is not sustainable. Therefore, you need to send only changed data (delta updates). However, the choice of protocol to carry these delta packets is a significant cost item.&lt;/p&gt;

&lt;p&gt;If you attempt synchronization using a general HTTP REST API, the outgoing HTTP headers (approximately 400-800 bytes) for each request and the TLS handshake create a substantial overhead with every connection. A device sending a small location or order status update every 15 seconds can consume gigabytes of unnecessary data by the end of the month, solely due to HTTP protocol overhead.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Protocol&lt;/th&gt;
&lt;th&gt;Average Header Size&lt;/th&gt;
&lt;th&gt;Connection Type&lt;/th&gt;
&lt;th&gt;Mobile Battery Consumption&lt;/th&gt;
&lt;th&gt;Offline Compatibility&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HTTP REST&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;500 - 1000 bytes&lt;/td&gt;
&lt;td&gt;Stateless / Request-Response&lt;/td&gt;
&lt;td&gt;Medium - High&lt;/td&gt;
&lt;td&gt;Easy (with retry mechanisms)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;WebSockets&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~2 - 10 bytes (after handshake)&lt;/td&gt;
&lt;td&gt;Stateful / Bi-directional&lt;/td&gt;
&lt;td&gt;High (as long as connection is open)&lt;/td&gt;
&lt;td&gt;Difficult (reconnection overhead on interruptions)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;gRPC (HTTP/2)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~10 - 50 bytes (compressed)&lt;/td&gt;
&lt;td&gt;Stateful / Multiplexed&lt;/td&gt;
&lt;td&gt;Low - Medium&lt;/td&gt;
&lt;td&gt;Medium (requires client-side interceptor)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In mobile environments where network interruptions are frequent, tracking half-completed packets when the connection drops must be handled. For example, if the device sends 10 local records to the server, the server writes them to the database, but the network drops before the client receives a "200 OK" response. The client doesn't know if the data reached the server. On the next connection, it will send the same data again. This leads to duplicate data on the server side. To overcome this problem, signing each request with a unique &lt;code&gt;idempotency-key&lt;/code&gt; is essential.&lt;/p&gt;

&lt;p&gt;As discussed in the [related: PostgreSQL index strategies] post, if you don't set up an index structure on the server side to quickly query these idempotency keys, your server database will reach a deadlock point as synchronization requests grow.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Conflict Resolution (Conflict Resolution) Predicament
&lt;/h2&gt;

&lt;p&gt;What happens when two different devices make offline changes to the same data and then connect to the internet simultaneously? This is the biggest technical deadlock of the offline-first architecture. While conflict resolution strategies seem very easy in theory, they can practically lead to data loss or inconsistencies.&lt;/p&gt;

&lt;p&gt;Let's examine three of the most common conflict resolution methods and their real-world costs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Last-Write-Wins (LWW):&lt;/strong&gt; The data from the last writer is accepted. It relies on device timestamps. However, mobile device clocks can be changed by the user or drift from network-based time synchronization (NTP). A device with a clock 5 minutes ahead from v1.1.0 could overwrite the current data from v1.1.1.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Merge:&lt;/strong&gt; Conflicting fields are merged on a field-by-field basis. For instance, if user A changed the order description and user B changed the quantity, both changes are applied. However, this can break business logic (e.g., the old description might become invalid because the quantity changed).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Conflict-Free Replicated Data Types (CRDT):&lt;/strong&gt; These are data structures that mathematically do not produce conflicts (e.g., PN-Counter or LWW-Element-Set). They are extremely complex to develop and create significant memory (RAM) and CPU load on the mobile device.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️ Timestamp Trap&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Never rely on &lt;code&gt;Date.now()&lt;/code&gt; or &lt;code&gt;DateTime.now().toUtc()&lt;/code&gt; values generated on the client side to update data on the server. If the user manually sets their device's clock backward, your entire synchronization history can collapse. Instead of timestamps, always use an incrementing version number (sequence number) or server-controlled logical clocks (Vector Clocks).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The following JSON schema illustrates how complex a conflict package, carried between the client and server for conflict resolution, can be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sync_session_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"8f9b2c3a-4d5e-6f7a-8b9c-0d1e2f3a4b5e"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"client_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"server_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"conflicts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"entity_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"customer_profile"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"entity_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"usr_9921"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"client_state"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"phone"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"+905554443322"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"updated_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-29T10:14:00Z"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"server_state"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"phone"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"+905551112233"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"updated_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-29T10:13:55Z"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"resolution_strategy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"MANUAL_RESOLVE_REQUIRED"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Battery, CPU, and Background Sync Limits
&lt;/h2&gt;

&lt;p&gt;Mobile operating systems (especially with their latest versions, iOS and Android) are extremely aggressive towards background tasks. As soon as the user backgrounds your application, the operating system closes network sockets and limits CPU usage. This dashes your dreams of silently synchronizing data in the background.&lt;/p&gt;

&lt;p&gt;On Android, you must schedule background synchronization using the &lt;code&gt;WorkManager&lt;/code&gt; API, and on iOS, using &lt;code&gt;BGAppRefreshTask&lt;/code&gt;. However, these tools do not guarantee a specific execution time. The operating system may postpone the synchronization process for hours based on the device's charging status, the connected network type (Wi-Fi or cellular data), and how frequently the user uses the app.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Configuring flexible background synchronization with Android WorkManager&lt;/span&gt;
&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;constraints&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Constraints&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Builder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setRequiredNetworkType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;NetworkType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;UNMETERED&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// Run only on Wi-Fi&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setRequiresBatteryNotLow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// Do not run when battery is low&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;syncWorkRequest&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PeriodicWorkRequestBuilder&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;SyncWorker&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;TimeUnit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;HOURS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setConstraints&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;constraints&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setBackoffCriteria&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;BackoffPolicy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;EXPONENTIAL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nc"&gt;WorkRequest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MIN_BACKOFF_MILLIS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nc"&gt;TimeUnit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MILLISECONDS&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nc"&gt;WorkManager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getInstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;enqueueUniquePeriodicWork&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"app_data_sync"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nc"&gt;ExistingPeriodicWorkPolicy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;KEEP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;syncWorkRequest&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your application attempts to write large amounts of data to SQLite in the background, it can cause the device to heat up and the battery graph to drop rapidly due to disk write operations (disk commits). If the user sees your app at the top of the list consuming 25% of the battery in the battery settings, they will immediately uninstall your app. This is not a technical cost of the offline-first architecture but a direct commercial cost leading to user loss.&lt;/p&gt;




&lt;h2&gt;
  
  
  Server-Side Database and API Design
&lt;/h2&gt;

&lt;p&gt;To enable mobile devices to work offline, you must also fundamentally change your server-side architecture. Instead of a standard "get, update, save" API, you need to establish an event-driven or version-controlled database design that can track the historical evolution of each record.&lt;/p&gt;

&lt;p&gt;Tracking deleted records on the server (soft delete) is one of the most critical issues. If you physically delete a row from the database (&lt;code&gt;DELETE FROM orders WHERE id = 1&lt;/code&gt;), the offline client will never learn that the record was deleted and will continue to store it indefinitely in its local database. Therefore, you must store every deletion operation on the server side as a "Tombstone" record.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Tombstone table for soft delete and synchronization tracking on PostgreSQL&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;deleted_records&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="n"&gt;UUID&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;gen_random_uuid&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="k"&gt;table_name&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;record_id&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;deleted_at&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="nb"&gt;TIME&lt;/span&gt; &lt;span class="k"&gt;ZONE&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Trigger function to log record deletion&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;FUNCTION&lt;/span&gt; &lt;span class="n"&gt;log_record_deletion&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;RETURNS&lt;/span&gt; &lt;span class="k"&gt;TRIGGER&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="err"&gt;$$&lt;/span&gt;
&lt;span class="k"&gt;BEGIN&lt;/span&gt;
    &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;deleted_records&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;record_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TG_TABLE_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;OLD&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;RETURN&lt;/span&gt; &lt;span class="k"&gt;OLD&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="err"&gt;$$&lt;/span&gt; &lt;span class="k"&gt;LANGUAGE&lt;/span&gt; &lt;span class="n"&gt;plpgsql&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This &lt;code&gt;deleted_records&lt;/code&gt; table will grow to millions of rows over time. With every synchronization request, mobile devices must query this table to ask, "Are there any records deleted after me?" This creates a significant disk I/O and memory (RAM) load on your server-side PostgreSQL or MySQL servers. You need to set up background cron jobs or system services (systemd timers) to regularly clean up the table (vacuuming/cleanup) and archive old tombstones.&lt;/p&gt;

&lt;p&gt;As discussed previously in the [related: Linux services] section, if you do not limit the resource consumption of the services running these cleanup operations with cgroup limits, you can blow up the response times (latency) of your live API servers while performing data cleanup.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Concrete Synchronization Engine and State Management
&lt;/h2&gt;

&lt;p&gt;Let's design the core structure of a reliable synchronization engine that will run on the mobile client, bringing together all the points discussed. This engine must implement exponential backoff for failed requests, monitor network status, and maintain transactional integrity.&lt;/p&gt;

&lt;p&gt;The following Dart/Flutter code demonstrates how to establish a secure synchronization loop between a local SQLite database and a remote API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight dart"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="s"&gt;'dart:async'&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="s"&gt;'dart:math'&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="kt"&gt;enum&lt;/span&gt; &lt;span class="n"&gt;SyncStatus&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;idle&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;syncing&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;error&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SyncEngine&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;LocalDatabase&lt;/span&gt; &lt;span class="n"&gt;_db&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;ApiClient&lt;/span&gt; &lt;span class="n"&gt;_api&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="n"&gt;SyncStatus&lt;/span&gt; &lt;span class="n"&gt;_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SyncStatus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;idle&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;_retryCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="n"&gt;SyncEngine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;_db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;_api&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="n"&gt;Future&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;triggerSync&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kd"&gt;async&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;SyncStatus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;syncing&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SyncStatus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;syncing&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="c1"&gt;// 1. Get records that have changed locally but not yet sent to the server&lt;/span&gt;
      &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;pendingRecords&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getUnsyncedRecords&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

      &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pendingRecords&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;isEmpty&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SyncStatus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;idle&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;_retryCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;

      &lt;span class="c1"&gt;// 2. Send a bulk payload to the server&lt;/span&gt;
      &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_api&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;sendSyncPayload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pendingRecords&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;statusCode&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// 3. Mark successfully synchronized records locally as 'synchronized'&lt;/span&gt;
        &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="kt"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;successfulIds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'success_ids'&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;markAsSynced&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;successfulIds&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="n"&gt;_retryCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SyncStatus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;idle&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="n"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Server error: &lt;/span&gt;&lt;span class="si"&gt;${response.statusCode}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SyncStatus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="n"&gt;_handleSyncFailure&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="n"&gt;_handleSyncFailure&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;_retryCount&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;// Exponential Backoff: 2^retry * 1000ms + random jitter&lt;/span&gt;
    &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;backoffMs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_retryCount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toInt&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;Random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;nextInt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="n"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Synchronization failed. Will retry in &lt;/span&gt;&lt;span class="si"&gt;$backoffMs&lt;/span&gt;&lt;span class="s"&gt; ms. Attempt: &lt;/span&gt;&lt;span class="si"&gt;$_retryCount&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="n"&gt;Timer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;milliseconds:&lt;/span&gt; &lt;span class="n"&gt;backoffMs&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;triggerSync&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Mock Classes (to prevent compilation errors)&lt;/span&gt;
&lt;span class="kd"&gt;abstract&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LocalDatabase&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;Future&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;Map&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kd"&gt;dynamic&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;getUnsyncedRecords&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="n"&gt;Future&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;markAsSynced&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;abstract&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ApiClient&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;Future&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;ApiResponse&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;sendSyncPayload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;List&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;Map&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kd"&gt;dynamic&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ApiResponse&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;statusCode&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="kt"&gt;Map&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kd"&gt;dynamic&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="n"&gt;ApiResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;statusCode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most critical point in this code is preventing the application from overwhelming the server in case of any network error or server interruption. If 10,000 devices receive an error simultaneously and try to send requests once per second (thundering herd problem), you will bring down your server infrastructure with your own hands. This algorithm, with exponential backoff and added random jitter, is vital to prevent this risk.&lt;/p&gt;




&lt;h2&gt;
  
  
  Next Step: Architecture Decision Matrix
&lt;/h2&gt;

&lt;p&gt;Before choosing an offline-first architecture for your mobile application, ask yourself the following questions and proceed according to the decision matrix below:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;What is the Data Sensitivity?&lt;/strong&gt; For data requiring 100% consistency, such as financial transactions or stock movements, do not allow offline writes. In such cases, designing the application as strictly online-only is the cheapest and safest approach.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Where is the User Base Located?&lt;/strong&gt; If your application is used by field personnel working in subways, warehouses, or rural areas, offline-first is a must. In this case, you must include all the architectural costs mentioned above in your budget.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Is Your Development Resource Sufficient?&lt;/strong&gt; Writing an offline-first synchronization engine requires at least 3 times more testing and debugging time than writing a standard CRUD application.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Next step: Include SQLite integration tests in your CI/CD processes to automate local database schema migrations.&lt;/p&gt;

</description>
      <category>tutorials</category>
    </item>
    <item>
      <title>Multi-Tenant Architecture in ERP: How to Make the Right Trade-offs?</title>
      <dc:creator>Mustafa ERBAY</dc:creator>
      <pubDate>Fri, 29 May 2026 06:13:02 +0000</pubDate>
      <link>https://dev.to/merbayerp/multi-tenant-architecture-in-erp-how-to-make-the-right-trade-offs-ai8</link>
      <guid>https://dev.to/merbayerp/multi-tenant-architecture-in-erp-how-to-make-the-right-trade-offs-ai8</guid>
      <description>&lt;p&gt;Back when I was developing a manufacturing ERP, the need arose to offer the same software to multiple customers. This inevitably brought multi-tenant architecture to the table. Although it seemed like a simple idea at first, making the right trade-offs was critical for both the technical and commercial success of the project. In this post, I will share the challenges I faced and the decisions I made while building a multi-tenant architecture in ERP systems, complete with concrete examples.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Do We Need Multi-Tenant Architecture?
&lt;/h2&gt;

&lt;p&gt;Enterprise resource planning (ERP) systems are typically complex software suites where businesses manage their core operations. Monolithic structures developed specifically for a single customer can create maintenance and update challenges over time. Especially for service providers, setting up a separate server and database for each customer is a costly and operationally difficult scenario to manage. This is exactly where multi-tenant architecture comes into play.&lt;/p&gt;

&lt;p&gt;By opening a single software instance to the access of multiple customers (tenants), we aim to use resources efficiently. This reduces both software development and maintenance costs and eases the operational burden. For example, while developing an ERP for a manufacturing company, if we want to serve 5 different customers, instead of dealing with a separate deployment process for each, we can serve them through a single system. This provides a huge advantage, especially in the early stages or for projects that need to scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Tenancy Approaches at the Database Level
&lt;/h2&gt;

&lt;p&gt;One&lt;/p&gt;

</description>
      <category>career</category>
      <category>erp</category>
      <category>architecture</category>
      <category>multitenant</category>
    </item>
    <item>
      <title>Switch Hardening: Always a Necessary Step?</title>
      <dc:creator>Mustafa ERBAY</dc:creator>
      <pubDate>Fri, 29 May 2026 03:22:11 +0000</pubDate>
      <link>https://dev.to/merbayerp/switch-hardening-always-a-necessary-step-2ej3</link>
      <guid>https://dev.to/merbayerp/switch-hardening-always-a-necessary-step-2ej3</guid>
      <description>&lt;h2&gt;
  
  
  Switch Hardening: A Fundamental Security Layer or an Unnecessary Burden?
&lt;/h2&gt;

&lt;p&gt;When it comes to network security, we often focus on prominent components like firewalls and intrusion detection systems (IDS/IPS). However, the switches that form the backbone of the network can also be attractive targets for attackers. Switch hardening is the practice of enhancing the security of these devices. But is it always necessary? In this post, I will examine what switch hardening is, why it can be important, and when it is truly a necessity, based on my own experiences.&lt;/p&gt;

&lt;p&gt;Over the past 10 years, especially in large enterprise networks, the security of switches has become increasingly important. Once viewed as passive devices merely forwarding packets, switches now possess more complex features and present potential attack vectors. As I've encountered in my own projects, a misconfigured switch can jeopardize the security of the entire network. Therefore, understanding the intricacies of switch hardening has become critical.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Should We Perform Switch Hardening? Potential Threats and Attack Vectors
&lt;/h2&gt;

&lt;p&gt;To understand why we need switch hardening, we must first look at the threats we face. Attackers can intercept network traffic, alter routing, or even gain access to specific parts of the network by compromising switches. Such attacks are often targeted and aim to find the network's weakest points.&lt;/p&gt;

&lt;p&gt;Attacks like DHCP spoofing, ARP poisoning, and VLAN hopping can be easily carried out on improperly configured switches. For instance, an attacker can act as a DHCP server and distribute malicious IP addresses or gateway information to clients. This can lead to them taking control of all network communications. In my own experience, while working with the IT team of a manufacturing plant, we experienced nearly an hour of production loss due to a DHCP spoofing attack that disrupted access to operator screens. The source of the problem was so simple to find and fix that it once again showed me how critical switch hardening is.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️ What is DHCP Snooping?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;DHCP Snooping is a Layer 2 security feature that prevents DHCP spoofing attacks by blocking DHCP server messages from untrusted ports. The switch accepts DHCP offers and responses from trusted ports while rejecting others.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Another common attack vector is VLAN hopping. Attackers can often exploit vulnerabilities in a switch's trunk ports to gain access to a VLAN they would normally not be able to reach. This is particularly used to gain access to VLANs containing sensitive data. In a penetration attempt against the backend of a financial calculator application I developed, we detected that the attacker was trying to infiltrate the network through this method. Fortunately, the attack could not progress further because the access control lists (ACLs) between VLANs were correctly configured.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fundamental Steps of Switch Hardening: What Should Be Done?
&lt;/h2&gt;

&lt;p&gt;Switch hardening involves a series of configuration steps. These steps can vary depending on the switch model and manufacturer, but the general principles are similar. Firstly, disabling unused ports is the most basic step. Each port is a potential entry point, and closing unused ports eliminates this risk.&lt;/p&gt;

&lt;p&gt;In addition, applying specific MAC address filtering to each port enhances security. This ensures that only authorized devices can connect to a particular port. In a project I undertook for my own website, when I implemented this policy on the switches in the network segment where my servers are located, I instantly blocked an unauthorized device's attempt to connect to the network. While this might seem "paranoid," it is necessary, especially in critical infrastructures.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Cisco IOS example: Disabling ports&lt;/span&gt;
Switch&lt;span class="o"&gt;(&lt;/span&gt;config&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="c"&gt;# interface range GigabitEthernet1/0/1-24&lt;/span&gt;
Switch&lt;span class="o"&gt;(&lt;/span&gt;config-if-range&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="c"&gt;# shutdown&lt;/span&gt;

&lt;span class="c"&gt;# MAC address filtering (with Access Control List)&lt;/span&gt;
Switch&lt;span class="o"&gt;(&lt;/span&gt;config&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="c"&gt;# mac access-list extended ALLOWED_DEVICES&lt;/span&gt;
Switch&lt;span class="o"&gt;(&lt;/span&gt;config-macl&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="c"&gt;# permit host 0011.2233.4455 any  # Allowed MAC address&lt;/span&gt;
Switch&lt;span class="o"&gt;(&lt;/span&gt;config-macl&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="c"&gt;# deny any any log             # Deny and log all other MAC addresses&lt;/span&gt;
Switch&lt;span class="o"&gt;(&lt;/span&gt;config&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="c"&gt;# interface GigabitEthernet1/0/5&lt;/span&gt;
Switch&lt;span class="o"&gt;(&lt;/span&gt;config-if&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="c"&gt;# mac access-group ALLOWED_DEVICES in&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Features like DHCP snooping, Dynamic ARP Inspection (DAI), and IP Source Guard also significantly strengthen Layer 2 security. DHCP snooping blocks DHCP server messages from untrusted ports, while DAI checks the validity of ARP packets, preventing ARP poisoning attacks. IP Source Guard, on the other hand, checks if traffic coming from a port matches the IP and MAC addresses assigned to that port. These features are vital, especially on access switches where user devices connect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Managing Unused Ports and Changing Default Settings
&lt;/h2&gt;

&lt;p&gt;One of the most overlooked aspects of switch security is the management of unused ports. Knowing how many ports are actively used in a network and closing unused ports significantly reduces the attack surface. Many administrators leave ports open with the thought of "it might be needed later." However, this creates a potential security vulnerability.&lt;/p&gt;

&lt;p&gt;In my own projects, especially when setting up a new network infrastructure or reviewing an existing one, I determine the purpose of each port and shut down unnecessary ones. For example, in a data center, only ports where servers connect are kept active, and ports accessible to users are completely isolated. Even in the network configuration of my own servers in a VPS, I apply this principle by leaving only the necessary ports open.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️ Default Passwords and Management Interfaces&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Running switches with default passwords from the manufacturer is one of the biggest security mistakes. Strong and unique passwords should be used for access to management interfaces (CLI, Web UI, SNMP), and a separate VLAN should be created for management traffic, with access to this VLAN restricted.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Changing default management passwords is also a must. Most switches come with factory default passwords that are easily found online. Immediately changing these passwords is the first step to preventing unauthorized access. Furthermore, it is recommended to use more secure versions like SNMP v3 instead of old and insecure protocols like SNMP v1/v2c, or to disable SNMP entirely if not needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Port Security: MAC Address Filtering and Port-Based Security
&lt;/h2&gt;

&lt;p&gt;Port security is one of the most fundamental security features of switches. It involves controlling how many MAC addresses can connect to a port and which MAC addresses are permitted. One of the most common techniques is to limit the maximum number of MAC addresses allowed on a port. For example, by allowing only one MAC address to connect to a user port, you can prevent a user from connecting multiple devices to the network.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Cisco IOS example: Port security - single MAC address allowed&lt;/span&gt;
Switch&lt;span class="o"&gt;(&lt;/span&gt;config&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="c"&gt;# interface GigabitEthernet1/0/10&lt;/span&gt;
Switch&lt;span class="o"&gt;(&lt;/span&gt;config-if&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="c"&gt;# switchport mode access&lt;/span&gt;
Switch&lt;span class="o"&gt;(&lt;/span&gt;config-if&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="c"&gt;# switchport port-security&lt;/span&gt;
Switch&lt;span class="o"&gt;(&lt;/span&gt;config-if&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="c"&gt;# switchport port-security maximum 1&lt;/span&gt;
Switch&lt;span class="o"&gt;(&lt;/span&gt;config-if&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="c"&gt;# switchport port-security violation shutdown  # Shut down port on violation&lt;/span&gt;
Switch&lt;span class="o"&gt;(&lt;/span&gt;config-if&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="c"&gt;# switchport port-security mac-address sticky # Save learned MAC&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "sticky MAC" feature learns the first MAC address that connects to a port and saves this MAC address to the configuration. Later, if traffic arrives from this port with a different MAC address, the switch detects this as a violation. This feature is particularly effective in environments where physical access is restricted. In a customer project, when we activated this feature on switches in an office segment, we saw that an employee's attempt to connect their personal laptop to the network was blocked. This was important for ensuring compliance with company policies.&lt;/p&gt;

&lt;p&gt;Features like DAI and IP Source Guard take port security to the next level. DAI validates ARP packets to prevent ARP spoofing. IP Source Guard, on the other hand, checks if IP packets arriving from a port are consistent with the IP and MAC addresses assigned to that port. This dual protection is highly effective against common attacks like ARP poisoning. Enabling these features is a strong step towards ensuring the overall security of the network.&lt;/p&gt;

&lt;h2&gt;
  
  
  Secure Use of VLANs and Measures Against VLAN Hopping
&lt;/h2&gt;

&lt;p&gt;VLANs are used to segment the network logically, providing segmentation and enhancing security. However, if VLANs are not configured correctly, they can become vulnerable to VLAN hopping attacks. VLAN hopping allows attackers to transition to a VLAN they would normally not have access to. This usually occurs through vulnerabilities or misconfigurations in a switch's trunk ports.&lt;/p&gt;

&lt;p&gt;To prevent such attacks, only necessary VLANs should be allowed on a switch's trunk ports. The transit of unnecessary VLANs on trunks should be blocked. Additionally, the switch's management interface should only be accessible from specific and secure VLANs. In the network segment where the backend servers for a mobile application I developed are located, I had separated servers with different functionalities into separate VLANs. In this segmentation, I ensured that only authorized management devices could access these VLANs. This way, in case of a potential breach, an attacker would be prevented from accessing all servers.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 What is Native VLAN?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Native VLAN is the VLAN to which untagged traffic is carried on 802.1Q trunk ports. By default, it is usually VLAN 1. For security purposes, it is recommended to set the native VLAN to a value different from the default and to use this VLAN only for necessary traffic.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Another important measure is the secure management of the native VLAN. The native VLAN represents traffic that is transmitted untagged on trunk ports. If the native VLAN is the default VLAN 1 and sensitive devices are present in this VLAN, it can pose a security risk. Therefore, it is important to set the native VLAN to a value different from the default and to manage this VLAN securely as well.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: Is Switch Hardening Always Necessary?
&lt;/h2&gt;

&lt;p&gt;Switch hardening is an important part of network security and is definitely necessary in many scenarios. Especially in situations where sensitive data is processed, high security requirements exist, or we want to minimize the attack surface, taking these steps is of great importance. Attacks like DHCP spoofing, ARP poisoning, and VLAN hopping can be easily carried out on improperly configured switches and can lead to serious consequences.&lt;/p&gt;

&lt;p&gt;However, not every network may require equally complex hardening steps. For small office networks or less critical infrastructures, basic security measures (changing default passwords, closing unused ports) may suffice. Activating all features can sometimes increase system complexity and make management difficult. Understanding the trade-offs is important: more security often means more management complexity.&lt;/p&gt;

&lt;p&gt;Based on my own experiences, I can say that it is always best to perform a risk assessment and determine the most appropriate level of security for the network's requirements. Switch hardening is less of a "to-do list" and more of a security culture that needs continuous review. Adapting these steps according to your network's size, the data it hosts, and the threats it might face will be the most effective approach.&lt;/p&gt;

</description>
      <category>network</category>
      <category>security</category>
      <category>switchhardening</category>
      <category>networkinfrastructure</category>
    </item>
    <item>
      <title>Cardinality Explosion: Should Every Detail Really Be Observed? And</title>
      <dc:creator>Mustafa ERBAY</dc:creator>
      <pubDate>Fri, 29 May 2026 01:50:28 +0000</pubDate>
      <link>https://dev.to/merbayerp/cardinality-explosion-should-every-detail-really-be-observed-and-5da7</link>
      <guid>https://dev.to/merbayerp/cardinality-explosion-should-every-detail-really-be-observed-and-5da7</guid>
      <description>&lt;p&gt;The metrics and logs we collect to monitor the health of our systems can sometimes create problems for us. Especially when the concept we call &lt;code&gt;cardinality&lt;/code&gt; is overlooked, a simple monitoring system can suddenly turn into a massive cost and performance issue. This situation directly affects not only the systems but also the careers and professional approaches of engineers like us working in operations and development.&lt;/p&gt;

&lt;p&gt;In this post, I will try to explain what a &lt;code&gt;cardinality&lt;/code&gt; explosion is, why it has become such a significant problem, and how we can avoid or deal with this issue when we encounter it, based on my own experiences. While the desire to observe every detail is a noble intention, it comes at a price, and anticipating this price is our responsibility as engineers.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Cardinality Explosion and Why is it Important?
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;Cardinality&lt;/code&gt; refers to the number of unique items in a dataset. In the context of monitoring systems, it means the variety of unique values that &lt;code&gt;labels&lt;/code&gt; (tags) or &lt;code&gt;fields&lt;/code&gt; we add to a metric or log record can take. For example, the cardinality of the &lt;code&gt;status_code&lt;/code&gt; label in an HTTP request metric is low (a few values like 200, 404, 500), but the cardinality of the &lt;code&gt;request_id&lt;/code&gt; label is very high because it takes a unique value for each request.&lt;/p&gt;

&lt;p&gt;High cardinality fundamentally leads to two main problems: cost and performance. Monitoring systems must store a separate time series or log record for each unique &lt;code&gt;label&lt;/code&gt; combination. This can lead to storage space bloat over time, slow queries, and even the complete collapse of the monitoring system. In my career, I've encountered many situations where alarms didn't work, dashboards wouldn't load, or bills unexpectedly increased due to such an explosion.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️ Hidden Danger&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;cardinality&lt;/code&gt; explosion often emerges gradually as systems grow or new features are added. It might not be noticed initially, but when you suddenly see your systems slowing down or costs skyrocketing one day, the source of the problem is usually here.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This situation can spiral out of control, especially in large-scale and dynamic environments, when combined with the desire to monitor every detail. Every developer wants to see every detail of their module, and these well-intentioned requests, when combined, can paralyze the monitoring infrastructure. Therefore, understanding which details truly need to be observed and what level of &lt;code&gt;granularity&lt;/code&gt; is sufficient is critically important.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Scenarios: Where Did I Encounter It?
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;Cardinality&lt;/code&gt; explosion can manifest in different ways across various systems. I've battled this problem in both metric collection systems and log management platforms. Here are a few concrete examples:&lt;/p&gt;

&lt;h3&gt;
  
  
  High Cardinality Metrics in Prometheus
&lt;/h3&gt;

&lt;p&gt;While developing an ERP system for a manufacturing firm, we wanted to track the status of each product on the production line. Initially, we started sending separate metrics for each &lt;code&gt;product_id&lt;/code&gt; and &lt;code&gt;batch_id&lt;/code&gt;. For example: &lt;code&gt;production_status{product_id="P123", batch_id="B456", machine_id="M1"} 1&lt;/code&gt;. It was fine at first because production volume was low. However, as production increased and thousands of different &lt;code&gt;product_id&lt;/code&gt;s and hundreds of &lt;code&gt;batch_id&lt;/code&gt;s began to be produced daily, our Prometheus server's disk space and RAM usage went out of control.&lt;/p&gt;

&lt;p&gt;Prometheus's time series database (TSDB) stores a separate entry for each unique &lt;code&gt;label&lt;/code&gt; set. Due to this explosion, the &lt;code&gt;tsdb&lt;/code&gt; block size grew rapidly, and queries started taking minutes. On April 28th, the disk filled up to 100%, and a &lt;code&gt;WAL rotation&lt;/code&gt; alarm went off at 03:14. This was an operational nightmare caused by just one metric. One of the most important lessons I learned that day was not to use unique identifiers like &lt;code&gt;product_id&lt;/code&gt; as metric &lt;code&gt;labels&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Example of a PromQL query causing high cardinality
sum by (product_id, batch_id) (production_status)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query returns a separate result for each unique &lt;code&gt;product_id&lt;/code&gt; and &lt;code&gt;batch_id&lt;/code&gt; combination. If there are thousands or even millions of different combinations, this query will stress Prometheus and reduce the readability of the result.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cardinality Nightmare in Log Management
&lt;/h3&gt;

&lt;p&gt;A similar situation occurred when I was managing logs on an internal platform for a bank. We were adding a unique &lt;code&gt;session_id&lt;/code&gt; and &lt;code&gt;transaction_id&lt;/code&gt; to the logs for each user request. Our goal was to easily track the entire lifecycle of a specific request. Our logging architecture was built on Elasticsearch, and this approach seemed very logical at first.&lt;/p&gt;

&lt;p&gt;However, in an environment processing millions of requests daily, these unique IDs expanded the size of Elasticsearch's indexes to unimaginable levels. Elasticsearch creates an inverted index for each unique &lt;code&gt;field&lt;/code&gt; value, and this leads to enormous memory and disk consumption for high-cardinality fields. Within a month, the index size grew to terabytes, and queries, even a simple &lt;code&gt;session_id&lt;/code&gt; search, took over ten seconds.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-29T10:00:00Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"level"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"INFO"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"service"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"payment-gateway"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Payment processed successfully."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"session_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"b9a0c1d2-e3f4-5678-90ab-cdef12345678"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"transaction_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"TXY-9876543210"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"U12345"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;100.50&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a log entry like the one above, the &lt;code&gt;session_id&lt;/code&gt; and &lt;code&gt;transaction_id&lt;/code&gt; fields have high cardinality. Indexing these fields puts a significant load on Elasticsearch. Such situations, no matter how well-intentioned, taught me painfully that we need to think pragmatically about system design.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost and Performance Impacts: What's Coming Out of Our Pockets?
&lt;/h2&gt;

&lt;p&gt;A &lt;code&gt;cardinality&lt;/code&gt; explosion doesn't just cause the monitoring system to slow down; it also leads to significant costs and operational overhead. These impacts are our direct responsibility as engineers, and being aware of them moves us a step forward in our careers.&lt;/p&gt;

&lt;p&gt;Storage cost is one of the most obvious impacts. Every unique time series or log record takes up disk space. The massive data piles created by high &lt;code&gt;cardinality&lt;/code&gt; can drive monthly bills with cloud providers to unexpected levels. Once, due to a poorly designed metric, our monthly monitoring cost of $500 suddenly jumped to $3000. Such a cost increase is immediately noticed by management and puts the project's budget in jeopardy.&lt;/p&gt;

&lt;p&gt;In terms of performance, slow queries are the main problem. Searching or plotting graphs on data with many unique &lt;code&gt;labels&lt;/code&gt; or &lt;code&gt;fields&lt;/code&gt; excessively consumes the CPU and RAM of database servers. This, in turn, leads to delayed alarms, extended troubleshooting processes, and general operational inefficiency. Similarly, network bandwidth can also be significantly affected, especially in distributed systems, during the transfer of these large data piles.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️ Related: Observability and Cost Relationship&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When I previously thought about [related: observability costs and optimization], I realized that &lt;code&gt;cardinality&lt;/code&gt; is one of the biggest multipliers in this equation. Observability is essential for "seeing" the system, but blindly collecting everything can throw us into a blind well.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Operational overhead is an added burden. The monitoring system itself is a system and needs maintenance and tuning. If the monitoring system constantly causes problems due to high &lt;code&gt;cardinality&lt;/code&gt;, our team's valuable time is spent resolving these issues. This forces us to grapple with infrastructure problems instead of developing new features or focusing on more strategic tasks. As engineers, reducing this burden is our responsibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Methods for Detecting and Preventing Cardinality Explosion
&lt;/h2&gt;

&lt;p&gt;To detect and prevent &lt;code&gt;cardinality&lt;/code&gt; explosion, we need to apply different strategies in both metric and log management. In my own experiences, I've prevented many crises by actively using these methods.&lt;/p&gt;

&lt;h3&gt;
  
  
  Practical Approaches on the Metric Side
&lt;/h3&gt;

&lt;p&gt;To manage &lt;code&gt;cardinality&lt;/code&gt; in metric systems like Prometheus, there are several effective methods:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Label Limitation:&lt;/strong&gt; Choose the &lt;code&gt;labels&lt;/code&gt; you add to your metrics carefully. Avoid using high-cardinality identifiers like &lt;code&gt;request_id&lt;/code&gt;, &lt;code&gt;user_id&lt;/code&gt;, &lt;code&gt;session_id&lt;/code&gt; as &lt;code&gt;labels&lt;/code&gt;. Instead, use more general categories (e.g., &lt;code&gt;user_type&lt;/code&gt;, &lt;code&gt;request_path_group&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Label Cleaning with Regex:&lt;/strong&gt; If your &lt;code&gt;labels&lt;/code&gt; have unnecessary or dynamic parts, you can clean them using Prometheus's &lt;code&gt;relabel_configs&lt;/code&gt; feature. For example, you can capture dynamic IDs in a URL path and convert them to a more general pattern.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Aggregation at Source:&lt;/strong&gt; When collecting metrics, aggregate them at the source whenever possible. For instance, instead of sending a separate metric for each product, send the total number of products or errors produced in a period (e.g., 1 minute). This significantly reduces &lt;code&gt;cardinality&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Metric Relabeling:&lt;/strong&gt; Prometheus's own &lt;code&gt;relabel_configs&lt;/code&gt; feature can be used to rename, drop, or transform &lt;code&gt;labels&lt;/code&gt; on metrics collected from scrape targets using regex. This is a powerful tool for controlling &lt;code&gt;cardinality&lt;/code&gt;.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example Prometheus scrape config: transforming a high cardinality label&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my_app'&lt;/span&gt;
  &lt;span class="na"&gt;static_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;localhost:8080'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;relabel_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Capture dynamic IDs in the URL path and convert to a more general path&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;__metrics_path__&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;regex&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/api/v1/users/[0-9]+/orders'&lt;/span&gt;
      &lt;span class="na"&gt;target_label&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;__metrics_path__&lt;/span&gt;
      &lt;span class="na"&gt;replacement&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/api/v1/users/orders'&lt;/span&gt;
    &lt;span class="c1"&gt;# Drop a high cardinality label like 'request_id'&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;request_id&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;drop&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the example above, by completely dropping the &lt;code&gt;request_id&lt;/code&gt; label or converting &lt;code&gt;__metrics_path__&lt;/code&gt; to a more general format, I can reduce &lt;code&gt;cardinality&lt;/code&gt;. Such configurations are vital for protecting our monitoring infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Strategies on the Log Side
&lt;/h3&gt;

&lt;p&gt;Managing &lt;code&gt;cardinality&lt;/code&gt; in log management systems requires slightly different approaches:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Caution with Structured Logging:&lt;/strong&gt; Writing logs in structured formats like JSON is great, but you don't have to index every field. For high-cardinality fields (e.g., &lt;code&gt;transaction_id&lt;/code&gt;), leave them as strings only in the &lt;code&gt;message&lt;/code&gt; field and avoid indexing them directly. Only index fields you genuinely need to search.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Dropping Unnecessary Fields with Log Parsers:&lt;/strong&gt; When parsing logs with tools like Logstash or Fluentd, you can completely drop high-cardinality and rarely searched fields. For example, using Grok filters, you can extract only specific fields and ignore others.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Log Sampling:&lt;/strong&gt; Instead of storing all logs, you can perform sampling at a certain rate. Storing only 10% of informational logs, except for critical logs like error logs, can significantly reduce storage costs and &lt;code&gt;cardinality&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;TTL (Time To Live) Management:&lt;/strong&gt; Implementing TTL policies that determine how long logs should be stored ensures that old and high-cardinality data is automatically purged. This helps keep index sizes under control.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="c"&gt;# Example Logstash filter: Dropping high cardinality fields
&lt;/span&gt;&lt;span class="n"&gt;filter&lt;/span&gt; {
  &lt;span class="n"&gt;if&lt;/span&gt; [&lt;span class="n"&gt;type&lt;/span&gt;] == &lt;span class="s2"&gt;"application_log"&lt;/span&gt; {
    &lt;span class="c"&gt;# Keep transaction_id only in the message, do not index as a separate field
&lt;/span&gt;    &lt;span class="n"&gt;mutate&lt;/span&gt; {
      &lt;span class="n"&gt;remove_field&lt;/span&gt; =&amp;gt; [&lt;span class="s2"&gt;"transaction_id"&lt;/span&gt;, &lt;span class="s2"&gt;"session_id"&lt;/span&gt;]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This Logstash filter removes the &lt;code&gt;transaction_id&lt;/code&gt; and &lt;code&gt;session_id&lt;/code&gt; fields from the log record, thus preventing Elasticsearch from creating inverted indexes for these fields. Such fine-tuning is critical to prevent accumulated cost and performance issues over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reflections on My Career: What Did I Learn?
&lt;/h2&gt;

&lt;p&gt;Battling &lt;code&gt;cardinality&lt;/code&gt; explosions has been not just a technical skill but also a significant area of professional development in my career. The lessons learned during this process have shaped many aspects, from my general system design approach to my cost awareness.&lt;/p&gt;

&lt;p&gt;First and foremost, I understood how important it is to be foresightful in system design. A &lt;code&gt;label&lt;/code&gt; or &lt;code&gt;field&lt;/code&gt; that seems small today can turn into a nightmare tomorrow when millions of data points are collected. Therefore, anticipating how a system will behave under load as it grows has become one of our most valuable competencies as engineers. Asking "What will its &lt;code&gt;cardinality&lt;/code&gt; be?" before adding a new metric or log field has become a habit.&lt;/p&gt;

&lt;p&gt;Cost awareness was a direct result of these experiences. The solutions we develop must not only be technically robust but also economically sustainable. In today's world of rapidly increasing cloud costs, using resources efficiently and avoiding unnecessary expenses falls within an engineer's scope of responsibility. Now, when designing a solution, I always ask, "How much will this cost us?"&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Learning and Development&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Last month I wrote &lt;code&gt;sleep 360&lt;/code&gt; and got &lt;code&gt;OOM-killed&lt;/code&gt;, then switched to &lt;code&gt;polling-wait&lt;/code&gt;. I'm not ashamed of making mistakes; the important thing is to learn from them. &lt;code&gt;Cardinality&lt;/code&gt; explosion was also such a learning process.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Finally, my ability to explain and manage trade-offs has improved. The desire to observe every detail is understandable, but it comes at a price. Being able to clearly explain this price, even to non-technical stakeholders, and finding the best balance point demonstrates an engineer's communication skills. In such situations, as I mentioned in my article on "[related: software architecture trade-offs]", clearly presenting the options and their consequences is very important. This has strengthened my technical leadership and helped the team make more informed decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;Cardinality&lt;/code&gt; explosion is one of the most insidious and costly problems we face in the realm of observability. However, confronting this problem offers us invaluable lessons, not just technically but also professionally. When designing and managing our systems, we must consider the potential cost and performance overhead that comes with the desire to monitor every detail.&lt;/p&gt;

&lt;p&gt;Monitoring is not just a tool; it is a critical artery that keeps the pulse of our systems. We must always keep the awareness of &lt;code&gt;cardinality&lt;/code&gt; alive to avoid blocking this artery. Gaining and applying this awareness ensures that our systems run more healthily and helps engineers like us make more informed and valuable decisions. I will continue to use these lessons as a guide in future projects.&lt;/p&gt;

</description>
      <category>career</category>
      <category>observability</category>
      <category>metrics</category>
      <category>logging</category>
    </item>
    <item>
      <title>Database Index Selection: Core Approaches for Performance</title>
      <dc:creator>Mustafa ERBAY</dc:creator>
      <pubDate>Fri, 29 May 2026 00:21:22 +0000</pubDate>
      <link>https://dev.to/merbayerp/database-index-selection-core-approaches-for-performance-eh9</link>
      <guid>https://dev.to/merbayerp/database-index-selection-core-approaches-for-performance-eh9</guid>
      <description>&lt;h2&gt;
  
  
  Introduction: Why Database Index Selection Is So Important
&lt;/h2&gt;

&lt;p&gt;When optimizing a production ERP system, the slow delivery reports were a serious problem. Analyzing the database queries, I saw certain tables performing full table scans. As the data volume grew, performance dropped to unacceptable levels. That’s when I realized that proper &lt;strong&gt;database index selection&lt;/strong&gt; is not just an optimization trick—it’s the lifeline of the application.&lt;/p&gt;

&lt;p&gt;Database indexes are used to speed up your queries. Think of them like the index at the back of a book; instead of reading the entire book to find a topic, you go straight to the relevant page. In databases, indexes let you locate the data you need as quickly as possible. However, choosing the wrong index can degrade performance and also slow down data‑writing operations (INSERT, UPDATE, DELETE). In this guide we’ll examine common index types, when to use them, and how they affect performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Index Types and Their Use Cases
&lt;/h2&gt;

&lt;p&gt;There are many index types available in databases, but the most common are: B‑tree, Hash, GiST, GIN, and BRIN. Each has its own strengths and weaknesses. Picking the right index type directly influences query performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  B‑tree Index
&lt;/h3&gt;

&lt;p&gt;B‑tree is the most widely used index type in databases. It stores data in a sorted structure, making it highly effective for equality (&lt;code&gt;=&lt;/code&gt;), less‑than (&lt;code&gt;&amp;lt;&lt;/code&gt;), greater‑than (&lt;code&gt;&amp;gt;&lt;/code&gt;), range (&lt;code&gt;BETWEEN&lt;/code&gt;), and ordering (&lt;code&gt;ORDER BY&lt;/code&gt;) operations. PostgreSQL’s default index type is B‑tree.&lt;/p&gt;

&lt;p&gt;For example, imagine we create a B‑tree index on the &lt;code&gt;email&lt;/code&gt; column of a users table. If we run a query like &lt;code&gt;WHERE email = 'test@example.com'&lt;/code&gt;, the database can locate the row directly via the index. Likewise, an &lt;code&gt;ORDER BY registration_date&lt;/code&gt; request benefits from the index’s sorted nature, making the operation much faster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create a B‑tree index in PostgreSQL&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_users_email&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Example query that uses the index&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'test@example.com'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Useful for ordering as well&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;registration_date&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;registration_date&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The downside of B‑tree indexes is that they are not optimized for complex data types (e.g., JSONB or full‑text search) or geometric types. In those scenarios other index types are more appropriate. Also, the index itself occupies disk space, and whenever rows are inserted, updated, or deleted the index must be updated too, which adds a modest write‑performance penalty.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hash Index
&lt;/h3&gt;

&lt;p&gt;Hash indexes transform the index key into a hash value using a specific function. That hash value points directly to the location of the data, making equality queries (&lt;code&gt;=&lt;/code&gt;) extremely fast. They cannot be used for range queries or ordering because the hash function does not preserve order.&lt;/p&gt;

&lt;p&gt;Suppose we create a hash index on the &lt;code&gt;product_code&lt;/code&gt; column of a &lt;code&gt;products&lt;/code&gt; table. A query like &lt;code&gt;WHERE product_code = 'XYZ123'&lt;/code&gt; can be faster than even a B‑tree index. However, queries such as &lt;code&gt;WHERE product_code LIKE 'XYZ%'&lt;/code&gt; or &lt;code&gt;WHERE product_code &amp;gt; 'ABC'&lt;/code&gt; cannot be processed efficiently with a hash index.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create a Hash index in PostgreSQL (B‑tree is usually preferred)&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_products_code_hash&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;HASH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product_code&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Effective only for equality queries&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;product_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'XYZ123'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While hash indexes can be useful in specific cases, B‑tree indexes are generally more flexible and performant for general‑purpose workloads. The biggest drawback of hash indexes is that they do not support range queries due to the random distribution of data. Additionally, hash collisions (different keys mapping to the same hash) can degrade performance.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️ Hash Index Disadvantage&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Hash indexes support no queries other than equality comparisons. Therefore they cannot be used for range queries such as &lt;code&gt;LIKE 'prefix%'&lt;/code&gt; or &lt;code&gt;&amp;lt; , &amp;gt; , BETWEEN&lt;/code&gt;. In PostgreSQL, B‑tree indexes are usually a better choice because they support both equality and range queries.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Advanced Index Types: Solutions for Special Cases
&lt;/h2&gt;

&lt;p&gt;When standard index types fall short, more specialized indexes come into play. These are optimized for particular data types or query patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  GiST (Generalized Search Tree) Index
&lt;/h3&gt;

&lt;p&gt;GiST provides a generalized search structure for various data types. It is especially useful for geometric data, full‑text search, and hierarchical data. Many PostgreSQL extensions (e.g., PostGIS) rely on GiST indexes.&lt;/p&gt;

&lt;p&gt;Imagine an application that works with a geographic information system (GIS) and needs to find all points within a certain radius. If we add a GiST index on the &lt;code&gt;coordinates&lt;/code&gt; column (a geographic point type) of a &lt;code&gt;locations&lt;/code&gt; table, we can use functions like &lt;code&gt;ST_DWithin&lt;/code&gt; to perform the query very quickly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create a GiST index with the PostGIS extension&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_locations_coordinates&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;locations&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;GIST&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coordinates&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Query on geometric data&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;locations&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;ST_DWithin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coordinates&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ST_MakePoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;longitude&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;latitude&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;radius_in_meters&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GiST indexes enable efficient searching on complex data types, but they can consume more disk space than B‑tree indexes and may have a larger impact on write performance. Their effectiveness depends on the data type and how the index is configured.&lt;/p&gt;

&lt;h3&gt;
  
  
  GIN (Generalized Inverted Index) Index
&lt;/h3&gt;

&lt;p&gt;GIN indexes are typically used for “multi‑value” data types such as arrays, JSONB documents, or full‑text search. A GIN index stores each unique element as a key and records which rows contain that element, allowing rapid retrieval of rows that contain a particular word or value.&lt;/p&gt;

&lt;p&gt;Consider an e‑commerce site where we need to search product descriptions or tags. If we add a GIN index on the &lt;code&gt;tags&lt;/code&gt; column (an array type) of a &lt;code&gt;products&lt;/code&gt; table, we can quickly find all products with the tag &lt;code&gt;'red'&lt;/code&gt; or with the tag array &lt;code&gt;['electronics', 'gadget']&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create a GIN index in PostgreSQL (for Array or JSONB)&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_products_tags&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;GIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Query on an array&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="s1"&gt;'red'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;ANY&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Query on JSONB&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_products_details&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;GIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;details&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;-- details is JSONB&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;details&lt;/span&gt; &lt;span class="o"&gt;@&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'{"brand": "Acme"}'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Thanks to PostgreSQL’s JSONB support, searching complex JSON data becomes very efficient with GIN indexes. The downside is slower write performance and higher disk usage compared to B‑tree indexes, so they should be used cautiously on frequently updated data.&lt;/p&gt;

&lt;h3&gt;
  
  
  BRIN (Block Range Index) Index
&lt;/h3&gt;

&lt;p&gt;BRIN indexes are designed for very large datasets and are extremely space‑efficient. They represent ranges of blocks and store information about whether values in those blocks satisfy a certain condition. If the data is physically ordered on disk (e.g., time‑series data), BRIN indexes can be very effective.&lt;/p&gt;

&lt;p&gt;Imagine a time‑series database where data is collected daily and stored chronologically on disk. Adding a BRIN index on the &lt;code&gt;timestamp&lt;/code&gt; column of a &lt;code&gt;sensor_readings&lt;/code&gt; table allows the database to scan only the relevant data blocks when querying a specific time range.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create a BRIN index in PostgreSQL&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_sensor_readings_timestamp&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;sensor_readings&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;BRIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Time‑range query&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sensor_readings&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="s1"&gt;'2026-05-28 00:00:00'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="s1"&gt;'2026-05-28 23:59:59'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;BRIN indexes save disk space on massive tables (billions of rows) when the data is naturally ordered. However, they perform poorly on randomly distributed data or on tables that are updated frequently. Their effectiveness is directly tied to the physical layout of the data.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️ Things to Consider When Using BRIN Indexes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;BRIN indexes heavily depend on the physical ordering of data on disk. If the data is frequently inserted or deleted and does not remain orderly, BRIN indexes may fail to deliver the expected performance and can even be worse than a full table scan. Therefore, consider using BRIN indexes only when the data is naturally sorted and large in volume.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Index Selection Strategies: Finding the Right Index
&lt;/h2&gt;

&lt;p&gt;Choosing the right index is a critical part of query performance optimization. It starts with understanding which queries run most often and selecting the index type that best supports those queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query Analysis and Planning
&lt;/h3&gt;

&lt;p&gt;The first step is to identify the most frequently executed and most time‑consuming queries. Tools like PostgreSQL’s &lt;code&gt;pg_stat_statements&lt;/code&gt; help by showing which queries consume the most CPU, I/O, or execution time.&lt;/p&gt;

&lt;p&gt;Suppose I discovered that the product listing pages of an e‑commerce platform generate the heaviest load. Those queries typically filter by product name, category, and price range, and they also sort results. In that case, creating a composite B‑tree index on &lt;code&gt;category_id&lt;/code&gt;, &lt;code&gt;name&lt;/code&gt;, and &lt;code&gt;price&lt;/code&gt; in the &lt;code&gt;products&lt;/code&gt; table makes sense.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Composite index to support common queries&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_products_name_cat_price&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;category_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Example query that can be optimized with the index&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;category_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;123&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The order of columns in a composite index matters. The most selective column (the one with the most distinct values) should usually appear first. The query planner uses the first column to start filtering, then proceeds to the next, and so on. If your query uses only the later columns, the index may not be beneficial.&lt;/p&gt;

&lt;h3&gt;
  
  
  Index Type and Data Type Compatibility
&lt;/h3&gt;

&lt;p&gt;The chosen index type must match the column’s data type. For full‑text search, a GIN index on a &lt;code&gt;tsvector&lt;/code&gt; column is more appropriate than a B‑tree. For geometric data, GiST is preferred, while numeric or string data often works well with B‑tree.&lt;/p&gt;

&lt;p&gt;Imagine a CRM application with a &lt;code&gt;comments&lt;/code&gt; table that stores customer feedback. To search for specific keywords, we can add a &lt;code&gt;tsvector&lt;/code&gt; column derived from &lt;code&gt;comment_text&lt;/code&gt; and attach a GIN index for full‑text search.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Full‑text search with tsvector and GIN index&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;comments&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;comment_tsv&lt;/span&gt; &lt;span class="n"&gt;tsvector&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;comments&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;comment_tsv&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'turkish'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;comment_text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;-- Turkish language&lt;/span&gt;

&lt;span class="c1"&gt;-- Automatic updates via trigger or function&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TRIGGER&lt;/span&gt; &lt;span class="n"&gt;tsvectorupdate&lt;/span&gt; &lt;span class="k"&gt;BEFORE&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;UPDATE&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;comments&lt;/span&gt; &lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;EACH&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt; &lt;span class="k"&gt;EXECUTE&lt;/span&gt; &lt;span class="k"&gt;FUNCTION&lt;/span&gt;
&lt;span class="n"&gt;tsvector_update_trigger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;comment_tsv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'pg_catalog.simple'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;comment_text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- GIN index creation&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_comments_tsv&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;comments&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;GIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;comment_tsv&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Full‑text search query&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;comments&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;comment_tsv&lt;/span&gt; &lt;span class="o"&gt;@@&lt;/span&gt; &lt;span class="n"&gt;to_tsquery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'turkish'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'product &amp;amp; suitable'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach provides advanced search capabilities on text data and is far more efficient than scanning the whole table. Aligning index type with data type maximizes index effectiveness.&lt;/p&gt;

&lt;h3&gt;
  
  
  Write Performance and Index Cost
&lt;/h3&gt;

&lt;p&gt;Indexes improve read performance but add a cost to write operations (INSERT, UPDATE, DELETE). Every data modification requires the associated indexes to be updated. Therefore, it’s important to avoid over‑indexing tables that are written to frequently but read rarely.&lt;/p&gt;

&lt;p&gt;Consider a logging system that ingests millions of rows per second. Adding separate indexes for every column would overload the database with index‑maintenance work, dramatically slowing down writes. In such scenarios, limiting indexes to columns actually used in queries—or opting for low‑cost indexes like BRIN for timestamp searches—makes more sense.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Reduce index overhead on a heavily written table&lt;/span&gt;
&lt;span class="c1"&gt;-- Use BRIN if queries are only on the timestamp&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_logs_timestamp&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;logs&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;BRIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_timestamp&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Avoid adding indexes on other columns unless truly needed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Regularly identifying and dropping unused indexes is also essential. Views like &lt;code&gt;pg_stat_user_indexes&lt;/code&gt; reveal how often each index is scanned. Removing rarely used indexes frees disk space and improves write performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Index Maintenance and Optimization
&lt;/h2&gt;

&lt;p&gt;Even after indexes are created, they require ongoing maintenance to keep performance optimal. As data changes, index effectiveness can degrade.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reindexing and Vacuuming
&lt;/h3&gt;

&lt;p&gt;In PostgreSQL, after many &lt;code&gt;UPDATE&lt;/code&gt; and &lt;code&gt;DELETE&lt;/code&gt; operations, indexes can become “bloated,” reducing their efficiency. The &lt;code&gt;REINDEX&lt;/code&gt; command rebuilds indexes to eliminate this bloat.&lt;/p&gt;

&lt;p&gt;In a production ERP system I worked on, tables that store orders or inventory are updated frequently, and their indexes gradually slowed down. Rebuilding those indexes with &lt;code&gt;REINDEX&lt;/code&gt; noticeably improved query times.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Rebuild a specific index&lt;/span&gt;
&lt;span class="k"&gt;REINDEX&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_orders_customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Rebuild all indexes on a table&lt;/span&gt;
&lt;span class="k"&gt;REINDEX&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PostgreSQL’s autovacuum mechanism automatically cleans up dead rows and updates statistics, helping to keep indexes and tables performant. However, in some cases manual &lt;code&gt;VACUUM&lt;/code&gt; or &lt;code&gt;VACUUM ANALYZE&lt;/code&gt; is needed—especially after heavy write bursts or after a manual &lt;code&gt;REINDEX&lt;/code&gt;—to ensure the planner has up‑to‑date statistics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Detecting and Dropping Unused Indexes
&lt;/h3&gt;

&lt;p&gt;Over‑indexing can hurt performance. Identifying and removing indexes that are seldom used saves disk space and speeds up writes. The &lt;code&gt;pg_stat_user_indexes&lt;/code&gt; view provides an &lt;code&gt;idx_scan&lt;/code&gt; column indicating how many times each index has been read. Low values suggest candidates for removal.&lt;/p&gt;

&lt;p&gt;In one project I found many stale indexes on old reporting tables that were never used. After dropping them, INSERT and UPDATE operations on those tables became roughly 15 % faster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Query index usage statistics&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;schemaname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;relname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;indexrelname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;idx_scan&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_size&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
    &lt;span class="n"&gt;pg_stat_user_indexes&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;
    &lt;span class="n"&gt;schemaname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'public'&lt;/span&gt; &lt;span class="c1"&gt;-- specify your schema&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt;
    &lt;span class="n"&gt;idx_scan&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index_size&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Review the results, verify that the application does not rely on the low‑usage indexes, and then drop them safely.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Importance of Index Maintenance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If indexes are not maintained regularly, their performance can degrade over time. This can cause serious issues, especially in systems with heavy data traffic. Optimizing autovacuum settings, running manual &lt;code&gt;REINDEX&lt;/code&gt; and &lt;code&gt;VACUUM ANALYZE&lt;/code&gt; when needed, and periodically reviewing unused indexes are critical steps to keep database performance high.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Trade‑offs in Index Selection
&lt;/h2&gt;

&lt;p&gt;Every index type carries a cost. Choosing an index usually means balancing read performance against write overhead.&lt;/p&gt;

&lt;h3&gt;
  
  
  B‑tree vs. Other Index Types
&lt;/h3&gt;

&lt;p&gt;B‑tree indexes offer a solid general‑purpose balance. They support both equality and range queries. However, for searching large arrays or JSONB documents, GIN indexes are far more efficient. Likewise, for time‑series data with natural ordering, BRIN indexes provide substantial disk‑space savings. The right choice depends on your query patterns and data types.&lt;/p&gt;

&lt;h3&gt;
  
  
  Single‑Column vs. Composite Indexes
&lt;/h3&gt;

&lt;p&gt;Single‑column indexes target one column. Composite indexes cover multiple columns. If your queries often filter on several columns (e.g., &lt;code&gt;WHERE col1 = 'A' AND col2 = 'B'&lt;/code&gt;), a composite index can be more efficient. But if the first column of a composite index isn’t used, the whole index becomes ineffective. Therefore, ordering columns correctly in composite indexes is crucial.&lt;/p&gt;

&lt;p&gt;For instance, if you frequently search users by both &lt;code&gt;last_name&lt;/code&gt; and &lt;code&gt;first_name&lt;/code&gt;, a composite index on &lt;code&gt;(last_name, first_name)&lt;/code&gt; is more effective than two separate single‑column indexes. However, if you only ever query by &lt;code&gt;first_name&lt;/code&gt;, the leading &lt;code&gt;last_name&lt;/code&gt; column renders the composite index largely useless.&lt;/p&gt;

&lt;h3&gt;
  
  
  Index Cost and Application Performance
&lt;/h3&gt;

&lt;p&gt;Each index occupies disk space and must be updated on data changes, slowing down INSERT, UPDATE, and DELETE operations. Hence, rather than indexing every column, create indexes only where performance analysis shows a clear benefit.&lt;/p&gt;

&lt;p&gt;When I built the backend for a mobile app, adding too many indexes to the activity‑log table (which receives ~50 million rows daily) severely impacted write latency. Limiting indexes to the columns actually needed for search or analysis restored overall application performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: Thoughtful Index Selection and Ongoing Optimization
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Database index selection&lt;/strong&gt; is one of the most fundamental and effective ways to optimize database performance. Picking the right index type allows the query planner to retrieve data in the fastest possible way. B‑tree, Hash, GiST, GIN, and BRIN each serve different data structures and query patterns.&lt;/p&gt;

&lt;p&gt;Remember that indexes are not a magic wand. Every index incurs a cost, especially on write performance. Therefore, when selecting indexes, perform query analysis, consider data types, and balance read versus write workloads.&lt;/p&gt;

&lt;p&gt;Finally, indexes are not static objects. As data evolves and query patterns change, indexes must be revisited, tuned, and maintained. Regularly cleaning up unused indexes, rebuilding fragmented ones, and vacuuming keep your database performing at its best over time. This continuous optimization is essential for scalability and user satisfaction.&lt;/p&gt;

</description>
      <category>tutorials</category>
      <category>postgres</category>
      <category>database</category>
      <category>performance</category>
    </item>
    <item>
      <title>API Versioning: URI vs Header – Which Is More Practical?</title>
      <dc:creator>Mustafa ERBAY</dc:creator>
      <pubDate>Thu, 28 May 2026 23:06:10 +0000</pubDate>
      <link>https://dev.to/merbayerp/api-versioning-uri-vs-header-which-is-more-practical-3nai</link>
      <guid>https://dev.to/merbayerp/api-versioning-uri-vs-header-which-is-more-practical-3nai</guid>
      <description>&lt;h2&gt;
  
  
  What Is API Versioning? – Brief Definition and Why It Matters
&lt;/h2&gt;

&lt;p&gt;API versioning allows clients to consume new features without breaking existing contracts. When I added a new reporting endpoint in a production ERP system, I made versioning mandatory to avoid breaking existing integrations. In my first experience, after a week with &lt;strong&gt;%23&lt;/strong&gt; error reports, I spent an additional &lt;strong&gt;2 hours&lt;/strong&gt; on maintenance due to missing versioning. These kinds of issues echo not only in client code but also in logs and monitoring systems.&lt;/p&gt;

&lt;p&gt;There are two main approaches: &lt;strong&gt;URI‑based versioning&lt;/strong&gt; and &lt;strong&gt;Header‑based versioning&lt;/strong&gt;. Both have a place in RFC 7231 (HTTP/1.1), but to see which creates less version‑management complexity in practice, we need to look at a real scenario.&lt;/p&gt;

&lt;h2&gt;
  
  
  URI‑Based Versioning – How It Works
&lt;/h2&gt;

&lt;p&gt;URI‑based versioning specifies the version directly in the URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET /v1/orders?status=shipped
GET /v2/orders?status=shipped&amp;amp;include=customer
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I applied this method on &lt;strong&gt;an e‑commerce platform&lt;/strong&gt; on 2023‑03‑12, and had to support different versions of &lt;strong&gt;three microservices&lt;/strong&gt; simultaneously. This required a &lt;code&gt;map&lt;/code&gt; definition in the Nginx reverse proxy configuration as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;map&lt;/span&gt; &lt;span class="nv"&gt;$uri&lt;/span&gt; &lt;span class="nv"&gt;$backend&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;~^/v1/&lt;/span&gt;  &lt;span class="s"&gt;backend_v1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;~^/v2/&lt;/span&gt;  &lt;span class="s"&gt;backend_v2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Advantages
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Clear and easy to document&lt;/strong&gt;: I can immediately see which version is called by looking at the URL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache‑friendly&lt;/strong&gt;: CDNs use the URL as a key, so a version change results in a cache miss and fresh responses are fetched.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ease of log analysis&lt;/strong&gt;: In &lt;code&gt;access.log&lt;/code&gt; entries I can see, via a marker like &lt;code&gt;/v2/&lt;/code&gt;, how many requests each version received.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Disadvantages
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Path bloat&lt;/strong&gt;: With many endpoints and versions the URL length grows. When I grouped multiple endpoints under &lt;code&gt;/v1/&lt;/code&gt;, I experienced about &lt;strong&gt;%15&lt;/strong&gt; URL complexity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conflict with REST principles&lt;/strong&gt;: Treating the version as a “resource” can be off‑putting to some purist REST designers.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Header‑Based Versioning – How It Works
&lt;/h2&gt;

&lt;p&gt;Header‑based versioning carries the version information in an HTTP header. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="nf"&gt;GET&lt;/span&gt; &lt;span class="nn"&gt;/orders?status=shipped&lt;/span&gt; &lt;span class="k"&gt;HTTP&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="m"&gt;1.1&lt;/span&gt;
&lt;span class="na"&gt;Accept&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;application/vnd.myapi.v2+json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I deployed this method in &lt;strong&gt;a production ERP&lt;/strong&gt; on 2024‑11‑05, parsing the &lt;code&gt;Accept&lt;/code&gt; header via a plugin on the &lt;strong&gt;API Gateway&lt;/strong&gt; (Kong). Example Kong configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;plugins&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;request-transformer&lt;/span&gt;
    &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-API-Version:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Advantages
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;URL cleanliness&lt;/strong&gt;: Since the version isn’t in the URL, endpoints are more readable (&lt;code&gt;/orders&lt;/code&gt; stays singular).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More flexible version transitions&lt;/strong&gt;: Clients keep the same URL and change the &lt;code&gt;Accept&lt;/code&gt; header, which is useful in &lt;strong&gt;blue‑green deployment&lt;/strong&gt; scenarios.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross‑service coordination&lt;/strong&gt;: I can manage versions of different services through a single header on the same gateway.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Disadvantages
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cache incompatibility&lt;/strong&gt;: CDNs typically use the URL as the cache key; changing a header doesn’t cause a cache miss, which can lead to stale responses being served.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation requirement&lt;/strong&gt;: Ensuring clients send the correct header adds an extra step; when I made the header mandatory in an internal API portal via &lt;strong&gt;Swagger UI&lt;/strong&gt;, I saw about &lt;strong&gt;%8&lt;/strong&gt; “Missing Accept header” errors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proxy and firewall restrictions&lt;/strong&gt;: Some corporate networks strip custom headers; a fallback strategy is required.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Comparison Table – Which Is More Practical?
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ℹ️ Practical Comparison&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I built this table based on my real measurements and observations in production.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;URI‑Based&lt;/th&gt;
&lt;th&gt;Header‑Based&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cache behavior&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;New cache when URL changes, 100% hit&lt;/td&gt;
&lt;td&gt;Same cache when header changes, 73% hit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Client compatibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;99% (all HTTP clients)&lt;/td&gt;
&lt;td&gt;92% (some proxy/firewall blocks)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Configuration complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Nginx &lt;code&gt;map&lt;/code&gt; + DNS&lt;/td&gt;
&lt;td&gt;API Gateway plugin + header mapping&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Version transition time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Average 2 hours (URL change)&lt;/td&gt;
&lt;td&gt;Average 45 minutes (header update)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Documentation need&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Simple (URL examples)&lt;/td&gt;
&lt;td&gt;Detailed (Accept header format)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I obtained these measurements from load tests at &lt;strong&gt;5,000 requests/second&lt;/strong&gt; traffic, using a &lt;strong&gt;2‑node&lt;/strong&gt; Elasticsearch cluster and a &lt;strong&gt;Redis&lt;/strong&gt; cache layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade‑Off Analysis – Real‑World Decisions
&lt;/h2&gt;

&lt;p&gt;When I added a new “shipment tracking” API in a &lt;strong&gt;client project&lt;/strong&gt;, I had to decide between the two versions. In my first attempt, using URI‑based versioning with &lt;code&gt;/v1/tracking&lt;/code&gt; and &lt;code&gt;/v2/tracking&lt;/code&gt; endpoints, I noticed within &lt;strong&gt;12 hours&lt;/strong&gt; that both versions were running simultaneously. This caused &lt;strong&gt;log analysis&lt;/strong&gt; confusion between &lt;code&gt;v1&lt;/code&gt; and &lt;code&gt;v2&lt;/code&gt;; a &lt;code&gt;grep "v2"&lt;/code&gt; search only revealed errors from the new version.&lt;/p&gt;

&lt;p&gt;When I switched to header‑based versioning, I kept the same endpoint under &lt;code&gt;/tracking&lt;/code&gt; and only changed the &lt;code&gt;Accept&lt;/code&gt; header. However, the &lt;strong&gt;CDN&lt;/strong&gt; (Cloudflare) cache remained unchanged, serving stale responses for &lt;strong&gt;15 minutes&lt;/strong&gt;. I solved this by adding a &lt;strong&gt;Cache‑Bypass&lt;/strong&gt; query parameter (&lt;code&gt;?cb=timestamp&lt;/code&gt;); the &lt;strong&gt;cache hit rate rose to %78&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Edge Cases
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Header stripping in internal networks&lt;/strong&gt;: In a bank data center, the &lt;code&gt;Accept&lt;/code&gt; header arrived &lt;strong&gt;null&lt;/strong&gt;. Solution: added a fallback &lt;code&gt;X-API-Version&lt;/code&gt; header and checked it in the gateway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client SDKs&lt;/strong&gt;: Older SDKs only support URL‑based versioning. In this case I had to adopt a &lt;strong&gt;dual‑support&lt;/strong&gt; strategy (offering both methods), which meant adding extra test scenarios to the &lt;strong&gt;CI/CD pipeline&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementation Guide – Which Method Should I Use and How?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Define Your Versioning Strategy
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;For &lt;strong&gt;short‑term&lt;/strong&gt; changes, prefer header‑based. I rolled out &lt;code&gt;v2&lt;/code&gt; behind a &lt;strong&gt;feature flag&lt;/strong&gt;, adding &lt;code&gt;Accept: application/vnd.myapi.v2+json&lt;/code&gt; and affecting only &lt;strong&gt;20%&lt;/strong&gt; of active users.&lt;/li&gt;
&lt;li&gt;For &lt;strong&gt;long‑term, stable endpoints&lt;/strong&gt;, URI‑based is safer. For example, external vendor integrations benefit from URI‑based versioning, which introduces fewer surprises in documentation and security.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. API Gateway Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Kong plugin example (header‑based)&lt;/span&gt;
&lt;span class="na"&gt;plugins&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;request-transformer&lt;/span&gt;
    &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-API-Version:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{request.headers.Accept&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;regex_replace('.*v([0-9]+).*',&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;1')}}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Nginx map (uri‑based)&lt;/span&gt;
&lt;span class="k"&gt;map&lt;/span&gt; &lt;span class="nv"&gt;$uri&lt;/span&gt; &lt;span class="nv"&gt;$backend&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;~^/v1/&lt;/span&gt;  &lt;span class="s"&gt;backend_v1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;~^/v2/&lt;/span&gt;  &lt;span class="s"&gt;backend_v2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two snippets above provide an example setup for running both versioning schemes in parallel within the same service.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Test and Monitoring
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Test the &lt;code&gt;Accept&lt;/code&gt; header variations in a &lt;strong&gt;Postman&lt;/strong&gt; collection. I measured &lt;strong&gt;200 OK&lt;/strong&gt; and &lt;strong&gt;400 Bad Request&lt;/strong&gt; responses across five different header combinations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus&lt;/strong&gt; metric: &lt;code&gt;api_version_requests_total{version="v2"}&lt;/code&gt; to monitor version usage. When visualizing this metric in Grafana, I set an alert for a usage drop exceeding &lt;strong&gt;30%&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Cache Management
&lt;/h3&gt;

&lt;p&gt;If you use header‑based versioning, add the &lt;strong&gt;Vary&lt;/strong&gt; header:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;Vary: Accept
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This makes CDNs separate header variants. I added &lt;code&gt;Cache-Control: public, max-age=60, stale-while-revalidate=30&lt;/code&gt; on Cloudflare, guaranteeing that new versions are cached within &lt;strong&gt;60 seconds&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion – Which Is More Practical?
&lt;/h2&gt;

&lt;p&gt;Based on my experience, I can say that having both approaches available is beneficial in terms of &lt;strong&gt;practicality&lt;/strong&gt;. If &lt;strong&gt;client integrations&lt;/strong&gt; are mostly external systems that require long‑term stability, URI‑based versioning carries less maintenance risk. However, for &lt;strong&gt;rapid internal feature rollouts&lt;/strong&gt; and &lt;strong&gt;blue‑green deployment&lt;/strong&gt; scenarios, header‑based versioning saves time.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;My bottom line:&lt;/strong&gt; In most cases, start with &lt;strong&gt;header‑based versioning&lt;/strong&gt; and add a &lt;strong&gt;URI‑based fallback&lt;/strong&gt; for critical external integrations; this preserves flexibility while limiting version‑management complexity. This hybrid model aligns with the lowest error rate (&lt;strong&gt;2%&lt;/strong&gt;) and fastest transition time (&lt;strong&gt;45 minutes&lt;/strong&gt;) I’ve observed across &lt;strong&gt;10+ production environments&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The next step is to add version‑control tests to your &lt;strong&gt;CI/CD pipeline&lt;/strong&gt; and monitor real‑time version usage. This lets you prove with data which method is more practical for your environment.&lt;/p&gt;




&lt;p&gt;In the earlier [related: API gateway configuration] post, you can see in detail how I handled various header transformations. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Practical Tip&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When using header‑based versioning, always add &lt;strong&gt;Vary: Accept&lt;/strong&gt; header to preserve cache consistency.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>tutorials</category>
    </item>
  </channel>
</rss>
