<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: taman9333</title>
    <description>The latest articles on DEV Community by taman9333 (@taman9333).</description>
    <link>https://dev.to/taman9333</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F668362%2Fdd6a6b20-ea1d-4f4f-a044-7a3b2a77e327.jpeg</url>
      <title>DEV Community: taman9333</title>
      <link>https://dev.to/taman9333</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/taman9333"/>
    <language>en</language>
    <item>
      <title>Traces at Scale: Head or Tail? Sampling Strategies &amp; Scaling the Collector</title>
      <dc:creator>taman9333</dc:creator>
      <pubDate>Mon, 11 Aug 2025 12:19:16 +0000</pubDate>
      <link>https://dev.to/taman9333/traces-at-scale-head-or-tail-sampling-strategies-scaling-the-collector-nk</link>
      <guid>https://dev.to/taman9333/traces-at-scale-head-or-tail-sampling-strategies-scaling-the-collector-nk</guid>
      <description>&lt;h2&gt;
  
  
  🚨 Tracing at Scale Isn’t Free
&lt;/h2&gt;

&lt;p&gt;In the &lt;a href="https://dev.to/taman9333/distributed-tracing-instrumentation-with-opentelemetry-and-jaeger-em2"&gt;previous article&lt;/a&gt;, we got everything working - traces flowed between services, we visualized them in Jaeger, and we saw end-to-end visibility in action.&lt;/p&gt;

&lt;p&gt;But here’s the catch:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In production, things look very different.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Your system might generate &lt;strong&gt;millions of traces every day&lt;/strong&gt;, and I’ve seen companies that run with &lt;strong&gt;100% sampling&lt;/strong&gt;, storing every single trace, but only keeping them for a &lt;strong&gt;very short period&lt;/strong&gt;, like &lt;strong&gt;1 to 2 days&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Even if the company can afford the storage cost for a &lt;strong&gt;longer period than 1 or 2 days&lt;/strong&gt;, this setup is inefficient:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Developers can't investigate incidents that happened more than a couple of days ago
&lt;/li&gt;
&lt;li&gt;A lot of traces are just noise, health checks, fast 200 OKs, and routine traffic
&lt;/li&gt;
&lt;li&gt;The traces that actually matter (slow requests, failures, edge cases) get lost in the crowd&lt;/li&gt;
&lt;li&gt;High cost for exporting and storing all spans - especially when using hosted platforms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let me quote this visual from the &lt;a href="https://opentelemetry.io/docs/concepts/sampling/" rel="noopener noreferrer"&gt;OpenTelemetry documentation&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F874hwqpj98k9gbqnf25d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F874hwqpj98k9gbqnf25d.png" alt="issues-without-sampling" width="800" height="559"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This image perfectly illustrates the problem. Sampling would solve the above issues 🚀&lt;/p&gt;




&lt;h3&gt;
  
  
  🎯 This is where &lt;strong&gt;sampling&lt;/strong&gt; comes in
&lt;/h3&gt;

&lt;p&gt;Sampling helps you &lt;strong&gt;reduce volume&lt;/strong&gt; while still capturing the traces that matter, the ones that help you debug real problems and improve performance.&lt;/p&gt;

&lt;p&gt;In this article, we’ll cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The difference between &lt;strong&gt;head-based and tail-based sampling&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;How to &lt;strong&gt;configure tail-based sampling using OpenTelemetry&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;And how to &lt;strong&gt;scale your collector&lt;/strong&gt; setup to handle production traffic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s get into it.&lt;/p&gt;




&lt;h3&gt;
  
  
  Head based Sampling
&lt;/h3&gt;

&lt;p&gt;Head based sampling means deciding whether to keep or drop a trace right at the start, as soon as the first span is created. The decision is made without knowing how the full trace will look.&lt;/p&gt;

&lt;p&gt;A common example of this is probability sampling. It uses the trace ID and a set percentage to decide which traces to keep. For example. you might keep 30 percent of all traces. If a trace is selected, all its spans are kept together, so you do not end up with missing spans.&lt;/p&gt;

&lt;p&gt;In OpenTelemetry, we usually combine a parent-based sampler with a probability-based sampler. This means if the parent span was sampled, all child spans will be sampled as well. If not, the entire trace will be dropped.&lt;/p&gt;




&lt;h3&gt;
  
  
  How to Configure Head based Sampling
&lt;/h3&gt;

&lt;p&gt;Head based sampling is simple to set up. You do not need to change your system architecture or add extra components.&lt;/p&gt;

&lt;p&gt;You can configure it directly in your application's code using the OpenTelemetry SDK.&lt;/p&gt;

&lt;p&gt;In our case, we are going to use a 30 percent sampling rate in all of the services.&lt;/p&gt;

&lt;p&gt;For example, in our &lt;a href="https://github.com/taman9333/distributed_tracing/blob/main/x/main.go#L60" rel="noopener noreferrer"&gt;Go service&lt;/a&gt;, here is how you can enable head based sampling by combining a parent-based sampler with a trace ID ratio-based sampler:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gi"&gt;+ sampler := sdktrace.ParentBased(
+              sdktrace.TraceIDRatioBased(0.3)
+            )
&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;&lt;span class="p"&gt;tp := sdktrace.NewTracerProvider(
&lt;/span&gt;      sdktrace.WithBatcher(exp),
&lt;span class="gi"&gt;+     sdktrace.WithSampler(sampler),
&lt;/span&gt;      sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceName("service-x"),
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a &lt;a href="https://github.com/taman9333/distributed_tracing/blob/main/y/server.rb#L9C1-L13C4" rel="noopener noreferrer"&gt;Ruby service&lt;/a&gt; you will need to add a single line&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="p"&gt;OpenTelemetry::SDK.configure do |c|
&lt;/span&gt;  c.service_name = 'service-y'
  c.use 'OpenTelemetry::Instrumentation::Sinatra'
  c.use 'OpenTelemetry::Instrumentation::Faraday'
&lt;span class="p"&gt;end
&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;&lt;span class="gi"&gt;+ OpenTelemetry.tracer_provider.sampler = OpenTelemetry::SDK::Trace::Samplers::TraceIdRatioBased.new(0.3)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the &lt;a href="https://github.com/taman9333/distributed_tracing/blob/main/z/tracing.js#L13-L18" rel="noopener noreferrer"&gt;Node service&lt;/a&gt;, you will need to add the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gi"&gt;+ const { ParentBasedSampler, TraceIdRatioBasedSampler } = require('@opentelemetry/sdk-trace-base');
&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;&lt;span class="p"&gt;const provider = new NodeTracerProvider({
&lt;/span&gt;  resource: new resourceFromAttributes({
    [ATTR_SERVICE_NAME]: "service-z",
  }),
  spanProcessors: [new SimpleSpanProcessor(exporter)],
&lt;span class="gi"&gt;+ sampler: new ParentBasedSampler({
+   root: new TraceIdRatioBasedSampler(0.3),
+ })
&lt;/span&gt;});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With these changes in place, all three services now use head based sampling by combining a parent-based strategy with a probability sampler set to 30 percent.&lt;/p&gt;




&lt;h3&gt;
  
  
  🧪 Time to test:
&lt;/h3&gt;

&lt;p&gt;I will run the &lt;a href="https://github.com/taman9333/distributed_tracing/blob/main/hit_x_service.sh" rel="noopener noreferrer"&gt;hit_x_service.sh&lt;/a&gt; script &lt;strong&gt;three times&lt;/strong&gt; which will generate &lt;strong&gt;30 requests&lt;/strong&gt;, so we will &lt;strong&gt;probably&lt;/strong&gt; see &lt;strong&gt;around&lt;/strong&gt; 9 traces (It’s a statistical estimate, the actual count may vary but trends align with the percentage as requests grow) in the Jaeger UI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8tqwvsvfntunmtytlt0u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8tqwvsvfntunmtytlt0u.png" alt="head-based-sampling" width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Yaaay 🙌 it is working, as you can see in the screenshot we have &lt;strong&gt;9 traces&lt;/strong&gt; sampled out of &lt;strong&gt;30 requests&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  🚨 The Catch with Head Based Sampling
&lt;/h3&gt;

&lt;p&gt;If you look again at the screenshot, you will notice the problem with head based sampling. The nine sampled traces do not include any of the error requests. You can also confirm this by observing the scatter plot in the top. The 9 dots represent the 9 sampled traces and the time they were captured. All of them are blue, which means they are successful requests. If an error trace was captured, it would appear as a &lt;strong&gt;red dot&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This shows a major limitation. Although head based sampling is simple to understand and configure, it makes the sampling decision before the request is fully processed. That means it can miss important spans such as failures or high latency cases.&lt;/p&gt;

&lt;p&gt;In our case, all three errors were dropped. This makes head based sampling unreliable when your goal is to capture anomalies or debug edge cases.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fheffifnthu0z27zwds43.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fheffifnthu0z27zwds43.png" alt="head-based-sampling-problem" width="717" height="348"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Tail based sampling
&lt;/h3&gt;

&lt;p&gt;Tail based sampling works differently from head-based sampling, instead of making a decision right when a trace starts, &lt;strong&gt;the decision to sample a trace takes place by considering all or most of the spans within the trace.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This means you can make smarter choices by looking at the full picture, like checking if any span had an error or latency.&lt;/p&gt;

&lt;p&gt;Let me show you a visual from the OpenTelemetry docs that explains it well:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0u2e75lbyjjqthp57i0i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0u2e75lbyjjqthp57i0i.png" alt="tail-based-sampling-visual" width="800" height="519"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Tail-based Sampling Rules
&lt;/h4&gt;

&lt;p&gt;With tail based sampling, you can define smart rules to decide which traces to keep. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Always keep traces that include an error
&lt;/li&gt;
&lt;li&gt;Keep traces with high latency
&lt;/li&gt;
&lt;li&gt;Sample based on specific span attributes - like keeping more traces from a newly deployed service
&lt;/li&gt;
&lt;li&gt;Drop traces that match certain paths - like health check endpoints
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And many other policies you can define based on your needs or business logic.&lt;/p&gt;




&lt;h3&gt;
  
  
  How to Implement Tail-based Sampling
&lt;/h3&gt;

&lt;p&gt;To use tail based sampling, you’ll need to introduce a new component into your infrastructure, the OpenTelemetry Collector.&lt;/p&gt;

&lt;p&gt;But wait… what the heck is that?&lt;/p&gt;

&lt;p&gt;Let’s break it down.&lt;/p&gt;

&lt;h4&gt;
  
  
  What is the OpenTelemetry Collector?
&lt;/h4&gt;

&lt;p&gt;The OpenTelemetry Collector is a vendor-agnostic implementation of how to receive, process and export telemetry data(logs, metrics &amp;amp; traces) to an observability backend.&lt;/p&gt;

&lt;p&gt;It simplifies your setup by removing the need to run different agents for each type of telemetry. Instead, it acts as a single, unified point to collect and forward all your data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Looking at the photo makes it much clearer.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9mb1yqyxnwnqxjgjumca.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9mb1yqyxnwnqxjgjumca.jpg" alt="OpenTelemetry-Collector" width="800" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On the left side, we have a typical cluster or host with different services running. These services continuously produce logs, metrics, and traces.&lt;/p&gt;

&lt;p&gt;All of this data is sent to the &lt;strong&gt;OpenTelemetry Collector&lt;/strong&gt;. The collector performs three steps for each telemetry type:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Receive&lt;/strong&gt; the data
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Process&lt;/strong&gt; it based on your pipeline configuration before it’s exported. These processors can:

&lt;ul&gt;
&lt;li&gt;Filter unnecessary data
&lt;/li&gt;
&lt;li&gt;Transform or enrich spans with additional metadata
&lt;/li&gt;
&lt;li&gt;Batch data to improve performance and reduce backend load&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Export&lt;/strong&gt; it to your observability backend&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;On the right, you can see popular observability platforms where the data can be exported, such as Prometheus, Grafana, Datadog, Loki and others.&lt;/p&gt;

&lt;p&gt;The real power of using the OpenTelemetry Collector is that it acts as a &lt;strong&gt;central hub&lt;/strong&gt; for all your telemetry data. Instead of asking every single service in your system to know where to send logs, metrics, or traces, or how to talk to different backends like Prometheus, Grafana, or Datadog, you let the collector handle it all in one place.&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your services stay lightweight and simple&lt;/li&gt;
&lt;li&gt;You can change or add observability backends without touching the code in your services&lt;/li&gt;
&lt;li&gt;You gain more control over processing and filtering data before it gets stored&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short, the collector decouples your application code from your observability tooling, which makes your system more flexible, maintainable, and scalable.&lt;/p&gt;




&lt;h3&gt;
  
  
  Installing OpenTelemetry Collector
&lt;/h3&gt;

&lt;p&gt;To keep things simple, we won’t include the changes we made earlier for head-based sampling. That code lives in the &lt;strong&gt;&lt;a href="https://github.com/taman9333/distributed_tracing/compare/main...head-based-sampling" rel="noopener noreferrer"&gt;head-based-sampling branch&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead, we’ll use the same instrumentation setup from the first article, which is available in the &lt;code&gt;main&lt;/code&gt; branch.&lt;/p&gt;

&lt;p&gt;All the changes required for tail-based sampling will be done in a new branch called &lt;strong&gt;&lt;a href="https://github.com/taman9333/distributed_tracing/compare/main...tail-based-sampling" rel="noopener noreferrer"&gt;tail-based-sampling&lt;/a&gt;&lt;/strong&gt;. These changes include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Updating the &lt;a href="https://github.com/taman9333/distributed_tracing/blob/main/docker-compose.yml" rel="noopener noreferrer"&gt;docker-compose.yml&lt;/a&gt;&lt;/strong&gt; file to add the OpenTelemetry Collector
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="p"&gt;version: '3'
&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;&lt;span class="p"&gt;services:
&lt;/span&gt;&lt;span class="gi"&gt;+  otel-collector:
+    image: otel/opentelemetry-collector-contrib:0.130.0
+    command: ["--config=/etc/otel-collector.yaml"]
+    volumes:
+      - ./otel-collector.yaml:/etc/otel-collector.yaml
+    ports:
+      - 4317:4317
+      - 4318:4318
+    depends_on:
+      - jaeger
&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;  jaeger:
    image: jaegertracing/all-in-one:1.71.0
&lt;span class="gd"&gt;-    command:
-      - "--collector.otlp.grpc.tls.enabled=false"
&lt;/span&gt;    ports:
      - "16686:16686"   # Jaeger UI
&lt;span class="gd"&gt;-      - "4317:4317"     # OTLP gRPC
-      - "4318:4318"     # OTLP HTTP
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice that we removed ports &lt;code&gt;4317&lt;/code&gt; and &lt;code&gt;4318&lt;/code&gt; from the Jaeger service. That’s because we won’t send traces from our services directly to Jaeger anymore.&lt;/p&gt;

&lt;p&gt;Instead, we’ll route all trace data to the &lt;code&gt;otel-collector&lt;/code&gt; service first. To enable that, we exposed ports 4317 and 4318 in the otel-collector service, where our x, y, and z services send traces via HTTP to port 4318.&lt;/p&gt;

&lt;p&gt;Then, the collector will export traces to Jaeger internally via OTLP gRPC on port &lt;code&gt;4317&lt;/code&gt; inside the Docker network.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Creating a new file&lt;/strong&gt; named &lt;strong&gt;&lt;a href="https://github.com/taman9333/distributed_tracing/blob/tail-based-sampling/otel-collector.yaml" rel="noopener noreferrer"&gt;otel-collector.yaml&lt;/a&gt;&lt;/strong&gt; in the root directory for the collector configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This file defines how the OpenTelemetry Collector will receive, process, and export trace data using tail-based sampling.&lt;/p&gt;

&lt;p&gt;Let’s break down what each section in the otel-collector.yaml file is doing.&lt;/p&gt;

&lt;h3&gt;
  
  
  📥 Receivers
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;protocols&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:4318&lt;/span&gt;
      &lt;span class="na"&gt;grpc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:4317&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Receivers are the entry point to the OpenTelemetry Collector. They collect telemetry data from your services and pass it into the processing pipeline.&lt;/p&gt;

&lt;p&gt;In our case, we are only dealing with traces. All three services, Go, Ruby, and Node.js, are configured to send traces via the OTLP HTTP protocol.&lt;/p&gt;

&lt;p&gt;The OTLP receiver starts both HTTP and gRPC servers, listening on ports 4318 and 4317 respectively. But since all our services send traces using HTTP, we do not need the gRPC server but we will not remove the code of grpc endpoint since we will use it later when scaling the collector.&lt;/p&gt;

&lt;p&gt;The HTTP server on port 4318 will continue receiving traces, we configured all of our 3 services to send data through http to this port.&lt;/p&gt;

&lt;p&gt;Receivers are mandatory in every collector configuration, without at least one, the collector will not function.&lt;/p&gt;




&lt;h3&gt;
  
  
  🔧 Processors
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;tail_sampling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;decision_wait&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
    &lt;span class="na"&gt;num_traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
    &lt;span class="na"&gt;expected_new_traces_per_sec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;[&lt;/span&gt;
        &lt;span class="pi"&gt;{&lt;/span&gt;
          &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;errors&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
          &lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;status_code&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
          &lt;span class="nv"&gt;status_code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;status_codes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;ERROR&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
        &lt;span class="pi"&gt;},&lt;/span&gt;
        &lt;span class="pi"&gt;{&lt;/span&gt;
            &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;drop-health-checks&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
            &lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;drop&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
            &lt;span class="nv"&gt;drop&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;
              &lt;span class="nv"&gt;drop_sub_policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;[&lt;/span&gt;
                &lt;span class="pi"&gt;{&lt;/span&gt;
                    &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;drop-health-paths&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
                    &lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;string_attribute&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
                    &lt;span class="nv"&gt;string_attribute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;url.path&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;\/health&lt;/span&gt;&lt;span class="pi"&gt;],&lt;/span&gt; &lt;span class="nv"&gt;enabled_regex_matching&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;true&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
                &lt;span class="pi"&gt;}&lt;/span&gt;
              &lt;span class="pi"&gt;]&lt;/span&gt;
            &lt;span class="pi"&gt;}&lt;/span&gt;
         &lt;span class="pi"&gt;},&lt;/span&gt;
        &lt;span class="pi"&gt;{&lt;/span&gt;
          &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;probabilistic_30_percent&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
          &lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;probabilistic&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
          &lt;span class="nv"&gt;probabilistic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;sampling_percentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;30&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
        &lt;span class="pi"&gt;}&lt;/span&gt;
      &lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Processors sit in the middle of the pipeline between collecting data and exporting it. They handle tasks like filtering out noise, enriching spans with more context, transforming data formats, or batching data together to improve performance. This step ensures your telemetry data is optimized before being sent to your backend.&lt;/p&gt;

&lt;h3&gt;
  
  
  🧠 Understanding the Processor Configuration
&lt;/h3&gt;

&lt;p&gt;Let’s go through the &lt;code&gt;processors&lt;/code&gt; section of the &lt;code&gt;otel-collector.yaml&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;We’re using the &lt;strong&gt;&lt;a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor" rel="noopener noreferrer"&gt;tail_sampling processor&lt;/a&gt;&lt;/strong&gt;, one of many available in the OpenTelemetry Collector.&lt;/p&gt;




&lt;h3&gt;
  
  
  ⏳ &lt;code&gt;decision_wait&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;decision_wait&lt;/code&gt; option sets how long the collector should wait (starting from the first span of a trace) before making a sampling decision.&lt;/p&gt;

&lt;p&gt;By default, it's set to &lt;code&gt;30s&lt;/code&gt;. In our config, we’ve reduced it to &lt;code&gt;10s&lt;/code&gt; to speed things up for local development.&lt;/p&gt;

&lt;h4&gt;
  
  
  When to increase &lt;code&gt;decision_wait&lt;/code&gt;:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Long-running traces&lt;/strong&gt; – If your system involves operations that take time to complete (e.g., async workflows, retries, or background jobs), a longer wait ensures all spans are included.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retry and backoff logic&lt;/strong&gt; – If some spans are delayed due to retries, a short wait might cause them to be missed in the sampling decision.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Potential downsides of increasing &lt;code&gt;decision_wait&lt;/code&gt;:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Increased memory usage&lt;/strong&gt; – The collector needs to buffer spans in memory for a longer period.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More latency&lt;/strong&gt; – Traces will be processed and exported later, as the collector waits before deciding.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🚣 &lt;code&gt;num_traces&lt;/code&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Default&lt;/strong&gt;: &lt;code&gt;50000&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Purpose&lt;/strong&gt;: Controls how many traces are kept in memory at a time.&lt;/li&gt;
&lt;li&gt;If your services generate a high number of traces, you might need to increase this to avoid dropping spans before a sampling decision is made.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  📈 &lt;code&gt;expected_new_traces_per_sec&lt;/code&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Default&lt;/strong&gt;: &lt;code&gt;0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Purpose&lt;/strong&gt;: An estimate of how many new traces the collector expects per second.&lt;/li&gt;
&lt;li&gt;This helps the collector allocate memory and data structures more efficiently.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Why it matters:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Too low&lt;/strong&gt; → Frequent reallocations, hurting performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Too high&lt;/strong&gt; → Wasted memory.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🧠 &lt;code&gt;decision_cache&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Even though we haven’t included &lt;code&gt;decision_cache&lt;/code&gt; in our config, it’s a useful tuning parameter for systems with delayed or long-lived traces (async, retry &amp;amp; backoff logic).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Purpose&lt;/strong&gt;: Controls the number of traces for which a sampling decision is cached.&lt;/li&gt;
&lt;li&gt;Even after a sampling decision is made for a trace, spans might continue to arrive. This cache ensures those late spans are handled properly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use case&lt;/strong&gt;: In systems where spans can arrive out of order or with delays (e.g., async processing or retries), setting this appropriately helps avoid dropping important spans that arrive after the decision.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It keeps a short-term memory of sampling decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;sampled_cache_size&lt;/code&gt;: remembers trace IDs that were sampled → accepts late spans&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;non_sampled_cache_size&lt;/code&gt;: remembers trace IDs that were dropped → drops late spans&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🧪 Sampling Policies
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;policies&lt;/code&gt; section defines how we want to sample traces based on specific criteria. Here's what each policy does:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;📛 errors&lt;/strong&gt; Samples any trace that contains a span with &lt;code&gt;status.code = ERROR&lt;/code&gt;. This ensures we always keep traces that highlight problems.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;{&lt;/span&gt;
  &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;errors&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;status_code&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;status_code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;status_codes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;ERROR&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
&lt;span class="pi"&gt;}&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🚫 drop-health-checks&lt;/strong&gt; Drops traces where the span path matches &lt;code&gt;/health&lt;/code&gt;. Health checks are frequent and usually not helpful for debugging, so we exclude them to reduce noise.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;{&lt;/span&gt;
  &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;drop-health-checks&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;drop&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;drop&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;
    &lt;span class="nv"&gt;drop_sub_policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;[&lt;/span&gt;
      &lt;span class="pi"&gt;{&lt;/span&gt;
          &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;drop-health-paths&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
          &lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;string_attribute&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
          &lt;span class="nv"&gt;string_attribute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;url.path&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;\/health&lt;/span&gt;&lt;span class="pi"&gt;],&lt;/span&gt; &lt;span class="nv"&gt;enabled_regex_matching&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;true&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
      &lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="pi"&gt;}&lt;/span&gt;
&lt;span class="pi"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🎯 probabilistic_30_percent&lt;/strong&gt; Samples 30% of the remaining traces. This helps retain a representative view of normal, successful traffic. Even though these traces aren’t errors, they’re useful for:

&lt;ul&gt;
&lt;li&gt;Monitoring overall system behavior&lt;/li&gt;
&lt;li&gt;Identifying performance trends&lt;/li&gt;
&lt;li&gt;Analyzing latency patterns
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;{&lt;/span&gt;
  &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;probabilistic_30_percent&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;probabilistic&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
  &lt;span class="nv"&gt;probabilistic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;sampling_percentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;30&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
&lt;span class="pi"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can fine-tune these policies based on your system’s needs and business logic. You can find more examples of available policies &lt;a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  📤 Exporters
&lt;/h3&gt;

&lt;p&gt;Exporters are the final step in the Collector pipeline. They send the processed telemetry data like traces, metrics, and logs to a backend system where the data can be stored, visualized, and analyzed.&lt;/p&gt;

&lt;p&gt;At least one exporter must be defined for the Collector to function.&lt;/p&gt;

&lt;p&gt;In our setup, we use the OTLP exporter to send trace data to Jaeger using gRPC over port &lt;code&gt;4317&lt;/code&gt; since Jaeger supports OTLP(OpenTelemetry Protocol):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jaeger:4317&lt;/span&gt;
    &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;insecure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;This exporter sends data to the &lt;code&gt;jaeger&lt;/code&gt; service over the Docker network.&lt;/li&gt;
&lt;li&gt;We use &lt;code&gt;insecure: true&lt;/code&gt; because this is a local dev setup. Avoid this in production.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  ⚙️ Service
&lt;/h3&gt;

&lt;p&gt;This section defines how everything in the collector connects together.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;processors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;tail_sampling&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;service&lt;/code&gt; block brings all the configured pieces - receivers, processors, and exporters into action.&lt;/li&gt;
&lt;li&gt;If you define a component but forget to include it here, it won’t be used.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;📦 &lt;strong&gt;Pipelines&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Under &lt;code&gt;pipelines&lt;/code&gt;, you configure how the data flows through the system. In our case, we only define a &lt;code&gt;traces&lt;/code&gt; pipeline. Other types like &lt;code&gt;metrics&lt;/code&gt; and &lt;code&gt;logs&lt;/code&gt; are possible too.&lt;/p&gt;

&lt;p&gt;Each pipeline must include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;At least one receiver (to accept data)&lt;/li&gt;
&lt;li&gt;Zero or more processors (to modify or filter it)&lt;/li&gt;
&lt;li&gt;At least one exporter (to send it somewhere)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Make sure each part used in the pipeline is properly defined in its corresponding top-level section.&lt;/p&gt;


&lt;h3&gt;
  
  
  🧪 Time to Test
&lt;/h3&gt;

&lt;p&gt;To try everything out:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Run the collector and Jaeger with: &lt;code&gt;docker-compose up --build&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Start all three services:

&lt;ul&gt;
&lt;li&gt;Go: &lt;code&gt;go run main.go&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Ruby: &lt;code&gt;bundle exec ruby server.rb&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Node.js: &lt;code&gt;node index.js&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;



&lt;p&gt;Then let's generate some traffic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Run the &lt;a href="https://github.com/taman9333/distributed_tracing/blob/main/hit_x_service.sh" rel="noopener noreferrer"&gt;hit_x_service.sh&lt;/a&gt; script &lt;strong&gt;three times&lt;/strong&gt; to send &lt;strong&gt;30 requests&lt;/strong&gt;, just like we did with head-based sampling.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In addition, call the &lt;a href="https://github.com/taman9333/distributed_tracing/blob/tail-based-sampling/x/main.go#L46-L49" rel="noopener noreferrer"&gt;health check endpoint&lt;/a&gt; several times:&lt;br&gt;
&lt;code&gt;curl localhost:3000/health&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This will help us verify if health checks are correctly excluded from sampling.&lt;/p&gt;



&lt;p&gt;We’ll &lt;strong&gt;probably&lt;/strong&gt; see &lt;strong&gt;around 9 traces&lt;/strong&gt; (again it's a statistical estimate) in the Jaeger UI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwctpnmx6rlijhw60z6v9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwctpnmx6rlijhw60z6v9.png" alt="tail-based-sampling-testing" width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Yaaay 🙌 We can see 10 traces. If you look at the top in the scatter plot, you'll notice &lt;strong&gt;3 red dots&lt;/strong&gt;, which indicate the &lt;strong&gt;error traces&lt;/strong&gt;. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Our &lt;strong&gt;probabilistic sampling&lt;/strong&gt; of 30% is working&lt;/li&gt;
&lt;li&gt;✅ The &lt;strong&gt;3 errors&lt;/strong&gt; are all present, so error sampling is working&lt;/li&gt;
&lt;li&gt;✅ No &lt;strong&gt;health check&lt;/strong&gt; traces appear, they’ve been dropped as expected&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  🚀 Scaling the Collector – Why and How
&lt;/h2&gt;

&lt;p&gt;As our system grows and the number of instrumented services increases, the &lt;strong&gt;volume of telemetry data can quickly overwhelm a single collector instance&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
Whether the collector is processing &lt;strong&gt;traces only&lt;/strong&gt; or handling &lt;strong&gt;traces, metrics, and logs&lt;/strong&gt;, running everything through one instance can lead to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bottlenecks&lt;/strong&gt; in processing and exporting data
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Increased latency&lt;/strong&gt; in trace availability
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk of dropped data&lt;/strong&gt; during traffic spikes
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limited fault tolerance&lt;/strong&gt; if the single collector fails
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To handle larger volumes reliably, we need to &lt;strong&gt;scale the collector horizontally&lt;/strong&gt; by running multiple instances and distributing the load across them — while ensuring spans from the same trace go to the same collector.&lt;/p&gt;
&lt;h2&gt;
  
  
  🛠️ Deployment patterns
&lt;/h2&gt;
&lt;h3&gt;
  
  
  🐢 No Collector
&lt;/h3&gt;

&lt;p&gt;When we used head based sampling, services sent traces straight to Jaeger. That is a &lt;strong&gt;direct integration&lt;/strong&gt; - no collector in the path.&lt;br&gt;&lt;br&gt;
Visual from the OpenTelemetry docs&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8mj89p0gqey4py83bbzo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8mj89p0gqey4py83bbzo.png" alt="no-collector" width="800" height="307"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  🕵🏻 Agent
&lt;/h3&gt;

&lt;p&gt;With tail based sampling, we introduced the collector next to each service. This is the &lt;strong&gt;agent&lt;/strong&gt; deployment pattern.&lt;br&gt;&lt;br&gt;
Applications instrumented with OTLP send telemetry to a collector running with the app or on the same host.&lt;br&gt;&lt;br&gt;
Visual from the OpenTelemetry docs&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3eokk8k39ijggvh4ukmi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3eokk8k39ijggvh4ukmi.png" alt="agent-mode" width="800" height="168"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple to get started
&lt;/li&gt;
&lt;li&gt;Clear 1-1 mapping between application and collector&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Limited scalability, especially if the collector must handle traces, logs, and metrics
&lt;/li&gt;
&lt;li&gt;Harder to manage at scale across many hosts&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  ⛩️ Gateway
&lt;/h3&gt;

&lt;p&gt;The solution is the third pattern - the &lt;strong&gt;gateway&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
In the &lt;strong&gt;gateway&lt;/strong&gt; deployment, apps or sidecar collectors send telemetry to one OTLP endpoint that fronts a pool of collectors running as a standalone service - per cluster, per data center, or per region.&lt;br&gt;&lt;br&gt;
Visual from the OpenTelemetry docs&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd57oyrz353wkagpwho3u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd57oyrz353wkagpwho3u.png" alt="gateway-mode" width="800" height="226"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  🚨⚠️ Important note - collectors are stateful 🚨⚠️
&lt;/h3&gt;

&lt;p&gt;Collectors hold data in memory. Tail sampling buffers spans until a decision is made.&lt;/p&gt;

&lt;p&gt;If you scale collectors horizontally without coordination, different replicas may receive spans from the same trace. Each replica will decide on sampling independently. Results can diverge. You may end up with traces missing spans, which misrepresents what happened.&lt;/p&gt;
&lt;h3&gt;
  
  
  How to scale correctly
&lt;/h3&gt;

&lt;p&gt;Place a &lt;strong&gt;load balancing&lt;/strong&gt; layer of collectors in front of the tail sampling collectors.  &lt;/p&gt;

&lt;p&gt;Use the &lt;strong&gt;load-balancing exporter&lt;/strong&gt; to route all spans of the same trace to the same backend collector.&lt;/p&gt;

&lt;p&gt;It does this by hashing the trace id, or the service name, and consistently sending related spans to the same target.&lt;/p&gt;

&lt;p&gt;OpenTelemetry provides this load-balancing exporter out of the box. Next, we will see how to configure it in code.&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;a href="https://github.com/taman9333/distributed_tracing/blob/4539c66b691033a3fef2cb7095a2dbdc88135e16/docker-compose.yml#L4-L40" rel="noopener noreferrer"&gt;Docker changes for scaling with a gateway&lt;/a&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;We renamed the &lt;code&gt;otel-collector&lt;/code&gt; service to &lt;code&gt;otel-collector-1&lt;/code&gt;, and duplicated it as &lt;code&gt;otel-collector-2&lt;/code&gt; and &lt;code&gt;otel-collector-3&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;These three collectors run tail based sampling, and do not expose host ports, since services will not talk to them directly&lt;/li&gt;
&lt;li&gt;We added an &lt;code&gt;otel-gateway&lt;/code&gt; service that runs the load balancing exporter, and exposes ports &lt;code&gt;4317&lt;/code&gt; and &lt;code&gt;4318&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;All app services send OTLP traffic to the gateway, and the gateway consistently routes each trace to one of the collectors
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otel-collector-1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;otel/opentelemetry-collector-contrib:0.130.0&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--config=/etc/otel-collector.yaml"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./otel-collector.yaml:/etc/otel-collector.yaml&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4317"&lt;/span&gt;        &lt;span class="c1"&gt;# OTLP gRPC receiver&lt;/span&gt;

  &lt;span class="na"&gt;otel-collector-2&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;otel/opentelemetry-collector-contrib:0.130.0&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--config=/etc/otel-collector.yaml"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./otel-collector.yaml:/etc/otel-collector.yaml&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4317"&lt;/span&gt;        &lt;span class="c1"&gt;# OTLP gRPC receiver&lt;/span&gt;

  &lt;span class="na"&gt;otel-collector-3&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;otel/opentelemetry-collector-contrib:0.130.0&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--config=/etc/otel-collector.yaml"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./otel-collector.yaml:/etc/otel-collector.yaml]&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4317"&lt;/span&gt;        &lt;span class="c1"&gt;# OTLP gRPC receiver&lt;/span&gt;

  &lt;span class="c1"&gt;# Otel gateway running load balancing exporter&lt;/span&gt;
  &lt;span class="na"&gt;otel-gateway&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;otel/opentelemetry-collector-contrib:0.130.0&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--config=/etc/otel-gateway.yaml"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./otel-gateway.yaml:/etc/otel-gateway.yaml&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;4317:4317&lt;/span&gt;     &lt;span class="c1"&gt;# OTLP gRPC&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;4318:4318&lt;/span&gt;     &lt;span class="c1"&gt;# OTLP HTTP&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;otel-collector-1&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;otel-collector-2&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;otel-collector-3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;a href="https://github.com/taman9333/distributed_tracing/blob/tail-based-sampling-horizontal-scaling/otel-gateway.yaml" rel="noopener noreferrer"&gt;otel-gateway.yaml&lt;/a&gt; - what it does
&lt;/h3&gt;

&lt;p&gt;This file defines a gateway collector that accepts OTLP traffic from services and forwards spans to a pool of tail sampling collectors using the load balancing exporter.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;protocols&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:4318&lt;/span&gt;

&lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;loadbalancing&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;routing_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;traceID&lt;/span&gt;
    &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;insecure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;resolver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;static&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;hostnames&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;otel-collector-1:4317&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;otel-collector-2:4317&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;otel-collector-3:4317&lt;/span&gt;

&lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;telemetry&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;debug&lt;/span&gt;
  &lt;span class="na"&gt;pipelines&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;traces&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;receivers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;otlp&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;exporters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;loadbalancing&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Receivers&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;otlp&lt;/code&gt; listens on &lt;code&gt;4318&lt;/code&gt; for HTTP since our services send traces via HTTP to &lt;code&gt;0.0.0.0:4318&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Exporters - loadbalancing&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;resolver.static.hostnames&lt;/code&gt; lists the downstream collectors to send to&lt;/li&gt;
&lt;li&gt;We use Docker Compose service names since within the Docker Compose network, service names act as DNS hostnames. so &lt;code&gt;otel-collector-1:4317&lt;/code&gt;, &lt;code&gt;otel-collector-2:4317&lt;/code&gt;, &lt;code&gt;otel-collector-3:4317&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;routing_key: traceID&lt;/code&gt;&lt;/strong&gt; means all spans that share the same trace id are routed to the same downstream collector, avoiding cases where different collectors sample parts of the same trace and cause incomplete or misleading results.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Service&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;telemetry.logs.level: debug&lt;/code&gt; helps with debugging the gateway behavior. We’ve also added the &lt;a href="https://github.com/taman9333/distributed_tracing/blob/4539c66b691033a3fef2cb7095a2dbdc88135e16/otel-collector.yaml#L50-L52" rel="noopener noreferrer"&gt;same telemetry configuration to otel-collector.yaml&lt;/a&gt; so that all three collectors produce debug-level logs, making it easier to verify that everything is working correctly.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pipelines.traces&lt;/code&gt; wires the &lt;code&gt;otlp&lt;/code&gt; receiver to the &lt;code&gt;loadbalancing&lt;/code&gt; exporter&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🧪 Time to Test
&lt;/h3&gt;

&lt;p&gt;If you still have Docker running, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose down
docker-compose up &lt;span class="nt"&gt;--build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then execute the &lt;code&gt;./hit_x_service.sh&lt;/code&gt; script &lt;strong&gt;three times&lt;/strong&gt;, just like we did when testing tail-based sampling without scaling the collector. This will generate &lt;strong&gt;30 requests&lt;/strong&gt;. We’d expect to see &lt;strong&gt;around 9 traces&lt;/strong&gt; in Jaeger.&lt;/p&gt;

&lt;p&gt;After checking Jaeger UI, we’ve received &lt;strong&gt;10 traces&lt;/strong&gt; (by coincidence, same as last test), &lt;strong&gt;3 error traces&lt;/strong&gt; and &lt;strong&gt;7 normal traces&lt;/strong&gt;. All traces show &lt;strong&gt;complete spans&lt;/strong&gt;, meaning nothing was dropped. This confirms that spans from the same trace were routed to the same collector.&lt;/p&gt;

&lt;p&gt;Now we need to confirm that traces were actually distributed across collectors, and not all sent to a single one. Since we enabled &lt;strong&gt;telemetry debug logs&lt;/strong&gt; in every collector, we can run the following command for each collector service to filter useful logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose logs &lt;span class="o"&gt;{{&lt;/span&gt;SERVICE_NAME&lt;span class="o"&gt;}}&lt;/span&gt; 2&amp;gt;&amp;amp;1 | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s1"&gt;'"batch.len": [1-9]'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This filters out noisy logs, showing only batches where spans were sent.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitwizhum40qdwuo2tf96.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitwizhum40qdwuo2tf96.png" alt="otel-gateway-testing" width="800" height="439"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;📊 &lt;strong&gt;Log results:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;otel-collector-1&lt;/strong&gt; → 2 logs - total traces: 10 (&lt;strong&gt;sampled&lt;/strong&gt;: 5, &lt;strong&gt;notSampled&lt;/strong&gt;: 5). Here, &lt;strong&gt;total traces&lt;/strong&gt; = sum of &lt;code&gt;"batch.len"&lt;/code&gt; values (2 from first log + 8 from second log).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;otel-collector-2&lt;/strong&gt; → 3 logs - total traces: 11 (&lt;strong&gt;sampled&lt;/strong&gt;: 2, &lt;strong&gt;notSampled&lt;/strong&gt;: 9)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;otel-collector-3&lt;/strong&gt; → 2 logs - total traces: 9 (&lt;strong&gt;sampled&lt;/strong&gt;: 3, &lt;strong&gt;notSampled&lt;/strong&gt;: 6)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✅ This means all collectors together received exactly &lt;strong&gt;30 traces&lt;/strong&gt;, matching the requests sent.&lt;br&gt;&lt;br&gt;
✅ Total sampled traces = &lt;strong&gt;10&lt;/strong&gt;, which matches what we see in Jaeger UI.&lt;br&gt;&lt;br&gt;
✅ Total not-sampled traces = &lt;strong&gt;20&lt;/strong&gt;, as expected.&lt;/p&gt;




&lt;h2&gt;
  
  
  📌 Final Architecture
&lt;/h2&gt;

&lt;p&gt;And here is our final architecture:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F777xou9lny6s61xwsejf.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F777xou9lny6s61xwsejf.jpg" alt="Final-OTEL-Architecture" width="800" height="238"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This setup allows us to build a &lt;strong&gt;robust distributed tracing system&lt;/strong&gt; that can &lt;strong&gt;absorb millions of traces&lt;/strong&gt; efficiently, while keeping costs lower and reducing noise as much as possible.  &lt;/p&gt;

&lt;p&gt;By combining &lt;strong&gt;tail-based sampling&lt;/strong&gt;, &lt;strong&gt;load balancing across multiple collectors&lt;/strong&gt;, and &lt;strong&gt;selective sampling policies&lt;/strong&gt;, we ensure that we capture the most valuable traces without overloading our backend.&lt;/p&gt;

</description>
      <category>tracing</category>
      <category>microservices</category>
      <category>apm</category>
      <category>opentelemetry</category>
    </item>
    <item>
      <title>Distributed Tracing Instrumentation with OpenTelemetry and Jaeger</title>
      <dc:creator>taman9333</dc:creator>
      <pubDate>Thu, 31 Jul 2025 20:46:25 +0000</pubDate>
      <link>https://dev.to/taman9333/distributed-tracing-instrumentation-with-opentelemetry-and-jaeger-em2</link>
      <guid>https://dev.to/taman9333/distributed-tracing-instrumentation-with-opentelemetry-and-jaeger-em2</guid>
      <description>&lt;p&gt;Distributed tracing is a way to track a request as it moves through a system, especially in setups where multiple services talk to each other, like in microservices.&lt;/p&gt;

&lt;p&gt;Imagine a user clicking &lt;strong&gt;"buy"&lt;/strong&gt; on an e-commerce site. That action might hit a front-end service, a payment processor, an inventory checker, a database and a Redis cache.&lt;/p&gt;

&lt;p&gt;If something goes wrong, figuring out where it failed can be a nightmare without a clear map. That’s where distributed tracing comes in. It’s like a GPS for your application, showing the path of a request across services, how long each step takes, and where things might break.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fniknc7b6ezs0m2qye5xj.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fniknc7b6ezs0m2qye5xj.jpg" alt="request-life-cycle-in-microservices" width="800" height="239"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Unlike logs, which are like diary entries of what happened, or metrics, which give you numbers like CPU usage, tracing gives you the full story of a request’s journey. It’s critical for spotting bottlenecks, debugging errors, and understanding how your system behaves under real world conditions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fww5x6p5aixtvh4cv07pz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fww5x6p5aixtvh4cv07pz.png" alt="traces vs logs" width="477" height="740"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Logs are great for capturing detailed information about what your application is doing. But they are often scattered and not tied together. Traces, on the other hand, bring structure. You can think of them as stitched logs that belong to the same request. When you attach key log details as attributes or events inside spans, you end up with the same information, but now it is grouped by request and connected across services.&lt;/p&gt;




&lt;p&gt;This article explains distributed tracing using my GitHub repository &lt;a href="https://github.com/taman9333/distributed_tracing" rel="noopener noreferrer"&gt;distributed_tracing&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The repo includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📦 Instrumentation with OpenTelemetry &amp;amp; Jaeger
&lt;/li&gt;
&lt;li&gt;🎯 Head-based sampling
&lt;/li&gt;
&lt;li&gt;🧠 Tail-based sampling
&lt;/li&gt;
&lt;li&gt;⚖️ Scaling collectors with a load balancer
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We will walk through each step and explain what the code is doing.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;⚠️Disclaimer:&lt;/strong&gt;&lt;br&gt;
There are other important topics that we will not cover here, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom instrumentation for capturing application-specific spans
&lt;/li&gt;
&lt;li&gt;How to define a good trace and what makes a trace useful
&lt;/li&gt;
&lt;li&gt;Correlating logs with traces so that logs are grouped around a single request
&lt;/li&gt;
&lt;li&gt;Aggregating data from your traces to derive metrics without exporting them separately&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;




&lt;p&gt;Before we dive into the repository, we will rely on automatic instrumentation to keep things simple. Specifically, this includes instrumenting HTTP servers to capture incoming requests and HTTP clients to trace outgoing calls. But in real systems, it is rarely enough. You often need to go beyond that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good traces require good data🤖&lt;/strong&gt; That means making sure you are instrumenting all the key parts of your system. HTTP clients and servers, relational databases, cache layers, Elasticsearch, and any other critical services should be automatically instrumented where possible. Then you layer on custom instrumentation to fill the gaps and highlight the things that matter most in your business logic.&lt;/p&gt;




&lt;h3&gt;
  
  
  Repository Overview
&lt;/h3&gt;

&lt;p&gt;This repository demonstrates distributed tracing using OpenTelemetry with Jaeger as the backend to collect and visualize traces.&lt;/p&gt;

&lt;p&gt;It features three services:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;X&lt;/strong&gt;: written in Go&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Y&lt;/strong&gt;: written in Ruby&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Z&lt;/strong&gt;: written in Node.js&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The services are connected in a chain: Service X calls Y, and Y calls Z. This creates a traceable path for a request as it moves across the system.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpdwxryvail3xo7mxx788.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpdwxryvail3xo7mxx788.png" alt="Service chain overview" width="550" height="56"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The goal is to trace the full lifecycle of a request as it flows from the entry point (service X) to the final service (Z). In the real world, this pattern is common in microservice-based applications where distributed tracing can help identify where time is spent or where failures occur.&lt;/p&gt;

&lt;p&gt;To simulate real behavior and failures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Service Z is configured to return a &lt;strong&gt;500 Internal Server Error&lt;/strong&gt; on &lt;strong&gt;every 10th request&lt;/strong&gt;. This is done deliberately to help us observe how different sampling strategies (head-based vs tail-based) behave when errors are present in the trace.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The script &lt;a href="https://github.com/taman9333/distributed_tracing/blob/main/hit_x_service.sh" rel="noopener noreferrer"&gt;hit_x_service.sh&lt;/a&gt; repeatedly sends &lt;strong&gt;10 HTTP GET requests&lt;/strong&gt; to the &lt;code&gt;/x&lt;/code&gt; endpoint in service X. This creates a consistent flow of traces that travel through all three services.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Architecture Overview
&lt;/h3&gt;

&lt;p&gt;Here’s a high level diagram that shows how everything fits together under the hood. Each of the three services (X, Y, and Z) is instrumented using OpenTelemetry and exports trace data via OTLP over HTTP to the Jaeger Collector. The Jaeger Collector receives and processes the traces, forwarding them to the backend for storage and visualization in the Jaeger UI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc18r0nj3dvden43x7a51.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc18r0nj3dvden43x7a51.jpg" alt="Jaeger-all-in-one-microservices" width="800" height="470"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Instrumentation and Basic Tracing
&lt;/h3&gt;

&lt;p&gt;To start, we have a &lt;a href="https://github.com/taman9333/distributed_tracing/blob/main/docker-compose.yml" rel="noopener noreferrer"&gt;docker-compose.yml&lt;/a&gt; file that sets up the environment. It installs Jaeger along with the necessary ports to receive and visualize trace data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3'&lt;/span&gt;

&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;jaeger&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jaegertracing/all-in-one:1.71.0&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--collector.otlp.grpc.tls.enabled=false"&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16686:16686"&lt;/span&gt;   &lt;span class="c1"&gt;# Jaeger UI&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4317:4317"&lt;/span&gt;     &lt;span class="c1"&gt;# OTLP gRPC&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4318:4318"&lt;/span&gt;     &lt;span class="c1"&gt;# OTLP HTTP&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s break down what each port does:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Port 16686&lt;/strong&gt; is used to access the Jaeger UI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Port 4317&lt;/strong&gt; allows the Jaeger Collector to receive trace data using the OpenTelemetry Protocol (OTLP) over gRPC.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Port 4318&lt;/strong&gt; does the same, but over HTTP instead of gRPC.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ports 4317 and 4318 are both handled by the Jaeger Collector, which ingests trace data from services instrumented with OpenTelemetry. These services generate spans, and the collector receives, processes, and forwards them to the Jaeger backend for storage and visualization.&lt;/p&gt;

&lt;p&gt;With this setup in place, you can start sending traces from your services to Jaeger using either OTLP over gRPC or HTTP. This flexibility makes it easier to integrate tracing into different environments and across various tech stacks.&lt;/p&gt;




&lt;h3&gt;
  
  
  Sampling: Default Behavior
&lt;/h3&gt;

&lt;p&gt;By default, OpenTelemetry samples 100% of traces. That means every span created in your service will be recorded and exported.&lt;/p&gt;

&lt;p&gt;Unless you have a specific need to manage trace volume such as in high-throughput production environments, you don’t need to configure a custom sampler.&lt;/p&gt;

&lt;p&gt;The default sampler is a combination of &lt;code&gt;ParentBased&lt;/code&gt; and &lt;code&gt;ALWAYS_ON&lt;/code&gt;. Here's what that means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The root span of a trace is always sampled.&lt;/li&gt;
&lt;li&gt;All child spans inherit the sampling decision of their parent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This guarantees that once a trace is started, every span within it will be sampled and exported.&lt;/p&gt;

&lt;p&gt;In the first step, the tracing logic added to all three services (X, Y, and Z) will use the default sampler, meaning no sampling limits are applied.&lt;/p&gt;

&lt;p&gt;Here’s how 100% sampling is configured in each language used in this repository:&lt;/p&gt;

&lt;h4&gt;
  
  
  Go (Service X)
&lt;/h4&gt;

&lt;p&gt;We start our Go app by invoking the &lt;a href="https://github.com/taman9333/distributed_tracing/blob/main/x/main.go#L24" rel="noopener noreferrer"&gt;&lt;code&gt;initTracer&lt;/code&gt;&lt;/a&gt; function. This function is responsible for tracing the HTTP server (receiving requests).&lt;/p&gt;

&lt;p&gt;There are two key things to consider in this function:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It sends traces &lt;a href="https://github.com/taman9333/distributed_tracing/blob/main/x/main.go#L53-L54" rel="noopener noreferrer"&gt;through HTTP to the Jaeger Collector&lt;/a&gt; on port 4318, and disables TLS since we are running locally:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;otlptracehttp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;otlptracehttp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithEndpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"localhost:4318"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;otlptracehttp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithInsecure&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="c"&gt;// disables TLS&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/taman9333/distributed_tracing/blob/main/x/main.go#L68-L72" rel="noopener noreferrer"&gt;Context propagation&lt;/a&gt;: the process of passing trace context (like trace and span IDs) across service boundaries, enabling the tracking of requests as they move through a distributed system. It ensures that the trace remains intact and connected, providing full observability. Propagation is usually handled by instrumentation libraries as we will see in the next snippet, however In the event that you need to manually propagate context, you can use the &lt;a href="https://opentelemetry.io/docs/specs/otel/context/api-propagators/" rel="noopener noreferrer"&gt;Propagators API&lt;/a&gt;.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;otel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetTextMapPropagator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;propagation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewCompositeTextMapPropagator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;propagation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TraceContext&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="n"&gt;propagation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Baggage&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can observe context propagation in action by inspecting the request headers passed between services.&lt;/p&gt;

&lt;p&gt;For example, since the Go service (X) calls the Ruby service (Y), if you log the incoming request headers in the Ruby app, you will see something like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fws66yedndxkw4zpjk48y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fws66yedndxkw4zpjk48y.png" alt="context-propagation" width="800" height="168"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, the &lt;code&gt;HTTP_TRACEPARENT&lt;/code&gt; header is present. This header carries trace context across service boundaries and allows the spans created by each service to be linked to the same trace.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Finally, we trace &lt;a href="https://github.com/taman9333/distributed_tracing/blob/main/x/main.go#L19-L21" rel="noopener noreferrer"&gt;outgoing HTTP requests&lt;/a&gt; made by the Go service using an instrumented &lt;code&gt;http.Client&lt;/code&gt;. This is essential for tracing HTTP client calls from Go to downstream services:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Transport&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;otelhttp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewTransport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DefaultTransport&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h4&gt;
  
  
  Ruby (Service Y)
&lt;/h4&gt;

&lt;p&gt;In Ruby, we use Sinatra to serve web requests and Faraday as the HTTP client. Instrumenting Ruby with OpenTelemetry is much simpler and requires less code compared to Go.&lt;/p&gt;

&lt;p&gt;Here’s what you need in the &lt;a href="https://github.com/taman9333/distributed_tracing/blob/main/y/server.rb#L9C1-L13" rel="noopener noreferrer"&gt;server.rb&lt;/a&gt; file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="no"&gt;OpenTelemetry&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;SDK&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;configure&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;
  &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;service_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'service-y'&lt;/span&gt;
  &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt; &lt;span class="s1"&gt;'OpenTelemetry::Instrumentation::Sinatra'&lt;/span&gt;
  &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;use&lt;/span&gt; &lt;span class="s1"&gt;'OpenTelemetry::Instrumentation::Faraday'&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Unlike Go, we don’t need to manually configure context propagation.&lt;br&gt;&lt;br&gt;
The OpenTelemetry Ruby SDK handles this automatically as long as you are using auto-instrumented libraries like Sinatra and Faraday. It will extract incoming context from request headers and inject it into outgoing HTTP requests without additional setup.&lt;/p&gt;




&lt;h4&gt;
  
  
  Node.js (Service Z)
&lt;/h4&gt;

&lt;p&gt;In Node.js, we use Express as our web server. The instrumentation setup is located in a separate file, &lt;a href="https://github.com/taman9333/distributed_tracing/blob/main/z/tracing.js" rel="noopener noreferrer"&gt;tracing.js&lt;/a&gt;, which is imported at the top of &lt;a href="https://github.com/taman9333/distributed_tracing/blob/main/z/index.js#L1" rel="noopener noreferrer"&gt;index.js&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;tracing.js file configures the OpenTelemetry setup for the service.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We send traces to &lt;a href="https://github.com/taman9333/distributed_tracing/blob/main/z/tracing.js#L11" rel="noopener noreferrer"&gt;the Jaeger Collector using HTTP&lt;/a&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;exporter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OTLPTraceExporter&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http://localhost:4318/v1/traces&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;We enable &lt;a href="https://github.com/taman9333/distributed_tracing/blob/main/z/tracing.js#L13-L26" rel="noopener noreferrer"&gt;automatic instrumentation for both the HTTP layer and Express&lt;/a&gt;. It’s important to instrument the HTTP layer first, since Express relies on it:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;NodeTracerProvider&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;resourceFromAttributes&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;ATTR_SERVICE_NAME&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;service-z&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;spanProcessors&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;SimpleSpanProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;exporter&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nf"&gt;registerInstrumentations&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;tracerProvider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;instrumentations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="c1"&gt;// Express instrumentation expects HTTP layer to be instrumented&lt;/span&gt;
    &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HttpInstrumentation&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ExpressInstrumentation&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  🎉 Installation Complete, Time to Trace 🚀
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Start all three services (X, Y, and Z)&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Run the Jaeger backend using: &lt;strong&gt;&lt;code&gt;docker-compose up&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Now run the following script to send 10 requests to Service X: &lt;code&gt;./hit_x_service.sh&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Open your browser and go to: &lt;a href="http://localhost:16686" rel="noopener noreferrer"&gt;http://localhost:16686&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You should see 10 traces listed in the Jaeger UI and if you click into the last one, you'll notice it contains an error. That’s because service Z is configured to fail on every 10th request, just like we planned.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1vapckwvshevzln9a2rn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1vapckwvshevzln9a2rn.png" alt="Jaeger-UI-no-sampling" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When you click on the link for the last trace with error you will be able to trace the full request lifecycle across all three services.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F21eeh5rfa7n3i2vbcqwb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F21eeh5rfa7n3i2vbcqwb.png" alt="reuqest-life-cycle-full-trace" width="800" height="452"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🧵 Below is a trace that shows:&lt;/p&gt;

&lt;p&gt;The request starts in service-x&lt;/p&gt;

&lt;p&gt;It propagates to service-y&lt;/p&gt;

&lt;p&gt;Then it hits service-z and fails with a 500&lt;/p&gt;

&lt;p&gt;Back in service-y, we log the error and correlate it with the trace&lt;/p&gt;

&lt;p&gt;This makes it super easy to debug distributed systems and pinpoint which service is failing and why.&lt;/p&gt;




&lt;h3&gt;
  
  
  🚨 But Wait There’s a Problem
&lt;/h3&gt;

&lt;p&gt;Cool - at this point everything is working. You’ve got traces flowing, spans being recorded, and the Jaeger UI showing the request paths across your services.&lt;/p&gt;

&lt;p&gt;But here’s the catch, In production things look very different. Your system might generate &lt;strong&gt;millions of traces every day&lt;/strong&gt;. And with that comes a few serious challenges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High cost for exporting and storing all spans - especially when using hosted platforms&lt;/li&gt;
&lt;li&gt;Too much noise - making it hard to focus on what's important (for example health checks)&lt;/li&gt;
&lt;li&gt;Hard to catch the interesting traces - the ones with high latency, errors, or performance bottlenecks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where &lt;strong&gt;sampling&lt;/strong&gt; comes in. It helps you reduce the volume of trace data while keeping the insights that matter most.&lt;/p&gt;

&lt;p&gt;We’ll talk about sampling strategies - including head-based and tail-based sampling in the next article.&lt;/p&gt;

&lt;p&gt;Stay tuned 🔥&lt;/p&gt;

</description>
      <category>tracing</category>
      <category>microservices</category>
      <category>apm</category>
      <category>opentelemetry</category>
    </item>
    <item>
      <title>Scalable Url Shortener Part2</title>
      <dc:creator>taman9333</dc:creator>
      <pubDate>Mon, 26 Aug 2024 23:59:40 +0000</pubDate>
      <link>https://dev.to/taman9333/scalable-url-shortener-part1-4jc2</link>
      <guid>https://dev.to/taman9333/scalable-url-shortener-part1-4jc2</guid>
      <description>&lt;h3&gt;
  
  
  Encoding a Long URL to a Short URL
&lt;/h3&gt;

&lt;p&gt;In this part, we'll explore how to encode a long URL into a short URL. There are different techniques we can use to achieve this.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 1: Hash Functions
&lt;/h3&gt;

&lt;p&gt;According to Wikipedia, a hash function is any function that maps data of arbitrary size to fixed-size values.&lt;/p&gt;

&lt;p&gt;Let’s take &lt;strong&gt;MD5&lt;/strong&gt; as an example. The output length of the MD5 hash function is 128 bits, or 16 bytes. When represented as a hexadecimal string, it’s 32 characters long.&lt;/p&gt;

&lt;p&gt;Here’s a simple example of using MD5 in Ruby:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="nb"&gt;require&lt;/span&gt; &lt;span class="s1"&gt;'digest/md5'&lt;/span&gt;
&lt;span class="no"&gt;Digest&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;SHA256&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"www.test.com"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# =&amp;gt; "84cc0e5c525dc728e1769ad6663341c8"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, the output is 32 characters, which is too long for our use case. To address this, we can use a simple trick: taking only the first 7 characters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="no"&gt;Digest&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;MD5&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"www.test.com"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="c1"&gt;# =&amp;gt; "84cc0e5"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cool; it's pretty easy, right?&lt;/p&gt;

&lt;p&gt;Unfortunately, this is not a perfect solution as the MD5 algorithm might lead to collisions. Here are the scenarios where collisions could happen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The MD5 algorithm can possibly generate the same hash code for different strings (this is very rare).&lt;/li&gt;
&lt;li&gt;Even if the entire hash code is not the same, you could encounter a collision where two different hash codes share the same first 7 characters that we plan to use.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We could introduce a unique index to solve this problem. This would allow us to catch any duplicate hash codes when writing them to the database and retry generating a new hash code. However, using a unique index has its downsides. It would place a lock on our database, which wouldn't scale well if we receive a lot of writes. Additionally, if we plan to shard our database to scale writes across different regions, the unique index approach would no longer work effectively.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 2: Counter (Optimal Solution)
&lt;/h3&gt;

&lt;p&gt;Instead of using a hash function, we can use a counter-based approach to generate short URLs. This method is much simpler and avoids the collision issues that can occur with hash functions.&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;p&gt;The idea is straightforward: we maintain a global counter that increments with every new URL request. Each time a new URL is submitted, we increment the counter and convert the value of the counter into a short string using &lt;strong&gt;Base62 encoding&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Base62 encoding&lt;/strong&gt; uses a character set of 62 characters (&lt;code&gt;0-9&lt;/code&gt;, &lt;code&gt;a-z&lt;/code&gt;, &lt;code&gt;A-Z&lt;/code&gt;), which is perfect for generating short, readable URLs. By encoding the incremented counter value into Base62, we generate a unique and compact string for each URL.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why It’s Optimal
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Uniqueness&lt;/strong&gt;: Since the counter increases sequentially, every value is guaranteed to be unique, which eliminates the risk of collisions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Short Length&lt;/strong&gt;: With Base62 encoding, we can generate short strings that are much smaller than the original counter value, making the URLs compact and easy to share.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt;: The counter-based approach is scalable and performs well, even with large numbers of URLs, since it's a simple increment and encoding operation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No Need for a Unique Index&lt;/strong&gt;: Unlike the hash-based approach, we don’t need to rely on a unique index in the database, as the counter ensures uniqueness on its own.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Problem with Counters in Scalable Applications
&lt;/h3&gt;

&lt;p&gt;When scaling the application with multiple instances (e.g., 3 or more), managing the counter across instances can become problematic. If the counter logic is handled by one instance, that instance becomes a single point of failure. If it goes down, the entire counter mechanism fails, which disrupts the generation of short URLs.&lt;/p&gt;

&lt;p&gt;Even if each instance manages its own counter, there’s still a challenge. Once an instance exhausts its local counter range, it would need a mechanism to obtain the next range of counters. This leads to more complexity in coordination across instances.&lt;/p&gt;

&lt;p&gt;To avoid these issues, we need a global counter service responsible for managing the counter in a distributed and scalable manner. This ensures that all instances can safely and consistently generate unique short URLs, without risking collisions or relying on a single instance to manage the counter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ensuring Scalability with Distributed Systems
&lt;/h3&gt;

&lt;p&gt;In a distributed system where multiple instances of the URL shortener are running, it's critical to ensure that each instance generates unique short URLs without collisions. To achieve this, we rely on &lt;strong&gt;etcd&lt;/strong&gt; as a distributed coordination service.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is etcd?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;etcd&lt;/strong&gt; is a distributed, reliable key-value store used for coordinating configuration data across multiple servers or machines. It ensures strong consistency and provides a way to manage shared data between multiple instances of our application. In our case, etcd will manage the global counter used to generate unique short URLs.&lt;/p&gt;

&lt;p&gt;etcd plays a key role in ensuring that all instances of the URL shortener service are synchronized, so that each instance retrieves the correct counter range without collisions, even in a distributed, multi-instance environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Use a 3-Node etcd Cluster?
&lt;/h3&gt;

&lt;p&gt;To avoid creating a single point of failure, etcd is not run in standalone mode. Instead, we deploy a 3-node etcd cluster to ensure high availability and fault tolerance. With this setup, even if one node goes down, the other nodes will continue to manage the distributed counter, ensuring that the service remains functional.&lt;/p&gt;

&lt;p&gt;Running etcd in a 3-node cluster guarantees that our coordination service is highly available and resilient to failures. Each etcd node shares the responsibility of managing the counters, and they work together using the Raft consensus algorithm to ensure consistency across all nodes.&lt;/p&gt;

&lt;h3&gt;
  
  
  How etcd Coordinates Between Machines
&lt;/h3&gt;

&lt;p&gt;In our URL shortener service, etcd acts as a coordination service between the different machines or servers. Here’s how it works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tracking Active Machines&lt;/strong&gt;: etcd maintains a list of machines (or instances) that are currently active. Each instance communicates with etcd to register itself and retrieve the range of counters it is responsible for.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Assigning Counter Ranges&lt;/strong&gt;: etcd keeps track of the last counter that was used across all instances. When a new instance is added to the system (e.g., for scalability), it talks to etcd and receives a new, unallocated range of counters that it can use to generate short URLs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Handling Counter Exhaustion&lt;/strong&gt;: If an instance exhausts its current counter range, it communicates with etcd again to request the next available counter range. This ensures that every instance is always generating unique short URLs, even when it runs out of its initial counter range.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By coordinating with etcd, all instances of the service can generate unique short URLs without needing to worry about collisions or stale counters. etcd ensures that only one instance uses a specific range of counters at a time, making the system scalable and resilient.&lt;/p&gt;

&lt;p&gt;Here is a &lt;a href="https://github.com/taman9333/scalable_url_shortener/blob/master/docker-compose.yml#L43-L80" rel="noopener noreferrer"&gt;link&lt;/a&gt; of Docker Compose configuration, we have set up a 3-node etcd cluster to ensure high availability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example: Using a Counter with Base62
&lt;/h3&gt;

&lt;p&gt;Let’s say the counter starts at &lt;code&gt;1&lt;/code&gt; and increments by &lt;code&gt;1&lt;/code&gt; with every new URL. The counter values will be encoded into Base62 like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Counter: &lt;code&gt;1&lt;/code&gt; → Base62: &lt;code&gt;1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Counter: &lt;code&gt;62&lt;/code&gt; → Base62: &lt;code&gt;10&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Counter: &lt;code&gt;3844&lt;/code&gt; → Base62: &lt;code&gt;100&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each counter value generates a unique, encoded string that we can use as the short URL.&lt;/p&gt;

&lt;p&gt;Here’s a simplified version of how the encoding would work in Ruby:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Base62&lt;/span&gt;
  &lt;span class="no"&gt;CHARSET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chars&lt;/span&gt;
  &lt;span class="no"&gt;BASE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;CHARSET&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;size&lt;/span&gt;

  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nc"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;CHARSET&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zero?&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="no"&gt;ArgumentError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"Number must be non-negative"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;negative?&lt;/span&gt;

    &lt;span class="n"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;""&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
      &lt;span class="n"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;CHARSET&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;num&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="no"&gt;BASE&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;str&lt;/span&gt;
      &lt;span class="n"&gt;num&lt;/span&gt; &lt;span class="o"&gt;/=&lt;/span&gt; &lt;span class="no"&gt;BASE&lt;/span&gt;
    &lt;span class="k"&gt;end&lt;/span&gt;
    &lt;span class="n"&gt;str&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Getting the Next Counter via &lt;code&gt;CounterService&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;CounterService&lt;/code&gt; is responsible for managing the global counter, ensuring that each request for a new counter is handled in a thread-safe manner, even in a multi-threaded environment. When a new URL is shortened, &lt;code&gt;CounterService.get_next_counter&lt;/code&gt; is invoked to retrieve the next available counter.&lt;/p&gt;

&lt;p&gt;Here’s how &lt;a href="https://github.com/taman9333/scalable_url_shortener/blob/master/services/counter_service.rb#L17" rel="noopener noreferrer"&gt;CounterService.get_next_counter&lt;/a&gt; works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Counter Initialization at Boot-Up&lt;/strong&gt;: The counter range is initialized once during the server boot-up in &lt;code&gt;config.ru&lt;/code&gt;. This ensures that the counter range is prepared and ready to handle requests as soon as the server is up and running.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Thread Safety with Mutex&lt;/strong&gt;: To handle concurrent requests, &lt;code&gt;get_next_counter&lt;/code&gt; uses a mutex to ensure that only one thread can modify the counter at a time. This prevents race conditions and ensures that the counter is incremented consistently, even in a multi-threaded environment.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_next_counter&lt;/span&gt;
  &lt;span class="n"&gt;counter_mutex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synchronize&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
    &lt;span class="n"&gt;current_counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_counter&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;counter_range&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;last&lt;/span&gt;
      &lt;span class="nb"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;counter_range&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_counter_range&lt;/span&gt;
      &lt;span class="nb"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;counter_range&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;first&lt;/span&gt;
      &lt;span class="n"&gt;current_counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt;
    &lt;span class="k"&gt;end&lt;/span&gt;
    &lt;span class="nb"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="n"&gt;current_counter&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Counter Range Exhaustion&lt;/strong&gt;: Once the current counter reaches the end of the allocated range, &lt;code&gt;CounterService&lt;/code&gt; will request a new counter range by calling &lt;code&gt;get_counter_range&lt;/code&gt;. This ensures that a fresh range of counters is always available for the next requests.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here’s the implementation of the &lt;a href="https://github.com/taman9333/scalable_url_shortener/blob/master/services/counter_service.rb#L38" rel="noopener noreferrer"&gt;get_counter_range&lt;/a&gt; method:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_counter_range&lt;/span&gt;
  &lt;span class="kp"&gt;loop&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
    &lt;span class="n"&gt;current_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;ETCD_CLIENT&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="no"&gt;COUNTER_KEY&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;kvs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;first&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_i&lt;/span&gt;
    &lt;span class="n"&gt;new_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_value&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="no"&gt;RANGE_SIZE&lt;/span&gt;
    &lt;span class="n"&gt;txn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;ETCD_CLIENT&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transaction&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;txn&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;
      &lt;span class="n"&gt;txn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compare&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;txn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="no"&gt;COUNTER_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:equal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_s&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="n"&gt;txn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;success&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;txn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="no"&gt;COUNTER_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;end&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;txn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;succeeded&lt;/span&gt;
      &lt;span class="nb"&gt;puts&lt;/span&gt; &lt;span class="s2"&gt;"Instance &lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="no"&gt;ENV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'SERVICE_NAME'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; obtained counter range &lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;current_value&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; to &lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;new_value&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_value&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="n"&gt;new_value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;end&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Here’s how the &lt;code&gt;get_counter_range&lt;/code&gt; method works:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fetching the Current Counter Value&lt;/strong&gt;: The method begins by retrieving the current counter value from etcd, using &lt;code&gt;ETCD_CLIENT.get(COUNTER_KEY).kvs.first&amp;amp;.value.to_i&lt;/code&gt;. This fetches the most up-to-date value of the global counter from the etcd cluster.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Calculating the New Counter Range&lt;/strong&gt;: The new range is calculated by adding a fixed &lt;code&gt;RANGE_SIZE&lt;/code&gt; to the &lt;code&gt;current_value&lt;/code&gt;. This ensures that the instance requesting the range will handle a specific block of counters.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Distributed Counter Allocation&lt;/strong&gt;: The method employs &lt;strong&gt;etcd’s transactional operations&lt;/strong&gt; to ensure safe and consistent counter updates across multiple instances:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Atomic Update&lt;/strong&gt;: The &lt;code&gt;txn.compare&lt;/code&gt; block checks if the current value in etcd is still the same as the &lt;code&gt;current_value&lt;/code&gt; retrieved earlier. This is to make sure that no other instance has updated the counter in the meantime.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Success Block&lt;/strong&gt;: If the comparison succeeds (i.e., no other instance has modified the counter), the transaction proceeds with updating the counter to &lt;code&gt;new_value&lt;/code&gt; using &lt;code&gt;txn.put&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Consistency in a Distributed Environment&lt;/strong&gt;: This method ensures consistency in distributed environments where multiple server instances are running. By atomically checking and updating the counter using etcd's transaction mechanism, it guarantees that each instance gets its own unique range of counters without overlap or collisions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Retry Mechanism&lt;/strong&gt;: If the transaction fails (which means another instance updated the counter), the loop retries by fetching the new current value from etcd. This ensures that the service will always get a valid counter range.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Logging&lt;/strong&gt;: A log message (&lt;code&gt;puts&lt;/code&gt;) is used to track which instance obtained a counter range, which can be helpful for debugging and monitoring.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By using this method, &lt;strong&gt;CounterService&lt;/strong&gt; can handle distributed counter allocation in a consistent, safe manner, ensuring that each instance of the URL shortener service always has a unique counter range to generate short URLs, even in a multi-instance, distributed environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use Cases Supported by &lt;code&gt;CounterService&lt;/code&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Unique Counter Generation&lt;/strong&gt;: &lt;code&gt;CounterService&lt;/code&gt; guarantees the generation of a unique counter value for every request. By using a distributed counter system (e.g., with etcd or another source), it ensures that no two instances of the service generate the same counter value.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Thread-Safe Operations&lt;/strong&gt;: The use of a mutex ensures that multiple threads in a multi-threaded web server (e.g., Puma) can safely access and modify the counter without causing race conditions. This is critical for ensuring the integrity and uniqueness of counter values in a concurrent environment.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Range-Based Counter Allocation&lt;/strong&gt;: To improve performance, &lt;code&gt;CounterService&lt;/code&gt; works with ranges of counters. It keeps track of a counter range, incrementing the current counter within that range. When the range is exhausted, it requests a new range, minimizing the need to repeatedly fetch individual counter values from an external source.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Distributed Counter Coordination&lt;/strong&gt;: In distributed environments, where multiple instances of the URL shortener are running, &lt;code&gt;CounterService&lt;/code&gt; can work with a coordination service like etcd. This allows each instance to request a unique range of counters, ensuring there are no overlaps or collisions across instances.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;CounterService&lt;/code&gt; is a crucial component of our URL shortener system, handling the management and allocation of counters in a way that is both thread-safe and scalable. By utilizing a distributed counter system and supporting range-based allocation, it ensures the generation of unique, incremented counters for every URL shortening request, even in a multi-threaded or distributed environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Explore the Source Code
&lt;/h3&gt;

&lt;p&gt;If you want to explore the code or run two instances of the URL shortener service locally, you can find the full source code in my GitHub repository:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/taman9333/scalable_url_shortener" rel="noopener noreferrer"&gt;scalable_url_shortener&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The repository includes everything you need, including Docker Compose configurations, to set up and run multiple instances of the service, a 3-Node etcd Cluster, and MongoDB.&lt;/p&gt;

&lt;p&gt;Feel free to clone the repository and experiment with running the service locally!&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>distributedsystems</category>
      <category>ruby</category>
      <category>urlshortener</category>
    </item>
    <item>
      <title>Scalable Url Shortener Part1</title>
      <dc:creator>taman9333</dc:creator>
      <pubDate>Mon, 26 Aug 2024 23:59:02 +0000</pubDate>
      <link>https://dev.to/taman9333/scalable-url-shortener-part1-5f5n</link>
      <guid>https://dev.to/taman9333/scalable-url-shortener-part1-5f5n</guid>
      <description>&lt;h3&gt;
  
  
  Building a Scalable URL Shortener Service That Can Handle Billions of Requests
&lt;/h3&gt;

&lt;p&gt;In this article, we’ll explore how to implement a scalable URL shortener service step-by-step. To follow along with the code, you can check out the full implementation in the &lt;a href="https://github.com/taman9333/scalable_url_shortener" rel="noopener noreferrer"&gt;scalable_url_shortener repository&lt;/a&gt;. This repository contains all the necessary code, including setup instructions and configurations using Docker Compose. By the end of this article, you’ll have a clear understanding of how to build a robust and scalable URL shortener service that can handle billions of requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;URL shortening&lt;/strong&gt; is a technique that creates shorter versions of URLs, serving as aliases for longer ones. This is particularly useful for sharing links on platforms with character limits or for improving the aesthetics of links.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Functional Requirements
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;URL Shortening&lt;/strong&gt;: Our service should be able to shorten the URL provided by the user. The shortened URL code should consist of a mix of uppercase and lowercase alphabet characters (A-Z, a-z) and digits (0-9).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Redirection&lt;/strong&gt;: When a user accesses the shortened URL, they should be redirected to the original, full-length URL.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Non-Functional Requirements
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Low Latency&lt;/strong&gt;: The service must respond quickly to both URL shortening and redirection requests.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;High Availability&lt;/strong&gt;: The service should be reliably accessible at all times, ensuring users can shorten URLs and be redirected without interruption.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Strong Consistency&lt;/strong&gt;: Each unique long URL should generate a unique short URL, and there should never be a case where two different long URLs map to the same short URL.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Assumptions
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The ratio of read (redirection) operations to write (URL shortening) operations is 100:1.&lt;/li&gt;
&lt;li&gt;The service is expected to generate approximately 100 million unique shortened URLs per month.

&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Topics We Won’t Cover(Don't ignore in production)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Caching&lt;/strong&gt;: While caching is essential for any scalable service to reduce database load and improve response times, it’s not unique to URL shorteners. Therefore, we will omit caching from this discussion.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Load Balancing&lt;/strong&gt;: Like caching, load balancing is a general technique used to distribute traffic across multiple servers. Although critical for scalability, it is not specific to URL shorteners, so we will not dig deep into load balancing here.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Rate Limiting&lt;/strong&gt;: Rate limiting is crucial to protect your service from abuse and to ensure fair usage across all users. By controlling the number of requests a user can make within a certain time period, you can prevent excessive load on your system and mitigate the risk of Denial of Service (DoS) attacks.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Stack We Are Going to Use
&lt;/h3&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ruby&lt;/strong&gt;: Although the codebase introduced in this article is written in Ruby, you can use whatever language you prefer, as the key focus is on the algorithm we’re going to implement. We will use the Sinatra web framework since it’s a minimal framework, and our use case doesn’t require anything more complex.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;MongoDB&lt;/strong&gt;: We will use MongoDB as our NoSQL database, and we’ll explain why it’s a good fit for this service.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;etcd&lt;/strong&gt;: We’ll use &lt;strong&gt;etcd&lt;/strong&gt;, a distributed key-value store, to reliably manage data across a cluster of machines. It will help us generate unique short URLs in a distributed environment.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  API Design
&lt;/h3&gt;

&lt;p&gt;To shorten a url you will need to send the following request&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;POST /shorten&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;Request Body&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "url": "http://test.com"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{"short_url":"av32cd"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;h3&gt;
  
  
  Database Choice
&lt;/h3&gt;

&lt;p&gt;Given that our servers will receive millions of requests, it's crucial to consider database scalability from the very beginning.&lt;/p&gt;

&lt;p&gt;For a service like URL shortening, the amount of data we need to store is relatively small. However, due to the high volume of &lt;strong&gt;read-heavy&lt;/strong&gt; traffic, our storage solution must be &lt;strong&gt;horizontally scalable&lt;/strong&gt; to handle the load efficiently and maintain low-latency responses as the service grows.&lt;/p&gt;

&lt;p&gt;Our data model is straightforward, with minimal need for complex joins beyond associating each shortened URL with a specific user. This simplicity makes &lt;strong&gt;NoSQL databases&lt;/strong&gt; a better fit for our requirements. While it's possible to use an SQL database, doing so would necessitate careful planning and the implementation of multiple read replicas to achieve the desired scalability and performance.&lt;/p&gt;

&lt;p&gt;We have chosen &lt;strong&gt;MongoDB&lt;/strong&gt; as our database solution because it scales more easily compared to traditional SQL databases. MongoDB's flexible schema and built-in support for horizontal scaling make it well-suited for handling large-scale, distributed applications like our URL shortener service.&lt;/p&gt;

&lt;p&gt;However, employing multiple read replicas, which is essential for scaling, introduces potential &lt;strong&gt;concurrency issues&lt;/strong&gt;. Specifically, we must ensure that once a short URL code is generated, it &lt;strong&gt;is not duplicated&lt;/strong&gt; by another request before the write operation has been propagated to all replicas. Addressing this challenge is critical to maintaining the uniqueness and integrity of our shortened URLs across a distributed system.&lt;/p&gt;



&lt;h3&gt;
  
  
  Implementation of Endpoints in Ruby
&lt;/h3&gt;

&lt;p&gt;Here is the implementation of the two primary endpoints for our application (you can find the actual code in this &lt;a href="https://github.com/taman9333/scalable_url_shortener/blob/master/server.rb#L10-L30" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="n"&gt;post&lt;/span&gt; &lt;span class="s1"&gt;'/shorten'&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
  &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="ss"&gt;:url&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="n"&gt;halt&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'URL is required'&lt;/span&gt; &lt;span class="k"&gt;unless&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;

  &lt;span class="n"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;CounterService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_next_counter&lt;/span&gt;
  &lt;span class="n"&gt;short_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;Base62&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;counter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="no"&gt;Url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;short_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="n"&gt;content_type&lt;/span&gt; &lt;span class="ss"&gt;:json&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="ss"&gt;short_url: &lt;/span&gt;&lt;span class="n"&gt;short_url&lt;/span&gt; &lt;span class="p"&gt;}.&lt;/span&gt;&lt;span class="nf"&gt;to_json&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="n"&gt;get&lt;/span&gt; &lt;span class="s1"&gt;'/:short_url'&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
  &lt;span class="n"&gt;short_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="ss"&gt;:short_url&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="n"&gt;url_doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;Url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_by_short_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;short_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="n"&gt;halt&lt;/span&gt; &lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'URL not found'&lt;/span&gt; &lt;span class="k"&gt;unless&lt;/span&gt; &lt;span class="n"&gt;url_doc&lt;/span&gt;
  &lt;span class="n"&gt;redirect&lt;/span&gt; &lt;span class="n"&gt;url_doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="ss"&gt;:original_url&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, to generate a unique short URL, we are using &lt;code&gt;CounterService&lt;/code&gt; and &lt;code&gt;Base62&lt;/code&gt;. I will discuss these components in detail in the next article to keep this one concise.&lt;/p&gt;

&lt;p&gt;I hope you enjoyed this first article, and I look forward to seeing you in the next part of the series.&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>distributedsystems</category>
      <category>ruby</category>
      <category>urlshortener</category>
    </item>
    <item>
      <title>Introducing Do Notation in the Mo Package for Golang</title>
      <dc:creator>taman9333</dc:creator>
      <pubDate>Thu, 27 Jun 2024 19:47:50 +0000</pubDate>
      <link>https://dev.to/taman9333/introducing-do-notation-in-the-mo-package-for-golang-1jpc</link>
      <guid>https://dev.to/taman9333/introducing-do-notation-in-the-mo-package-for-golang-1jpc</guid>
      <description>&lt;h3&gt;
  
  
  What is Do Notation?
&lt;/h3&gt;

&lt;p&gt;Do notation is a syntactic sugar primarily used in functional programming languages like Haskell and Scala. It simplifies the chaining of monadic operations, making the code more readable and maintainable. By bringing this feature to Go, we can now write cleaner, more expressive code when working with monads.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Do Notation?
&lt;/h3&gt;

&lt;p&gt;When dealing with monads, especially in complex business logic, chaining operations can become cumbersome. Error handling and managing different states often lead to deeply nested structures that are hard to follow. Do notation addresses this by allowing us to write monadic operations in a sequential style, akin to imperative programming, but with all the benefits of functional programming.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Does It Work in the Mo Package?
&lt;/h3&gt;

&lt;p&gt;In Go, implementing do notation wasn't straightforward, but I managed to achieve it using the Do function. Here's a quick look at how you can use it with an example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"errors"&lt;/span&gt;
    &lt;span class="s"&gt;"fmt"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/samber/mo"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;validateBooking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"guest"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"roomType"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"validation failed"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;createBooking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;guest&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;guest&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Booking Created for: "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;guest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"booking creation failed"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;assignRoom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;booking&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;roomType&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;roomType&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Room Assigned: "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;roomType&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;" for "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;booking&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"room assignment failed"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// This could be a service package that performs the entire process&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;bookRoom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;[[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Do&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c"&gt;// Validate booking parameters&lt;/span&gt;
        &lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;validateBooking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MustGet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c"&gt;// Create booking&lt;/span&gt;
        &lt;span class="n"&gt;booking&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;createBooking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"guest"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MustGet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c"&gt;// Assign room&lt;/span&gt;
        &lt;span class="n"&gt;room&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;assignRoom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;booking&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"roomType"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MustGet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c"&gt;// Return success with booking and room details&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;booking&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;room&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s"&gt;"guest"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="s"&gt;"Foo"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;"roomType"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Suite"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bookRoom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IsError&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Error:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Success:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MustGet&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this example, bookRoom uses the Do function to sequentially perform several operations: validating booking parameters, creating a booking, and assigning a room. Each step returns a Result which can be seamlessly chained using the Do function, ensuring clean and readable error handling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison of bookRoom Function
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Without Do-Notation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can have two options:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Using bind (if implemented):&lt;/strong&gt;&lt;br&gt;
The "bind" operation in monads can resemble callback hell when there are many monadic operations because of the nested and sequential nature of these operations. When many such operations are chained together, the code can become deeply nested and harder to read, similar to how deeply nested callbacks can be in asynchronous programming. If bind were implemented in the Mo package, using it in this example would look something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;bookRoom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;[[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;validateBooking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;[[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;createBooking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"guest"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;booking&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;[[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;assignRoom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;booking&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"roomType"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;room&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;[[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;booking&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;room&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach quickly becomes hard to read and maintain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Using .Get():&lt;/strong&gt;&lt;br&gt;
Another option is to use .Get() on the monad to unwrap the monad and get the underlying value and error. This looks like typical Go code, but error handling can be verbose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;bookRoom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;[[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;validateBooking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;[[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;booking&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;createBooking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"guest"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;[[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;room&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;assignRoom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;booking&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"roomType"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;[[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;booking&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;room&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach is more readable than using bind, but still involves a lot of boilerplate error handling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With Do-Notation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With do notation, you can call .MustGet() on the monad to get the underlying value directly without error. This function (MustGet()) will panic if the monad has an error; however, do notation will handle that and short circuit the execution if there is an error or return the unwrapped value back:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;bookRoom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;[[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;mo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Do&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;validateBooking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MustGet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;booking&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;createBooking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"guest"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MustGet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;room&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;assignRoom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;booking&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"roomType"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MustGet&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;booking&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;room&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach is clean, concise, and easy to read, significantly reducing boilerplate error handling code.&lt;/p&gt;




&lt;h3&gt;
  
  
  Final Thoughts
&lt;/h3&gt;

&lt;p&gt;One of the great advantages of using do notation is that you don't have to check for errors after every monadic operation. Even though a monad can have an error type, do notation will automatically handle error propagation and short-circuit the execution if an error occurs. This leads to cleaner and more maintainable code, which is particularly valuable in complex workflows.&lt;/p&gt;

</description>
      <category>go</category>
      <category>monads</category>
    </item>
  </channel>
</rss>
