<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kanaga abishek</title>
    <description>The latest articles on DEV Community by Kanaga abishek (@abishek2981).</description>
    <link>https://dev.to/abishek2981</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3936917%2F6fdb8881-b4e3-4239-8921-da02cb39318d.jpg</url>
      <title>DEV Community: Kanaga abishek</title>
      <link>https://dev.to/abishek2981</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/abishek2981"/>
    <language>en</language>
    <item>
      <title>I built a distributed tracing system from scratch — here's what I learned about Cassandra, gRPC, and critical path analysis</title>
      <dc:creator>Kanaga abishek</dc:creator>
      <pubDate>Mon, 18 May 2026 00:32:40 +0000</pubDate>
      <link>https://dev.to/abishek2981/i-built-a-distributed-tracing-system-from-scratch-heres-what-i-learned-about-cassandra-grpc-4e0l</link>
      <guid>https://dev.to/abishek2981/i-built-a-distributed-tracing-system-from-scratch-heres-what-i-learned-about-cassandra-grpc-4e0l</guid>
      <description>&lt;p&gt;A few months ago I was freelancing on a client project. Every API call was slow. Some were taking 800ms, some 1.2 seconds — but nobody could pinpoint why.&lt;/p&gt;

&lt;p&gt;The codebase touched 6 services. Debugging meant manually correlating logs across all of them, file by file, hoping to find where time was being lost. It took hours per incident.&lt;/p&gt;

&lt;p&gt;That experience made me ask a question I couldn't let go of:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Is there a way to get a blueprint of every function, service, and database call that happens for a single API request — automatically?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That led me to Jaeger, Zipkin, and the OpenTelemetry protocol. I was genuinely impressed that someone had built a system that could trace the entire path of an API call and tell you exactly where latency occurs.&lt;/p&gt;

&lt;p&gt;Then I did what any curious engineer would do — I decided to build my own.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Lumen&lt;/strong&gt; — a self-hosted distributed tracing system that collects, stores, and analyzes traces from your microservices using OpenTelemetry.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fubo0u5lm4ldvpgu3thuh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fubo0u5lm4ldvpgu3thuh.png" alt="Architecture Diagram" width="800" height="547"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One command to run everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Point any OpenTelemetry SDK at port 9090 and Lumen starts receiving traces immediately. No account. No API key. Your trace data never leaves your infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Build It When Jaeger Already Exists?
&lt;/h2&gt;

&lt;p&gt;Jaeger is excellent. I'm not competing with it.&lt;/p&gt;

&lt;p&gt;The goal was to understand how distributed tracing works under the hood — not just use it. Building forces you to answer questions that using never asks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why does the storage schema need two tables instead of one?&lt;/li&gt;
&lt;li&gt;What happens when gRPC threads block waiting for Cassandra?&lt;/li&gt;
&lt;li&gt;How do you correctly calculate how much time a span spent on its own work when its children ran in parallel?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every one of those questions led to a design decision I can now explain from first principles. That's the whole point.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Things I Learned Building It
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Cassandra Schema Design Is Query Design
&lt;/h3&gt;

&lt;p&gt;In a relational database you design a schema and add indexes for the queries you need. In Cassandra it's the opposite — you design a table for each query pattern.&lt;/p&gt;

&lt;p&gt;Lumen needs two queries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Give me all spans for trace ID X" → partition by &lt;code&gt;trace_id&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;"List recent traces for service checkout" → partition by &lt;code&gt;(service_name, hour_bucket)&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I can't use one table for both. A secondary index on &lt;code&gt;service_name&lt;/code&gt; in the spans table causes Cassandra to ask every node in the cluster whether it has matching rows — a scatter-gather query that gets slower as you add nodes. The opposite of what you want.&lt;/p&gt;

&lt;p&gt;So I built two tables. One per access pattern.&lt;/p&gt;

&lt;p&gt;I also hit the hot partition problem. Partitioning the trace index by &lt;code&gt;service_name&lt;/code&gt; alone means all checkout-service traces land on one Cassandra node. The fix is time bucketing — partition by &lt;code&gt;(service_name, hour_bucket)&lt;/code&gt;. 30 days becomes 720 partitions spread across the cluster instead of one overloaded node.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Naive Self-Time Calculation Produces Negative Numbers
&lt;/h3&gt;

&lt;p&gt;When you want to know how much time a span spent doing its own work — excluding time spent waiting on children — the naive approach is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;selfTime = span.duration - sum(child.duration)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is wrong when children overlap in time. If two child spans run concurrently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;parent  [0ms ─────────────── 100ms]
child-A [10ms ──── 50ms]           = 40ms
child-B [30ms ──────── 80ms]       = 50ms

Naive sum:          100 - (40+50) = 10ms
Correct (interval union):
  merge [10,50] and [30,80] → [10,80] = 70ms covered
  100 - 70 = 30ms self time
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fix is interval union — merge overlapping child time ranges before subtracting. Without this, services that make parallel downstream calls produce negative self-time values, which makes bottleneck detection meaningless.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. gRPC Threads Should Never Touch the Database
&lt;/h3&gt;

&lt;p&gt;My first implementation had the gRPC handler write directly to Cassandra. At high volume this creates a bottleneck:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10,000 spans/second × 10ms Cassandra write = 100 concurrent threads needed
gRPC thread pool default                   = ~20 threads
Result                                     = 80% of spans rejected
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fix is a &lt;code&gt;LinkedBlockingQueue&lt;/code&gt; between the gRPC handler and Cassandra. The handler calls &lt;code&gt;offer()&lt;/code&gt; — which returns in microseconds whether the queue accepts or drops the span — and moves on. A background thread drains 500 spans every 100ms and batch-writes to Cassandra.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// gRPC thread — never blocks&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;export&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ExportTraceServiceRequest&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; 
                   &lt;span class="nc"&gt;StreamObserver&lt;/span&gt; &lt;span class="n"&gt;observer&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Span&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;extractSpans&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;boolean&lt;/span&gt; &lt;span class="n"&gt;accepted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ingestionQueue&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;offer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;accepted&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;droppedCount&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;incrementAndGet&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;observer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;onNext&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;successResponse&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;observer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;onCompleted&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Background thread — runs every 100ms&lt;/span&gt;
&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;writeLoop&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;running&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;List&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ArrayList&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;
        &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;drainTo&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(!&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;isEmpty&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;forEach&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;repository:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;save&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
        &lt;span class="nc"&gt;Thread&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;sleep&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;gRPC threads never block. Ingestion throughput is completely decoupled from write throughput.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Lumen Actually Shows You
&lt;/h2&gt;

&lt;p&gt;Here's a real trace from a simulated checkout service:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2lde30818sssx4ox1si.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2lde30818sssx4ox1si.png" alt="User Interface" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsn399zr8wrg5yi0fbm1z.png" alt="Trace Detail" width="800" height="1169"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Connect Your App in 30 Seconds
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Java / Spring Boot&lt;/strong&gt; — zero code changes required:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;java &lt;span class="nt"&gt;-javaagent&lt;/span&gt;:opentelemetry-javaagent.jar &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-Dotel&lt;/span&gt;.service.name&lt;span class="o"&gt;=&lt;/span&gt;my-service &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-Dotel&lt;/span&gt;.exporter.otlp.endpoint&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:9090 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-Dotel&lt;/span&gt;.traces.exporter&lt;span class="o"&gt;=&lt;/span&gt;otlp &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-jar&lt;/span&gt; your-app.jar
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Docs: &lt;a href="https://github.com/kanagaabishek/lumen/blob/master/docs/integrations/java.md" rel="noopener noreferrer"&gt;github.com/kanagaabishek/lumen/blob/master/docs/integrations/java.md&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Node.js:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;OTLPTraceExporter&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@opentelemetry/exporter-trace-otlp-grpc&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;exporter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OTLPTraceExporter&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http://localhost:9090&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createInsecure&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Docs: &lt;a href="https://github.com/kanagaabishek/lumen/blob/master/docs/integrations/node.md" rel="noopener noreferrer"&gt;github.com/kanagaabishek/lumen/blob/master/docs/integrations/node.md&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;OTEL_SERVICE_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-service &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nv"&gt;OTEL_EXPORTER_OTLP_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:9090 &lt;span class="se"&gt;\&lt;/span&gt;
opentelemetry-instrument python app.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Docs: &lt;a href="https://github.com/kanagaabishek/lumen/blob/master/docs/integrations/python.md" rel="noopener noreferrer"&gt;github.com/kanagaabishek/lumen/blob/master/docs/integrations/python.md&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tail-based sampling&lt;/strong&gt; — store only slow or errored traces, not 100% of everything&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kafka-backed ingestion&lt;/strong&gt; — horizontal scaling of the write path across multiple Lumen instances&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service dependency graph&lt;/strong&gt; — visualize which services call which, with average latency on each edge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alert rules&lt;/strong&gt; — notify when p99 latency for a service exceeds a threshold&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/kanagaabishek/lumen
&lt;span class="nb"&gt;cd &lt;/span&gt;lumen
docker-compose up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:8080&lt;/code&gt;. Select a service. Click a trace.&lt;/p&gt;

&lt;p&gt;I'd love feedback on the engineering decisions&lt;br&gt;
What would you change about the architecture? &lt;br&gt;
I'm especially curious if anyone has built something similar &lt;br&gt;
and made different tradeoffs on the storage layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub: &lt;a href="https://github.com/kanagaabishek/lumen" rel="noopener noreferrer"&gt;github.com/kanagaabishek/lumen&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>java</category>
      <category>opentelemetry</category>
      <category>cassandra</category>
    </item>
  </channel>
</rss>
