I built a distributed tracing system from scratch — here's what I learned about Cassandra, gRPC, and critical path analysis

Kanaga abishek — Mon, 18 May 2026 00:32:40 +0000

A few months ago I was freelancing on a client project. Every API call was slow. Some were taking 800ms, some 1.2 seconds — but nobody could pinpoint why.

The codebase touched 6 services. Debugging meant manually correlating logs across all of them, file by file, hoping to find where time was being lost. It took hours per incident.

That experience made me ask a question I couldn't let go of:

Is there a way to get a blueprint of every function, service, and database call that happens for a single API request — automatically?

That led me to Jaeger, Zipkin, and the OpenTelemetry protocol. I was genuinely impressed that someone had built a system that could trace the entire path of an API call and tell you exactly where latency occurs.

Then I did what any curious engineer would do — I decided to build my own.

What I Built

Lumen — a self-hosted distributed tracing system that collects, stores, and analyzes traces from your microservices using OpenTelemetry.

One command to run everything:

docker-compose up

Point any OpenTelemetry SDK at port 9090 and Lumen starts receiving traces immediately. No account. No API key. Your trace data never leaves your infrastructure.

Why Build It When Jaeger Already Exists?

Jaeger is excellent. I'm not competing with it.

The goal was to understand how distributed tracing works under the hood — not just use it. Building forces you to answer questions that using never asks:

Why does the storage schema need two tables instead of one?
What happens when gRPC threads block waiting for Cassandra?
How do you correctly calculate how much time a span spent on its own work when its children ran in parallel?

Every one of those questions led to a design decision I can now explain from first principles. That's the whole point.

Three Things I Learned Building It

1. Cassandra Schema Design Is Query Design

In a relational database you design a schema and add indexes for the queries you need. In Cassandra it's the opposite — you design a table for each query pattern.

Lumen needs two queries:

"Give me all spans for trace ID X" → partition by trace_id
"List recent traces for service checkout" → partition by (service_name, hour_bucket)

I can't use one table for both. A secondary index on service_name in the spans table causes Cassandra to ask every node in the cluster whether it has matching rows — a scatter-gather query that gets slower as you add nodes. The opposite of what you want.

So I built two tables. One per access pattern.

I also hit the hot partition problem. Partitioning the trace index by service_name alone means all checkout-service traces land on one Cassandra node. The fix is time bucketing — partition by (service_name, hour_bucket). 30 days becomes 720 partitions spread across the cluster instead of one overloaded node.

2. Naive Self-Time Calculation Produces Negative Numbers

When you want to know how much time a span spent doing its own work — excluding time spent waiting on children — the naive approach is:

selfTime = span.duration - sum(child.duration)

This is wrong when children overlap in time. If two child spans run concurrently:

parent  [0ms ─────────────── 100ms]
child-A [10ms ──── 50ms]           = 40ms
child-B [30ms ──────── 80ms]       = 50ms

Naive sum:          100 - (40+50) = 10ms
Correct (interval union):
  merge [10,50] and [30,80] → [10,80] = 70ms covered
  100 - 70 = 30ms self time

The fix is interval union — merge overlapping child time ranges before subtracting. Without this, services that make parallel downstream calls produce negative self-time values, which makes bottleneck detection meaningless.

3. gRPC Threads Should Never Touch the Database

My first implementation had the gRPC handler write directly to Cassandra. At high volume this creates a bottleneck:

10,000 spans/second × 10ms Cassandra write = 100 concurrent threads needed
gRPC thread pool default                   = ~20 threads
Result                                     = 80% of spans rejected

The fix is a LinkedBlockingQueue between the gRPC handler and Cassandra. The handler calls offer() — which returns in microseconds whether the queue accepts or drops the span — and moves on. A background thread drains 500 spans every 100ms and batch-writes to Cassandra.

// gRPC thread — never blocks
public void export(ExportTraceServiceRequest request, 
                   StreamObserver observer) {
    for (Span span : extractSpans(request)) {
        boolean accepted = ingestionQueue.offer(span);
        if (!accepted) droppedCount.incrementAndGet();
    }
    observer.onNext(successResponse());
    observer.onCompleted();
}

// Background thread — runs every 100ms
private void writeLoop() {
    while (running) {
        List batch = new ArrayList<>();
        queue.drainTo(batch, 500);
        if (!batch.isEmpty()) {
            batch.forEach(repository::save);
        }
        Thread.sleep(100);
    }
}

gRPC threads never block. Ingestion throughput is completely decoupled from write throughput.

What Lumen Actually Shows You

Here's a real trace from a simulated checkout service:

Connect Your App in 30 Seconds

Java / Spring Boot — zero code changes required:

java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.service.name=my-service \
  -Dotel.exporter.otlp.endpoint=http://localhost:9090 \
  -Dotel.traces.exporter=otlp \
  -jar your-app.jar

Docs: github.com/kanagaabishek/lumen/blob/master/docs/integrations/java.md

Node.js:

const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');

const exporter = new OTLPTraceExporter({
    url: 'http://localhost:9090',
    credentials: credentials.createInsecure()
});

Docs: github.com/kanagaabishek/lumen/blob/master/docs/integrations/node.md

Python:

OTEL_SERVICE_NAME=my-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:9090 \
opentelemetry-instrument python app.py

Docs: github.com/kanagaabishek/lumen/blob/master/docs/integrations/python.md

What's Next

Tail-based sampling — store only slow or errored traces, not 100% of everything
Kafka-backed ingestion — horizontal scaling of the write path across multiple Lumen instances
Service dependency graph — visualize which services call which, with average latency on each edge
Alert rules — notify when p99 latency for a service exceeds a threshold

Try It

git clone https://github.com/kanagaabishek/lumen
cd lumen
docker-compose up

Open http://localhost:8080. Select a service. Click a trace.

I'd love feedback on the engineering decisions
What would you change about the architecture?
I'm especially curious if anyone has built something similar
and made different tradeoffs on the storage layer.

GitHub: github.com/kanagaabishek/lumen

DEV Community: Kanaga abishek