Kanaga abishek

Posted on May 18

I built a distributed tracing system from scratch — here's what I learned about Cassandra, gRPC, and critical path analysis

#distributedsystems #java #opentelemetry #cassandra

A few months ago I was freelancing on a client project. Every API call was slow. Some were taking 800ms, some 1.2 seconds — but nobody could pinpoint why.

The codebase touched 6 services. Debugging meant manually correlating logs across all of them, file by file, hoping to find where time was being lost. It took hours per incident.

That experience made me ask a question I couldn't let go of:

Is there a way to get a blueprint of every function, service, and database call that happens for a single API request — automatically?

That led me to Jaeger, Zipkin, and the OpenTelemetry protocol. I was genuinely impressed that someone had built a system that could trace the entire path of an API call and tell you exactly where latency occurs.

Then I did what any curious engineer would do — I decided to build my own.

What I Built

Lumen — a self-hosted distributed tracing system that collects, stores, and analyzes traces from your microservices using OpenTelemetry.

One command to run everything:

docker-compose up

Point any OpenTelemetry SDK at port 9090 and Lumen starts receiving traces immediately. No account. No API key. Your trace data never leaves your infrastructure.

Why Build It When Jaeger Already Exists?

Jaeger is excellent. I'm not competing with it.

The goal was to understand how distributed tracing works under the hood — not just use it. Building forces you to answer questions that using never asks:

Why does the storage schema need two tables instead of one?
What happens when gRPC threads block waiting for Cassandra?
How do you correctly calculate how much time a span spent on its own work when its children ran in parallel?

Every one of those questions led to a design decision I can now explain from first principles. That's the whole point.

Three Things I Learned Building It

1. Cassandra Schema Design Is Query Design

In a relational database you design a schema and add indexes for the queries you need. In Cassandra it's the opposite — you design a table for each query pattern.

Lumen needs two queries:

"Give me all spans for trace ID X" → partition by trace_id
"List recent traces for service checkout" → partition by (service_name, hour_bucket)

I can't use one table for both. A secondary index on service_name in the spans table causes Cassandra to ask every node in the cluster whether it has matching rows — a scatter-gather query that gets slower as you add nodes. The opposite of what you want.

So I built two tables. One per access pattern.

I also hit the hot partition problem. Partitioning the trace index by service_name alone means all checkout-service traces land on one Cassandra node. The fix is time bucketing — partition by (service_name, hour_bucket). 30 days becomes 720 partitions spread across the cluster instead of one overloaded node.

2. Naive Self-Time Calculation Produces Negative Numbers

When you want to know how much time a span spent doing its own work — excluding time spent waiting on children — the naive approach is:

selfTime = span.duration - sum(child.duration)

This is wrong when children overlap in time. If two child spans run concurrently:

parent  [0ms ─────────────── 100ms]
child-A [10ms ──── 50ms]           = 40ms
child-B [30ms ──────── 80ms]       = 50ms

Naive sum:          100 - (40+50) = 10ms
Correct (interval union):
  merge [10,50] and [30,80] → [10,80] = 70ms covered
  100 - 70 = 30ms self time

The fix is interval union — merge overlapping child time ranges before subtracting. Without this, services that make parallel downstream calls produce negative self-time values, which makes bottleneck detection meaningless.

3. gRPC Threads Should Never Touch the Database

My first implementation had the gRPC handler write directly to Cassandra. At high volume this creates a bottleneck:

10,000 spans/second × 10ms Cassandra write = 100 concurrent threads needed
gRPC thread pool default                   = ~20 threads
Result                                     = 80% of spans rejected

The fix is a LinkedBlockingQueue between the gRPC handler and Cassandra. The handler calls offer() — which returns in microseconds whether the queue accepts or drops the span — and moves on. A background thread drains 500 spans every 100ms and batch-writes to Cassandra.

// gRPC thread — never blocks
public void export(ExportTraceServiceRequest request, 
                   StreamObserver observer) {
    for (Span span : extractSpans(request)) {
        boolean accepted = ingestionQueue.offer(span);
        if (!accepted) droppedCount.incrementAndGet();
    }
    observer.onNext(successResponse());
    observer.onCompleted();
}

// Background thread — runs every 100ms
private void writeLoop() {
    while (running) {
        List batch = new ArrayList<>();
        queue.drainTo(batch, 500);
        if (!batch.isEmpty()) {
            batch.forEach(repository::save);
        }
        Thread.sleep(100);
    }
}

gRPC threads never block. Ingestion throughput is completely decoupled from write throughput.

What Lumen Actually Shows You

Here's a real trace from a simulated checkout service:

Connect Your App in 30 Seconds

Java / Spring Boot — zero code changes required:

java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.service.name=my-service \
  -Dotel.exporter.otlp.endpoint=http://localhost:9090 \
  -Dotel.traces.exporter=otlp \
  -jar your-app.jar

Docs: github.com/kanagaabishek/lumen/blob/master/docs/integrations/java.md

Node.js:

const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');

const exporter = new OTLPTraceExporter({
    url: 'http://localhost:9090',
    credentials: credentials.createInsecure()
});

Docs: github.com/kanagaabishek/lumen/blob/master/docs/integrations/node.md

Python:

OTEL_SERVICE_NAME=my-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:9090 \
opentelemetry-instrument python app.py

Docs: github.com/kanagaabishek/lumen/blob/master/docs/integrations/python.md

What's Next

Tail-based sampling — store only slow or errored traces, not 100% of everything
Kafka-backed ingestion — horizontal scaling of the write path across multiple Lumen instances
Service dependency graph — visualize which services call which, with average latency on each edge
Alert rules — notify when p99 latency for a service exceeds a threshold

Try It

git clone https://github.com/kanagaabishek/lumen
cd lumen
docker-compose up

Open http://localhost:8080. Select a service. Click a trace.

I'd love feedback on the engineering decisions
What would you change about the architecture?
I'm especially curious if anyone has built something similar
and made different tradeoffs on the storage layer.

GitHub: github.com/kanagaabishek/lumen

Top comments (6)

Sol • May 21

Really useful build writeup, especially your critical-path emphasis.

One spec-level question from the OpenTelemetry side: in the current GenAI semantic-conventions registry (raw file semantic-conventions-genai/.../gen-ai.md), I can see usage/token attributes (12 gen_ai.usage.* keys) but no cost/currency attribute. If you were adding LLM chargeback on top of Lumen, would you attach price-book metadata at ingest time, or keep token telemetry pure and do cost joins downstream?

I’m trying to compare which model stays auditable in multi-tenant traces.

Kanaga abishek • May 22

Lumen doesn't handle either of these yet, but both are worth thinking through.

for Cost attribution I'd keep token telemetry pure and join downstream against a price book table as Enriching at ingest locks historical data to the price at that moment - if pricing changes, old traces become wrong for chargeback comparisons. Join at query time keeps the raw telemetry auditable.

Sol • May 22

The join-at-query-time approach is the right call — this is essentially the immutable fact table + slowly-changing dimension pattern, and you sidestep the stale-price-in-raw-data problem you named.

One practical addition for high-volume telemetry (millions of spans/day): materializing the enriched cost view on a schedule (hourly or daily) makes chargeback reports interactive without touching the raw telemetry. Raw spans stay immutable and auditable; the materialized layer is purely a reporting surface. The two layers stay cleanly separated.

Are there plans to support pre-computed cost rollups at team or project level, or is the expectation that consumers build that aggregation layer on top of Lumen?

Sol • May 21

Really useful breakdown. One cost-attribution edge case I’m watching: OTel GenAI content-part semantics are shifting for multimodal payloads. Issue #3672 (opened 2026-04-28) and PR #3673 (closed 2026-05-11) propose document modality plus byte_size on Blob/File/Uri parts, plus StrippedPart fail-closed behavior.

If your tracer stores usage only at span level, those part-level bytes can vanish from tenant chargeback joins. Have you tested whether your critical-path pipeline preserves content-part attributes end-to-end, or are they normalized away in Cassandra writes?

Refs:
github.com/open-telemetry/semantic...
github.com/open-telemetry/semantic...

Kanaga abishek • May 22

Good catch — Currently Lumen stores attributes as MAP<TEXT, TEXT> in Cassandra, so repeated keys from multiple content parts would collide and last write wins. Part-level byte_size would vanish exactly as you describe.
Haven't tested the critical-path pipeline against multimodal payloads, it exposes a real gap. I think fixing it properly needs either LIST<TUPLE<TEXT,TEXT>> or a separate span_attributes table with (span_id, key, value, index) to preserve repeated keys without collision.Thanks for pointing at #3672/#3673 — hadn't seen that PR. The byte_size on content parts is an interesting hook for cost attribution beyond token counts alone.

Sol • May 22

The separate span_attributes table (span_id, key, value, index) is architecturally cleaner than LIST> for a few reasons: secondary indexes on (key, value) let you query "find all spans where model=gpt-4o" without a full scan; new attribute keys can appear without schema migrations; and the index column preserves ordering for multi-part content.

The byte_size-per-part hook is especially useful when a span has mixed input types — a request where 90% of tokens come from a cached image prefix should attribute cost differently than a purely text prompt. That distinct-by-part accounting is exactly what #3673 is trying to standardize at the semantic conventions level: byte_size becomes the common denominator for cross-modal cost attribution, not just a debug field.