A few months ago I was freelancing on a client project. Every API call was slow. Some were taking 800ms, some 1.2 seconds — but nobody could pinpoint why.
The codebase touched 6 services. Debugging meant manually correlating logs across all of them, file by file, hoping to find where time was being lost. It took hours per incident.
That experience made me ask a question I couldn't let go of:
Is there a way to get a blueprint of every function, service, and database call that happens for a single API request — automatically?
That led me to Jaeger, Zipkin, and the OpenTelemetry protocol. I was genuinely impressed that someone had built a system that could trace the entire path of an API call and tell you exactly where latency occurs.
Then I did what any curious engineer would do — I decided to build my own.
What I Built
Lumen — a self-hosted distributed tracing system that collects, stores, and analyzes traces from your microservices using OpenTelemetry.
One command to run everything:
docker-compose up
Point any OpenTelemetry SDK at port 9090 and Lumen starts receiving traces immediately. No account. No API key. Your trace data never leaves your infrastructure.
Why Build It When Jaeger Already Exists?
Jaeger is excellent. I'm not competing with it.
The goal was to understand how distributed tracing works under the hood — not just use it. Building forces you to answer questions that using never asks:
- Why does the storage schema need two tables instead of one?
- What happens when gRPC threads block waiting for Cassandra?
- How do you correctly calculate how much time a span spent on its own work when its children ran in parallel?
Every one of those questions led to a design decision I can now explain from first principles. That's the whole point.
Three Things I Learned Building It
1. Cassandra Schema Design Is Query Design
In a relational database you design a schema and add indexes for the queries you need. In Cassandra it's the opposite — you design a table for each query pattern.
Lumen needs two queries:
- "Give me all spans for trace ID X" → partition by
trace_id - "List recent traces for service checkout" → partition by
(service_name, hour_bucket)
I can't use one table for both. A secondary index on service_name in the spans table causes Cassandra to ask every node in the cluster whether it has matching rows — a scatter-gather query that gets slower as you add nodes. The opposite of what you want.
So I built two tables. One per access pattern.
I also hit the hot partition problem. Partitioning the trace index by service_name alone means all checkout-service traces land on one Cassandra node. The fix is time bucketing — partition by (service_name, hour_bucket). 30 days becomes 720 partitions spread across the cluster instead of one overloaded node.
2. Naive Self-Time Calculation Produces Negative Numbers
When you want to know how much time a span spent doing its own work — excluding time spent waiting on children — the naive approach is:
selfTime = span.duration - sum(child.duration)
This is wrong when children overlap in time. If two child spans run concurrently:
parent [0ms ─────────────── 100ms]
child-A [10ms ──── 50ms] = 40ms
child-B [30ms ──────── 80ms] = 50ms
Naive sum: 100 - (40+50) = 10ms
Correct (interval union):
merge [10,50] and [30,80] → [10,80] = 70ms covered
100 - 70 = 30ms self time
The fix is interval union — merge overlapping child time ranges before subtracting. Without this, services that make parallel downstream calls produce negative self-time values, which makes bottleneck detection meaningless.
3. gRPC Threads Should Never Touch the Database
My first implementation had the gRPC handler write directly to Cassandra. At high volume this creates a bottleneck:
10,000 spans/second × 10ms Cassandra write = 100 concurrent threads needed
gRPC thread pool default = ~20 threads
Result = 80% of spans rejected
The fix is a LinkedBlockingQueue between the gRPC handler and Cassandra. The handler calls offer() — which returns in microseconds whether the queue accepts or drops the span — and moves on. A background thread drains 500 spans every 100ms and batch-writes to Cassandra.
// gRPC thread — never blocks
public void export(ExportTraceServiceRequest request,
StreamObserver observer) {
for (Span span : extractSpans(request)) {
boolean accepted = ingestionQueue.offer(span);
if (!accepted) droppedCount.incrementAndGet();
}
observer.onNext(successResponse());
observer.onCompleted();
}
// Background thread — runs every 100ms
private void writeLoop() {
while (running) {
List batch = new ArrayList<>();
queue.drainTo(batch, 500);
if (!batch.isEmpty()) {
batch.forEach(repository::save);
}
Thread.sleep(100);
}
}
gRPC threads never block. Ingestion throughput is completely decoupled from write throughput.
What Lumen Actually Shows You
Here's a real trace from a simulated checkout service:
Connect Your App in 30 Seconds
Java / Spring Boot — zero code changes required:
java -javaagent:opentelemetry-javaagent.jar \
-Dotel.service.name=my-service \
-Dotel.exporter.otlp.endpoint=http://localhost:9090 \
-Dotel.traces.exporter=otlp \
-jar your-app.jar
Docs: github.com/kanagaabishek/lumen/blob/master/docs/integrations/java.md
Node.js:
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const exporter = new OTLPTraceExporter({
url: 'http://localhost:9090',
credentials: credentials.createInsecure()
});
Docs: github.com/kanagaabishek/lumen/blob/master/docs/integrations/node.md
Python:
OTEL_SERVICE_NAME=my-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:9090 \
opentelemetry-instrument python app.py
Docs: github.com/kanagaabishek/lumen/blob/master/docs/integrations/python.md
What's Next
- Tail-based sampling — store only slow or errored traces, not 100% of everything
- Kafka-backed ingestion — horizontal scaling of the write path across multiple Lumen instances
- Service dependency graph — visualize which services call which, with average latency on each edge
- Alert rules — notify when p99 latency for a service exceeds a threshold
Try It
git clone https://github.com/kanagaabishek/lumen
cd lumen
docker-compose up
Open http://localhost:8080. Select a service. Click a trace.
I'd love feedback on the engineering decisions
What would you change about the architecture?
I'm especially curious if anyone has built something similar
and made different tradeoffs on the storage layer.
GitHub: github.com/kanagaabishek/lumen


Top comments (0)