Mohammad Awwaad

Posted on May 28

Decoding the Observability Pipeline: A Java Architect's Guide to Metrics, Logs, and Traces

#java #observability #architecture

If you’ve spent any time modernizing a Java-based microservices architecture recently, you’ve likely hit the "Observability Wall." The ecosystem is drowning in tools. We hear about Prometheus, Loki, OpenSearch, Zipkin, Tempo, OpenTelemetry, Grafana Alloy, Datadog—the list never ends.

Observability isn't about collecting tools; it’s about establishing reliable data streams that tell you exactly what your system is doing.

In this article, we’re going to demystify observability architecture in the Java ecosystem, structure it into a clean 4-phase pipeline, and discuss the architectural realities of when to choose an all-in-one stack versus a customized best-of-breed setup.

The 3 Pillars of Observability

Before we talk about architecture, we have to define the data. Observability relies on three distinct types of telemetry data, commonly known as the "Three Pillars":

Logs (The "Event"): A discrete, timestamped record of an event (e.g., Payment failed for user 123).
Metrics (The "Aggregate"): Numbers measured over time. They tell you the overall health of a system without caring about individual requests (e.g., CPU is at 90% or API error rate is 5/sec).
Traces (The "Journey"): The interconnected lifecycle of a single request as it hops across multiple microservices.

The 4 Phases of the Observability Pipeline

No matter which tools you choose, your observability architecture will always follow this four-phase pipeline.

Phase 1: Instrumentation (Application Code)

Where the data is born.
In the Java world, we decouple our business logic from the backend storage using standard facades.

Metrics & Traces: We use Micrometer (or the OpenTelemetry SDK). You implement metric tracking (e.g., using meterRegistry.counter("orders").increment(); or via annotations), and Micrometer handles translating it.
Logs: We use the battle-tested SLF4J + Logback.

Phase 2: Agents & Collectors (The Data Movers)

How the data leaves the host.
Instead of having your application push data directly to a database, you offload that work to a sidecar or node agent.

Universal Routers: Tools like the OpenTelemetry (OTel) Collector or Grafana Alloy can route metrics, traces, and logs simultaneously.
Log Routers: Purpose-built, high-performance log agents like Fluent Bit or Promtail.

Phase 3: Storage Backends (The Databases)

Where the data lives.

Metrics DB: Prometheus, Mimir, or Datadog.
Trace DB: Tempo, Zipkin, or Jaeger.
Log DB: Loki, OpenSearch, or Elasticsearch.

Phase 4: Visualization

Where you investigate the data.

Unified UI: Grafana is the undisputed champion here, pulling from multiple databases into a single pane of glass.

The 3 Pillars in Action: Sample Use-Cases

Let’s look at how data moves through this pipeline to solve actual production problems.

1. The Metric Journey: Tracking API Load

Action: You need to monitor how many times the /api/v1/orders endpoint is hit.

Phase 1 (Instrumentation): You use Micrometer to increment a counter. The Actuator exposes this data at /actuator/prometheus.
Phase 2 & 3 (Collection/Storage): Prometheus polls your app every 15 seconds, pulls the counter value, and saves it in its Time-Series Database (TSDB).
Phase 4 (Visualization): In Grafana, you build a dashboard using PromQL to calculate the request rate over the last hour.

2. The Trace Journey: Finding a Database Bottleneck

Action: A request passes through your Gateway, calls an Order Service, and queries a database. It's too slow.

Phase 1: Micrometer Tracing generates a root Trace ID at the gateway and injects it into the HTTP headers. As downstream services execute, they create "Child Spans" linked to that ID.
Phase 2: The app pushes these spans via the OTLP protocol to an OpenTelemetry Collector.
Phase 3: The collector ships the data to Zipkin (or Tempo), which records the exact parent-child timings.
Phase 4: You paste the Trace ID into Grafana and see a waterfall diagram proving the PostgreSQL query took 2.5 seconds, isolating the exact bottleneck.

3. The Log Journey: Correlating an Error

Action: A customer’s checkout fails.

Phase 1: Logback executes log.error(). Crucially, it grabs the active Trace ID from the MDC (Mapped Diagnostic Context) and includes it in the JSON log payload.
Phase 2: Fluent Bit (or Promtail) tails the console output, tags it with environment labels (app=payment-service), and routes it.
Phase 3: Loki indexes the labels and compresses the text.
Phase 4: In Grafana, you search for the user ID, find the error, and immediately copy the associated Trace ID to view the trace waterfall.

Push vs. Pull: The Architectural Divide

When data moves from Phase 2 to Phase 3, it utilizes either a Push or Pull model.

Almost everything in observability is a Push model: Logs and Traces are fired off by the application, pushed through an agent, and pushed into a database.

Metrics are the major exception.

Prometheus = PULL. Prometheus reaches directly into your Java microservice via an HTTP GET request (usually /actuator/prometheus) and pulls the data.
Mimir = PUSH. Grafana Mimir does not pull from your app. An agent (like Grafana Alloy) pulls the metrics locally, and then pushes them over the network to Mimir via a remote_write protocol.
Datadog = PUSH. Because Datadog is a cloud SaaS, it cannot reach into your private VPC to scrape endpoints. A local agent gathers the data and pushes it to the Datadog cloud.

Architectural Decisions: LGTM vs. Custom Stacks

The most common architectural question is: "Should I use the full Grafana LGTM stack (Loki, Grafana, Tempo, Mimir), or build a customized stack with Prometheus, Zipkin, and OpenSearch?"

Here is how you make that decision.

1. The Value of Native Correlation (The LGTM Argument)

If you are starting fresh, the LGTM stack (combined with Grafana Alloy as the Universal Router) is the modern gold standard. Its primary advantage is that Loki, Tempo, and Mimir were explicitly designed to talk to each other. Configuring the "click a log line to instantly see the trace waterfall" feature is practically automatic.

2. Scaling Metrics: Prometheus vs. Mimir

While the "M" in LGTM stands for Mimir, Mimir is a heavy, distributed microservice architecture designed for tens of millions of active metrics across multiple tenants. For most mid-sized systems, a standalone Prometheus binary is vastly easier to configure and consumes a fraction of the hardware resources.

3. Log Routing: Alloy vs. Fluent Bit

Grafana Alloy (the successor to Grafana Agent) is an excellent Universal Router. However, if your enterprise requires routing logs to multiple disparate destinations (e.g., OpenSearch for searching, S3 for compliance, and Kafka for analytics), Fluent Bit is written in C, uses mere megabytes of memory, and remains the undisputed, battle-tested standard for complex log topologies.

4. Log Storage: Loki vs. OpenSearch/Elasticsearch

Loki is incredibly cost-effective because it only indexes metadata labels (like env=prod), not the raw text itself.

However, if your business requires heavy, full-text wildcard searches across terabytes of complex JSON payloads, Loki will struggle. In those cases, you need the heavy lifting of a true inverted index provided by OpenSearch (the open-source successor to Elasticsearch).

5. Trading Capital for Time: Datadog / Instana

Managing the storage, retention, and scaling of open-source databases (like OpenSearch or Mimir) requires dedicated engineering effort.

Commercial APM platforms like Datadog or Instana exist so that organizations can trade capital (money) for time (engineering hours). You pay a premium subscription fee to offload the database infrastructure, gaining AI-driven root cause analysis and auto-instrumentation immediately out-of-the-box.

Conclusion

There is no single "correct" observability architecture. The LGTM stack offers incredible developer experience, while a custom stack (Prometheus, Zipkin, OpenSearch, Fluent Bit) often aligns better with existing legacy infrastructure and heavy search requirements.

By understanding the 4-phase pipeline and keeping your Java applications cleanly instrumented with Micrometer and SLF4J, you can swap out the backend tools as your scale and budget dictate without rewriting your business logic.

Additional Resource: For an excellent visual breakdown of how the modern OTel pipeline comes together, check out this video: What's New & Next in Grafana Alloy. This talk provides a great live demonstration of how Grafana Alloy acts as an official OpenTelemetry distribution to route millions of traces and logs in production.

DEV Community