David

Posted on Apr 8

Learning Full-stack Observability: Tracing

#distributedsystems #learning #grafana

(Originally published on Medium, Image by Gerd Altmann from Pixabay)

Introduction

Distributed Tracing serves the purpose of connecting network calls across various services within an organization. In my case, that means I can:

From Android: see the time taken for the request to get sent from the device
From Spring Boot: see the potential error logs a certain user may have
From the database: see how much time is spent to determine the efficiency of my queries

Traces are made of spans, and spans have attributes, and a span may have a parent span. A span without a parent is known as the root span. More about that in the OpenTelemetry documentation

In this example:

The root span is named GET, starting at the Android app.
It’s composed of 5 spans, 3 of which are database queries, and the longest-running one is the root.
The next span after the app starts at the backend
The trace took 1.91 seconds in total.
As mentioned before, my choice of tools for OpenTelemetry will be the ones from the Grafana stack, and so I’ll be using Tempo.

Trace Propagation

Briefly, I’ll go over how traces work. The W3C Trace Context and OpenTelemetry documentation explain in much better detail.

Spans are generated within each service before being sent to the span collector. Spans are able to be created with ancestral information using trace headers.

A trace header is passed from the first component down the service trail. If a service has tracing enabled and doesn’t detect a compatible trace header, it will create one. It comes in multiple formats, but W3C is the main header format.

Since the Android app is instrumented, the trace starts from there; otherwise, it would have started and ended at the backend as a single-service trace

And the resulting trace can be viewed in Tempo

Trace Correlation

This allows one to filter logs by HTTP request instead of having to narrow down the potential time of the request. Correlation connects traces to the other 2* types of OpenTelemetry — logs and metrics — and allows one to jump back and forth between them.

(* profiles are an upcoming 4th signal in OpenTelemetry. As of now, Grafana Pyroscope can be used with traces, but that will be discussed in the performance test)

Log Correlation with Loki

The key attributes here are trace_id and span_id. These have to be set as structured metadata of the log entry. In Spring, these can be extracted using the request headers, such as with RequestContextHolder(link). The request attributes then have to be converted into log attributes that can be parsed using a Promtail pipeline, as I explained in my logging article. With an auto-instrumented app such as one using the Java Agent, these steps can be skipped as the attributes are extracted automatically.

Loki to Tempo

After getting the trace attributes, I configured Loki to use Derived Fields to link to Tempo.

And this is an example of what will appear when properly configured

Note the “tempo” buttons:

Tempo to Loki

grafana.com

Example link:

When configured properly, this “LOG” button, or a chain link icon, will appear on the traces.

Metric Correlation with Prometheus

For Prometheus, it’s slightly more complicated. All labels are indexed, so a high cardinality field like trace_id is out of the question. There’s no such thing as “metadata labels” as there is with Loki, either. Instead, exemplars are needed; the feature is currently available, but still experimental

For some reason, I’ve only gotten it to work with histogram_quantile.

There’s also Traces to Metrics, but I won’t talk about it here.

Custom Trace Attributes

My main use case for this is to allow traces to filtered by a user's device, to act as a very basic form of session replay.

Instrumentation

Because of the fundamental differences between Android’s JVM and the standard JVM, auto-instrumentation is generally not available, so I’ve gone the route of using manual instrumentation.

Retrofit is my library of choice for making HTTP calls, and so I chose to use the OkHttp instrumentation.

This is a summarized version of my Dagger Provides method for the Retrofit object:

@Named("authorized")
fun retrofitAuthorized(buildConstants: BuildConstants, openTelemetry: OpenTelemetry): Retrofit {
    val okHttpClient = OkHttpClient.Builder()
        .addNetworkInterceptor {
            val span = Span.current()

            if(span.isRecording) {
                val userId = dataProvider.userId()
                if (userId != null) {
                    span.setAttribute(
                        Constants.CustomTraceAttributes.UserId.attributeName,
                        userId.toString()
                    )
                }
                val deviceId = dataProvider.deviceId()
                if (deviceId != null) {
                    span.setAttribute(
                        Constants.CustomTraceAttributes.DeviceId.attributeName,
                        deviceId.toString()
                    )
                }
            }
            it.proceed(it.request())
        }
        .build()

    val callFactory =
        OkHttpTelemetry.builder(openTelemetry)
            .build().newCallFactory(okHttpClient)
    val baseUrl = buildConstants.exTrackBaseUrl()

    val retrofit = Retrofit.Builder()
        .baseUrl(baseUrl)
        .callFactory(callFactory)
        .build()
    return retrofit
}

I’m using Span.current() here because the instrumentation already starts a span through a custom interceptor.

And the definition for DeviceDataProvider is simply this:

interface DeviceDataProvider {
    fun deviceId(): UUID?
    fun userId(): UUID?
}

Where I store the device ID in a SharedPreferences file.

And the result is that I can search Tempo with a query like this:

{resource.service.name="expense-tracker-android"} && {span.extrack.deviceId = "8ea68a2d-ffba-42a8-970c-02a397210fb0"}

I also added these attributes to my Loki logs because I could afford to have metadata without indexes, but as discussed in my Prometheus article, I didn’t add these to my metrics because of high cardinality

Conclusion

With tracing in place, I now have a tool which works as a form of "super-logger" to help me narrow down API calls faster.
Now that I’ve concluded the “OpenTelemetry” portion of my ExTrack series, I’ll be discussing how I was able to do performance tests on my server and how all 3 signals provided great value to me. That will be all for now.

Thank you for your time.

DEV Community