David

Posted on Apr 7

Learning Full-stack Observability: Logging

#learning

(Originally published on Medium, Image by Boskampi from Pixabay)

Introduction

When it comes to logging, there are multiple concerns, some of which I'll highlight for this post:

Retention: to prevent server disk space from becoming scarce
Search: the core use case of logs
Formatting: aids in readability and search

Across different tools and platforms such as programming language and Operating System, these aspects vary significantly. One of the main points of OpenTelemetry is to have a central place to view all those logs, whether they come from:

A file on the local machine, like with my MySQL server

A file on a remote machine, like with my NGINX server

A mobile device, like with ExTrack:

As mentioned in my earlier articles, my tools of choice for OpenTelemetry come from the Grafana stack, so I’ll be using Loki.

Loki

Loki’s main description is:

Like Prometheus, but for logs.

So that’s why I made this article to be after the Metrics article.

Loki mainly relies on labels in order to narrow down logs while searching, similar to Prometheus.

(service_name and level, 2 of the most common labels in my case)

It also supports structured metadata, which is similar to labels, but they’re not indexed, and so they can have high cardinality, such as an IP address or a user ID. The main idea when working with Loki is to filter by a few key indexed labels, and then narrow down with the various search operators.

The 2 main ways I push to Loki are:

Scraping the raw log files with Grafana Promtail*, which then pushes them to Loki
Using the OpenTelemetry Java Agent in combination with Grafana Alloy using OTLP

(*Promtail is at end-of-life, Alloy is the recommended alternative)

Promtail

At a basic level, Promtail uses a pipeline to process log lines into the final output to be sent to Loki. Pipeline consists of a wide variety of stages, which include, most notably in my case, the json stage, since I can use JSON for both the MySQL error log and NGINX access logs.

MySQL Promtail Config

- job_name: mysql_error_logs
  static_configs:
  - targets:
      - localhost
    labels:
      service_name: mysql_error_logs
      __path__: "C:/ProgramData/MySQL/MySQL Server 8.0/Data/LAPTOP-7H9JJDHB.err.00.json"
  pipeline_stages:
  - json:
      expressions:
        time: time
        msg: msg
        prio: prio
        level_derived: prio
        level: label
        err_code: err_code
        err_symbol: err_symbol
        SQL_state: SQL_state
        subsystem: subsystem
  # https://dev.mysql.com/doc/refman/8.4/en/error-log-event-fields.html
  - template:
      source: level_derived
      template: '{{ if eq .Value "0" }}System{{ else if eq .Value "1" }}Error{{ else if eq .Value "2" }}Warning{{ else }}Note{{ end }}'
  - timestamp:
      source: time
      format: RFC3339
  - labels:
      level:
      level_derived:
  - structured_metadata:
      prio:
      err_code: err_code
      err_symbol: err_symbol
      SQL_state: SQL_state
      subsystem: subsystem
  - output:
      source: msg

I extract all the key-value pairs using the json stage
The timestamp stage is there to ensure Promtail uses the time from the log statement instead of the time the file was last scraped
I chose to turn the level attribute into a label because log levels are usually the most important label, and also because the cardinality is quite low — 4.

Example

Equivalent logs in Loki:

NGINX Logs

- job_name: nginx_access_logs
  static_configs:
  - targets:
      - localhost
    labels:
      service_name: nginx_access_logs
      __path__: "/var/log/nginx/access-json.log"
  pipeline_stages:
  - json:
      expressions:
        timestamp: timestamp
        remote_addr: remote_addr
        message: message
        status: status
        request_method: request_method
        hostname: hostname
        trace_id: trace_id
        span_id: span_id
        server_name: server_name
  - timestamp:
      source: timestamp
      format: RFC3339
  - labels:
      status:
      request_method:
  - structured_metadata:
      remote_addr:
      hostname:
      trace_id:
      span_id:
      server_name:
  - output:
      source: message

Here, once again, I chose just 2 labels with a combined maximum cardinality of 4,500: 9 for HTTP methods times 500 for the range of status codes. Although practically it will be much smaller since not all of them are used.

And here is the custom NGINX log format:

log_format json escape=json '{"remote_addr":"$remote_addr","timestamp":"$time_iso8601","message":"$request","status":"$status","request_method":"$request_method","hostname":"$hostname","trace_id": "$otel_trace_id","span_id":"$otel_span_id", "user_agent": "$http_user_agent","server_name":"$server_name"}';

access_log  /var/log/nginx/access-json.log  json;

Java Agent

Backend

For the backend, the Java Agent automatically exports the Logback logs using OTLP and sends them to Grafana Alloy, where some resource attributes get indexed to labels by default. The relevant ones for me include deployment.environment.name(link) to differentiate dev, staging, and production; and service.name(link).

My configuration using environment variables for OpenTelemetry and Java:

export APP_VERSION=$(cat app-version)
export OTEL_SERVICE_NAME=expense_tracker_backend
export OTEL_JAVA_AGENT_LOCATION=/opt/opentelemetry
export JAVA_TOOL_OPTIONS="-javaagent:$OTEL_JAVA_AGENT_LOCATION/opentelemetry-javaagent.jar"
export SPRING_PROFILES_ACTIVE=dev
export OTEL_EXPORTER_OTLP_ENDPOINT=http://telemetry.davidgrath.com:4318
export OTEL_RESOURCE_ATTRIBUTES=deployment.environment.name=dev,deployment.environment=dev,service.version=$APP_VERSION
java -jar "/opt/expense_tracker/expense-tracker-backend.jar"

The domain name is my local DNS entry that points to my Grafana Alloy instance:

Which is configured like this:

otelcol.exporter.otlphttp "loki_exporter" {
    client {
        endpoint = "http://localhost:3100/otlp"
    }

}
otelcol.receiver.otlp "otlp_receiver" {
    grpc {
    }

    http {
    }
    output {
        logs = [otelcol.exporter.otlphttp.loki_exporter.input]
    }
}

Example

Resulting Loki entry

Android

For the Android version, the JVM is fundamentally different, and so auto-instrumentation isn’t a thing. Manual instrumentation is needed and can be achieved with the Logback library

I’m using Dependency Injection with Dagger, so here’s how the OpenTelemetry object is configured in my Dagger Provides method:

@Provides
@Singleton
fun openTelemetry(buildConstants: BuildConstants): OpenTelemetry {
    val logsUrl = "${buildConstants.telemetryHttpUrl()}/v1/logs"

    val resource = Resource.builder()
        .put(ServiceAttributes.SERVICE_NAME, "expense-tracker-android")
        .put("android.os.api_level", Build.VERSION.SDK_INT.toString())  //Gotten from semconv docs
        .put("deployment.environment.name", buildConstants.environmentName())
        .build()
    val sdk = OpenTelemetrySdk.builder()
        .setPropagators(
            ContextPropagators.create(
                TextMapPropagator.composite(
                    W3CTraceContextPropagator.getInstance(),
                    W3CBaggagePropagator.getInstance()
                )
            )
        )
        .setLoggerProvider(
            SdkLoggerProvider.builder().addResource(resource)
                .addLogRecordProcessor(BatchLogRecordProcessor.builder(OtlpHttpLogRecordExporter.builder().setEndpoint(logsUrl).build()).build())
                .build())
        .buildAndRegisterGlobal()
    OpenTelemetryAppender.install(sdk)
    return sdk
}

Forgetting to append /v1/logs gave me a bit of trouble before I found out.
With the help of Mapped Diagnostics Context, I’m able to attach a randomly generated UUID to be able to filter my logs by deviceId:

Logging crashes

There are better tools for the job, including ACRA and Firebase Crashlytics. There’s even one within OpenTelemetry itself, but I wanted something basic that I could see directly in Loki.

By making use of the OpenTelemetry SDK and UncaughtExceptionHandler, I’m able to upload the log before the app shuts down:

val defaultUncaughtExceptionHandler = Thread.getDefaultUncaughtExceptionHandler()

val handler = object: UncaughtExceptionHandler {
    override fun uncaughtException(t: Thread, e: Throwable) {
        val processor = appComponent.logRecordProcessor()
        LOGGER.error("Uncaught exception", e)
        processor.forceFlush().join(200, TimeUnit.MILLISECONDS)
        processor.logRecordExporter.flush().join(200, TimeUnit.MILLISECONDS)
        //VERY IMPORTANT! Return the flow back to the system
        defaultUncaughtExceptionHandler?.uncaughtException(t, e)
    }

And I can see what happened in Loki:

This method feels a little hacky since I’m accessing the SDK directly, but I’ll work with it since I got my logs

Configuration summary

Conclusion

Now that my configuration is complete, I have a central place where I can search my logs across all my servers and services, while still letting SSH be a viable alternative.

In the next article, I’ll make use of distributed tracing to enable me to track the flow of execution from the app to the backend to the database using Tempo.

If you need any feedback, feel free to discuss in the comments

Thank you for your time.

DEV Community

Learning Full-stack Observability: Logging

Introduction

Loki

Promtail

MySQL Promtail Config

Example

NGINX Logs

Java Agent

Backend

Example

Android

Logging crashes

Configuration summary

Conclusion

Top comments (0)