DEV Community

Kevin Kimani
Kevin Kimani

Posted on

Next-Level Observability with OpenTelemetry

When something goes wrong in my applications, logging is almost always the first tool I reach for. I'll throw a few log statements at the start and end of a function, sprinkle some into the exception handlers, and I'll have a basic picture of what's happening. For simple services running on a single instance, that's usually enough; I can scroll through the log file, spot the error, and trace it back to its cause.

But as my systems have grown, that same approach doesn't work. Logs pile up from multiple sources, executions start interleaving, and the error I'm staring at no longer tells me enough. I can see that something failed, but I can't trace why.

In this tutorial, I'll walk you through how I moved beyond basic logging by instrumenting a Kotlin and Spring Boot backend service with OpenTelemetry. You'll learn how OpenTelemetry's tracing model gives you the execution context that logs alone can't provide. By the end of this guide, you'll have a working instrumented service and a clear mental model for building more observable backend systems.

Why I Needed Next-Level Observability

Modern backend systems are rarely linear. One operation might fan out to several downstream services, retry on failure, or execute concurrently across multiple instances and threads. All of these patterns create failure modes that are genuinely hard to explain after the fact.

I ran into this personally when a background job processing hundreds of records across overlapping executions started throwing errors. My logs showed the errors, but they couldn't tell me which execution each error belonged to, or whether the other concurrent executions succeeded or failed in the same way.

The gap between seeing an error and understanding what led to it is the problem. I'd have the error message and a timestamp, but no way to connect either to the broader execution context. In a system running multiple concurrent job executions, log lines from different runs freely interleave, and thread names get recycled by the thread pool. Without a way to uniquely identify each execution, every line in my log file was an isolated fact with no reliable way to group it with others from the same run.

Observability gave me a structured view of what my system did and in what order. It does this through traces, records of complete operations that carry unique identifiers. I can filter my logs by those trace identifiers and see the entire history of a specific execution, clearly. Metrics add another dimension by revealing patterns over time that no single log entry can show. Together, they transformed my debugging from guesswork into a structured investigation.

What OpenTelemetry Provides

OpenTelemetry is an open-source observability framework that defines a unified model for collecting three types of signals from your applications:

  • Traces: represent the full lifecycle of a request or an operation as it moves through your system. A trace is made up of spans, where each span represents a unit of work within the operation, such as an HTTP call or a background task. Each span contains a trace ID and a span ID as part of its context, where the trace ID ties back the span to its parent operation (trace) and the span ID uniquely identifies the single specific step within the operation.
  • Metrics: capture aggregated measurements over time, such as how long an operation takes and the error rates. This helps to give you a statistical view of the overall system health.
  • Logs: represent discrete events that happened at a specific point in time. When correlated with trace context, logs stop being isolated entries in a file and become anchored events within a specific execution, which makes it easy to understand exactly what happened and why.

The OpenTelemetry ecosystem has three main components: instrumentation, the Collector, and exporters. Instrumentation is how you integrate OpenTelemetry into your application, using language-specific SDKs to create spans, record metrics, and propagate context. The Collector is an optional middleware component that receives, processes, and exports telemetry data to one or more backends. Exporters are the plugins that actually send your data to a specific destination, like Prometheus or Jaeger.

What made me commit to OpenTelemetry as a long-term strategy is its vendor neutrality. Before it existed, instrumentation was tightly coupled to specific vendors, and switching meant rewriting instrumentation code throughout your entire codebase. OpenTelemetry fixed that by separating instrumentation from destination; I can instrument my service once using the standard API, and then swap exporters as my infrastructure evolves.

Setting Up Next-Level Observability with OpenTelemetry

If you want to follow along, you'll need the following:

Here is a rough architecture diagram of what we'll build:

Rough architecture diagram

The application consists of a Spring Boot Service running in a single Java Virtual Machine (JVM). The Task Scheduler triggers the OrderSummaryJob at regular intervals. The job reads orders from an embedded H2 database via Spring Data JPA, processes them, and writes summaries back to the database. The OpenTelemetry Java Agent sits within the JVM, automatically instrumenting the job and injecting trace context into the Mapped Diagnostic Context (MDC). This context flows through the log output, allowing you to correlate all logs from a single execution when multiple executions run concurrently.

Setting Up the Starter Template

To keep things focused on observability, I've prepared a Kotlin and Spring Boot application with a scheduled job already in place:

git clone --single-branch -b starter-template https://github.com/kimanikevin254/jetbrains-otel-order-summary.git
Enter fullscreen mode Exit fullscreen mode

Open the project in your code editor.

The most important file in this project is the src/main/kotlin/com/example/order_summary/service/OrderSummaryJob.kt, which defines the scheduled job. The job reads orders created in the last 24 hours from an H2 database via Spring Data repositories, processes them one by one, and writes summaries back to the DB. The job runs every five minutes using Spring's @Scheduled annotation. The summaries generated by this job can later be consumed by other parts of a larger system, such as dashboards, analytic pipelines, or downstream services that need a periodic snapshot of order volume and revenue.

The logging approach is straightforward and very common. You log when the job starts, log each order being processed, log if an exception occurs, and log when the job finishes:

@Service
class OrderSummaryJob(
   private val orderRepository: OrderRepository,
   private val orderSummaryRepository: OrderSummaryRepository
) {
   private val logger = LoggerFactory.getLogger(OrderSummaryJob::class.java)

   @Scheduled(fixedDelay = 300000) // 5mins in ms
   fun generateSummary() {
       logger.info("Starting order summary job...")

       val periodEnd = LocalDateTime.now()
       val periodStart = periodEnd.minusHours(24)

       val orders = orderRepository.findByCreatedAtAfter(periodStart)
       if (orders.isEmpty()) {
           logger.info("No orders found in the last 24 hours. Skipping summary generation.")
           return
       }
       logger.info("Found ${orders.size} orders to process")

       var processedCount = 0
       var totalAmount = BigDecimal.ZERO

       for (order in orders) {
           try {
               logger.info("Processing order ${order.id} for customer ${order.customerId}...")

               // Simulate processing work
               Thread.sleep(2000)

               // Simulate occasional failures
               if (order.amount > BigDecimal("400")) {
                   throw RuntimeException("Order amount exceeds threshold: ${order.amount}")
               }

               totalAmount = totalAmount.add(order.amount)
               processedCount++
           } catch (e: Exception) {
               logger.error("Failed to process order ${order.id}: ${e.message}")
               // Continue processing other orders
           }
       }

       val summary = OrderSummary(
           totalOrders = orders.size,
           totalAmount = totalAmount,
           periodStart = periodStart,
           periodEnd = periodEnd
       )

       orderSummaryRepository.save(summary)
       logger.info("Order summary job completed. Total: ${orders.size} orders, Amount $totalAmount")
   }
}
Enter fullscreen mode Exit fullscreen mode

To see it in action, run the following command in your terminal:

./gradlew bootRun
Enter fullscreen mode Exit fullscreen mode

Once the application starts, you can see the logs. Execute the command tail -f logs/order-summary.log in a separate terminal to stream the logs from the configured log file:

2026-02-24T15:49:47.304+03:00  INFO 605417 --- [order-summary] [scheduling-1] c.e.o.service.OrderSummaryJob            : Found 12 orders to process
2026-02-24T15:49:47.305+03:00  INFO 605417 --- [order-summary] [scheduling-1] c.e.o.service.OrderSummaryJob            : Processing order 1 for customer CUST-10001...
2026-02-24T15:49:49.306+03:00  INFO 605417 --- [order-summary] [scheduling-1] c.e.o.service.OrderSummaryJob            : Processing order 3 for customer CUST-10003...
2026-02-24T15:49:51.307+03:00  INFO 605417 --- [order-summary] [scheduling-1] c.e.o.service.OrderSummaryJob            : Processing order 7 for customer CUST-10007...
2026-02-24T15:49:53.308+03:00  INFO 605417 --- [order-summary] [scheduling-1] c.e.o.service.OrderSummaryJob            : Processing order 8 for customer CUST-10008...
2026-02-24T15:49:55.308+03:00  INFO 605417 --- [order-summary] [scheduling-1] c.e.o.service.OrderSummaryJob            : Processing order 9 for customer CUST-10009...
2026-02-24T15:49:57.310+03:00 ERROR 605417 --- [order-summary] [scheduling-1] c.e.o.service.OrderSummaryJob            : Failed to process order 9: Order amount exceeds threshold: 458.23
...

2026-02-24T15:50:11.322+03:00 ERROR 605417 --- [order-summary] [scheduling-1] c.e.o.service.OrderSummaryJob            : Failed to process order 19: Order amount exceeds threshold: 427.98
2026-02-24T15:50:11.340+03:00  INFO 605417 --- [order-summary] [scheduling-1] c.e.o.service.OrderSummaryJob            : Order summary job completed. Total: 12 orders, Amount 1680.31
Enter fullscreen mode Exit fullscreen mode

The logs show clean, linear execution. Each step follows the previous, and errors are easy to associate with what was happening. This works well for infrequent, single-instance execution.

Introducing Complexity

As a project evolves, two changes might occur:

  • The business might demand near real-time visibility into the order metrics, which means that the job needs to run more frequently. Say, every five seconds instead of every five minutes.
  • The application may be deployed across multiple instances for high availability.

Let's start by running the job more frequently to see how this affects our current logging approach. To do this, open the src/main/kotlin/com/example/order_summary/OrderSummaryApplication.kt file and add the following line of code to the main application class:

@EnableAsync
Enter fullscreen mode Exit fullscreen mode

This enables async execution. Remember to add the following import statement to the same file:

import org.springframework.scheduling.annotation.EnableAsync
Enter fullscreen mode Exit fullscreen mode

Open the src/main/kotlin/com/example/order_summary/service/OrderSummaryJob.kt file and replace @Scheduled(fixedDelay = 300000) // 5mins with the following:

@Async
@Scheduled(fixedRate = 5000) // 5secs in ms
Enter fullscreen mode Exit fullscreen mode

Changing from fixedDelay to fixedRate means the job starts every five seconds regardless of whether the previous execution has finished. Adding @Async ensures that each execution runs on its own thread from Spring's task executor pool, preventing slow jobs from blocking the scheduler. This is a common pattern when scaling background jobs to handle higher throughput.

Remember to add the following import statement to the same file:

import org.springframework.scheduling.annotation.Async
Enter fullscreen mode Exit fullscreen mode

Restart the application and observe the logs. You should see something like this:

2026-02-24T16:17:40.469+03:00  INFO 610799 --- [order-summary] [task-1] c.e.o.service.OrderSummaryJob            : Starting order summary job...
2026-02-24T16:17:40.596+03:00  INFO 610799 --- [order-summary] [task-1] c.e.o.service.OrderSummaryJob            : Found 12 orders to process
2026-02-24T16:17:40.597+03:00  INFO 610799 --- [order-summary] [task-1] c.e.o.service.OrderSummaryJob            : Processing order 1 for customer CUST-10001...
...
2026-02-24T16:17:47.468+03:00  INFO 610799 --- [order-summary] [task-2] c.e.o.service.OrderSummaryJob            : Processing order 3 for customer CUST-10003...
2026-02-24T16:17:48.602+03:00  INFO 610799 --- [order-summary] [task-1] c.e.o.service.OrderSummaryJob            : Processing order 9 for customer CUST-10009...
2026-02-24T16:17:49.469+03:00  INFO 610799 --- [order-summary] [task-2] c.e.o.service.OrderSummaryJob            : Processing order 7 for customer CUST-10007...
2026-02-24T16:17:50.460+03:00  INFO 610799 --- [order-summary] [task-3] c.e.o.service.OrderSummaryJob            : Starting order summary job...
2026-02-24T16:17:50.472+03:00  INFO 610799 --- [order-summary] [task-3] c.e.o.service.OrderSummaryJob            : Found 12 orders to process
2026-02-24T16:17:50.473+03:00  INFO 610799 --- [order-summary] [task-3] c.e.o.service.OrderSummaryJob            : Processing order 1 for customer CUST-10001...
...
2026-02-24T16:17:54.476+03:00  INFO 610799 --- [order-summary] [task-3] c.e.o.service.OrderSummaryJob            : Processing order 7 for customer CUST-10007...
2026-02-24T16:17:54.608+03:00  INFO 610799 --- [order-summary] [task-1] c.e.o.service.OrderSummaryJob            : Processing order 14 for customer CUST-10014...
2026-02-24T16:17:55.460+03:00  INFO 610799 --- [order-summary] [task-4] c.e.o.service.OrderSummaryJob            : Starting order summary job...
2026-02-24T16:17:55.473+03:00  INFO 610799 --- [order-summary] [task-4] c.e.o.service.OrderSummaryJob            : Found 12 orders to process
2026-02-24T16:17:55.474+03:00  INFO 610799 --- [order-summary] [task-4] c.e.o.service.OrderSummaryJob            : Processing order 1 for customer CUST-10001...
2026-02-24T16:17:55.475+03:00 ERROR 610799 --- [order-summary] [task-2] c.e.o.service.OrderSummaryJob            : Failed to process order 9: Order amount exceeds threshold: 458.23
Enter fullscreen mode Exit fullscreen mode

The logs are now completely interleaved. Executions from task-1, task-2, and task-3 are all running simultaneously, processing the orders, and logging to the same output. When an error occurs, like the failure on order 9 at 16:17:55, it's not easy to figure out which job execution the log belongs to and which orders were successfully processed before the error occurred in that specific execution.

You might think searching by thread name, such as task-1, would solve this, but Spring's thread pool reuses threads. After task-1 finishes its first execution, it picks up execution 9, then execution 17, and so on. Searching by thread name now gives you mixed logs from multiple unrelated executions. In production, where multiple application instances run behind a load balancer, thread names are no longer unique across your system.

This is where plain logging breaks down. I can see that something failed. I can't explain what happened leading up to it.

A Better Solution: Adding OpenTelemetry

To fix the missing execution context, I wanted each job execution to be treated as a logical unit of work, with a unique trace ID attached to it. All logs emitted during that job should include that ID. Even when multiple executions run concurrently and logs interleave, I can filter by trace ID and see exactly what happened in a single run.

OpenTelemetry provides several ways to instrument applications. I'll use the Java Agent here, which automatically instruments your application without requiring any changes to the source code. It's genuinely the path of least resistance.

Let's start by downloading the agent JAR file. Execute the following commands in the project root folder to create a directory for the agent and download it:

mkdir -p agents
curl -L https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar \
-o agents/opentelemetry-javaagent.jar
Enter fullscreen mode Exit fullscreen mode

Verify the download using the following command:

ls -lh agents/opentelemetry-javaagent.jar
Enter fullscreen mode Exit fullscreen mode

You should see a file around 24MB in size.

Add the agents/ directory to your .gitignore file so that the JAR file is not committed to version control. You can use the command echo "agents/" >> .gitignore or add it manually.

Once you've confirmed that the agent was downloaded successfully, it's time to configure it. You need to attach it to the JVM when running the application by passing it as an argument. Open the build.gradle.kts file and add the following configuration:

tasks.bootRun {
   jvmArgs = listOf(
       "-javaagent:${projectDir}/agents/opentelemetry-javaagent.jar",
       "-Dotel.service.name=order-summary-service",
       "-Dotel.traces.exporter=logging",
       "-Dotel.metrics.exporter=none",
       "-Dotel.logs.exporter=none"
   )
}
Enter fullscreen mode Exit fullscreen mode

Here's what each argument does:

  • -javaagent tells the JVM to load the OpenTelemetry agent before your application starts
  • -Dotel.service.name sets the name of your service in the telemetry data
  • -Dotel.traces.exporter=logging prints trace data to the console. No external backend is needed for this guide
  • -Dotel.metrics.exporter=none and -Dotel.logs.exporter=none disable metrics and log exporting since that is outside the scope of this guide

Lastly, you need to update the log patterns to include trace context. The OpenTelemetry agent automatically injects trace_id and span_id into the logging context (Mapped Diagnostic Context). To display these values in your application logs, open the src/main/resources/application.properties file and add the following:

logging.pattern.console=%d{HH:mm:ss.SSS} [%thread] [trace_id=%mdc{trace_id} span_id=%mdc{span_id}] %-5level %logger{36} - %msg%n
logging.pattern.file=%d{HH:mm:ss.SSS} [%thread] [trace_id=%mdc{trace_id} span_id=%mdc{span_id}] %-5level %logger{36} - %msg%n
Enter fullscreen mode Exit fullscreen mode

The %mdc{trace_id} and %mdc{span_id} directives extract values from the MDC that the agent populates automatically.

Now, let's restart the application and observe the logs:

17:19:43.715 [task-1] [trace_id=da673f1ec49eba77264c5912584e7183 span_id=74c708e335a974e3] INFO  c.e.o.service.OrderSummaryJob - Starting order summary job...
17:19:43.856 [task-1] [trace_id=da673f1ec49eba77264c5912584e7183 span_id=74c708e335a974e3] INFO  c.e.o.service.OrderSummaryJob - Found 12 orders to process
17:19:43.857 [task-1] [trace_id=da673f1ec49eba77264c5912584e7183 span_id=74c708e335a974e3] INFO  c.e.o.service.OrderSummaryJob - Processing order 1 for customer CUST-10001...
17:19:45.860 [task-1] [trace_id=da673f1ec49eba77264c5912584e7183 span_id=74c708e335a974e3] INFO  c.e.o.service.OrderSummaryJob - Processing order 3 for customer CUST-10003...

...

17:19:53.704 [task-3] [trace_id=4a969bbb00634e0ee36b2fbda1399d8a span_id=0a602f1a58df2f71] INFO  c.e.o.service.OrderSummaryJob - Starting order summary job...
17:19:53.715 [task-3] [trace_id=4a969bbb00634e0ee36b2fbda1399d8a span_id=0a602f1a58df2f71] INFO  c.e.o.service.OrderSummaryJob - Found 12 orders to process
17:19:53.715 [task-3] [trace_id=4a969bbb00634e0ee36b2fbda1399d8a span_id=0a602f1a58df2f71] INFO  c.e.o.service.OrderSummaryJob - Processing order 1 for customer CUST-10001...
17:19:53.868 [task-1] [trace_id=da673f1ec49eba77264c5912584e7183 span_id=74c708e335a974e3] ERROR c.e.o.service.OrderSummaryJob - Failed to process order 9: Order amount exceeds threshold: 458.23
Enter fullscreen mode Exit fullscreen mode

All logs from the first execution share the same trace_id (da673f1ec49eba77264c5912584e7183), while logs from the third execution have a different trace_id (4a969bbb00634e0ee36b2fbda1399d8a). Even though both executions are running concurrently and their logs are interleaved, you can now filter by trace ID to isolate a single execution.

For example, to see logs only from the first execution, you could search for trace_id=da673f1ec49eba77264c5912584e7183 in a log aggregation tool such as Amazon CloudWatch.

All the code used in this tutorial is available on GitHub.

Where to Go From Here

Once I had trace IDs in my logs, a few next steps were obvious.

For even more powerful filtering, I could add structured fields to logs for record IDs or job phases, for example, logging order_id as a dedicated field alongside the trace ID, letting me query all executions that touched a specific order.

I could also export logs alongside traces to an observability backend like Jaeger or Grafana, which would let me visualize the full trace as a timeline showing how long each step took and where errors occurred. The OpenTelemetry agent supports multiple backends by simply changing the exporter configuration, so I can start with logging (as we did here) and migrate to a full observability platform later without touching my instrumentation code.

The same pattern works for API handlers, other background jobs, or any async work in the system. Once OpenTelemetry is in place, every part of the application automatically benefits from trace context propagation.

Conclusion

Good observability isn't just about adding more logs. It's about adding structure and context to the signals your system already emits. With OpenTelemetry, I was able to turn interleaved, confusing output into isolated, queryable execution traces.

Building with observability in mind changes how I think about system design. Traces reveal where boundaries should exist. Metrics show trends that make capacity planning and SLO definition more grounded in actual production behavior. And well-instrumented code is simply more readable. When every operation is traced, you think more carefully about what constitutes a meaningful unit of work, which benefits both the tools and the people reading the code later.

Top comments (0)