Making Instrumentation Extensible

#instrumentation #observability #sre #devops

Observability-driven development requires both rich query capabilities and sufficient instrumentation in order to capture the nuances of developers' intention and useful dimensions of cardinality. When our systems are running in containers, we need an equivalent to our local debugging tools that is as easy to use as Printf and as powerful as gdb. We should empower developers to write instrumentation by ensuring that it's easy to add context to our data, and requires little maintenance work to add or replace telemetry providers after the fact. Instead of thinking about individual counters or log lines in isolation, we need to consider how the telemetry we might want to transmit fits into a wider whole.

Aggregated counters, gauges, and histograms can provide us with information broken down by host or endpoint, but not necessarily higher cardinality fields that we need for a rich understanding of our distributed systems. Automatic instrumentation of a language server framework, such as framework support for Node.js Express or Go http.Server in Honeycomb's Beelines, can only provide a modest amount of context. It will capture request header fields such as URL and response durations/error codes, but not anything from the business logic or involving the logged-in user's metadata. Because observability requires the ability to understand the impact of user behavior upon our applications, we cannot just stop with collecting surface level data. Thus, we'll need to make changes to our code to instrument it.

Instrumentation should be reusable

Typically, instrumenting code involves adding a vendor's library or a standard package like OpenCensus or slf4j to one's dependencies, then calling the library directly from instrumented code. If multiple providers and kinds of telemetry (e.g. logs, metrics, traces, events…) are in use, calls to each wind up sprinkled across the codebase. But should we have to re-instrument our entire codebase every time we gain access to new methods of data aggregation/visualization or change observability providers? Of course not. This gives rise to the need to separate observability plumbing from your business logic, or domain-specific code.

To address this problem of abstracting instrumentation, Tyler Treat envisions a solution involving centralized inter-process collection in "The Observability Pipeline", and Pete Hodgson suggests abstracting collection in-process in "Domain Oriented Observability". Tyler's article explains creating structured events, and then streaming them to an out of process service that can aggregate them and send them onwards to a variety of instrumentation sinks. Pete's article suggests creating a separate class for handling the vendor-specific pieces of instrumentation, but still relies upon tight coupling between the instrumentation code and the domain-specific code--e.g. creating a method in the instrumentation code to handle each potential property we might want to record about an event (e.g. discountCodeApplied(), discountLookup{Failed,Succeeded}()).

Why not both?

However, there's a simpler, within-process approach that is easier for developers to understand, test, configure, maintain, and operate. It's a fusion of that described by Pete in "Event-Based Observability" and "Collecting Instrumentation Context", and by Tyler's distributed event buffering solution. With the improved solution, we neither need to have an advanced understanding of mocking functions and classes, nor do we need to operate a Kafka pipeline from day 0. Instead, we just generate and consume structured events.

Within each span of work done in the domain-specific business logic, we populate a weakly-typed context dictionary with key/value pairs added from within instrumented code, as well as the default standard contextual fields (e.g. requestId, startTime, endTime, etc). Child units of work become separate contexts and spans, with appropriate fields (e.g. parentId, requestId) templated (or "partially applied" in Pete's words) from the parent context/span. Adding telemetry becomes as easy as Printf for developers -- it's just setting a ctx[key] = val only for keys and values relevant to your code. We no longer need to create one function call to the instrumentation adapter for each telemetry action. Using Pete's example, we might set discountCode => FREESHIPPING, responseCode => 403, or discountLookupSucceeded => {true,false,nil} within one event instead of making the multiple function calls above, or emitting multiple distinct "Announcement" objects for only one work unit. Writing tests to validate that the generated context map is correct becomes straightforward to do in table-based testsuites (e.g. go functest), rather than requiring mocking functions and classes.

Once the work unit finishes, its context dictionary is sent in-process to the instrumentation adapter where any number of listeners can interpret it. Each listener sees the context maps for each received event, decides whether it's relevant to it, and if so, translates it according to its own rules into metrics, traces/events/structured logs, or human-readable logs. We no longer need to duplicate calls to the same instrumentation provider from each kind of telemetry function call, but can create single listeners for each common metric (e.g. response time metrics collection, response code) that act on a wider range of events. We can then measure the correctness of listeners, ensuring that each processor is only interested in the correct set of structured events, and dispatches them to the upstream structured event, log, metric, or trace provider(s)' APIs appropriately.

Correspondence is more useful when it's about the outcome

Unlike Tyler's streaming design there need not be a 1:1 correspondence between listeners/routers and instrumentation sinks. Instead, the correspondence is between the action we'd like to coalesce or report on, and what related calls we make -- e.g. performing more than one metric counter increment, etc. to the same sink, or even scattering increments across many different sinks if we're transitioning between providers. This makes the code much more testable, as it's focused on the intent of "record these values from this specific kind of event, to whatever Sinks are relevant", rather than a catch-all of "duplicate everything in the kitchen we do in Sink A in Sink B instead". And the value of event stores such as Honeycomb quickly becomes clearer -- because you don't have to do anything different to aggregate or process each such structured event, only pass it on to us directly. Let us worry about how to efficiently query the data when you ask a question, such as P99(duration_ms) or COUNT WHERE err exists GROUP BY customer_id ORDER BY COUNT desc.

Decoupling event creation from event consumption, even within the same process, is a great step between instrumentation spaghetti and needing a Kafka, Kinesis or PubSub queue. Never create a distributed system unless you need to, and run as few distributed systems as possible. Same-process structured event creation and structured event consumption is super easy to work with, test, and reason about, to boot! As you grow and your needs scale, you may wind up reaching for that Kafka queue. But you'll have an easier migration path, if so.

Ideas for future-proofing

How does this relate to OpenTelemetry née Open{Census,Tracing}? Despite the creation of the new consensus standard, the ongoing transition to OpenTelemetry is proof indeed that we ought to future-proof our work by ensuring we can switch to and from instrumentation providers, including those that do not support the current newest standard, without further breaking domain code. Instead of using the OpenTelemetry API directly within your domain-specific code, it still may be wise to use one context/span propagation library of your choice (which could still be OTel's), and write an InstrumentationAdapter that passes data it receives through to OpenTelemetry's metrics & trace consumers, as well as to legacy and future instrumentation providers.

I hope that this article was helpful! If you're looking for more detailed examples of how Honeycomb Beelines work, check out our Examples repo in Github , such as this example of using our Beeline for Go alongside custom instrumentation.

Looking to find out more? Get started with Honeycomb for free.