Understanding Open telemetry and Observability for SRE

Yash Nigam — Sun, 27 Oct 2024 20:19:57 +0000

Understanding of OpenTelemetry and Observability is essential for an SRE in any org. This blog post is my attempt to lay down a good understanding of OT after reading the following book:

Cloud-Native Observability with OpenTelemetry from Packt Publishing

From an high level OT can be described as:

A Framework to produce telemetry from your applications using open standards

Concept of signals - traces, metrics, and logs

Produce telemetry for these signals using OT APIs

provides Tools to gain visibility into the performance of your services by combining tracing, metrics, and logging.

allows you to instrument your application code through vendor-neutral APIs, libraries and tools.

Before Proceeding with Open telemetry let us list down and understand some other useful concepts and technologies which are interconnected:

Cloud Native Applications

There has been a shift to Microservices based Architecture for deploying an running applications aided by cloud services such as Kubernetes and serverless.
The Applications are now Distributed amongst multiple cloud services, and scaled horizontally, producing logs at multiple places.
The services are loosely coupled and operate independently.
In such cases Latency is introduced between calling services as each service sits in it own container.

A Shift towards DevOps

Small teams(4 to 6 people) managing their own microservices
Developers own the lifecycle of code through all the stages, do all the work write, test, build code, package, deploy and operate the code in prod instance.(with aid of SRE)
This Accelerates feature development
However, as microservices increase - No one has the full picture, and it becomes difficult to find what caused an outage.
Dev teams have to learn multiple tools For Building, Deploying, Monitoring.. etc which shifts their focus from their main task - coding.
They may struggle to identify the root cause of production issues as there is not enough visibility across the managed systems.

Observability

Observability can be defined in different ways:

As per https://en.wikipedia.org/wiki/Observability, "In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs."
The the ability to answer questions:
- Is the system doing what I think it should be?
- If a problem occurred in production, what evidence would you have to be able to identify it?
- Why is this service suddenly overwhelmed when it was fine just a minute ago?
- If a specific condition from a client triggers an anomaly in some underlying service, would you know it without customers or support calling you?
empowering the people who build and operate distributed applications to understand their code's behaviour while running in production

OT ultimately enables observability for the application on which it is configured, Historically observability has been achieved using the following:

1. Centralized logging

For an application which is large and distributed across enough systems, searching through the logs on individual machines is not practical.
Applications can also run on ephemeral machines that may no longer be present when we need those logs.
Need to make the logs available in a central location for persistent storage and searchability, and thus centralized logging was born
Tools for logging
- Fluentd
- Logstash
- Apache Flume

2. Metrics and dashboards

Measuring application and system performance via the collection of metrics(signals)
Metrics can also be used to configure alerting when an error rate becomes greater than an acceptable percentage.
Tools:
- Prometheus
- StatsD
- Graphite
- Grafana

3. Tracing and analysis

Tracing applications means having the ability to run through the application code and ensure it's doing what is expected.(Generally done via a debugger in IDE)
This becomes impossible when debugging an application that is spread across multiple services on different hosts across a network.
Google whitepaper on same: Dapper (https:// research.google/pubs/pub36356/)
Tools:
- Opentracing
- zipkin
- Jaegar

4. Challenges

Multiple tools for logging, tracing and metrics monitoring
Multiple standards, libraries, methods
Time needed to instrumenting the code/application to generate logs, traces and metrics, and the time needed to integrate the tools depending on complexity
ROI

Describing OpenTelemetry

OT is an ecosystem or an framework for application running on cloud native services.

Standardize how applications are instrumented and how telemetry data is generated, collected, and transmitted

Give users the tools necessary to correlate that telemetry across systems, languages, and applications

An open specification

Language-specific APIs and SDKs

Instrumentation libraries

Semantic conventions

An agent to collect telemetry

A protocol to organize, transmit, and receive the data

OpenTelemetry has implementations in 11 languages

Core Concepts/Categories of conecerns of Opentelemetry

1. Signals

Signals represent the core of the telemetry data that is generated by instrumenting.
Signals are : a) Tracing b) Baggage c) Metrics d) Logging
Real power of OpenTelemetry is to allow its users to correlate data across signals to get a better understanding of their systems

2. Specification:

https://github.com/open-telemetry/opentelemetry-specification

3. Data Model

4. API

Providing users with an API allows them to go through the process of instrumenting their code in a way that is vendor-agnostic.
The API is decoupled from the code that generates the telemetry, allowing users the flexibility to swap out the underlying implementations as they see fit
A user who instruments their code by using the API and does not configure the SDK will not see any telemetry produced by design.

5. SDK

SDK does most of the heavy lifting in OT.
Implements the underlying system that generates, aggregates, and transmits telemetry data.
Provides the controls to configure how telemetry should be collected, where it should be transmitted, and how.
Configuration of the SDK is supported via in-code configuration, as well as via environment variables defined in the specification.
As it is decoupled from the API, using the SDK provided by OpenTelemetry is an option for users, but it is not required. Users and vendors are free to implement their own SDKs

6. Instrumentation Libraries

Ensures users can get up and running quickly
provide instrumentation for popular open source projects and frameworks, in Python, the instrumentation libraries include Flask, Requests, Django, and others.

7. Pipelines

Pipelines helps to produce telemetry generated by signal and export them to data store.
Each signal implementation offers a series of mechanisms to generate, process, and transmit telemetry.
PROVIDER > GENERATOR > PROCESSOR > Exporter

Providers

The starting point of the telemetry pipeline is the provider.
A provider is a configurable factory that is used to give application code access to an entity used to generate telemetry data.
Although multiple providers may be configured within an application, a default global provider may also be made available via the SDK.
Providers should be configured early in the application code, prior to any telemetry data being generated.

Generator:

To generate telemetry data at different points in the code, the telemetry generator instantiated by a provider is made available in the SDK.
This generator is what most users will interact with through the instrumentation of their application and the use of the API.
Generators are named differently depending on the signal: the
- tracing signal calls this a tracer,

Processors

Once the telemetry data has been generated, processors provides the ability to further modify the contents of the data.
Processors may determine the frequency at which data should be processed or how the data should be exported.

Exporters

translate the internal data model of OpenTelemetry into the format that best matches the configured exporter's understanding.
Multiple export formats and protocols are supported by the OpenTelemetry project:
- OpenTelemetry protocol
- Console
- Jaeger
- Zipkin
- Prometheus
- OpenCensus

8. Resources

used to identify the source of the telemetry data, whether a machine, container, or function
used at the time of analysis to correlate different events occurring in the same resource.
Resource attributes are added to the telemetry data from signals at the export time
Are associated with providers

9. Context propagation

Is the core concept of distributed tracing,
Provides the ability to pass valuable contextual information between services that are separated by a logical boundary.
Context propagation is what allows distributed tracing to tie requests together across multiple systems
Allows user defined values (baggage) to be propagated as well
defines a context API as part of the OpenTelemetry specification.
Python has built-in context mechanisms, ContextVar

Auto Instrumentation, Manual instrumentation and challenges

Why Auto instrumentation
- The upfront cost of instrumenting code can be a deterrent to even getting started, especially if a solution is complicated to implement and will fail to deliver any value for a long time.
- Auto-instrumentation looks to alleviate some of the burdens of instrumenting code manually
Challenges of manual instrumentation
- The libraries and APIs that are provided by telemetry frameworks can be hard to learn how to use
- Instrumenting applications can be tricky. This can be especially true for legacy applications where the original author of the code is no longer around
- Knowing what to instrument and how it should be done takes practice
- Modifying code means compiling code again and building the artifact again and deploying again
- The ability to disable instrumentation for a specific cod eblock/module/plugin
Components of auto-instrumentation
- 1. Instrumentation libraries
- Python - flask, django, boto
- 2. Agent/runner
- automatically invoke the instrumentation libraries without additional work on the part of the user
- configure OpenTelemetry and load the instrumentation libraries that can be used to then generate telemetry
- What it cannot do
  - cannot instrument application-specific code
  - it may instrument things you're not interested in. This may result in the same network call being recorded multiple times, or generated data that you're not interested in using
Instrumentation libraries in Python
- Any intercepting calls to libraries are instrumented and are replaced at runtime via a technique known as monkey patching (https://en.wikipedia.org/wiki/ Monkey_patch).
- The instrumenting library receives the original call, produces telemetry data, and then calls the underlying library.
- Python implementation ships a script that can be called to wrap any Python application.
- The opentelemetry-instrument script finds all the instrumentations that have been installed in an environment by loading the entry points registered under the opentelemetry_instrumentor name

Overview of Traces, Spans and Logs and Metrics using a sample application with Opentelemetry

A sample application running in docker compose environment.

3 Pods of application which emit data - shopper, gorcery store and legacy-inventory
An Open telemetry collector
Loki : visualized by Grafana
Jaegar : http://localhost:16686
Prometheus : http://localhost:9090/
Grafana : http://localhost:3000/explore

Traces

Trace Context specification
- https://www.w3.org/TR/trace-context-1/
A distributed trace contains events that cross process, network and security boundaries
The work captured in a trace is broken into separate units or operations, each represented by a span
This specification defines standard HTTP headers and a value format to propagate context information that enables distributed tracing scenarios
Distributed tracing is the foundation behind the tracing signal of OpenTelemetry.
A distributed trace is a series of event data generated at various points throughout a system tied together via a unique identifier.
This identifier is propagated across all components responsible for any operation required to complete the request, allowing each operation to associate the event data to the originating request
Example Jaegar trace

A Trace shows
- Trace ID
- Start date time
- Duration
- Count of services

SPAN

span can represent a method call or a subset of the code being called within a method.
Multiple spans within a trace are linked together in a parent-child relationship, with each child span containing information about its parent.
The first span in a trace is called the root span and is identified because it does not have a parent span identifier

Two Spans can be seen here
First one with 7.01 millisecond duration, second with 260 millisecond
Each span has span id
Tags: representing key value flavours which give information about operation being done
Process: represents which process executed this operation
SpanContext:
- Contains information about the trace and must be propagated throughout the system.
- The elements of a trace available within a span context include the following:
- A unique identifier, referred to as a trace ID, identifies the request through the system.
- A second identifier, the span ID, is associated with the span that last interacted with the context.
- This may also be referred to as the parent identifier. •
- Trace flags include additional information about the trace, such as the sampling decision and trace level.
- Vendor-specific information is carried forward using a Trace state field. This allows individual vendors to propagate information necessary for their systems to interpret the tracing data.

Metrics

metrics provide information about the state of a running system to developers and operators
The data collected via metrics can be aggregated over time to identify trends and patterns in applications graphed through various tools and visualizations.
Metrics are critical to monitoring the health of an application and deciding when an on-call engineer should be alerted
Metrics form the basis of service level indicators (SLIs) (https://en.wikipedia.org/wiki/Service_level_ indicator) that measure the performance of an application.
These indicators are then used to set service level objectives (SLOs) (https://en.wikipedia.org/wiki/ Service-level_objective) that organizations use to calculate error budgets.
Opentelmetry primarily uses metrics by:
- OpenMetrics
- StatsD,
- Prometheus
Metrics may capture data in various Data Point Types

Searching a Metric in Prometheus

Prometheus collects and stores metrics over time, in a time series database, which can be queried using metric name
Request counter metric is a counter which counts the number of incoming requests on a service
Here we can see that after querying by metric name "request_counter" we are returned with 3 rows

Each row is for a different service and shows the request_count value, which is a integer - metric of type counter which increases the count

Logs

A log is a record of events written to output
Loki stores all the logs generated by grocery store application and grafana is used to view it
A normal message on console output would be: Filter the logs using the {job="shopper"} query to retrieve all the logs generated by the shopper application
shopper | INFO:shopper:message="add orange to cart"

However this message in the loki would be below

Which contains more details like Traceid, spanid, time..etc
The Same log contains trace id hence this can be corelated in Jaegar with trace and span details

DEV Community: Yash Nigam