Tracing and Metrics
Lets learn more about how tracing and metrics are collected and used within the system. From a request perspectivewe will check they are added and implemented.
Tracing Lifecycle
Tracing is used to observe the path a request takes through the system, providing a holistic view of the processing flow, including execution times, dependencies, and errors along the way. Hereβs a detailed breakdown:
-
Request Ingress (HTTP Layer):
-
Instrumentation: OpenTelemetry (OTEL)
HttpInstrumentation
automatically creates a new root span when a request comes in. The span's name ishttp.request
. The span's context (traceId, spanId) can be accessed through thetrace.getActiveSpan()
function. -
Attributes: The instrumentation also sets relevant attributes to the span including the tenant id (if available), method, url, status code, route, headers, and more, based on
applyCustomAttributesOnSpan
,headersToSpanAttributes
configuration in theHttpInstrumentation
. -
Logging: The
logRequest
plugin is executed, capturing the request information in a log line. -
Context Propagation: The span's context (including
traceId
andspanId
) is attached to the current asynchronous execution context, ensuring consistent propagation to downstream operations.
-
Instrumentation: OpenTelemetry (OTEL)
-
Authentication and Authorization (HTTP Layer):
-
Plugin: The
jwt
plugin is activated. -
Span Creation: The
ClassInstrumentation
plugin detectsjwt.verify
calls, creates, and ends a new span namedjwt.verify
, which represents authentication. - Span Attributes: The span has metadata such as the role that comes from the payload.
-
Plugin: The
-
Database Operations (Internal Layer):
-
Plugin: The
db
plugin initializes the database connection pool and retrieves a connection. -
Span Creation: The
ClassInstrumentation
plugin detectsStorageKnexDB.runQuery
calls and creates and ends a new span namedStorageKnexDB.runQuery.{{queryName}}
. - Span Attributes: The spans store the query name if present.
-
Context Propagation: The database query is instrumented using
@opentelemetry/instrumentation-knex
, and any calls to the database or database pool will be automatically linked to the main trace context.
-
Plugin: The
-
Business Logic (Storage Layer):
-
Span Creation: When
Storage
,ObjectStorage
,Uploader
, etc. methods are called,ClassInstrumentation
creates a span such asStorage.createBucket
orUploader.upload
, using themethodsToInstrument
configuration. -
Span Attributes: Span's name and attributes are configured by
setName
andsetAttributes
functions when available. - Chaining: If nested calls within the same trace are made, those spans will automatically be children of the parent operation; this also happens with database calls using the knex instrumentation.
-
Span Creation: When
-
Backend Interactions (Storage/Backend Layer)
-
Span Creation: When a function like
S3Backend.getObject
orS3Backend.uploadObject
is called, a new span with the nameS3Backend.getObject
orS3Backend.uploadObject
is created byClassInstrumentation
, respectively. -
Span Attributes: Span contains attributes such as
operation: command.constructor.name
. -
Context Propagation: When an API call is made using
aws-sdk
, the request is automatically added to the OTEL context to ensure all spans are connected; this happens because of@opentelemetry/instrumentation-aws-sdk
instrumentation.
-
Span Creation: When a function like
-
Async Operations:
-
Context Propagation: If async operations are performed, and a new span is to be created, the context should be carried forward using the
trace.getTracer().startActiveSpan
function to create and automatically activate the span, so that the new span will correctly be associated with the request. - Span Attributes: It might contain attributes depending on what method is calling it.
-
Context Propagation: If async operations are performed, and a new span is to be created, the context should be carried forward using the
-
Response Phase (HTTP Layer):
-
Plugin:
traceServerTime
is used to measure the server time it took to complete the response, capturing time spent in queue, database, HTTP operations, etc. -
Span Attributes: If the tracing mode is set to
debug
, spans collected using the TraceCollector are serialized as JSON and added as astream
attribute to the main HTTP span. -
Span End: All spans created using the active context, including the root
http.request
span, are ended. This finalizes the span, collects metrics, and prepares it for export. -
Logging: The
logRequest
plugin logs the request, including the status code and response time, and theserverTimes
that were captured with thetraceServerTime
plugin.
-
Plugin:
-
Exporting Spans:
-
OTLP Exporter: If configured, the OpenTelemetry Collector
BatchSpanProcessor
batches up the created spans using theOTLPTraceExporter
and sends them to the OTEL endpoint.- This is done asynchronously in the background.
-
OTLP Exporter: If configured, the OpenTelemetry Collector
Metrics Collection
Metrics provide a numerical representation of system behavior, such as request rates, duration, etc. This system exposes metrics via a Prometheus endpoint, and here's how they are collected:
-
Request Level (HTTP Layer):
-
Counters: The
fastify-metrics
plugin collects HTTP request-related metrics such as request counts, duration, and error counts. It also stores the request data into thestorage_api_http_request_duration_seconds
metrics. -
Labels: Metrics collected by the
fastify-metrics
plugin includemethod
,route
, andstatus_code
. -
Label Aggregation: All Prometheus labels for HTTP are stored in
fastify-metrics
as tags.
-
Counters: The
-
Database Operations (Internal Layer):
-
Histograms: The
DbQueryPerformance
histogram records the time it takes for a database query to complete. It captures the time spent waiting for connection and the database query time as well as labelsregion
and the method name, stored asname
. -
Connections: The metrics
DbActivePool
andDbActiveConnection
track the pool connection counts and active connections.
-
Histograms: The
-
S3 Operations (Storage/Backend Layer):
-
Histograms: The
S3UploadPart
histogram records the time it takes to upload a part of a large file to the object storage service. It has one label,region
.
-
Histograms: The
-
Uploads:
-
Gauges:
FileUploadStarted
is incremented when the file upload process starts.FileUploadedSuccess
is incremented when an upload completes successfully. - Labels: The upload-related metrics include labels for region and upload type (standard or multipart).
-
Gauges:
-
Queue Operations (Internal Layer):
-
Histograms:
QueueJobSchedulingTime
records the time it took to schedule the message to the queue, labeled withname
, which is usually the queue name andregion
. -
Gauges:
QueueJobScheduled
for the messages scheduled to be processed by the queue, labeled withname
andregion
,QueueJobCompleted
for the number of completed messages,QueueJobRetryFailed
for the number of failed retries on each message, andQueueJobError
, which is the total count of errored messages. The labels used here are the queue names and region.
-
Histograms:
-
HTTP Agent Metrics
-
Gauges:
HttpPoolSocketsGauge
for the number of active sockets,HttpPoolFreeSocketsGauge
for the number of free sockets,HttpPoolPendingRequestsGauge
for the pending requests,HttpPoolErrorGauge
for the errors, each one of them havingname
,region
,protocol
, andtype
as labels.
-
Gauges:
-
Supavisor Metrics
- Custom Exporter: The Supavisor has a custom Prometheus exporter that collects information about pool sizes, tenant status, connected clients, etc. These are collected by the Prometheus config file.
Key Takeaways:
- End-to-End Visibility: Tracing provides a complete view of the request lifecycle, including HTTP, DB, and file I/O operations.
- Resource-Specific Metrics: Metrics provide an overview of request performance with different labels.
- Integration with OpenTelemetry: The use of OpenTelemetry allows traces to be sent to observability backends.
- Integration with Prometheus: The usage of Prometheus makes it easy to collect and visualize metrics.
Top comments (0)