Tracing and Metrics
Lets learn more about how tracing and metrics are collected and used within the system. From a request perspectivewe will check they are added and implemented.
Tracing Lifecycle
Tracing is used to observe the path a request takes through the system, providing a holistic view of the processing flow, including execution times, dependencies, and errors along the way. Here’s a detailed breakdown:
-
Request Ingress (HTTP Layer):
-
Instrumentation: OpenTelemetry (OTEL)
HttpInstrumentationautomatically creates a new root span when a request comes in. The span's name ishttp.request. The span's context (traceId, spanId) can be accessed through thetrace.getActiveSpan()function. -
Attributes: The instrumentation also sets relevant attributes to the span including the tenant id (if available), method, url, status code, route, headers, and more, based on
applyCustomAttributesOnSpan,headersToSpanAttributesconfiguration in theHttpInstrumentation. -
Logging: The
logRequestplugin is executed, capturing the request information in a log line. -
Context Propagation: The span's context (including
traceIdandspanId) is attached to the current asynchronous execution context, ensuring consistent propagation to downstream operations.
-
Instrumentation: OpenTelemetry (OTEL)
-
Authentication and Authorization (HTTP Layer):
-
Plugin: The
jwtplugin is activated. -
Span Creation: The
ClassInstrumentationplugin detectsjwt.verifycalls, creates, and ends a new span namedjwt.verify, which represents authentication. - Span Attributes: The span has metadata such as the role that comes from the payload.
-
Plugin: The
-
Database Operations (Internal Layer):
-
Plugin: The
dbplugin initializes the database connection pool and retrieves a connection. -
Span Creation: The
ClassInstrumentationplugin detectsStorageKnexDB.runQuerycalls and creates and ends a new span namedStorageKnexDB.runQuery.{{queryName}}. - Span Attributes: The spans store the query name if present.
-
Context Propagation: The database query is instrumented using
@opentelemetry/instrumentation-knex, and any calls to the database or database pool will be automatically linked to the main trace context.
-
Plugin: The
-
Business Logic (Storage Layer):
-
Span Creation: When
Storage,ObjectStorage,Uploader, etc. methods are called,ClassInstrumentationcreates a span such asStorage.createBucketorUploader.upload, using themethodsToInstrumentconfiguration. -
Span Attributes: Span's name and attributes are configured by
setNameandsetAttributesfunctions when available. - Chaining: If nested calls within the same trace are made, those spans will automatically be children of the parent operation; this also happens with database calls using the knex instrumentation.
-
Span Creation: When
-
Backend Interactions (Storage/Backend Layer)
-
Span Creation: When a function like
S3Backend.getObjectorS3Backend.uploadObjectis called, a new span with the nameS3Backend.getObjectorS3Backend.uploadObjectis created byClassInstrumentation, respectively. -
Span Attributes: Span contains attributes such as
operation: command.constructor.name. -
Context Propagation: When an API call is made using
aws-sdk, the request is automatically added to the OTEL context to ensure all spans are connected; this happens because of@opentelemetry/instrumentation-aws-sdkinstrumentation.
-
Span Creation: When a function like
-
Async Operations:
-
Context Propagation: If async operations are performed, and a new span is to be created, the context should be carried forward using the
trace.getTracer().startActiveSpanfunction to create and automatically activate the span, so that the new span will correctly be associated with the request. - Span Attributes: It might contain attributes depending on what method is calling it.
-
Context Propagation: If async operations are performed, and a new span is to be created, the context should be carried forward using the
-
Response Phase (HTTP Layer):
-
Plugin:
traceServerTimeis used to measure the server time it took to complete the response, capturing time spent in queue, database, HTTP operations, etc. -
Span Attributes: If the tracing mode is set to
debug, spans collected using the TraceCollector are serialized as JSON and added as astreamattribute to the main HTTP span. -
Span End: All spans created using the active context, including the root
http.requestspan, are ended. This finalizes the span, collects metrics, and prepares it for export. -
Logging: The
logRequestplugin logs the request, including the status code and response time, and theserverTimesthat were captured with thetraceServerTimeplugin.
-
Plugin:
-
Exporting Spans:
-
OTLP Exporter: If configured, the OpenTelemetry Collector
BatchSpanProcessorbatches up the created spans using theOTLPTraceExporterand sends them to the OTEL endpoint.- This is done asynchronously in the background.
-
OTLP Exporter: If configured, the OpenTelemetry Collector
Metrics Collection
Metrics provide a numerical representation of system behavior, such as request rates, duration, etc. This system exposes metrics via a Prometheus endpoint, and here's how they are collected:
-
Request Level (HTTP Layer):
-
Counters: The
fastify-metricsplugin collects HTTP request-related metrics such as request counts, duration, and error counts. It also stores the request data into thestorage_api_http_request_duration_secondsmetrics. -
Labels: Metrics collected by the
fastify-metricsplugin includemethod,route, andstatus_code. -
Label Aggregation: All Prometheus labels for HTTP are stored in
fastify-metricsas tags.
-
Counters: The
-
Database Operations (Internal Layer):
-
Histograms: The
DbQueryPerformancehistogram records the time it takes for a database query to complete. It captures the time spent waiting for connection and the database query time as well as labelsregionand the method name, stored asname. -
Connections: The metrics
DbActivePoolandDbActiveConnectiontrack the pool connection counts and active connections.
-
Histograms: The
-
S3 Operations (Storage/Backend Layer):
-
Histograms: The
S3UploadParthistogram records the time it takes to upload a part of a large file to the object storage service. It has one label,region.
-
Histograms: The
-
Uploads:
-
Gauges:
FileUploadStartedis incremented when the file upload process starts.FileUploadedSuccessis incremented when an upload completes successfully. - Labels: The upload-related metrics include labels for region and upload type (standard or multipart).
-
Gauges:
-
Queue Operations (Internal Layer):
-
Histograms:
QueueJobSchedulingTimerecords the time it took to schedule the message to the queue, labeled withname, which is usually the queue name andregion. -
Gauges:
QueueJobScheduledfor the messages scheduled to be processed by the queue, labeled withnameandregion,QueueJobCompletedfor the number of completed messages,QueueJobRetryFailedfor the number of failed retries on each message, andQueueJobError, which is the total count of errored messages. The labels used here are the queue names and region.
-
Histograms:
-
HTTP Agent Metrics
-
Gauges:
HttpPoolSocketsGaugefor the number of active sockets,HttpPoolFreeSocketsGaugefor the number of free sockets,HttpPoolPendingRequestsGaugefor the pending requests,HttpPoolErrorGaugefor the errors, each one of them havingname,region,protocol, andtypeas labels.
-
Gauges:
-
Supavisor Metrics
- Custom Exporter: The Supavisor has a custom Prometheus exporter that collects information about pool sizes, tenant status, connected clients, etc. These are collected by the Prometheus config file.
Key Takeaways:
- End-to-End Visibility: Tracing provides a complete view of the request lifecycle, including HTTP, DB, and file I/O operations.
- Resource-Specific Metrics: Metrics provide an overview of request performance with different labels.
- Integration with OpenTelemetry: The use of OpenTelemetry allows traces to be sent to observability backends.
- Integration with Prometheus: The usage of Prometheus makes it easy to collect and visualize metrics.
Top comments (0)