DEV Community: Yuri Grinshteyn

Integrating Tracing and Logging with OpenTelemetry and Stackdriver

Yuri Grinshteyn — Wed, 05 Feb 2020 00:16:35 +0000

One of the main benefits of using an all-in-one observability suite like Stackdriver is that it provides all of the capabilities you may need. Specifically, your metrics, traces, and logs are all in one place, and with the GA release of Monitoring in the Cloud Console, that's more true than ever before. However, for the most part, each of these data elements are still mostly independent, and I wanted to attempt to try to unify two of them - traces and logs.

The idea for the project was inspired by the excellent work Alex Amies did in his Reference Guide on using OpenCensus to measure Spanner performance and troubleshoot latency. Specifically, he included an applog package that integrated traces and logs in OpenCensus:

I wanted to follow my post on tracing with OpenTelemetry and attempt to create integrated traces and logs. Let's dive in!

The app

I created a very basic Go app using the Mux router that:

Receives a request on /.
Sleeps between 0 and 9 seconds.
Makes a backend call (to https://www.google.com)

My intent was to create a root span with two children - one for the delay that simulates an internal process and another for the backend call.

The code

Main function

func main() {
    initTracer()
    initLogger()
    defer closeLogger()

    r := mux.NewRouter()
    r.HandleFunc("/", mainHandler)

    if env == "LOCAL" {
        http.ListenAndServe("localhost:8080", r)
    } else {
        http.ListenAndServe(":8080", r)
    }
}

The main function simply sets up my tracing and logging and uses the mainHandler to respond to requests on /.

Tracing setup

func initTracer() {
    // Create Stackdriver exporter to be able to retrieve
    // the collected spans.
    exporter, err := stackdriver.NewExporter(
        stackdriver.WithProjectID(projectID),
    )
    if err != nil {
        log.Fatal(err)
    }

    // For the demonstration, use sdktrace.AlwaysSample sampler to sample all traces.
    // In a production application, use sdktrace.ProbabilitySampler with a desired probability.
    tp, err := sdktrace.NewProvider(sdktrace.WithConfig(sdktrace.Config{DefaultSampler: sdktrace.AlwaysSample()}),
        sdktrace.WithSyncer(exporter))
    if err != nil {
        log.Fatal(err)
    }
    global.SetTraceProvider(tp)
}

The tracing set up is pretty straightforward - I'm simply using the exporter written by Yoshi Yamaguchi, a fantastic Developer Advocate. It's the same exporter I used in my post on tracing without any changes.

Logging setup

This is where things start to get interesting.

func initLogger() {
    ctx := context.Background()
    var err error
    loggingClient, err = logging.NewClient(ctx, projectID)
    if err != nil {
        fmt.Printf("Failed to create logging client: %v", err)
        return
    }
    fmt.Printf("Stackdriver Logging initialized with project id %s, see Cloud "+
        " Console under GCE VM instance > all instance_id\n", projectID)
}

I've largely lifted this from Alex's work. The init function simply sets up the logging client.

Writing logs

This is where the trace/logging integration really happens.

// Send to Cloud Logging service including reference to current span
func printWithTrace(ctx context.Context, format string, v ...interface{}) {
    printf(ctx, logging.Info, format, v...)
}

// Send to Cloud Logging service including reference to current span
func printf(ctx context.Context, severity logging.Severity, format string,
    v ...interface{}) {
    span := trace.SpanFromContext(ctx)
    sCtx := span.SpanContext()
    tr := sCtx.TraceIDString()
    lg := loggingClient.Logger(LOGNAME)
    trace := fmt.Sprintf("projects/%s/traces/%s", projectID, tr)
    lg.Log(logging.Entry{
        Severity: severity,
        Payload:  fmt.Sprintf(format, v...),
        Trace:    trace,
        SpanID:   sCtx.SpanIDString(),
    })
}

In Stackdriver, traces and logs can be connected by writing the span ID and the trace ID in the payload of the log message. Here, I'm using the context to extract both the span and trace and then extract their IDs. I then write them to the log payload. Here's what a resulting log message looks like:

Notice that the spanId and trace fields are populated appropriately.

Viewing traces

I can run the app locally (after using gcloud auth application-default login to write default credentials) and send traffic to http://localhost:8080. Here's a resulting trace:

Note that the trace contains both Events - added with the span.AddEvent() method - and logs, written as described above.

Viewing logs

I can click on one of the log entries to see the full details of the log message:

I can then click Open in Logs Viewer and see this log entry there:

In conclusion...

I was very glad to see that this somewhat underappreciated functionality from OpenCensus still works in OpenTelemetry with minor changes. Specifically, I had to find the new APIs to use in printf() to extract the span from context and then get its span ID and trace ID, and this does not seem to be well documented. With that said, I hope this brief tutorial is useful to others looking to build a more integrated approach to observability with Stackdriver, especially in distributed systems. Many thanks for Alex for doing the original work on this, and thank you for reading!

Distributed Tracing with OpenTelemetry in Go

Yuri Grinshteyn — Thu, 09 Jan 2020 22:26:33 +0000

Toward the end of last year, I had the good fortune of publishing a reference guide on using OpenCensus for distributed tracing. In it, I covered distributed tracing fundamentals, like traces, spans, and context propagation, and demonstrated using OpenCensus to instrument a simple pair of frontend/backend services written in Go. Since then, the OpenCensus and OpenTracing projects have merged into OpenTelemetry, a "single set of APIs, libraries, agents, and collector services to capture distributed traces and metrics from your application." I wanted to attempt to reproduce the work I did in OpenCensus using the new project and see how much has changed.

Objective

For this exercise, I built a simple demo. It consists of two services. The frontend service receives an incoming request and makes a request to the backend. The backend receives the request and returns a response. Our objective is to trace this interaction to determine the overall response latency and understand how the two services and the nework connectivity between them contribute to the overall latency.

In the original guide, the two services were deployed in two separate GKE clusters, but that is actually not necessary to demonstrate distributed tracing. For this exercise, we'll simply run both services locally.

Primitives

While the basic concepts are covered in the reference guide and in much greater detail in the Google Dapper research paper, it's still worth briefly covering them here such that we can then understand how they're implemented in the code.

From the reference guide:

A trace is the total of information that describes how a distributed system responds to a user request. Traces are composed of spans, where each span represents a specific request and response pair involved in serving the user request. The parent span describes the latency as observed by the end user. Each of the child spans describes how a particular service in the distributed system was called and responded to, with latency information captured for each.

This is well illustrated in the aforementioned research paper using this diagram:

Implementation

Let's take a look at how we can implement distributed tracing in our frontend/backend service pair using OpenTelemetry.

Note that most of this is adopted from the samples published by OpenTelemetry in their Github repo. I made relatively minor changes to add custom spans and use the Mux router, rather than just basic HTTP handling.

Frontend code

We'll start by reviewing the frontend code.

Imports

First, the imports:

import (
    "fmt"
    "log"
    "net/http"
    "os"
    "context"
    "io/ioutil"
    "google.golang.org/grpc/codes"

    "github.com/gorilla/mux"

    "go.opentelemetry.io/otel/api/distributedcontext"
    "go.opentelemetry.io/otel/api/global"
    "go.opentelemetry.io/otel/api/trace"
    "go.opentelemetry.io/otel/exporter/trace/stackdriver"
    "go.opentelemetry.io/otel/plugin/httptrace"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
)

Mostly, we're using a variety of OpenTelemetry libraries at this point. We'll also use the Mux router to handle HTTP requests (I mostly use it because it seem to be similar to Express in Node.js).

Main function

Next, let's have a look at the main() function for our service:

func main() {
    initTracer()

    r := mux.NewRouter()
    r.HandleFunc("/", mainHandler)

    if (env=="LOCAL") {
        http.ListenAndServe("localhost:8080", r)
    } else {
        http.ListenAndServe(":8080", r)
    }
}

As you can tell, this is pretty straighforward. We initialize tracing right at the start and use a Mux router to handle a single route for requests to /. We then start the server on port 8080. I added an environment variable to check to see whether I'm running the code locally to bypass the MacOS prompt to allow inbound network connections as per these instructions.

Initialize tracing

Next, let's take a look at the initTracer() function:

func initTracer() {

    // Create Stackdriver exporter to be able to retrieve
    // the collected spans.
    exporter, err := stackdriver.NewExporter(
        stackdriver.WithProjectID(projectID),
    )
    if err != nil {
        log.Fatal(err)
    }

    // For the demonstration, use sdktrace.AlwaysSample sampler to sample all traces.
    // In a production application, use sdktrace.ProbabilitySampler with a desired probability.
    tp, err := sdktrace.NewProvider(sdktrace.WithConfig(sdktrace.Config{DefaultSampler: sdktrace.AlwaysSample()}),
        sdktrace.WithSyncer(exporter))
    if err != nil {
        log.Fatal(err)
    }
    global.SetTraceProvider(tp)
}

Here, we're simply instantiating the Stackdriver exporter and settting the sampling parameter to capture every trace.

Handle requests

Finally, let's look at the mainHandler() function that is called to handle requests to /.

func mainHandler(w http.ResponseWriter, r *http.Request) {

    tr := global.TraceProvider().Tracer("OT-tracing-demo")

    client := http.DefaultClient
    ctx := distributedcontext.NewContext(context.Background())

    var body []byte

    err := tr.WithSpan(ctx, "incoming call",  // root span here
        func(ctx context.Context) error {

            // create child span
            ctx, childSpan := tr.Start(ctx, "backend call")
            childSpan.AddEvent (ctx, "making backend call")

            // create backend request
            req, _ := http.NewRequest("GET", backendAddr, nil)

            // inject context
            ctx, req = httptrace.W3C(ctx, req)
            httptrace.Inject(ctx, req)

            // do request
            log.Printf("Sending request...\n")
            res, err := client.Do(req)
            if err != nil {
                panic(err)
            }
            body, err = ioutil.ReadAll(res.Body)
            _ = res.Body.Close()

            // close child span
            childSpan.End()

            trace.SpanFromContext(ctx).SetStatus(codes.OK)
            log.Printf("got response: %d\n", res.Status)
            fmt.Printf("%v\n", "OK") //change to status code from backend
            return err
        })

    if err != nil {
        panic(err)
    }   
}

Here, we're setting the name of the tracer to "OT-tracing-demo" and starting a root span labeled "incoming call". We then create a child span of that labeled "backend call" and pass the context to it. We then create a request to our backend server, whose location is defined in an env variable and inject our context into that request - we'll see how that context is used in the backend in a bit. Finally, we make the request, get the status code, and output a confirmation message. Pretty straightforward!

A couple of things to note further:

I am explicitly closing the child span, rather than using defer for more control over exactly when the timer is stopped.
I am adding events to spans for even more clear labeling.

Now, let's look at our backend.

Backend code

Much of the code here is very similar to the frontend - we use the same exact main() and initTracer() functions to run the server and initialize tracing.

mainHandler

func mainHandler(w http.ResponseWriter, req *http.Request) {
    // start tracer
    tr := global.TraceProvider().Tracer("backend")
    // get context from incoming request
    attrs, entries, spanCtx := httptrace.Extract(req.Context(), req)

    // create request using context
    req = req.WithContext(distributedcontext.WithMap(req.Context(), distributedcontext.NewMap(distributedcontext.MapUpdate{
        MultiKV: entries,
    })))

    // create span
    ctx, span := tr.Start(
        req.Context(),
        "backend call received",
        trace.WithAttributes(attrs...),
        trace.ChildOf(spanCtx),
    )

    span.AddEvent(ctx, "handling backend call")

    // output
    log.Printf("backend call received")
    fmt.Printf("OK")
    // close span
    span.End()
}

The mainHandler() function does look quite different. Here, we extract the span context from the incoming request, create a new request object using that context, and create a new span using that request context. We also add an event to our span for explicit labeling. Finally, we return "OK" to the caller and close our span. Again, I could have used defer span.End() instead of doing it explicitly.

Note the difference between span context and request context. This is specifically relevant when accepting incoming context and using it to create child spans. For further exploration of these two, take a look at the relevant documentation from OpenTracing.

Traces

Now that we've seen how to implement tracing instrumentation in our code, let's take a look at what this instrumentation creates. We can run both frontend and backend locally after setting the relevant environment variables for each and using gcloud auth login to log in to Google Cloud. Once we do that, we can hit the frontend on http://localhost:8080 and issue a few requests. This should immediately result in traces being written to Stackdriver:

You can see the span names we specified in our code and the events we added for clearer labeling. One additional thing I was pleasantly surprised by is that OpenTelemetry explicitly adds steps for the HTTP/networking stack, including DNS, connecting, and sending and receiving data.

Conclusion

I greatly enjoyed attempting to reproduce the work I did in OpenCensus with OpenTelemetry and eventually found it understandable and clear, especially once I was pointed to the tracer.Start() method to create child spans. Come back next time when I attempt to use the stats features of OpenTelemetry to create custom metrics. Until then!

SLOs with Stackdriver Service Monitoring

Yuri Grinshteyn — Tue, 07 Jan 2020 19:49:29 +0000

Service Level Objectives or SLOs are one of the fundamental principles of site reliability engineering. We use them to precisely quantify the reliability target we want to achieve in our service. We also use their inverse, error budgets, to make informed decisions about how much risk we can take on at any given time. This lets us determine, for example, whether we can go ahead with a push to production or infrastructure upgrade.

However, Stackdriver has never given us the ability to actually create, track, alert, and report on SLOs - until now. The Service Monitoring API was released to public beta at NEXT London in the fall, and I wanted to take the opportunity to try it out. Here's what I found.

The service

Before I could create a service level objective, I needed a service. Because the initial release of the API only supports App Engine, Istio on GKE, and Cloud Endpoints, I thought I'd try the simplest option - App Engine Standard. I created a basic Hello World app in Go using the Mux router - here's the code for it:

package main

import (
    "fmt"
    "net/http"

    "github.com/gorilla/mux"
)

func main() {
    r := mux.NewRouter()

    r.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        fmt.Fprintf(w, "Hello World!")
    })

    http.ListenAndServe(":8080", r)
}

I then created the app.yaml file:

runtime: go112 # replace with go111 for Go 1.11

Next, I deployed the app using gcloud app deploy. At this point, the build process was failing, and I ended up having to follow these instructions to define a module. I am not an expert in Go, and I'm guessing that this is a matter of local environment configuration that I just couldn't be bothered to sort out. Nevertheless, I was able to deploy the app after following those instructions.

Finally, I set up a global Uptime Check to get a steady stream of traffic flowing to my new app:

Defining the SLI

Now that I had a service created, I was ready to proceed. I found the documentation on concepts very helpful, even as someone mostly familiar with this topic. For this exercise, I wanted to create a simple availability Service Level Indicator (SLI) to measure the percentage of "good" requests as a fraction of total. That required three decisions:

How to count total requests
How to count "good" requests
What time frame to use for my SLI

Thankfully, App Engine exposes a useful response count metric:

Note that this metric is not written if the GAE application is disabled (as I learned by attempting to simulate a failure by disabling the app).

This metric can further be filtered by response code:

I decided to use the unfiltered metric to count total requests and filter requests with a response code of 200 to count "good" requests for the sake of simplicity.

Note that this is likely far too simplistic for any production use. For example, this would count 404s as "bad" requests, when they are likely to be the result of misconfigured clients or even external scanners.

I then chose a 1 day rolling window as my SLO time frame. For a lot more information on how to choose SLIs and SLOs, I highly recommend the Art of SLOs workshop, which the Google CRE team has recently released.

Creating the SLI and SLO

At this point, I was ready to use the API to define my SLO. As recommended in the "Building the SLI" section, I used the Metrics Explorer to create a chart that showed my "total" request count:

From there, I was able to copy the JSON for the filter:

"metric.type=\"appengine.googleapis.com/http/server/response_count\" resource.type=\"gae_app\" resource.label.\"module_id\"=\"default\""

I then modified the chart to only count the "good" requests, filtering on response_code=200 and copied that JSON:

"metric.type=\"appengine.googleapis.com/http/server/response_count\" resource.type=\"gae_app\" resource.label.\"module_id\"=\"default\" metric.label.\"response_code\"=\"200\""

At this point, I was ready to build the SLI:

  "requestBased": {
    "goodTotalRatio": {
      "totalServiceFilter": "metric.type=\"appengine.googleapis.com/http/server/response_count\" resource.type=\"gae_app\" resource.label.\"module_id\"=\"default\"",
      "goodServiceFilter": "metric.type=\"appengine.googleapis.com/http/server/response_count\" resource.type=\"gae_app\" resource.label.\"module_id\"=\"default\" metric.label.\"response_code\"=\"200\"",
    }
  }

I chose the "requestBased" type of SLI, because I was looking to capture the fraction of good over total requests. The other options include basic, which might have been good enough for my purpose here, and "window-based", which lets you count the number of periods during which the service meets a defined health threshold. I may come back and revisit the latter in another post.

From there, I defined the SLO:

{
   "displayName": "GAE Hello World Availability",
   "serviceLevelIndicator": {
      "requestBased": {
        "goodTotalRatio": {
          "totalServiceFilter": "metric.type=\"appengine.googleapis.com/http/server/response_count\" resource.type=\"gae_app\" resource.label.\"module_id\"=\"default\"",
          "goodServiceFilter": "metric.type=\"appengine.googleapis.com/http/server/response_count\" resource.type=\"gae_app\" resource.label.\"module_id\"=\"default\" metric.label.\"response_code\"=\"200\"",
        }
     }
   },
   "goal": 0.98,
   "rollingPeriod": "86400s",
   "displayName": "98% Successful requests in a rolling day"
}

Finally, I submitted the request to the API using Postman - you could do the same using the API Explorer or even curl. The response was successful and returned the SLO name in the body:

{
  "name": "projects/<project number>/services/gae:<project ID>_default/serviceLevelObjectives/<SLO name>",
  "serviceLevelIndicator": {
    "requestBased": {
      "goodTotalRatio": {
        "goodServiceFilter": "metric.type=\"appengine.googleapis.com/http/server/response_count\" resource.type=\"gae_app\" resource.label.\"module_id\"=\"default\" metric.label.\"response_code\"=\"200\"",
        "totalServiceFilter": "metric.type=\"appengine.googleapis.com/http/server/response_count\" resource.type=\"gae_app\" resource.label.\"module_id\"=\"default\""
      }
    }
  },
  "goal": 0.98,
  "rollingPeriod": "86400s",
  "displayName": "98% Successful requests in a rolling day"
}

Alerting on SLO

Now that my SLO was defined, I wanted to achieve two things - create an alert for SLO violation and figure out how to get a status without tripping an alert. I was able to use the UI to create an alerting policy using the "SLO BURN RATE" condition type:

When setting up the alerting policy, I ran into two fields whose meaning was not immediately clear to me. The first one is Lookback Duration. I was able to find an explanation in the documentation - because burn rate is fundamentally a rate of change condition, you have the option of specifying a custom lookback window. For other rate of change conditions, the lookback is set to 10 minutes and cannot be changed. From the doc for rate of change conditions:

The condition averages the values of the metric from the past 10 minutes, then compares the result with the 10-minute average that was measured just before the duration window. The 10-minute lookback window used by a metric rate of change condition is a fixed value; you can't change it. However, you do specify the duration window when you create a condition.

The second field that confused me was the threshold. A bit more thought led me to believe that this is the threshold for the rate of error budget burn - applied to the specified lookback duration. So, using 10 minutes for the lookback and 10 for the threshold would result in a condition that would trip if 10% of the total error budget was burned over a 10 minute period.

Triggering alert

Once my alerting policy was configured, I wanted to see what would happen if there was an availability issue. I rewrote the service to throw an error half the time:

func main() {
    r := mux.NewRouter()
    r.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {

        rand.Seed(time.Now().UnixNano())
        n := rand.Intn(10) // n will be between 0 and 10
        fmt.Printf("randon number was %d\n", n)
        if n < 6 {
            http.Error(w, "error!", 500)

        } else {
            fmt.Fprintf(w, "Hello World!")
        }

    })

    http.ListenAndServe(":8080", r)
}

and redeployed the app. Fairly quickly, I got an incident:

I was satisfied with this and redeployed the app with the original code to get it working again. In short order, the incident was resolved.

Retrieving SLO Status

Alerting on error budget burn is obviously necessary, but there will be times when we'll need to know the status of our SLO long before an issue. As such, I needed a way to query the SLO data. I followed the documentation and used the timeSeries.list method of the monitoring API.

I thought of two primary questions I would want to be able to answer - what is the availability of my service over a given time period and what is the status of my error budget at a given point in time?

SLO Status

The first is answered using the "select_slo_health" time series. I sent a request to https://monitoring.googleapis.com/v3/projects/stack-doctor/timeSeries with the following parameters:

name:projects/<project ID>
filter:select_slo_health("projects/<project number>/services/gae:<project ID>_default/serviceLevelObjectives/<SLO name>")
interval.endTime:2020-01-06T17:17:00.0Z
interval.startTime:2020-01-05T17:17:00.0Z
aggregation.alignmentPeriod:3600s
aggregation.perSeriesAligner:ALIGN_MEAN

I used the SLO name that was returned when I created the SLO in the previous steps. I could have also used a call to https://monitoring.googleapis.com/v3/projects//services/gae:_default/serviceLevelObjectives to retrieve the SLOs I have configured against my default App Engine service.

I specified a 24hr interval to retrieve the data with an alignment of 1 hour using the mean aligner. If I was going to chart the data, I could have used a shorter alignment period and a more precise aligner, but this sufficed for my purposes. The results looked like this:

{
  "timeSeries": [
    {
      "metric": {
        "type": "select_slo_health(\"projects/860128900282/services/gae:stack-doctor_default/serviceLevelObjectives/2IooYmjTSROak0g9f-DmpA\")"
      },
      "resource": {
        "type": "gae_app",
        "labels": {
          "project_id": "stack-doctor"
        }
      },
      "metricKind": "GAUGE",
      "valueType": "DOUBLE",
      "points": [
        {
          "interval": {
            "startTime": "2020-01-06T17:17:00Z",
            "endTime": "2020-01-06T17:17:00Z"
          },
          "value": {
            "doubleValue": 1
    …

As expected, the output shows me the fractional ratio of good requests to total requests for each interval (that matches my alignmentPeriod) within the total interval (as specified by interval.startTime and endTime). For my service, each value was 1, meaning that 100% of the requests were good for each hourly interval.

Error Budget Status

The second question I wanted to answer is "how much error budget do I have left?" The operator for that is the select_slo_budget_fraction. The only change in the request is to change the filter:

filter:select_slo_budget_fraction("projects/860128900282/services/gae:stack-doctor_default/serviceLevelObjectives/2IooYmjTSROak0g9f-DmpA")  
interval.endTime:2020-01-06T17:17:00.0Z  
interval.startTime:2020-01-05T17:17:00.0Z  
aggregation.alignmentPeriod:3600s  
aggregation.perSeriesAligner:ALIGN_MEAN

After making a request to the timeSeries.list method, I got the following return:

{
  "timeSeries": [
    {
      "metric": {
        "type": "select_slo_budget_fraction(\"projects/860128900282/services/gae:stack-doctor_default/serviceLevelObjectives/2IooYmjTSROak0g9f-DmpA\")"
      },
      "resource": {
        "type": "gae_app",
        "labels": {
          "project_id": "stack-doctor"
        }
      },
      "metricKind": "GAUGE",
      "valueType": "DOUBLE",
      "points": [
        {
          "interval": {
            "startTime": "2020-01-06T17:17:00Z",
            "endTime": "2020-01-06T17:17:00Z"
          },
          "value": {
            "doubleValue": 1
          }
        },
    ….

As before, each "value" represents the fraction of the error budget remaining at that point. As my service is not burning error budget, the numbers stay at 1. I could have also used the select_slo_budget operator to get the actual remaining budget - the count of errors remaining.

In conclusion...

I hope you found this introduction to the Service Monitoring API useful. Thank you for reading, and let me know if you have any feedback. Until next time!

How I learned to stop worrying and debug in production

Yuri Grinshteyn — Wed, 04 Dec 2019 19:19:43 +0000

Incident management is one of the core practices of Site Reliability Engineering. As part of that, the SRE Book recommends focusing on prioritization during the incident itself. Specifically:

Prioritize. Stop the bleeding, restore service, and preserve the evidence for root-causing.

However, there may still be times when you may need to try to debug in production - for example, you may be struggling to reproduce a problem locally or in a dev environment, and production may be the only place where it happens reliably enough. In this situation, redeploying with additional logging enabled may not be an option, especially if there's an incident in progress. There may be other times when an error isn't enough of a problem to impact your SLO, but you still want to fix it.

In the past, these situations required developers to add additional instrumentation to their code and wait for a new deployment for that instrumentation to show something useful. But what if that wasn't necessary? What if you could just inspect the code in production or add logging on the fly without having to redeploy?

Stackdriver Debugger is intended to enable exactly this. I wanted to try it for myself - here's how it went.

Setup

Agent

I started by writing a simple Node Express app that would return a 500 error half the time based on a random number generator. In order to use Debugger, I included the agent at the top of my code:

require('@google-cloud/debug-agent').start({
    projectId: projectID,
    keyFilename: './key.json',
    serviceContext: {
      service: serviceName,
      version: serviceVersion
    },
    allowExpressions: true
  });

Some things of note here:

The agent takes the project ID and credentials as parameters. The latter is not necessary if running somewhere where access to the Debugger API doesn't need explicit authentication, like GCE, GKE, or App Engine. My plan was to simply run this locally, so I needed to add it explicitly.
The service and version parameters allow you to have multiple services available to debug at the same time
The allowExpressions flag enables Debugger to, for example, view static or global variables that are not part of the local variable set.

Once I ran the code, I was able to select my app and the specified version:

Code

At this point, I was presented with a screen to select my source code.

Because my code is in GitHub, I chose that path. Once I selected my repo and branch, I was able to see the code!

Debugging

Snapshots

At this point, I was ready to start actually using Debugger. The first capability it provides is Snapshots - it's the ability to see the value of variables at a specific execution point - again, without actually stopping code execution. Specifically, I wanted to see if I could get it to show me the value of my randomInt variable that I was using to trigger an error. I added a snapshot by clicking on the line number:

I then refreshed the page running on my local server until I got an error and got a snapshot:

I was curious about the fact that my randomInt variable didn't show up. After a bit of digging, I realized that I had to add an expression to capture it:

This time, when I loaded the page, it was captured as I expected:

I was very happy with this - I can definitely see how useful this would be in a production debugging or troubleshooting scenario to, for example, capture the values that are being used in a calculation. I can even add multiple snapshot points and see the value of the variable change:

Logpoints

The second major capability of Debugger is the ability to add logging on the fly - that is, to essentially create additional log entries that get ingested into Stackdriver Logging and persist for 24 hours. Adding one is as easy as switching to the Logpoint tab, selecting a line of code, and specifying the message to be written:

Now, when I reloaded the page, I saw additional logging in my local console:

So, the log entries were added to stdout - not automatically sent to Stackdriver Logging. So, I needed to run this somewhere in GCP. I built a container image using Google Cloud Build and deployed it to Cloud Run. Once that was done, I saw my new service in Debugger:

I then selected the code from GitHub again and added a logpoint. This time, I got logs!

This is very cool - it's such an easy way to get more debugging info - without having to redeploy the app!

In conclusion…

Debugging in production CAN be done! Debugger is obviously not a replacement for being able to step through the code in an IDE (although it does have IDE integration), but it's a great way to add more debugging information to an app on the fly - without having to redeploy. Thanks for reading!

What's the best way to log errors (in Node.js)?

Yuri Grinshteyn — Fri, 15 Nov 2019 19:14:18 +0000

I wanted to address another one in the mostly-in-my-head series of questions with the running title of "things people often ask me". Today's episode in the series is all about logging errors to Stackdriver. Specifically, I've found that folks are somewhat confused about the multiple options they have for error logging and even more so when they want to understand how to log and track exceptions. My opinion is that this is in part caused by Stackdriver providing multiple features that enable this - Error Reporting and Logging. This is further confusing because Error Reporting is in a way a subset of Logging. As such, I set out to explore exactly what happens when I tried to log both errors and exceptions using Logging and Error Reporting in a sample Node.js app. Let's see what I found!

Logging Errors

I think that the confusion folks face starts with the fact that Stackdriver actually supports three different options for logging in Node.js - Bunyan, Winston, and the API client library. I wanted to see how the first two treat error logs. At this point, I do not believe we recommend using the client library directly (in the same way that we recommend using OpenCensus for metric telemetry, rather than calling the Monitoring API directly).

Logging with Bunyan

The documentation is pretty straightforward - setting up Bunyan logging in my app was very easy.

// *************** Bunyan logging setup *************
// Creates a Bunyan Stackdriver Logging client
const loggingBunyan = new LoggingBunyan();
// Create a Bunyan logger that streams to Stackdriver Logging
const bunyanLogger = bunyan.createLogger({
  name: serviceName, // this is set by an env var or as a parameter
  streams: [
    // Log to the console at 'info' and above
    {stream: process.stdout, level: 'info'},
    // And log to Stackdriver Logging, logging at 'info' and above
    loggingBunyan.stream('info'),
  ],
});

From there, logging an error message is as simple as:

app.get('/bunyan-error', (req, res) => {
    bunyanLogger.error('Bunyan error logged');
    res.send('Bunyan error logged!');
})

When I ran my app, I saw this logging output in the console:

{"name":"node-error-reporting","hostname":"ygrinshteyn-macbookpro1.roam.corp.google.com","pid":5539,"level":50,"msg":"Bunyan error logged","time":"2019-11-15T17:19:58.001Z","v":0}

And this in Stackdriver Logging:

Note that the log entry is created against the "global" resource because the log entry is being sent from my local machine not running on GCP, and the logName is bunyan_log. The output is nicely structured, and the severity is set to ERROR.

Logging with Winston

I again followed the documentation to set up the Winston client:

// ************* Winston logging setup *****************
const loggingWinston = new LoggingWinston();
// Create a Winston logger that streams to Stackdriver Logging
const winstonLogger = winston.createLogger({
  level: 'info',
  transports: [
    new winston.transports.Console(),
    // Add Stackdriver Logging
    loggingWinston,
  ],
});

Then I logged an error:

app.get('/winston-error', (req, res) => {
    winstonLogger.error('Winston error logged');
    res.send('Winston error logged!');
})

This time, the console output was much more concise:

{"message":"Winston error logged","level":"error"}

Here's what I saw in the Logs Viewer:

The severity was again set properly, but there's a lot less information in this entry. For example, my hostname is not logged. This may be a good choice for folks looking to reduce the amount of data that is logged while still retaining enough information to be useful.

Error Reporting

At this point, I had a good understanding of how logging errors works. I next wanted to investigate whether using Error Reporting for this purpose would provide additional value. First, I set up Error Reporting in the app:

//************** Stackdriver Error Reporting setup ******** */
const errors = new ErrorReporting(
  {
    projectId: projectID,
    reportMode: 'always',
    serviceContext: {
      service: serviceName,
      version: '1'
    }
  }
);

I then sent an error using the client:

app.get('/report-error', (req, res) => {
  res.send('Stackdriver error reported!');
  errors.report('Stackdriver error reported');
})

This time, there was no output in the console AND nothing was logged to Stackdriver Logging. I went to Error Reporting to find my error:

When I clicked on the error, I was able to get a lot of detail:

This is great because I can see when the error started happening, I get a histogram if and when it continues to happen, and I get a full stack trace showing me exactly where in my code the error is generated - this is all incredibly valuable information that I don't get from simply logging with the ERROR severity.

The tradeoff here is that this message never makes it to Stackdriver Logging. This means that I can't use errors reported through Error Reporting to, for example, create log based metrics, which may make for a great SLI and/or alerting policy condition.

Logging Exceptions

Next, I wanted to investigate what would happen if my app were to throw an exception and log it - how would it show up? I used Bunyan to log an exception:

app.get('/log-exception', (req, res) => {
  res.send('exception');
  bunyanLogger.error(new Error('exception logged'));
})

The console output contained the entire exception:

{"name":"node-error-reporting","hostname":"<hostname>","pid":5539,"level":50,"err":{"message":"exception logged","name":"Error","stack":"Error: exception logged\n    at app.get (/Users/ygrinshteyn/src/error-reporting-demo/app.js:72:22)\n    at Layer.handle [as handle_request] (/Users/ygrinshteyn/src/error-reporting-demo/node_modules/express/lib/router/layer.js:95:5)\n    at next (/Users/ygrinshteyn/src/error-reporting-demo/node_modules/express/lib/router/route.js:137:13)\n    at Route.dispatch (/Users/ygrinshteyn/src/error-reporting-demo/node_modules/express/lib/router/route.js:112:3)\n    at Layer.handle [as handle_request] (/Users/ygrinshteyn/src/error-reporting-demo/node_modules/express/lib/router/layer.js:95:5)\n    at /Users/ygrinshteyn/src/error-reporting-demo/node_modules/express/lib/router/index.js:281:22\n    at Function.process_params (/Users/ygrinshteyn/src/error-reporting-demo/node_modules/express/lib/router/index.js:335:12)\n    at next (/Users/ygrinshteyn/src/error-reporting-demo/node_modules/express/lib/router/index.js:275:10)\n    at expressInit (/Users/ygrinshteyn/src/error-reporting-demo/node_modules/express/lib/middleware/init.js:40:5)\n    at Layer.handle [as handle_request] (/Users/ygrinshteyn/src/error-reporting-demo/node_modules/express/lib/router/layer.js:95:5)"},"msg":"exception logged","time":"2019-11-15T17:47:50.981Z","v":0}

The logging entry looked like this:

And the jsonPayload contained the exception:

This is definitely a lot of useful data. I next wanted to see if Error Reporting would work as advertised and identify this exception in the log as an error. After carefully reviewing the documentation, I realized that this functionality works specifically on GCE, GKE, App Engine, and Cloud Functions, whereas I was just running my code on my local desktop. I tried running the code in Cloud Shell and immediately got a new entry in Error Reporting:

The full stack trace of the exception is available in the detail view:

So, logging an exception gives me the best of both worlds - I get a logging entry that I can use for things like log based metrics, and I get an entry in Error Reporting that I can use for analysis and tracking.

Reporting Exceptions

I next wanted to see what would happen if I used Error Reporting to report the same exception.

app.get('/report-exception', (req, res) => {
  res.send('exception');
  errors.report(new Error('exception reported'));
})

Once again, there was no console output. My error was immediately visible in Error Reporting:

And somewhat to my surprise, I was able to see an entry in Logging, as well:

As it turns out, exceptions are recorded in both Error Reporting AND Logging - no matter which of the two you use to send them.

So, what now?

Here's what I've learned from this exercise:

Bunyan logging is more verbose than Winston, which could be a consideration if cost is an issue.
Exceptions can be sent to Stackdriver through Logging or Error Reporting - they will then be available in both.
Using Error Reporting to report** non-exception** errors adds a lot of value for developers, but gives up value for SREs or ops folks who need to use logs for metrics or SLIs.

Thanks for joining me - come back soon for more!

Can you alert on logs in Stackdriver?

Yuri Grinshteyn — Tue, 15 Oct 2019 17:44:37 +0000

Introduction

One of the more common questions I hear from folks who are using either or both Stackdriver Logging and Monitoring is "how do I create an alert on errors in my logs?" Generally, they ask for one of two reasons:

They really want to be notified every time an error is logged - as you might expect, this is an opportunity for me to talk to them about reliability targets, SLOs, and good alerting practices.
They use logs as a kind of SLI and really do need to be alerted when the number of messages that meet some criteria (e.g. errors) exceeds a particular threshold.

In this post, I am going to address the latter scenario. I'll start by covering what logs and metrics are typically used for and the kinds of data they contain. From there, I'll review creating metrics from logs in Stackdriver and using those metrics to create charts in dashboards and alerts.

Note that a lot of this information is already covered in the documentation and in this excellent blog post by Mary Koes, a product manager on the Stackdriver team. Nevertheless, this is my attempt to create a single coherent story with practical examples of how to get started with log-based metrics in Stackdriver.

Logs and metrics

Before we start, it's important to have a common understanding of exactly what we mean when we talk about logs and metrics. Together, they are two of the three pillars (with traces being the third) of observability.

Note that I am generally referring to "observability 1.0" here. I will readily defer to folks like Charity Majors and Liz Fong-Jones on the current state of the art where observability is concerned. You can start with reading Charity's retrospective here, and Ben Sigelman's addressing the idea of the three pillars directly here.

With that sorted - let's get back to the issue at hand. We use logs as data points that specifically describe an event that takes place in our system. Logs are written by our code, by the platform our code is running on, and the infrastructure we depend on (for the purposes of this post, I'm leaving audit logs out of scope, and I will return to them directly in a separate post). Because logs in modern systems are the descendant of (and sometimes still are) text log files written to disk, I consider a log entry, analogous to a line in a log file, to be the quantum unit of logging. An entry will generally consist of exactly two things - a timestamp that indicates either when the event took place or when it was ingested into our logging system and the text payload, either as unstructured text data or structured data, most commonly in JSON. Logs can also carry associated metadata, especially when they're ingested into Stackdriver, like the resource that's writing them, the log name, and a severity for each entry. We use logs for two main purposes:

Event logs describe specific events that take place within our system - we may use them to output messages that assure the developer that things are working well ("task succeeded") or to provide information when things are not ("received exception from server").
Transaction logs describe the details of every transaction processed by a system or component. For example, a load balancer will log every request that it receives, whether it was successfully completed or not, and include additional information like the requested URL, HTTP response code, and possibly things like which backend was used to serve the request.

Metrics, on the other hand, are not generally thought of as describing specific events (again, this is changing). More commonly, they're used to represent the state or health of your system over time. A metric is made up of a series of points, each of which includes the timestamp and a numeric value. Metrics also have metadata associated with them - the series of points, commonly referred to as a timeseries, will include things like the name, description, and often labels that help determine which resource is writing the metric.

In Stackdriver specifically, metrics are the only kind of data that can be used to create alerts via alerting policies. As such, it'll be important to understand how to use logs to create metrics - let's do that now.

Log-based metrics

The term "log-based metrics" is rather specific to Stackdriver, but the idea is rather straightforward. First, Stackdriver provides a simple mechanism to count the number of log entries per minute that match a filter - referred to as a "counter metric". This is what we'll use if we want to, for example, use load balancer logs as our service level indicator for availability of a service - we can create a metric that will count how many errors we'll see, and we'll use an alerting policy to alert when that value exceeds a threshold we deem acceptable. The process is documented here, but I find that a specific example is always helpful.

Counter metric - errors

Let's take a look at a simple example. I've created a simple example that writes an error 20% of the time. When I run the code locally after authenticating through the Google Cloud SDK, here are the log entries:

From here, I'd like to know when my error rate exceeds a particular threshold. First, I need to create a filter for all the logs that contain the error. An easy way is to expand the log, find the message I'd like to key on in the payload, click it, and select show matching entries.

This creates an Advanced Filter and shows me the failure messages:

Next, I can use the Create Metric feature to create a Counter Metric:

Once I click Create Metric, I can then go to Stackdriver Monitoring and see it there:

Distribution metric - latency

Now that you've seen how to create a simple counter metric that will track the number of errors per minute, let's take a look at the other reason we might want to use log-based metrics - to track a specific numeric value in the log payload. I've created a second example - this time, I'm introducing a randomly generated delay in the code and logging it as the latency. Here's what the payload looks like:

I want to create a metric that will capture value in the "message" field. To do that, I again use the "Show matching entries" feature and create a metric from the selection. I need to use a regular expression to parse the field and extract the numeric value. Note that I modified the selection filter to look for all the messages coming from my local machine, where the Node.js code is running, by using the logName filter.

As before, I create the metric and view it in Metrics Explorer:

Using metrics

Now that we have our metrics created, we can use them just like any other metric in Stackdriver Monitoring. We can create charts with them and use them in Alerting Policies.

Charts

As an example, I created a chart for the latency I'm writing as a log value:

One great thing about log-based metrics is that you can easily see the logs that feed them. Click on the 3 dot menu for the chart and select View Logs:

The result is an advanced filter that shows you the logs that were ingested within the timeframe that the chart or dashboard were set to select:

Alerts

To address the original question raised at the start - we can also use our metrics as the basis for alerts. For example, if we wanted to know when our error rate exceeded a specific threshold, we can simply use our error metric in an alerting policy condition:

We can do the same for our distribution metric that captures latency:

Note that until a week or so ago, the documentation stated that alerting is not supported for distribution metrics - this is not true, and you can alert on distribution metrics by using a percentile aligner (with thanks to Summit for catching the documentation error).

Summary and conclusion

I hope that you now have a better understanding of how to create metrics from logs in Stackdriver and how you can use those metrics to visualize data with charts and create alerts with alerting policies. As always, I appreciate your feedback, questions, and ideas for topics to address in the future.

Introduction to open source observability tools on Kubernetes

Yuri Grinshteyn — Mon, 07 Oct 2019 14:08:04 +0000

This is my first post here, so I'll just start by cross-posting an article I just published on opensource.com:

https://opensource.com/article/19/10/open-source-observability-kubernetes

Enjoy!