<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yuri Grinshteyn</title>
    <description>The latest articles on DEV Community by Yuri Grinshteyn (@yurigrinshteyn).</description>
    <link>https://dev.to/yurigrinshteyn</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F193385%2F3b324705-8d19-4e67-abc8-e3050efeefc7.jpeg</url>
      <title>DEV Community: Yuri Grinshteyn</title>
      <link>https://dev.to/yurigrinshteyn</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yurigrinshteyn"/>
    <language>en</language>
    <item>
      <title>Integrating Tracing and Logging with OpenTelemetry and Stackdriver</title>
      <dc:creator>Yuri Grinshteyn</dc:creator>
      <pubDate>Wed, 05 Feb 2020 00:16:35 +0000</pubDate>
      <link>https://dev.to/yurigrinshteyn/integrating-tracing-and-logging-with-opentelemetry-and-stackdriver-1ec0</link>
      <guid>https://dev.to/yurigrinshteyn/integrating-tracing-and-logging-with-opentelemetry-and-stackdriver-1ec0</guid>
      <description>&lt;p&gt;One of the main benefits of using an all-in-one observability suite like Stackdriver is that it provides all of the capabilities you may need.  Specifically, your metrics, traces, and logs are all in one place, and with the GA &lt;a href="https://cloud.google.com/monitoring/docs/monitoring_in_console"&gt;release&lt;/a&gt; of Monitoring in the Cloud Console, that's more true than ever before. However, for the most part, each of these data elements are still mostly independent, and I wanted to attempt to try to unify two of them - traces and logs.&lt;/p&gt;

&lt;p&gt;The idea for the project was inspired by the excellent work &lt;a href="https://github.com/alexamies"&gt;Alex Amies&lt;/a&gt; did in his &lt;a href="https://cloud.google.com/solutions/troubleshooting-app-latency-with-cloud-spanner-and-opencensus"&gt;Reference Guide&lt;/a&gt; on using OpenCensus to measure Spanner performance and troubleshoot latency.  Specifically, he included an applog &lt;a href="https://github.com/GoogleCloudPlatform/opencensus-spanner-demo/tree/master/applog"&gt;package&lt;/a&gt; that integrated traces and logs in OpenCensus:  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--__lfI6i_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cloud.google.com/solutions/images/troubleshooting-app-latency-with-cloud-spanner-and-opencensus-7-trace-log.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--__lfI6i_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cloud.google.com/solutions/images/troubleshooting-app-latency-with-cloud-spanner-and-opencensus-7-trace-log.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I wanted to follow my &lt;a href="https://dev.to/yurigrinshteyn/distributed-tracing-with-opentelemetry-in-go-473h"&gt;post&lt;/a&gt; on tracing with OpenTelemetry and attempt to create integrated traces and logs.  Let's dive in!&lt;/p&gt;

&lt;h2&gt;
  
  
  The app
&lt;/h2&gt;

&lt;p&gt;I created a very basic Go app using the Mux router that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Receives a request on &lt;code&gt;/&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Sleeps between 0 and 9 seconds.&lt;/li&gt;
&lt;li&gt;Makes a backend call (to &lt;a href="https://www.google.com"&gt;https://www.google.com&lt;/a&gt;)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;My intent was to create a root span with two children - one for the delay that simulates an internal process and another for the backend call.  &lt;/p&gt;

&lt;h2&gt;
  
  
  The code
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Main function
&lt;/h3&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;initTracer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;initLogger&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;closeLogger&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;mux&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewRouter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandleFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mainHandler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"LOCAL"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ListenAndServe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"localhost:8080"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ListenAndServe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;":8080"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The main function simply sets up my tracing and logging and uses the &lt;code&gt;mainHandler&lt;/code&gt; to respond to requests on &lt;code&gt;/&lt;/code&gt;. &lt;/p&gt;

&lt;h3&gt;
  
  
  Tracing setup
&lt;/h3&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;initTracer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Create Stackdriver exporter to be able to retrieve&lt;/span&gt;
    &lt;span class="c"&gt;// the collected spans.&lt;/span&gt;
    &lt;span class="n"&gt;exporter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;stackdriver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewExporter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;stackdriver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithProjectID&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;projectID&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// For the demonstration, use sdktrace.AlwaysSample sampler to sample all traces.&lt;/span&gt;
    &lt;span class="c"&gt;// In a production application, use sdktrace.ProbabilitySampler with a desired probability.&lt;/span&gt;
    &lt;span class="n"&gt;tp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sdktrace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sdktrace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sdktrace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;DefaultSampler&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sdktrace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AlwaysSample&lt;/span&gt;&lt;span class="p"&gt;()}),&lt;/span&gt;
        &lt;span class="n"&gt;sdktrace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithSyncer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exporter&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;global&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetTraceProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The tracing set up is pretty straightforward - I'm simply using the exporter written by &lt;a href="https://github.com/ymotongpoo"&gt;Yoshi Yamaguchi&lt;/a&gt;, a fantastic Developer Advocate.  It's the same exporter I used in my post on tracing without any changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Logging setup
&lt;/h3&gt;

&lt;p&gt;This is where things start to get interesting.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;initLogger&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;
    &lt;span class="n"&gt;loggingClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;projectID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Failed to create logging client: %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Stackdriver Logging initialized with project id %s, see Cloud "&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
        &lt;span class="s"&gt;" Console under GCE VM instance &amp;gt; all instance_id&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;projectID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;I've largely lifted this from Alex's &lt;a href="https://github.com/GoogleCloudPlatform/opencensus-spanner-demo/blob/master/applog/applog.go"&gt;work&lt;/a&gt;.  The init function simply sets up the logging client.&lt;/p&gt;

&lt;h3&gt;
  
  
  Writing logs
&lt;/h3&gt;

&lt;p&gt;This is where the trace/logging integration really happens.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Send to Cloud Logging service including reference to current span&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;printWithTrace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;format&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="k"&gt;interface&lt;/span&gt;&lt;span class="p"&gt;{})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// Send to Cloud Logging service including reference to current span&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;severity&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Severity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;format&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="k"&gt;interface&lt;/span&gt;&lt;span class="p"&gt;{})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SpanFromContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sCtx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SpanContext&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;tr&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sCtx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TraceIDString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;lg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;loggingClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Logger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LOGNAME&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;trace&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"projects/%s/traces/%s"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;projectID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;lg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Entry&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Severity&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Payload&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;Trace&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;    &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;SpanID&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;sCtx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SpanIDString&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;In Stackdriver, traces and logs can be connected by writing the span ID and the trace ID in the payload of the log message.  Here, I'm using the context to extract both the span and trace and then extract their IDs.  I then write them to the log payload.  Here's what a resulting log message looks like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--eRFNwmfE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/opentelemetry-traces-logs/images/logentry.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--eRFNwmfE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/opentelemetry-traces-logs/images/logentry.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Notice that the &lt;code&gt;spanId&lt;/code&gt; and &lt;code&gt;trace&lt;/code&gt; fields are populated appropriately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Viewing traces
&lt;/h2&gt;

&lt;p&gt;I can run the app locally (after using &lt;code&gt;gcloud auth application-default login&lt;/code&gt; to write default credentials) and send traffic to &lt;a href="http://localhost:8080"&gt;http://localhost:8080&lt;/a&gt;.  Here's a resulting trace:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6xqNLqNu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/opentelemetry-traces-logs/images/trace.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6xqNLqNu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/opentelemetry-traces-logs/images/trace.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note that the trace contains both Events - added with the &lt;code&gt;span.AddEvent()&lt;/code&gt; method - and logs, written as described above.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Viewing logs
&lt;/h2&gt;

&lt;p&gt;I can click on one of the log entries to see the full details of the log message:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--B38DI_nQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/opentelemetry-traces-logs/images/tracewithlogs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--B38DI_nQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/opentelemetry-traces-logs/images/tracewithlogs.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I can then click Open in Logs Viewer and see this log entry there:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wBPYxw1E--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/opentelemetry-traces-logs/images/logs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wBPYxw1E--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/opentelemetry-traces-logs/images/logs.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  In conclusion...
&lt;/h2&gt;

&lt;p&gt;I was very glad to see that this somewhat underappreciated functionality from OpenCensus still works in OpenTelemetry with minor changes.  Specifically, I had to find the new APIs to use in &lt;code&gt;printf()&lt;/code&gt; to extract the span from context and then get its span ID and trace ID, and this does not seem to be well documented.  With that said, I hope this brief tutorial is useful to others looking to build a more integrated approach to observability with Stackdriver, especially in distributed systems.  Many thanks for Alex for doing the original work on this, and thank you for reading!&lt;/p&gt;

</description>
      <category>opentelemetry</category>
      <category>tracing</category>
      <category>stackdriver</category>
      <category>logging</category>
    </item>
    <item>
      <title>Distributed Tracing with OpenTelemetry in Go</title>
      <dc:creator>Yuri Grinshteyn</dc:creator>
      <pubDate>Thu, 09 Jan 2020 22:26:33 +0000</pubDate>
      <link>https://dev.to/yurigrinshteyn/distributed-tracing-with-opentelemetry-in-go-473h</link>
      <guid>https://dev.to/yurigrinshteyn/distributed-tracing-with-opentelemetry-in-go-473h</guid>
      <description>&lt;p&gt;Toward the end of last year, I had the good fortune of publishing a reference &lt;a href="https://cloud.google.com/solutions/using-distributed-tracing-to-observe-microservice-latency-with-opencensus-and-stackdriver-trace"&gt;guide&lt;/a&gt; on using OpenCensus for distributed tracing.  In it, I covered distributed tracing fundamentals, like traces, spans, and context propagation, and demonstrated using OpenCensus to instrument a simple pair of frontend/backend services written in Go.  Since then, the OpenCensus and OpenTracing projects have merged into &lt;a href="https://opentelemetry.io"&gt;OpenTelemetry&lt;/a&gt;, a "single set of APIs, libraries, agents, and collector services to capture distributed traces and metrics from your application."  I wanted to attempt to reproduce the work I did in OpenCensus using the new project and see how much has changed.&lt;/p&gt;

&lt;h1&gt;
  
  
  Objective
&lt;/h1&gt;

&lt;p&gt;For this exercise, I built a simple &lt;a href="https://github.com/yuriatgoogle/stack-doctor/opentelemetry-tracing-demo"&gt;demo&lt;/a&gt;.  It consists of two services. The frontend service receives an incoming request and makes a request to the backend.  The backend receives the request and returns a response.  Our objective is to trace this interaction to determine the overall response latency and understand how the two services and the nework connectivity between them contribute to the overall latency.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XLaObwpw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/opentelemetry-tracing-demo/images/1-architecture.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XLaObwpw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/opentelemetry-tracing-demo/images/1-architecture.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the original guide, the two services were deployed in two separate GKE clusters, but that is actually not necessary to demonstrate distributed tracing.  For this exercise, we'll simply run both services locally.&lt;/p&gt;

&lt;h1&gt;
  
  
  Primitives
&lt;/h1&gt;

&lt;p&gt;While the basic concepts are covered in the reference guide and in much greater detail in the Google Dapper research &lt;a href="https://research.google.com/archive/papers/dapper-2010-1.pdf"&gt;paper&lt;/a&gt;, it's still worth briefly covering them here such that we can then understand how they're implemented in the code.  &lt;/p&gt;

&lt;p&gt;From the reference guide:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A &lt;em&gt;trace&lt;/em&gt; is the total of information that describes how a distributed system responds to a user request. Traces are composed of &lt;em&gt;spans&lt;/em&gt;, where each span represents a specific request and response pair involved in serving the user request. The &lt;em&gt;parent&lt;/em&gt; span describes the latency as observed by the end user. Each of the &lt;em&gt;child&lt;/em&gt; spans describes how a particular service in the distributed system was called and responded to, with latency information captured for each.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is well illustrated in the aforementioned research paper using this diagram:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--FbdC4IlC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/opentelemetry-tracing-demo/images/2-diagram.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--FbdC4IlC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/opentelemetry-tracing-demo/images/2-diagram.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Implementation
&lt;/h1&gt;

&lt;p&gt;Let's take a look at how we can implement distributed tracing in our frontend/backend service pair using OpenTelemetry. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt; that most of this is adopted from the &lt;a href="https://github.com/open-telemetry/opentelemetry-go/tree/master/example"&gt;samples&lt;/a&gt; published by OpenTelemetry in their Github &lt;a href="https://github.com/open-telemetry/opentelemetry-go"&gt;repo&lt;/a&gt;. I made relatively minor changes to add custom spans and use the Mux router, rather than just basic HTTP handling. &lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Frontend code
&lt;/h2&gt;

&lt;p&gt;We'll start by reviewing the frontend &lt;a href="https://github.com/yuriatgoogle/stack-doctor/blob/master/opentelemetry-tracing-demo/go/frontend/frontend.go"&gt;code&lt;/a&gt;.  &lt;/p&gt;

&lt;h3&gt;
  
  
  Imports
&lt;/h3&gt;

&lt;p&gt;First, the imports:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"fmt"&lt;/span&gt;
    &lt;span class="s"&gt;"log"&lt;/span&gt;
    &lt;span class="s"&gt;"net/http"&lt;/span&gt;
    &lt;span class="s"&gt;"os"&lt;/span&gt;
    &lt;span class="s"&gt;"context"&lt;/span&gt;
    &lt;span class="s"&gt;"io/ioutil"&lt;/span&gt;
    &lt;span class="s"&gt;"google.golang.org/grpc/codes"&lt;/span&gt;

    &lt;span class="s"&gt;"github.com/gorilla/mux"&lt;/span&gt;

    &lt;span class="s"&gt;"go.opentelemetry.io/otel/api/distributedcontext"&lt;/span&gt;
    &lt;span class="s"&gt;"go.opentelemetry.io/otel/api/global"&lt;/span&gt;
    &lt;span class="s"&gt;"go.opentelemetry.io/otel/api/trace"&lt;/span&gt;
    &lt;span class="s"&gt;"go.opentelemetry.io/otel/exporter/trace/stackdriver"&lt;/span&gt;
    &lt;span class="s"&gt;"go.opentelemetry.io/otel/plugin/httptrace"&lt;/span&gt;
    &lt;span class="n"&gt;sdktrace&lt;/span&gt; &lt;span class="s"&gt;"go.opentelemetry.io/otel/sdk/trace"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Mostly, we're using a variety of OpenTelemetry libraries at this point.  We'll also use the &lt;a href="https://github.com/gorilla/mux"&gt;Mux&lt;/a&gt; router to handle HTTP requests (I mostly use it because it seem to be similar to Express in Node.js).  &lt;/p&gt;

&lt;h3&gt;
  
  
  Main function
&lt;/h3&gt;

&lt;p&gt;Next, let's have a look at the &lt;code&gt;main()&lt;/code&gt; function for our service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;initTracer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;mux&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewRouter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandleFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mainHandler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;==&lt;/span&gt;&lt;span class="s"&gt;"LOCAL"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ListenAndServe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"localhost:8080"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ListenAndServe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;":8080"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;As you can tell, this is pretty straighforward.  We initialize tracing right at the start and use a Mux router to handle a single route for requests to &lt;code&gt;/&lt;/code&gt;.  We then start the server on port 8080.  I added an environment variable to check to see whether I'm running the code locally to bypass the MacOS prompt to allow inbound network connections as per these &lt;a href="https://medium.com/@leeprovoost/suppressing-accept-incoming-network-connections-warnings-on-osx-7665b33927ca"&gt;instructions&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Initialize tracing
&lt;/h3&gt;

&lt;p&gt;Next, let's take a look at the &lt;code&gt;initTracer()&lt;/code&gt; function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;initTracer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="c"&gt;// Create Stackdriver exporter to be able to retrieve&lt;/span&gt;
    &lt;span class="c"&gt;// the collected spans.&lt;/span&gt;
    &lt;span class="n"&gt;exporter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;stackdriver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewExporter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;stackdriver&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithProjectID&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;projectID&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// For the demonstration, use sdktrace.AlwaysSample sampler to sample all traces.&lt;/span&gt;
    &lt;span class="c"&gt;// In a production application, use sdktrace.ProbabilitySampler with a desired probability.&lt;/span&gt;
    &lt;span class="n"&gt;tp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sdktrace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sdktrace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sdktrace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;DefaultSampler&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sdktrace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AlwaysSample&lt;/span&gt;&lt;span class="p"&gt;()}),&lt;/span&gt;
        &lt;span class="n"&gt;sdktrace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithSyncer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exporter&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;global&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetTraceProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Here, we're simply instantiating the Stackdriver exporter and settting the sampling parameter to capture every trace.  &lt;/p&gt;

&lt;h3&gt;
  
  
  Handle requests
&lt;/h3&gt;

&lt;p&gt;Finally, let's look at the &lt;code&gt;mainHandler()&lt;/code&gt; function that is called to handle requests to &lt;code&gt;/&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;mainHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="n"&gt;tr&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;global&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TraceProvider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"OT-tracing-demo"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DefaultClient&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;distributedcontext&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;

    &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;tr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithSpan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"incoming call"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c"&gt;// root span here&lt;/span&gt;
        &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

            &lt;span class="c"&gt;// create child span&lt;/span&gt;
            &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;childSpan&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;tr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"backend call"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;childSpan&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AddEvent&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"making backend call"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c"&gt;// create backend request&lt;/span&gt;
            &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"GET"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;backendAddr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c"&gt;// inject context&lt;/span&gt;
            &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httptrace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;W3C&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;httptrace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Inject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c"&gt;// do request&lt;/span&gt;
            &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Sending request...&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Do&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="nb"&gt;panic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ioutil&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Body&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

            &lt;span class="c"&gt;// close child span&lt;/span&gt;
            &lt;span class="n"&gt;childSpan&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;End&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

            &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SpanFromContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetStatus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;codes&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OK&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"got response: %d&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%v&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"OK"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;//change to status code from backend&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nb"&gt;panic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;   
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Here, we're setting the name of the tracer to "OT-tracing-demo" and starting a root span labeled "incoming call".  We then create a child span of that labeled "backend call" and pass the context to it.  We then create a request to our backend server, whose location is defined in an env variable and inject our context into that request - we'll see how that context is used in the backend in a bit.  Finally, we make the request, get the status code, and output a confirmation message.  Pretty straightforward! &lt;/p&gt;

&lt;p&gt;A couple of things to note further:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I am explicitly closing the child span, rather than using &lt;code&gt;defer&lt;/code&gt; for more control over exactly when the timer is stopped. &lt;/li&gt;
&lt;li&gt;I am adding events to spans for even more clear labeling.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, let's look at our &lt;a href="https://github.com/yuriatgoogle/stack-doctor/blob/master/opentelemetry-tracing-demo/go/backend/backend.go"&gt;backend&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Backend code
&lt;/h2&gt;

&lt;p&gt;Much of the code here is very similar to the frontend - we use the same exact &lt;code&gt;main()&lt;/code&gt; and &lt;code&gt;initTracer()&lt;/code&gt; functions to run the server and initialize tracing.  &lt;/p&gt;

&lt;h3&gt;
  
  
  mainHandler
&lt;/h3&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;mainHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// start tracer&lt;/span&gt;
    &lt;span class="n"&gt;tr&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;global&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TraceProvider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"backend"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c"&gt;// get context from incoming request&lt;/span&gt;
    &lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spanCtx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;httptrace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// create request using context&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;distributedcontext&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithMap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;distributedcontext&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewMap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;distributedcontext&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MapUpdate&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;MultiKV&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})))&lt;/span&gt;

    &lt;span class="c"&gt;// create span&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;tr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="s"&gt;"backend call received"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithAttributes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChildOf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spanCtx&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AddEvent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"handling backend call"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// output&lt;/span&gt;
    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"backend call received"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"OK"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c"&gt;// close span&lt;/span&gt;
    &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;End&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;mainHandler()&lt;/code&gt; function does look quite different.  Here, we extract the span context from the incoming request, create a new request object using that context, and create a new span using that request context.  We also add an event to our span for explicit labeling.  Finally, we return "OK" to the caller and close our span.  Again, I could have used &lt;code&gt;defer span.End()&lt;/code&gt; instead of doing it explicitly.  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt; the difference between span context and request context.  This is specifically relevant when accepting incoming context and using it to create child spans.  For further exploration of these two, take a look at the relevant &lt;a href="https://opentracing.io/docs/best-practices/"&gt;documentation&lt;/a&gt; from OpenTracing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  Traces
&lt;/h1&gt;

&lt;p&gt;Now that we've seen how to implement tracing instrumentation in our code, let's take a look at what this instrumentation creates.  We can run both frontend and backend locally after setting the relevant environment variables for each and using &lt;code&gt;gcloud auth login&lt;/code&gt; to log in to Google Cloud.  Once we do that, we can hit the frontend on &lt;a href="http://localhost:8080"&gt;http://localhost:8080&lt;/a&gt; and issue a few requests.  This should immediately result in traces being written to Stackdriver:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lmEL8m6W--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/opentelemetry-tracing-demo/images/3-traces.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lmEL8m6W--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/opentelemetry-tracing-demo/images/3-traces.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can see the span names we specified in our code and the events we added for clearer labeling.  One additional thing I was pleasantly surprised by is that OpenTelemetry explicitly adds steps for the HTTP/networking stack, including DNS, connecting, and sending and receiving data.  &lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;I greatly enjoyed attempting to reproduce the work I did in OpenCensus with OpenTelemetry and eventually found it understandable and clear, especially once I was pointed to the tracer.Start() method to create child spans.  Come back next time when I attempt to use the stats features of OpenTelemetry to create custom metrics.  Until then!&lt;/p&gt;

</description>
      <category>tracing</category>
      <category>go</category>
      <category>opentelemetry</category>
      <category>observability</category>
    </item>
    <item>
      <title>SLOs with Stackdriver Service Monitoring</title>
      <dc:creator>Yuri Grinshteyn</dc:creator>
      <pubDate>Tue, 07 Jan 2020 19:49:29 +0000</pubDate>
      <link>https://dev.to/yurigrinshteyn/slos-with-stackdriver-service-monitoring-27oo</link>
      <guid>https://dev.to/yurigrinshteyn/slos-with-stackdriver-service-monitoring-27oo</guid>
      <description>&lt;p&gt;Service Level Objectives or &lt;a href="https://landing.google.com/sre/sre-book/chapters/service-level-objectives/"&gt;SLOs&lt;/a&gt; are one of the fundamental principles of &lt;a href="https://landing.google.com/sre/"&gt;site reliability engineering&lt;/a&gt;.  We use them to precisely quantify the reliability target we want to achieve in our service. We also use their inverse, &lt;a href="https://landing.google.com/sre/sre-book/chapters/embracing-risk/#xref_risk-management_unreliability-budgets"&gt;error budgets&lt;/a&gt;, to make informed decisions about how much risk we can take on at any given time.  This lets us determine, for example, whether we can go ahead with a push to production or infrastructure upgrade.&lt;/p&gt;

&lt;p&gt;However, Stackdriver has never given us the ability to actually create, track, alert, and report on SLOs - until now.  The &lt;a href="https://cloud.google.com/service-monitoring/"&gt;Service Monitoring&lt;/a&gt; API was released to public beta at NEXT London in the fall, and I wanted to take the opportunity to try it out.  Here's what I found.&lt;/p&gt;

&lt;h1&gt;
  
  
  The service
&lt;/h1&gt;

&lt;p&gt;Before I could create a service level objective, I needed a service.  Because the initial release of the &lt;a href="https://cloud.google.com/monitoring/service-monitoring/"&gt;API&lt;/a&gt; only supports App Engine, Istio on GKE, and Cloud Endpoints, I thought I'd try the simplest option - App Engine Standard.  I created a basic Hello World app in Go using the Mux router - here's the &lt;a href="https://github.com/yuriatgoogle/stack-doctor/tree/master/service-monitoring-demo/gae"&gt;code&lt;/a&gt; for it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"fmt"&lt;/span&gt;
    &lt;span class="s"&gt;"net/http"&lt;/span&gt;

    &lt;span class="s"&gt;"github.com/gorilla/mux"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;mux&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewRouter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandleFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Hello World!"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ListenAndServe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;":8080"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;I then created the app.yaml file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;runtime: go112 # replace with go111 for Go 1.11
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Next, I deployed the app using &lt;code&gt;gcloud app deploy&lt;/code&gt;.  At this point, the build process was failing, and I ended up having to follow these &lt;a href="https://github.com/golang/go/wiki/Modules#how-to-define-a-module"&gt;instructions&lt;/a&gt; to define a module.  I am not an expert in Go, and I'm guessing that this is a matter of local environment configuration that I just couldn't be bothered to sort out.  Nevertheless, I was able to deploy the app after following those instructions.&lt;/p&gt;

&lt;p&gt;Finally, I set up a global Uptime Check to get a steady stream of traffic flowing to my new app:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Qy8bIMVi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/service-monitoring-demo/gae/images/1-uptimecheck.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Qy8bIMVi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/service-monitoring-demo/gae/images/1-uptimecheck.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Defining the SLI
&lt;/h1&gt;

&lt;p&gt;Now that I had a service created, I was ready to proceed.  I found the &lt;a href="https://cloud.google.com/monitoring/service-monitoring/"&gt;documentation&lt;/a&gt; on concepts very helpful, even as someone mostly familiar with this topic.  For this exercise, I wanted to create a simple availability Service Level Indicator (SLI) to measure the percentage of "good" requests as a fraction of total.  That required three decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; How to count total requests&lt;/li&gt;
&lt;li&gt; How to count "good" requests&lt;/li&gt;
&lt;li&gt; What time frame to use for my SLI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thankfully, App Engine exposes a useful response count metric:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--UEq4OnB0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/service-monitoring-demo/gae/images/2-metric.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--UEq4OnB0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/service-monitoring-demo/gae/images/2-metric.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt; that this metric is not written if the GAE application is disabled (as I learned by attempting to simulate a failure by disabling the app).  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This metric can further be filtered by response code:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--V8mkJ2ia--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/service-monitoring-demo/gae/images/3-filtered.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--V8mkJ2ia--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/service-monitoring-demo/gae/images/3-filtered.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I decided to use the unfiltered metric to count total requests and filter requests with a response code of 200 to count "good" requests for the sake of simplicity.  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt; that this is likely far too simplistic for any production use.  For example, this would count 404s as "bad" requests, when they are likely to be the result of misconfigured clients or even external scanners.  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I then chose a 1 day rolling window as my SLO time frame.  For a lot more information on how to choose SLIs and SLOs, I highly recommend the Art of SLOs &lt;a href="https://landing.google.com/sre/resources/practicesandprocesses/art-of-slos/"&gt;workshop&lt;/a&gt;, which the Google CRE team has recently released. &lt;/p&gt;

&lt;h1&gt;
  
  
  Creating the SLI and SLO
&lt;/h1&gt;

&lt;p&gt;At this point, I was ready to use the &lt;a href="https://cloud.google.com/monitoring/service-monitoring/identifying-custom-sli"&gt;API&lt;/a&gt; to define my SLO.  As recommended in the "&lt;a href="https://cloud.google.com/monitoring/service-monitoring/identifying-custom-sli#configure-sli"&gt;Building the SLI&lt;/a&gt;" section, I used the Metrics Explorer to create a chart that showed my "total" request count:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cRn4AIPK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/service-monitoring-demo/gae/images/4-json.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cRn4AIPK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/service-monitoring-demo/gae/images/4-json.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From there, I was able to copy the JSON for the filter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"metric.type=\"appengine.googleapis.com/http/server/response_count\" resource.type=\"gae_app\" resource.label.\"module_id\"=\"default\""
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;I then modified the chart to only count the "good" requests, filtering on response_code=200 and copied that JSON:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"metric.type=\"appengine.googleapis.com/http/server/response_count\" resource.type=\"gae_app\" resource.label.\"module_id\"=\"default\" metric.label.\"response_code\"=\"200\""
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;At this point, I was ready to build the SLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;"requestBased"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"goodTotalRatio"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"totalServiceFilter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"metric.type=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;appengine.googleapis.com/http/server/response_count&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; resource.type=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;gae_app&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; resource.label.&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;module_id&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;default&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"goodServiceFilter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"metric.type=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;appengine.googleapis.com/http/server/response_count&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; resource.type=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;gae_app&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; resource.label.&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;module_id&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;default&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; metric.label.&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;response_code&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;200&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;I chose the "requestBased" type of SLI, because I was looking to capture the fraction of good over total requests.  The other options &lt;a href="https://cloud.google.com/monitoring/service-monitoring/api-structures#sli-structs"&gt;include&lt;/a&gt; &lt;em&gt;basic&lt;/em&gt;, which might have been good enough for my purpose here, and "&lt;em&gt;window-based&lt;/em&gt;", which lets you count the number of periods during which the service meets a defined health threshold.  I may come back and revisit the latter in another post.   &lt;/p&gt;

&lt;p&gt;From there, I defined the SLO:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"displayName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GAE Hello World Availability"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"serviceLevelIndicator"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"requestBased"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"goodTotalRatio"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"totalServiceFilter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"metric.type=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;appengine.googleapis.com/http/server/response_count&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; resource.type=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;gae_app&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; resource.label.&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;module_id&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;default&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"goodServiceFilter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"metric.type=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;appengine.googleapis.com/http/server/response_count&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; resource.type=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;gae_app&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; resource.label.&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;module_id&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;default&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; metric.label.&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;response_code&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;200&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"goal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.98&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"rollingPeriod"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"86400s"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"displayName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"98% Successful requests in a rolling day"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Finally, I submitted the request to the API using &lt;a href="https://www.getpostman.com/"&gt;Postman&lt;/a&gt; - you could do the same using the API Explorer or even curl.  The response was successful and returned the SLO name in the body:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"projects/&amp;lt;project number&amp;gt;/services/gae:&amp;lt;project ID&amp;gt;_default/serviceLevelObjectives/&amp;lt;SLO name&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"serviceLevelIndicator"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"requestBased"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"goodTotalRatio"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"goodServiceFilter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"metric.type=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;appengine.googleapis.com/http/server/response_count&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; resource.type=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;gae_app&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; resource.label.&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;module_id&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;default&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; metric.label.&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;response_code&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;200&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"totalServiceFilter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"metric.type=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;appengine.googleapis.com/http/server/response_count&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; resource.type=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;gae_app&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; resource.label.&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;module_id&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;default&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"goal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.98&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"rollingPeriod"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"86400s"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"displayName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"98% Successful requests in a rolling day"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h1&gt;
  
  
  Alerting on SLO
&lt;/h1&gt;

&lt;p&gt;Now that my SLO was defined, I wanted to achieve two things - create an alert for SLO violation and figure out how to get a status without tripping an alert.  I was able to use the UI to create an alerting policy using the "SLO BURN RATE" condition type:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3-pSk172--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/service-monitoring-demo/gae/images/5-policy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3-pSk172--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/service-monitoring-demo/gae/images/5-policy.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When setting up the alerting policy, I ran into two fields whose meaning was not immediately clear to me.  The first one is Lookback Duration.  I was able to find an explanation in the &lt;a href="https://cloud.google.com/monitoring/alerts/concepts-indepth#condition-types"&gt;documentation&lt;/a&gt; - because burn rate is fundamentally a rate of change condition, you have the option of specifying a custom lookback window.  For other rate of change conditions, the lookback is set to 10 minutes and cannot be changed.  From the doc for rate of change conditions:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The condition averages the values of the metric from the past 10 minutes, then compares the result with the 10-minute average that was measured just before the duration window. The 10-minute lookback window used by a metric rate of change condition is a fixed value; you can't change it. However, you do specify the duration window when you create a condition.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The second field that confused me was the threshold.  A bit more thought led me to believe that this is the threshold for the rate of error budget burn - applied to the specified lookback duration.  So, using 10 minutes for the lookback and 10 for the threshold would result in a condition that would trip if 10% of the total error budget was burned over a 10 minute period.  &lt;/p&gt;

&lt;h2&gt;
  
  
  Triggering alert
&lt;/h2&gt;

&lt;p&gt;Once my alerting policy was configured, I wanted to see what would happen if there was an availability issue.  I rewrote the service to throw an error half the time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;mux&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewRouter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandleFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

        &lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UnixNano&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Intn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// n will be between 0 and 10&lt;/span&gt;
        &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"randon number was %d&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;6&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"error!"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Hello World!"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ListenAndServe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;":8080"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;and redeployed the app.  Fairly quickly, I got an incident:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OcWKmtEF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/service-monitoring-demo/gae/images/6-incident.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OcWKmtEF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/service-monitoring-demo/gae/images/6-incident.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I was satisfied with this and redeployed the app with the original code to get it working again.  In short order, the incident was resolved.&lt;/p&gt;

&lt;h1&gt;
  
  
  Retrieving SLO Status
&lt;/h1&gt;

&lt;p&gt;Alerting on error budget burn is obviously necessary, but there will be times when we'll need to know the status of our SLO long before an issue.  As such, I needed a way to query the SLO data. I followed the &lt;a href="https://cloud.google.com/monitoring/service-monitoring/timeseries-selectors"&gt;documentation&lt;/a&gt; and used the&lt;a href="https://cloud.google.com/monitoring/api/ref_v3/rest/v3/projects.timeSeries/list"&gt; timeSeries.list &lt;/a&gt;method of the monitoring API.  &lt;/p&gt;

&lt;p&gt;I thought of two primary questions I would want to be able to answer - what is the availability of my service over a given time period and what is the status of my error budget at a given point in time?  &lt;/p&gt;

&lt;h2&gt;
  
  
  SLO Status
&lt;/h2&gt;

&lt;p&gt;The first is answered using the "select_slo_health" time series.  I sent a request to &lt;a href="https://monitoring.googleapis.com/v3/projects/stack-doctor/timeSeries"&gt;https://monitoring.googleapis.com/v3/projects/stack-doctor/timeSeries&lt;/a&gt; with the following parameters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name:projects/&amp;lt;project ID&amp;gt;
filter:select_slo_health("projects/&amp;lt;project number&amp;gt;/services/gae:&amp;lt;project ID&amp;gt;_default/serviceLevelObjectives/&amp;lt;SLO name&amp;gt;")
interval.endTime:2020-01-06T17:17:00.0Z
interval.startTime:2020-01-05T17:17:00.0Z
aggregation.alignmentPeriod:3600s
aggregation.perSeriesAligner:ALIGN_MEAN
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;I used the SLO name that was returned when I created the SLO in the previous steps.  I could have also used a call to &lt;a href="https://monitoring.googleapis.com/v3/projects/"&gt;https://monitoring.googleapis.com/v3/projects/&lt;/a&gt;/services/gae:_default/serviceLevelObjectives to retrieve the SLOs I have configured against my default App Engine service.&lt;/p&gt;

&lt;p&gt;I specified a 24hr interval to retrieve the data with an alignment of 1 hour using the mean aligner.  If I was going to chart the data, I could have used a shorter alignment period and a more precise aligner, but this sufficed for my purposes.  The results looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timeSeries"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"metric"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"select_slo_health(&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;projects/860128900282/services/gae:stack-doctor_default/serviceLevelObjectives/2IooYmjTSROak0g9f-DmpA&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;)"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gae_app"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"labels"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"project_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"stack-doctor"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"metricKind"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GAUGE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"valueType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DOUBLE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"points"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"interval"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"startTime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2020-01-06T17:17:00Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"endTime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2020-01-06T17:17:00Z"&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"doubleValue"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;…&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;As expected, the output shows me the fractional ratio of good requests to total requests for each interval (that matches my alignmentPeriod) within the total interval (as specified by interval.startTime and endTime).  For my service, each value was 1, meaning that 100% of the requests were good for each hourly interval.&lt;/p&gt;

&lt;h2&gt;
  
  
  Error Budget Status
&lt;/h2&gt;

&lt;p&gt;The second question I wanted to answer is "how much error budget do I have left?"  The operator for that is the select_slo_budget_fraction.  The only change in the request is to change the filter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;filter:select_slo_budget_fraction("projects/860128900282/services/gae:stack-doctor_default/serviceLevelObjectives/2IooYmjTSROak0g9f-DmpA")  
interval.endTime:2020-01-06T17:17:00.0Z  
interval.startTime:2020-01-05T17:17:00.0Z  
aggregation.alignmentPeriod:3600s  
aggregation.perSeriesAligner:ALIGN_MEAN
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;After making a request to the timeSeries.list method, I got the following return:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timeSeries"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"metric"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"select_slo_budget_fraction(&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;projects/860128900282/services/gae:stack-doctor_default/serviceLevelObjectives/2IooYmjTSROak0g9f-DmpA&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;)"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gae_app"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"labels"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"project_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"stack-doctor"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"metricKind"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GAUGE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"valueType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DOUBLE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"points"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"interval"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"startTime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2020-01-06T17:17:00Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"endTime"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2020-01-06T17:17:00Z"&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"doubleValue"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;….&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;As before, each "value" represents the fraction of the error budget remaining at that point.  As my service is not burning error budget, the numbers stay at 1.  I could have also used the select_slo_budget operator to get the actual remaining budget - the count of errors remaining.&lt;/p&gt;

&lt;h1&gt;
  
  
  In conclusion...
&lt;/h1&gt;

&lt;p&gt;I hope you found this introduction to the Service Monitoring API useful.  Thank you for reading, and let me know if you have any feedback.  Until next time!&lt;/p&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>stackdriver</category>
      <category>slos</category>
    </item>
    <item>
      <title>How I learned to stop worrying and debug in production</title>
      <dc:creator>Yuri Grinshteyn</dc:creator>
      <pubDate>Wed, 04 Dec 2019 19:19:43 +0000</pubDate>
      <link>https://dev.to/yurigrinshteyn/how-i-learned-to-stop-worrying-and-debug-in-production-1d04</link>
      <guid>https://dev.to/yurigrinshteyn/how-i-learned-to-stop-worrying-and-debug-in-production-1d04</guid>
      <description>&lt;p&gt;&lt;a href="https://landing.google.com/sre/sre-book/chapters/managing-incidents/"&gt;Incident management&lt;/a&gt; is one of the core practices of Site Reliability Engineering.  As part of that, the SRE Book recommends focusing on prioritization during the incident itself.  Specifically:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Prioritize&lt;/strong&gt;. Stop the bleeding, restore service, and preserve the evidence for root-causing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;However, there may still be times when you may need to try to debug in production - for example, you may be struggling to reproduce a problem locally or in a dev environment, and production may be the only place where it happens reliably enough.  In this situation, redeploying with additional logging enabled may not be an option, especially if there's an incident in progress.  There may be other times when an error isn't enough of a problem to impact your SLO, but you still want to fix it.  &lt;/p&gt;

&lt;p&gt;In the past, these situations required developers to add additional instrumentation to their code and wait for a new deployment for that instrumentation to show something useful. But what if that wasn't necessary? What if you could just inspect the code in production or add logging on the fly &lt;em&gt;without&lt;/em&gt; having to redeploy?  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://cloud.google.com/debugger/"&gt;Stackdriver Debugger&lt;/a&gt; is intended to enable exactly this.  I wanted to try it for myself - here's how it went.&lt;/p&gt;

&lt;h1&gt;
  
  
  Setup
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Agent
&lt;/h2&gt;

&lt;p&gt;I started by writing a simple Node Express &lt;a href="https://github.com/yuriatgoogle/stack-doctor/tree/master/debugger-demo"&gt;app&lt;/a&gt; that would return a 500 error half the time based on a random number generator.  In order to use Debugger, I included the agent at the top of my code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@google-cloud/debug-agent&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;start&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;projectID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;keyFilename&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./key.json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;serviceContext&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;serviceName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;serviceVersion&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;allowExpressions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Some things of note here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  The agent takes the project ID and credentials as parameters.  The latter is not necessary if running &lt;a href="https://cloud.google.com/debugger/docs/setup/nodejs"&gt;somewhere&lt;/a&gt; where access to the Debugger API doesn't need explicit authentication, like GCE, GKE, or App Engine.  My plan was to simply run this locally, so I needed to add it explicitly.&lt;/li&gt;
&lt;li&gt;  The &lt;strong&gt;service&lt;/strong&gt; and &lt;strong&gt;version&lt;/strong&gt; parameters allow you to have multiple services available to debug at the same time&lt;/li&gt;
&lt;li&gt;  The &lt;strong&gt;allowExpressions&lt;/strong&gt; flag enables Debugger to, for example, view static or global variables that are not part of the local variable set.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once I ran the code, I was able to select my app and the specified version:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--kfoYTJWv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kfoYTJWv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/1.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;At this point, I was presented with a screen to select my source code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YDKNTtea--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YDKNTtea--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/2.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Because my &lt;a href="https://github.com/yuriatgoogle/stack-doctor/tree/master/debugger-demo"&gt;code&lt;/a&gt; is in GitHub, I chose that path.  Once I selected my repo and branch, I was able to see the code!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--qOHveNOE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--qOHveNOE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/3.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Debugging
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Snapshots
&lt;/h2&gt;

&lt;p&gt;At this point, I was ready to start actually using Debugger.  The first capability it provides is Snapshots - it's the ability to see the value of variables at a specific execution point - again, without actually stopping code execution.  Specifically, I wanted to see if I could get it to show me the value of my &lt;em&gt;randomInt&lt;/em&gt; variable that I was using to trigger an error.  I added a snapshot by clicking on the line number:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--F7q9WmoX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--F7q9WmoX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/4.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I then refreshed the page running on my local server until I got an error and got a snapshot:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--mc_cnh6H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--mc_cnh6H--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/5.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I was curious about the fact that my randomInt variable didn't show up.  After a bit of digging, I realized that I had to add an &lt;a href="https://cloud.google.com/debugger/docs/using/snapshots#expressions_optional"&gt;expression&lt;/a&gt; to capture it:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ucsoiZ2C--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ucsoiZ2C--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/6.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This time, when I loaded the page, it was captured as I expected:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OupDr_qQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OupDr_qQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/7.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I was very happy with this - I can definitely see how useful this would be in a production debugging or troubleshooting scenario to, for example, capture the values that are being used in a calculation.  I can even add multiple snapshot points and see the value of the variable change:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bQiBT1Tt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bQiBT1Tt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/8.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Logpoints
&lt;/h2&gt;

&lt;p&gt;The second major capability of Debugger is the ability to add logging on the fly - that is, to essentially create additional log entries that get ingested into Stackdriver Logging and persist for 24 hours.  Adding one is as easy as switching to the Logpoint tab, selecting a line of code, and specifying the message to be written:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vnoxCFkI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vnoxCFkI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/9.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, when I reloaded the page, I saw additional logging in my local console:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--te0ANd2F--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/10.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--te0ANd2F--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/10.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So, the log entries were added to stdout - not automatically sent to Stackdriver Logging.  So, I needed to run this somewhere in GCP.  I built a container image using Google Cloud Build and &lt;a href="https://cloud.google.com/run/docs/quickstarts/build-and-deploy"&gt;deployed&lt;/a&gt; it to Cloud Run.  Once that was done, I saw my new service in Debugger:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--l-zN1Lnq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/11.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--l-zN1Lnq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/11.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I then selected the code from GitHub again and added a logpoint.  This time, I got logs!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--djZLWCso--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/12.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--djZLWCso--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/raw/master/debugger-demo/images/12.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is very cool - it's such an easy way to get more debugging info - without having to redeploy the app!&lt;/p&gt;

&lt;h1&gt;
  
  
  In conclusion…
&lt;/h1&gt;

&lt;p&gt;Debugging in production CAN be done!  Debugger is obviously not a replacement for being able to step through the code in an IDE (although it does have IDE &lt;a href="https://cloud.google.com/code/docs/intellij/debugger"&gt;integration&lt;/a&gt;), but it's a great way to add more debugging information to an app on the fly - without having to redeploy.  Thanks for reading!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>What's the best way to log errors (in Node.js)?</title>
      <dc:creator>Yuri Grinshteyn</dc:creator>
      <pubDate>Fri, 15 Nov 2019 19:14:18 +0000</pubDate>
      <link>https://dev.to/yurigrinshteyn/what-s-the-best-way-to-log-errors-in-node-js-dfn</link>
      <guid>https://dev.to/yurigrinshteyn/what-s-the-best-way-to-log-errors-in-node-js-dfn</guid>
      <description>&lt;p&gt;I wanted to address another one in the mostly-in-my-head series of questions with the running title of "things people often ask me".  Today's episode in the series is all about logging errors to Stackdriver.  Specifically, I've found that folks are somewhat confused about the multiple options they have for error logging and even more so when they want to understand how to log and track exceptions.  My opinion is that this is in part caused by Stackdriver providing multiple features that enable this - Error Reporting and Logging.  This is further confusing because Error Reporting is in a way a subset of Logging.  As such, I set out to explore exactly what happens when I tried to log both errors and exceptions using Logging and Error Reporting in a sample Node.js &lt;a href="https://github.com/yuriatgoogle/stack-doctor/tree/master/error-reporting-demo" rel="noopener noreferrer"&gt;app&lt;/a&gt;.  Let's see what I found!&lt;/p&gt;

&lt;h1&gt;
  
  
  Logging Errors
&lt;/h1&gt;

&lt;p&gt;I think that the confusion folks face starts with the fact that Stackdriver actually supports three different &lt;a href="https://cloud.google.com/logging/docs/setup/nodejs" rel="noopener noreferrer"&gt;options&lt;/a&gt; for logging in Node.js - Bunyan, Winston, and the API client library.  I wanted to see how the first two treat error logs. At this point, I do not believe we recommend using the client library directly (in the same way that we recommend using OpenCensus for metric telemetry, rather than calling the Monitoring API directly).  &lt;/p&gt;

&lt;h2&gt;
  
  
  Logging with Bunyan
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://cloud.google.com/logging/docs/setup/nodejs" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; is pretty straightforward - setting up Bunyan logging in my app was very easy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// *************** Bunyan logging setup *************&lt;/span&gt;
&lt;span class="c1"&gt;// Creates a Bunyan Stackdriver Logging client&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;loggingBunyan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;LoggingBunyan&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="c1"&gt;// Create a Bunyan logger that streams to Stackdriver Logging&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;bunyanLogger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;bunyan&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createLogger&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;serviceName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// this is set by an env var or as a parameter&lt;/span&gt;
  &lt;span class="na"&gt;streams&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="c1"&gt;// Log to the console at 'info' and above&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;info&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;// And log to Stackdriver Logging, logging at 'info' and above&lt;/span&gt;
    &lt;span class="nx"&gt;loggingBunyan&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;info&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From there, logging an error message is as simple as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/bunyan-error&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;bunyanLogger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Bunyan error logged&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Bunyan error logged!&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When I ran my app, I saw this  logging output in the console:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{"name":"node-error-reporting","hostname":"ygrinshteyn-macbookpro1.roam.corp.google.com","pid":5539,"level":50,"msg":"Bunyan error logged","time":"2019-11-15T17:19:58.001Z","v":0}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And this in Stackdriver Logging:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F1.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F1.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note that the log entry is created against the "global" resource because the log entry is being sent from my local machine not running on GCP, and the logName is bunyan_log.  The output is nicely structured, and the severity is set to ERROR.  &lt;/p&gt;

&lt;h2&gt;
  
  
  Logging with Winston
&lt;/h2&gt;

&lt;p&gt;I again followed the documentation to set up the Winston client:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ************* Winston logging setup *****************&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;loggingWinston&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;LoggingWinston&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="c1"&gt;// Create a Winston logger that streams to Stackdriver Logging&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;winstonLogger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;winston&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createLogger&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;info&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;transports&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;winston&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transports&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Console&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="c1"&gt;// Add Stackdriver Logging&lt;/span&gt;
    &lt;span class="nx"&gt;loggingWinston&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then I logged an error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/winston-error&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;winstonLogger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Winston error logged&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Winston error logged!&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This time, the console output was much more concise:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{"message":"Winston error logged","level":"error"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Here's what I saw in the Logs Viewer:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F2.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F2.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The severity was again set properly, but there's a lot less information in this entry.  For example, my hostname is not logged.  This may be a good choice for folks looking to reduce the amount of data that is logged while still retaining enough information to be useful.  &lt;/p&gt;

&lt;h2&gt;
  
  
  Error Reporting
&lt;/h2&gt;

&lt;p&gt;At this point, I had a good understanding of how logging errors works.  I next wanted to investigate whether using Error Reporting for this purpose would provide additional value.  First, I set up Error Reporting in the app:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;//************** Stackdriver Error Reporting setup ******** */&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;errors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ErrorReporting&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;projectID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;reportMode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;always&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;serviceContext&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;serviceName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I then sent an error using the client:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/report-error&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Stackdriver error reported!&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Stackdriver error reported&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This time, there was no output in the console AND nothing was logged to Stackdriver Logging.  I went to Error Reporting to find my error:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F4.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F4.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When I clicked on the error, I was able to get a lot of detail:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F5.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F5.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is great because I can see when the error started happening, I get a histogram if and when it continues to happen, and I get a full stack trace showing me exactly where in my code the error is generated - this is all incredibly valuable information that I don't get from simply logging with the ERROR severity.  &lt;/p&gt;

&lt;p&gt;The tradeoff here is that this message never makes it to Stackdriver Logging.  This means that I can't use errors reported through Error Reporting to, for example, create &lt;a href="https://dev.to/yurigrinshteyn/can-you-alert-on-logs-in-stackdriver-1lp8"&gt;log based metrics&lt;/a&gt;, which may make for a great SLI and/or alerting policy condition.&lt;/p&gt;

&lt;h1&gt;
  
  
  Logging Exceptions
&lt;/h1&gt;

&lt;p&gt;Next, I wanted to investigate what would happen if my app were to throw an exception and log it - how would it show up?  I used Bunyan to log an exception:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/log-exception&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;exception&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;bunyanLogger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;exception logged&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The console output contained the entire exception:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{"name":"node-error-reporting","hostname":"&amp;lt;hostname&amp;gt;","pid":5539,"level":50,"err":{"message":"exception logged","name":"Error","stack":"Error: exception logged\n    at app.get (/Users/ygrinshteyn/src/error-reporting-demo/app.js:72:22)\n    at Layer.handle [as handle_request] (/Users/ygrinshteyn/src/error-reporting-demo/node_modules/express/lib/router/layer.js:95:5)\n    at next (/Users/ygrinshteyn/src/error-reporting-demo/node_modules/express/lib/router/route.js:137:13)\n    at Route.dispatch (/Users/ygrinshteyn/src/error-reporting-demo/node_modules/express/lib/router/route.js:112:3)\n    at Layer.handle [as handle_request] (/Users/ygrinshteyn/src/error-reporting-demo/node_modules/express/lib/router/layer.js:95:5)\n    at /Users/ygrinshteyn/src/error-reporting-demo/node_modules/express/lib/router/index.js:281:22\n    at Function.process_params (/Users/ygrinshteyn/src/error-reporting-demo/node_modules/express/lib/router/index.js:335:12)\n    at next (/Users/ygrinshteyn/src/error-reporting-demo/node_modules/express/lib/router/index.js:275:10)\n    at expressInit (/Users/ygrinshteyn/src/error-reporting-demo/node_modules/express/lib/middleware/init.js:40:5)\n    at Layer.handle [as handle_request] (/Users/ygrinshteyn/src/error-reporting-demo/node_modules/express/lib/router/layer.js:95:5)"},"msg":"exception logged","time":"2019-11-15T17:47:50.981Z","v":0}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The logging entry looked like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F6.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F6.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And the jsonPayload contained the exception:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F7.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F7.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is definitely a lot of useful data.  I next wanted to see if Error Reporting would work as &lt;a href="https://cloud.google.com/error-reporting/docs/setup/nodejs" rel="noopener noreferrer"&gt;advertised&lt;/a&gt; and identify this exception in the log as an error.  After carefully reviewing the documentation, I realized that this functionality works specifically on GCE, GKE, App Engine, and Cloud Functions, whereas I was just running my code on my local desktop. I tried running the code in Cloud Shell and immediately got a new entry in Error Reporting:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F8.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F8.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The full stack trace of the exception is available in the detail view:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F9.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F9.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So, logging an exception gives me the best of &lt;strong&gt;both&lt;/strong&gt; worlds - I get a logging entry that I can use for things like log based metrics, and I get an entry in Error Reporting that I can use for analysis and tracking.&lt;/p&gt;

&lt;h1&gt;
  
  
  Reporting Exceptions
&lt;/h1&gt;

&lt;p&gt;I next wanted to see what would happen if I used Error Reporting to report the same exception.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/report-exception&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;exception&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;exception reported&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once again, there was no console output.  My error was immediately visible in Error Reporting:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F10.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F10.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And somewhat to my surprise, I was able to see an entry in Logging, as well:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F11.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fyuriatgoogle%2Fstack-doctor%2Fblob%2Fmaster%2Ferror-reporting-demo%2Fimages%2F11.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As it turns out, exceptions are recorded in both Error Reporting AND Logging - no matter which of the two you use to send them.&lt;/p&gt;

&lt;h1&gt;
  
  
  So, what now?
&lt;/h1&gt;

&lt;p&gt;Here's what I've learned from this exercise:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Bunyan logging is more verbose than Winston, which could be a consideration if cost is an issue.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Exceptions&lt;/strong&gt; can be sent to Stackdriver through Logging or Error Reporting - they will then be available in both.&lt;/li&gt;
&lt;li&gt; Using Error Reporting to report** non-exception** errors adds a lot of value for developers, but gives up value for SREs or ops folks who need to use logs for metrics or SLIs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Thanks for joining me - come back soon for more!&lt;/p&gt;

</description>
      <category>node</category>
      <category>errors</category>
      <category>gcp</category>
      <category>stackdriver</category>
    </item>
    <item>
      <title>Can you alert on logs in Stackdriver?</title>
      <dc:creator>Yuri Grinshteyn</dc:creator>
      <pubDate>Tue, 15 Oct 2019 17:44:37 +0000</pubDate>
      <link>https://dev.to/yurigrinshteyn/can-you-alert-on-logs-in-stackdriver-1lp8</link>
      <guid>https://dev.to/yurigrinshteyn/can-you-alert-on-logs-in-stackdriver-1lp8</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;One of the more common questions I hear from folks who are using either or both Stackdriver Logging and Monitoring is "how do I create an alert on errors in my logs?"  Generally, they ask for one of two reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; They really want to be notified every time an error is logged - as you might expect, this is an opportunity for me to talk to them about reliability targets, SLOs, and good alerting practices. &lt;/li&gt;
&lt;li&gt; They use logs as a kind of SLI and really do need to be alerted when the number of messages that meet some criteria (e.g. errors) exceeds a particular threshold.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this post, I am going to address the latter scenario.  I'll start by covering what logs and metrics are typically used for and the kinds of data they contain.  From there, I'll review creating metrics from logs in Stackdriver and using those metrics to create charts in dashboards and alerts.&lt;/p&gt;

&lt;p&gt;Note that a lot of this information is already covered in the &lt;a href="https://cloud.google.com/logging/docs/logs-based-metrics/charts-and-alerts"&gt;documentation&lt;/a&gt; and in this excellent blog &lt;a href="https://cloud.google.com/blog/products/gcp/extracting-value-from-your-logs-with-stackdriver-logs-based-metrics"&gt;post&lt;/a&gt; by Mary Koes, a product manager on the Stackdriver team.  Nevertheless, this is my attempt to create a single coherent story with practical examples of how to get started with log-based metrics in Stackdriver.&lt;/p&gt;

&lt;h1&gt;
  
  
  Logs and metrics
&lt;/h1&gt;

&lt;p&gt;Before we start, it's important to have a common understanding of exactly what we mean when we talk about logs and metrics.  Together, they are two of the three pillars (with traces being the third) of observability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--WDefkyY0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/1%2520-%2520charity%2520tweet.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--WDefkyY0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/1%2520-%2520charity%2520tweet.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt; that I am generally referring to "observability 1.0" here.  I will readily defer to folks like &lt;a href="https://twitter.com/mipsytipsy"&gt;Charity Majors&lt;/a&gt; and &lt;a href="https://twitter.com/lizthegrey"&gt;Liz Fong-Jones&lt;/a&gt; on the &lt;strong&gt;current&lt;/strong&gt; state of the art where observability is concerned.  You can start with reading Charity's retrospective &lt;a href="https://thenewstack.io/observability-a-3-year-retrospective/"&gt;here&lt;/a&gt;, and &lt;a href="https://twitter.com/el_bhs"&gt;Ben Sigelman&lt;/a&gt;'s addressing the idea of the three pillars directly &lt;a href="https://lightstep.com/blog/three-pillars-zero-answers-towards-new-scorecard-observability/"&gt;here&lt;/a&gt;.    &lt;/p&gt;

&lt;p&gt;With that sorted - let's get back to the issue at hand.  We use &lt;strong&gt;logs&lt;/strong&gt; as data points that specifically describe an event that takes place in our system.  Logs are written by our code, by the platform our code is running on, and the infrastructure we depend on (for the purposes of this post, I'm leaving audit logs out of scope, and I will return to them directly in a separate post). Because logs in modern systems are the descendant of (and sometimes still are) text log files written to disk, I consider a log entry, analogous to a line in a log file, to be the quantum unit of logging.  An entry will generally consist of exactly two things - a timestamp that indicates either when the event took place or when it was ingested into our logging system and the text payload, either as unstructured text data or structured data, most commonly in JSON.  Logs can also carry associated metadata, especially when they're ingested into Stackdriver, like the resource that's writing them, the log name, and a severity for each entry. We use logs for two main purposes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Event logs describe specific events that take place within our system - we may use them to output messages that assure the developer that things are working well ("task succeeded") or to provide information when things are not ("received exception from server").&lt;/li&gt;
&lt;li&gt;  Transaction logs describe the details of every transaction processed by a system or component.  For example, a load balancer will log every request that it receives, whether it was successfully completed or not, and include additional information like the requested URL, HTTP response code, and possibly things like which backend was used to serve the request.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Metrics&lt;/strong&gt;, on the other hand, are not generally thought of as describing specific events (again, this is changing).  More commonly, they're used to represent the state or health of your system over time.  A metric is made up of a series of points, each of which includes the timestamp and a numeric value.  Metrics also have metadata associated with them - the series of points, commonly referred to as a timeseries, will include things like the name, description, and often labels that help determine which resource is writing the metric.  &lt;/p&gt;

&lt;p&gt;In Stackdriver specifically, metrics are the only kind of data that can be used to create alerts via alerting policies.  As such, it'll be important to understand how to use logs to create metrics - let's do that now.&lt;/p&gt;

&lt;h1&gt;
  
  
  Log-based metrics
&lt;/h1&gt;

&lt;p&gt;The term "log-based metrics" is rather specific to Stackdriver, but the idea is rather straightforward.  First, Stackdriver provides a simple mechanism to count the number of log entries per minute that match a filter - referred to as a "counter metric".  This is what we'll use if we want to, for example, use load balancer logs as our service level indicator for availability of a service - we can create a metric that will count how many errors we'll see, and we'll use an alerting policy to alert when that value exceeds a threshold we deem acceptable. The process is documented &lt;a href="https://cloud.google.com/logging/docs/logs-based-metrics/counter-metrics"&gt;here&lt;/a&gt;, but I find that a specific example is always helpful.  &lt;/p&gt;

&lt;h2&gt;
  
  
  Counter metric - errors
&lt;/h2&gt;

&lt;p&gt;Let's take a look at a simple example.  I've created a simple &lt;a href="https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/errors/errors.js"&gt;example&lt;/a&gt; that writes an error 20% of the time.  When I run the code locally after authenticating through the Google Cloud SDK, here are the log entries:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--peH8Afgu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/2-%2520matching%2520error%2520entries.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--peH8Afgu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/2-%2520matching%2520error%2520entries.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From here, I'd like to know when my error rate exceeds a particular threshold.  First, I need to create a filter for all the logs that contain the error.  An easy way is to expand the log, find the message I'd like to key on in the payload, click it, and select show matching entries.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LDXHkWCk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/3-%2520matching%2520error%2520logs.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LDXHkWCk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/3-%2520matching%2520error%2520logs.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This creates an Advanced Filter and shows me the failure messages:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jnzDhwa_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/4%2520-%2520failure%2520filter.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jnzDhwa_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/4%2520-%2520failure%2520filter.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, I can use the Create Metric feature to create a Counter Metric:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QFmUzD5h--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/5%2520-%2520counter%2520metric%2520config.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QFmUzD5h--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/5%2520-%2520counter%2520metric%2520config.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once I click Create Metric, I can then go to Stackdriver Monitoring and see it there:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--GPkQOhsR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/6%2520-%2520counter%2520metric%2520in%2520explorer.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--GPkQOhsR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/6%2520-%2520counter%2520metric%2520in%2520explorer.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Distribution metric - latency
&lt;/h2&gt;

&lt;p&gt;Now that you've seen how to create a simple counter metric that will track the number of errors per minute, let's take a look at the other reason we might want to use log-based metrics - to track a specific numeric value in the log payload.  I've created a second &lt;a href="https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/latency/latency.js"&gt;example&lt;/a&gt; - this time, I'm introducing a randomly generated delay in the code and logging it as the latency.  Here's what the payload looks like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Wgufi-Ca--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/7%2520-%2520latency%2520payload.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Wgufi-Ca--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/7%2520-%2520latency%2520payload.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I want to create a metric that will capture value in the "message" field.  To do that, I again use the "Show matching entries" feature and create a metric from the selection.  I need to use a regular expression to parse the field and extract the numeric value. &lt;strong&gt;Note&lt;/strong&gt; that I modified the selection filter to look for all the messages coming from my local machine, where the Node.js code is running, by using the logName filter. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Jm9aPN4z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/8%2520-%2520latency%2520metric%2520config.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Jm9aPN4z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/8%2520-%2520latency%2520metric%2520config.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As before, I create the metric and view it in Metrics Explorer:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--W0Yr4EZZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/9%2520-%2520latency%2520metric%2520in%2520explorer.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--W0Yr4EZZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/9%2520-%2520latency%2520metric%2520in%2520explorer.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Using metrics
&lt;/h1&gt;

&lt;p&gt;Now that we have our metrics created, we can use them just like any other metric in Stackdriver Monitoring.  We can create charts with them and use them in Alerting Policies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Charts
&lt;/h2&gt;

&lt;p&gt;As an example, I created a chart for the latency I'm writing as a log value:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QVbqBpYM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/10%2520-%2520latency%2520in%2520dashboard.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QVbqBpYM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/10%2520-%2520latency%2520in%2520dashboard.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One great thing about log-based metrics is that you can easily see the logs that feed them.  Click on the 3 dot menu for the chart and select View Logs:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--uBXpe_gF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/11%2520-%2520view%2520latency%2520logs.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uBXpe_gF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/11%2520-%2520view%2520latency%2520logs.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The result is an advanced filter that shows you the logs that were ingested within the timeframe that the chart or dashboard were set to select:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--qx4GD_Ma--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/12%2520-%2520logs%2520with%2520time%2520filter.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--qx4GD_Ma--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/12%2520-%2520logs%2520with%2520time%2520filter.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Alerts
&lt;/h2&gt;

&lt;p&gt;To address the original question raised at the start - we can also use our metrics as the basis for alerts.  For example, if we wanted to know when our error rate exceeded a specific threshold, we can simply use our error metric in an alerting policy condition:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cTjFbQUE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/13%2520-%2520alerting%2520policy%2520config.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cTjFbQUE--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/13%2520-%2520alerting%2520policy%2520config.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can do the same for our distribution metric that captures latency:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--aw1UDq_B--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/14%2520-%2520latency%2520alerting%2520config.png%3Fraw%3Dtrue" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--aw1UDq_B--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://github.com/yuriatgoogle/stack-doctor/blob/master/log-based-metrics/images/14%2520-%2520latency%2520alerting%2520config.png%3Fraw%3Dtrue" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt; that until a week or so ago, the documentation stated that alerting is not supported for distribution metrics - this is not true, and you can alert on distribution metrics by using a percentile aligner (with thanks to &lt;a href="https://twitter.com/summitraj"&gt;Summit&lt;/a&gt; for catching the documentation error). &lt;/p&gt;

&lt;h1&gt;
  
  
  Summary and conclusion
&lt;/h1&gt;

&lt;p&gt;I hope that you now have a better understanding of how to create metrics from logs in Stackdriver and how you can use those metrics to visualize data with charts and create alerts with alerting policies.  As always, I appreciate your feedback, questions, and ideas for topics to address in the future.  &lt;/p&gt;

</description>
      <category>stackdriver</category>
      <category>gcp</category>
      <category>logs</category>
      <category>metrics</category>
    </item>
    <item>
      <title>Introduction to open source observability tools on Kubernetes</title>
      <dc:creator>Yuri Grinshteyn</dc:creator>
      <pubDate>Mon, 07 Oct 2019 14:08:04 +0000</pubDate>
      <link>https://dev.to/yurigrinshteyn/introduction-to-open-source-observability-tools-on-kubernetes-43ng</link>
      <guid>https://dev.to/yurigrinshteyn/introduction-to-open-source-observability-tools-on-kubernetes-43ng</guid>
      <description>&lt;p&gt;This is my first post here, so I'll just start by cross-posting an article I just published on opensource.com:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://opensource.com/article/19/10/open-source-observability-kubernetes"&gt;https://opensource.com/article/19/10/open-source-observability-kubernetes&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Enjoy!&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>observability</category>
      <category>sre</category>
      <category>monitoring</category>
    </item>
  </channel>
</rss>
