DEV Community: honeycomb

A Culture of Observability Helps Engineers Hit the Spot (Instance)

honeycomb — Mon, 11 Jan 2021 17:47:25 +0000

At Honeycomb, we’re big fans of AWS Spot Instances. During a recent bill reduction exercise, we found significant savings in running our API service on Spot, and now look to use it wherever we can. Not all services fit the mold for Spot, though. While a lot of us are comfortable running services atop on-demand EC2 instances by now, with hosts that could be terminated or fail at any time, this is relatively rare when compared with using Spot, where sudden swings in the Spot market can result in your instances going away with only two minutes of warning. When we considered running a brand-new, stateful, and time-sensitive service on Spot, it seemed like a non-starter, but the potential upside was too good to not give it a try.

Our strong culture of observability and the environment it enables means we have the ability to experiment safely. With a bit of engineering thoughtfulness, modern deployment processes, and good instrumentation we were able to make the switch with confidence. Here's how we did it:

What we needed to support

We're working on a feature that enables you to do trace-aware sampling with just a few clicks in our UI. The backend service (internal name Dalmatian) is responsible for buffering trace events and making sampling decisions on the complete trace. To work correctly, it must:

process a continuous, unordered stream of events from the API as quickly as possible
group and buffer related events long enough to make a decision
avoid re-processing the same traces, and make consistent decisions for the same trace ID for spans that arrive late

To accomplish grouping, we use a deterministic mapping function to route trace IDs to the same Kafka partition, with each partition getting processed by one Dalmatian host. To keep track of traces we’ve already processed, we also need this pesky little thing called state. Specifically:

the current offset in the event stream,
the offset of the oldest seen trace not yet processed,
recent history of processed trace IDs and their sampling decision.

This state is persisted to a combination of Redis and S3, so Dalmatian doesn’t need to store anything locally between restarts. Instead, state is regularly flushed to external storage. That’s easy. Startup is more complex, however, as there are some precise steps that need to happen to resume event processing correctly. There’s also a lot of state fetching to do, which adds time and can fail. In short, once the service is running, it’s better to not rock the boat and just let it run. Introducing additional chaos with hosts that come and go more frequently was something to consider before running this service on Spot.

Chaos by default

As part of the observability-centric culture we strive for at Honeycomb, we embrace CI/CD. Deploys go out every hour. Among the many benefits of this approach is that our deploy system is regularly exercised, and our services undergo restarts all the time. Building a stateful service with a complex startup and shutdown sequence in this environment means that bugs in those sequences are going to show themselves very quickly. By the time we considered running Dalmatian on Spot (queue the puns), we’d already made our state management stable through restarts, and most of the problems Spot might have introduced were already handled.

Hot spare Spots

There was one lingering issue with using Spot though: having hosts regularly go away means we need to wait on new hosts to come up. Between ASG response times, Chef, and our deploy process, it averages 10 minutes for a new host to come online. It’s something we hope to improve one day, but we’re not there yet. With one processing host per partition, that means losing a host can result in at least a 10 minute delay in event processing. That’s a bad experience for our customers, and one we’d like to avoid. Fortunately, Spot instances are cheap, and since we’re averaging a 70% savings on instance costs, we can afford to run with extra hosts in standby mode. This is accomplished with a bit of extra terraform code:

min_size                  = var.dalmatian_instance_count[var.size] + ceil(var.dalmatian_instance_count[var.size] / 5)
max_size                  = var.dalmatian_instance_count[var.size] + ceil(var.dalmatian_instance_count[var.size] / 5)

With a small modification in our process startup code, instances will wait for a partition to become free and immediately pick up that partition’s workload.

Spot how we're doing

We’ve made a significant change to how we run this service. How do we know if things are working as intended? Well, let’s define working, aka our Service Level Objective. We think that significant ingestion delays break our core product experience, so for Dalmatian and the new feature, we set an SLO that events should be processed and delivered in under five minutes, 99.9% of the time.

With the SLO for context, we can restate our question as: Can we move this new functionality onto Spot Fleets and still maintain our Service Level Objective despite the extra host churn?

In the last month in our production environment, we have observed at least 10 host replacements. This is higher than usual churn than we’ve had in the past, but also what we expected with Spot.

In that same month, we stayed above our threshold for SLO compliance (99.9%). From this perspective, it looks like a successful change - one we’re excited about because it saves us a significant amount in operational expenses.

New firefighting capabilities must be part of the plan

Things are working fine now, but they might break in the future. Inevitably, we will find ourselves running behind in processing events. One set of decisions we had to make for this service was “what kind of instance do we run on, and how large?”. When we’re keeping up with incoming traffic, we can run on smaller instance types. If we ever fall behind, though, we need more compute to catch up as quickly as we can, but that’s expensive to run all the time when we may only need it once a year. Since we have a one processor per partition model, we can’t really scale out dynamically. But we can scale up!

Due to the previously mentioned 10 min host bootstrapping time, swapping out every host with a larger instance is not ideal. We’ll fall further behind while we wait to bring the instances up, and once we decide to scale down, we’ll fall behind again. What if we stood up an identically sized fleet with larger instances and cut over to that? This Terraform spec defines what we call our “catch-up fleet”. The ASG exists only when dalmatian_catchup_instance_count is greater than 0. In an emergency, a one-line diff can bring this online.

resource "aws_autoscaling_group" "dalmatian_catchup_asg" {
  name  = "dalmatian_catchup_${var.env}"
  count = var.dalmatian_catchup_instance_count &gt; 0 ? 1 : 0

  # ...

  launch_template {
    launch_template_specification {
      launch_template_id = aws_launch_template.dalmatian_lt.id
      version            = "$Latest"
    }

    dynamic "override" {
      for_each = var.dalmatian_catchup_instance_types
      content {
        instance_type = override.value
      }
    }
  }
}

When the hosts are ready, we can cut over to them by stopping all processes on the smaller fleet, enabling the “hot spares” logic to kick in on the backup fleet. When we’re caught up, we can repeat the process in reverse: start the processes on the smaller fleet, and stop the larger fleet. Using Honeycomb, of course, we were able to verify that this solution works with a fire drill in one of our smaller environments.

Spot the benefits of an observability-centric culture

Ok maybe that's enough spot-related puns, but the point here is that our ability to experiment with new architectures and make significant changes to how services operate is predicated on our ability to deploy rapidly and find out how those changes impact the behavior of the system. Robust CI/CD processes, thoughtful and context-rich instrumentation, and attention to what we as software owners will need to do to support this functionality in the future makes it easier and safer to build and ship software our users love and benefit from.

Ready to take the sting out of shipping new software? Honeycomb helps you spot outliers and find out how your users experience your service. Get started for free!

Instrumenting Lambda with Traces: A Complete Example in Python

honeycomb — Mon, 04 Jan 2021 22:02:15 +0000

We’re big fans of AWS Lambda at Honeycomb. As you may have read, we recently made some major improvements to our storage engine by leveraging Lambda to process more data in less time. Making a change to a complex system like our storage engine is daunting, but can be made less so with good instrumentation and tracing. For this project, that meant getting instrumentation out of Lambda and into Honeycomb. Lambda has some unique constraints that make this more complex than you might think, so in this post we’ll look at instrumenting an app from scratch.

Bootstrapping the app

Before we begin, you’ll want to ensure you have the Serverless framework installed and a recent Python version (I’m using 3.6). For this example, I picked a Serverless Python TODO API template in the Serverless Examples repo. I like this particular template for demonstration because it sets up an external dependency (DynamoDB), which gives us something interesting to look at when we’re tracing. To install the demo app:

$ sls install --url https://github.com/serverless/examples/tree/master/aws-python-rest-api-with-dynamodb --name my-sls-app

You should see a project directory with some contents like this:

$ cd my-sls-app && ls
README.md    package.json    serverless.yml    todos

There’s just a bit more to add before we get going. We want to install honeycomb-beeline in our app, so we’ll need to package the Python requirements:

# install the serverless-python-requirements module
$ npm install --save-dev serverless-python-requirements
# install beeline-python in a venv, then export the requirements.txt
$ virtualenv venv --python=python3
$ source venv/bin/activate
$ pip install honeycomb-beeline
$ pip freeze &gt; requirements.txt

Now edit serverless.yml and add the following:

# essentially, this injects your python requirements into the package
# before deploying. This runs in docker if you're not deploying from a linux host
plugins:
  - serverless-python-requirements

custom:
  pythonRequirements:
    dockerizePip: non-linux

Now we can deploy using sls deploy:

Serverless: Packaging service...
[...]
endpoints:
  POST - https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos
  GET - https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos
  GET - https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos/{id}
  PUT - https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos/{id}
  DELETE - https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos/{id}
functions:
  create: my-sls-app-dev-create
  list: my-sls-app-dev-list
  get: my-sls-app-dev-get
  update: my-sls-app-dev-update
  delete: my-sls-app-dev-delete

A few curl calls against the new endpoints confirm we’re in good shape!

$ curl https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos
# db is empty initially
[]
$ curl -X POST -d '{"text": "write a blog post"}' https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos
{"id": "267199de-25c0-11ea-82d7-e6f595c02494", "text": "write a blog post", "checked": false, "createdAt": "1577131779.711644", "updatedAt": "1577131779.711644"}
$ curl https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos
[{"checked": false, "createdAt": "1577131779.711644", "text": "write a blog post", "id": "267199de-25c0-11ea-82d7-e6f595c02494", "updatedAt": "1577131779.711644"}]

Implementing tracing

The first step is to initialize the beeline. The todos/init.py file is a great place to drop that init code so that it gets pulled in by all of the Lambda handlers.

import beeline
beeline.init(writekey='YOURWRITEKEY', service_name='todo-app', dataset='my-sls-app')

Now let’s look at that curl we ran earlier. That’s hitting the list function on the API. Let’s open this up and add some instrumentation.

# ...

import beeline
from beeline.middleware.awslambda import beeline_wrapper

# The beeline_wrapper decorator wraps the Lambda handler here in a span. By default,
# this also starts a new trace. The span and trace are finished when the function exits
# (but before the response is returned)
@beeline_wrapper
def list(event, context):
    table = dynamodb.Table(os.environ['DYNAMODB_TABLE'])
    beeline.add_context_field("table_name", table)

    # This call to our db dependency is worth knowing about - let's wrap it in a span.
    # That's easy to do with a context manager.
    with beeline.tracer("db-scan"):
        # fetch all todos from the database
        result = table.scan()

    # .. capture any results we want to include from this function call
    beeline.add_context({
        'status_code': 200,
        'num_items': len(result['Items'])
    })

    # create a response
    response = {
        "statusCode": 200,
        "body": json.dumps(result['Items'], cls=decimalencoder.DecimalEncoder)
    }
    return response

Another sls deploy gets our instrumentation out there. Once deployed, I can re-run that curl again. Now I have a trace in my dataset!

The trace shows me a Lambda runtime of 6.7ms, but from the client, it seemed slow, so I check the Lambda execution logs and see:

REPORT RequestId: def28374-1e35-429b-8667-3ab481b61ad4 Duration: 37.27 ms

Why the discrepancy?

The instrumentation performance tax

The lifecycle of a Lambda container is tricky. You may already know about cold starts. That is, your code isn’t really running until it is invoked, and when it does run, the container has to start up and this can add latency to your initial invocation(s). Once your function returns a response, it goes into a frozen state, and execution of any running threads is suspended. From there, the container may be reused (without the cold start penalty) by subsequent requests, or terminated. This termination is not done gracefully, and thus the running function isn't given a chance to do housekeeping.

What does that mean if you’re trying to send telemetry?

Typical client patterns using delayed batching, like those used in our SDKs, are unreliable in Lambda, since the batch transmission thread(s) may never get a chance to run after the function exits.
Anything that needs to be sent reliably must be sent before the function returns

That’s why if you crack open the Beeline middleware code for Lambda, you’ll see this line:

beeline.get_beeline().client.flush()

This is our way of ensuring event batches are sent before the function exits. But it isn't free. The effect of this is that it delays response to the client by the round-trip time from your Lambda to our API. For many use cases, that's an acceptable tradeoff for getting instrumentation out of Lambda. But what if your application (or your user) is latency-sensitive?

Cloudwatch logs to the rescue

There is one way to synchronously ship data from a Lambda function without significant blocking: logging.
If you want detailed telemetry without a performance hit, this is the currently the only way to go. Each Lambda function writes to its own Cloudwatch Log stream, and AWS makes it relatively easy to subscribe to those streams with other Lambda functions. Here’s where we introduce the Honeycomb Publisher for Lambda.

The Publisher can subscribe to one or more Cloudwatch Log streams, pulling in event data in JSON format and shipping it to the correct dataset. To use it, you need to deploy it and subscribe it to your Lambda function(s) log streams, then configure your app to emit events via logs.

Switching to log output

The Python Beeline makes it easy to override the event transmission mechanism. For our purpose here, we’ll use the built-in FileTransmission with output to sys.stdout:

import sys
import beeline
import libhoney
from libhoney.transmission import FileTransmission
beeline.init(
    writekey='can be anything',
    service_name='todo-app',
    dataset='my-sls-app',
    transmission_impl=FileTransmission(output=sys.stdout)
)

That’s all we need to do to have events flow to Cloudwatch instead of the Honeycomb API. Note that you do not need to set a valid writekey when using the FileTransmission. The Publisher will have responsibility for authenticating API traffic.

If you execute your app after deploying this, you should see span data in the Cloudwatch Logs for the Lambda function.

{ 
    "time": "2019-12-17T16:54:20.355317Z", 
    "samplerate": 1, 
    "dataset": "my-sls-app",
    ...
    "data": {
       "service_name": "todo-app",
       "meta.beeline_version": "2.11.2",
       ...
      "duration_ms": 6.47 
    } 
}

Also note the shorter Lambda run time, now that we aren’t blocking on an API call before exiting:

REPORT RequestId: 30ff2be3-2c0b-4869-8da3-7af280dfc76c Duration: 9.41 ms

Deploying the Publisher

The Publisher is another Lambda function, so if you are familiar with integrating third-party Lambda functions in your stack, you should use the tools and methods that work best for you. We provide a helpful, generic Cloudformation Template to walk you through installation of it and to document its AWS dependencies. Since we’re using Serverless in this tutorial, though, let’s see if we can make a it a “native” part of our project!

Serverless allows you to describe ancillary AWS resources to spin up alongside your stack. In our case, we want to bolt on the publisher and its dependencies whenever we deploy our app. In the serverless.yml, append the following to the resources block of the sample app:

resources:
  Resources:
    TodosDynamoDbTable:
      # ...
    PublisherLambdaHandler:
      Type: 'AWS::Lambda::Function'
      Properties:
        Code:
          S3Bucket: honeycomb-integrations-${opt:region, self:provider.region}
          S3Key: agentless-integrations-for-aws/LATEST/ingest-handlers.zip
        Description: Lambda function for publishing asynchronous events from Lambda
        Environment:
          Variables:
            HONEYCOMB_WRITE_KEY: 'YOURHONEYCOMBKEY'
            DATASET: 'my-sls-app'
        FunctionName: PublisherLambdaHandler-${self:service}-${opt:stage, self:provider.stage}
        Handler: publisher
        MemorySize: 128
        Role:
          "Fn::GetAtt":
            - LambdaIAMRole
            - Arn
        Runtime: go1.x
        Timeout: 10
    ExecutePermission:
      Type: "AWS::Lambda::Permission"
      Properties:
        Action: 'lambda:InvokeFunction'
        FunctionName:
          "Fn::GetAtt":
            - PublisherLambdaHandler
            - Arn
        Principal: 'logs.amazonaws.com'
    LambdaIAMRole:
      Type: "AWS::IAM::Role"
      Properties:
        AssumeRolePolicyDocument:
          Version: "2012-10-17"
          Statement:
            - Effect: "Allow"
              Principal:
                Service:
                  - "lambda.amazonaws.com"
              Action:
                - "sts:AssumeRole"
    LambdaLogPolicy:
      Type: "AWS::IAM::Policy"
      Properties:
        PolicyName: "lambda-create-log"
        Roles:
            - Ref: LambdaIAMRole
        PolicyDocument:
          Version: "2012-10-17"
          Statement:
            - Effect: Allow
              Action:
                - logs:CreateLogGroup
                - logs:CreateLogStream
                - logs:PutLogEvents
              Resource: 'arn:aws:logs:*:*:*'
    # add one of me for each function in your app
    CloudwatchSubscriptionFilterList:
      Type: "AWS::Logs::SubscriptionFilter"
      Properties:
        DestinationArn:
          "Fn::GetAtt":
            - PublisherLambdaHandler
            - Arn
        LogGroupName: /aws/lambda/${self:service}-${opt:stage, self:provider.stage}-list
        FilterPattern: ''
      DependsOn: ExecutePermission

That’s a lot of boilerplate! The main things you’ll want to take note of is setting a valid value for HONEYCOMB_WRITE_KEY and add SubscriptionFilter resources for each function in your stack. Run sls deploy one more time and you should now see a new function, PublisherLambdaHandler-my-sls-app-dev, deployed alongside your Lambda functions. The Publisher will be subscribed to your app’s Cloudwatch Log Streams and will forward events to Honeycomb.

Leveling up with distributed tracing

Lambda functions don’t always run in a vacuum - often they are executed as part of a larger chain of events. You can link your Lambda instrumentation with your overall application instrumentation with distributed traces. To do this, you need to pass trace context from the calling app to your Lambda-backed service. Let’s look at a practical example by building on our existing app. We already have a todo API implemented in Lambda. Let’s assume we’re building a UI on top of that. We have some code that fetches the list of items from the API and then renders them.

@beeline.traced(name="todo_list_view")
def todo_list_view():
    with beeline.tracer(name="requests_get"):
        todo_list = requests.get('https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos')

    render_list(todo_list)

@beeline.traced(name="render_list")
def render_list(l):
    # ...

The HTTP request to fetch the list of todo items is done using the requests lib. This get call is wrapped in a trace span to measure the request time. We know that list is instrumented with tracing, so it would be nice if we could link those spans to our trace!

@beeline.traced(name="todo_list_view")
def todo_list_view():
    with beeline.tracer(name="requests_get"):
        context = beeline.get_beeline().tracer_impl.marshal_trace_context()
        todo_list = requests.get('https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos', 
                                headers={'X-Honeycomb-Trace': context})

    render_list(todo_list)

marshal_trace_context builds a serialized trace context object, which can be passed to other applications via request header or payload. The Lambda middleware automatically looks for a header named X-Honeycomb-Trace and extracts the context object, adopting the trace ID and caller’s span ID as its parent rather than starting an all-new trace. All we need to do is pass this context object as a request header before we call the list endpoint. Adding the header is demonstrated manually here, but the beeline provides a patch for the requests lib that will do this for you. Simply import beeline.patch.requests into your application after importing the requests lib.

When we call todo_list_view in our UI app, we now see one trace with spans from both services:

Go Serverless with confidence

Building and running apps with Lambda can be complex and is not without challenges, but with distributed tracing and rich instrumentation, there’s no need to fumble around in the dark.

Ready to instrument your serverless apps? Get started with Honeycomb for free.

Unpacking Events: All the Better to Observe

honeycomb — Mon, 16 Nov 2020 17:28:52 +0000

At Honeycomb, we’ve been listening to your feedback. You want easier ways to predict usage and scale your observability spend with your business. What would it look like to meet you where you already are, using similar terms, and give you more control with a simpler experience? We think that means reimagining the customer experience into one that centers around an event-based model.

But what exactly is an event? What does that mean for your team’s observability journey?

The Old Way

Traditionally, Honeycomb has charged customers based on two axes: ingest and retention. We figured not everyone has the same needs, so why not have you select what’s right for you? But in practice, that translated into placing more administrative burden on you.

Many of you told us that you’d frequently visit the Usage page to fiddle with the slider, responding to spikes in traffic (and therefore ingest) by reducing your retention period in order to keep your bill around the same amount each month. We started to see that become a difficult (but common) tradeoff: if you want to send more data, then it wouldn’t stick around as long. Many of you ended up robbing Peter to pay Paul.

In addition, we found that it’s not intuitive for many of you to estimate ingest in gigabytes. As a result, many Usage page sliders were adjusted reactively, without much sense of how that would affect Honeycomb cost as more systems were integrated or as service traffic continued to grow.

Stress-Free Observability

Let’s face it, that’s not a great experience. If we could take what we’ve learned from you to make it better, what would we do?

You can’t always anticipate traffic spikes. A great experience would not penalize you for those.

You don’t want to obsess over usage every month or have to explain variations in spend to your accounting team. A great experience would let you set your monthly or yearly spend and forget about it, knowing you have headroom for growth.

You want to stop worrying about retention and capacity planning. We’ve worked with many teams who limit their retention to 24 hours, or even less! A great experience would give you an extended time frame with your data. In the beginning, you would have space to get comfortable with your new instrumentation. Over time, it would allow you to reflect back on the last two months to support your incident review and capacity planning needs.

A great experience would encourage you to send us wide events with many context fields, since that’s the richness you need for observability. You shouldn’t have to worry about how much data each of those events sends.

What we’ve found is that as you progress on your observability journey, your instrumentation will reach an equilibrium. Once you figure out your right level of instrumentation, your usage should be predictable and aligned with your application’s traffic patterns. You want to spend time thinking about improving your application, not optimizing fiddly usage sliders.

You Already Think About Events

In thinking about how to meet you where you already are, it makes a lot of sense to land on the event as the core unit of measure.

But not all events are scoped similarly. Let’s define what an “event” would mean to Honeycomb, and how that relates to the events you care about for your service.

Honeycomb defines an event as a single “unit of work” in your application code. But a “unit of work” can have a dozen meanings in a dozen contexts. It can be as small as flipping a bit or as large as a round-trip HTTP request. The simplest definition is that an event is usually either a trace span, or a log event.

Let’s unpack that a bit further.

Log Events

For logs, enumerating events is pretty straightforward. If you use (or if you’re planning to use) honeytail to send structured logs into Honeycomb, you likely already know how many log events you’re sending. For infrastructure teams with less authority over code changes, installing an agent like honeytail, the AWS integrations, or the newly-upgraded Kubernetes agent is the best way to get data into Honeycomb. Each event in these agents corresponds to one event in Honeycomb.

Trace Spans and Events

With spans, we need to examine the definition a bit more. To start: we’ve decided that one span equals one event.

If you want high-res observability into your systems, you’ve probably looked into trace-aware instrumentation like our Beelines. We’ve already written a lot about the benefits of tracing. For this post, let’s focus on what 1 span == 1 event would mean for adding trace-aware instrumentation to your service.

As a service owner, you already have your own concept of events in your world: HTTP or API requests, background tasks, queue events, etc. If your app is in production, you probably know your traffic patterns, i.e.:

How many of your service’s events you experience at any particular time scale—per second, per hour, per day, per month
The seasonality of those events at different times of the day, week, month, year

When you’ve instrumented for tracing, one of your service’s events generates a single Honeycomb trace. A trace is made up of one or more spans. Remember that in this world, one span is one event.

Estimating Honeycomb Events

From here you can roughly predict your Honeycomb usage as a function of your traffic. For example, you could estimate your monthly usage like so:

  (Number of your service’s events per month)
× (Number of spans in each service event)
_____________________________________________
=  Honeycomb usage per month

So how do you figure out the number of spans for each of your service’s events?

A useful guideline: one span gets generated from each method call, down to a certain level of granularity. By “granularity,” we mean how deep you go down the call stack. Sometimes you care about the context in methods being called by other methods, and sometimes you don’t.

For example, you probably care more about what kinds of database queries your controller is making, and less about what arguments went into Math.sum(). (Don’t let me tell you what’s important, though! You know your code better than I do.)

Still, the number of spans generated from one of your service’s events depends on the kind of event that is. In this example, an HTTP request using the ruby-beeline Rails integration, generated 18 spans. If you’re calling out to another service like Redis or S3, that will generate more spans.

Here's the same HTTP request, fully expanded:

What to expect

There's no magic number for the “proper” level of granularity in tracing. As you ramp up your use of Honeycomb, you'll make discoveries that'll guide how to further instrument your code. New Honeycomb users often discover long-hidden bugs and inefficiencies when they first instrument for tracing. Aim for higher granularity early on with the goal of learning, finding these inefficiencies, and nipping them in the bud.

So when you see a high-granularity trace with many spans, ask yourself, “Is this trace valuable?” It’s entirely in the eye of the beholder!

After this initial period of discovery, you'll gain familiarity with what a normal trace looks like for various parts of your service. Going forward, you'll be much more interested in the abnormal and better able to tune your instrumentation.

Let’s look back at our estimation formula from up above to figure out exactly what usage to expect:

  (Number of your service’s events per month) 
× (Number of spans in each service event) 
_____________________________________________ 
=  Honeycomb usage per month

Plugging in some numbers, let’s say your service gets around 1 million requests per day, or up to 30 million requests per month. If each request sends ~20 spans, then you’re looking at 600 million Honeycomb events per month. If each request sends ~50 spans, you’ll be sending 1.5 billion Honeycomb events per month.

Rather than worrying about storage size and retention periods, in this event-based world you’d be able to quickly ascertain what your usage needs are. In future posts, we’ll cover more about what that means going forward and how to even further optimize your usage with techniques like dynamic sampling.

Have questions on how to get started? Want help estimating your event volume? Reach out to our team at info@honeycomb.io.

Not using Honeycomb? Get started today, for free!

Using Honeycomb to Investigate a Redis Connection Leak

honeycomb — Thu, 05 Nov 2020 18:06:36 +0000

This is a guest post by Alex Vondrak, Senior Platform Engineer at true[X].

Alex VondrakFollow

Backend engineer, programming language theory enthusiast, armchair algebraist.

This is the story of how I used Honeycomb to troubleshoot redis/redis-rb#924 and discovered a surprising workaround.

The Problem

Emergent failures aren't fun. I've long since learned to dread those subtle issues that occur at the intersection of old code bases, eclectic dependencies, unreliable networking, third party libraries & services, the whims of production traffic, and so on.

That's why, when RedisLabs emailed us early in December 2019 about our clients spiking up to ~70,000 connections on a new Redis database, I really didn't want to deal with it. Our application was deployed across ~10 machines with 8 processes apiece - so you'd think maybe around 100 connections tops, if everything was going smoothly. Yet there the stale connections were.

Of course, I already knew where the problem came from. The new database was for a workaround we'd just deployed. It was a bit of over-engineering for a very specific use case, and it relied on blocking operations to force parallel HTTP requests into some serial order. The logic is replicated in this GitHub gist.

The important thing to note is that we relied on BLPOP to populate most of our responses. This Redis command blocks the client for a given number of seconds, waiting for data to appear on a list for it to pop off. If no data arrives in the given timeout, it returns nil.

Elapsed Time (sec)	Client 1	Client 2	Client 3
0		`BLPOP list 2`	`BLPOP list 4`
1		(waiting)	(waiting)
2	`RPUSH list x y z`	`nil` (timed out)	(waiting)
3			`x` (didn’t time out)

We were under stringent timeouts ourselves, so we'd just wait 1 second for this data. Our customers would timeout long before Redis did anyway.

To add to the stress, this legacy code base was using relatively old/unmaintained software: Goliath, em-synchrony, the recently-deprecated Redis synchrony driver, and other such things built atop the often-confusing EventMachine. Suffice to say, we've had our issues in the past.

RedisLabs instituted a connection cap to stop paging their ops people. With the intervening holidays (and just slow customer ramp-up on the new feature), it wouldn't be until February 2020 that we realized no requests were succeeding because RedisLabs constantly sat at max capacity - no one was able to connect.

Searching in the Dark

What could I do? The Ruby code was as straightforward as it could be. I used the redis gem to initialize a single Redis instance shared across each process. Nothing but a simple call to the Redis#blpop method:

redis.blpop(key, timeout: 1)

Was the problem in my application? Redis? EventMachine? I spent weeks trying to read each library's code line-by-line, wishing I could just squint hard enough to understand where it could go wrong. I went through several Hail Mary attempts:

I tried sending a QUIT command after every request. Nothing.
I investigated setting a client timeout, but that's not configurable on RedisLabs.
I tried to monkey-patch underlying EventMachine connections to set inactivity timeouts. No dice.

The only remaining suggestion was to write a script that periodically fed the CLIENT LIST into CLIENT KILL. That was my reality check: surely that couldn't be an appropriate fix. What was I missing?

Enter Honeycomb

At the time, we had signed up for Honeycomb, but hadn't really instrumented any of our old systems. I whinged about how much I didn't want to spend time on this decrepit code - exactly because of problems like this. So of course I didn't want to instrument it! But I was getting nowhere with this bug. I didn't know what triggered it, short of production traffic. What's more, it was likely a bug in some library, not my application.

But hey, this is Ruby: I could just monkey-patch the libraries. It wouldn't be my first time. I even wrote the Redis integration for the Ruby beeline (full disclosure), and that works by monkey-patching the redis gem.

So I spun up a canary instance pointing to some frenzied branch of my code where I iteratively added stuff like this:

Redis::Client.prepend(Module.new do
  def connect
    Honeycomb.start_span(name: 'Redis::Client#connect') do
      super
    end
  end

  def disconnect
    Honeycomb.start_span(name: 'Redis::Client#disconnect') do
      super
    end
  end

  def reconnect
    Honeycomb.start_span(name: 'Redis::Client#reconnect') do
      super
    end
  end
end)

It was ugly, but it did the trick. I could finally see the code in action.

It still took some flipping back & forth between the trace and the Redis code, but I finally noticed a pattern:

The BLPOP command was being wrapped in a client-side read timeout that added 5 seconds on top of the 1 second I was trying to block for. I didn't expect or even need that.
By using that read timeout code, connection errors would also be retried endlessly.
Due to the read timeout, each connection error took the full 6 seconds to surface.
Each retry would disconnect & reconnect the client.

One reason the connection errors were likely happening was because of the RedisLabs connection cap. They may still have been happening for mysterious reasons, as in redis/redis-rb#598. Regardless, I could observe this endless retry loop triggering a disconnect & reconnect - several of them per request. So I thought: if those connections aren't actually being closed, that could be the source of my idle connections.

After weeks of endless squinting, guessing, and checking, within a day of using Honeycomb, I had an idea actually backed up by data. Carefully inspecting the redis gem's code, I circumvented their read timeout & retry logic with:

-redis.blpop(key, timeout: 1)
+redis.call([:blpop, key, 1])

Once it was deployed to the whole cluster, our connection count plummeted, and the feature finally worked as intended.

Takeaways

Think about what you're actually trying to do. I could've written a CLIENT KILL cron job, but that's trying to reap idle connections, not stop them from being generated in the first place. When you don't have enough information about the problem you're trying to solve, the answer probably isn't to fix some other symptom.
When the only thing you can do to reproduce an error is to launch production traffic at it, live tracing is the only way to see exactly the details you need.
The code bases you hate the most are usually the ones that need the most instrumentation. But a little instrumentation still goes a long way.
Know when to stop. I don't know why this bug happened just the way it did. But I figured out enough to fix the problem for our customers. It's a reasonably precise workaround, and further investigation can go from there.

Download the Observability for Developers white paper to learn more about how observability can help you debug emergent failure modes in production.

Experience what Honeycomb can do for your business. Check out our short demo.

Logs and Traces: Two Houses Unalike in Dignity

honeycomb — Fri, 23 Oct 2020 21:21:50 +0000

Intelligent Medical Objects (IMO) and its clinical interface terminology form the foundation healthcare enterprises need, including effective management of Electronic Health Record (EMR) problem lists and accurate documentation. Over 4,500 hospitals and 500,000 physicians use IMO products on a daily basis.

With Honeycomb, the engineering team at IMO was able to find hidden architectural issues that were previously obscured in their logs. By introducing tracing into their .NET application stack in AWS, they were able to generate new insights that unlocked reliability and efficiency gains.

Michael Ericksen, Sr. Staff Software Engineer at IMO, contributed this guest blog detailing the journey.

At IMO, our 2019 engineering roadmap included moving application hosting from multiple data centers into AWS. The primary hosting pattern to migrate was a .NET application running on a Windows instance behind a load balancer. In AWS, we use SSM documents written in PowerShell to define the steps to install agents, configure IIS, and perform other machine configuration prior to installing application code. We compose these discrete documents into orchestration documents (more than 25 in total) that execute upon instance launch based on the tags associated with the instance.

Uncovering issues with trace instrumentation

Our production applications have predictable load distribution during core business hours. We don’t see traffic spikes that demand instantaneous scaling. Development teams use scheduled Auto Scaling Group actions in our pre-production accounts to toggle capacity after business hours, generating significantly more activity. After a year of reliable operation under lower load parameters, we started hearing reports from internal users that the provisioning service in these pre-prod accounts was intermittently failing.

However, we didn’t have an easy toolset and vocabulary for describing our service reliability. We had logs! Unstructured ones. Lots of them.

Default log output generated during SSM provisioning

Debugging issues for end-users meant searching through the copious logs generated by our processes. A dashboard aggregated the number of errors that appeared in the log files. That visibility helped us stabilize the service, but it didn’t provide critical insights to improve service operation for users.

I had used the Honeycomb agentless CloudWatch integration to ingest structured logs from Lambda functions. I was optimistic I could adopt a similar pattern to provide visibility into our provisioning process. The germ of the idea was simple:

Write structured logs to stdout
Ingest them via CloudWatch
Forward them to Honeycomb using the agentless CloudWatch integration

One key technical difference between a Lambda function and SSM documents is that functions hold context while dispatching the next operation. For SSM documents, we pushed in-process spans to a stack saved to the local file system. As spans completed, we popped them from the stack and wrote them as structured JSON logs to CloudWatch. Since our SSM documents execute sequentially, there are minimal concerns about concurrency.

For more technical implementation details, the source code for the initial version is on GitHub.

CloudWatch agent latency

The first time I saw the trace from this implementation I was certain there was an instrumentation bug: Why would an SSM document composed of two steps that each write a single line to stdout take more than 25 seconds to complete?!?

The provisioning execution time comprises a fractional percentage of the total execution time. That impact is magnified by the number of SSM documents (typically 15-25 documents) that execute during our primary provisioning processes. With trace instrumentation, that behavior was immediately visible.

Default log output with highlighting

We’d overlooked past performance issues numerous times while debugging other problems because they were obscured by the log structure. Similarly, the log structure precluded the type of open-ended curiosity we associate with observability:

What provisioning steps execute the most frequently?
What provisioning steps fail most frequently?
Does introducing a new agent version preserve the stability of the system or does it de-stabilized it?

Debugging idle time in our provisioning service

As we debugged potential causes of the issue (shoutout to Narendra from AWS technical support), one detail became clear: we had two performance issues with a similar signature instead of the one we imagined.

The first issue presented when writing logs to stdout and sending them directly to CloudWatch. This was the eyebrow-raising performance issue detailed above.

However, our production provisioning service uses a different strategy to ingest logs into CloudWatch. For this second strategy, there was no detectable performance penalty. Yet all of our telemetry continued to show 10-20 second gaps of time between steps. We estimated three to five minutes of idle time during end-to-end execution. Curious.

IMO end-to-end provisioning

We slowly unveiled a subtle issue in the PowerShell code that executes during our provisioning process. At various points, we use functions from AWSPowerShell modules to retrieve configuration from a remote source, usually SSM Parameter Store. There’s a significant performance penalty when first importing AWSPowerShell modules into your PowerShell session.

We also learned our strategy for module imports “… are always applied globally, and must be met, before the script can execute.” The performance penalty for importing AWSPowerShell modules occurs outside of any application telemetry that attempts to measure it.

If you want to experiment with this behavior, we’ve included the steps to reproduce in our GitHub project.

Watching code go live to fix the issue

After identifying our implementation of AWSPowerShell as the root of our performance issues, we evaluated an array of resolution options. Caching modules across PowerShell sessions wasn’t technically feasible. Restructuring provisioning documents to execute within a single PowerShell session was a significant time commitment. Rewriting our provisioning service to use AWS Image Builder was the long-term direction we agreed was most appropriate. However, rewrites never happen overnight and change frequently introduces new failures.

In the interim, we added a new step to install the AWS CLI and modified function calls to prefer using the AWS CLI,before falling back to using AWSPowerShell modules if it wasn’t available.

Initially, we deployed the change to our lowest application environment and compared the performance delta.

A young Leonardo DiCaprio in Romeo and Juliet says “But, soft! What light through yonder window breaks?”

One of our key performance metrics is time from instance launch until application code begins installation. Our logs made this difficult to calculate for a single instance and impossible to measure at scale. Our trace data, however, made this simple: measure what matters to your users.

Graph showing performance improvement between account with CLI change and account without

The graphs show a significant performance increase when using the AWS CLI. At the 50th percentile, we improved performance by more than 2 minutes and improved performance by more than 4 minutes at the 99th percentile. In fact, the 99th percentile performance metrics when using the AWS CLI was almost as fast as the 50th percentile times when it was not. ¡Que bueno!

Conclusion

Building observability into the foundation of systems can dramatically reveal potential architectural issues long before it reaches customers. Even experienced peer reviewers frequently cannot predict the way multiple complex systems converge and intersect to magnify performance issues: we were paying performance penalties once per document, not once per end-to-end execution.

Although this issue could have been discoverable in our logs, it remained opaque to us for more than a year until our tracing revealed it. The trace data we generate also provides new insights into how to operate our system more reliably and efficiently.

Graph showing execution count and duration percentiles for provisioning steps.

As seen in the above graph, there’s a significant decline in execution time after our four longest run steps. Moreover, one of those longest running steps executes far less frequently than its peers.

If we were to only solve the time-gap fix discussed above and then optimize the three most frequently executed, longest running steps, we could improve performance for one of our key user metrics by more than 9 minutes, or 62% of the total execution.

As Cindy Sridharan notes in her report on Distributed System Observability:

The goal … is not to collect logs, metrics, or traces. It is to build a culture of engineering based on facts and feedback, and then spread that culture within the broader organization

With remarkable economy of language, another familiar CTO expressed her sentiments even more succinctly: “I’m so done with logs.” We’re so done with logs too, Charity.

Join the swarm. Get started with Honeycomb for free.