DEV Community: Honeycomb.io

The Future of Developer Careers

Christine Yen — Wed, 12 May 2021 21:15:26 +0000

While JavaScript frameworks come and go, a change has been brewing over the last several years that will permanently change what it means to be a modern developer: how our code goes from our laptops to the wild. The widespread adoption of containers, microservices and orchestration have made it easier than ever to take a small bit of software and push it live in front of users — and, in doing so, push a whole bunch of comfortable tasks (debugging, profiling) into uncomfortable territory: production.

I hate to be the bearer of bad news (not really), but the reality for developers is that it’s only getting more complicated to ensure the code you write still works. Assuming operational responsibility for the code you write is becoming a larger and larger part of the developer role — even as “where it runs” gets further and further away from “where it was written.”

The first wave of DevOps primarily embodied Ops folks learning how to Dev: to “automate everything” through code. The second wave, naturally, is now about Dev folks learning how to Ops: now, we own the running of our code in production. But while the two shifting waves typically come together in cross-functional DevOps teams, “understanding production” has historically carried with it a heavy Ops bias.

It’s almost ironic, really — recent trends in platform abstractions have turned everything into code. (Hi, serverless! And thanks, Heroku.) What that should have meant is that understanding what’s happening in production would be easier for devs, not harder. Let’s take a look at why that hasn’t been the case, and how it should be instead.

Shoehorning Devs into Ops: What Could Go Wrong?

Leading software engineering organizations are increasingly asking developers to own their code in production. Software engineers are being asked to join on-call rotations, with varying levels of support.

And yet, conventional “production monitoring” tools are inherently hostile to how developers think and work with their code. Traditional approaches to understanding “production” are tied to an application’s underlying infrastructure. Graphs of data like CPU utilization, network throughput, or database load are very infrastructure-centric ways to understand the world. Just because the lines continue to blur between dev and ops doesn’t mean we simply transfer over previous mental models of the world — the goal of DevOps isn’t to simply swap out responsibilities. The goal of shifting into DevOps is to get the most out of the varied skills, background and mindsets that comprise these new cross-functional teams.

Traditional production monitoring tools were written long before the era of DevOps — they speak the language of Ops, not Devs. Unfortunately, that sets up an artificial barrier to entry for developers to think about production as a place they own. We’ve done nothing to help developers see their world as it exists in production. Developers often get handed a dashboard full of Cassandra read/write throughput graphs, thread counts and memtable sizes, as if that somehow inducts them into the club of production ownership.

Sure, those metrics and graphs look cool — but there’s often no way to connect that information back to the code, business logic or customer needs that are the world of software development. When problems occur, there’s a big mental leap that exists between seeing that information and tying it back to “what really happened.” And even if that leap can somehow be made, there’s certainly no path at all that leads toward reproducing any observed phenomenon, much less writing the code to fix it.

The cognitive leap that traditional production monitoring tools require developers to make doesn’t get a lot of attention, because that’s simply how things are done for Ops engineers. In some corners of engineering, there’s a smug satisfaction that devs now have to make that leap. Feel our pain, devs! How do you not know that when both of these lines trend down and that graph turns red, it means your application has run out of memory? Welcome to production.

That cavalier attitude reinforces the hostility reflected by the approach taken by traditional monitoring tools. In practice, that approach inadvertently leads to situations where devs simply follow the breadcrumbs and do their best to replicate production debugging patterns they don’t fully understand. Culturally, it creates a moat between the approaches that Ops values and the approaches that Dev values — and reinforces the illusion that production is a hostile place for developers.

Enhance Existing Dev Behaviors

Instead, a more welcoming approach is to tap into what we Devs do naturally when debugging: allow us to quickly compare our expected outcome against the actual outcome (e.g. this code should handle 10K req/sec, but seems to only handle 100 req/sec). Devs share this part of the investigative journey with their Ops comrades. However, where Ops and Dev patterns deviate is when digging into understanding why that deviation occurs.

For Devs, we compare “expected” against “actual” all the time in test suites. Investigating test failures means digging into the code, walking through the logic, and questioning our assumptions. Being able to capture business logic-level metadata in production (often high cardinality, often across many dimensions) is a baseline requirement for being able to tap into Dev experience for production problems.

We need a specific replicable test case. Being able to tap into the specificity of custom attributes like userID, partitionID, etc, is what enables production to feel like an extension of development and test workflows, as opposed to some new foreign and hostile environment.

A Developer Approach to Production

With the advent of PaaS, IaaS, and serverless, our world is increasingly abstracting infrastructure away. That’s paved the way for both waves of DevOps and it has made room to redefine priorities. For software development teams that own running their code in prod, that means they’ve shifted toward aligning their definition of successful operation with what ultimately matters to the business — whether the users of that software are having a good customer experience.

That shift works very well for developers who are accustomed to having functions, endpoints, customer IDs, and other business-level identifiers naturally live in their various tests. Those types of identifiers will only continue to become more critical when investigating and understanding the behavior of production systems. (In contrast, traditional monitoring systems focus on the aggregate behavior of an overall system and almost never include these types of identifiers.)

All of the questions that developers should ask about production boil down to two basic forms:

Is my code running in the first place?
Is my code behaving as expected in production?

As a developer in a world with frequent deploys, the first few things I want to know about a production issue are: When did it start happening? Which build is, or was, live? Which code changes were new at that time? And is there anything special about the conditions under which my code is running?

The ability to correlate some signal to a specific build or code release is table stakes for developers looking to grok production. Not coincidentally, “build ID” is precisely the sort of “unbounded source” of metadata that traditional monitoring tools warn against including. In metrics-based monitoring systems, doing so commits to an infinitely increasing set of metrics captured, negatively impacting the performance of that monitoring system AND with the added “benefit” of paying your monitoring vendor substantially more for it.

Feature flags — and the combinatorial explosion of possible parameters when multiple live feature flags intersect — throw additional wrenches into answering Question 1. And yet, feature flags are here to stay; so our tooling and techniques simply have to level up to support this more flexibly defined world.

Question 2, on the other hand, is the same question we ask anytime we run a test suite: “Does my code’s actual execution match what I expect?” The same signals that are useful to us when digging into a failing test case are what help us understand, reproduce and resolve issues identified in production.

A developer approach to debugging prod means being able to isolate the impact of the code by endpoint, by function, by payload type, by response status, or by any other arbitrary metadata used to define a test case. Developers should be able to take those pieces and understand the real-world workload handled by their systems, and then adjust their code accordingly.

The Way Forward: A Developer-Friendly Prod

The future of Dev careers isn’t about having different bespoke ways of approaching debugging your production environment. DevOps is about getting the most out of your new cross-functional teams and, luckily — when it comes to using tools to get answers to the questions you care about in production — there’s an opportunity to all get on the same page. Whether your team labels itself Devs, Ops, Devops, or SRE, you can all use tools that speak the same language.

In today’s abstracted world — one full of ephemeral instances, momentary containers and serverless functions — classic infrastructure metrics are quickly fading into obsolescence. This is happening so quickly that it even calls into question the future of ops careers. A fundamentally better approach to understanding production is necessary — for everyone.

A good first step is shifting focus away from metrics like CPU and memory and instead embracing RED metrics as the primary signal of service health. That can substantially lower the barrier for entry to production for most developers. Devs can then be armed with the metadata necessary to understand the impact of any given graph, by tagging those metrics with customer ID, API endpoint, resource type, customer action, etc. It bridges the gap between capturing metrics in prod and tying them back to code and tests.

One step better is the reason that observability has seen an explosion in popularity. Observability is not a synonym for monitoring. Observability takes an event-based approach that still allows you to incorporate infrastructure metrics to understand the behavior of your production systems. It’s an entirely different approach to the Ops-centric world of monitoring that enables understanding the behavior of production systems in ways that makes them accessible to engineers from all backgrounds.

The future of dev careers should be defined by struggling to understand the correlations between traditional monitoring tools and where that ties into your code. By breaking away from traditional monitoring tools, the future of dev careers instead becomes one where understanding what’s happening in prod feels every bit as natural as understanding why code failed in your development or test environments.

Over the last decade and change, as an industry, we’ve all gotten really good at taking code and shipping it to the user. That was Heroku’s promise, after all: simply and magically hooking a production environment up to a developer’s natural workflow. And because of this — because of how much closer we’ve brought production to the development environment — the developer skill set has to follow the same trajectory… or risk being left behind.

Learn more about making production more approachable to devs and to ops in Honeycomb’s Guide to Developing a Culture of Observability.

Want to know how Honeycomb.io helps software developers? Schedule 1:1 time with our technical experts.

A Culture of Observability Helps Engineers Hit the Spot (Instance)

honeycomb — Mon, 11 Jan 2021 17:47:25 +0000

At Honeycomb, we’re big fans of AWS Spot Instances. During a recent bill reduction exercise, we found significant savings in running our API service on Spot, and now look to use it wherever we can. Not all services fit the mold for Spot, though. While a lot of us are comfortable running services atop on-demand EC2 instances by now, with hosts that could be terminated or fail at any time, this is relatively rare when compared with using Spot, where sudden swings in the Spot market can result in your instances going away with only two minutes of warning. When we considered running a brand-new, stateful, and time-sensitive service on Spot, it seemed like a non-starter, but the potential upside was too good to not give it a try.

Our strong culture of observability and the environment it enables means we have the ability to experiment safely. With a bit of engineering thoughtfulness, modern deployment processes, and good instrumentation we were able to make the switch with confidence. Here's how we did it:

What we needed to support

We're working on a feature that enables you to do trace-aware sampling with just a few clicks in our UI. The backend service (internal name Dalmatian) is responsible for buffering trace events and making sampling decisions on the complete trace. To work correctly, it must:

process a continuous, unordered stream of events from the API as quickly as possible
group and buffer related events long enough to make a decision
avoid re-processing the same traces, and make consistent decisions for the same trace ID for spans that arrive late

To accomplish grouping, we use a deterministic mapping function to route trace IDs to the same Kafka partition, with each partition getting processed by one Dalmatian host. To keep track of traces we’ve already processed, we also need this pesky little thing called state. Specifically:

the current offset in the event stream,
the offset of the oldest seen trace not yet processed,
recent history of processed trace IDs and their sampling decision.

This state is persisted to a combination of Redis and S3, so Dalmatian doesn’t need to store anything locally between restarts. Instead, state is regularly flushed to external storage. That’s easy. Startup is more complex, however, as there are some precise steps that need to happen to resume event processing correctly. There’s also a lot of state fetching to do, which adds time and can fail. In short, once the service is running, it’s better to not rock the boat and just let it run. Introducing additional chaos with hosts that come and go more frequently was something to consider before running this service on Spot.

Chaos by default

As part of the observability-centric culture we strive for at Honeycomb, we embrace CI/CD. Deploys go out every hour. Among the many benefits of this approach is that our deploy system is regularly exercised, and our services undergo restarts all the time. Building a stateful service with a complex startup and shutdown sequence in this environment means that bugs in those sequences are going to show themselves very quickly. By the time we considered running Dalmatian on Spot (queue the puns), we’d already made our state management stable through restarts, and most of the problems Spot might have introduced were already handled.

Hot spare Spots

There was one lingering issue with using Spot though: having hosts regularly go away means we need to wait on new hosts to come up. Between ASG response times, Chef, and our deploy process, it averages 10 minutes for a new host to come online. It’s something we hope to improve one day, but we’re not there yet. With one processing host per partition, that means losing a host can result in at least a 10 minute delay in event processing. That’s a bad experience for our customers, and one we’d like to avoid. Fortunately, Spot instances are cheap, and since we’re averaging a 70% savings on instance costs, we can afford to run with extra hosts in standby mode. This is accomplished with a bit of extra terraform code:

min_size                  = var.dalmatian_instance_count[var.size] + ceil(var.dalmatian_instance_count[var.size] / 5)
max_size                  = var.dalmatian_instance_count[var.size] + ceil(var.dalmatian_instance_count[var.size] / 5)

With a small modification in our process startup code, instances will wait for a partition to become free and immediately pick up that partition’s workload.

Spot how we're doing

We’ve made a significant change to how we run this service. How do we know if things are working as intended? Well, let’s define working, aka our Service Level Objective. We think that significant ingestion delays break our core product experience, so for Dalmatian and the new feature, we set an SLO that events should be processed and delivered in under five minutes, 99.9% of the time.

With the SLO for context, we can restate our question as: Can we move this new functionality onto Spot Fleets and still maintain our Service Level Objective despite the extra host churn?

In the last month in our production environment, we have observed at least 10 host replacements. This is higher than usual churn than we’ve had in the past, but also what we expected with Spot.

In that same month, we stayed above our threshold for SLO compliance (99.9%). From this perspective, it looks like a successful change - one we’re excited about because it saves us a significant amount in operational expenses.

New firefighting capabilities must be part of the plan

Things are working fine now, but they might break in the future. Inevitably, we will find ourselves running behind in processing events. One set of decisions we had to make for this service was “what kind of instance do we run on, and how large?”. When we’re keeping up with incoming traffic, we can run on smaller instance types. If we ever fall behind, though, we need more compute to catch up as quickly as we can, but that’s expensive to run all the time when we may only need it once a year. Since we have a one processor per partition model, we can’t really scale out dynamically. But we can scale up!

Due to the previously mentioned 10 min host bootstrapping time, swapping out every host with a larger instance is not ideal. We’ll fall further behind while we wait to bring the instances up, and once we decide to scale down, we’ll fall behind again. What if we stood up an identically sized fleet with larger instances and cut over to that? This Terraform spec defines what we call our “catch-up fleet”. The ASG exists only when dalmatian_catchup_instance_count is greater than 0. In an emergency, a one-line diff can bring this online.

resource "aws_autoscaling_group" "dalmatian_catchup_asg" {
  name  = "dalmatian_catchup_${var.env}"
  count = var.dalmatian_catchup_instance_count &gt; 0 ? 1 : 0

  # ...

  launch_template {
    launch_template_specification {
      launch_template_id = aws_launch_template.dalmatian_lt.id
      version            = "$Latest"
    }

    dynamic "override" {
      for_each = var.dalmatian_catchup_instance_types
      content {
        instance_type = override.value
      }
    }
  }
}

When the hosts are ready, we can cut over to them by stopping all processes on the smaller fleet, enabling the “hot spares” logic to kick in on the backup fleet. When we’re caught up, we can repeat the process in reverse: start the processes on the smaller fleet, and stop the larger fleet. Using Honeycomb, of course, we were able to verify that this solution works with a fire drill in one of our smaller environments.

Spot the benefits of an observability-centric culture

Ok maybe that's enough spot-related puns, but the point here is that our ability to experiment with new architectures and make significant changes to how services operate is predicated on our ability to deploy rapidly and find out how those changes impact the behavior of the system. Robust CI/CD processes, thoughtful and context-rich instrumentation, and attention to what we as software owners will need to do to support this functionality in the future makes it easier and safer to build and ship software our users love and benefit from.

Ready to take the sting out of shipping new software? Honeycomb helps you spot outliers and find out how your users experience your service. Get started for free!

Instrumenting Lambda with Traces: A Complete Example in Python

honeycomb — Mon, 04 Jan 2021 22:02:15 +0000

We’re big fans of AWS Lambda at Honeycomb. As you may have read, we recently made some major improvements to our storage engine by leveraging Lambda to process more data in less time. Making a change to a complex system like our storage engine is daunting, but can be made less so with good instrumentation and tracing. For this project, that meant getting instrumentation out of Lambda and into Honeycomb. Lambda has some unique constraints that make this more complex than you might think, so in this post we’ll look at instrumenting an app from scratch.

Bootstrapping the app

Before we begin, you’ll want to ensure you have the Serverless framework installed and a recent Python version (I’m using 3.6). For this example, I picked a Serverless Python TODO API template in the Serverless Examples repo. I like this particular template for demonstration because it sets up an external dependency (DynamoDB), which gives us something interesting to look at when we’re tracing. To install the demo app:

$ sls install --url https://github.com/serverless/examples/tree/master/aws-python-rest-api-with-dynamodb --name my-sls-app

You should see a project directory with some contents like this:

$ cd my-sls-app && ls
README.md    package.json    serverless.yml    todos

There’s just a bit more to add before we get going. We want to install honeycomb-beeline in our app, so we’ll need to package the Python requirements:

# install the serverless-python-requirements module
$ npm install --save-dev serverless-python-requirements
# install beeline-python in a venv, then export the requirements.txt
$ virtualenv venv --python=python3
$ source venv/bin/activate
$ pip install honeycomb-beeline
$ pip freeze &gt; requirements.txt

Now edit serverless.yml and add the following:

# essentially, this injects your python requirements into the package
# before deploying. This runs in docker if you're not deploying from a linux host
plugins:
  - serverless-python-requirements

custom:
  pythonRequirements:
    dockerizePip: non-linux

Now we can deploy using sls deploy:

Serverless: Packaging service...
[...]
endpoints:
  POST - https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos
  GET - https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos
  GET - https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos/{id}
  PUT - https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos/{id}
  DELETE - https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos/{id}
functions:
  create: my-sls-app-dev-create
  list: my-sls-app-dev-list
  get: my-sls-app-dev-get
  update: my-sls-app-dev-update
  delete: my-sls-app-dev-delete

A few curl calls against the new endpoints confirm we’re in good shape!

$ curl https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos
# db is empty initially
[]
$ curl -X POST -d '{"text": "write a blog post"}' https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos
{"id": "267199de-25c0-11ea-82d7-e6f595c02494", "text": "write a blog post", "checked": false, "createdAt": "1577131779.711644", "updatedAt": "1577131779.711644"}
$ curl https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos
[{"checked": false, "createdAt": "1577131779.711644", "text": "write a blog post", "id": "267199de-25c0-11ea-82d7-e6f595c02494", "updatedAt": "1577131779.711644"}]

Implementing tracing

The first step is to initialize the beeline. The todos/init.py file is a great place to drop that init code so that it gets pulled in by all of the Lambda handlers.

import beeline
beeline.init(writekey='YOURWRITEKEY', service_name='todo-app', dataset='my-sls-app')

Now let’s look at that curl we ran earlier. That’s hitting the list function on the API. Let’s open this up and add some instrumentation.

# ...

import beeline
from beeline.middleware.awslambda import beeline_wrapper

# The beeline_wrapper decorator wraps the Lambda handler here in a span. By default,
# this also starts a new trace. The span and trace are finished when the function exits
# (but before the response is returned)
@beeline_wrapper
def list(event, context):
    table = dynamodb.Table(os.environ['DYNAMODB_TABLE'])
    beeline.add_context_field("table_name", table)

    # This call to our db dependency is worth knowing about - let's wrap it in a span.
    # That's easy to do with a context manager.
    with beeline.tracer("db-scan"):
        # fetch all todos from the database
        result = table.scan()

    # .. capture any results we want to include from this function call
    beeline.add_context({
        'status_code': 200,
        'num_items': len(result['Items'])
    })

    # create a response
    response = {
        "statusCode": 200,
        "body": json.dumps(result['Items'], cls=decimalencoder.DecimalEncoder)
    }
    return response

Another sls deploy gets our instrumentation out there. Once deployed, I can re-run that curl again. Now I have a trace in my dataset!

The trace shows me a Lambda runtime of 6.7ms, but from the client, it seemed slow, so I check the Lambda execution logs and see:

REPORT RequestId: def28374-1e35-429b-8667-3ab481b61ad4 Duration: 37.27 ms

Why the discrepancy?

The instrumentation performance tax

The lifecycle of a Lambda container is tricky. You may already know about cold starts. That is, your code isn’t really running until it is invoked, and when it does run, the container has to start up and this can add latency to your initial invocation(s). Once your function returns a response, it goes into a frozen state, and execution of any running threads is suspended. From there, the container may be reused (without the cold start penalty) by subsequent requests, or terminated. This termination is not done gracefully, and thus the running function isn't given a chance to do housekeeping.

What does that mean if you’re trying to send telemetry?

Typical client patterns using delayed batching, like those used in our SDKs, are unreliable in Lambda, since the batch transmission thread(s) may never get a chance to run after the function exits.
Anything that needs to be sent reliably must be sent before the function returns

That’s why if you crack open the Beeline middleware code for Lambda, you’ll see this line:

beeline.get_beeline().client.flush()

This is our way of ensuring event batches are sent before the function exits. But it isn't free. The effect of this is that it delays response to the client by the round-trip time from your Lambda to our API. For many use cases, that's an acceptable tradeoff for getting instrumentation out of Lambda. But what if your application (or your user) is latency-sensitive?

Cloudwatch logs to the rescue

There is one way to synchronously ship data from a Lambda function without significant blocking: logging.
If you want detailed telemetry without a performance hit, this is the currently the only way to go. Each Lambda function writes to its own Cloudwatch Log stream, and AWS makes it relatively easy to subscribe to those streams with other Lambda functions. Here’s where we introduce the Honeycomb Publisher for Lambda.

The Publisher can subscribe to one or more Cloudwatch Log streams, pulling in event data in JSON format and shipping it to the correct dataset. To use it, you need to deploy it and subscribe it to your Lambda function(s) log streams, then configure your app to emit events via logs.

Switching to log output

The Python Beeline makes it easy to override the event transmission mechanism. For our purpose here, we’ll use the built-in FileTransmission with output to sys.stdout:

import sys
import beeline
import libhoney
from libhoney.transmission import FileTransmission
beeline.init(
    writekey='can be anything',
    service_name='todo-app',
    dataset='my-sls-app',
    transmission_impl=FileTransmission(output=sys.stdout)
)

That’s all we need to do to have events flow to Cloudwatch instead of the Honeycomb API. Note that you do not need to set a valid writekey when using the FileTransmission. The Publisher will have responsibility for authenticating API traffic.

If you execute your app after deploying this, you should see span data in the Cloudwatch Logs for the Lambda function.

{ 
    "time": "2019-12-17T16:54:20.355317Z", 
    "samplerate": 1, 
    "dataset": "my-sls-app",
    ...
    "data": {
       "service_name": "todo-app",
       "meta.beeline_version": "2.11.2",
       ...
      "duration_ms": 6.47 
    } 
}

Also note the shorter Lambda run time, now that we aren’t blocking on an API call before exiting:

REPORT RequestId: 30ff2be3-2c0b-4869-8da3-7af280dfc76c Duration: 9.41 ms

Deploying the Publisher

The Publisher is another Lambda function, so if you are familiar with integrating third-party Lambda functions in your stack, you should use the tools and methods that work best for you. We provide a helpful, generic Cloudformation Template to walk you through installation of it and to document its AWS dependencies. Since we’re using Serverless in this tutorial, though, let’s see if we can make a it a “native” part of our project!

Serverless allows you to describe ancillary AWS resources to spin up alongside your stack. In our case, we want to bolt on the publisher and its dependencies whenever we deploy our app. In the serverless.yml, append the following to the resources block of the sample app:

resources:
  Resources:
    TodosDynamoDbTable:
      # ...
    PublisherLambdaHandler:
      Type: 'AWS::Lambda::Function'
      Properties:
        Code:
          S3Bucket: honeycomb-integrations-${opt:region, self:provider.region}
          S3Key: agentless-integrations-for-aws/LATEST/ingest-handlers.zip
        Description: Lambda function for publishing asynchronous events from Lambda
        Environment:
          Variables:
            HONEYCOMB_WRITE_KEY: 'YOURHONEYCOMBKEY'
            DATASET: 'my-sls-app'
        FunctionName: PublisherLambdaHandler-${self:service}-${opt:stage, self:provider.stage}
        Handler: publisher
        MemorySize: 128
        Role:
          "Fn::GetAtt":
            - LambdaIAMRole
            - Arn
        Runtime: go1.x
        Timeout: 10
    ExecutePermission:
      Type: "AWS::Lambda::Permission"
      Properties:
        Action: 'lambda:InvokeFunction'
        FunctionName:
          "Fn::GetAtt":
            - PublisherLambdaHandler
            - Arn
        Principal: 'logs.amazonaws.com'
    LambdaIAMRole:
      Type: "AWS::IAM::Role"
      Properties:
        AssumeRolePolicyDocument:
          Version: "2012-10-17"
          Statement:
            - Effect: "Allow"
              Principal:
                Service:
                  - "lambda.amazonaws.com"
              Action:
                - "sts:AssumeRole"
    LambdaLogPolicy:
      Type: "AWS::IAM::Policy"
      Properties:
        PolicyName: "lambda-create-log"
        Roles:
            - Ref: LambdaIAMRole
        PolicyDocument:
          Version: "2012-10-17"
          Statement:
            - Effect: Allow
              Action:
                - logs:CreateLogGroup
                - logs:CreateLogStream
                - logs:PutLogEvents
              Resource: 'arn:aws:logs:*:*:*'
    # add one of me for each function in your app
    CloudwatchSubscriptionFilterList:
      Type: "AWS::Logs::SubscriptionFilter"
      Properties:
        DestinationArn:
          "Fn::GetAtt":
            - PublisherLambdaHandler
            - Arn
        LogGroupName: /aws/lambda/${self:service}-${opt:stage, self:provider.stage}-list
        FilterPattern: ''
      DependsOn: ExecutePermission

That’s a lot of boilerplate! The main things you’ll want to take note of is setting a valid value for HONEYCOMB_WRITE_KEY and add SubscriptionFilter resources for each function in your stack. Run sls deploy one more time and you should now see a new function, PublisherLambdaHandler-my-sls-app-dev, deployed alongside your Lambda functions. The Publisher will be subscribed to your app’s Cloudwatch Log Streams and will forward events to Honeycomb.

Leveling up with distributed tracing

Lambda functions don’t always run in a vacuum - often they are executed as part of a larger chain of events. You can link your Lambda instrumentation with your overall application instrumentation with distributed traces. To do this, you need to pass trace context from the calling app to your Lambda-backed service. Let’s look at a practical example by building on our existing app. We already have a todo API implemented in Lambda. Let’s assume we’re building a UI on top of that. We have some code that fetches the list of items from the API and then renders them.

@beeline.traced(name="todo_list_view")
def todo_list_view():
    with beeline.tracer(name="requests_get"):
        todo_list = requests.get('https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos')

    render_list(todo_list)

@beeline.traced(name="render_list")
def render_list(l):
    # ...

The HTTP request to fetch the list of todo items is done using the requests lib. This get call is wrapped in a trace span to measure the request time. We know that list is instrumented with tracing, so it would be nice if we could link those spans to our trace!

@beeline.traced(name="todo_list_view")
def todo_list_view():
    with beeline.tracer(name="requests_get"):
        context = beeline.get_beeline().tracer_impl.marshal_trace_context()
        todo_list = requests.get('https://mtd8iptz1m.execute-api.us-east-2.amazonaws.com/dev/todos', 
                                headers={'X-Honeycomb-Trace': context})

    render_list(todo_list)

marshal_trace_context builds a serialized trace context object, which can be passed to other applications via request header or payload. The Lambda middleware automatically looks for a header named X-Honeycomb-Trace and extracts the context object, adopting the trace ID and caller’s span ID as its parent rather than starting an all-new trace. All we need to do is pass this context object as a request header before we call the list endpoint. Adding the header is demonstrated manually here, but the beeline provides a patch for the requests lib that will do this for you. Simply import beeline.patch.requests into your application after importing the requests lib.

When we call todo_list_view in our UI app, we now see one trace with spans from both services:

Go Serverless with confidence

Building and running apps with Lambda can be complex and is not without challenges, but with distributed tracing and rich instrumentation, there’s no need to fumble around in the dark.

Ready to instrument your serverless apps? Get started with Honeycomb for free.

Getting At The Good Stuff: How To Sample Traces in Honeycomb

Irving Popovetsky — Wed, 30 Dec 2020 19:07:11 +0000

Sampling is a must for applications at scale; it’s a technique for reducing the burden on your infrastructure and telemetry systems by only keeping data on a statistical sample of requests rather than 100% of requests. Large systems may produce large volumes of similar requests which can be de-duplicated.

This article is for developers and operators who have added Honeycomb instrumentation into their applications and wish to learn more about sampling as a technique for controlling costs and resource utilization while maximizing the value of your telemetry data. In this article, you’ll learn about the various sampling techniques and decision strategies as well as their benefits, drawbacks, and use-cases. Finally, there are some tips on where to look in case you’re running into trouble.

Instrumentation isn't free

Instrumentation always has a cost. Sampling can reduce this cost but will not entirely eliminate it.

There are, of course, costs associated with sending and storing the additional data (in Honeycomb or any platform).
There’s also an infrastructure cost to capture instrumentation data, handle it, make decisions about what to do with it (like sampling) and transmit it somewhere.
- This cost is typically paid with an incremental amount of CPU usage, memory allocation and/or network bandwidth.
- If your application and infrastructure are not CPU-bound or Network IO-bound, the impact on request latency is likely negligible.

These costs vary by programming language. All instrumentation tooling vendors and projects work hard to minimize impact, typically by processing+sending data in different threads so as to not block applications while they serve requests. Latency may be more impacted in runtimes that were not designed for native parallelism (see: the GIL in Ruby and Python, NodeJS is essentially single threaded)

Although you can't eliminate the overhead, you can move the overhead around in (and out) of your infrastructure. For example, Honeycomb provides the samproxy (currently in alpha) which you can run in your infrastructure for the buffering and sending of traces or our new Refinery feature (currently in beta) for handling this on the collection side.

Sampling traces is trickier than sampling individual events. You want to avoid situations where traces aren’t completely collected (some spans thrown away) because this is frustrating for developers trying to debug their code. Not all event-sampling techniques will work since trace spans can be distributed among multiple threads, processes, or even servers and processed asynchronously. More on this below.

How sampling works

Honeycomb can decide which request traces or events to keep and which ones to drop in a number of different ways:

Random sampling evaluates each event (whether it is a trace span or not) for sampling on a simple probabilistic basis - this means that each event has an equal chance of being selected, there is no further selection criteria other than the probability (for instance, 1 in 10). This is not good for tracing because it doesn’t guarantee you’ll get all the spans of a trace!
Deterministic sampling is a probabilistic technique that works by taking a hash of the trace ID and converting the first 4 bytes of the hash to a numeric value, for easy and fast decision-making based on a target sample rate. Consistent sampling is ensured for all spans in the trace because the trace ID is propagated to all child spans and the sampling decision is made the same way.
Target rate sampling delivers a given rate of collected telemetry (e.g. 500 collected spans per minute), decreasing the sample probability if there’s an increase in traffic, and increasing the sample probability if traffic drops off.
Rule-based sampling is a variant of deterministic sampling, but you can make your sampling decision based on properties of the request. Think of this as the data contained in the HTTP header: endpoint (URL), user-agent, or any arbitrary header value that you known will exist. This is cool because it allows you to fine-tune sampling rates based on your needs. For instance, keeping all write-path data but sampling the higher traffic read-path data.
- Rule-based sampling is compatible with both traces and events, however in the case of trace data that isn’t known at the start of the request it needs to consider the entire request including status codes which means it is only compatible with tail-based sampling.
Dynamic sampling combines rule-based and target rate sampling. It delivers a given sample rate, weighting rare traffic and frequent traffic (for example by status code or endpoint) differently so as to end up with the correct average. Frequent traffic is sampled more heavily, while rarer events are kept or sampled at a lower rate. This is the strategy you want to use if you are concerned about keeping high-resolution data about unusual events while maintaining an a representative sample of your application’s behavior.

How sampling works for traces

Head-based sampling - a sampling decision is made when the root span is created - this is typically when the request is first seen by a tracing-aware service. At this point only metadata is known about the request (e.g. information in the HTTP header fields). The sampling decision is then propagated to child spans, typically with the addition of an HTTP header denoting the request is to be captured. This is most compatible with Deterministic and Rule-based sampling where the rules are the same on all services and the fields are known in advance.
Tail-based sampling is where the sampling decision is made upon the completion of the entire request and all of the spans have been collected. This is much more expensive than Head-based sampling because all trace spans must be collected, buffered and then processed for sampling decisions. This is best combined with Dynamic sampling because all of the information is available to identify the “interesting” traces by type (including status code) and volume.
- We advise against performing dynamic tail-based sampling in your app (in-process) because it’s impossible to buffer downstream trace spans, and the overhead may adversely affect performance.

Troubleshooting your sampling implementation

Seeing increased request latency?
If your existing metrics systems have picked up a significant increase in request latency, it’s important to quantify and isolate the sources of the latency. Is the effect on latency different when the system is under periods of high load? If so, this is most likely a sign of infrastructure contention. Increasing the sampling rate (sending fewer traces) may help in some cases, as well as using feature flags to introduce finer granularity and control over sample rates.

If the increase in request latency is visible even when the system is idle, this is very rare and might be an issue with the instrumentation itself or how it was integrated. Reach out to Honeycomb support and we can help.

Missing trace spans?
If you’ve enabled sampling and see missing trace spans, there are several potential avenues of investigation:

If you’ve customized your sampling logic, ensure you’re using a sampling decision that is compatible with your sampling technique. For example, don’t combine Head-based sampling with Random or Dynamic sampling techniques.
If trace spans are coming from different services, ensure that consistent sampling rules are being applied to all services and that there are no network issues preventing any services from sending events.
If you see the whole trace but it claims a missing root span, be sure you're not setting the Parent ID field on the root span - its absence is what indicates the root span. (Note: this only applies to users of the libhoney SDK, the Honeycomb Beelines take care of this for you automatically.)

Want more information?

For more information on sampling, check out the following resources:

Need this kind of flexibility for your instrumentation and logs? Get started with Honeycomb for free.

They Aren’t Pillars, They’re Lenses

Danyel Fisher — Tue, 22 Dec 2020 23:40:29 +0000

To have Observability is to have the ability to understand your system’s internal state based on signals and externally-visible output. Honeycomb’s approach to Observability is to strive toward this: every feature of the product attempts to move closer to a unified vision of figuring out what your system did, and how it got there. Our approach is to let people smoothly move between aggregated views of their data, like heat-maps and line charts, into views that emphasize collections of events, like traces and BubbleUp, into views that emphasize single events, like raw data.

In the broader marketplace, though, Observability is often promoted as “three pillars” — separating logging, monitoring, and tracing (aka logs, metrics & traces) as three distinct capabilities. We believe that separating these capabilities misses out on the true power of solving a problem with rich observability.

The metaphor I like is to think of each feature as a lens on your data. Like a lens, they remove some wavelengths of information in exchange for emphasizing others. To debug in hi-res, you need to be able to see all the vivid colors.

Let’s say, for example, that you’re tracking a service that seems to be acting up. An alert has gone off, saying that some users are having a poor experience. Monitoring tools that track metrics—the first pillar-- will interpret data as a time series of numbers and gauges — and that’s really important, because it’s useful to know such things as how long a process takes to launch or how long a web page takes to load. Using a metrics monitoring tool (e.g. Prometheus) will help generate that alert. If the monitoring tool supports high cardinality — the ability to track hundreds or thousands of different values — you can even find out which endpoints those users are encountering and, perhaps, some information about which users.

You could think of that as a magnifying glass with a blue lens on your data. It comes out looking something like this:

The second pillar is traces or tracing, which looks at individual calls and dives into how they are processed. From inside a tracing tool (e.g. Jaeger), you can do wonderful things — you can see which component took the longest or shortest, and you can see whether specific functions resulted in errors. In this case, for example, we might be able to use the information we found from the metrics to try to find a trace that hits the same endpoint. That trace might help us identify that the slow part of the trace was the call to the database, which is now taking much more time than before.

(Of course, the process of getting from the metrics monitoring tool to the tracing tool is bumpy: the two types of tools collect different data. You need to find out how to correlate information in the metrics tool and the tracing tool. The process can be time-consuming and doesn’t always give you the accuracy you need. The fields might be called different things, and might use different encodings. Indeed, the key data might not be available in the two systems.)

In our lens analogy, that’s a red lens. From this lens, the picture looks pretty different — but there’s enough in common that we can tell we’re looking at the same image. There are some parts that stand out and are much more visible; other aspects of detail entirely disappear.

But why did the database calls get slow? To continue debugging, you can look through logs, which is the third pillar. Maybe scrolling around in the logs, you might find some warnings issued by the database to show that it was overloaded at the time, or logs showing that the event queue had gotten long. That helps figure out what had happened to the database — but it’s a limited view. If we want to know how often this problem had arisen, we’d need to go back to the metrics to learn the history of the database queue.

Like before, the process of switching tools, from tracing to logging, requires a new set of searches, a new set of interactions and of course more time.

We could think of that as a green lens.

When companies sell the “three pillars of observability”, they lump all these visualizations together, but as separate capabilities:

That’s not a bad start. Some things are completely invisible in one view, but easy to see in others, so placing them side by side can help alleviate those gaps. Each image brings different aspects more clearly into view: the blue image shows the outline of the flowers best; the red shows the detail in the florets; and the green seems to get the shading and depth best.

But these three separate lenses have limitations. True observability is not just the ability to see each piece at a time; it’s also the ability to understand the whole and to see how the pieces combine to tell you the state of the underlying system.

The truth is, of course, there aren’t three different systems interacting: there is one underlying system in all its richness. If we separate out these dimensions — if we collect metrics monitoring separately from log and traces — then we lose the fact that this data reflects the single underlying system.

We need to collect and preserve that richness and dimensionality. We need to move through the data smoothly, precisely, and efficiently. We need to be able to discover where a trace has a phenomenon that may be occurring over and over in other traces, and to find out where and how often. We need to break down a monitoring chart into its underlying components to understand which factors really cause a spike.

One way to implement this is to maintain a single set of telemetry collection and storage that keeps rich enough data that we can view it as metrics monitoring, tracing, or logging — or in some other perspective.

Honeycomb’s event store acts a single source of truth for everything that has happened in your system. Monitoring, tracing, logging are simply different views of system events being stored — and it’s easy to switch quickly and easily between different views. Tracing isn’t a separate experience of the event store: it’s a different lens that brings certain aspects into sharper focus. Any point on a heat-map or a metric line-chart connects to a trace and any span on a trace can be turned into a query result.

This single event store also enables Honeycomb to provide unique features such as BubbleUp. This is the ability to visually show a slice across the data — in other words how two sets of events differ from each other, across all their various dimensions (fields). That’s the sort of question that metrics systems simply cannot show (because they don’t store the individual events), and let’s face it that would be exhausting in a dedicated log system.

--
What do you do when you have separate pieces of the complete picture? You need to manually connect the parts and make the connections, looking for correlates. In our lens analogy, that might be like seeing that an area shows as light colored in both the green and the red lens, so it must be yellow.

You COULD do that math yourself. Flip back and forth. Stare at where bits contrast.

Or, you could use a tool where seeing the image isn’t a matter of skill or experience of combining those pieces in your head: it’s all laid out, so you can see it as one complete beautiful picture.

Claude Monet,

“Bouquet of Sunflowers,” 1881

Join the swarm. Get started with Honeycomb for free!

The Future of Software is a Sociotechnical Problem

Charity Majors — Fri, 18 Dec 2020 21:19:14 +0000

"Sociotechnical"

I learned this word from Liz Fong-Jones recently, and it immediately entered my daily lexicon. You know exactly what it means as soon as you hear it, and then you wonder how you ever lived without it.

Our systems are sociotechnical systems. This is why technical problems are never just technical problems, and why social problems are never just social problems.

I work on a company, Honeycomb, which develops next-gen observability tooling. But I don't spend my time trying to figure out how to get more people to use observability tools. Observability alone can't solve anything, it's just a necessary part of the solution.

What I do spend my day thinking about is the future of building software. How can we convert the creative fuel of people’s labor into healthier teams and more reliable, resilient systems? We are incredibly wasteful of the creative fuel that people pour into the process, and the result is that we have unreliable, opaque systems hairballs that nobody understands — which are then operated by stressed, burned out humans who are afraid to touch them.

What if we had:

a future where your code goes live a few seconds or minutes after you commit your changes, and this is all very predictable and boring
a future where everyone owns their code in production, and you actually look forward to your own turn on call
a future where all the energy you pour into writing code and building systems genuinely moves the business forward, and you are rarely frustrated or lost or misled by those systems
a future where the debugger of last resort is not the engineer who has been there the longest, but the most curious person
a future where shipping software is not scary.

What do you think, does this sound achievable? Easy? Or are you thinking “never gonna happen for my team in this lifetime?”

This is all much more attainable than you might think.

The future is here, it is just unevenly distributed.

I have lived in the future. It's why I started this company — I got a brief glimpse of what I now think of as ODD, or observability-driven development, a world where the best engineers wrote code with half their screen taken up by their editor, half by a tool where they were constantly watching and poking at and playing with that code live in production. The code they wrote was better. The systems they built were understandable, in a way I had never seen before.

Going back to a world where people write and ship blind was unthinkable. Not an option.

We hear echoes of this from Honeycomb customers now: "This is incredible. I can never go back.“

Because the teams who invest in these sociotechnical practices are radically more productive and happy than those who don't. They move so much faster and with more confidence; their systems are more reliable and better understood; they amass dramatically less technical debt and can do far more with radically fewer people. They attract and retain better candidates.

And as a software company, this is how you win.

We are in the Middle Ages of software delivery.

The Stripe developer report reports that engineers spend at least 40% (self-reported) on miscellaneous technical bullshit that keeps you busy, maybe blocks you from working on what you need to work on ... but does not move the business forward. Just sit with that a sec. Forty percent. Optimistically.

Or maybe you're familiar with the DORA report. The honeycomb team’s engineering stats are an order of magnitude or two better than their Elite teams, which represent the top 20% of all teams. ("But the company is young, easy for you to say!" you may protest. Sure, we are relatively young ... a little over four years old. We are also a fast-growing platform with unpredictable, spiky traffic composed of user-generated streams of content that we have no control over.)

I wish I could tell you "just buy Honeycomb and voila! Get high-performing teams!“

That is not what I'm saying. It is not that easy.

It's a sociotechnical hole, and only a combination of technical fixes and social change will get us out of it.

A sociotechnical recipe for high-performing teams

But lots of smart, creative teams are out there working hard on this and sharing their findings. As a result, we know a LOT more about what contributes to a solution than we knew even just a year or two ago. You will be forgiven for skimming this very long list:

Blameless retrospectives
Automatic deployments triggered on each commit, single commit per deploy
Removing human gates in the deploy pipeline
Good test coverage, instrumented test harness
Shared conventions around instrumentation
Training, education, collaboration
Code reviews and mentoring
Promoting people for their value as team members and force multipliers, not just raw coding ability
Interview processes that value strengths, not lack of weaknesses
Shared value systems and organizational transparency
Welcoming of diverse viewpoints and fresh eyes
Teams that value juniors and know how to train them up
Tooling that rewards curiosity
Job ladders that value communication and independent initiative
Encouraging software engineers to own their code from end to end
Encouraging SRE types to work more like product teams
Adopting SLOs, SLIs, and aligning on call pain strictly with user pain
Making sure everyone gets enough sleep and time off
Observability tooling (in the technical sense, as I define it here; not in the old fashioned sense of "metrics, logs and traces")

Observability is only a one piece of the solution ... but it is a necessary piece that should be actively frontloaded if your efforts are to have maximum impact.

Observability is about the unknown-unknowns

Rolling out o11y tooling is like turning on the light and putting on your glasses before you start swinging at the pinata.

To get at the candy inside — the real actionable user and technical insights — you need to be able to interactively slice and dice in real time, break down by high cardinality dimensions, and ask those new questions, the ones that you couldn’t have predicted you would need to ask. This is the minimum viable technical functionality you need in order to explore exactly what is happening in production, what happened when you deployed a particular piece of code, what happened when that user reported that bug. That’s why previous iterations of monitoring were not enough.

Observability in the modern technical definition is about answering the unknown-unknowns, and it is necessary. With observability, all things become easier. It is a force amplifier for all your other efforts.

If you don't have observability -- if you only have metrics, logs, and/or traces -- all you can ask will be those questions that you predicted and defined in advance. You are swinging out at the pinata in the dark, or where you think it was yesterday or last week. It might not be completely prohibitively impossible, but it's a damn sight harder and a lot comes down to luck.

Observability is a necessary ingredient. But everything matters.

People often kvetch at me "yeah, but anything's easy when you have the best engineers." They have this exactly backwards. Observability-driven development is what makes great engineers. Observability is what enables you to peek under the hood of the abstractions, it grounds you in reality, forces you to think through the code all the way to how the user will use it. It tethers you to your users and lets you see the world through their eyes.

TDD → ODD

Learning to check your assumptions vs reality was the argument for TDD (test-driven development). That makes you write better code, indisputably. But tests stop at the edge of your laptop! Tests imperfectly mock a predictable subset of reality. Testing in production means replacing the artificial test sandbox with reality.

If you believe TDD makes you a better developer, you should be hungry for the developer you will become using ODD.

I am cautiously optimistic that the industry will embrace observability in far less time than it took to adopt TDD and metrics. Mostly because it is much, much easier to do things this way. It’s actually much harder to do things the bad old ways, what with all the hacks and workarounds.

And every little bit helps. Every one of these changes will, if you embrace them, make your people happier and more productive.

Observability-driven development is what creates great software engineers.

The greatest obstacle between us and a better tomorrow is this pervasive lack of hope. (The second greatest is our perverse pride in our Rube Goldberg hacks & sunk costs fallacy.)

Most people still have not experienced what it's like to build software in a radically better way. Even worse, most people don't see themselves in the better world I describe. They don't think this world is meant for them.

I don’t know how to fix this yet. But if we only succeed in making life better for the elites, we will have failed.

Observability is for everyone, and it is easier if you do it first. Observability makes every technical effort that comes after it sooooo much easier to achieve. Observability is what creates great engineers, not vice versa. Start at the edge, instrument some code, and work in. Rinse and repeat. You got this.

Experience what Honeycomb can do for your business. Check out our short and sweet demo!

Making Instrumentation Extensible

Liz Fong-Jones — Wed, 09 Dec 2020 20:06:08 +0000

Observability-driven development requires both rich query capabilities and sufficient instrumentation in order to capture the nuances of developers' intention and useful dimensions of cardinality. When our systems are running in containers, we need an equivalent to our local debugging tools that is as easy to use as Printf and as powerful as gdb. We should empower developers to write instrumentation by ensuring that it's easy to add context to our data, and requires little maintenance work to add or replace telemetry providers after the fact. Instead of thinking about individual counters or log lines in isolation, we need to consider how the telemetry we might want to transmit fits into a wider whole.

Aggregated counters, gauges, and histograms can provide us with information broken down by host or endpoint, but not necessarily higher cardinality fields that we need for a rich understanding of our distributed systems. Automatic instrumentation of a language server framework, such as framework support for Node.js Express or Go http.Server in Honeycomb's Beelines, can only provide a modest amount of context. It will capture request header fields such as URL and response durations/error codes, but not anything from the business logic or involving the logged-in user's metadata. Because observability requires the ability to understand the impact of user behavior upon our applications, we cannot just stop with collecting surface level data. Thus, we'll need to make changes to our code to instrument it.

Instrumentation should be reusable

Typically, instrumenting code involves adding a vendor's library or a standard package like OpenCensus or slf4j to one's dependencies, then calling the library directly from instrumented code. If multiple providers and kinds of telemetry (e.g. logs, metrics, traces, events…) are in use, calls to each wind up sprinkled across the codebase. But should we have to re-instrument our entire codebase every time we gain access to new methods of data aggregation/visualization or change observability providers? Of course not. This gives rise to the need to separate observability plumbing from your business logic, or domain-specific code.

To address this problem of abstracting instrumentation, Tyler Treat envisions a solution involving centralized inter-process collection in "The Observability Pipeline", and Pete Hodgson suggests abstracting collection in-process in "Domain Oriented Observability". Tyler's article explains creating structured events, and then streaming them to an out of process service that can aggregate them and send them onwards to a variety of instrumentation sinks. Pete's article suggests creating a separate class for handling the vendor-specific pieces of instrumentation, but still relies upon tight coupling between the instrumentation code and the domain-specific code--e.g. creating a method in the instrumentation code to handle each potential property we might want to record about an event (e.g. discountCodeApplied(), discountLookup{Failed,Succeeded}()).

Why not both?

However, there's a simpler, within-process approach that is easier for developers to understand, test, configure, maintain, and operate. It's a fusion of that described by Pete in "Event-Based Observability" and "Collecting Instrumentation Context", and by Tyler's distributed event buffering solution. With the improved solution, we neither need to have an advanced understanding of mocking functions and classes, nor do we need to operate a Kafka pipeline from day 0. Instead, we just generate and consume structured events.

Within each span of work done in the domain-specific business logic, we populate a weakly-typed context dictionary with key/value pairs added from within instrumented code, as well as the default standard contextual fields (e.g. requestId, startTime, endTime, etc). Child units of work become separate contexts and spans, with appropriate fields (e.g. parentId, requestId) templated (or "partially applied" in Pete's words) from the parent context/span. Adding telemetry becomes as easy as Printf for developers -- it's just setting a ctx[key] = val only for keys and values relevant to your code. We no longer need to create one function call to the instrumentation adapter for each telemetry action. Using Pete's example, we might set discountCode => FREESHIPPING, responseCode => 403, or discountLookupSucceeded => {true,false,nil} within one event instead of making the multiple function calls above, or emitting multiple distinct "Announcement" objects for only one work unit. Writing tests to validate that the generated context map is correct becomes straightforward to do in table-based testsuites (e.g. go functest), rather than requiring mocking functions and classes.

Once the work unit finishes, its context dictionary is sent in-process to the instrumentation adapter where any number of listeners can interpret it. Each listener sees the context maps for each received event, decides whether it's relevant to it, and if so, translates it according to its own rules into metrics, traces/events/structured logs, or human-readable logs. We no longer need to duplicate calls to the same instrumentation provider from each kind of telemetry function call, but can create single listeners for each common metric (e.g. response time metrics collection, response code) that act on a wider range of events. We can then measure the correctness of listeners, ensuring that each processor is only interested in the correct set of structured events, and dispatches them to the upstream structured event, log, metric, or trace provider(s)' APIs appropriately.

Correspondence is more useful when it's about the outcome

Unlike Tyler's streaming design there need not be a 1:1 correspondence between listeners/routers and instrumentation sinks. Instead, the correspondence is between the action we'd like to coalesce or report on, and what related calls we make -- e.g. performing more than one metric counter increment, etc. to the same sink, or even scattering increments across many different sinks if we're transitioning between providers. This makes the code much more testable, as it's focused on the intent of "record these values from this specific kind of event, to whatever Sinks are relevant", rather than a catch-all of "duplicate everything in the kitchen we do in Sink A in Sink B instead". And the value of event stores such as Honeycomb quickly becomes clearer -- because you don't have to do anything different to aggregate or process each such structured event, only pass it on to us directly. Let us worry about how to efficiently query the data when you ask a question, such as P99(duration_ms) or COUNT WHERE err exists GROUP BY customer_id ORDER BY COUNT desc.

Decoupling event creation from event consumption, even within the same process, is a great step between instrumentation spaghetti and needing a Kafka, Kinesis or PubSub queue. Never create a distributed system unless you need to, and run as few distributed systems as possible. Same-process structured event creation and structured event consumption is super easy to work with, test, and reason about, to boot! As you grow and your needs scale, you may wind up reaching for that Kafka queue. But you'll have an easier migration path, if so.

Ideas for future-proofing

How does this relate to OpenTelemetry née Open{Census,Tracing}? Despite the creation of the new consensus standard, the ongoing transition to OpenTelemetry is proof indeed that we ought to future-proof our work by ensuring we can switch to and from instrumentation providers, including those that do not support the current newest standard, without further breaking domain code. Instead of using the OpenTelemetry API directly within your domain-specific code, it still may be wise to use one context/span propagation library of your choice (which could still be OTel's), and write an InstrumentationAdapter that passes data it receives through to OpenTelemetry's metrics & trace consumers, as well as to legacy and future instrumentation providers.

I hope that this article was helpful! If you're looking for more detailed examples of how Honeycomb Beelines work, check out our Examples repo in Github , such as this example of using our Beeline for Go alongside custom instrumentation.

Looking to find out more? Get started with Honeycomb for free.

Challenges with Implementing SLOs

Danyel Fisher — Mon, 07 Dec 2020 20:44:21 +0000

A few months ago, Honeycomb released our SLO — Service Level Objective — feature to the world. We’ve written before about how to use it and some of the use scenarios. Today, I’d like to say a little more about how the feature has evolved, and what we did in the process of creating it. (Some of these notes are based on my talk, “Pitfalls in Measuring SLOs;” you can find the slides to that talk here, or view the video on our Honeycomb Talks page).

Honeycomb approaches SLOs a little differently than some of the market does, and it’s interesting to step back and see how we made our decisions.

If you aren’t familiar with our SLO feature, I’d encourage you to check out the SLO webcast and our other documentation. The shortest summary, though, is that an SLO is a way of expressing how reliable a service is. An SLO comes in two parts: a metric (or indicator) that can measure the quality of a service, and an expectation of how often the service meets that metric.

When Liz Fong-Jones joined Honeycomb, she came carrying the banner of SLOs. She’d had a lot of experience with them as an SRE at Google, and wanted us to support SLOs, too. Honeycomb had an interesting secret weapon, though: the fact that Honeycomb stores rich, wide events [REF] means that we can do things with SLOs that otherwise aren’t possible.

The core concept of SLOs is outlined in the Google SRE book and workbook, and in Alex Hidalgo’s upcoming Implementing Service Level Objectives book. In the process of implementing SLOs, though, we found that there were a number of issues that aren’t well-articulated in the Google texts; I’d like to spend a little time analyzing what we learned.

I’ve put this post together because it might be fun to take a look behind the scenes — at what it takes to roll out this feature; at some of the dead ends and mistakes we made; and how we managed to spend $10,000 of AWS on one particularly embarrassing day.

As background, I’m trained as a human-computer interaction researcher. That means I design against user needs, and build based on challenges that users are encountering. My toolbox includes a lot of prototyping, interviewing, and collecting early internal feedback. Fortunately, Honeycomb has a large and supportive user community — the “Pollinators” — who love to help each other, and give vocal and frequent feedback.

Expressing SLOs

In Honeycomb, you can express static triggers pretty easily: simply set an aggregate operation (like COUNT or AVERAGE) and a filter. The whole experience re-uses our familiar query builder.

We tried to go down the same path with SLOs. Unfortunately, it required an extra filter and — when we rolled it out with paper prototypes — more settings and screens than we really wanted.

We decided to reduce our goals, at least in the short term. Users would create SLOs far fewer times than they would view and monitor their SLOs; our effort should be spent on the monitoring experience. Many of our users who were starting out with SLOs would be more advanced users; because SLOs are aimed at enterprise users, we could help ensure our customer success team was ready to help users create them.

In the end, we realized, an SLO was just three things: an SLI (“Service Level Indicator”), a time period, and a percentage. "Was the SLI fulfilled 99% of the time over the last 28 days?" An SLI, in turn, is a function that returns TRUE, FALSE, or N/A for every event. This turns out to be very easy to express in Honeycomb Derived Columns. Indeed, it meant we could even create an SLI Cookbook that helped express some common patterns.

We might revisit this decision at some point in the future — it would be nice to make the experience friendlier as we’re beginning to learn more about how users want to use them. But it was also useful to realize that we could allow that piece of the experience to be less well-designed.

Tracking SLOs

Our goal in putting together the main SLO display was to let users see where the burndown was happening, explain why it was happening, and remediate any problems they detected.

This screenshot gives a sense of the SLO screen. At the top-left, the “remaining budget” shows how the current SLO has been burned down over 30 days. Note that the current budget has 46.7% remaining. We can also see that the budget has been gradually burning slowly and steadily.

The top-right view shows our overall compliance: for each day of the last 30, what did the previous 30 look like? We’re gradually getting better.

Contrast this view: we’ve burned almost three times our budget (-176% means we burned the first 100%, and then another 1.76 of it).

A lot of that was due to a crash a few weeks ago — but, to be honest, our steady state of burning was probably lower than it wanted to be. Indeed, we’ve never exceeded our goal of 99.5% at any time in the last 30 days.

Explaining the Burn

The top parts of the screen are familiar and occur in many tools. The bottom part of the chart is, to me, more interesting, as they take advantage of the unique aspects that Honeycomb has to offer. Honeycomb is all about the high-cardinality, high-dimensional data. We love it when users send us rich, complex events: it lets us give them tools to provide rich comparisons between events.

The chart on this page shows a heatmap. Like other Honeycomb heatmaps, it shows the number of events with various durations (Y axis) by time (X Axis). This time, though, it adds yellow events that failed the SLI. This image shows that the yellow failed events are largely ones that are a little slower than we might expect; but a few are being processed quickly and failing.

Contrast this image, which shows that most of the failed events happened in a single burst of time. (The time axis on the bottom chart looks only at one day, and is currently set on one of the big crash days).

Last, we can use Honeycomb’s BubbleUp capability to contrast events that succeeded to those that failed — across every dimension that is in the dataset! For example, in this chart, we see (in the top left) that failing events had status codes of 400 and 500, while succeeding events had status codes of 200.

We can also see — zooming into the field app.user.email in the second row — that this incident of burn is actually due only to one person — encountering a small number of errors.

This sort of rich explanation also lets us discover how to respond to the incident. For example, seeing this error, we know several things: we can reach out to the relevant customer to find out what the impact on them was; meanwhile, we can send the issue to the UX team to try to figure out what sequence of actions got to this error.

User Responses to SLOs

From the earlier versions of SLOs, we got useful and enthusiastic responses to the early betas. One user who had set up their SLO system, for example, wrote: “The Bubble Up in the SLO page is really powerful at highlighting what is contributing the most to missing our SLIs, it has definitely confirmed our assumptions.”

Another found SLOs were good to show that their engineering effort was going in the right direction: ““The historical SLO chart also confirms a fix for a performance issue we did that greatly contributed to the SLO compliance by showing a nice upward trend line. :)”

Unfortunately, after that first burst of ebullience, enthusiasm began to wane a little. A third customer finally gave us the necessary insight: “I’d love to drive alerts off our SLOs. Right now we don’t have anything to draw us in and have some alerts on the average error rate .... It would be great to get a better sense of when the budget is going down and define alerts that way.”

Designing an Alert System

We had hoped to put off alerts until after SLOs were finished. It had become clear to us — from our experience internally as well as our user feedback — that alerts were a fundamental part of the SLO experience. Fortunately, it wasn’t hard to design an alerting system that would warn you when your SLO was going to fail in several hours, or when your budget had burned out. We could extrapolate from the last few hours what the next few would look like; after some experimentation, we settled on a 1:4 ratio of baseline to prediction: that is, a one-hour baseline would be used to predict four hours from now; a six hour baseline would be used to predict the next day.

We built our first alert system, wired it up to check on status every minute ... and promptly arranged a $10,000 day of AWS spend on data retrieval. (Before this, our most expensive query had cleared $0.25; this was a new and surprising cost).

Tony Hoare is known to have said that “premature optimization is the root of all evil;” it turns out that some optimization, however, can come too late. In our case, running a one-minute resolution query across sixty days of data, every minute, was asking a lot of both our query system and our storage.

We paused alerts; and rapidly implemented a caching system.

Caching is always an interesting challenge, but perhaps the most dramatic issue we ran into is that Honeycomb is designed as a best-effort querying system which is ok with occasional incorrect answers. (It’s even one of our company values: “Fast and close to right is better than perfect!”). Unfortunately, when you cache a close-to-right value, that means that you keep an incorrect value in your cache for an extended period; occasional incorrect values had an outsize effect on SLO quality. Some investigation showed that our database was able to identify queries that were approximations, and that few retries would usually produce a correct value; we ended up ensuring that we simply didn’t cache approximations.

(We had other challenges with caching and quirks of the database, but I think those are less relevant from a design perspective.)

Handling Flappy Alerts

One of our key design goals was to reduce the number of alerts produced by noisy systems. Within a few days of release, though, a customer started complaining that their alert was turning just as noisy.

We realized they’d been unlucky: their system happened to be such that their burndown was being estimated at just about four hours. A bad event would pop in — and it would drop to 3:55. A good event would show up, and they’d bump up to 4:05. This flapping would turn on and off the alerts, frustrating and annoying users.

Fortunately, the fix was easy once we’d figured out the problem: we added a small buffer, and the problems went away.

Learning from Experience

Last, I’d like to reflect just a little on what we’ve learned from the SLO experience, and some best practices for handling SLOs.

Volume is important. A very small number of events really shouldn’t be enough to exhaust budget: if two or three failures can do it, then most likely, a standard alert would be the right use case. The SLO should tolerate at least in the dozens of failed events a day. Doing the math backward, an SLO of 99.9% needs a minimum level of traffic of a few tens of thousands of events a day to be meaningful.
Test pathways, not users: It’s tempting to write an SLO per customer, to find out whether any customer is having a bad experience. That seems not to be as productive a path: first, it reduces volume (because each customer now needs those tens of thousands of events); second, if a single customer is having a problem, does that imply something about the system, or the customer? Instead, writing SLOs on paths through the system and on user scenarios seems like a better way to identify commonalities.
Iterating is important: We learned rapidly that some of our internal SLOs were often off by a bit: they tested the wrong things, or had the wrong intuition for what it meant for something to be broken. For example, “status code >= 400” gets user errors (400s) as well as failures in our own system (500s). Iterating on them helped us figure out what we wanted.
Cultural change around SLOs can be slow. Alerts on numbers in the system are familiar; SLOs are new and may seem a little unpredictable. Internally, our teams had been slow to adopt SLOs; after an incident hit that the SLOs caught long before alarms, engineers started watching SLOs more carefully.

Conclusion

SLOs are an observability feature for characterizing what went wrong, how badly it went wrong, and how to prioritize repair. The pathway to implementing SLOs, however, was not as straightforward as we’d hoped. My hope by putting this post together is to help future implementors make decisions for their paths — and to help users know a little more about what’s going on behind the scenes.

Want to know more about what Honeycomb can do for your business? Check out our short demo!

Honeycomb SLO Now Generally Available: Success, Defined.

Danyel Fisher — Fri, 04 Dec 2020 23:24:12 +0000

Previously, in this series, we created a derived column to show how a back-end service was doing. That column categorized every incoming event as passing, failing, or irrelevant. We then counted up the column over time to see how many events passed and failed. But we had a problem: we were doing far too much math ourselves.

To address that problem, Honeycomb has now released SLO Support! Unsurprisingly, it is based on precisely the principles we discussed above.

To recall, the derived column looked something like this:

IF(
  AND(
   EQUALS($request.endpoint, "batch"),
   EQUALS($request.method, "POST")
  ),
  AND(
   EQUALS($response.status_code, 200),
   LT($duration_ms, 100)
  )
)

which meant, “we only count requests that hit the batch endpoint, and use the POST method. If they do, then we will say the SLI has succeeded if we processed it in under 100 ms, and returned a 200; otherwise, we’ll call it a failure.” We counted the percentage of total requests as our SLI success rate. For example, we might say that over the last thirty days, we managed a 99.4% SLI success rate.

Formalizing this structure

We’ll pick an SLI. An SLI (Service Level Indicator) consists of the ability to sort all the events in my dataset into three groups: those that are irrelevant, those that pass, and those that fail.
Now, we’ll pick a target level for this SLI. “Of the relevant events, we want 99.95% of them to pass.”
Last, we’ll pick a duration for them: “Over each 30 days, we expect our SLI to be at 99.95% passing.”

The nice thing about this is that we can quantify how our SLI is doing. We can look at a dataset, and see what percentage of events have succeeded.

This is a really useful way to think about systems that are constantly in minor states of error. Ordinary noise happens; this can lead to transient failures or occasional alerts. We can use this structure to ask how much these minor running errors are costing us.

(When there’s a catastrophic failure, frankly, SLOs are less surprising: every light on every console is blinking red and the phone is buzzing. We’ll use SLOs in those cases to estimate “how bad was this incident.”)

Understanding your Error Budget

Let’s make the assumption that we expect to see 100,000 relevant events in a given thirty day period. Let’s further say that, say, 700 of them have failed over the last 27 days. Over the next three days, we can afford for another 300 events to fail and still maintain a 99.9% SLO.

This gets to the concept of an error budget. In Honeycomb’s implementation, error budgets are continuously rolling: at any moment, old errors are slowly scrolling away into the past, no longer counting against your budget.

In our example, we'll assume that the world looks something like this: The grey line at the top is the total number of events a system has sent. It’s staying pretty constant. The orange line shows errors.

For this chart, the Y scale on the errors is exaggerated: after all, if you’re running at 99.9%, that means that there’s 1/1000 the number of errors as successes. (The orange line would be very small!)

33 days ago, there was an incident which caused the number of errors to spike. Fortunately, we got that under control pretty quickly. Two weeks ago, there was a slower-burning incident, which took a little longer to straighten out.

Checking the Burn Down graph

It would be great to track when we spent our error budget. Was the painful part of our last month those big spikes? Or was it the fact that we’ve had a small, continuous background burn the other time? How much were those background events costing us?

The burn down graph shows the last month, and how much budget was burned every day. If we had looked at the graph last week, we'd have seen that your last 30 days had been burnt, pretty hard, by that first incident, and then again by the second. The rest of the time has been a slow, continuous burn: nothing too bad. That helps us make decisions: are we just barely making budget every month? Is the loss due to incidents, or is it because we are slowly burning away over time?

Both of those can be totally fine! For some systems, it’s perfectly reasonable to have a slow, gentle burn of occasional errors. For others, we want to keep our powder dry to compensate for more-severe outages!

The graph from six days ago was looking dismal. That first incident had burned 40% of budget in a single incident; the usual pace of “a few percent a day” meant that the budget was nearly exhausted.

But if we look at the burn down graph today, things are looking better! The first incident is off the books, and now we're only making up for the errors of D2. Someday, that too will be forgotten.

We should also take a look at how we compare to the goal. For every day, we can compute the percentage of events that has passed the SLI. As you can see, we’re usually above 95% for most 30 day periods. At the trough of the first incident, things were pretty bad — and we lost ground, again, with the second one — but now we’re maintaining a comfortably higher level.

Now, all these illustrations have shown moments when our problems were comfortably in the past. While that’s a great place to have our problems, we wouldn’t be using Honeycomb if all our problems were solved. That’s why there are two other important SLO aspects to think about:

SLO Burn Alerts

When the error rate is gradually increasing, it would be great to know when we'll run out of budget. Honeycomb creates Burn Alerts to show when our SLO will run out of budget. The green line shows the gradually shrinking budget, but on a slightly adjusted window.

Then, Honeycomb predicts forward. The orange line looks at how our last hour has been, and then interpolates forward to the next four hours. In this image, the four hour estimate is going to dip below zero — and so the system warns the user.

This can let us know how long until we use up our error budget. It acts as a forewarning against slow failures.

It’s really useful to have a couple of different time ranges. A 24 hour alert can mean “you’ve got a slow degradation in your service; you might want to fix it. — but worry about it in the morning.” A four hour alert means “it’s time to get cracking” (at Honeycomb, we tend to send 24 hour alerts to Slack channels, but 4 hour alerts to PagerDuty).

Find out why it's going wrong

This wouldn’t be Honeycomb if we didn’t provide you tools to dive into an issue. The SLO Page shows a Heatmap and a BubbleUp of the last 24 hours, so you can figure out what’s changed and how you want to take it on.

Here’s a great example: the SLO page for a Honeycomb tool that’s looking at rendering speed. (Yep, we’ve even set an SLO on end user experience!) This is a pretty loose SLO — really, we’re keeping it around to alarm us if our pages suddenly get really bad — but we can see that we’re doing OK against our goals.

The bottom half of the page shows where the problems are coming from. The BubbleUp heatmap shows the last day of events: higher up are yellow, meaning these events fail the SLI; lower down are blue, meaning they are compliant with the SLI. We can see that mostly this is happening when events are particularly slow.

We can also look in there and see that it’s one particular page that seems to be having the worst experience, and one particular user email that’s running slow. That’s a pretty cool insight — it tells us where to look, and how we might want to handle it. It also gives us a sense for what repro cases to look for, and figure out what strange thing this user is doing.

Now, define your own SLOs

Honeycomb SLOs are now released and are available to Enterprise/yearly contract customers. We’d love to learn more about how you think about SLOs, and what you use them for.

Read the final installment in this blog series: Challenges with Implementing SLOs

New to Honeycomb? Get started for free!

Working Toward Service Level Objectives (SLOs), Part 1

Danyel Fisher — Fri, 04 Dec 2020 23:19:16 +0000

In theory, Honeycomb is always up. Our servers run without hiccups, our user interface loads rapidly and is highly responsive, and our query engine is lightning fast. In practice, this isn’t always perfectly the case — and dedicated readers of this blog have learned about how we use those experiences to improve the product.

Now, we could spend all our time on system stability. We could polish the front-end code and look for inefficiencies; throw ever-harder test-cases at the back-end. (There are a few developers who are vocal advocates for that approach!) But we also want to make the product better — and so we keep rolling out so many great features!

How do we decide when to work on improving stability, and when we get to go make fun little tweaks?

From organizational objectives to events

So let’s say we come to an agreement with each other about "how bad" things are. When things are bad—when we’re feeling mired in small errors, or one big error that takes down the service—we can slow down on Cool Feature Development, and switch over to stability work. Conversely, when things are feeling reasonably stable, we can believe we have a pretty solid infrastructure for development, and can slow down on repair and maintenance work.

What would that agreement look like?

First, it means being able to take a good hard look at our system. Honeycomb has a mature level of observability, so we feel pretty confident that we have the raw tools to look at how we’re doing — where users are experiencing challenges, and where bugs are appearing in our system.

Second, it means coming to understand that no system is perfect. If our goal is 100% uptime at all times for all requests, then we’ll be disappointed, because some things will fail from time to time. But we can come up with statements about quality of service. Honeycomb had an internal meeting where we worked to quantify this:

We pretty much never want to lose customer data. We live and die by storing customer data, so we want every batch of customer telemetry to get back a positive response, and quickly. Let’s say that we want them to be handled in under 100 ms, without errors, for 99.95% of requests. (That means that in a full year, we could have 4 hours of downtime.)
We want our main service to be up pretty much every time someone clicks on honeycomb.io, and we want it to load pretty quickly. Let’s say we want to load the page without errors, within a second, for 99.9% of requests.
Sometimes, when you run a query, it takes a little longer. For that, we decided that 99.5% of data queries should come back within 10 seconds and not return an error.

These are entirely reasonable goals. The wonderful thing is, they can actually be expressed in Honeycomb’s dogfood servers as Derived Column expressions:

For example, we would write that first one — the one about data ingest — as “We’re talking about events where request.endpoint uses the batch endpoint and the input is a POST request. When they do, they should return a code 200, and the duration_ms should be under 100.

Let’s call this a “Service Level Indicator,” because we use it to indicate how our service is doing. In our derived column language, that looks like

IF(
  AND(
   EQUALS($request.endpoint, "batch"),
   EQUALS($request.method, "POST")
  ),
  AND(
   EQUALS($response.status_code, 200),
   LT($duration_ms, 100)
  )
)

We name that derived column “SLI”, and we can generate a COUNT and a HEATMAP on it

This looks pretty sane: we see that there are many more points that are true (the indicator is ok! everything is great!) than false (oh, no, they failed!); and we can see that all the points that are slowest are in the “false” group.

Let’s whip out our Trusty Pocket Calculator: 35K events with “false”; 171 million with true. That’s about a 0.02% failure rate — we’re up at 99.98%. Sounds like we’re doing ok!

But there are still some failures. I’d love to know why!

By clicking over to the BubbleUp tab, I can find out who is having this slow experience. I highlight all the slowest requests, and BubbleUp shows me a histogram for every dimension in the dataset. By finding those columns that are most different from everything else, I can see wehre these errors stand out.

... and I see that it’s one particular customer, and one particular team. Not only that, but they’re using a fairly unusual API for Honeycomb (that’s the fourth entry, request.header.user-agent)

This is great, and highly actionable! I can reach out to the customer, and find out what’s up; I can send our integrations team to go look at that particular package, and see if we’re doing something that’s making it hard to use well.

Quantifying quality of service means you can measure it

So bringing that back to where we started: we’ve found a way to start with organizational goals, and found a way to quantify our abstract concept: “always up and fast” now has a meaning, and is measurable. We can then use that to diagnose what’s going wrong, and figure out how to make it faster.

Part 2, coming soon: Wait, why did I have to pull out my pocket calculator? Don’t we have computers for that? Also, this term “SLI”, it feels familiar somehow...

Read the next post in the series:
Honeycomb SLO Now Generally Available: Success, Defined.

Excited about what the future of operational excellence looks like? Get started with Honeycomb for free.

Unpacking Events: All the Better to Observe

honeycomb — Mon, 16 Nov 2020 17:28:52 +0000

At Honeycomb, we’ve been listening to your feedback. You want easier ways to predict usage and scale your observability spend with your business. What would it look like to meet you where you already are, using similar terms, and give you more control with a simpler experience? We think that means reimagining the customer experience into one that centers around an event-based model.

But what exactly is an event? What does that mean for your team’s observability journey?

The Old Way

Traditionally, Honeycomb has charged customers based on two axes: ingest and retention. We figured not everyone has the same needs, so why not have you select what’s right for you? But in practice, that translated into placing more administrative burden on you.

Many of you told us that you’d frequently visit the Usage page to fiddle with the slider, responding to spikes in traffic (and therefore ingest) by reducing your retention period in order to keep your bill around the same amount each month. We started to see that become a difficult (but common) tradeoff: if you want to send more data, then it wouldn’t stick around as long. Many of you ended up robbing Peter to pay Paul.

In addition, we found that it’s not intuitive for many of you to estimate ingest in gigabytes. As a result, many Usage page sliders were adjusted reactively, without much sense of how that would affect Honeycomb cost as more systems were integrated or as service traffic continued to grow.

Stress-Free Observability

Let’s face it, that’s not a great experience. If we could take what we’ve learned from you to make it better, what would we do?

You can’t always anticipate traffic spikes. A great experience would not penalize you for those.

You don’t want to obsess over usage every month or have to explain variations in spend to your accounting team. A great experience would let you set your monthly or yearly spend and forget about it, knowing you have headroom for growth.

You want to stop worrying about retention and capacity planning. We’ve worked with many teams who limit their retention to 24 hours, or even less! A great experience would give you an extended time frame with your data. In the beginning, you would have space to get comfortable with your new instrumentation. Over time, it would allow you to reflect back on the last two months to support your incident review and capacity planning needs.

A great experience would encourage you to send us wide events with many context fields, since that’s the richness you need for observability. You shouldn’t have to worry about how much data each of those events sends.

What we’ve found is that as you progress on your observability journey, your instrumentation will reach an equilibrium. Once you figure out your right level of instrumentation, your usage should be predictable and aligned with your application’s traffic patterns. You want to spend time thinking about improving your application, not optimizing fiddly usage sliders.

You Already Think About Events

In thinking about how to meet you where you already are, it makes a lot of sense to land on the event as the core unit of measure.

But not all events are scoped similarly. Let’s define what an “event” would mean to Honeycomb, and how that relates to the events you care about for your service.

Honeycomb defines an event as a single “unit of work” in your application code. But a “unit of work” can have a dozen meanings in a dozen contexts. It can be as small as flipping a bit or as large as a round-trip HTTP request. The simplest definition is that an event is usually either a trace span, or a log event.

Let’s unpack that a bit further.

Log Events

For logs, enumerating events is pretty straightforward. If you use (or if you’re planning to use) honeytail to send structured logs into Honeycomb, you likely already know how many log events you’re sending. For infrastructure teams with less authority over code changes, installing an agent like honeytail, the AWS integrations, or the newly-upgraded Kubernetes agent is the best way to get data into Honeycomb. Each event in these agents corresponds to one event in Honeycomb.

Trace Spans and Events

With spans, we need to examine the definition a bit more. To start: we’ve decided that one span equals one event.

If you want high-res observability into your systems, you’ve probably looked into trace-aware instrumentation like our Beelines. We’ve already written a lot about the benefits of tracing. For this post, let’s focus on what 1 span == 1 event would mean for adding trace-aware instrumentation to your service.

As a service owner, you already have your own concept of events in your world: HTTP or API requests, background tasks, queue events, etc. If your app is in production, you probably know your traffic patterns, i.e.:

How many of your service’s events you experience at any particular time scale—per second, per hour, per day, per month
The seasonality of those events at different times of the day, week, month, year

When you’ve instrumented for tracing, one of your service’s events generates a single Honeycomb trace. A trace is made up of one or more spans. Remember that in this world, one span is one event.

Estimating Honeycomb Events

From here you can roughly predict your Honeycomb usage as a function of your traffic. For example, you could estimate your monthly usage like so:

  (Number of your service’s events per month)
× (Number of spans in each service event)
_____________________________________________
=  Honeycomb usage per month

So how do you figure out the number of spans for each of your service’s events?

A useful guideline: one span gets generated from each method call, down to a certain level of granularity. By “granularity,” we mean how deep you go down the call stack. Sometimes you care about the context in methods being called by other methods, and sometimes you don’t.

For example, you probably care more about what kinds of database queries your controller is making, and less about what arguments went into Math.sum(). (Don’t let me tell you what’s important, though! You know your code better than I do.)

Still, the number of spans generated from one of your service’s events depends on the kind of event that is. In this example, an HTTP request using the ruby-beeline Rails integration, generated 18 spans. If you’re calling out to another service like Redis or S3, that will generate more spans.

Here's the same HTTP request, fully expanded:

What to expect

There's no magic number for the “proper” level of granularity in tracing. As you ramp up your use of Honeycomb, you'll make discoveries that'll guide how to further instrument your code. New Honeycomb users often discover long-hidden bugs and inefficiencies when they first instrument for tracing. Aim for higher granularity early on with the goal of learning, finding these inefficiencies, and nipping them in the bud.

So when you see a high-granularity trace with many spans, ask yourself, “Is this trace valuable?” It’s entirely in the eye of the beholder!

After this initial period of discovery, you'll gain familiarity with what a normal trace looks like for various parts of your service. Going forward, you'll be much more interested in the abnormal and better able to tune your instrumentation.

Let’s look back at our estimation formula from up above to figure out exactly what usage to expect:

  (Number of your service’s events per month) 
× (Number of spans in each service event) 
_____________________________________________ 
=  Honeycomb usage per month

Plugging in some numbers, let’s say your service gets around 1 million requests per day, or up to 30 million requests per month. If each request sends ~20 spans, then you’re looking at 600 million Honeycomb events per month. If each request sends ~50 spans, you’ll be sending 1.5 billion Honeycomb events per month.

Rather than worrying about storage size and retention periods, in this event-based world you’d be able to quickly ascertain what your usage needs are. In future posts, we’ll cover more about what that means going forward and how to even further optimize your usage with techniques like dynamic sampling.

Have questions on how to get started? Want help estimating your event volume? Reach out to our team at info@honeycomb.io.

Not using Honeycomb? Get started today, for free!

The Future of Ops Careers — Honeycomb

Charity Majors — Fri, 13 Nov 2020 16:36:21 +0000

Have you seen Lambda: A Serverless Musical?

If not, you really have to. I love Hamilton, I love serverless, and I’m not trying to be a crank or a killjoy or police people’s language. BUT, unfortunately, the chorus chose to double-down on one of the stupidest and most dangerous tendencies the serverless movement has had from day one: misunderstanding and trash-talking operations.

“I’m gonna reduce your… ops
I’m gonna reduce your… ops”

Well, I hate to tell you, but…

“No, I am not throwing away my… ops.
And you’re not throwing away my… ops.”

Or anyone else’s for that matter.

Even if you don’t run any servers or have any infrastructure of your own, you’ll still have to deal with operability and operations engineering problems. I hate to be the bearer of bad news (not really), but the role of operations isn’t going away. At best, the shifts that supposedly reduce your ops are simply delegating the operability of your stack to someone that does it better. The reality for most teams is that operations engineering is more necessary than ever.

Beyond Hamilton clap backs, that distinction matters because it has real career ramifications for engineers who, like me, are so operationally minded. Where are Ops careers heading?

Where Does Ops Fit, Anyway?

In some corners of engineering, “ops” is straight up used as a synonym for toil and manual labor. There is no good ops, only dead ops. The existence of ops is a technical failure: a blemish to be automated away, eradicated by adding more and more code. Code defeats toil. Dev makes ops obsolete. #NoOps!

If this is such an inexorable march towards utopia, maybe someone can explain to me why the shops that flirt the hardest with #NoOps have been, without exception, such humanitarian disasters?

Or, I’ll start. Operations is ridiculously important. When you denigrate it and diminish it, that’s the first sign that you aren’t doing it well. The way to do something well generally starts with adding focus and rigor, not writing it off.

Consider Business Development and Operations. Business is the why, development is the what, operations is the how. Operations is the constellation of your organizational memory: patterns, practices, habits, defaults, aspirations, expertise, tools, and everything else used to deliver business value to users.

The value of serverless isn’t found in “less ops.” Less ops doesn’t yield better systems than more ops, any more than fewer lines of code means better software. The value of serverless is unlocked by clear and powerful abstractions that let you delegate running large portions of your infrastructure to other people who can do it better than you — yes, because of economies of scale, but more so because that’s their core business model. YOUR core business model probably has nothing to do with infrastructure.

Because of that, a great sort is now happening between software engineering, infrastructure operations, and core business value.

What Is Infrastructure?

Infrastructure is software support. It’s the prerequisite thing you have to do, in order to get to the stuff you want to do. It’s not what you want to be doing, yet your business goals presume its existence.

An important quality of infrastructure is that it typically changes less often and is more stable than the software that constitutes your core business value. The features you ship to customers are typically under constant or frequent development, and they change at the rate of pull requests and commits (in fact, the velocity of these changes can be a critical competitive advantage). Infrastructure, on the other hand, changes at a more glacial pace — at the rate of package managers, OS updates, and new machine images. It’s seconds-to-minutes versus hours-to-days.

This dividing line between infrastructure and core business value even holds true for companies whose business model is building infrastructure for other companies. For example, a company providing email focuses on products that consist of email workflow features that are constantly being developed and shipped to users. There isn’t much new business value to be wrung out of modifying commodity SMTP transport layers or optimizing IMAP servers.

To its credit, serverless is perhaps the first trend to have really understood and powerfully leveraged that dividing line. IaaS, PaaS, and full-service suites like Gitlab were all germinal forms of this shift. “Cloud native” was also, arguably, another lurch in that direction. But where has that taken our industry?

*-As-a-Service Is Really Just Code for “Outsourcing”

IaaS, PaaS, and even FaaS/serverless are really all just types of outsourcing. But yet we don’t call it “outsourcing” when we rely on companies like AWS to run our datacenter and provide compute or storage, or when we use Google apps for our email, documents, and spreadsheets?

Historically, “outsourcing” is what we call shifting work off-premises when we aren’t yet comfortable with the arrangement; whether because the fit is awkward, the support is incomplete, or the service isn’t on par with what we could do ourselves. With infrastructure outsourcing, service quality is now creeping up the stack. More and more complex subsystems are becoming commodity components: and other companies utilize them to build their own businesses (or other infrastructure!) on top.

When I started my career, I was a jack-of-all-trades systems person. I ran mail, web, db, DNS, cache, deploys, CI/CD, patched operating systems, built debs and rpms, etc, etc. Most engineers don’t do those things now, and nor do I. Why would I, when I can pay someone else to abstract those details away, so that I can spend my time focusing on delivering customer value?

Increasingly, as an industry, we are outsourcing any bits that we can.

As a more personal example, why would you want to run your own observability team or build your own in-house monitoring software, if that’s not your core business? Why split your focus to building a bespoke and unsustainable version of a thing when you can readily buy a world-class version? If my company has had ten or twenty full-time engineers working on that solution, how long will it be until your team of three or five can catch up?

In a post-cloud world, we’ve learned that it’s usually much better and far easier to buy than it is to build those things that don’t add business value.

How to Outsource Things Well

In my personal example, buying doesn’t mean that you shouldn’t have an observability team. It means that the observability team should turn their gaze inward. That team should take a page out of the SRE or test community’s books and focus on providing value for your org’s developers whenever they interact with this outsourced solution.

That team should write libraries, generate examples, and drive standardization; ushering in consistency, predictability, and usability. They should partner with internal teams to evaluate use cases. They should partner with your vendors as roadmap stakeholders. They might also write glue code and helper modules to connect disparate data sources and create cohesive visualizations. Basically, that team becomes an integration point between your organization and the outsourced work.

We already know from industry research that the key to success when outsourcing is to embed those off-prem contributions within cross-functional teams, which manage integrating that work back into the broader organization.

Monstrous amounts of engineering work create the stack that ships value to your customers. Trying to save work, some teams build complicated Rube Goldberg machines that are brutal to run, change, and debug. It’s much harder to build simple platforms with operable, intelligible components that provide a humane user experience. Bridging that gap requires quality operations engineering to streamline that outsourcing for successful user adoption.

That’s why even if you run no servers and have no infrastructure of your own, you still have operability and operations problems to contend with. Getting to the point where your org successfully has no infrastructure of its own takes a lot of world-class operations expertise. Staying there is even harder. Any jerk with a credit card can just go spin up a server you’re now responsible for. Try being any sort of roadblock and see how quickly that happens.

What This Means For Operationally Minded Engineers

The reality is that jack-of-all-trades systems infrastructure jobs are slowly vanishing: the world doesn’t need thousands of people who can expertly tune postfix, SpamAssassin and ClamAV — the world has Gmail. You might find your next job by following the trail of technologies you know, like getting hired as a MySQL expert. But technologies come and go, so you should think carefully before hitching your cart to any particular piece of software. What will this mean for your career?

The industry is bifurcating along an infrastructure fault line, and the long-held indistinguishability between infrastructure-oriented engineers and operationally-minded engineers is swiftly eroding. These are becoming two different roles and career paths at two different kinds of companies: infrastructure providers, and the rest of them. Those of us who love optimizing, debugging, maintaining, and tackling weird systems problems far more than writing new greenfield code, now have a choice to make: go deep and specialize in infrastructure, or go broad on operability.

If the mission of your company is to solve a category problem by providing infrastructure to the world, then operations will always be a core part of that mission: your company thrives by solving that particular operability problem better than anyone. So you are justified in going deep and specializing in it, and figuring out how to do it better and more efficiently than anyone else in the world — so that other people don’t have to. But know that even this infrastructure-heavy backend work also needs design, product management, and software engineering work — just like those non-infrastructure focused companies!

If your chosen company isn’t solving an infrastructure problem for the world, there are still loads of opportunities for ops generalists here too. But know that a core part of your job is critically examining the cycles your company devotes to infrastructure operations and finding effective ways to outsource or minimize their in-house developer cycles. Your job is not to go deep if there is any alternative.

I see operationally-minded engineers working cross-functionally with software development teams to help them grow in a few key areas: making outsourcing successful, speeding up time to value, and up-leveling their production chops.

They’re evolving very crude “build vs. buy” industry arguments (often based on little more than whimsical notions) into sophisticated understandings of how and when to leverage abstractions that radically accelerate development. They build and maintain the bridges that make outsourcing successful.

They’re evolving release engineering to fulfill the delivery part of CI/CD. Far too many teams are perfectly competent at writing software, yet perfectly remedial when it comes to shipping that software swiftly and safely.

They’re also up-leveling the production operational skills of software engineers by crafting on-call rotations, counseling teams on instrumentation, and teaching observability. As teams leave behind dated metrics and logs, they start using observability to dig themselves out of the ever-increasing massive hole where everyone constantly ships software they don’t understand to a production system they’ve never understood.

Everyone needs operational skills; even teams who don’t run any of their own infrastructure. Ops is the constellation of skills necessary for shipping software; it’s not optional. If you ship software, you have operations work that needs to be done. That work isn’t going away. It’s just moving up the stack and becoming more sophisticated, and you might not recognize it.

I look forward to the improved Lambda Serverless Musical chorus:

I’m going to improve your… ops.
Yes, I’m going to improve your… ops!

Read more about Honeycomb’s hiring methodology. P.S. We’re hiring!

Join the swarm! Get started with Honeycomb for free.