DEV Community: incident.io

3 common pitfalls of post-mortems

incident.io — Wed, 27 Jul 2022 12:34:33 +0000

Small confession: we currently use the term 'post-mortem' in incident.io despite preferring the term 'incident debrief'. Unless you have particularly serious incidents, the link to death here really isn't helping anyone. However, we're optimising for familiarity, so we're sticking to the term 'post-mortem' here.

Ask any engineer and they'll tell you that a post-mortem is a positive thing (despite the scary name). Being able to reflect on an incident helps us learn from our mistakes and do better next time. Your return only increases when future engineers and decision makers are able to access the record of events.

However, one does not simply follow a post-mortem guide and reap the benefits; post-mortems are all too easily executed badly.

They can waste time

Most obviously, you might not need a post-mortem at all. It's common to skip post-mortems for low-severity incidents, but for one-in-a-million events or those that are out of your control (eg. a provider is down), you may want to apply the same rule. The energy invested in preventing the incident can outweigh the pain incurred by the incident itself.

Secondly, compiling information can take a lot of human time. I've seen engineers spend hours trying to find the exact timestamps, or pour over the best way to describe what happened. There is pressure to fill all the fields in a post-mortem template, otherwise it won't be signed off as "done".

With proper tooling, automation can be used to compile the important information, freeing up a responder's time for thinking about the important questions like:

What did we learn (eg. we need better observability)?
What were the mitigators (eg. off-peak timing meant that impact was low)?
Are there any patterns between this incident and previous incidents (eg. the same service fails easily)?

The discussions might not be helpful

To get the most out of a post-mortem, it's important to establish the level on which an incident is reflected on. The classic gotcha is: contributors assume that the post-mortem is a place to work out how to fix X so that it doesn't happen again. But that would be on the lowest level of reflection - you could also look at how to prevent this class of incidents, or step back and ask if your existing solution needs a whole re-think. There's no point fixing a leaky sink in a burning building. Your post-mortem should help you discover what the problem really is.

Post-mortems are meant to be blameless, meaning they focus on how a mistake was made rather than who made it. But they can easily get too retrospective, focusing on what could have been if a decision was made differently. If someone has a bee in their bonnet about how a particular service was built, and that service went down, you'll likely end up with a rant. Post-mortem discussions need steering so they are only looking forward.

Nothing changes afterwards

A post-mortem normally results in action points. This is great, but where are these tasks of work meant to lie alongside all the other tickets for this sprint? How do we prioritise them? I've seen entire lists of "Action points" lay dormant in post-mortem documents whilst everyone tries to recover from the incident and pick up everything else they were meant to get done. Just like any planning meeting, action points should be drawn up into an issue tracker like Linear/Jira, and ongoing work should be re-prioritised if needs be.

Some final words

Given the negative tone of most of this, I'll have to round off by reiterating that I do find post-mortems incredibly useful. When the laboursome bits are automated, the discussion is steered, and action points are seamlessly plugged into a team's roadmap, what you have left is a brainstorming session on a tricky problem. And that'll take any engineer's fancy.

Tracing Gorm queries with OpenCensus & Google Cloud Tracing

incident.io — Thu, 07 Jul 2022 14:37:00 +0000

At incident.io we use gorm.io as the ORM library for our Postgres database, it's a really powerful tool and one I'm very glad for after years of working with hand-rolled SQL in Go & Postgres apps. You may have seen from our other blog posts that we're heavily invested in tracing, specifically with Google Cloud Tracing via OpenCensus libraries. A huge amount of our application's time is spent talking to Postgres through gorm, so we wanted to get a better view of that in our tracing stack.

Luckily Gorm has perfect hooks for us to pull tracing out of through the Callbacks API, this allows us to provide Gorm with functions that get called at specific parts of the query lifecycle, either allowing you to change query behaviour in a traditional middleware approach, or in our case, pull out data for observability.

func beforeQuery(scope *gorm.DB) {
    // do stuff!
}

db.Callback().
    Create().
    Before("gorm:query").
    Register("instrumentation:before_query", beforeQuery)

Our goal in this post is to introduce tracing spans to our Gorm queries, in order to do that we need to catch both the start and end events, and deal with the spans accordingly. In these examples I'm going to be using the tracing tools provided by go.opencensus.io/trace which feeds into Google Cloud Tracing, but other tracing libraries should behave similarly.

Now we have a function called when a query starts, we need to begin our span:

func beforeQuery(scope *gorm.DB) {
    db.Statement.Context = startTrace(
    db.Statement.Context,
    db,
    operation,
  )
}

func startTrace(
  ctx context.Context,
  db *gorm.DB,
) context.Context {
    // Don't trace queries if they don't have a parent span.
    if span := trace.FromContext(ctx); span == nil {
        return ctx
    }

    ctx, span := trace.StartSpan(ctx, "gorm.query")
    return ctx
}

And then we need to end the span too:

func afterQuery(scope *gorm.DB) { endTrace(scope) }

func endTrace(db *gorm.DB) {
    span := trace.FromContext(db.Statement.Context)
    if span == nil || !span.IsRecordingEvents() {
        return
    }

    var status trace.Status

    if db.Error != nil {
        err := db.Error
        if err == gorm.ErrRecordNotFound {
            status.Code = trace.StatusCodeNotFound
        } else {
            status.Code = trace.StatusCodeUnknown
        }

        status.Message = err.Error()
    }
    span.SetStatus(status)
    span.End()
}

db.Callback().
    Query().
    After("gorm:query").
    Register("instrumentation:after_query", afterQuery)

Now we can see all of our gorm queries in our traces, nice!

However they're not super clear about what we're actually doing, let's see if we can make these spans a little more informative, by adding:

table name & query fingerprint¹ to the span name.
the line of code that called us
the WHERE params from select queries
number of rows affected

¹ A query fingerprint is a unique identifier for a query regardless of formatting & variables, so you can uniquely identify queries that will behave the same in your database.

let's extend our code from earlier:

func startTrace(ctx context.Context, db *gorm.DB) context.Context {
    // Don't trace queries if they don't have a parent span.
    if span := trace.FromContext(ctx); span == nil {
        return ctx
    }

    // start the span
    ctx, span := trace.StartSpan(ctx, fmt.Sprintf("gorm.query.%s", db.Statement.Table))

    // set the caller of the gorm query, so we know where in the codebase the
    // query originated.
    //
    // walk up the call stack looking for the line of code that called us. but
    // give up if it's more than 20 steps, and skip the first 5 as they're all
    // gorm anyway
    var (
        file string
        line int
    )
    for n := 5; n < 20; n++ {
        _, file, line, _ = runtime.Caller(n)
        if strings.Contains(file, "/gorm.io/") {
            // skip any helper code and go further up the call stack
            continue
        }
        break
    }
    span.AddAttributes(trace.StringAttribute("caller", fmt.Sprintf("%s:%v", file, line)))

    // add the primary table to the span metadata
    span.AddAttributes(trace.StringAttribute("gorm.table", db.Statement.Table))
    return ctx
}

func endTrace(db *gorm.DB) {
    // get the span from the context
    span := trace.FromContext(db.Statement.Context)
    if span == nil || !span.IsRecordingEvents() {
        return
    }

    // set the span status, so we know if the query was successful
    var status trace.Status
    if db.Error != nil {
        err := db.Error
        if err == gorm.ErrRecordNotFound {
            status.Code = trace.StatusCodeNotFound
        } else {
            status.Code = trace.StatusCodeUnknown
        }

        status.Message = err.Error()
    }
    span.SetStatus(status)

    // add the number of affected rows & query string to the span metadata
    span.AddAttributes(
        trace.Int64Attribute("gorm.rows_affected", db.Statement.RowsAffected),
        trace.StringAttribute("gorm.query", db.Statement.SQL.String()),
    )
    // Query fingerprint provided by github.com/pganalyze/pg_query_go
    fingerprint, err := pg_query.Fingerprint(db.Statement.SQL.String())
    if err != nil {
        fingerprint = "unknown"
    }

    // Rename the span with the fingerprint, as the DB handle
    // doesn't have SQL to fingerprint before being executed
    span.SetName(fmt.Sprintf("gorm.query.%s.%s", db.Statement.Table, fingerprint))

    // finally end the span
    span.End()
}

func afterQuery(scope *gorm.DB) {
    // now in afterQuery we can add query vars to the span metadata
    // we do this in afterQuery rather than the trace functions so we
    // can re-use the traces for non-select cases where we wouldn't want
    // to record the vars as they may contain sensitive data

    // first we extract the vars from the query & map them into a
  // human readable format
    fieldStrings := []string{}
    if scope.Statement != nil {
        fieldStrings = lo.Map(scope.Statement.Vars, func(v any i int) string {
            return fmt.Sprintf("($%v = %v)", i+1, v)
        })
    }
    // then add the vars to the span metadata
    span := trace.FromContext(scope.Statement.Context)
    if span != nil && span.IsRecordingEvents() {
        span.AddAttributes(
            trace.StringAttribute("gorm.query.vars", strings.Join(fieldStrings, ", ")),
        )
    }
    endTrace(scope)
}

And now we end up with super rich easy to scan spans, making it much easier to understand what our app is spending it's time doing, yay!

Gorm offers callbacks for a bunch of different bits of the query lifecycle, and you can add specific behaviour for all of them, we currently trace create, delete, update, and query separately, but if you want to go further you can check out the gorm docs! . You can find all of the code from this post here.

Please remember, you could end up tracing some sensitive data if you're not careful. So make sure to sanitise your query vars if applicable. One good practice is to only trace SELECT queries, as they're typically done via IDs, as opposed to any sensitive information.

Handling third-party provider outages

incident.io — Thu, 30 Jun 2022 13:10:16 +0000

There are a handful of providers that large parts of the internet rely on: Google, AWS, Fastly, Cloudflare. While these providers can boast five or even six nines of availability, they're not perfect and - like everyone - they occasionally go down.

Thinking about availability

For customers to get value from your product or service, it has to be available. That means that all the systems required to deliver the service are working, including:

Your code or any systems that you control or run
Anything that your systems depend on (i.e. third party providers)
Whatever your customer needs to access your service (e.g. an internet connection, or mobile)

There's a great paper The Calculus of Service Availability that tries applying maths to how people interpret availability. It points out that in order for a system to provide a certain availability, For a system to provide a certain availability, any third parties that it depends on need to have an order of magnitude higher availability (e.g. for a system to provide 99.99%, its dependencies need to have ~99.999%).

In practise, this means that there are some services which need significantly higher availability than others.

As a consumer grade service provider (e.g. an e-commerce site), a 99.99% availability is likely to be sufficient. Above this, the consumers dependencies (of which you have no control) such as their internet connection or device are collectively less reliable. This means that investment to significantly improve availability beyond this point isn't particularly valuable.

By contrast, if you're a cloud provider, your customers are relying on you having a significantly higher availability guarantee so that they can serve their customers while building on top of your platform.

How much is availability worth to you?

In general, most consumer systems can afford a small amount of unexpected downtime without world-ending consequences: in fact, most customers won't notice, as their connection and device is likely less reliable. Given achieving more reliability is extremely expensive, it's important you know when to stop, as the time you save can be invested in delivering product features that your customers will genuinely value.

Multi-cloud is a great example. Multi-cloud is a shorthand for building a platform that runs on both multiple cloud providers (e.g. AWS, GCP, Azure etc.). This is the only way to be resilient to a full cloud provider outage - you need a whole second cloud provider that you can lean on instead.

This is an incredibly expensive thing to do. It increases the complexity of your system, meaning that engineers have to understand multiple platforms whenever they're thinking about infrastructure. You become limited to just the feature set that is shared by both cloud providers, meaning that you end up missing out on the full benefits of neither.

You've also introduced a new component - whatever is doing the routing / load balancing between the two cloud providers. To improve your availability using multi-cloud, this new component has to have significantly higher availability than the underlying cloud providers: otherwise you're simply replacing one problem with another.

Unless you have very specific needs, you'll do better purchasing high availability products from the best-in-class providers than building your own.

See the original Tweer here

If you're interested in reading more, there's a great write-up from Corey Quinn on the trade-offs on multi-cloud.

Big provider outages are stressful

Being on the receiving end of a big provider outage is stressful: you can be completely down with very limited recovery options apart from 'wait until the provider fixes it'.

In addition, it's likely that some of your tools are also down as they share dependencies on the third party. When Cloudflare goes down, it takes a large percentage of the internet with it. AWS is the same. That can increase panic and further complicate your response.

So how should we think about these kinds of incidents, and how do we manage them well?

Prioritise the things you can control

Your site is down. Instead of desperately trying to fix things to bring your site back up, you are ... waiting. What should you be doing?

Choose high availability products

As we discussed above, availability is something that cloud providers are really very good at. The easiest thing you can usually do to improve availability is to use the products that cloud providers build for exactly this reason.

Most cloud providers offer multi-zone or multi-region features which you can opt into (for a price) and vastly decrease the likelihood of these outages.

Understand the impact

As with all incidents, it's important to understand the impact of the outage on your customers. Take the time to figure out what is and isn't working - perhaps it's not a full outage, but a service degradation. Or there are some parts of your product which aren't impacted.

Talk to your customers

If you can, find a way to tell your customers what's going on. Ideally via your usual channels, but if those are down then find another way: social media or even old-fashioned emails.

Translate the impact into something your customers can easily understand. What can they do, what can't they do. Where can they follow along (maybe the third party's status page) to find out more.

Can you mitigate the impact?

Can you change anything about your infrastructure to bypass the broken component? Provide a temporary gateway for someone to access a particular critical service? Ask someone to email you a CSV file which you can manually process?

This is your chance to think outside the box: it's likely to be for a short time period so you can do things that won't scale.

What's going to happen next?

What's going to happen when the third party outage ends: is it business as usual? Have you got a backlog of async work that you need to get through, which might need to be rate limited? Are you going to have data inconsistencies that need to be reconciled?

Ideally, you'd have some tried and tested methods for disaster recovery which the team is already familiar with and are frequently rehearsed (see Practise or it doesn't count for more details).

In absence of that, try to forecast as much as you can, and take steps to mitigate the impact of these. Maybe scale up queues ready for the thundering herd, or apply some more aggressive rate limiting. Keep communicating, giving your customers all the information they need to make good decisions.

After the incident is over, what can we learn?

The fallacy of control

Writing a debrief document after a third party outage doesn't feel good:

What happened? Cloudflare went down\
What was the impact? No-one could visit our website\
What did we learn? It's bad when Cloudflare goes down 🤷‍♀️

Incidents that you can control often feel better than third party incidents where you can't control the outcome. After the incident you can write a post-mortem, learn from it, and get a warm fuzzy feeling that you've improved your product along the way.

However, in the cold light of day, the numbers are unlikely to support this theory. Unless you have the best SRE team in the world, you aren't going to ship infrastructure products with better availability than a cloud provider.

Instead, we should again focus on the things that are within our control.

Learn more about post-mortems and incident debriefs in our Incident Management Guide.

How can we best prepare for these kinds of outages?

Understand your dependencies

It's pretty stressful to be trying to figure out what is impacted by a third party outage in the middle of an incident. To avoid that, you need to understand the various dependency chains in advance.

This is tricky to do as a pen-and-paper exercise: often the most reliable way is to spin up a second environment (that customers aren't using) and start turning bits of the system off.

Once you've got an understanding of the dependencies, when an incident does happen, you'll be able to focus your attention on the relevant parts of your system.

As part of this, you can also run Game days to help train responders in disaster recovery. These are the exercises which can produce the disaster recovery plans (and familiarity) which can be so valuable when bringing your systems back online.

Avoid unnecessary components

Sometimes, often for historic reasons, you'll end up relying on multiple third parties where really, one would do the job. Whenever you add a dependency, you significantly reduce your availability. If you can consolidate on fewer appropriately reliable dependencies, it will significantly improve your overall available.

We can also consider blast radius here: are there ways to make some of your product work while a certain provider is down. This doesn't mean using another provider necessarily, but perhaps you could boot service [x] even if service [y] is unavailable.

Reducing the number of components is likely to reduce your exposure to these kinds of outages.

Be honest about availability

Your availability is always, at best, the availability of all your critical providers, combined. Be honest with yourselves and your customers about what a realistic availability target is within those constraints, and make sure your contracts reflect that.

Uncovering the mysteries of on-call

incident.io — Tue, 28 Jun 2022 12:43:44 +0000

For the vast majority of organisations, some form of round-the-clock cover is critical to successful business operations. On-call is an essential part of an effective incident response process, yet there is no commonly accepted playbook on how to most effectively structure and compensate on-callers.

We ran a survey to uncover the mysteries of how on-call works in organisations of different shapes and sizes around the world.

You can download the full report below, or read on to get a few of the headlines.

Download full report

TL;DR

We had over 200 responses, ranging from globally recognised tech leaders including Google, Amazon and Airbnb, to small start-ups with fewer than 50 employees. Here are the highlights:

In nearly 70% of organisations, each team is responsible for their own on-call rota, rather than having a single or multiple centralised on-call teams.

Over 40% of participants were not compensated for on-call. Interestingly, this was more common in larger organisations (+5,000 people) than in small to mid-sized organisations that participated in the survey.

Where companies did provide compensation, most paid a fixed amount for time spent on-call (e.g. $X per hour, day or week). But the actual dollar amount paid ranged significantly, from $5 to $1,000 per week, with the average weekly rate at $540.

The most commonly cited on-call challenges were:

Disrupted personal life (30%)
Dealing with issues without sufficient context or knowledge (24%)
Lack of sleep (12%)
False alerts (10%)

Need help calculating your on-call compensation?

Our report recommends paying a fixed rate for time spent on-call, calculated down to the minute, regardless of whether or not someone is paged. This helps compensate for the inconvenience and disruption of needing to be available 24/7. From bitter experience, we know that calculating time spent on-call accurately can be tricky, especially when you want to account for weekend rates, holidays and multiple schedules.

That's why we've just launched an on-call compensation calculator. Just connect your PagerDuty account to incident.io, tell us the rules you use to calculate on-call pay and we'll do the rest.

You'll be able to automatically generate a report detailing how much on-call pay each member of the team is owed, based on the hours spent holding the pager. Your responders can also see a breakdown of what they're being paid for each shift, making it super transparent for everyone.

Et voila --- on-call compensation, made easy.

Practical guide to managing incidents

incident.io — Thu, 16 Jun 2022 14:36:09 +0000

Every company needs a plan for when things go wrong. I've written these plans many times now, and every time I've wished for a reference that reflects the way companies actually work today.

So here it is --- our many years of collective knowledge and experience distilled into a practical guide to incident managementfor your whole organisation.

If you're looking a quick entry point, or a round-up of the key points in the guide. See here for the full guide.

On-call

On-call isn't just for engineers: consider who else you might need in an emergency; they should be on-call too.

Invest in your training process: onboarding new on-callers well is critical: each on-call rota should define a clear path to becoming ready to 'be on call', including learning domain specific content as well as your incident response process.

Pay anyone that's on-call: compensate them inconvenience of being on-call. We recommend paying per hour spent on-call, and adjust your compensation based on expectations.

Be compassionate and understanding:

Allow on-call teams to define their own schedules that best suit them
Use overrides to give on-callers flexibility, and relieve pressure when things get tough
Look out for anyone taking too much of the burden

Foundations

Create a shared understanding of an incident: an incident is anything that takes you away from planned work.

Declare more incidents: using your incident process frequently means that, when things go really wrong, you're processes will run like a well-oiled machine.

Use 3--5 human-named severities: plain-english words such as minor, major and critical are easier for everyone to understand.

Every incident should have a lead: whether there's one responder or 30, someone has to play the lead role and drive the incident to a resolution.

Only use the roles you really need: you can often lean on actions (and your incident lead) to understand who's doing what.

Response

When you declare an incident:

Create a fresh space which you can use to co-ordinate your response.
Announce the incident in a shared space so everyone's in the loop
Assemble the team that you need to start investigating

As you respond to an incident:

Identify what's broken & understand the impact
Mitigate the immediate impact
Take a pause
Resolve the issue
Close everything off, and assign follow-up actions

Send regular, easy to digest, internal updates: using a predictable format helps busy stakeholders get the context that they need. Long gaps between updates can cause confusion or stress.

Show your working: document your response in an incident channel, even if you're the only one there. It'll help you avoid bad assumptions or mistakes, and helps your team learn from what you've done.

Keep your customers in the loop: clear and frequent communication builds trust, and can turn a negative into a positive. Use simple language, tell everyone what you're doing and what they should do in the meantime.

Structure your thinking: use questions and theories to methodically work through a problem, being clear about any assumptions you make along the way.

Calm is contagious: take breaks and keep everyone well fed so your incident response can stay on track, even on the bad days.

When you're remote, over-communicate: to avoid a fragmented response, make sure everything is in one place (the incident channel) and it's really clear who's doing what.

Learn and Improve

Hold a debrief when there's value: the responders for an incident should have a good idea whether a debrief will be valuable. If it becomes mandatory 'red tape', they'll become a useless checkbox exercise.

Make debriefs truly blameless: start with the assumption that everyone came to work to do their best, and don't hold individuals accountable for systematic failures.

Value the conversation over the artifact: having a document is a useful way to share knowledge asynchronously, but the most valuable part of a debrief is usually the conversation that precedes it.

Use incidents to level up your team: they broaden your horizons and teach you how to build resilient systems. Bring junior members into incidents, so that your teams get the full value from them.

Be transparent: building a transparent culture means that stakeholders and customers will trust you and give you space to fix what's broken.

Practice your incident process: just like any other skill, practice makes perfect. Dry-run your incident process regularly to get everyone up-to-speed and find the rough edges while the stakes are low.

Declare early, declare often: why you shouldn’t hesitate to raise an incident

incident.io — Mon, 06 Jun 2022 16:02:52 +0000

My first incident.io-incident happened in my second week here, when I screwed up the process for requesting extra Slack permissions, which made it impossible to install our app for a few minutes. This was a bit embarrassing, but also simple to resolve for someone more familiar with the process, and declaring an incident meant we got there in just a few minutes.

Declaring the first incident when you start a new job can be intimidating, but it really shouldn't be. Let's look at some common fears, and work out how to address them.

"Won't it be loads of work?"

Most organisations have some kind of incident response procedure that will have a few things you have to do, like working out how many customers this affected, deciding how to make things right with them, and putting in proactive measures to avoid this happening again.

That can seem like a daunting prospect. Is this issue really worth all that?

A safe default should be "yes". If it turns out this issue wasn't so bad after all you should be able to shut down the response pretty quickly. But if it escalates further, you'll be glad you already have the process rolling.

You can find out more about how to automate your incident process in our previous article.

If it does end up being a lot of work, that isn't necessarily a bad thing. As an individual, you're likely to learn a lot from tackling your first big incident. As a team or company you've addressed a serious issue proactively.

"Will this get me or my team in trouble?"

If the answer is "yes" you should probably be looking for another job.

As Chris discussed before, having more incidents is not a bad thing --- it's not in the long-term interests of any organisation to brush small incidents under the rug, because you can never tell which might turn into huge problems later on.

The same logic applies here too! Most teams that don't have any incidents are either not taking any risks and slowing down delivery, or hiding their problems. Neither of those is a sign of a healthy team.

That's not to say that managers should set a target number of incidents per team per quarter, but it does mean that managers should be looking at outlier teams that have very few incidents, as well as those that have more than their fair share. Are they afraid of getting blamed? Are they spending so long making sure everything they deliver is perfectly robust that they forget about their customers? Did they struggle to respond to an earlier incident effectively and need extra help learning how it's done well?

"I've never done this before"

There will always be a first time, and it's probably better if the first incident you run isn't a critical "everything is down" one. A major incident is stressful enough without having to learn about your organisation's response processes at the same time.

Game days, where you run a pretend incident in a non-production environment, are great for learning your incident response process (and honing your debugging skills!), but you have to apply that knowledge to something real sooner or later.

Even with blameless post-mortems and very well-run incident response processes, I've seen teams who decide that downtime on a key product isn't an incident because it hadn't been down for that long and probably no one had noticed.

When joining a team, especially as a more experienced engineer, part of the experience you bring is about the different cultures you've worked in, including how you respond when things go wrong. This might feel uncomfortable, shifting the culture of a team always does, but the payoff is absolutely worth it.

Now you see me, now you don't: feature-flagging with LaunchDarkly at incident.io

incident.io — Thu, 02 Jun 2022 15:23:28 +0000

At incident.io, we ship fast. We're talking multiple times a day, every day (yes, including Fridays). Once I merge a pull request (PR), my changes rocket their way into production without me lifting a finger. 💅 It's when we tackle larger projects that this becomes a bit more complicated.

We recently launched Announcement Rules, which let you configure which channels incident announcements are posted in depending on criteria you define. That piece of work took us a few weeks, split over multiple PRs, and we couldn't put a half-finished feature in front of customers.

It's not a question of putting everything in one PR; the larger a PR gets, the riskier it becomes, as it's more difficult for someone to review confidently and a lot more work to roll back if something goes wrong. We also don't want to work on separate branches for too long, as we need to make sure everything we're building works together seamlessly, and working on a branch for a long time will leave us with a nightmare of merge conflicts when we do finally merge the branch back in.

We need to be able to deploy tiny bits of a feature consistently, building up to the finished product, and be able to hide those bits from our customers until the feature is ready for its debut. And feature-flagging - the process of enabling and disabling features programatically - allows us to do exactly that.

Anatomy of a feature flag

In a feature-flagging system, we can control visibility of specific features (in our case, Announcement Rules) by switching them on or off.

The most common kind of flag is a boolean flag, which can either be enabled or disabled. In its simplest implementation we might have a boolean variable announcementRulesEnabled which we initialise to false in production, but set to true in development so that only we can see it while we work on it. Our app code would check the value of this variable and only display the new feature if it was true. Once the feature's ready for release, we can set the variable's value to true in production, thereby enabling the flag.

You should build the systems that differentiate you from your competition, and buy the ones that don't

There are a few ways of approaching feature-flagging, including building your own system, or using a third-party one. Having used a few custom-built feature flag systems over the years, I find it's almost always better to buy than to build - I strongly believe you should build the systems that differentiate you from your competition, and buy the ones that don't.

Feature-flagging is not a differentiator for us, and having another system to maintain would end up costing us more in the long term - especially as you add more dimensions to the flags, such as percentage-based or attribute-based rollouts ("enable this for all users who have this app version"). That's why we opted to use LaunchDarkly, a feature management platform-as-a-service.

Enter LaunchDarkly

Now, I've had plenty of experience integrating third-party libraries into apps, and fighting with poor documentation and strange library behaviour. But we managed to get LaunchDarkly implemented across the backend and frontend in a single day. It completely surpassed my expectations, and has made the entire feature-flagging process totally seamless. (I promise they're not paying me to say this. It's so good.)

Helpfully, LaunchDarkly provide SDKs for loads of different platforms and languages, so we were able to wrap their Go SDK in our own featureflags package which enables us to quickly and easily query a user's feature flags. We have three environments set up on the LaunchDarkly dashboard - development (local), staging and production - and when you create a flag in one, it automatically gets populated across all of them.

Using a wrapper for the SDK allows us to reduce repetition and keep the list of flag names in one place (no passing in typo-riddled strings, thanks). To make this even more robust, we plan to add a check in CircleCI to compare the list of flags we get back from LaunchDarkly with the constants in the featureflags library, to make sure they actually exist in the dashboard.

When the backend starts up we initialise the LaunchDarkly client as a singleton. Our code can then pass in the currently authenticated user's user ID and organisation ID to the library to get back a scoped client which we can query directly for specific flags.

This wrapper allows us to inject a feature flag client, making it easy to inject a mock client for testing. Whenever you're building features behind a flag, it's really important to test both the enabled and disabled versions of the code, to make sure both paths work as expected. I can think of many times in the past where we've written extensive tests for a new feature, but not covered the case where it's turned off (and subsequently realised that path is completely broken, and we had no idea because we were all using our local environments with the flag enabled).

In our web UI, we use the LaunchDarkly React SDK. Once a user logs in, we initialise the LaunchDarkly client with that user's ID, organisation ID, and any other attributes we want to be able to feature-flag against. The LaunchDarkly SDK gives us a handy React hook to easily access a user's flags, so we can make sure who isn't feature-flagged won't see the Announcements link in the settings menu.

Reaping the benefits

Implementing feature flagging has enabled us to continue moving quickly while still staying in control of what we put in front of customers. Most of what we do is on the organisation level rather than targeting individual users, so we're much more likely to turn features on for specific organisations. It means we can ask a subset of our customer organisations to beta test a feature, and get fast feedback before we launch to everyone.

Right now we're relying on boolean flags, but as the complexity of our organisation and customer base grows, so will the complexity of our flags. It's reassuring to know we won't need to make changes to our implementation to accommodate that. LaunchDarkly gives us the flexibility to add custom attributes for users when we initialise the client; in the future we could feature-flag against anything we wanted, and use string-based or even JSON-based flags if we needed to.

I suppose you could say we're looking forward to hiding even more new and exciting features from our customers... not least for the intensely satisfying experience of ceremonially flipping a switch to release something new!

How to empower your team to own incident response

incident.io — Mon, 23 May 2022 13:57:33 +0000

Responding to and managing incidents feels fairly straightforward when you're in a small team. As your team grows, it becomes harder to figure out the ownership of your services, especially during critical times. In those moments, you need everyone to know exactly what their role is in order to recover fast.

Moving to incident.io as the 7th engineer, from a scaleup of around 70 engineers, has given me a new perspective on what it means to own your code. Switching from somewhere with a centralised platform team who hold the pager and coordinate response, to being part of the on-call rotation has been really eye-opening.

Previously I worked in a team that was working towards a you build it, you run it approach. The intention was to give delivery teams autonomy and control over their own services -- from creation all the way through to monitoring. As delivery teams grew, incident response became complicated. When the first responders didn't have context on the code it could be really hard to identify the root of a problem. This is magnified as systems become more varied and complex. Incident channels could have 20 people in before anyone relevant was identified and looped in. That's a lot of people interrupted from their flow without adding much value.

Personally, having a disconnect between the code I wrote and what was going on in production gave me a lack of confidence. I felt like I didn't understand the platform or how it worked since I didn't look at my code's journey beyond clicking "merge". It's easy to not fully understand the importance of quality and reliability when your bugs aren't waking you up at night.

It's really hard to get engineers excited about ownership when it's been someone else's job for a long time. Teams feel uncomfortable about the idea of their 9-5 job description changing, evening and weekend pagers are scary and companies understandably do not want to come across as adding extra responsibilities without clear communication of why this will actually save time.

It's hard but worthwhile to push past this. The core sense of ownership is vital to maintaining an empowered and skilled workforce of engineers that care about what they build. These are some suggestions for what can work to get engineers back to being excited about ownership.

Start small

The idea of 24/7 support is scary. When you're used to being shielded it's easy to feel like you can't deal with incidents independently, especially at 4am.

1. Find the path (or team) of least resistance.

Is there a team that has a good monitoring approach already? Or a smaller product area with a clear sense of ownership? Even a team with just a few engineers who feel excited about the idea. It's easy to get the momentum going when there's a head start.

It's vital to encourage this publicly and loudly to start a culture change where other teams follow the example.

2. Don't drop people in the deep end.

Handing over a pager 9-5 is a great way to test the sensitivity of alerts without consequences. It's nice to have the security of dealing with things when others are around, to give you the confidence to take it alone in the future.

Define clear expectations that teams should handle their code within working hours. Document what support is available, and where people can turn if things go wrong. Setting up pairs of engineers to collaborate for the first couple of rotations can go along way in making things feel less scary.

3. Keep it simple

Declaring and managing incidents should be easy. If your team uses Slack, make sure they don't have to leave it in order to coordinate response. Using an incident management tool like incident.io means people can raise incidents, deal with them, and escalate directly from Slack.

Configuring an expected incident flow through a tool like this goes a long way in supporting engineers to make the right decisions under pressure, removing the "am I doing this right" worry. Maintaining a clear timeline of what decisions were made in previous incidents and when is a great resource to show people scenarios they might encounter as they onboard into incident response. Keep note of a few past incidents, they're great learning resources for newer engineers to follow.

Remove the noise

When pagers aren't involved, it's easy to let alerting get messy. I've missed urgent issues before because of a lack of clarity and trust in what is pinging us.

Invest in streamlining your alerts to only include things that are "drop everything" moments.

Make sure your alerting process is easy to understand and modify. I've been in, ad contributed to situations where monitoring gets messier and messier because people just aren't empowered to fix it themselves.

Encourage teams to get together and decide what they really care about, and give them time to implement those thoughts in their paging process.

Google's 4 golden signals are a great place to start with knowing what to monitor.

Empower everyone

There are naturally some people in each team that are the firefighters. Perhaps they have the most legacy knowledge, understand the platform, and are the first to turn to when things go wrong. You need to be actively working away from relying on these roles and knowledge silos for code ownership to really work. The responsibility to take the pager and triage incidents coming in must be baked into the culture of the team. Focus on empowering the more hesitant members. The aim should be that anyone and everyone in a team can make sensible decisions about how to deal with incidents on the fly.

A good support network and process to follow can really help here. Who can someone turn to if they're stuck? What should the default flow be? How can they get others involved?

These are all questions I'd want well documented before being responsible for my team's code out of hours.

In conclusion

Reaching an effective culture of end-to-end code ownership is difficult. Big cultural change can be uncomfortable. It's crucial to be empathetic and understand that technical solutions will only go so far. Keeping people at the heart of the approach is essential to getting empowered supporters behind you and making a real difference.

Incident postmortem pitfalls

incident.io — Mon, 09 May 2022 15:25:50 +0000

We spent some time talking to Gergely Orosz about our thoughts on what happens when an incident is over, and you're looking back on how things went.

If you haven't read it already, grab a coffee, get comfortable, and read Gergely's full article Postmortem Best Practices.

But before you do that, here's some bonus material on some of our points.

When blameless postmortems actually aren't

I'm sure we can all recall a time when we were sitting in an incident debrief, walking through the timeline, and we've reached the critical point where 'someone' pushed the button that triggered the cascade of events that led to the incident. We all talk in hypotheticals about how 'someone' should have read the docs or consulted someone else, and how if that had happened we wouldn't all be here at all.

It's about starting from an assumption of good intent.

This is unequivocally not how good debriefs happen. Being blameless doesn't mean we need to tiptoe around the problem and avoid using names. It's more nuanced than that.

It's about the starting point for an investigation being based on the premise that everyone arrived at work on that day to do all the right things.

It's about starting from an assumption of good intent and sharing personal accounts of the events that unfolded. They might well have been the person that pushed the button, but why did that make sense to them in this instance? How many times has the button been pushed where everything worked exactly as intended?

If we understand the specific motivations of the folks who were there when this was happening, we stand to learn the most about the situation, and ultimately turn that into actionable follow-ups or knowledge that can be shared.

Incidents are always going to happen again

If you've spent time in incident debriefs, especially big ones with senior leaders, you'll likely be familiar with questions like "how are we going to prevent incidents like this from happening in future?". Cue a room full of engineers rolling their eyes.

How do we prevent this from ever happening again?

There is a class of incident where we can reasonably expect the likelihood of recurrence to be almost zero. If a disk on a server fills and brings down our service, we can add both technical controls to prevent this happening, and detective alerts that'll warn us if we're close to having a similar issue. It's going to be hard (though not impossible!) for that same incident to happen again, and everyone walks away from the debrief happy.

Now take the scenario where a feature of a system you didn't know about, behaved in a way you didn't expect, and put you in a situation you couldn't foresee. How do you prevent that scenario from happening again? By virtue of fixing the issue during the incident, we learned something we didn't know, and we can place some controls in place to reduce the likelihood of that specific thing happening again. But what about the hundred other features of that system we don't know about? Do we prioritise a deep dive on the system to understand everything? And once we've done that, how many other systems do we need to do the same on?

The point here isn't that we should throw our hands in the air and give up. Everyone wants to drive towards better service, fewer incidents and happier customers, but you need to get comfortable with the fact you can't prevent everything. Trying to do so will likely tie you knots on low value work, with little to no guarantee that it'll actually pay off.

Ultimately, by fixing the issue (ideally using incident.io, and out in the open) you've already done the best thing you can to stop this happening again; you've learned something.

Take time before you commit to all the actions

It's easy to get carried away in a debrief and generate 37 action items to tackle that 5 minutes of downtime you experienced. Incidents shine a light on a particular problem and combined with recency bias (i.e. this is most important because it's fresh in my memory), it's easy to get lured into prioritising a bunch of work that really shouldn't be done.

The sad reality is that there's always more than can be done in pretty much every corner of everything we build. But it's important we approach things with perspective and avoid letting the pressure and spotlight on this incident drive us to commit to arguably low-value work.

The best solution we've found is to introduce a mandatory time gap --- "soak time" --- to let those ideas percolate, and the more rational part of your brain figure out whether they really are the best use of your time.

Incidents as a process, not an artefact

Perhaps one of my biggest gripes in the post-incident flow is organisations that value the written incident artefact over all else. As the famous Eisenhower quote goes, "Plans are nothing; planning is everything", and the same is mostly true of incidents.

Plans are nothing; planning is everything.

The postmortem/debrief artefact isn't quite 'nothing', but in our experience, these reports are typically not written to read or convey knowledge, but instead are there to tick a box. The folks in Risk and Compliance need to know that we've asked the five whys and written down the exact times that everything happened because that's how the risks are controlled.

Personal experiences aside (😅) this is actually pretty common, and if you find yourself here it's useful to remember that --- documents aside --- the process of running debriefs is itself a perfectly effective way to get your money's worth out of incidents.

The three pillars of great incident response

incident.io — Wed, 04 May 2022 14:52:48 +0000

There's no one-size-fits-all incident response process. Depending on your organisation's shape and size, you'll have different requirements and priorities.

But the same three pillars form the core of any good process, whether it's for the largest e-commerce giant or a scrappy SaaS startup.

For transparency: we are a Slack-native incident response and management tool that helps hypergrowth companies to automate incident processes, focus on fixing the issue, and learn from incident insights to improve site reliability and fix vulnerabilities. Learn more and see how it works on incident.io.

1. Clarity

You can't fix a problem you don't understand. An incident response team needs to have a clear and shared understanding of what the problem is, and the steps they are taking to resolve it.

To achieve clarity during an incident, it's important to:

Have clear roles so everyone knows who is responsible for what. Context switching between resolving the issue and communicating with stakeholders is tiring and challenging: let everyone do their one job, and do it well.
Stay focussed on the problem at hand: don't get distracted by unrelated issues that you discover (ticket them up to look at later).
Use actions so everyone knows what is in-flight: this avoids duplication of effort and helps people provide useful context at the right time.

2. Transparency

Default to transparency

Incidents should be public by default and only made private when absolutely necessary.

Being transparent means that everyone knows what's happening, as it is happening. That unlocks access to all the context and skills in the whole of your organisation, rather than relying on the incident reporter to pull in all the people that they need. If someone sees something that might be related to the issue, they can let the incident team know instead of wasting valuable time investigating themselves.

Transparency builds trust, both with internal stakeholders and customers.

Transparency shatters the illusion of perfection: replacing it with a much more useful 'things go wrong here, and we work hard to fix and learn from them' attitude.

This can make you feel vulnerable, but that's the only way to build trust with stakeholders. They'll start to feel confident that you'll inform and involve them where needed, and they have more faith in your team's ability to handle difficult situations.

Blameless culture

Transparency goes hand-in-hand with a blameless culture where mistakes are learning opportunities, not fireable offences.

Without this, people will hide their mistakes and won't pull in the more experienced folks needed to help resolve the issue safely: in the worst case, small hiccups can turn into full-blown outages.

A blameless culture is enabled by humility. This kind of culture is often driven by senior people --- if senior people never make mistakes, junior people will be afraid to admit theirs. Similarly, it's important that people ask for help if they feel out of their depth and need support.

Intentionally share information

Transparency isn't just about making it possible for everyone to see the info they need, it's about making it easy for them to see it.

Think actively about who needs to know: both inside and outside the organisation. Communicate clearly and frequently, providing relevant context to keep people informed.

Keeping the context in one place makes it easier for people to understand what's happening during the incident, and learn from it afterwards. It's likely that things that you learn in an incident on one team will produce learnings that many people across the organisation will find useful.

3. Calm

When responding to incidents, we're only human. Unfortunately, flooding your body with adrenaline doesn't help you make good decisions, or collaborate well with others. Take a breath, grab a glass of water, and situate yourself.

Calm comes with trust

Many problems within organisations are rooted in a lack of trust. If you don't trust the people you're working with, you'll be stressed that they might do the wrong thing, or even be doing nothing at all.

The same applies in reverse: if other people trust you and your team, then they're more likely to give you space to do what needs doing without interrupting or asking for reassurance.

Calm comes with good tools

If you can easily gather the information that you can verify and share, it's easier to collaborate and problem solve. Having a good observability setup, as well as easy access to data is key.

Ideally, your whole team will be familiar with these tools and use them day-to-day, so the learning curve, while something is going wrong, isn't too steep.

Calm comes with energy

Overworked and tired people make bad incident responders. Incident response needs to be distributed across the team; share the load.

Everyone should have an extra 10% contingency energy they can use when unexpected things happen, not already be in their overdraft. If someone's involved in a tough incident, try taking other things off their plate to give them time to recover.

Calm comes with experience

Having someone who's taken down production before, who's had the 'oh shit' moment, is incredibly valuable.

Of course, it helps that they're more likely to be able to diagnose and fix the issue. But the real value is that they know that it isn't world-ending.

That handling incidents well can make customers trust you more, not less, and turn a negative into a positive. That you can admit mistakes, and you won't lose your job.

Finally, always remember that calm is contagious, just like panic. If a team's leaders are calm, it'll perforate down to everyone around them.

Don't count your incidents, make your incidents count

incident.io — Tue, 03 May 2022 12:10:37 +0000

We can't have more than two major incidents per quarter.

It happens all the time: senior folks at your company feel like things are out of control, and they attempt to improve the situation by counting how many incidents you're having.

And it's not an unreasonable approach --- on the surface, the number of incidents seems like a great measure for how well things are going.

Whilst setting targets might work in some organisations, it's worth considering whether they provide the signal you expect and whether the implications of doing so have been properly considered. We've had this conversation more times than we can count, so here are a few tips on how to navigate the situation.

Fewer incidents doesn't mean things are better

The absence of incidents doesn't mean your systems are reliable or things are safe. I've worked in teams where we've had months of smooth sailing, followed by intense periods of seemingly everything being on fire. Nothing materially changed between the two periods. A deeper analysis showed the many contributing factors present throughout. We just got lucky and the perfect storm of latent errors and enabling conditions didn't occur in the first instance.

More incidents is no bad thing

Incidents aren't an evil we need to stamp out. In many cases, they're the cost of doing business. We shouldn't encourage failure, but despite our best efforts to maintain high levels of service, surprises will catch us out. When done right, a healthy culture of declaring incidents can be a superpower. I want my teams to feel comfortable sharing when things may be going wrong, be excellent at responding when they do, and democratise knowledge and expertise after the fact --- this is exactly why we build incident.io.

Targets can drive the wrong behaviour

I've seen people arguing why something is or isn't an incident because they don't want to reset the "days since incident" counter. Equally, I've seen engineers waste time in an incident trying to justify a minor severity rating, rather than major, because they don't want to trigger the company target.

As stated in Goodhart's Law, "when a measure becomes a target, it ceases to be a good measure". If you set a low target with severe consequences, you'll probably meet it, whether that means suppressing reporting, arguing over labels, or some other counterproductive measure.

Targeted or not, you're not in control

The vast majority of incidents are outside of our control. At best, a "no incident" goal is un-actionable and ignored. At worst, it can alter behaviour to the detriment of the organisation.

If you were set a target of not spilling a drink for a year, what would you do differently? Nobody sets out to spill a drink, and when it happens it's not because you're careless, it's just random chance sprinkled with misfortune. Pick a better target, like suggesting I don't run with drinks.

There are better alternatives to counting incidents

So you've convinced your leadership team it might be a bad idea, but to seal the deal they're after an alternative. What can you offer in return?

The best advice is to understand their motivations for the goal. For example, is there a lack of trust between leadership and engineering? Is that fuelled by them seeing incidents, but not seeing the analysis and follow-up that happens afterwards? Perhaps a target around the number incidents which didn't have a debrief would help.

Whatever the motivation, here are a few options you might want to consider.

Measure what you actually care about

You don't really care about the number of incidents. You care about what that means; whether it's lost revenue, customer satisfaction, or the service you provide --- incidents are just a useful proxy.

Instead, measure the thing you actually care about like service uptime, the number of times PII data was shared, or the number of failed payments. These are tangible measures that can be targeted and improved.

Measure the value you get from incidents

If you can accept that incidents are unavoidable surprises, why not measure how well your org is using them to improve?

We suggest writing debrief documents that are used to educate, holding sessions to discuss them, and ensuring you're seeing follow up actions through to completion. If you do all of the above, you're likely getting your money's worth. (Pro tip: you can generate incident timelines, post-mortem documents and follow up actions with incident.io with one click directly in Slack)

Give them the metrics they want, with the context they need

If you can't convince people not to target the number of incidents, why not provide the metrics they want but with the context they need to understand the full picture?

Rather than "we had 5 major incidents", share the contributing factors and risks, the commonalities and differences, and what's being done to improve. It's relatively easy to take the heat out of a number by providing some qualitative context. As it happens, there's a great post from the Learning from Incidents blog about this here.

If you've got any pro tips of your own, we'd love to hear them! Send us an email at hello@incident.io, or find us on Twitter at @incident_io.

Build custom API integrations with incident.io

incident.io — Wed, 27 Apr 2022 14:46:35 +0000

We're building incident.io as the single place you turn to when things go wrong. When an issue is disrupting your business-as-usual, the last thing you want is to start opening ten different tools to diagnose and fix it!

As your central incident hub, we need to give you two powers:

Replicating (and possibly automating) your existing processes in incident.io; and
Keeping incident.io in sync with your existing tool stack.

Workflows cover the former. Workflows are like a mini incident.io Zapier. You can use them to tell us when to page teammates, send updates to your affected customers and other stakeholders, assign roles and to-dos, and infinitely more.

What about the latter? We already offer a catalogue of native integrations with on-call solutions (PagerDuty, OpsGenie), issue trackers (e.g. Jira, Linear, GitHub) and communications tools (Statuspage; Zoom; GMeet). We'll continue adding more as we grow: from Datadog or Zendesk to let you declare incidents from those tools; to Backstage or Terraform to sync your service catalog into incident.io. (You can upvote and request yours).

Today though, we are stepping up our connectedness game by launching the incident.io API!

Check out the API reference, or read on to see how our early-access customers have used it.

Why would I need the API?

Native integrations are a beautiful thing. They are easy to set up, even for non-technical teammates. Sales, Customer Support, BizOps - anyone can plug in their suite of tools without a line of code. They are faster to implement and reduce the need for maintenance on our customers' side.

However, integrations also have their downfalls. Firstly, they must be built separately for every tool. We are known to ship at the speed of light, but there are a lot of tools that our customers use - in their engineering teams, and beyond. This means you might have to wait a few months before we integrate natively with that one tool you need.

Moreover, native integrations have a smaller surface area. They could cover 80% of use cases, and fall short on the outstanding 20%. You might be looking for a functionality that's very specific to your team or company. Or that we simply haven't got round to building yet.

The API is the answer to both those issues.

What does the incident.io API do?

At the highest level, our API lets you connect incident.io to any tool in your stack (or even to your own application), and give us instructions via that connection. No more waiting around for that one integration you need, and no more constraints on what you can do: the API world is your oyster!

It's worth noting at this point that the first version of our API focuses on the three major use cases, namely:

Automatically creating an incident from another system, such as a monitoring tool like Datadog or a ticketing system like Zendesk - see our step-by-step example below
Exporting incidents and their to-dos (i.e. Actions and Follow-Ups) into a data warehouse or BI tool (Looker, Tableau etc.) to analyse.
Configuring roles, severities, and custom fields from an external data source like Terraform or your service catalog

We picked these three because they were the most requested by our existing users, and we wanted to stay focussed on laying solid foundations while building quickly.

Needless to say though, we'll be extending this over time, so we'd love to hear what you'd like to use our API for!

Step-by-step example: declaring an incident from Zendesk

Most of our users have a ticketing system like Zendesk to manage their customer support. Inbound tickets are triaged by customer support teammates, and escalated to the engineering teams based on specific criteria (e.g. a certain severity, or a particular incident type such as data breaches).

In a pre-API world, things could get a little painful to execute that escalation policy. Support team members would have to exit Zendesk. Then head to Slack to manually declare an incident (and remember to append the Zendesk ticket link). We'd work our magic at the point of declaration, and downstream of it.

In a post-API world, we can add value upstream of the declaration point too: a support agent can declare an incident with one click, straight from within Zendesk. We'll take care of the rest, from declaring the incident in incident.io to pulling in the right teammates, notifying the relevant internal and external stakeholders, spinning up your public Statuspage and much more.

Here are the few key steps to bringing this flow to life.

1️⃣ Generate an API Key in incident.io

You'll want to generate a key with Create incidents enabled. Keep this safe - we'll need it in a minute.

2️⃣ Configure a Zendesk Support trigger

This lets you choose when Zendesk should escalate a ticket. In this case, we might use a checkbox custom field, which will declare an incident when it's ticked.

3️⃣ Add a Zendesk webhook as the action to take when the trigger fires

Configure it to make an HTTP POST request to https://api.incident.io/v1/incidents. The API Key we generated earlier is used for Bearer token authentication. The request body needs to be JSON that looks like:

{
  "name": "{{ ticket.title }}",
  "idempotency_key": "{{ ticket.id }}",
  "severity_id": "01FCQSP07Z74QMMYPDDGQB9FTG",
  "summary": "From Zendesk: {{ ticket.description }}",
  "visibility": "public",
  "custom_field_entries": [
        {
            "custom_field_id": "01FCNDV6P870EA6S7TK1DSYDG0",
            "values": [
                {
                    "value_link": "{{ ticket.link }}"
                }
            ]
        }
  ]
}

There's a couple of special IDs in there you'll need:

The custom_field_id references a "Zendesk Ticket Link" custom field we've configured. You can find the IDs of your custom fields using the List Custom Fields API.
The severity_id references our "Minor" severity. You can find the IDs of your severities using the List Severities API.

That's it! 🎉 Whenever that trigger fires, you'll get a new incident declared in incident.io. Nice.

As a bonus: you can configure workflows that only run on your API-generated incidents. For example, you could build a workflow that automatically adds the support agent that declared the incident in Zendesk to the incident's Slack channel.

To do that, add a Condition based on the API Key that reported the incident. This would look like:

How we built the API

The API for declaring an incident is as permissive as possible: if something's going wrong, it's better that an incident is declared with some missing context than not declared at all. This means that required custom fields won't be enforced, for example.

We also wanted to avoid accidental spam. If things are going wrong in the middle of the night, the last thing you need is 5 incidents with reminders pinging around. To create an incident with the API, you must specify a unique key which we can deduplicate on. That might be the ID of an alert firing, or the reference for a support ticket. We'll only create one incident for each key, so you don't have to worry about waking up to more incident channels than necessary.

Over to you!

We built our API around these three use-cases for now, but we'll keep expanding it over the coming weeks.

We'd absolutely love to hear what you build with it, and how you'd like us to extend it.

There's an #api channel in the incident.io Community. See you there 👀