Ben Halpern

Posted on Feb 17, 2020

How does your team handle critical production errors?

#discuss #sre #devops

What's the process once a major regression is discovered?

Top comments (5)

Dan Silcox • Feb 17 '20

“Asking for a friend” 😂

Ben Halpern • Feb 17 '20

What we've been up to

Molly Struve 🦄

@molly_struve

How to write a solid, full coverage test suite:

1) Break your app in production
2) Fix your app
3) Add tests for what you broke

Repeat 1-3 until you stop breaking things in production

17:00 PM - 17 Feb 2020

18 77

Niko Heikkilä • Feb 17 '20

We're using Sentry to catch live errors. It has a handy integration to JIRA so the person who discovered the error can open a ticket directly from there.

Once reported, the team plans, implements, and deploys the fix within the same day.

Nico S___ • Feb 17 '20

We get notified of issues in production in several ways: sentry, pingdom, our Customer Success Team

When an issue occurs in production we have a predefined process we go through:

Assign a production incident Marshall to drive the effort, this is the customer success team lead
Work with a product team member to investigate the issue
Recruit help from others when needed
Work towards a resolution
Create a Production Incident Report
Review the report in a Production Incident Retrospective
Schedule actions that came up from the retrospective

Works incredible well

Yaser Al-Najjar • Feb 25 '20 • Edited

We use Sentry to make sure we get notified directly if something goes wrong.

What to Do?

If there is NO migration made (no db changes) in the latest deployed release:

We run a container from our previous docker image.
Let Nginx's load balancer do its magic to load the traffic from the bad container into the good one.
Fix the issue and make sure things work fine before deploying 😁
Repeat from 1 to 2, but with the new bug-free-hopefully container.

If there is migration made (db changes made to existing data... not happening so often):

Realize that you fu**ed up this time (cuz you can't use an old container)!
Fix it fast.
Deploy again.

DEV Community

How does your team handle critical production errors?

Top comments (5)

What to Do?

If there is NO migration made (no db changes) in the latest deployed release:

If there is migration made (db changes made to existing data... not happening so often):

Read next

Simplify Environment Variable Management with GitHub Environments

Container Orchestration with Kubernetes

pyya - The way to manage YAML config in your Python project

How to Retrieve EC2 Instances Information Using Python and Boto3