DEV Community

Ben Halpern
Ben Halpern Subscriber

Posted on

How does your team handle critical production errors?

What's the process once a major regression is discovered?

Top comments (5)

Collapse
 
dansilcox profile image
Dan Silcox

“Asking for a friend” 😂

Collapse
 
ben profile image
Ben Halpern

What we've been up to

Collapse
 
nikoheikkila profile image
Niko Heikkilä

We're using Sentry to catch live errors. It has a handy integration to JIRA so the person who discovered the error can open a ticket directly from there.

Once reported, the team plans, implements, and deploys the fix within the same day.

Collapse
 
nicolasini profile image
Nico S___

We get notified of issues in production in several ways: sentry, pingdom, our Customer Success Team

When an issue occurs in production we have a predefined process we go through:

  • Assign a production incident Marshall to drive the effort, this is the customer success team lead
  • Work with a product team member to investigate the issue
  • Recruit help from others when needed
  • Work towards a resolution
  • Create a Production Incident Report
  • Review the report in a Production Incident Retrospective
  • Schedule actions that came up from the retrospective

Works incredible well

Collapse
 
yaser profile image
Yaser Al-Najjar • Edited

We use Sentry to make sure we get notified directly if something goes wrong.

What to Do?

If there is NO migration made (no db changes) in the latest deployed release:

  1. We run a container from our previous docker image.
  2. Let Nginx's load balancer do its magic to load the traffic from the bad container into the good one.
  3. Fix the issue and make sure things work fine before deploying 😁
  4. Repeat from 1 to 2, but with the new bug-free-hopefully container.

If there is migration made (db changes made to existing data... not happening so often):

  1. Realize that you fu**ed up this time (cuz you can't use an old container)!
  2. Fix it fast.
  3. Deploy again.