DEV Community

Discussion on: Production-only bugs

Collapse
 
craser profile image
Chris Raser • Edited

Been there. Tough situation.

There are generally three phases to fixing something like this.

Get the Fire Out

The first priority is to get back to a fully-running production system. There are several ways to do this:

  • Roll back to the previous "known good" release.
  • Disable the feature that's causing trouble.

The key is that you get back to a situation where your production systems are up and working as normal, and you can investigate the issue without impacting the business/customers.

Get a real resolution in place.

At this point, you may not know enough to reproduce the bug in your test environment. So, focus your efforts on acquiring more information.

  • Add better logging to the feature that's causing trouble.
  • Add code to detect the problem and dump full context information to server logs.

It's sometimes possible to pull a production server out of production. If, for example, you have four load-balanced servers, roll back servers 1-3 to the latest stable release, and remove 4 from the load balancer, but leave it on the latest (problem) release.

Then connect to that server yourself and try to reproduce the bug. If that works, gather as much information as possible. I once did this, and connected a remote debugging session so that I could hit a production server & debug in my IDE. I had the issue fixed in 20 min. or so. If you're going to try this, put a time limit on it: if you're not well on your way to a fix after 30 minutes, focus instead on gathering enough information to reproduce the problem in staging.

Most of your tactics really should be focused on getting to a point where you can reproduce the bug in your IDE. That's where you can debug and test the fastest, and those edit/run/test iterations dominate how long it'll take you to reach a resolution. The faster you can try new fixes, or gather new information, the faster you're going to have a permanent solution.

Address the Process Issue

Ideally, your staging/QA environment should be identical to your production environment. Same OS version, server software versions, etc.

But, for example, if your production database has more data than is reasonable to duplicate in dev/QA, then the team needs some way of retrieving a representative snapshot. That means every customer type, payment method, state/country, etc.

And your test/QA servers should run the same server configs as production. If they don't/can't, you have a blind spot, and you will have more issues like this.

Good luck!