DEV Community

Renan Lourençoni Nobile
Renan Lourençoni Nobile

Posted on • Edited on

Production-only bugs

Hello everyone, this is a discussion to know what you do when there are bugs that only appear on production and can't be reproduced in other environments either by the lack of data, or by the configuration that simply doesn't allow it to be caught outside production.

Top comments (9)

Collapse
 
rhymes profile image
rhymes

If it's about configuration the solution is to have a staging environment that's identical to production, except maybe the amount of data and reproduce the bug there.

If it's about read only data you could, in theory, create a read only replica of your production database, wait for it to be on par and then point staging to it and see if you can reproduce it.

The worst possible scenario is to debug production live but that's tricky and depends on the bug...

Collapse
 
peter profile image
Peter Kim Frank • Edited

Hey Renan, thanks for starting this discussion, I'm excited to follow along.

Hello fellas

Please remember that our wonderful DEV Community has members of all genders, ages, backgrounds, etc. "Fellas" has a connotation of referring to a group of men, so please try and use language that is default inclusionary and not gender-specific — such as "Hello everyone."

After all, there are plenty of non-male members here who might have the insights you are looking for!

Thanks again for starting the conversation! :)

Collapse
 
renannobile profile image
Renan Lourençoni Nobile

Thanks Peter, I'll edit the post and I'm sorry for the specification, definitely not what I intended.

Collapse
 
peter profile image
Peter Kim Frank

Thanks, Renan!

Collapse
 
scotthannen profile image
Scott Hannen

That's a really difficult spot to be in. It's easy to go on and on about how to try to prevent that situation (which isn't 100% possible) but that doesn't do any good once you're there.

If you have logging you can look at, that's one way to go. Depending on the code, sometimes half the challenge is figuring out what gets logged and how to access it.

It might help to try to set up similar data in a development environment. Or start with what you know about the data in production that may be different and try to read the code. Neither are ideal, but sometimes there's nothing else we can do.

In these scenarios it's easier if the problem is code. The most frustrating cases I've experienced are when it's something like a firewall rule on a production server and I have no visibility whatsoever. Then I have to get someone else to look at it, but I can only tell them what I think they might be looking for.

Collapse
 
craser profile image
Grumpy and • Edited

Been there. Tough situation.

There are generally three phases to fixing something like this.

Get the Fire Out

The first priority is to get back to a fully-running production system. There are several ways to do this:

  • Roll back to the previous "known good" release.
  • Disable the feature that's causing trouble.

The key is that you get back to a situation where your production systems are up and working as normal, and you can investigate the issue without impacting the business/customers.

Get a real resolution in place.

At this point, you may not know enough to reproduce the bug in your test environment. So, focus your efforts on acquiring more information.

  • Add better logging to the feature that's causing trouble.
  • Add code to detect the problem and dump full context information to server logs.

It's sometimes possible to pull a production server out of production. If, for example, you have four load-balanced servers, roll back servers 1-3 to the latest stable release, and remove 4 from the load balancer, but leave it on the latest (problem) release.

Then connect to that server yourself and try to reproduce the bug. If that works, gather as much information as possible. I once did this, and connected a remote debugging session so that I could hit a production server & debug in my IDE. I had the issue fixed in 20 min. or so. If you're going to try this, put a time limit on it: if you're not well on your way to a fix after 30 minutes, focus instead on gathering enough information to reproduce the problem in staging.

Most of your tactics really should be focused on getting to a point where you can reproduce the bug in your IDE. That's where you can debug and test the fastest, and those edit/run/test iterations dominate how long it'll take you to reach a resolution. The faster you can try new fixes, or gather new information, the faster you're going to have a permanent solution.

Address the Process Issue

Ideally, your staging/QA environment should be identical to your production environment. Same OS version, server software versions, etc.

But, for example, if your production database has more data than is reasonable to duplicate in dev/QA, then the team needs some way of retrieving a representative snapshot. That means every customer type, payment method, state/country, etc.

And your test/QA servers should run the same server configs as production. If they don't/can't, you have a blind spot, and you will have more issues like this.

Good luck!

Collapse
 
ben profile image
Ben Halpern

Collapse
 
gsandahl profile image
Göran Sandahl

This is an interesting topic. I believe that the reason why it is hard to reproduce production failures is that the state of "production" is unknown and can't be replicated. Failures in otherwise resilient systems doesn't have a single root cause, but it is the result of multiple latent issues - aka "dark debt" - that act together to form a fatal error. Reproducing that exact domino effect is impossible.

I wrote a piece related to this, and what we can do to adress it:

unomaly.com/blog/zero-bugs-in-prod...

Collapse
 
belinde profile image
Franco Traversaro

Try to think about EVERY layer of you technology stack: can be a networking error? Some local cache? Some race condition that happens only on a fast CPU? A concurrency of queueries to database from different parts of the software, made by different users?
As far as we can think that the server is under some demonic possession, computer technology is still completely deterministic, and bugs are just a sign of our ignorance of the fines details.