loading...
markdown guide
 

Production data has some issues:

  • legal or regulatory requirements mandate anonymizing PII, patient data, financials, and so on, which requires extra effort
  • email addresses, phone numbers, and the like have to be "disarmed" to ensure integration tests can't accidentally reach users
  • data is changing all the time, so it's more difficult to write stable assertions
  • paucity of representative data for certain states: if some process requires multiple steps but in practice 99% of users complete it in one go, prod data is insufficient for testing the intermediary stages

It's still important to test against production data -- second. Functional coverage is more important, and you can only be sure of testing the most possibilities if you generate fixture data. If you're working in a smaller team, testing against live data sets will likely be all manual.

Fixtures are tricky to do right, and the obvious solution of a monolithic test dataset is a dead end for reasons best explained by Jorge Luis Borges. I wrote something a while ago about a more flexible modular approach based on the post-structuralist idea of rhizomes, and published a drop-in JavaScript implementation; the PHP O/RM Doctrine does something similar as well.

 

We have tremendous fun with 100+ suppliers (API vendors of various sorts), who all bring some sort of test interface, frequently synthetic data, none of them compatible with each other...what to do?

  • we mock them and use a consistent local synthetic data set for local testing (of our combinatorial logic or other internal testing).
  • we use their synthetic data for point testing such as connectivity/credentials checks (when in UAT/Stage/Live -all production)
  • we use production data, hopefully with test markers to avoid side effects (like marking someone's credit history) for end-end tests across multiple providers.
 

I would always use production data unless there are security/privacy reasons why you want to limit its use for testing.

Another blocker would be differences between the production and non-production systems which mean the data formats and processing behaviour is not consistent, but concentrating on reducing the gap between these two via a high deployment cadence will go a long way towards mitigating this risk.

That said, you can/should use "production data" without "production volume" and "production behaviour", although the latter two are also obviously useful for certain types of testing.

It's well worth the effort being able to repeatedly replay production data into a test environment at the same rate at which it arrived in production.

 

This is super hard. I do not quite understand how people can abstract away the complexity of data and state. We seem to do this for configuration and for systems but when it comes to a user has this property at this time with this value then everything goes out the window. I am still not sure what the best approach is, maybe if the system is small enough but crossing boundaries of systems it feels like this all goes out the window. An alternative approach to test data might entail capturing the state of a user at a given time and reproducing that in the staging system or disabling changes in the production system for that user to reproduce an issue. This feels like one of the last properties software teams think about in development.

 

Synthetic data is often generated to represent the production data.

It is normally used to protect privacy and confidentiality of production data, e.g. in testing and creating many different types of systems such fraud detection and churn prediction systems.

There is a number approaches to generate synthetic data described by the folks from Synthesized (synthesized.io/) in this blog post

blog.synthesized.io/2018/11/28/thr...

Classic DEV Post from Jun 2 '19

Stay Healthy as a Developer

Julia Torrejón profile image
Perpetual learner - Learning by doing