Production vs Synthetic Data for Testing

Julia Torrejón · 2018-12-18T16:35:02Z

When should we use one over the other? Which is the best approach? What is the criteria used to guide the decision?

#discuss #testdata #softwaretesting

When should we use one over the other?
Which is the best approach?
What is the criteria used to guide the decision?

Top comments (4)

Dian Fay • Dec 19 '18 • Edited

Production data has some issues:

legal or regulatory requirements mandate anonymizing PII, patient data, financials, and so on, which requires extra effort
email addresses, phone numbers, and the like have to be "disarmed" to ensure integration tests can't accidentally reach users
data is changing all the time, so it's more difficult to write stable assertions
paucity of representative data for certain states: if some process requires multiple steps but in practice 99% of users complete it in one go, prod data is insufficient for testing the intermediary stages

It's still important to test against production data -- second. Functional coverage is more important, and you can only be sure of testing the most possibilities if you generate fixture data. If you're working in a smaller team, testing against live data sets will likely be all manual.

Fixtures are tricky to do right, and the obvious solution of a monolithic test dataset is a dead end for reasons best explained by Jorge Luis Borges. I wrote something a while ago about a more flexible modular approach based on the post-structuralist idea of rhizomes, and published a drop-in JavaScript implementation; the PHP O/RM Doctrine does something similar as well.

Alan Barr • Dec 18 '18

This is super hard. I do not quite understand how people can abstract away the complexity of data and state. We seem to do this for configuration and for systems but when it comes to a user has this property at this time with this value then everything goes out the window. I am still not sure what the best approach is, maybe if the system is small enough but crossing boundaries of systems it feels like this all goes out the window. An alternative approach to test data might entail capturing the state of a user at a given time and reproducing that in the staging system or disabling changes in the production system for that user to reproduce an issue. This feels like one of the last properties software teams think about in development.

Synthesized • Oct 6 '19

Synthetic data is often generated to represent the production data.

It is normally used to protect privacy and confidentiality of production data, e.g. in testing and creating many different types of systems such fraud detection and churn prediction systems.

There is a number approaches to generate synthetic data described by the folks from Synthesized (synthesized.io/) in this blog post

blog.synthesized.io/2018/11/28/thr...