Discussion on: Production vs Synthetic Data for Testing

View post

Production data has some issues:

legal or regulatory requirements mandate anonymizing PII, patient data, financials, and so on, which requires extra effort
email addresses, phone numbers, and the like have to be "disarmed" to ensure integration tests can't accidentally reach users
data is changing all the time, so it's more difficult to write stable assertions
paucity of representative data for certain states: if some process requires multiple steps but in practice 99% of users complete it in one go, prod data is insufficient for testing the intermediary stages

It's still important to test against production data -- second. Functional coverage is more important, and you can only be sure of testing the most possibilities if you generate fixture data. If you're working in a smaller team, testing against live data sets will likely be all manual.

Fixtures are tricky to do right, and the obvious solution of a monolithic test dataset is a dead end for reasons best explained by Jorge Luis Borges. I wrote something a while ago about a more flexible modular approach based on the post-structuralist idea of rhizomes, and published a drop-in JavaScript implementation; the PHP O/RM Doctrine does something similar as well.