DEV Community

Discussion on: Production vs Synthetic Data for Testing

Collapse
 
dmfay profile image
Dian Fay • Edited

Production data has some issues:

  • legal or regulatory requirements mandate anonymizing PII, patient data, financials, and so on, which requires extra effort
  • email addresses, phone numbers, and the like have to be "disarmed" to ensure integration tests can't accidentally reach users
  • data is changing all the time, so it's more difficult to write stable assertions
  • paucity of representative data for certain states: if some process requires multiple steps but in practice 99% of users complete it in one go, prod data is insufficient for testing the intermediary stages

It's still important to test against production data -- second. Functional coverage is more important, and you can only be sure of testing the most possibilities if you generate fixture data. If you're working in a smaller team, testing against live data sets will likely be all manual.

Fixtures are tricky to do right, and the obvious solution of a monolithic test dataset is a dead end for reasons best explained by Jorge Luis Borges. I wrote something a while ago about a more flexible modular approach based on the post-structuralist idea of rhizomes, and published a drop-in JavaScript implementation; the PHP O/RM Doctrine does something similar as well.