This is such an important conversation. I have had the same dilemma, and over the years I ended up with two ways of solving it for two different use cases.
For development, regression testing and performance testing, I created a product simulator that calls the same product REST APIs called by the UI. That generates synthetic data, emulating real user usage patterns. This not only creates data for development, but it also tests all the major back-end functions of the product and it is able to recreate the data from scratch, even if there are major changes to the schemes and data structures (which is problematic with snapshots). Additionally, this is also at the base for a performance testing framework, meaning that I can measure the performance of various areas of the back-end product as the data is being generated.
For testing releasing of new builds and data migrations, I run a staging environment with a copy of the production environment, but only a subset of the real data, where all the PII has been obfuscated. This is important because real data has all sorts of corner cases and variants that is difficult to completely emulate synthetically.
We're a place where coders share, stay up-to-date and grow their careers.
We strive for transparency and don't collect excess data.