For small projects, it's very easy -- if not always good practice -- to grab a copy of the production database and save it locally for use in development.
However, as a project grows, this becomes increasingly more time consuming and increasingly more unsafe (as individual team members should probably not be walking around with tons of actual user data on their local machines).
Further, for organizations that are open source, it certainly doesn't make sense to freely distribute copies of production data to anyone who wants it.
So, my question to you is this: how do you populate your development databases?
Latest comments (19)
Quite late for the party, but while researching this subject again, I bumped into this topic and will recommend a pretty nice library for Rails projects: github.com/IFTTT/polo
This IFTTT lib helps to "travel through your database and creates sample snapshots so you can work with real world data in development."
It has obfuscation and can handle associations (manually / automatically) as it'll output a SQL for you to use on any environment without having production data.
Also, found this awesome tool for PostgreSQL: github.com/mla/pg_sample
We do a ton of development against Oracle databases. In any application that has sensitive data, Oracle Data Masking and Subsetting Pack (an option to Oracle Enterprise Edition) has the ability to take all the production data and massage it so that it can't be used to identify production information. All referential integrity is maintained throughout this process.
Currently, We are using Liquibase to resolve this issue. As Developer, You know feature you are implementing and you already tested with a different combination of data (If you are a good developer!). With liquibase we added one more step in our process that each developer will write liquibase changeset to insert/update/delete data by which they performed the test. We have profile setup so this changeset will only run in dev context.
Benefits:
Problem:
In the past, we've relied on Liquibase to generate dummy records for the primary database tables. This was originally done to allow us to quickly and (somewhat) efficiently perform API test (with a focus on regression testing).
More recently, our API test infrastructure has increased to the point where we can just write brief (JUnit) API tests which can be run against our test environments to easily insert data (via the proper API pathways). Because of that, we've been moving away from predefined test data. (Using 'production' data isn't really something that applies to our specific situation)
I usually have api tests so I just run the whole suite to generate data.
Agree with those who say don't use production data in development. It could possibly be useful when hunting down a nasty bug, but unless absolutely necessary I'd stay out of there. If you need to test something specific, especially new features, you probably end up adding your own data anyway, so why not do it right from start.
I can understand the attraction though, especially for projects that don't have proper tests. It's much easier to find by accident those pages that start loading slow or have tables that become unusable when populated with too much data and other similar issues. Especially those issues that may require adding a lot of fake data to find.
I like to use the faker.js library to populate my development databases. It's easy to use and simple to integrate into test suites or custom bootstrapping scripts.
It has a wide API that is well documented and covers almost anything you need. It supports many localities ( i18n ) which can be useful for seeding certain types of applications. Faker.js has been in development for seven years and currently has 124 contributors and 11k stars on Github.
It's a very nifty piece of software. We never use production data in testing or development environments. Everything is always generated with faker bootstrapping scripts customized to the application. Works very well. The localization part can also help with testing UTF-8 encodings.
Disclaimer: I am the author of faker.js
I use FactoryGirl to seed development data. Since I work at a financial company, I don't use production data, even sanitized production data for development. The factories share as much production code as possible, so that a seeded model is exactly the same as a manually created model, but it's just much faster.
I just published an article yesterday talking about putting a project to run fast after you get the code, here: dev.to/taq/driver-driven-developme...
As I'm using Rails for my web apps, I use to feed the seeds file with all I need to run the project on my development environment. This gives me:
A way to run my project fast, from zero, with no extra needed configurations. Every developer who has access to the code will be allowed to do that.
Samples of the data needed to run the project, and even to build my fixtures/factories.
A way to reset the development database and build it again, fast, if needed.
Even with bigger projects this is working for me. When there are massive data to insert on seeds, for example, if needed to load all the states and cities, I put the data on external files and load them on seeds.
Ah, so, a little story, that won't answer your question in any way other than please don't do it this way.
Way back when, I worked for a lovely company making, amongst other things, telco billing systems. We wrote them largely in server-side Javascript, which was a wild idea back in ~1999 that would surely never go mainstream. Oh, we also did Agile before it was Agile (anyone remember Extreme Programming?) and a bunch of other stuff that's now rather more mundane.
Anyway, in order to actually Get Stuff Done, we needed to populate the dev systems with test data, so we could test stuff and things. Sorry, that was obvious. Some of the test accounts would need to be large to test overflow issues, others would need outstanding payments to generate those "red reminders", and so on.
To begin with, us sober-minded developers would enter in data such as our own names and addresses. Soon, though, a frantic competition to come up with the most amusing test data emerged. Saddam Hussein, the late dictator of Iraq, ended up in the generic dev database backup. Then, so did one "Liz Windsor", who lived in Buckingham Palace. When we wanted to spin up a system, we'd just restore from this backup. Simples, right? What could possibly go wrong?
So, the great day arrived when we would go live. To ensure that nothing went wrong, we installed the live system in the tried and tested way we'd installed the development and test systems. That was, of course, the least risky way of doing things. Simples, right? What could possibly go... Oh.
Because the next month, several developers got a letter on their doorstep, from the customer, asking them to pay an often vast telephone bill. For many developers, this was clearly listing all the premium-rate porn chat lines (it was the '90's, and yes, these things existed). We spent no small time carefully explaining the situation to the customer, and getting those bills cleared from the system. It wasn't pretty.
Then someone remembered that Her Majesty The Queen Elizabeth II, Queen of The United Kingdom, Fid Def and all that jazz, would also have had a letter carried to her by a highly trained footman. We all thought that the nice red writing demanding immediate payment would, no doubt, contrast nicely with the gold platter is would be borne on.