Jared Silver

Posted on Oct 9, 2017

How do you populate your development databases?

#discuss #productivity #database

For small projects, it's very easy -- if not always good practice -- to grab a copy of the production database and save it locally for use in development.

However, as a project grows, this becomes increasingly more time consuming and increasingly more unsafe (as individual team members should probably not be walking around with tons of actual user data on their local machines).

Further, for organizations that are open source, it certainly doesn't make sense to freely distribute copies of production data to anyone who wants it.

So, my question to you is this: how do you populate your development databases?

Latest comments (19)

Tiago Padrela Amaro • May 29 '18

Quite late for the party, but while researching this subject again, I bumped into this topic and will recommend a pretty nice library for Rails projects: github.com/IFTTT/polo

This IFTTT lib helps to "travel through your database and creates sample snapshots so you can work with real world data in development."

It has obfuscation and can handle associations (manually / automatically) as it'll output a SQL for you to use on any environment without having production data.

Tiago Padrela Amaro • Jun 7 '18

Also, found this awesome tool for PostgreSQL: github.com/mla/pg_sample

Rich Soule • Nov 17 '17

We do a ton of development against Oracle databases. In any application that has sensitive data, Oracle Data Masking and Subsetting Pack (an option to Oracle Enterprise Edition) has the ability to take all the production data and massage it so that it can't be used to identify production information. All referential integrity is maintained throughout this process.

Harshit rathod • Nov 6 '17 • Edited

Currently, We are using Liquibase to resolve this issue. As Developer, You know feature you are implementing and you already tested with a different combination of data (If you are a good developer!). With liquibase we added one more step in our process that each developer will write liquibase changeset to insert/update/delete data by which they performed the test. We have profile setup so this changeset will only run in dev context.

Benefits:

This data will be available on all developer machine once they start application
We can resolve most of the migration problem in dev cycle as we have all combinations of data
Your QA team already has some data when they start testing (QA team should test on production clone application but here we have two CI one with dev profile and one with prod)

Problem:

Sometimes it is burden on developer to do this extra work
You can not skip this step under tight deadline because if you stop to add changeset for some time then time to add this changeset afterward will increase exponentially and it is possible that many bugs will be reported from production

Alain Van Hout • Oct 24 '17

In the past, we've relied on Liquibase to generate dummy records for the primary database tables. This was originally done to allow us to quickly and (somewhat) efficiently perform API test (with a focus on regression testing).

More recently, our API test infrastructure has increased to the point where we can just write brief (JUnit) API tests which can be run against our test environments to easily insert data (via the proper API pathways). Because of that, we've been moving away from predefined test data. (Using 'production' data isn't really something that applies to our specific situation)

Jonathan Boudreau • Oct 14 '17

I usually have api tests so I just run the whole suite to generate data.

Antero Karki • Oct 12 '17

Agree with those who say don't use production data in development. It could possibly be useful when hunting down a nasty bug, but unless absolutely necessary I'd stay out of there. If you need to test something specific, especially new features, you probably end up adding your own data anyway, so why not do it right from start.

I can understand the attraction though, especially for projects that don't have proper tests. It's much easier to find by accident those pages that start loading slow or have tables that become unusable when populated with too much data and other similar issues. Especially those issues that may require adding a lot of fake data to find.

Marak • Oct 12 '17 • Edited

I like to use the faker.js library to populate my development databases. It's easy to use and simple to integrate into test suites or custom bootstrapping scripts.

It has a wide API that is well documented and covers almost anything you need. It supports many localities ( i18n ) which can be useful for seeding certain types of applications. Faker.js has been in development for seven years and currently has 124 contributors and 11k stars on Github.

It's a very nifty piece of software. We never use production data in testing or development environments. Everything is always generated with faker bootstrapping scripts customized to the application. Works very well. The localization part can also help with testing UTF-8 encodings.

Disclaimer: I am the author of faker.js

Rebecca G • Oct 11 '17

I use FactoryGirl to seed development data. Since I work at a financial company, I don't use production data, even sanitized production data for development. The factories share as much production code as possible, so that a seeded model is exactly the same as a manually created model, but it's just much faster.

Eustáquio Rangel • Oct 11 '17

I just published an article yesterday talking about putting a project to run fast after you get the code, here: dev.to/taq/driver-driven-developme...

As I'm using Rails for my web apps, I use to feed the seeds file with all I need to run the project on my development environment. This gives me:

A way to run my project fast, from zero, with no extra needed configurations. Every developer who has access to the code will be allowed to do that.
Samples of the data needed to run the project, and even to build my fixtures/factories.
A way to reset the development database and build it again, fast, if needed.

Even with bigger projects this is working for me. When there are massive data to insert on seeds, for example, if needed to load all the states and cities, I put the data on external files and load them on seeds.

Dave Cridland • Oct 11 '17

Ah, so, a little story, that won't answer your question in any way other than please don't do it this way.

Way back when, I worked for a lovely company making, amongst other things, telco billing systems. We wrote them largely in server-side Javascript, which was a wild idea back in ~1999 that would surely never go mainstream. Oh, we also did Agile before it was Agile (anyone remember Extreme Programming?) and a bunch of other stuff that's now rather more mundane.

Anyway, in order to actually Get Stuff Done, we needed to populate the dev systems with test data, so we could test stuff and things. Sorry, that was obvious. Some of the test accounts would need to be large to test overflow issues, others would need outstanding payments to generate those "red reminders", and so on.

To begin with, us sober-minded developers would enter in data such as our own names and addresses. Soon, though, a frantic competition to come up with the most amusing test data emerged. Saddam Hussein, the late dictator of Iraq, ended up in the generic dev database backup. Then, so did one "Liz Windsor", who lived in Buckingham Palace. When we wanted to spin up a system, we'd just restore from this backup. Simples, right? What could possibly go wrong?

So, the great day arrived when we would go live. To ensure that nothing went wrong, we installed the live system in the tried and tested way we'd installed the development and test systems. That was, of course, the least risky way of doing things. Simples, right? What could possibly go... Oh.

Because the next month, several developers got a letter on their doorstep, from the customer, asking them to pay an often vast telephone bill. For many developers, this was clearly listing all the premium-rate porn chat lines (it was the '90's, and yes, these things existed). We spent no small time carefully explaining the situation to the customer, and getting those bills cleared from the system. It wasn't pretty.

Then someone remembered that Her Majesty The Queen Elizabeth II, Queen of The United Kingdom, Fid Def and all that jazz, would also have had a letter carried to her by a highly trained footman. We all thought that the nice red writing demanding immediate payment would, no doubt, contrast nicely with the gold platter is would be borne on.

View full discussion (19 comments)