How do you populate your development databases?

For small projects, it's very easy -- if not always good practice -- to grab a copy of the production database and save it locally for use in development.

However, as a project grows, this becomes increasingly more time consuming and increasingly more unsafe (as individual team members should probably not be walking around with tons of actual user data on their local machines).

Further, for organizations that are open source, it certainly doesn't make sense to freely distribute copies of production data to anyone who wants it.

So, my question to you is this: how do you populate your development databases?

Did you find this post useful? Show some love!
DISCUSSION (21)

Wow I was just talking about this with @maestromac . We seed our database with dummy data as our default development process. Even though we could, we don't grab copies of production DB unless we also have some safety constraints in place. I just wouldn't want to do something bad by accident in that regard.

But our process is inefficient and is not an ideal development experience. I feel like it might be better if we actively maintained a central dummy database we manicured by reseeding it with a well-thought out process of creating dummy data or preprocessing production data. Then storing that in the cloud and every time someone wants to reset their dev environment, they download the ready-to-go DB that is primed for a beautiful, fresh dev setup.

Does this make sense or am I way off-base?

Yup, this makes a lot of sense -- thanks, @ben !

Though, I'm curious about what your data structure looks like that you're able to generate dummy data as part of your development process.

Prod has the benefit of giving us virtually every imaginable permutation of data, whereas manually generating it will likely result in blind spots that make it tougher to catch things during the QA process.

I definitely agree with respect to the security concerns around grabbing prod, and I really like the idea of trying to maintain a central dummy database!

Yeah our dummy data system has its flaws and only works because it's worked so far. Far from a universal-forever solution, but we have a small team and gradually building out concrete solutions as our team and data-scaling needs necessitate them.

The centrally maintained dummy DB could be based off production, but I think it's key to have it be its own project with dedicated maintenance (from few or many) and setting the team up for success.

I'd really love to hear from others, especially those in large, hairy, long-lived projects.

I've worked at a couple major online retailers with long-lived databases and they've been grotty. Basically, after a decade or more, the only way you end up with a prod database you could synthetically dummy up is if all the members that have ever worked on the team (including management) have kept technical debt near zero. Not. Gunna. Happen. Deadlines, deferred maintenance, and shifting business goals impose compromises that have long-lived effects on how other features are built, etc. The ripples go on and on.

It becomes insane to try to maintain a synthetic dev/test database that duplicates all the things that users do, or have done, possibly using features that no longer exist, or by exploiting bugs. At some point, trying to dummy up "production-adjacent" data is as much work as just using a copy of production.

The best solution I've seen used a nightly snapshot of production together with Docker so that all devs did their daily work with a full copy of the production DB. (With user passwords, etc. stripped.) We could let our dev DB get paved over every night, or flip a switch so that the replication process would leave it alone.

This is such an important conversation. I have had the same dilemma, and over the years I ended up with two ways of solving it for two different use cases.

  • For development, regression testing and performance testing, I created a product simulator that calls the same product REST APIs called by the UI. That generates synthetic data, emulating real user usage patterns. This not only creates data for development, but it also tests all the major back-end functions of the product and it is able to recreate the data from scratch, even if there are major changes to the schemes and data structures (which is problematic with snapshots). Additionally, this is also at the base for a performance testing framework, meaning that I can measure the performance of various areas of the back-end product as the data is being generated.

  • For testing releasing of new builds and data migrations, I run a staging environment with a copy of the production environment, but only a subset of the real data, where all the PII has been obfuscated. This is important because real data has all sorts of corner cases and variants that is difficult to completely emulate synthetically.

I just published an article yesterday talking about putting a project to run fast after you get the code, here: dev.to/taq/driver-driven-developme...

As I'm using Rails for my web apps, I use to feed the seeds file with all I need to run the project on my development environment. This gives me:

  • A way to run my project fast, from zero, with no extra needed configurations. Every developer who has access to the code will be allowed to do that.

  • Samples of the data needed to run the project, and even to build my fixtures/factories.

  • A way to reset the development database and build it again, fast, if needed.

Even with bigger projects this is working for me. When there are massive data to insert on seeds, for example, if needed to load all the states and cities, I put the data on external files and load them on seeds.

I use FactoryGirl to seed development data. Since I work at a financial company, I don't use production data, even sanitized production data for development. The factories share as much production code as possible, so that a seeded model is exactly the same as a manually created model, but it's just much faster.

For a simple data structure, I see how it might be possible to just take a horizontal slice of production data and then anonymize it. (This is especially true for NoSQL databases.)

However, for more complicated SQL databases with lots of joins, I'm not sure this would be an option because so much of the data is apt to be reliant on so much other data outside of the 'slice' you'd want to take.

We use a slimmed down version of the production database dump which we call the "nocustomer" dump.

It's quite involved to make this all work (for my taste) but we've yet to find a better solution.

Of course, the "obvious" solutionis to simply generate fake data.

But:

  • our database is constantly changing This means we add/remove columns, change how things work, write regularly migration script => this means in top of that you've to maintain any seeding infrastructure, too
  • the state variety is just too much, as in likely "infinite" 99% of the data in the database is "generated externally" (retrieved via 3rd party APIs). With a lot of sources, there's a lot of different states. It feels impossible to cover them all programmatically

How do we create it? The production DB is PostgresSQL, so we:

  • create a full dump pg_dump
  • upload to S3 (of course private and only a very limited amount of users can access it)
  • …and now hold your breath…
  • spin up an EC2 instance with LOTS of RAM and LOTS of fast disk
  • download the dump from S3
  • import the whole thing (with 64 parallel jobs, yay)
  • use pgsql feature to create a dedicated schema and use the same dump to only import the DDL into this schema
  • use a series of INSERT INTO <newschema>.<table> SELECT * FROM <oldschema>.<table> WHERE … is not production data 😀 …
  • dump only this schema
  • upload it to another S3 bucket (yes, private, all developers can access this one)

Within the developers VM there a special command can then download this dump, import and, rebuild ElasticSearch, etc.

Oh, and this is only the really high level stuff. There are so much details in between to make this seamlessly work.

The process takes a couple if hours (4-6 currently) but at least it's fully automated (except triggering it).

The instance is quite expensive (again, for my taste) but when there's a need for this dump, you usually want it as quick as possible.

While writing this down and re-reading it, I almost can't believe it but, yes, this works. It occasionally fails (like: once or twice a year) but otherwise is rock solid.

Ah, so, a little story, that won't answer your question in any way other than please don't do it this way.

Way back when, I worked for a lovely company making, amongst other things, telco billing systems. We wrote them largely in server-side Javascript, which was a wild idea back in ~1999 that would surely never go mainstream. Oh, we also did Agile before it was Agile (anyone remember Extreme Programming?) and a bunch of other stuff that's now rather more mundane.

Anyway, in order to actually Get Stuff Done, we needed to populate the dev systems with test data, so we could test stuff and things. Sorry, that was obvious. Some of the test accounts would need to be large to test overflow issues, others would need outstanding payments to generate those "red reminders", and so on.

To begin with, us sober-minded developers would enter in data such as our own names and addresses. Soon, though, a frantic competition to come up with the most amusing test data emerged. Saddam Hussein, the late dictator of Iraq, ended up in the generic dev database backup. Then, so did one "Liz Windsor", who lived in Buckingham Palace. When we wanted to spin up a system, we'd just restore from this backup. Simples, right? What could possibly go wrong?

So, the great day arrived when we would go live. To ensure that nothing went wrong, we installed the live system in the tried and tested way we'd installed the development and test systems. That was, of course, the least risky way of doing things. Simples, right? What could possibly go... Oh.

Because the next month, several developers got a letter on their doorstep, from the customer, asking them to pay an often vast telephone bill. For many developers, this was clearly listing all the premium-rate porn chat lines (it was the '90's, and yes, these things existed). We spent no small time carefully explaining the situation to the customer, and getting those bills cleared from the system. It wasn't pretty.

Then someone remembered that Her Majesty The Queen Elizabeth II, Queen of The United Kingdom, Fid Def and all that jazz, would also have had a letter carried to her by a highly trained footman. We all thought that the nice red writing demanding immediate payment would, no doubt, contrast nicely with the gold platter is would be borne on.

For any projects - perhaps especially small, never, ever, ever, ever use production data in test. Never. Period!

Make routines early on to generate data.This will just get harder if you wait, so start while the system is small.

With GDPR, there are laws in place that if you store sensitive data unnecessary, or use them without user's consent in - say, test, then huge fine will wait for you on the other side, so have good routines in place.

I like to use the faker.js library to populate my development databases. It's easy to use and simple to integrate into test suites or custom bootstrapping scripts.

It has a wide API that is well documented and covers almost anything you need. It supports many localities ( i18n ) which can be useful for seeding certain types of applications. Faker.js has been in development for seven years and currently has 124 contributors and 11k stars on Github.

It's a very nifty piece of software. We never use production data in testing or development environments. Everything is always generated with faker bootstrapping scripts customized to the application. Works very well. The localization part can also help with testing UTF-8 encodings.

Disclaimer: I am the author of faker.js

Currently, We are using Liquibase to resolve this issue. As Developer, You know feature you are implementing and you already tested with a different combination of data (If you are a good developer!). With liquibase we added one more step in our process that each developer will write liquibase changeset to insert/update/delete data by which they performed the test. We have profile setup so this changeset will only run in dev context.

Benefits:

  • This data will be available on all developer machine once they start application
  • We can resolve most of the migration problem in dev cycle as we have all combinations of data
  • Your QA team already has some data when they start testing (QA team should test on production clone application but here we have two CI one with dev profile and one with prod)

Problem:

  • Sometimes it is burden on developer to do this extra work
  • You can not skip this step under tight deadline because if you stop to add changeset for some time then time to add this changeset afterward will increase exponentially and it is possible that many bugs will be reported from production

In the past, we've relied on Liquibase to generate dummy records for the primary database tables. This was originally done to allow us to quickly and (somewhat) efficiently perform API test (with a focus on regression testing).

More recently, our API test infrastructure has increased to the point where we can just write brief (JUnit) API tests which can be run against our test environments to easily insert data (via the proper API pathways). Because of that, we've been moving away from predefined test data. (Using 'production' data isn't really something that applies to our specific situation)

Agree with those who say don't use production data in development. It could possibly be useful when hunting down a nasty bug, but unless absolutely necessary I'd stay out of there. If you need to test something specific, especially new features, you probably end up adding your own data anyway, so why not do it right from start.

I can understand the attraction though, especially for projects that don't have proper tests. It's much easier to find by accident those pages that start loading slow or have tables that become unusable when populated with too much data and other similar issues. Especially those issues that may require adding a lot of fake data to find.

One other idea: for organizations that use frameworks like Rails where the associations between tables are explicitly expressed through models, perhaps it would be possible to somehow take a slice of the production database (anonymizing it of course) by taking a slice from key tables and then traversing the associations to gather just the right amount of data.

We do a ton of development against Oracle databases. In any application that has sensitive data, Oracle Data Masking and Subsetting Pack (an option to Oracle Enterprise Edition) has the ability to take all the production data and massage it so that it can't be used to identify production information. All referential integrity is maintained throughout this process.

Quite late for the party, but while researching this subject again, I bumped into this topic and will recommend a pretty nice library for Rails projects: github.com/IFTTT/polo

This IFTTT lib helps to "travel through your database and creates sample snapshots so you can work with real world data in development."

It has obfuscation and can handle associations (manually / automatically) as it'll output a SQL for you to use on any environment without having production data.

Most of my projects are small enough that I insert random text by hand into the databases. For bigger projects, I use Laravel framework which allows seeding the databases easily.

I usually have api tests so I just run the whole suite to generate data.

Classic DEV Post from Jul 21

How to build your own Module bundler from scratch.

Build your Own Module Bundler from scratch

Jared Silver
I build software to improve education. // Current: DataCamp // Past: Codecademy, TEDx, Quill.org, EdSurge, Higher Learning Technologies, Oppia.org

Great software developers built here

Sign up 🦄