We use a slimmed down version of the production database dump which we call the "nocustomer" dump.
It's quite involved to make this all work (for my taste) but we've yet to find a better solution.
Of course, the "obvious" solutionis to simply generate fake data.
But:
our database is constantly changing
This means we add/remove columns, change how things work, write regularly migration script => this means in top of that you've to maintain any seeding infrastructure, too
the state variety is just too much, as in likely "infinite"
99% of the data in the database is "generated externally" (retrieved via 3rd party APIs). With a lot of sources, there's a lot of different states. It feels impossible to cover them all programmatically
How do we create it? The production DB is PostgresSQL, so we:
create a full dump pg_dump
upload to S3 (of course private and only a very limited amount of users can access it)
…and now hold your breath…
spin up an EC2 instance with LOTS of RAM and LOTS of fast disk
download the dump from S3
import the whole thing (with 64 parallel jobs, yay)
use pgsql feature to create a dedicated schema and use the same dump to only import the DDL into this schema
use a series of INSERT INTO <newschema>.<table> SELECT * FROM <oldschema>.<table> WHERE … is not production data 😀 …
dump only this schema
upload it to another S3 bucket (yes, private, all developers can access this one)
Within the developers VM there a special command can then download this dump, import and, rebuild ElasticSearch, etc.
Oh, and this is only the really high level stuff. There are so much details in between to make this seamlessly work.
The process takes a couple if hours (4-6 currently) but at least it's fully automated (except triggering it).
The instance is quite expensive (again, for my taste) but when there's a need for this dump, you usually want it as quick as possible.
While writing this down and re-reading it, I almost can't believe it but, yes, this works. It occasionally fails (like: once or twice a year) but otherwise is rock solid.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
We use a slimmed down version of the production database dump which we call the "nocustomer" dump.
It's quite involved to make this all work (for my taste) but we've yet to find a better solution.
Of course, the "obvious" solutionis to simply generate fake data.
But:
How do we create it? The production DB is PostgresSQL, so we:
pg_dump
schema
and use the same dump to only import the DDL into this schemaINSERT INTO <newschema>.<table> SELECT * FROM <oldschema>.<table> WHERE … is not production data 😀 …
Within the developers VM there a special command can then download this dump, import and, rebuild ElasticSearch, etc.
Oh, and this is only the really high level stuff. There are so much details in between to make this seamlessly work.
The process takes a couple if hours (4-6 currently) but at least it's fully automated (except triggering it).
The instance is quite expensive (again, for my taste) but when there's a need for this dump, you usually want it as quick as possible.
While writing this down and re-reading it, I almost can't believe it but, yes, this works. It occasionally fails (like: once or twice a year) but otherwise is rock solid.