DataPop: Generate Realistic Synthetic Datasets in Python
Ever need realistic test data for your app? DataPop generates statistically realistic multi-table synthetic datasets with a single YAML config.
The Problem
You need test data. Realistic, multi-table, relational test data. But:
- Copying production data is a compliance nightmare
- Mocking by hand produces flat, obviously fake datasets
- Existing generators are schema-limited
Solution: DataPop
from datapop import Dataset
schema = Dataset.from_yaml("schema.yml")
schema.generate(rows=50_000)
schema.to_sqlite("synthetic.db")
Features
- Distribution-aware: Normal, exponential, log-normal, uniform, zipfian
- Relational integrity: Foreign keys, unique constraints
- Multi-format: CSV, Parquet, SQLite, PostgreSQL, DuckDB
- Schema validation: Catch constraint violations before generation
- Seedable: Reproducible output
-
CLI:
datapop generate schema.yml --rows 100k --format sqlite
Example Schema (YAML)
name: ecommerce
tables:
users:
rows: 10000
columns:
user_id:
distribution: sequence
start: 1
email:
distribution: email
country:
distribution: choice
values: [US, UK, DE, FR, IN, CA, AU]
age:
distribution: normal
mean: 32
std: 8
created_at:
distribution: date
start: "2023-01-01"
freq: "1d"
plan:
distribution: choice
values: [free, pro, enterprise]
products:
rows: 500
columns:
product_id:
distribution: sequence
start: 1
name:
distribution: choice
values: [Widget Pro, MegaTool, CloudSync, DataPipe, SecureVault, Analytics+, AI Studio]
category:
distribution: choice
values: [SaaS, Infrastructure, Security, Analytics, AI]
price:
distribution: normal
mean: 49
std: 30
active:
distribution: choice
values: [true, false]
orders:
rows: 50000
columns:
order_id:
distribution: sequence
start: 1
user_id:
distribution: foreign_key
table: users
column: user_id
product_id:
distribution: foreign_key
table: products
column: product_id
amount:
distribution: normal
mean: 79
std: 40
status:
distribution: choice
values: [pending, completed, refunded, failed]
created_at:
distribution: date
start: "2024-01-01"
freq: "30s"
Installation
pip install datapop
# or from source
pip install .
CLI Usage
# Generate from a schema file
datapop generate examples/shop.yml --rows 10000 --format csv
# Validate a schema
datapop validate schema.yml
Architecture
datapop/
├── core/ # Schema, Dataset, Column definitions
├── generators/ # Distribution generators (Normal, Uniform, Zipf, etc.)
├── exporters/ # Output formatters (CSV, Parquet, SQLite, DuckDB)
└── validators/ # Schema and constraint validators
Use Cases
- Test data for development — No more mocking by hand
- Machine learning training — Generate large synthetic datasets for ML experiments
- Database demos — Showcase multi-table relational schemas
- Privacy-preserving data sharing — Share datasets without exposing real user data
Great for developers who need realistic test data without the compliance risk of using production data.
Top comments (0)