DEV Community

Aman Sachan
Aman Sachan

Posted on

DataPop: Generate Realistic Synthetic Datasets in Python

DataPop: Generate Realistic Synthetic Datasets in Python

Ever need realistic test data for your app? DataPop generates statistically realistic multi-table synthetic datasets with a single YAML config.

The Problem

You need test data. Realistic, multi-table, relational test data. But:

  • Copying production data is a compliance nightmare
  • Mocking by hand produces flat, obviously fake datasets
  • Existing generators are schema-limited

Solution: DataPop

from datapop import Dataset

schema = Dataset.from_yaml("schema.yml")
schema.generate(rows=50_000)
schema.to_sqlite("synthetic.db")
Enter fullscreen mode Exit fullscreen mode

Features

  • Distribution-aware: Normal, exponential, log-normal, uniform, zipfian
  • Relational integrity: Foreign keys, unique constraints
  • Multi-format: CSV, Parquet, SQLite, PostgreSQL, DuckDB
  • Schema validation: Catch constraint violations before generation
  • Seedable: Reproducible output
  • CLI: datapop generate schema.yml --rows 100k --format sqlite

Example Schema (YAML)

name: ecommerce
tables:
  users:
    rows: 10000
    columns:
      user_id:
        distribution: sequence
        start: 1
      email:
        distribution: email
      country:
        distribution: choice
        values: [US, UK, DE, FR, IN, CA, AU]
      age:
        distribution: normal
        mean: 32
        std: 8
      created_at:
        distribution: date
        start: "2023-01-01"
        freq: "1d"
      plan:
        distribution: choice
        values: [free, pro, enterprise]
  products:
    rows: 500
    columns:
      product_id:
        distribution: sequence
        start: 1
      name:
        distribution: choice
        values: [Widget Pro, MegaTool, CloudSync, DataPipe, SecureVault, Analytics+, AI Studio]
      category:
        distribution: choice
        values: [SaaS, Infrastructure, Security, Analytics, AI]
      price:
        distribution: normal
        mean: 49
        std: 30
      active:
        distribution: choice
        values: [true, false]
  orders:
    rows: 50000
    columns:
      order_id:
        distribution: sequence
        start: 1
      user_id:
        distribution: foreign_key
        table: users
        column: user_id
      product_id:
        distribution: foreign_key
        table: products
        column: product_id
      amount:
        distribution: normal
        mean: 79
        std: 40
      status:
        distribution: choice
        values: [pending, completed, refunded, failed]
      created_at:
        distribution: date
        start: "2024-01-01"
        freq: "30s"
Enter fullscreen mode Exit fullscreen mode

Installation

pip install datapop
# or from source
pip install .
Enter fullscreen mode Exit fullscreen mode

CLI Usage

# Generate from a schema file
datapop generate examples/shop.yml --rows 10000 --format csv

# Validate a schema
datapop validate schema.yml
Enter fullscreen mode Exit fullscreen mode

Architecture

datapop/
├── core/          # Schema, Dataset, Column definitions
├── generators/    # Distribution generators (Normal, Uniform, Zipf, etc.)
├── exporters/     # Output formatters (CSV, Parquet, SQLite, DuckDB)
└── validators/   # Schema and constraint validators
Enter fullscreen mode Exit fullscreen mode

Use Cases

  1. Test data for development — No more mocking by hand
  2. Machine learning training — Generate large synthetic datasets for ML experiments
  3. Database demos — Showcase multi-table relational schemas
  4. Privacy-preserving data sharing — Share datasets without exposing real user data

Great for developers who need realistic test data without the compliance risk of using production data.

Top comments (0)