Muhammed Rasin O M

Posted on Apr 11

The Best Python Library for Generating Quick Synthetic Data in 2026

#python #datascience #testing #opensource

Misata: Generate Realistic Synthetic Datasets From Plain English Descriptions

Generating synthetic data in Python used to mean one of three things: write random.uniform() loops by hand, use Faker for fake names and emails, or spend a week configuring SDV on top of real data you might not even have. But we have got LLMs now. Still maintaining the logics and the referential integrity is a nightmare.

Misata is none of those things.

One sentence in. Multiple related tables out. Distributions calibrated to real-world statistics. Foreign key integrity guaranteed. Monthly revenue targets hit to the cent.

pip install misata

import misata

tables = misata.generate(
    "A SaaS company with 2000 users. "
    "MRR rises from 80k in January to 320k in June, "
    "drops to 180k in August due to churn, "
    "then recovers to 400k in December.",
    seed=42,
)

That generates two linked tables with 21,000+ rows. Here is what the monthly MRR looks like when you sum the rows:

Jan    $80,000   ✓
Feb   $128,000   ✓
Mar   $176,000   ✓
Apr   $224,000   ✓
May   $272,000   ✓
Jun   $320,000   ✓
Jul   $250,000   ✓
Aug   $180,000   ✓   <- churn dip, as described
Sep   $235,000   ✓
Oct   $290,000   ✓
Nov   $345,000   ✓
Dec   $400,000   ✓

Every target exact. Not approximate. The individual rows still follow a log-normal distribution (median MRR $126, mean $150, p90 $291) because that is what real SaaS revenue looks like. But the monthly totals are pinned to whatever story you gave it.

The core problem: why most synthetic data is useless

There's a gap between what synthetic data generators produce and what you actually need to build, test, or demo a data system.

Uniform distributions lie. Real revenue data is log-normal. Real fraud rates hover around 2%, not 50%. Real product category distributions follow Zipf's law - one category dominates, the others trail off. When your fake data looks nothing like the real thing, your model trains on lies, your dashboards tell wrong stories, and your tests pass cases that would fail in production.

Referential integrity breaks things. If you're testing a JOIN across customers and transactions, orphan foreign keys will silently ruin your results. Most data generators either skip relational structure entirely or produce it inconsistently.

Business targets get ignored. You don't just want data that looks roughly right. You want a dataset where Q3 revenue dips 22% due to a simulated product recall, or where churn spikes in August because your description says so. No general-purpose generator can do this.

Misata was built specifically to close this gap.

Why distributions matter more than people think

Most fake data generators produce values that are uniformly distributed. When you plot them, everything looks flat. Real business data is never flat.

Misata ships calibrated distribution priors for seven domains. Here is what that means in practice.

Fintech: fraud rates, credit scores, and account balances

tables = misata.generate(
    "A fintech company with 2000 customers and banking transactions.",
    seed=42,
)

transactions = tables["transactions"]
print(f"Fraud rate: {transactions['is_fraud'].mean() * 100:.2f}%")

Fraud rate: 2.00%

400 fraudulent transactions out of 20,000. The calibrated real-world baseline for card fraud is around 2%. That is what you get. Not a random number. A calibrated one.

Credit scores follow the actual US distribution:

mean:   679   (real US average: 680-720)
std:     80   (real range: 70-90)
min:    328
max:    850

Account balances follow log-normal because real bank balances do:

median     $1,976
mean       $6,128
p90       $14,260
p99       $62,565

Most customers have under two thousand dollars. A few have tens of thousands. The tail is real. This matters enormously if you're building fraud detection models, credit scoring pipelines, or stress-testing payment infrastructure — a flat distribution would make every one of those models overfit to a distribution that doesn't exist in production.

Healthcare: blood type frequencies, age distributions, and appointment patterns

tables = misata.generate("A hospital with 500 patients and doctors.", seed=42)
patients = tables["patients"]

Blood type    Generated    Real-world
O+               37.9%        38.0%   ✓
A+               33.9%        34.0%   ✓
B+                9.6%         9.0%   ✓
AB+               3.0%         3.0%   ✓
O-                6.5%         7.0%   ✓
A-                6.1%         6.0%   ✓
B-                2.0%         2.0%   ✓
AB-               0.9%         1.0%   ✓

All eight blood types within 0.6% of the actual ABO/Rh frequency distribution. Patient ages center on 45 with a standard deviation of 18, matching a chronic-care hospital population. Nobody configured any of this. It is what the healthcare domain prior knows.

This level of epidemiological accuracy is essential when you're training triage models, testing EHR systems, or building health analytics pipelines that will eventually run on real patient data.

Ecommerce: Zipf categories, seasonal peaks, and return rates

schema = misata.parse(
    "An ecommerce store with 5000 customers and orders. "
    "Revenue grows from 100k in January to 300k in November "
    "then 350k in December.",
    rows=5000,
)
tables = misata.generate_from_schema(schema)

Product categories follow Zipf's law because that is how real shopping behavior works:

electronics      47.1%
clothing         20.0%
home & garden    12.3%
sports            8.7%
books             6.5%
beauty            5.5%

One category dominates. The rest trail off. Uniform would give you ~17% each. Real shopping does not look like that.

Order statuses come with realistic rates:

completed    71.5%
shipped      12.4%
pending       8.2%
returned      5.0%
cancelled     3.0%

Real e-commerce return rates are 8–10%. That is what gets generated. If you're building a returns processing pipeline, this means your test data will actually stress the right code paths.

Referential integrity across all tables

Every child table samples foreign key values from the actual parent pool. This means zero orphan rows by construction, not by luck.

tables = misata.generate(
    "A fintech company with 2000 customers and banking transactions.",
    seed=42,
)

customers    = tables["customers"]     # 2,000 rows
accounts     = tables["accounts"]      # 2,600 rows
transactions = tables["transactions"]  # 20,000 rows

# Both FK edges hold
orphan_accounts = (~accounts["customer_id"].isin(customers["customer_id"])).sum()
orphan_txns     = (~transactions["account_id"].isin(accounts["account_id"])).sum()

print(orphan_accounts)  # 0
print(orphan_txns)      # 0

Tables are generated in topological dependency order. Parents first. Children sample from the completed parent pool. It cannot produce orphans.

This matters for any workflow that involves JOINs. Referential integrity errors in test data produce false negatives — your pipeline looks like it works until it meets real data.

The two-step flow for more control

When you want to inspect or modify the schema before committing to generation:

schema = misata.parse("A hospital with 500 patients and doctors.")
print(schema.summary())

Schema: Healthcare Dataset
Domain: healthcare
Tables (3)
  doctors         25 rows    [doctor_id, first_name, last_name, specialty, years_experience]
  patients       500 rows    [patient_id, name, age, gender, blood_type, registered_at]
  appointments  1500 rows    [appointment_id, patient_id, doctor_id, type, duration_minutes]

Relationships (2)
  patients.patient_id  -> appointments.patient_id
  doctors.doctor_id    -> appointments.doctor_id

Adjust the seed, add columns, change row counts. Then generate. The two-step flow is useful for teams where a data engineer defines the schema and a developer generates data against it — the schema becomes a shared artifact you can version control.

Real-world use cases

Use case 1: Training ML models without access to production data

Privacy regulations — GDPR, HIPAA, CCPA — make it difficult or impossible to use real user data for model training in many industries. The usual workaround is anonymization, but anonymized data often loses the statistical properties that make it useful for training.

Misata generates statistically calibrated data with no PII at all. A fraud detection team can produce 500,000 transactions with a realistic 2% fraud rate, a plausible credit score distribution, and calibrated account balance tails — without touching a single real customer record.

tables = misata.generate(
    "A fintech company with 50000 customers and banking transactions. "
    "Fraud rate is 2%. High-value accounts above 50k balance are 3x more likely to be targeted.",
    seed=42,
)

The model trains on data that behaves like production data. The privacy risk is zero.

Use case 2: Seeding development and staging databases

Every new developer joining a product team hits the same wall: the development database is empty or has three test rows from 2019. You can't build features that depend on realistic data patterns without realistic data.

Misata can seed a full development database in seconds:

from misata import seed_database

tables = misata.generate("A SaaS company with 1000 users.", seed=42)
report = seed_database(tables, "postgresql://user:pass@localhost/mydb", create=True)
print(report.total_rows)  # 12,400

Or from the CLI, which makes it easy to add to a Makefile or docker-compose setup:

misata generate \
  --story "A SaaS company with 1000 users" \
  --db-url postgresql://user:pass@localhost/mydb \
  --db-create --db-truncate

SQLite works too for local-only development:

misata generate \
  --story "A SaaS company with 1000 users" \
  --db-url sqlite:///./dev.db \
  --db-create --db-truncate

A new developer can run make seed-db and have a working dataset in their environment in under 10 seconds.

Use case 3: Building product demos without real customer data

Sales engineering teams routinely need to demo analytics dashboards, CRM systems, and data products to prospects. Using real customer data for demos is a legal and ethical non-starter. Using hand-crafted fake data means someone spends two days building a CSV in Excel.

Misata lets you generate a compelling, internally consistent demo dataset for any domain:

tables = misata.generate(
    "A B2B SaaS company with 800 enterprise customers. "
    "ARR grows from 2M in Q1 to 5M in Q4. "
    "Average contract value is 6000. Churn rate is 8%.",
    seed=42,
)

The result is a dataset where every KPI in the demo dashboard reflects a plausible business trajectory — not a random scatter of numbers.

Use case 4: Testing data pipelines and ETL systems

Data pipeline tests are only as good as the data they run on. Edge cases like NULL foreign keys, skewed distributions, and outlier values are exactly what break pipelines in production — and exactly what hand-crafted test data tends to miss.

Misata's calibrated distributions naturally produce the tail values that stress-test pipelines:

tables = misata.generate(
    "A logistics company with 10000 shipments. "
    "Include delayed deliveries at a 12% rate. "
    "International shipments are 30% of total volume.",
    seed=42,
)

The p99 values in account balances, the occasional NULL in optional fields, the rare blood type AB- at 1% frequency — these are the values that reveal pipeline brittleness.

Use case 5: Generating benchmark datasets for academic and research use

Researchers publishing papers on data systems, query optimizers, or ML benchmarks need datasets that are reproducible, realistic, and free of privacy concerns. Misata's seed parameter makes generation fully deterministic:

tables = misata.generate(
    "A marketplace with 5000 buyers and sellers, orders, and product listings.",
    seed=42,  # Anyone running this gets the exact same dataset
)

Share the seed and description in your paper. Readers can reproduce your exact dataset with a single Python call.

Use case 6: Prototyping data products and BI dashboards

Before you connect a BI tool to production data, you need something to build against. Misata gives you a structurally correct, statistically plausible dataset to prototype on — so you can validate your data model, build your first dashboard, and demo your schema to stakeholders before a single production row exists.

LLM-powered generation for custom domains

The rule-based parser covers SaaS, ecommerce, fintech, healthcare, marketplace, logistics, and pharma. For anything outside those domains, the LLM backend handles arbitrary schema generation:

from misata import LLMSchemaGenerator

gen    = LLMSchemaGenerator(provider="groq")   # or openai, ollama
schema = gen.generate_from_story(
    "A B2B marketplace with vendor tiers, SLA contracts, and quarterly invoices"
)
tables = misata.generate_from_schema(schema)

This works with any LLM provider that supports the OpenAI-compatible API format. Requires GROQ_API_KEY or OPENAI_API_KEY. Retries automatically on rate limits.

The LLM path is useful for:

Industry-specific schemas with unusual entities (clinical trials, commodity trading, fleet management)
Multi-tenant SaaS with complex permission hierarchies
Any domain where the rule-based parser doesn't have calibrated priors

The LLM infers a reasonable schema, column types, and row count ratios from your description. You get back the same DataFrames as the rule-based path — just with the schema derived from a language model instead of hard-coded priors.

How it compares to Faker and SDV

Faker generates individual fake values. One row at a time. It has no concept of tables that reference each other and no domain-specific distributions. Wiring foreign keys and getting log-normal amounts is your job.

SDV (Synthetic Data Vault) learns patterns from real data and generates synthetic copies. It requires actual training data, pulls in heavy ML dependencies, and cannot pin specific business targets like "fraud rate must be 2%." If you don't have real data to train on, SDV is a dead end.

Misata generates from a description. No real data required. No ML training. Distributions are calibrated to domain knowledge. Business targets are exact.

	Faker	SDV	Misata
Multi-table FK integrity	No	Partial	Yes
No real data needed	Yes	No	Yes
Calibrated domain distributions	No	Learned	Yes
Exact monthly aggregate targets	No	No	Yes
Plain-English story input	No	No	Yes
Database seeding	Manual	No	Yes
LLM-powered custom domains	No	No	Yes
Reproducible with seed	No	No	Yes

The key distinction: SDV is a synthetic data replication tool. Misata is a synthetic data generation tool. They solve different problems. SDV needs real data to learn from. Misata generates from scratch.

Installation and quick start

pip install misata pandas numpy

All of these produce full verified output in under 3 seconds:

python examples/saas_revenue_curve.py
python examples/fintech_fraud_detection.py
python examples/healthcare_multi_table.py
python examples/ecommerce_seasonal.py

Or open the Colab notebook and run it without installing anything. No signup, no API key, no configuration.

Design principles

A few constraints Misata holds to that are worth understanding:

Determinism over randomness. Given the same description and seed, you always get the same dataset. This is non-negotiable for reproducible research and CI pipelines where test data needs to be stable across runs.

Statistical realism over convenience. It would be simpler to generate uniformly distributed values. Misata does not do this because uniform distributions produce data that behaves nothing like real data. The extra calibration work is the point.

Aggregate targets are constraints, not approximations. When you say MRR should be $320k in June, the generated data will sum to exactly $320k in June. Not $318k. Not $322k. The individual rows remain statistically realistic while the aggregates are treated as hard constraints.

Referential integrity is structural, not checked. Misata does not generate data and then validate foreign keys. It generates in dependency order so invalid keys cannot occur. This is a stronger guarantee than post-hoc validation.

Frequently asked questions

Can I add custom columns to a generated schema?

Yes. The two-step parse → generate_from_schema flow lets you inspect and modify the schema object before generating. You can add columns, change data types, adjust row counts, and modify relationship cardinality.

How large can generated datasets be?

Misata is DataFrame-based, so the practical limit is your available RAM. For datasets larger than a few million rows, you can generate in chunks and write directly to a database using seed_database. Benchmarks on a standard laptop show ~500k rows/second for most schemas.

Does it support databases other than PostgreSQL and SQLite?

seed_database accepts any SQLAlchemy connection string, which covers PostgreSQL, MySQL, SQLite, MS SQL Server, Oracle, and others. If SQLAlchemy can connect to it, Misata can seed it.

Is there a way to generate time-series data?

Temporal columns are supported. The registered_at, transaction_date, and similar timestamp fields follow realistic distributions relative to one another — a customer's first transaction always comes after their account creation date, for example. You can specify date ranges in your description: "transactions between January 2023 and December 2024."

What if I need data that follows my company's specific distribution?

The LLM path lets you describe distribution constraints in natural language: "30% of accounts are enterprise tier with balances above $50k." For highly specific requirements, the schema object exposes column-level distribution parameters you can override directly.

Misata is open source, MIT licensed, and available now.

GitHub: github.com/rasinmuhammed/misata
PyPI: pypi.org/project/misata
Docs: QUICKSTART.md
Colab: Run the quickstart notebook

DEV Community