Misata: Generate Realistic Synthetic Datasets From Plain English Descriptions
Generating synthetic data in Python used to mean one of three things: write random.uniform() loops by hand, use Faker for fake names and emails, or spend a week configuring SDV on top of real data you might not even have. But we have got LLMs now. Still maintaining the logics and the referential integrity is a nightmare.
Misata is none of those things.
One sentence in. Multiple related tables out. Distributions calibrated to real-world statistics. Foreign key integrity guaranteed. Monthly revenue targets hit to the cent.
pip install misata
import misata
tables = misata.generate(
"A SaaS company with 2000 users. "
"MRR rises from 80k in January to 320k in June, "
"drops to 180k in August due to churn, "
"then recovers to 400k in December.",
seed=42,
)
That generates two linked tables with 21,000+ rows. Here is what the monthly MRR looks like when you sum the rows:
Jan $80,000 ✓
Feb $128,000 ✓
Mar $176,000 ✓
Apr $224,000 ✓
May $272,000 ✓
Jun $320,000 ✓
Jul $250,000 ✓
Aug $180,000 ✓ <- churn dip, as described
Sep $235,000 ✓
Oct $290,000 ✓
Nov $345,000 ✓
Dec $400,000 ✓
Every target exact. Not approximate. The individual rows still follow a log-normal distribution (median MRR $126, mean $150, p90 $291) because that is what real SaaS revenue looks like. But the monthly totals are pinned to whatever story you gave it.
The core problem: why most synthetic data is useless
There's a gap between what synthetic data generators produce and what you actually need to build, test, or demo a data system.
Uniform distributions lie. Real revenue data is log-normal. Real fraud rates hover around 2%, not 50%. Real product category distributions follow Zipf's law - one category dominates, the others trail off. When your fake data looks nothing like the real thing, your model trains on lies, your dashboards tell wrong stories, and your tests pass cases that would fail in production.
Referential integrity breaks things. If you're testing a JOIN across customers and transactions, orphan foreign keys will silently ruin your results. Most data generators either skip relational structure entirely or produce it inconsistently.
Business targets get ignored. You don't just want data that looks roughly right. You want a dataset where Q3 revenue dips 22% due to a simulated product recall, or where churn spikes in August because your description says so. No general-purpose generator can do this.
Misata was built specifically to close this gap.
Why distributions matter more than people think
Most fake data generators produce values that are uniformly distributed. When you plot them, everything looks flat. Real business data is never flat.
Misata ships calibrated distribution priors for seven domains. Here is what that means in practice.
Fintech: fraud rates, credit scores, and account balances
tables = misata.generate(
"A fintech company with 2000 customers and banking transactions.",
seed=42,
)
transactions = tables["transactions"]
print(f"Fraud rate: {transactions['is_fraud'].mean() * 100:.2f}%")
Fraud rate: 2.00%
400 fraudulent transactions out of 20,000. The calibrated real-world baseline for card fraud is around 2%. That is what you get. Not a random number. A calibrated one.
Credit scores follow the actual US distribution:
mean: 679 (real US average: 680-720)
std: 80 (real range: 70-90)
min: 328
max: 850
Account balances follow log-normal because real bank balances do:
median $1,976
mean $6,128
p90 $14,260
p99 $62,565
Most customers have under two thousand dollars. A few have tens of thousands. The tail is real. This matters enormously if you're building fraud detection models, credit scoring pipelines, or stress-testing payment infrastructure — a flat distribution would make every one of those models overfit to a distribution that doesn't exist in production.
Healthcare: blood type frequencies, age distributions, and appointment patterns
tables = misata.generate("A hospital with 500 patients and doctors.", seed=42)
patients = tables["patients"]
Blood type Generated Real-world
O+ 37.9% 38.0% ✓
A+ 33.9% 34.0% ✓
B+ 9.6% 9.0% ✓
AB+ 3.0% 3.0% ✓
O- 6.5% 7.0% ✓
A- 6.1% 6.0% ✓
B- 2.0% 2.0% ✓
AB- 0.9% 1.0% ✓
All eight blood types within 0.6% of the actual ABO/Rh frequency distribution. Patient ages center on 45 with a standard deviation of 18, matching a chronic-care hospital population. Nobody configured any of this. It is what the healthcare domain prior knows.
This level of epidemiological accuracy is essential when you're training triage models, testing EHR systems, or building health analytics pipelines that will eventually run on real patient data.
Ecommerce: Zipf categories, seasonal peaks, and return rates
schema = misata.parse(
"An ecommerce store with 5000 customers and orders. "
"Revenue grows from 100k in January to 300k in November "
"then 350k in December.",
rows=5000,
)
tables = misata.generate_from_schema(schema)
Product categories follow Zipf's law because that is how real shopping behavior works:
electronics 47.1%
clothing 20.0%
home & garden 12.3%
sports 8.7%
books 6.5%
beauty 5.5%
One category dominates. The rest trail off. Uniform would give you ~17% each. Real shopping does not look like that.
Order statuses come with realistic rates:
completed 71.5%
shipped 12.4%
pending 8.2%
returned 5.0%
cancelled 3.0%
Real e-commerce return rates are 8–10%. That is what gets generated. If you're building a returns processing pipeline, this means your test data will actually stress the right code paths.
Referential integrity across all tables
Every child table samples foreign key values from the actual parent pool. This means zero orphan rows by construction, not by luck.
tables = misata.generate(
"A fintech company with 2000 customers and banking transactions.",
seed=42,
)
customers = tables["customers"] # 2,000 rows
accounts = tables["accounts"] # 2,600 rows
transactions = tables["transactions"] # 20,000 rows
# Both FK edges hold
orphan_accounts = (~accounts["customer_id"].isin(customers["customer_id"])).sum()
orphan_txns = (~transactions["account_id"].isin(accounts["account_id"])).sum()
print(orphan_accounts) # 0
print(orphan_txns) # 0
Tables are generated in topological dependency order. Parents first. Children sample from the completed parent pool. It cannot produce orphans.
This matters for any workflow that involves JOINs. Referential integrity errors in test data produce false negatives — your pipeline looks like it works until it meets real data.
The two-step flow for more control
When you want to inspect or modify the schema before committing to generation:
schema = misata.parse("A hospital with 500 patients and doctors.")
print(schema.summary())
Schema: Healthcare Dataset
Domain: healthcare
Tables (3)
doctors 25 rows [doctor_id, first_name, last_name, specialty, years_experience]
patients 500 rows [patient_id, name, age, gender, blood_type, registered_at]
appointments 1500 rows [appointment_id, patient_id, doctor_id, type, duration_minutes]
Relationships (2)
patients.patient_id -> appointments.patient_id
doctors.doctor_id -> appointments.doctor_id
Adjust the seed, add columns, change row counts. Then generate. The two-step flow is useful for teams where a data engineer defines the schema and a developer generates data against it — the schema becomes a shared artifact you can version control.
Real-world use cases
Use case 1: Training ML models without access to production data
Privacy regulations — GDPR, HIPAA, CCPA — make it difficult or impossible to use real user data for model training in many industries. The usual workaround is anonymization, but anonymized data often loses the statistical properties that make it useful for training.
Misata generates statistically calibrated data with no PII at all. A fraud detection team can produce 500,000 transactions with a realistic 2% fraud rate, a plausible credit score distribution, and calibrated account balance tails — without touching a single real customer record.
tables = misata.generate(
"A fintech company with 50000 customers and banking transactions. "
"Fraud rate is 2%. High-value accounts above 50k balance are 3x more likely to be targeted.",
seed=42,
)
The model trains on data that behaves like production data. The privacy risk is zero.
Use case 2: Seeding development and staging databases
Every new developer joining a product team hits the same wall: the development database is empty or has three test rows from 2019. You can't build features that depend on realistic data patterns without realistic data.
Misata can seed a full development database in seconds:
from misata import seed_database
tables = misata.generate("A SaaS company with 1000 users.", seed=42)
report = seed_database(tables, "postgresql://user:pass@localhost/mydb", create=True)
print(report.total_rows) # 12,400
Or from the CLI, which makes it easy to add to a Makefile or docker-compose setup:
misata generate \
--story "A SaaS company with 1000 users" \
--db-url postgresql://user:pass@localhost/mydb \
--db-create --db-truncate
SQLite works too for local-only development:
misata generate \
--story "A SaaS company with 1000 users" \
--db-url sqlite:///./dev.db \
--db-create --db-truncate
A new developer can run make seed-db and have a working dataset in their environment in under 10 seconds.
Use case 3: Building product demos without real customer data
Sales engineering teams routinely need to demo analytics dashboards, CRM systems, and data products to prospects. Using real customer data for demos is a legal and ethical non-starter. Using hand-crafted fake data means someone spends two days building a CSV in Excel.
Misata lets you generate a compelling, internally consistent demo dataset for any domain:
tables = misata.generate(
"A B2B SaaS company with 800 enterprise customers. "
"ARR grows from 2M in Q1 to 5M in Q4. "
"Average contract value is 6000. Churn rate is 8%.",
seed=42,
)
The result is a dataset where every KPI in the demo dashboard reflects a plausible business trajectory — not a random scatter of numbers.
Use case 4: Testing data pipelines and ETL systems
Data pipeline tests are only as good as the data they run on. Edge cases like NULL foreign keys, skewed distributions, and outlier values are exactly what break pipelines in production — and exactly what hand-crafted test data tends to miss.
Misata's calibrated distributions naturally produce the tail values that stress-test pipelines:
tables = misata.generate(
"A logistics company with 10000 shipments. "
"Include delayed deliveries at a 12% rate. "
"International shipments are 30% of total volume.",
seed=42,
)
The p99 values in account balances, the occasional NULL in optional fields, the rare blood type AB- at 1% frequency — these are the values that reveal pipeline brittleness.
Use case 5: Generating benchmark datasets for academic and research use
Researchers publishing papers on data systems, query optimizers, or ML benchmarks need datasets that are reproducible, realistic, and free of privacy concerns. Misata's seed parameter makes generation fully deterministic:
tables = misata.generate(
"A marketplace with 5000 buyers and sellers, orders, and product listings.",
seed=42, # Anyone running this gets the exact same dataset
)
Share the seed and description in your paper. Readers can reproduce your exact dataset with a single Python call.
Use case 6: Prototyping data products and BI dashboards
Before you connect a BI tool to production data, you need something to build against. Misata gives you a structurally correct, statistically plausible dataset to prototype on — so you can validate your data model, build your first dashboard, and demo your schema to stakeholders before a single production row exists.
LLM-powered generation for custom domains
The rule-based parser covers SaaS, ecommerce, fintech, healthcare, marketplace, logistics, and pharma. For anything outside those domains, the LLM backend handles arbitrary schema generation:
from misata import LLMSchemaGenerator
gen = LLMSchemaGenerator(provider="groq") # or openai, ollama
schema = gen.generate_from_story(
"A B2B marketplace with vendor tiers, SLA contracts, and quarterly invoices"
)
tables = misata.generate_from_schema(schema)
This works with any LLM provider that supports the OpenAI-compatible API format. Requires GROQ_API_KEY or OPENAI_API_KEY. Retries automatically on rate limits.
The LLM path is useful for:
- Industry-specific schemas with unusual entities (clinical trials, commodity trading, fleet management)
- Multi-tenant SaaS with complex permission hierarchies
- Any domain where the rule-based parser doesn't have calibrated priors
The LLM infers a reasonable schema, column types, and row count ratios from your description. You get back the same DataFrames as the rule-based path — just with the schema derived from a language model instead of hard-coded priors.
How it compares to Faker and SDV
Faker generates individual fake values. One row at a time. It has no concept of tables that reference each other and no domain-specific distributions. Wiring foreign keys and getting log-normal amounts is your job.
SDV (Synthetic Data Vault) learns patterns from real data and generates synthetic copies. It requires actual training data, pulls in heavy ML dependencies, and cannot pin specific business targets like "fraud rate must be 2%." If you don't have real data to train on, SDV is a dead end.
Misata generates from a description. No real data required. No ML training. Distributions are calibrated to domain knowledge. Business targets are exact.
| Faker | SDV | Misata | |
|---|---|---|---|
| Multi-table FK integrity | No | Partial | Yes |
| No real data needed | Yes | No | Yes |
| Calibrated domain distributions | No | Learned | Yes |
| Exact monthly aggregate targets | No | No | Yes |
| Plain-English story input | No | No | Yes |
| Database seeding | Manual | No | Yes |
| LLM-powered custom domains | No | No | Yes |
| Reproducible with seed | No | No | Yes |
The key distinction: SDV is a synthetic data replication tool. Misata is a synthetic data generation tool. They solve different problems. SDV needs real data to learn from. Misata generates from scratch.
Installation and quick start
pip install misata pandas numpy
All of these produce full verified output in under 3 seconds:
python examples/saas_revenue_curve.py
python examples/fintech_fraud_detection.py
python examples/healthcare_multi_table.py
python examples/ecommerce_seasonal.py
Or open the Colab notebook and run it without installing anything. No signup, no API key, no configuration.
Design principles
A few constraints Misata holds to that are worth understanding:
Determinism over randomness. Given the same description and seed, you always get the same dataset. This is non-negotiable for reproducible research and CI pipelines where test data needs to be stable across runs.
Statistical realism over convenience. It would be simpler to generate uniformly distributed values. Misata does not do this because uniform distributions produce data that behaves nothing like real data. The extra calibration work is the point.
Aggregate targets are constraints, not approximations. When you say MRR should be $320k in June, the generated data will sum to exactly $320k in June. Not $318k. Not $322k. The individual rows remain statistically realistic while the aggregates are treated as hard constraints.
Referential integrity is structural, not checked. Misata does not generate data and then validate foreign keys. It generates in dependency order so invalid keys cannot occur. This is a stronger guarantee than post-hoc validation.
Frequently asked questions
Can I add custom columns to a generated schema?
Yes. The two-step parse → generate_from_schema flow lets you inspect and modify the schema object before generating. You can add columns, change data types, adjust row counts, and modify relationship cardinality.
How large can generated datasets be?
Misata is DataFrame-based, so the practical limit is your available RAM. For datasets larger than a few million rows, you can generate in chunks and write directly to a database using seed_database. Benchmarks on a standard laptop show ~500k rows/second for most schemas.
Does it support databases other than PostgreSQL and SQLite?
seed_database accepts any SQLAlchemy connection string, which covers PostgreSQL, MySQL, SQLite, MS SQL Server, Oracle, and others. If SQLAlchemy can connect to it, Misata can seed it.
Is there a way to generate time-series data?
Temporal columns are supported. The registered_at, transaction_date, and similar timestamp fields follow realistic distributions relative to one another — a customer's first transaction always comes after their account creation date, for example. You can specify date ranges in your description: "transactions between January 2023 and December 2024."
What if I need data that follows my company's specific distribution?
The LLM path lets you describe distribution constraints in natural language: "30% of accounts are enterprise tier with balances above $50k." For highly specific requirements, the schema object exposes column-level distribution parameters you can override directly.
Misata is open source, MIT licensed, and available now.
GitHub: github.com/rasinmuhammed/misata
PyPI: pypi.org/project/misata
Docs: QUICKSTART.md
Colab: Run the quickstart notebook
Top comments (0)