The "2 Months of Pain" Origin Story
An year ago, I was working as a Data Science Engineer at a consultancy firm. We needed to build a Tableau dashboard to demonstrate a new business model. The consultants didn't want "random" data; they wanted a specific story:
"Show a _____ failing in Phase 2, causing a 40% revenue dip in Q3, followed by a recovery in Q4 due to a new ____ launch."
I tried at first using standard libraries like Faker and Mimesis. They are fantastic for generating random names and emails, but they failed hard on Business Logic. Then I used python scripting to generate the data, using for loops and all kind of loops.
I ended up with:
- Time Travel Bugs: Timesheets dated before an employee's hire date.
- Orphaned Rows: Orders linked to non-existent Users.
- Flat Curves: Revenue that looked like static noise, not a "Q3 Dip."
I spent 2 months manually hacking Python scripts, hard-coding probabilities, and stitching CSVs together to make the demo look real. It was a nightmare.
I realized: We don't need more random data generators. We need Narrative Data Engines.
So, I built Misata.
What is Misata?
Misata is an open-source Python engine that turns a natural language story into a multi-table, referentially intact dataset.
Instead of writing 500 lines of schema config, you just type:
misata generate --story "A SaaS platform with 50k users, 20% churn in Q3, and usage-based billing" --use-llm
And it generates SQL-ready CSVs where the math actually works.
Under the Hood: The Architecture
Misata isn't just a wrapper around Faker. It uses a Neuro-Symbolic approach to solve the consistency problem.
- The Brain (LLM Parser)
First, it uses an LLM (I optimized it for Llama 3.3 via Groq) to parse your story into a strict JSON schema. It extracts:
Entities: Users, Subscriptions, Invoices.
Distributions: "20% churn" becomes a probability weight.
Relationships: "Invoices belong to Subscriptions."
- The Logic (Topological Sort)
To prevent "Orphaned Rows," Misata builds a Directed Acyclic Graph (DAG) of your tables. It uses Topological Sorting to ensure parent tables (e.g., Users) are generated before child tables (e.g., Orders).
Python
# Simplified logic from misata/simulator.py
def topological_sort(self):
graph = defaultdict(list)
in_degree = {table.name: 0 for table in self.config.tables}
for rel in self.config.relationships:
graph[rel.parent_table].append(rel.child_table)
in_degree[rel.child_table] += 1
# Standard Kahn's Algorithm...
- The Muscle (Vectorized NumPy)
The biggest bottleneck with Python data generation is looping. Generating 10 million rows in a loop is too slow.
Misata uses Vectorized Operations (via NumPy and Pandas) to generate data in blocks. This allows it to hit speeds of ~250,000 rows/second on a standard laptop.
Features for Data Engineers
I built this to solve the specific pains I faced in consulting:
Relational Integrity: It automatically maps Primary Keys to Foreign Keys. No more broken joins in SQL/Tableau.
No "Time Travel": Child tables (like Timesheets) automatically look up their parent's Start Date to ensure events happen chronologically.
Business Constraints: You can define rules like "Employees cannot log > 8 hours/day."
Try it out
It's open source and available on PyPI.
pip install misata
Generate a dataset:
# Needs GROQ_API_KEY (free tier works great)
misata generate --story "E-commerce store with seasonal spikes in December" --use-llm
Why I Open Sourced It
I know there are enterprise tools out there that cost $10k+/year. But for individual consultants, students, and indie hackers, there was no good "middle ground" between Faker and Enterprise Privacy tools.
I want Misata to be that middle ground.
I'm currently working on adding Curve Fitting (so you can draw a chart and get data that matches it). If you're into Data Engineering or Python optimization, I'd love your feedback on the architecture!
Repo: github.com/rasinmuhammed/misata
P.S. If you are a consultant stuck in "Demo Data Hell" right now and need a specific scenario generated, drop a comment or DM me. I'm looking for complex edge cases to stress-test the engine.
Top comments (0)