DEV Community

Muhammed Rasin O M
Muhammed Rasin O M

Posted on

How I Built a "Story-to-Data" Engine in Python (Because Faker Wasn't Enough)

The "2 Months of Pain" Origin Story

An year ago, I was working as a Data Science Engineer at a consultancy firm. We needed to build a Tableau dashboard to demonstrate a new business model. The consultants didn't want "random" data; they wanted a specific story:

"Show a _____ failing in Phase 2, causing a 40% revenue dip in Q3, followed by a recovery in Q4 due to a new ____ launch."

I tried at first using standard libraries like Faker and Mimesis. They are fantastic for generating random names and emails, but they failed hard on Business Logic. Then I used python scripting to generate the data, using for loops and all kind of loops.

I ended up with:

  • Time Travel Bugs: Timesheets dated before an employee's hire date.
  • Orphaned Rows: Orders linked to non-existent Users.
  • Flat Curves: Revenue that looked like static noise, not a "Q3 Dip."

I spent 2 months manually hacking Python scripts, hard-coding probabilities, and stitching CSVs together to make the demo look real. It was a nightmare.

I realized: We don't need more random data generators. We need Narrative Data Engines.

So, I built Misata.

What is Misata?
Misata is an open-source Python engine that turns a natural language story into a multi-table, referentially intact dataset.

Instead of writing 500 lines of schema config, you just type:

misata generate --story "A SaaS platform with 50k users, 20% churn in Q3, and usage-based billing" --use-llm
Enter fullscreen mode Exit fullscreen mode

And it generates SQL-ready CSVs where the math actually works.

Under the Hood: The Architecture

Misata isn't just a wrapper around Faker. It uses a Neuro-Symbolic approach to solve the consistency problem.

  1. The Brain (LLM Parser)

First, it uses an LLM (I optimized it for Llama 3.3 via Groq) to parse your story into a strict JSON schema. It extracts:

Entities: Users, Subscriptions, Invoices.

Distributions: "20% churn" becomes a probability weight.

Relationships: "Invoices belong to Subscriptions."

  1. The Logic (Topological Sort)

To prevent "Orphaned Rows," Misata builds a Directed Acyclic Graph (DAG) of your tables. It uses Topological Sorting to ensure parent tables (e.g., Users) are generated before child tables (e.g., Orders).

Python

# Simplified logic from misata/simulator.py
def topological_sort(self):
    graph = defaultdict(list)
    in_degree = {table.name: 0 for table in self.config.tables}

    for rel in self.config.relationships:
        graph[rel.parent_table].append(rel.child_table)
        in_degree[rel.child_table] += 1

    # Standard Kahn's Algorithm...
Enter fullscreen mode Exit fullscreen mode
  1. The Muscle (Vectorized NumPy)

The biggest bottleneck with Python data generation is looping. Generating 10 million rows in a loop is too slow.

Misata uses Vectorized Operations (via NumPy and Pandas) to generate data in blocks. This allows it to hit speeds of ~250,000 rows/second on a standard laptop.

Features for Data Engineers

I built this to solve the specific pains I faced in consulting:

Relational Integrity: It automatically maps Primary Keys to Foreign Keys. No more broken joins in SQL/Tableau.

No "Time Travel": Child tables (like Timesheets) automatically look up their parent's Start Date to ensure events happen chronologically.

Business Constraints: You can define rules like "Employees cannot log > 8 hours/day."

Try it out
It's open source and available on PyPI.

pip install misata
Enter fullscreen mode Exit fullscreen mode

Generate a dataset:

# Needs GROQ_API_KEY (free tier works great)
misata generate --story "E-commerce store with seasonal spikes in December" --use-llm
Enter fullscreen mode Exit fullscreen mode

Why I Open Sourced It
I know there are enterprise tools out there that cost $10k+/year. But for individual consultants, students, and indie hackers, there was no good "middle ground" between Faker and Enterprise Privacy tools.

I want Misata to be that middle ground.

I'm currently working on adding Curve Fitting (so you can draw a chart and get data that matches it). If you're into Data Engineering or Python optimization, I'd love your feedback on the architecture!

Repo: github.com/rasinmuhammed/misata

P.S. If you are a consultant stuck in "Demo Data Hell" right now and need a specific scenario generated, drop a comment or DM me. I'm looking for complex edge cases to stress-test the engine.

Top comments (0)