DEV Community

Cover image for AdventureWorks Is Dead. Here's a 42-Table Business Dataset That Actually Balances.
Mindweave Technologies
Mindweave Technologies

Posted on

AdventureWorks Is Dead. Here's a 42-Table Business Dataset That Actually Balances.

If you've ever needed realistic business data for testing, demos, or development, you've probably used one of these:

  • AdventureWorks — last updated 2014, SQL Server only, no real accounting
  • Northwind — last updated ~2000, 8 tables, no financial integrity
  • Faker/Mockaroo — random flat data with no relationships between tables

They all have the same problem: they don't reflect how a real business actually works.

A real business has sales that generate invoices, invoices that trigger payments, payments that hit the bank, and bank transactions that flow into double-entry journal entries. None of the above give you that.

So I built one that does.

What is sme-sim?

It's a day-by-day business simulator. You spin up a fake Australian retail company and let it operate for 2 financial years. Each simulated day, the company:

  • Receives and fulfils customer orders
  • Processes payments (some early, some late, some partial)
  • Runs fortnightly payroll with real tax calculations
  • Reorders inventory when stock drops below reorder points
  • Generates double-entry journal entries for every financial event
  • Lodges quarterly BAS (tax returns) with the ATO

After 2 years, you get 42 interconnected tables with 83,000+ rows and 44 foreign key relationships.

What makes it different

1. End-to-end traceability

Every sale traces all the way through:

Customer → Sales Order → Sales Order Lines → Invoice → Payment
    → Bank Transaction → Journal Entry → Journal Entry Lines
Enter fullscreen mode Exit fullscreen mode

You can pick any transaction and follow it across 8 tables. This is what real business data looks like.

2. Double-entry accounting that actually balances

Every financial event generates balanced journal entries. Debits always equal credits. Across 7,400+ entries, not a single one is unbalanced.

This matters because if you're testing accounting software, you need data where the books actually work. Random generators can't do this.

3. Real tax compliance

The dataset uses real ATO (Australian Tax Office) 2024-25 rules:

  • PAYG withholding — actual tax brackets, not made-up percentages
  • Medicare levy — 2% on taxable income
  • Superannuation — 11.5% employer contribution
  • GST — 10% on all sales and purchases
  • Quarterly BAS — Business Activity Statements derived from the GL

Every payslip satisfies: Gross = Net + Tax. Every BAS return reconciles to the general ledger.

4. Temporal realism

The simulation creates patterns you'd see in a real business:

  • Seasonal sales — camping equipment sells more in spring/summer
  • Staff turnover — employees get hired, promoted, and terminated
  • Late payments — some customers always pay late, others pay early
  • Inventory cycles — stock levels fluctuate with demand and lead times

Comparison

Feature AdventureWorks Northwind Faker sme-sim
Tables 71 13 N/A 42
Cross-domain traceability Partial No No Full
Double-entry accounting No No No Yes
Tax compliance US-only None None AU + US
Temporal realism Static Static Random Simulated
FK relationships Good Basic None 44 enforced
Last updated 2014 ~2000 N/A 2025
Deterministic No N/A No Yes (seeded RNG)

Who is this for?

  • Developers building ERP, accounting, CRM, or HR software
  • QA teams testing complex workflows that span multiple modules
  • Consultants who need realistic demo data without exposing client data
  • Data engineers building ETL pipelines or data warehouses
  • Students studying business systems, accounting, or databases
  • AI/ML teams who need realistic training data for business intelligence models

Get the data

Browse all datasets → mindweave.tech/datasets

Free sample (~2,800 rows, 26 tables):

  • GitHub — clone and explore
  • Kaggle — download or use in notebooks

Full datasets:

Quick start

git clone https://github.com/MindweaveTech/sme-sim-sample.git
cd sme-sim-sample

# Load into SQLite
sqlite3 :memory: <<'SQL'
.mode csv
.import sales_orders_sample.csv sales_orders
.import journal_entry_lines_sample.csv journal_lines
SELECT count(*) as total_orders FROM sales_orders;
SELECT 
  sum(debit) as total_debits, 
  sum(credit) as total_credits,
  round(sum(debit) - sum(credit), 2) as difference
FROM journal_lines;
SQL
Enter fullscreen mode Exit fullscreen mode

Output:

total_orders = 200
total_debits = 1847234.56
total_credits = 1847234.56
difference = 0.0
Enter fullscreen mode Exit fullscreen mode

Debits equal credits. Every time.

Technical details

  • Engine: Python 3.14, SQLAlchemy 2.x, Click CLI
  • Output formats: CSV, SQL (PostgreSQL), SQLite
  • Deterministic: Same seed = identical output. Seed 42 always produces "Outback Outdoor Supplies Pty Ltd"
  • 12 domain modules: Company, Accounting, HR, Payroll, CRM, Sales, Purchasing, Inventory, Banking, Tax, Assets, Projects

Now available: US variant

Since launching the AU version, I've built a US compliance variant with:

  • IRS 2024 federal tax brackets + $14,600 standard deduction
  • FICA (Social Security 6.2% + Medicare 1.45%)
  • State sales tax (~7.5%)
  • Calendar-year fiscal year, LLC with EIN
  • US Chart of Accounts (GAAP-style)

Same 42-table structure, same referential integrity — just US-flavoured. Available as US Complete ($49) and US Multi-Company ($99).

Formats

All datasets ship in 4 formats: CSV, SQL (PostgreSQL), Parquet, and SQLite. Load into whatever tool you use — pandas, DuckDB, dbt, Power BI, raw SQL.

What's next

  • UK variant (HMRC, PAYE, VAT, GBP)
  • More industry presets (restaurant, consulting, e-commerce)
  • Open-sourcing the simulation engine

Built by Mindweave Technologies. Browse all datasets → Feedback welcome — what domains or formats would be most useful for your workflow?

Top comments (0)