Sovereign Forger

Posted on Apr 3 • Originally published at sovereignforger.com

Why I Built a 3,200-Line Python Pipeline to Generate Synthetic Financial Data From Math -- Not AI

#machinelearning #datascience #python #fintech

Every synthetic data tool I evaluated had the same architecture: take real data as input, learn the distributions, generate new records that look similar.

That works fine -- until you don't have real data to feed it.

I was building synthetic UHNWI (ultra-high-net-worth individual) profiles for compliance testing. The problem: nobody gives you access to real UHNWI client data. Not banks, not wealth managers, not family offices. That's the whole point of privacy regulations.

So every platform that requires real data as input -- Mostly AI, Gretel, Tonic, Synthesized -- was structurally useless for my use case.

I had to build something different.

The Architecture: Math First, AI Second

The pipeline is ~3,200 lines of Python. It has three sequential stages, and the order matters:

Stage 1: Math First (Zero AI)

Wealth distributions follow Pareto. Not because I like the math -- because that's how extreme wealth actually distributes. The top 1% holds more than the bottom 50%. Pareto captures this with one parameter (alpha).

But here's the key constraint:

Assets - Liabilities = Net Worth

This is enforced algebraically. Not approximated. Not post-hoc adjusted. Every single record satisfies this equation by construction.

The same applies to asset decomposition:

Property + Core Equity + Cash Liquidity = Total Assets

Again: enforced, not fitted. 1,332,000 records generated. Zero balance-sheet errors.

Why does this matter? Because most synthetic data pipelines generate fields independently. Net worth in one column, assets in another, no constraint between them. The result looks plausible in isolation but is mathematically incoherent. Train a financial model on that, and you're training on impossible scenarios.

Stage 2: AI Enrichment (Local, Offline)

A local LLM (running entirely offline -- no API calls, no cloud) generates narrative fields: biography, profession, philanthropic focus, asset composition descriptions.

The critical constraint: the AI never touches the numbers. Financial values come from Stage 1, period. The AI adds the human texture that makes the profiles feel realistic -- but it operates within boundaries set by math.

Why offline? Because the entire selling proposition is "your data never leaves local hardware." If I'm generating profiles for compliance teams worried about data security, sending those profiles to an API endpoint defeats the purpose.

Stage 3: FORGE Mode (Zero AI)

For customers who want maximum auditability, there's a mode that bypasses AI entirely. Every field is deterministically derived from the profile UUID using SHA-256 hashing. Same UUID, same profile, every time. Fully reproducible, fully auditable.

The Balance Sheet Test

Here's a simple quality check you can run on any synthetic financial dataset:

For every record, compute total_assets - total_liabilities
Compare with the stated net_worth
Check if they match

Most datasets fail this test. Not on every record -- but on enough records to matter. A 2% error rate on 10,000 records means 200 financially impossible profiles. Would you trust a model trained on 200 impossible scenarios?

Our pass rate across 1.33 million records: 100%.

This isn't a flex -- it's an architectural consequence. When you enforce constraints during generation (not after), errors are structurally impossible.

Why 31 Archetypes, Not 5 Personas

Generic synthetic data uses personas: "young professional," "retiree," "small business owner." That works for retail banking.

UHNWI wealth doesn't work that way. A Silicon Valley founder with $500M in pre-IPO equity has a completely different wealth structure than an Old Money European with $500M in real estate and art. Different jurisdictions, different offshore structures, different KYC risk profiles, different philanthropy patterns.

The pipeline uses 31 wealth archetypes across 6 geographic niches:

Silicon Valley -- Tech founders, VCs, serial entrepreneurs
Old Money Europe -- Dynastic wealth, private banking families
Middle East -- Sovereign families, merchant houses
LatAm Barons -- Agribusiness, infrastructure, mining
Pacific Rim -- Semiconductor, shipping dynasties
Swiss-Singapore -- Offshore wealth, multi-family offices

Each archetype drives different Pareto parameters, different asset allocations, different offshore jurisdiction probabilities. The result: profiles that compliance officers recognize as structurally realistic, not just numerically plausible.

The KYC Layer: 29 Fields

The enhanced dataset adds 10 KYC/AML fields on top of the 19 base fields:

kyc_risk_rating -- Derived from archetype + jurisdiction (not random)
pep_status -- None, domestic, foreign, international organization
pep_position and pep_jurisdiction -- Correlated with the profile's geographic niche
sanctions_screening_result -- Clear, potential match, or confirmed match
sanctions_match_confidence -- 0-100 score
adverse_media_flag -- Boolean
source_of_wealth_verified -- Boolean with verification method
high_risk_jurisdiction_flag -- Derived from offshore structure

These fields aren't random. They're deterministically derived from the profile's attributes. A Middle East sovereign family member with BVI structures gets a different risk profile than a Swiss private banker. That's the point -- the KYC layer reflects real-world assessment patterns.

The Compliance Angle

Why does any of this matter beyond data science?

GDPR Article 25 requires data protection by design. That extends to test environments. If your compliance team tests AML screening with anonymized client data, the test environment itself processes personal data.

EU AI Act Article 10 (enforcement: August 2026) mandates governance over training data for high-risk AI systems. Financial services AI is high-risk under Annex III.

Born-Synthetic data -- generated entirely from math, with zero lineage to real individuals -- sidesteps both requirements. Not by anonymizing real data better. By never starting with real data in the first place.

What I Learned

Building this pipeline taught me a few things:

Constraints beat correlations. If you know the algebraic relationship between fields, enforce it. Don't hope your model learns it.
Domain knowledge is the moat. Any team can train a GAN on tabular data. Building a pipeline that understands UHNWI wealth structures, offshore jurisdictions, and KYC risk patterns -- that requires domain knowledge that no generic tool embeds.
Local-first is a feature. For compliance-sensitive data, "our model runs in your VPC" is a weaker guarantee than "the model runs on a machine with no internet connection."
Test your own data ruthlessly. The Balance Sheet Test is trivial to implement. The fact that most datasets fail it tells you something about how synthetic data quality is typically evaluated (it isn't).

Try It

There's a free sample -- 100 UHNWI profiles with 29 KYC/AML fields. No registration, no email gate:

KYC/AML sample: sovereignforger.com/sample-kyc/
UHNWI sample (19 fields): sovereignforger.com/sample/

Run the Balance Sheet Test on it. If a single record fails, I want to know.

The full datasets (1K to 100K records per niche) are at sovereignforger.com.

I'm building Sovereign Forger as a solo founder. If you work in compliance, fintech, or data science and have feedback on what's useful (or what's missing), I'd genuinely like to hear it.

DEV Community