Why your synthetic fintech data fails code review (and how mixture models fix it)

#datascience #fintech #testing #python

Every fintech developer has done this: you need test data, you reach for Faker, you generate ten thousand transactions, and your demo works. Then a data scientist on the buying side opens your dataset, runs one df.describe(), and the deal-killing question arrives: "Why are your transaction amounts uniformly distributed?"

Real financial data has a shape. Synthetic data that ignores that shape is instantly recognizable — and in testing, ML training, or sales demos, instantly discrediting. I spent nine years running a savings app in Latin America (30,000+ users, 2015–2024), and when it wound down I kept something most synthetic data generators never had: 506,311 real records to measure that shape against. This post is about the three statistical properties that separate believable synthetic financial data from Faker output, with the actual numbers.

Property 1: Amounts are multimodal, not lognormal

The standard "sophisticated" approach is to sample amounts from a lognormal distribution. It's better than uniform — and it still fails. When I fitted a single lognormal to 261,070 real deposits, the body of the distribution looked fine (7–10% deviation between p25 and p90), but the tail fell apart: 35–45% deviation at p95–p99.

The reason is that "deposit amount" isn't one population. It's at least three: micro-deposits (the $1–$20 spare-change crowd), typical deposits ($100–$800), and large transfers ($6,000+). Each has its own location and spread. A single lognormal averages across them and gets all of them wrong.

The fix is a mixture of lognormals. Fit GaussianMixture from scikit-learn on the log-amounts, select the number of components, sample from the mixture. One non-obvious lesson from doing this on real data: don't select K with BIC. Financial amounts have heavy atoms at round values (more on that below), and BIC reacts to those atoms by under-fitting the number of components. Selecting K by minimizing the Kolmogorov–Smirnov statistic against a held-out sample worked far better: a 6-component mixture brought deposits from KS=0.068 down to KS=0.032, and p99 deviation from ~45% to under 5%.


python

DEV Community

Why your synthetic fintech data fails code review (and how mixture models fix it)

Property 1: Amounts are multimodal, not lognormal

Top comments (0)