<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Joel Mendoza</title>
    <description>The latest articles on DEV Community by Joel Mendoza (@joel_mendoza_8a2623998b93).</description>
    <link>https://dev.to/joel_mendoza_8a2623998b93</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3981946%2F70143611-00c6-4c77-9611-4b21a31946ba.png</url>
      <title>DEV Community: Joel Mendoza</title>
      <link>https://dev.to/joel_mendoza_8a2623998b93</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/joel_mendoza_8a2623998b93"/>
    <language>en</language>
    <item>
      <title>Why your synthetic fintech data fails code review (and how mixture models fix it)</title>
      <dc:creator>Joel Mendoza</dc:creator>
      <pubDate>Fri, 12 Jun 2026 22:01:57 +0000</pubDate>
      <link>https://dev.to/joel_mendoza_8a2623998b93/why-your-synthetic-fintech-data-fails-code-review-and-how-mixture-models-fix-it-fm9</link>
      <guid>https://dev.to/joel_mendoza_8a2623998b93/why-your-synthetic-fintech-data-fails-code-review-and-how-mixture-models-fix-it-fm9</guid>
      <description>&lt;p&gt;Every fintech developer has done this: you need test data, you reach for Faker, you generate ten thousand transactions, and your demo works. Then a data scientist on the buying side opens your dataset, runs one &lt;code&gt;df.describe()&lt;/code&gt;, and the deal-killing question arrives: &lt;em&gt;"Why are your transaction amounts uniformly distributed?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Real financial data has a shape. Synthetic data that ignores that shape is instantly recognizable — and in testing, ML training, or sales demos, instantly discrediting. I spent nine years running a savings app in Latin America (30,000+ users, 2015–2024), and when it wound down I kept something most synthetic data generators never had: 506,311 real records to measure that shape against. This post is about the three statistical properties that separate believable synthetic financial data from Faker output, with the actual numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Property 1: Amounts are multimodal, not lognormal
&lt;/h2&gt;

&lt;p&gt;The standard "sophisticated" approach is to sample amounts from a lognormal distribution. It's better than uniform — and it still fails. When I fitted a single lognormal to 261,070 real deposits, the body of the distribution looked fine (7–10% deviation between p25 and p90), but the tail fell apart: &lt;strong&gt;35–45% deviation at p95–p99&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The reason is that "deposit amount" isn't one population. It's at least three: micro-deposits (the $1–$20 spare-change crowd), typical deposits ($100–$800), and large transfers ($6,000+). Each has its own location and spread. A single lognormal averages across them and gets all of them wrong.&lt;/p&gt;

&lt;p&gt;The fix is a &lt;strong&gt;mixture of lognormals&lt;/strong&gt;. Fit &lt;code&gt;GaussianMixture&lt;/code&gt; from scikit-learn on the log-amounts, select the number of components, sample from the mixture. One non-obvious lesson from doing this on real data: &lt;strong&gt;don't select K with BIC&lt;/strong&gt;. Financial amounts have heavy atoms at round values (more on that below), and BIC reacts to those atoms by under-fitting the number of components. Selecting K by minimizing the Kolmogorov–Smirnov statistic against a held-out sample worked far better: a 6-component mixture brought deposits from KS=0.068 down to &lt;strong&gt;KS=0.032&lt;/strong&gt;, and p99 deviation from ~45% to under 5%.&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>datascience</category>
      <category>fintech</category>
      <category>testing</category>
      <category>python</category>
    </item>
  </channel>
</rss>
