DEV Community

Cover image for When Synthetic Data Lies: A Hidden Correlation Problem I Didn’t Expect
Mohamed Hussain S
Mohamed Hussain S

Posted on

When Synthetic Data Lies: A Hidden Correlation Problem I Didn’t Expect

While working on a small analytics setup using ClickHouse and Superset, I generated some synthetic data to test queries and dashboards.

Initially, everything looked fine. The distributions seemed reasonable, and the dashboards behaved as expected.

But as I increased the dataset size, a few patterns started to look off.

Revenue seemed to concentrate in a single country.
In some cases, certain countries had no purchases at all.

At first, it looked like a simple distribution issue. But the patterns were too consistent to ignore.


Checking the Usual Suspects

The first assumption was that something was wrong with the queries or aggregations.

So I checked:

  • query logic
  • filters
  • materialized views
  • dashboard configurations

Everything seemed correct.

Which pointed to a different possibility:

The issue wasn’t in how the data was queried - it was in how the data was generated.


Looking at the Data More Closely

Instead of relying on dashboards, I went back to the raw data.

A simple aggregation made things clearer:

  • one country dominating purchases
  • another missing entirely


Overall event distribution looks reasonable - the issue isn’t obvious here

At this point, it was clear that the data itself wasn’t behaving as expected.


The First Issue: Randomness That Wasn’t Quite Random

The initial data generation logic used rand() like this:

multiIf(
    rand() % 100 < 40, 'India',
    rand() % 100 < 65, 'US',
    rand() % 100 < 80, 'UK',
    rand() % 100 < 90, 'Germany',
    'UAE'
)
Enter fullscreen mode Exit fullscreen mode

At a glance, this looks reasonable.

But each rand() call is evaluated independently.

So instead of generating a single random value and assigning a category, the logic evaluates a new random value at each step.

This leads to unintended distributions and subtle bias in the data.


Fixing That… and Introducing Another Problem

To make the data more stable, I switched to a deterministic approach using number:

(number * 17) % 100 AS event_rand
(number * 29) % 100 AS country_rand
Enter fullscreen mode Exit fullscreen mode

This made the distributions predictable and easier to reason about.


After fixing randomness - a different issue appears (some countries have zero purchases)

But it introduced a different issue.

Both event_type and country were now derived from the same base value: number.


The Real Issue: Hidden Correlation

Even with different multipliers, these values were not independent.

They were mathematically related.

This meant:

  • certain rows could only produce certain combinations
  • some combinations would never occur

In this case, the rows that produced "purchase" never aligned with the rows that produced "UAE".

Which resulted in:

  • UAE having zero purchases
  • other countries showing skewed distributions

What Was Actually Wrong

The issue wasn’t randomness.

It was lack of independence.

The variables in the synthetic dataset were not independent of each other.

And that’s enough to produce misleading analytics.


Fixing the Data Generation

To resolve this, I changed how the values were generated:

  • used different transformations
  • added offsets
  • ensured each variable had its own distribution

For example:

(number * 17) % 100 AS event_rand
(number * 31 + 13) % 100 AS country_rand
(number * 47 + 23) % 100 AS device_rand
Enter fullscreen mode Exit fullscreen mode

This breaks the alignment between variables and restores independence.


After this change, the distributions behaved as expected.


Why This Matters

At smaller scales, the issue wasn’t obvious.

The data looked fine, and the dashboards seemed reasonable.

But as the dataset grew:

  • patterns became more consistent
  • biases became more visible
  • incorrect assumptions started to look like real insights

Key Takeaway

Synthetic data can look correct while still producing misleading results.

The problem wasn’t query performance.

It was data correctness.

Scaling the data didn’t create the issue.

It revealed it.


Final Thoughts

This was a good reminder that:

  • data generation deserves as much attention as querying
  • small assumptions can lead to large inconsistencies
  • and “reasonable-looking” data isn’t always reliable

When working with analytics systems, it’s easy to trust what the data shows.

But sometimes, it’s worth questioning how that data was created in the first place.


Related

This came up while building an analytics setup using ClickHouse and Superset, where I was comparing raw tables and materialized views.

If you're interested in that setup, you can read about it here:
👉 link to Blog 1


Top comments (0)