While working on a small analytics setup using ClickHouse and Superset, I generated some synthetic data to test queries and dashboards.
Initially, everything looked fine. The distributions seemed reasonable, and the dashboards behaved as expected.
But as I increased the dataset size, a few patterns started to look off.
Revenue seemed to concentrate in a single country.
In some cases, certain countries had no purchases at all.
At first, it looked like a simple distribution issue. But the patterns were too consistent to ignore.
Checking the Usual Suspects
The first assumption was that something was wrong with the queries or aggregations.
So I checked:
- query logic
- filters
- materialized views
- dashboard configurations
Everything seemed correct.
Which pointed to a different possibility:
The issue wasn’t in how the data was queried - it was in how the data was generated.
Looking at the Data More Closely
Instead of relying on dashboards, I went back to the raw data.
A simple aggregation made things clearer:
- one country dominating purchases
- another missing entirely

Overall event distribution looks reasonable - the issue isn’t obvious here
At this point, it was clear that the data itself wasn’t behaving as expected.
The First Issue: Randomness That Wasn’t Quite Random
The initial data generation logic used rand() like this:
multiIf(
rand() % 100 < 40, 'India',
rand() % 100 < 65, 'US',
rand() % 100 < 80, 'UK',
rand() % 100 < 90, 'Germany',
'UAE'
)
At a glance, this looks reasonable.
But each rand() call is evaluated independently.
So instead of generating a single random value and assigning a category, the logic evaluates a new random value at each step.
This leads to unintended distributions and subtle bias in the data.
Fixing That… and Introducing Another Problem
To make the data more stable, I switched to a deterministic approach using number:
(number * 17) % 100 AS event_rand
(number * 29) % 100 AS country_rand
This made the distributions predictable and easier to reason about.

After fixing randomness - a different issue appears (some countries have zero purchases)
But it introduced a different issue.
Both event_type and country were now derived from the same base value: number.
The Real Issue: Hidden Correlation
Even with different multipliers, these values were not independent.
They were mathematically related.
This meant:
- certain rows could only produce certain combinations
- some combinations would never occur
In this case, the rows that produced "purchase" never aligned with the rows that produced "UAE".
Which resulted in:
- UAE having zero purchases
- other countries showing skewed distributions
What Was Actually Wrong
The issue wasn’t randomness.
It was lack of independence.
The variables in the synthetic dataset were not independent of each other.
And that’s enough to produce misleading analytics.
Fixing the Data Generation
To resolve this, I changed how the values were generated:
- used different transformations
- added offsets
- ensured each variable had its own distribution
For example:
(number * 17) % 100 AS event_rand
(number * 31 + 13) % 100 AS country_rand
(number * 47 + 23) % 100 AS device_rand
This breaks the alignment between variables and restores independence.

After this change, the distributions behaved as expected.
Why This Matters
At smaller scales, the issue wasn’t obvious.
The data looked fine, and the dashboards seemed reasonable.
But as the dataset grew:
- patterns became more consistent
- biases became more visible
- incorrect assumptions started to look like real insights
Key Takeaway
Synthetic data can look correct while still producing misleading results.
The problem wasn’t query performance.
It was data correctness.
Scaling the data didn’t create the issue.
It revealed it.
Final Thoughts
This was a good reminder that:
- data generation deserves as much attention as querying
- small assumptions can lead to large inconsistencies
- and “reasonable-looking” data isn’t always reliable
When working with analytics systems, it’s easy to trust what the data shows.
But sometimes, it’s worth questioning how that data was created in the first place.
Related
This came up while building an analytics setup using ClickHouse and Superset, where I was comparing raw tables and materialized views.
If you're interested in that setup, you can read about it here:
👉 link to Blog 1

Top comments (0)