DEV Community

Ibim
Ibim

Posted on

Synthetic Data Is Not About Replacing Reality. It Is About Questioning It.

There is a quiet moment many of us reach when working with data and machine learning.

The model performs well.
The metrics look reassuring.
The pipeline feels complete.

And yet, something does not sit right. Not because the system is broken. But because we are not convinced it is fair.

This is where synthetic data becomes more than a technical tool.
It becomes a way to ask uncomfortable questions.

The Hidden Problem With Real World Data

We often talk about real world data as if it is neutral. It is not.

Hiring data reflects decades of unequal access to education, employment, and opportunity. Healthcare data reflects who was diagnosed, who was believed, and who was ignored. Behavioural datasets reflect cultural norms and economic pressures.

When AI systems are trained purely on historical data, they do not learn fairness. They learn patterns. And many of those patterns are shaped by inequality. This is not a philosophical argument. It is a statistical one.

What Synthetic Data Actually Is

Synthetic data is artificially generated data that mimics the structure and statistical properties of real datasets, without representing real individuals. It is not created for humans to read. It is created for systems to learn from, or to be tested against. This distinction matters.

Synthetic CVs are not meant to apply for jobs.
Synthetic patient records are not meant to describe real people.
Synthetic handwriting samples are not meant to replace human writing.

They exist to allow experimentation without harm.

Synthetic Data as a Controlled Lens

One of the most powerful properties of synthetic data is control. In the real world, you cannot ethically do the following:

  • Take a job applicant.
  • Change only their name.
  • Or their age.
  • Or a single line mentioning a disability.

Then re-run the application. With synthetic data, you can.

Research on synthetic CV generation for fairness testing shows how artificial applicant profiles can be created where all variables are held constant except one. This allows researchers and practitioners to observe how automated hiring systems respond to specific demographic changes without involving real candidates or breaching privacy obligations (Saldivar, Gatzioura, Castillo, 2025).

When outcomes change under these controlled conditions, bias becomes visible. Not as an accusation. But as behaviour.

Lessons From Healthcare and Rare Disease Research

Some of the most mature work on synthetic data comes from healthcare. In rare disease research, data is scarce, sensitive, and heavily regulated. Sharing real patient records is often impossible.

Research into privacy preserving synthetic data generation shows how generative models can create realistic patient profiles that allow analysis, model training, and collaboration without exposing personal information (Mendes, Barbar, Refaie, 2025).

But these studies also highlight something important.

Synthetic data reflects the quality of the data it is generated from. If the original dataset is biased or incomplete, the synthetic data will inherit those weaknesses. This lesson transfers directly to hiring systems. Synthetic data is not automatically fair. It must be designed with intent.

Why Representation Matters More Than Volume

Another important insight comes from handwriting recognition research.

Some languages and writing styles are poorly represented in public datasets. As a result, models perform well for some populations and poorly for others.

Research on handwriting recognition shows that large scale synthetic datasets are often required to capture enough variation for models to generalise properly, especially when real data is limited (Pham Thach Thanh Truc et al., 2025). The takeaway is simple.

If certain groups are missing from the data, the system will struggle with them. This applies to CVs, medical records, and any system that interacts with human diversity.

What Robotics Teaches Us About Synthetic Worlds

Robotics offers a useful warning.

In robotic learning, simulation is widely used because collecting real world data is expensive and slow. However, research on robotic bin packing shows that systems trained only in idealised synthetic environments often fail when deployed in real conditions (Wang et al., 2025).

Why?

Because reality is messy.

Objects behave unpredictably.
Lighting changes.
Constraints shift.

The same principle applies to synthetic data used for fairness testing. If synthetic CVs are too clean, too linear, or too idealised, fairness evaluations become misleading. Real careers are rarely neat.

People change paths.
Take breaks.
Move countries.
Care for others.

Synthetic data must reflect this complexity if it is to reveal meaningful bias.

Synthetic Data Does Not Eliminate Bias Automatically

This point is worth stating clearly. Synthetic data does not fix bias on its own. Generative models learn patterns. They do not understand ethics or social context. If historical data encodes inequality, a naive synthetic generator will reproduce it.

This is why recent research emphasises constraints, validation, and domain knowledge when generating synthetic datasets, particularly in sensitive domains such as healthcare and employment (Mendes et al., 2025).

Synthetic data is a tool.
Fairness depends on how it is used.

Why Synthetic Data Forces Honesty

There is something quietly powerful about synthetic data.

It removes excuses.

When systems can be tested under controlled conditions, bias can no longer hide behind noise or complexity.

If a hiring model behaves unfairly when only one variable is changed, the issue is structural.

Synthetic data does not accuse.
It reveals.

And that is precisely why it matters.

Looking Ahead

Synthetic data is often described as artificial.

But its impact is real.

It shapes how we test AI systems.
How we protect privacy.
How we detect bias.
How we imagine fairer alternatives.

Used carelessly, it can reinforce historical inequality.
Used thoughtfully, it can help us challenge it.

Synthetic data is not about replacing reality.

It is about questioning the systems we build from it.

References

Saldivar, J., Gatzioura, A., Castillo, C. (2025).
Synthetic CVs to Build and Test Fairness-Aware Hiring Tools.
ACM Transactions on Intelligent Systems and Technology.

Mendes, M., Barbar, F., Refaie, A. (2025).
Synthetic Data Generation: A Privacy-Preserving Approach to Accelerate Rare Disease Research.
Frontiers in Digital Health.

Pham Thach Thanh Truc et al. (2025).
HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition.
arXiv preprint.

Wang, Z. et al. (2025).
RoboBPP: Benchmarking Robotic Online Bin Packing with Physics-Based Simulation.
arXiv preprint.

MIT Technology Review
What synthetic data is and why it matters for AI
https://www.technologyreview.com

Nature News and Comment
How artificial data could help address bias in AI
https://www.nature.com

OECD AI Policy Observatory
Fairness, transparency, and accountability in AI
https://oecd.ai

Top comments (0)