DEV Community

Vu Hung Nguyen (Hưng)
Vu Hung Nguyen (Hưng)

Posted on

Create Synthetic Data - A Comprehensive Guideline

Overview
This document will guide you how to create synthetic data using Python, Cursor, to solve the problem of "being hungry for data" when real data is not available.

What is Synthetic Data?
Synthetic data is artificially generated data that mimics the statistical properties of real-world data. It is often used in scenarios where real data is scarce, sensitive, or expensive to obtain. Synthetic data can be used for testing, training machine learning models, and validating algorithms without compromising privacy or security.

Why Use Synthetic Data?
Privacy: Synthetic data can be generated without using any real personal information, making it a safer alternative for testing and development.
Cost-Effective: Generating synthetic data can be more cost-effective than collecting and maintaining real datasets.
Flexibility: Synthetic data can be tailored to specific requirements, allowing for the creation of diverse datasets that cover various scenarios.
Scalability: Synthetic data can be generated in large volumes, making it suitable for big data analytics.
Methodology
Python faker library is used to generate synthetic data.
Probabilistic approaches are employed to ensure the synthetic data closely resembles real-world data distributions.
Machine learning techniques can be applied to refine the synthetic data generation process.
Neural networks can be utilized to create more complex and realistic synthetic datasets, such as GAN (Generative Adversarial Networks)
Notes:

Translate a dataset from one language to another language can be a good choice
Expand or extend an existing dataset by generating synthetic samples based on the original data distribution.
For example, IoT sensor datasets can be extended by:

Generating more time series data points following observed patterns
Adding noise and anomalies for robustness testing
Simulating different environmental conditions
Creating multi-sensor correlation scenarios
Example Synthetic Data Generation Project
Define the features and structure of the synthetic datasets to be generated, do it manually, save it to FEATURES.md.

Example Prompt
Help me create a Python script that generates synthetic data for stock prices.

Refer to FEATURES.md for the fields to include: date, open, high, low, close, adjusted close, and volume.

Folder Structure
The folder structure is as follows:

synthetic-data/
├── datasets/
│ ├── small/
│ ├── medium/
│ ├── large/
│ ├── README.md
│ ├── FEATURES.md
│ └── .gitignore
├── scripts/
│ ├── generate_datasets.py
│ ├── compress_datasets.py
│ ├── README.md
│ └── .gitignore
├── requirements.txt
└── README.md
└── Makefile
└── .gitignore
data: where synthetic data will be stored
raw: raw synthetic data files
processed: processed synthetic data files
scripts: contains the Python script to generate synthetic data
README.md: How to install and run the script using uv with virtual enviroment named .venv
FEATURES.md: Document the features of the synthetic data generation process
requirements.txt: List of Python dependencies
README.md Structure
Overview & Purpose
Prerequisites (Python version (3.9+), libraries)
Installation Instructions (using uv with .venv)
Quick Start Guide
Basic usage examples
Common workflows
Configuration Options
Conclusion
Objectives of the Synthetic Datasets
Provide ready-to-use datasets to demonstrate ML workflows
Cover supervised, unsupervised, and semi-supervised learning (and suggest more options if any)
Support task types:
classification,
regression,
clustering
time-series forecasting: In this project, we focus on generating stock prices data
anomaly detection
recommendation systems
graph analysis
sentiment analysis
Note: Implement all these tasks if possible, they will be needed for comprehensive ML demonstrations.

Support a range of dataset sizes and feature counts
--size Options (Number of Samples)
small: 1,000 – 10,000
medium: 10,000 – 100,000
large: 100,000 – 1,000,000
extra large: 1,000,000 – 10,000,000
Feature targets
Features: 5 – 50: Number of features/columns in the dataset to generate. Do not set if FEATURES.md is present
Classes: 2 – 10 (classification)
Clusters: 2 – 10 (clustering)
Dataset format
CSV: Comma-separated values for easy import into various tools
Optionally, support for compressed formats like .csv.gz for large datasets
Encoding: UTF-8 to ensure compatibility
Example Run Command
python scripts/generate_datasets.py
--task classification
--size small
--num-samples 5000
--num-classes 3
--random-state 2025

scripts/generate_datasets.py Parameters
-t, --task: Type of machine learning task (classification, regression, clustering, etc.)
-s, --size: Size of the dataset to generate (small, medium, large, extra large)
-n, --num-samples: Number of samples/rows in the dataset
-f, --num-features: Number of features/columns in the dataset (if not using FEATURES.md)
-c, --num-classes: Number of classes (for classification tasks)
-k, --num-clusters: Number of clusters (for clustering tasks)
--random-state: Seed for random number generation to ensure reproducibility
--output-format: Format of the output dataset (CSV, CSV.GZ). Default is CSV
--output-dir: Directory to save the generated datasets. Default is datasets/
Note: When using --size, the script automatically determines the number of samples within the specified range. Use --num-samples to override with an exact number.

Makefile targets
make help # List available targets
make create-all # Generate representative datasets across tasks and sizes
make compress-all # Compress all CSV datasets (creates .csv.gz)
make clean # Delete all CSV files in datasets/
make clean-gzip # Delete all .csv.gz files in datasets/
make test
make sample # Generate small sample datasets for testing
make visualize # Create visualizations of dataset distributions
Required libraries
faker: For generating fake data such as names, addresses, emails, etc.
pandas: For data manipulation and analysis
scikit-learn: For generating datasets with specific characteristics
tensorflow or pytorch: For advanced synthetic data generation using neural networks
You can decide what libraries to use based on your specific needs and the complexity of the synthetic data you want to generate.

FEATURES.md Example
Features of Synthetic Data Generation of stock prices

Synthetic Data Features
The synthetic data features include:

Feature Name Description Data Type Example Values
Date Date of the stock price record Date 2015-03-31
Open Opening price of the stock Float 0.555
High Highest price of the stock Float 0.595
Low Lowest price of the stock Float 0.53
Close Closing price of the stock Float 0.565
Adj Close Adjusted closing price of the stock Float 0.565
Volume Trading volume of the stock Integer 4816294
Sample data
Date,Open,High,Low,Close,Adj Close,Volume
2015-03-31,0.555,0.595,0.53,0.565,0.565,4816294
2015-04-01,0.575,0.58,0.555,0.565,0.565,4376660
2015-04-02,0.56,0.565,0.535,0.555,0.555,2779640
Final Notes
R is a strong alternative to Python for synthetic data generation
Combine with LLMs to create more context-aware synthetic data is a promising direction, tailor your prompts
faker can be replaced by using LLM API calls to generate more realistic and diverse synthetic data samples.

Top comments (0)