Oliver Samuel

Posted on Nov 10

Synthetic Data Generator

#privacy #testing #tooling

By Oliver | November 7, 2025

The Problem: Why We Need Fake Data That Feels Real

Imagine you're building a new mobile app for a bank. Before launching it to real customers, you need to test it thoroughly. But here's the catch: you can't use real customer data for testing - that would be a privacy nightmare and potentially illegal. You also can't just make up random numbers and names because your app needs to handle realistic scenarios.

This is where synthetic data comes in. It's like having a movie set instead of real location. Everything looks authentic, but it's all carefully constructed and completely safe to use.

That's exactly what I built: DataGen - a Python library that creates realistic synthetic datasets at the click of a button.

What is DataGen

Think of DataGen as a digital factory for fake-but-realistic data. Just like a toy factory can produce thousands of identical toys, DataGen can generate thousands of realistic user profiles, salary records, regional information, and vehicle data. All completely synthetic but statistically accurate.

Here's another analogy: If you've ever used a flight simulator to practice flying without risking a real plane, DataGen does the same thing for data. It gives you realistic practice data without any privacy concerns or legal complications.

The Four Data Generators: My Digital Assembly Lines

DataGen consists of four specialized "assembly lines", each producing a different type of data:

1. Profile Generator: Creating Digital People

The Profile Generator creates realistic user profiles - complete with names, emails, addresses, and even geographic coordinates.
It's like having a character generator for a video game, but instead of fantasy characters, you get realistic Kenyan citizens.

What it generates:

Full names (first and last)
Email addresses and usernames
Phone numbers
Complete addresses (street, city, postal code)
Age and date of birth
Gender identity
Geographic coordinates (latitude and longitude)

Real-world use case: A fintech startup testing their loan application system can generate 10,000 realistic customer profiles in seconds, ensuring their system handles Kenyan names, addresses, and phone formats correctly.

Profile Generation Output - Show a table of generated profiles with names, emails, cities, and ages

2. Salary Generator: Modeling Compensation Data

The Salary Generator creates realistic employment and compensation records across different industries and experience levels. Think of it as a salary survey simulator that understands how compensation works in the real world.

What it generates:

Job titles across 8 departments (Engineering, Product, Data, Marketing, Sales, Operations, Finance, HR)
Experience levels (from Junior to C-Level executives)
Base salary, bonuses, and total compensation
Years of experience aligned with job level
Currency support (Kenyan Shillings and US Dollars)

The intelligence behind it: The generator know that a Senior Software Engineer should earn more than a Junior one, and that C-Level executives typically have 20+ years of experience. It's not just random numbers - it's statistically realistic.

Real-world use case: An HR analytics platform can test their salary benchmarking features with realistic compensation data across different industries and experience levels.

Salary Analysis - Show salary distribution by department or level with statistics

3. Region Generator: Mapping the World

The Region Generator creates global organizational data - perfect for companies with international operations. It's like having a world atlas combined with an organizational chart.

What it generates:

Six major global regions (North America, South America, Europe, Middle East, Africa, Asia Pacific)
Countries within each region
Time zones
Regional headquarters locations
Regional managers with contact information

Real-world use case: A multinational company testing their global CRM system can simulate operations across all continents with realistic regional structures.

Region Data Table - Show all regions with their headquarters and country counts

4. Car Generator: Building a Virtual Showroom

The Car Generator creates vehicle inventory data focused on Kenyan automotive market. It's like having a digital car dealership that understands local market preferences.

What it generates:

Popular makes and models in Kenya (Toyota, Nissan, Mazda, etc.)
Manufacturing years (2008-2025)
Colors, transmission types, and fuel types
Realistic pricing in Kenyan Shillings
Dealer locations across major Kenyan cities
Age-based depreciation modeling

The smart part: The generator knows that a 2025 Toyota Corolla should cost more than a 2010 model, and it applies realistic depreciation curves.

Real-world use case: Automotive marketplace app can test their search, filtering, and pricing features with thousands of realistic vehicle listings.

Car Inventory - Show a sample of generated cars with makes, models, years, and prices

The Magic Ingredient: Reproducibility

Here's something crucial that makes DataGen special: reproducibilty.

Imagine you're baking cookies. If you allow the exact same recipe with the exact same measurements, you'll get identical cookies every time. DataGen works the same way through something called a "seed".

When you set a seed(let's say, seed=42), DataGen will generate the exact same data every single time. This is incredibly important for:

- Testing: Developers can reproduce bugs by using the same seed
- Collaboration: Team members can work with identical datasets
- Validation: You can verify that your system produces consistent results

Analogy: Think of the seed as a recipe number. Recipe #42 always makes chocolate chip cookies, Recipe #106 always makes oatmeal cookies. The same recipe number = same cookies, every time.

From Code to Package: The Publishing Journey

Creating the generators was just the first step. To make DataGen useful to the world , I had to package it and publsh it to PyPI(Python Package Index) - think of it as the App Store for Python libraries.

Now, anyone in the world can install DataGen with a single command:

# Go to a new folder (like /Applications)
cd Applications

# Create a brand new, clean environment
python3 -m venv datagen_venv

# Activate the virtual environment
source datagen_venv/bin/activate

# Run the standard install command
pip install sami-datagen

# A sample try
python -c "from datagen import generate_profiles
profiles = generate_profiles(n=10, seed=42)
print(profiles)"

It's like making your homemade recipe available in every grocery store worldwide.

Real-world Impact: Who Benefits?

1. Software Developers

Testing applications without risking real user data. It's like having crash test dummies instead of real people for car safety tests.

2. Data Scientists

Training machine learning models on synthetic data before deploying to production. Think of it as practicing surgery on cadavers before operating on real patients.

3. Business Analysts

Creating demo dashboards and presentations without exposing sensitive company data. Like using a model home to show buyers what their houses could look like.

4. Students and Educators

Learning data analysis and database design with realistic datasets. It's like using a flight simulator in pilot training - safe, repeatable, and realistic.

5. Startups

Building and demonstrating MVPs(Minimum Viable Products) without collecting real user data. Like creating a movie trailer before filming the entire moving.

The Technical Foundation

For those curious about how it works under the hood:

DataGen uses:

- Faker library: Generates realistic names, addresses, and contact information
- Pandas: Organizes data into structured tables (like Excel spreadsheets)
- Statistical modeling: Ensures salary ranges, age distributions, and pricing follow realistic patterns
- Localization: Understands Kenyan naming conventions, cities, and market preferences

Analogy: If DataGen were a restaurant, Faker would be the ingredient supplier, Pandas would be the kitchen organization system, and statistical modeling would be the chef's knowledge of how flavors work together.

Practical Examples: See It In Action

Example 1: Generate 100 user profiles

from datagen import generate_profiles

profiles = generate_profiles(n=100, seed=42)
print(profiles.head())

Output: A table with 100 realistic Kenyan user profiles, complete with names like “Sharon Mohamed” from Nairobi, “Kennedy Atieno” from Mombasa, each with unique emails, addresses, and coordinates.

Code Example Output - Show the actual output from running this code

Example 2: Analyze Salary Distribution

from datagen import generate_salaries

salaries = generate_salaries(n=1000)
avg_by_dept = salaries.groupby('department')['total_compensation'].mean()
print(avg_by_dept)

Output: Average compensation by department, showing that Engineering and Data departments typically have higher compensation than Operations or HR.

Salary Analysis Results - Show the grouped statistics

Beyond the Code: Docker Support

For those who want to use DataGen without installing anything on their computer, I included Docker support.

What's Docker? Think of it as a portable computer inside your computer. It's like having a fully equipped kitchen(with all tools and ingredients) that you can set up anywhere in seconds.

With Docker you can:

Download the DataGen container
Start it with one command
Generate data immediately - no installation, no configuration

Docker Setup - Show the docker-compose command and container running

The Documentation Journey

Creating the library was only three-quarter the battle. Making it usable requires comprehensive documentation:

README.md: A guide covering installation, usage, and examples
Example Scripts: Five Python scripts demonstrating each generator
Inline Documentation: Every function has detailed explanations
API Reference: Complete parameter descriptions and return types

Analogy: It's like buying furniture from IKEA - the product is great, but without clear instructions(with pictures), it's just a pile of woods and screws.

Challenges and Soultions

Challenge 1: Making Data Feel "Real"

Solution: Instead of purely random generation, I implemented statistical models. For example, Senior Engineers have 5-10 years of experience, not 2 years or 30 years.

Challenge 2: Kenyan Localization

Solution: Researched and included actual Kenyan cities, realistic coordinate boundaries, and local naming patterns. The data doesn't just look real - it looks Kenyan real.

Challenge 3: Reproducibility

Solution: Implemented seed-based generation, ensuring that seed=42 always produces identical results, making debugging and testing possible.

The Results: By The Numbers

4 specialized generators covering different data types
60+ job titles across 8 departments
10 experience levels from Junior to C-Level
6 global regions covering 36 countries
10 popular car makes with realistic pricing
100% reproducibility with seed control
Published on PyPI - accessible worldwide
Docker support for zero-installation usage

Complete Demo Output - Show the final output from running complete_demo.py with all statistics

What's Next?

DataGen is just the beginning. Future enhancements could include:

- More data types: Transaction records, event logs, social media posts
- Relationship modeling: Connecting profiles to their salaries and purchases
- Time-series data: Stock prices, sensor readings, website traffic
- Custom templates: Industry-specific data patterns
- Web interface: Generate data without writing code

Try it Yourself

Want to explore DataGen? Here's how:

For technical users:

pip install sami-datagen

For Everyone Else: Visit the GitHub repository at GitHub repo where you'll find:

Complete installation instructions
Step-by-step tutorials

Conclusion

Building DataGen taught me that great tools aren't just about functionality - they're about accessibility. The best technology is technology that anyone can use , understand and benefit from.

Whether you're a developer testing an app. a student learning data science, or a business professional creating a demo, DataGen provides the realistic data you need, when you need it, without compromise.

The code is open source, the documentation is comprehensive, and the possibilities are endless.

About the Author

Oliver is a data engineer who's passionate about building tools that make technology more accessible. This project was completed as part of the LuxDevHQ Data Engineering Internship program.

Connect:

GitHub:@25thOliver
LinkedIn: Samwel Oliver

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.