By Oliver | November 7, 2025
The Problem: Why We Need Fake Data That Feels Real
Imagine you're building a new mobile app for a bank. Before launching it to real customers, you need to test it thoroughly. But here's the catch: you can't use real customer data for testing - that would be a privacy nightmare and potentially illegal. You also can't just make up random numbers and names because your app needs to handle realistic scenarios.
This is where synthetic data comes in. It's like having a movie set instead of real location. Everything looks authentic, but it's all carefully constructed and completely safe to use.
That's exactly what I built: DataGen - a Python library that creates realistic synthetic datasets at the click of a button.
What is DataGen
Think of DataGen as a digital factory for fake-but-realistic data. Just like a toy factory can produce thousands of identical toys, DataGen can generate thousands of realistic user profiles, salary records, regional information, and vehicle data. All completely synthetic but statistically accurate.
Here's another analogy: If you've ever used a flight simulator to practice flying without risking a real plane, DataGen does the same thing for data. It gives you realistic practice data without any privacy concerns or legal complications.
The Four Data Generators: My Digital Assembly Lines
DataGen consists of four specialized "assembly lines", each producing a different type of data:
1. Profile Generator: Creating Digital People
The Profile Generator creates realistic user profiles - complete with names, emails, addresses, and even geographic coordinates.
It's like having a character generator for a video game, but instead of fantasy characters, you get realistic Kenyan citizens.
What it generates:
- Full names (first and last)
- Email addresses and usernames
- Phone numbers
- Complete addresses (street, city, postal code)
- Age and date of birth
- Gender identity
- Geographic coordinates (latitude and longitude)
Real-world use case: A fintech startup testing their loan application system can generate 10,000 realistic customer profiles in seconds, ensuring their system handles Kenyan names, addresses, and phone formats correctly.

Profile Generation Output - Show a table of generated profiles with names, emails, cities, and ages
2. Salary Generator: Modeling Compensation Data
The Salary Generator creates realistic employment and compensation records across different industries and experience levels. Think of it as a salary survey simulator that understands how compensation works in the real world.
What it generates:
- Job titles across 8 departments (Engineering, Product, Data, Marketing, Sales, Operations, Finance, HR)
- Experience levels (from Junior to C-Level executives)
- Base salary, bonuses, and total compensation
- Years of experience aligned with job level
- Currency support (Kenyan Shillings and US Dollars)
The intelligence behind it: The generator know that a Senior Software Engineer should earn more than a Junior one, and that C-Level executives typically have 20+ years of experience. It's not just random numbers - it's statistically realistic.
Real-world use case: An HR analytics platform can test their salary benchmarking features with realistic compensation data across different industries and experience levels.

Salary Analysis - Show salary distribution by department or level with statistics
3. Region Generator: Mapping the World
The Region Generator creates global organizational data - perfect for companies with international operations. It's like having a world atlas combined with an organizational chart.
What it generates:
- Six major global regions (North America, South America, Europe, Middle East, Africa, Asia Pacific)
- Countries within each region
- Time zones
- Regional headquarters locations
- Regional managers with contact information
Real-world use case: A multinational company testing their global CRM system can simulate operations across all continents with realistic regional structures.

Region Data Table - Show all regions with their headquarters and country counts
4. Car Generator: Building a Virtual Showroom
The Car Generator creates vehicle inventory data focused on Kenyan automotive market. It's like having a digital car dealership that understands local market preferences.
What it generates:
- Popular makes and models in Kenya (Toyota, Nissan, Mazda, etc.)
- Manufacturing years (2008-2025)
- Colors, transmission types, and fuel types
- Realistic pricing in Kenyan Shillings
- Dealer locations across major Kenyan cities
- Age-based depreciation modeling
The smart part: The generator knows that a 2025 Toyota Corolla should cost more than a 2010 model, and it applies realistic depreciation curves.
Real-world use case: Automotive marketplace app can test their search, filtering, and pricing features with thousands of realistic vehicle listings.

Car Inventory - Show a sample of generated cars with makes, models, years, and prices
The Magic Ingredient: Reproducibility
Here's something crucial that makes DataGen special: reproducibilty.
Imagine you're baking cookies. If you allow the exact same recipe with the exact same measurements, you'll get identical cookies every time. DataGen works the same way through something called a "seed".
When you set a seed(let's say, seed=42), DataGen will generate the exact same data every single time. This is incredibly important for:
- Testing: Developers can reproduce bugs by using the same seed
- Collaboration: Team members can work with identical datasets
- Validation: You can verify that your system produces consistent results
Analogy: Think of the seed as a recipe number. Recipe #42 always makes chocolate chip cookies, Recipe #106 always makes oatmeal cookies. The same recipe number = same cookies, every time.
From Code to Package: The Publishing Journey
Creating the generators was just the first step. To make DataGen useful to the world , I had to package it and publsh it to PyPI(Python Package Index) - think of it as the App Store for Python libraries.
Now, anyone in the world can install DataGen with a single command:
# Go to a new folder (like /Applications)
cd Applications
# Create a brand new, clean environment
python3 -m venv datagen_venv
# Activate the virtual environment
source datagen_venv/bin/activate
# Run the standard install command
pip install sami-datagen
# A sample try
python -c "from datagen import generate_profiles
profiles = generate_profiles(n=10, seed=42)
print(profiles)"
It's like making your homemade recipe available in every grocery store worldwide.
Real-world Impact: Who Benefits?
1. Software Developers
Testing applications without risking real user data. It's like having crash test dummies instead of real people for car safety tests.
2. Data Scientists
Training machine learning models on synthetic data before deploying to production. Think of it as practicing surgery on cadavers before operating on real patients.
3. Business Analysts
Creating demo dashboards and presentations without exposing sensitive company data. Like using a model home to show buyers what their houses could look like.
4. Students and Educators
Learning data analysis and database design with realistic datasets. It's like using a flight simulator in pilot training - safe, repeatable, and realistic.
5. Startups
Building and demonstrating MVPs(Minimum Viable Products) without collecting real user data. Like creating a movie trailer before filming the entire moving.
The Technical Foundation
For those curious about how it works under the hood:
DataGen uses:
- Faker library: Generates realistic names, addresses, and contact information
- Pandas: Organizes data into structured tables (like Excel spreadsheets)
- Statistical modeling: Ensures salary ranges, age distributions, and pricing follow realistic patterns
- Localization: Understands Kenyan naming conventions, cities, and market preferences
Analogy: If DataGen were a restaurant, Faker would be the ingredient supplier, Pandas would be the kitchen organization system, and statistical modeling would be the chef's knowledge of how flavors work together.
Practical Examples: See It In Action
Example 1: Generate 100 user profiles
from datagen import generate_profiles
profiles = generate_profiles(n=100, seed=42)
print(profiles.head())
Output: A table with 100 realistic Kenyan user profiles, complete with names like “Sharon Mohamed” from Nairobi, “Kennedy Atieno” from Mombasa, each with unique emails, addresses, and coordinates.

Code Example Output - Show the actual output from running this code
Example 2: Analyze Salary Distribution
from datagen import generate_salaries
salaries = generate_salaries(n=1000)
avg_by_dept = salaries.groupby('department')['total_compensation'].mean()
print(avg_by_dept)
Output: Average compensation by department, showing that Engineering and Data departments typically have higher compensation than Operations or HR.

Salary Analysis Results - Show the grouped statistics
Beyond the Code: Docker Support
For those who want to use DataGen without installing anything on their computer, I included Docker support.
What's Docker? Think of it as a portable computer inside your computer. It's like having a fully equipped kitchen(with all tools and ingredients) that you can set up anywhere in seconds.
With Docker you can:
- Download the DataGen container
- Start it with one command
- Generate data immediately - no installation, no configuration

Docker Setup - Show the docker-compose command and container running
The Documentation Journey
Creating the library was only three-quarter the battle. Making it usable requires comprehensive documentation:
- README.md: A guide covering installation, usage, and examples
- Example Scripts: Five Python scripts demonstrating each generator
- Inline Documentation: Every function has detailed explanations
- API Reference: Complete parameter descriptions and return types
Analogy: It's like buying furniture from IKEA - the product is great, but without clear instructions(with pictures), it's just a pile of woods and screws.
Challenges and Soultions
Challenge 1: Making Data Feel "Real"
Solution: Instead of purely random generation, I implemented statistical models. For example, Senior Engineers have 5-10 years of experience, not 2 years or 30 years.
Challenge 2: Kenyan Localization
Solution: Researched and included actual Kenyan cities, realistic coordinate boundaries, and local naming patterns. The data doesn't just look real - it looks Kenyan real.
Challenge 3: Reproducibility
Solution: Implemented seed-based generation, ensuring that seed=42 always produces identical results, making debugging and testing possible.
The Results: By The Numbers
- 4 specialized generators covering different data types
- 60+ job titles across 8 departments
- 10 experience levels from Junior to C-Level
- 6 global regions covering 36 countries
- 10 popular car makes with realistic pricing
- 100% reproducibility with seed control
- Published on PyPI - accessible worldwide
- Docker support for zero-installation usage

Complete Demo Output - Show the final output from running complete_demo.py with all statistics
What's Next?
DataGen is just the beginning. Future enhancements could include:
- More data types: Transaction records, event logs, social media posts
- Relationship modeling: Connecting profiles to their salaries and purchases
- Time-series data: Stock prices, sensor readings, website traffic
- Custom templates: Industry-specific data patterns
- Web interface: Generate data without writing code
Try it Yourself
Want to explore DataGen? Here's how:
For technical users:
pip install sami-datagen
For Everyone Else: Visit the GitHub repository at GitHub repo where you'll find:
- Complete installation instructions
- Step-by-step tutorials
Conclusion
Building DataGen taught me that great tools aren't just about functionality - they're about accessibility. The best technology is technology that anyone can use , understand and benefit from.
Whether you're a developer testing an app. a student learning data science, or a business professional creating a demo, DataGen provides the realistic data you need, when you need it, without compromise.
The code is open source, the documentation is comprehensive, and the possibilities are endless.
About the Author
Oliver is a data engineer who's passionate about building tools that make technology more accessible. This project was completed as part of the LuxDevHQ Data Engineering Internship program.
Connect:
- GitHub:@25thOliver
- LinkedIn: Samwel Oliver



Top comments (0)