I Built a Python Library for Synthetic Dataset Generation and Missing Value Simulation

#python #showdev #datascience #machinelearning

As a student interested in Data Science and Machine Learning, I often faced the same problem:

I needed datasets to test ideas, algorithms, and projects, but finding the right dataset wasn't always easy.

Sometimes I needed:

A dataset with specific correlations
A dataset generated from a formula
Missing values following MCAR, MAR, or MNAR patterns
Time-series data for experimentation
Multiple datasets that could be merged and compared

Most existing libraries solved only one part of the problem.

So I decided to build my own Python package:

Introducing GOSEIDATASET

GOSEIDATASET is a Python library designed for:

✅ Synthetic Dataset Generation

✅ Missing Value Simulation

✅ Time Series Generation

✅ Dataset Merging

✅ Supervised Learning Utilities

Installation

pip install goseidataset

Random Dataset Generation

Generating a dataset is straightforward:

from goseidataset import DatasetGenerator

dg = DatasetGenerator()

df = dg.generate_random(
    n_rows=100,
    constraints={
        "sleep": [4, 10],
        "revision": [0, 8],
        "session": ["Morning", "Evening"]
    }
)

print(df.head())

Generate Correlated Data

Need a dataset where features have predefined relationships?

df = dg.generate_correlated(
    n_rows=1000,
    target="marks",
    correlations={
        "hours": 0.8,
        "stress": -0.5
    },
    constraints={
        "marks": [0, 100],
        "hours": [0, 12],
        "stress": [0, 100]
    }
)

Formula-Based Dataset Generation

Generate data from mathematical relationships:

df = dg.generate_formula(
    n_rows=500,
    formula="hours*10 + revision*5",
    constraints={
        "hours": [1, 10],
        "revision": [0, 5],
        "marks": [0, 125]
    },
    target="marks"
)

Missing Value Simulation

One of the main goals of the package was to help test imputation techniques.

Supported methods include:

MCAR
MAR
MNAR
Random Missing
Consecutive Missing
Block Missing
Correlation-Based Missing

Example:

from goseidataset import MissingValueGenerator

mv = MissingValueGenerator(df)

result = mv.mcar(
    column="marks",
    percentage=20
)

Time Series Generation

Generate timestamp-based features easily:

from goseidataset import TimeSeriesGenerator

ts = TimeSeriesGenerator(df)

result = ts.timestamp_series()

Supervised Learning Utilities

The package also includes utilities for:

Dataset comparison
Weighted ensemble learning
Dataset merging
Missing value imputation

Example:

from goseidataset import Supervised_learning

sl = Supervised_learning(
    dataset_a,
    dataset_b,
    target="Retention"
)

result = sl.compare_models(model)

What I Learned Building This Project

Building this package taught me much more than writing Python code.

I learned about:

Package structure
API design
Documentation
Testing
PyPI packaging
Dependency management
Versioning
Real-world debugging

One of the biggest lessons was that writing the code is only part of the work. Making a package easy for others to install, understand, and use is equally important.