DEV Community

Harihara Suthan S
Harihara Suthan S

Posted on

I Built a Python Library for Synthetic Dataset Generation and Missing Value Simulation

As a student interested in Data Science and Machine Learning, I often faced the same problem:

I needed datasets to test ideas, algorithms, and projects, but finding the right dataset wasn't always easy.

Sometimes I needed:

  • A dataset with specific correlations
  • A dataset generated from a formula
  • Missing values following MCAR, MAR, or MNAR patterns
  • Time-series data for experimentation
  • Multiple datasets that could be merged and compared

Most existing libraries solved only one part of the problem.

So I decided to build my own Python package:

Introducing GOSEIDATASET

GOSEIDATASET is a Python library designed for:

✅ Synthetic Dataset Generation

✅ Missing Value Simulation

✅ Time Series Generation

✅ Dataset Merging

✅ Supervised Learning Utilities


Installation

pip install goseidataset
Enter fullscreen mode Exit fullscreen mode

Random Dataset Generation

Generating a dataset is straightforward:

from goseidataset import DatasetGenerator

dg = DatasetGenerator()

df = dg.generate_random(
    n_rows=100,
    constraints={
        "sleep": [4, 10],
        "revision": [0, 8],
        "session": ["Morning", "Evening"]
    }
)

print(df.head())
Enter fullscreen mode Exit fullscreen mode

Generate Correlated Data

Need a dataset where features have predefined relationships?

df = dg.generate_correlated(
    n_rows=1000,
    target="marks",
    correlations={
        "hours": 0.8,
        "stress": -0.5
    },
    constraints={
        "marks": [0, 100],
        "hours": [0, 12],
        "stress": [0, 100]
    }
)
Enter fullscreen mode Exit fullscreen mode

Formula-Based Dataset Generation

Generate data from mathematical relationships:

df = dg.generate_formula(
    n_rows=500,
    formula="hours*10 + revision*5",
    constraints={
        "hours": [1, 10],
        "revision": [0, 5],
        "marks": [0, 125]
    },
    target="marks"
)
Enter fullscreen mode Exit fullscreen mode

Missing Value Simulation

One of the main goals of the package was to help test imputation techniques.

Supported methods include:

  • MCAR
  • MAR
  • MNAR
  • Random Missing
  • Consecutive Missing
  • Block Missing
  • Correlation-Based Missing

Example:

from goseidataset import MissingValueGenerator

mv = MissingValueGenerator(df)

result = mv.mcar(
    column="marks",
    percentage=20
)
Enter fullscreen mode Exit fullscreen mode

Time Series Generation

Generate timestamp-based features easily:

from goseidataset import TimeSeriesGenerator

ts = TimeSeriesGenerator(df)

result = ts.timestamp_series()
Enter fullscreen mode Exit fullscreen mode

Supervised Learning Utilities

The package also includes utilities for:

  • Dataset comparison
  • Weighted ensemble learning
  • Dataset merging
  • Missing value imputation

Example:

from goseidataset import Supervised_learning

sl = Supervised_learning(
    dataset_a,
    dataset_b,
    target="Retention"
)

result = sl.compare_models(model)
Enter fullscreen mode Exit fullscreen mode

What I Learned Building This Project

Building this package taught me much more than writing Python code.

I learned about:

  • Package structure
  • API design
  • Documentation
  • Testing
  • PyPI packaging
  • Dependency management
  • Versioning
  • Real-world debugging

One of the biggest lessons was that writing the code is only part of the work. Making a package easy for others to install, understand, and use is equally important.


Future Improvements

Planned features include:

  • Classification dataset generators
  • Advanced time-series simulation
  • More missing-value mechanisms
  • Better visualization utilities
  • Additional machine learning helpers

Feedback Welcome

This is my first published Python package, and I'd love to hear feedback from the community.

PyPI:
https://pypi.org/project/goseidataset/

GitHub:
https://github.com/GITTY5678/Gosei-dataset

Thanks for reading!

Top comments (0)