As a student interested in Data Science and Machine Learning, I often faced the same problem:
I needed datasets to test ideas, algorithms, and projects, but finding the right dataset wasn't always easy.
Sometimes I needed:
- A dataset with specific correlations
- A dataset generated from a formula
- Missing values following MCAR, MAR, or MNAR patterns
- Time-series data for experimentation
- Multiple datasets that could be merged and compared
Most existing libraries solved only one part of the problem.
So I decided to build my own Python package:
Introducing GOSEIDATASET
GOSEIDATASET is a Python library designed for:
✅ Synthetic Dataset Generation
✅ Missing Value Simulation
✅ Time Series Generation
✅ Dataset Merging
✅ Supervised Learning Utilities
Installation
pip install goseidataset
Random Dataset Generation
Generating a dataset is straightforward:
from goseidataset import DatasetGenerator
dg = DatasetGenerator()
df = dg.generate_random(
n_rows=100,
constraints={
"sleep": [4, 10],
"revision": [0, 8],
"session": ["Morning", "Evening"]
}
)
print(df.head())
Generate Correlated Data
Need a dataset where features have predefined relationships?
df = dg.generate_correlated(
n_rows=1000,
target="marks",
correlations={
"hours": 0.8,
"stress": -0.5
},
constraints={
"marks": [0, 100],
"hours": [0, 12],
"stress": [0, 100]
}
)
Formula-Based Dataset Generation
Generate data from mathematical relationships:
df = dg.generate_formula(
n_rows=500,
formula="hours*10 + revision*5",
constraints={
"hours": [1, 10],
"revision": [0, 5],
"marks": [0, 125]
},
target="marks"
)
Missing Value Simulation
One of the main goals of the package was to help test imputation techniques.
Supported methods include:
- MCAR
- MAR
- MNAR
- Random Missing
- Consecutive Missing
- Block Missing
- Correlation-Based Missing
Example:
from goseidataset import MissingValueGenerator
mv = MissingValueGenerator(df)
result = mv.mcar(
column="marks",
percentage=20
)
Time Series Generation
Generate timestamp-based features easily:
from goseidataset import TimeSeriesGenerator
ts = TimeSeriesGenerator(df)
result = ts.timestamp_series()
Supervised Learning Utilities
The package also includes utilities for:
- Dataset comparison
- Weighted ensemble learning
- Dataset merging
- Missing value imputation
Example:
from goseidataset import Supervised_learning
sl = Supervised_learning(
dataset_a,
dataset_b,
target="Retention"
)
result = sl.compare_models(model)
What I Learned Building This Project
Building this package taught me much more than writing Python code.
I learned about:
- Package structure
- API design
- Documentation
- Testing
- PyPI packaging
- Dependency management
- Versioning
- Real-world debugging
One of the biggest lessons was that writing the code is only part of the work. Making a package easy for others to install, understand, and use is equally important.
Future Improvements
Planned features include:
- Classification dataset generators
- Advanced time-series simulation
- More missing-value mechanisms
- Better visualization utilities
- Additional machine learning helpers
Feedback Welcome
This is my first published Python package, and I'd love to hear feedback from the community.
PyPI:
https://pypi.org/project/goseidataset/
GitHub:
https://github.com/GITTY5678/Gosei-dataset
Thanks for reading!
Top comments (0)