DEV Community

Chinthaparthi Sridhar
Chinthaparthi Sridhar

Posted on

I Built a Python Package That Automates EDA in One Line

After writing the same pandas code for every new dataset, I decided to automate it and published it on PyPI.

The Problem

Every data scientist knows this pain. You get a new dataset and start typing:

df.head()
df.tail()
df.info()
df.describe()
df.isnull().sum()
df.duplicated().sum()
# ... 50 more lines
Enter fullscreen mode Exit fullscreen mode

Same code. Every. Single. Time.

So I built smarteda — a Python package that runs your entire EDA automatically.

Installation

pip install smarteda
Enter fullscreen mode Exit fullscreen mode

Quick Start

import pandas as pd
import smarteda

df = pd.read_csv("your_data.csv")

# Run everything at once
smarteda.analyze(df)

# Or pick what you need
smarteda.basic_eda(df)       # head, tail, info, describe, shape
smarteda.overview(df)        # shape, memory, dtypes, constant columns
smarteda.missing(df)         # missing values + fill suggestions
smarteda.outliers(df)        # IQR + Z-score + Isolation Forest
smarteda.correlations(df)    # multicollinearity warnings + heatmap
smarteda.suggestions(df)     # smart recommendations + ML score

# Clean your data
clean_df = smarteda.clean(df)           # returns new cleaned df
smarteda.clean(df, inplace=True)        # modifies df directly

# Generate full HTML report
smarteda.report(df, output_file="report.html")
Enter fullscreen mode Exit fullscreen mode

The Output

=== Smart Suggestions ===

⚠️  Column 'salary' has 16.7% missing → fill missing values
⚠️  Column 'salary' is highly skewed (skew=1.53) → apply log transformation
⚠️  '10_yop' and '12_yop' are 100% correlated → drop one to avoid multicollinearity
⚠️  Column 'Name' has high cardinality (20000 unique) → use target encoding
✅  No duplicates found

💡 ML Readiness Score: 74 / 100
Enter fullscreen mode Exit fullscreen mode

All 15 Functions

Function Description
basic_eda(df) Head, tail, sample, shape, size, info, describe
overview(df) Shape, memory, dtypes, constant columns, wrong type detection
missing(df) Missing counts, percentages, heatmap, fill strategy suggestions
duplicates(df) Count and display duplicate rows
duplicates(df, drop=True) Drop duplicates and return clean DataFrame
outliers(df) IQR + Z-score + Isolation Forest outlier detection
distributions(df) Skewness, kurtosis, transformation suggestions, KDE plots
correlations(df) Pearson/Spearman/Kendall, multicollinearity warnings, heatmap
categorical(df) Value counts, high cardinality, encoding suggestions
timeseries(df) Auto datetime detection, trends, seasonality, gap detection
suggestions(df) Smart recommendations + ML Readiness Score /100
clean(df) Auto clean — drop dupes, fill nulls, fix types, cap outliers
visualize(df) Auto charts for every column
analyze(df) 🚀 Runs ALL functions in one call
report(df) 📄 Generates full standalone HTML report

What I Learned Building This

1. Publishing to PyPI is easier than you think
Build with python -m build, upload with twine upload dist/*. Done.

2. Jupyter has quirks
When a function returns a dictionary, Jupyter auto-displays it — giving double output. Fixed it with a custom SilentDict class that overrides __repr__ to return an empty string.

3. Deprecation warnings matter
Pandas is actively deprecating infer_datetime_format and select_dtypes(include='object'). Wrapping these in warnings.catch_warnings() keeps the package clean.

4. The gap between "it works" and "ready to publish" is large
The code worked on day one. But edge cases, warnings, Jupyter quirks, docstrings, README, PyPI config — that's where the real work was.

Links

If this helped you or gave you an idea — drop a comment. Always happy to talk Python and data science! 🚀

Top comments (0)