I Built a Python Package That Automates EDA in One Line

#python #pandas #opensource #datascience

After writing the same pandas code for every new dataset, I decided to automate it and published it on PyPI.

The Problem

Every data scientist knows this pain. You get a new dataset and start typing:

df.head()
df.tail()
df.info()
df.describe()
df.isnull().sum()
df.duplicated().sum()
# ... 50 more lines

Same code. Every. Single. Time.

So I built smarteda — a Python package that runs your entire EDA automatically.

Installation

pip install smarteda

Quick Start

import pandas as pd
import smarteda

df = pd.read_csv("your_data.csv")

# Run everything at once
smarteda.analyze(df)

# Or pick what you need
smarteda.basic_eda(df)       # head, tail, info, describe, shape
smarteda.overview(df)        # shape, memory, dtypes, constant columns
smarteda.missing(df)         # missing values + fill suggestions
smarteda.outliers(df)        # IQR + Z-score + Isolation Forest
smarteda.correlations(df)    # multicollinearity warnings + heatmap
smarteda.suggestions(df)     # smart recommendations + ML score

# Clean your data
clean_df = smarteda.clean(df)           # returns new cleaned df
smarteda.clean(df, inplace=True)        # modifies df directly

# Generate full HTML report
smarteda.report(df, output_file="report.html")

The Output

=== Smart Suggestions ===

⚠️  Column 'salary' has 16.7% missing → fill missing values
⚠️  Column 'salary' is highly skewed (skew=1.53) → apply log transformation
⚠️  '10_yop' and '12_yop' are 100% correlated → drop one to avoid multicollinearity
⚠️  Column 'Name' has high cardinality (20000 unique) → use target encoding
✅  No duplicates found

💡 ML Readiness Score: 74 / 100

All 15 Functions

Function	Description
`basic_eda(df)`	Head, tail, sample, shape, size, info, describe
`overview(df)`	Shape, memory, dtypes, constant columns, wrong type detection
`missing(df)`	Missing counts, percentages, heatmap, fill strategy suggestions
`duplicates(df)`	Count and display duplicate rows
`duplicates(df, drop=True)`	Drop duplicates and return clean DataFrame
`outliers(df)`	IQR + Z-score + Isolation Forest outlier detection
`distributions(df)`	Skewness, kurtosis, transformation suggestions, KDE plots
`correlations(df)`	Pearson/Spearman/Kendall, multicollinearity warnings, heatmap
`categorical(df)`	Value counts, high cardinality, encoding suggestions
`timeseries(df)`	Auto datetime detection, trends, seasonality, gap detection
`suggestions(df)`	Smart recommendations + ML Readiness Score /100
`clean(df)`	Auto clean — drop dupes, fill nulls, fix types, cap outliers
`visualize(df)`	Auto charts for every column
`analyze(df)`	🚀 Runs ALL functions in one call
`report(df)`	📄 Generates full standalone HTML report

What I Learned Building This

1. Publishing to PyPI is easier than you think
Build with python -m build, upload with twine upload dist/*. Done.

2. Jupyter has quirks
When a function returns a dictionary, Jupyter auto-displays it — giving double output. Fixed it with a custom SilentDict class that overrides __repr__ to return an empty string.

3. Deprecation warnings matter
Pandas is actively deprecating infer_datetime_format and select_dtypes(include='object'). Wrapping these in warnings.catch_warnings() keeps the package clean.

4. The gap between "it works" and "ready to publish" is large
The code worked on day one. But edge cases, warnings, Jupyter quirks, docstrings, README, PyPI config — that's where the real work was.

DEV Community