Rupesh Bharambe

Posted on Apr 7

I Analyzed 26 ML Libraries and Found a Gap Nobody Fills - So I Built It

#python #machinelearning #datascience #opensource

How I built dissectml, the missing middle layer between EDA and AutoML.

Every data science project starts the same way.

You load your dataset. You run df.describe(). You open YData Profiling for a quick report. Then you switch to PyCaret or LazyPredict to screen a bunch of models. Then you pull in SHAP for explainability. Then matplotlib for custom comparison plots. By the time you actually understand your data and your models, you've imported five libraries, written 200 lines of glue code, and it's been three hours.

I kept asking myself: why isn't there one library that does the full journey?

So I researched every tool in the space. Thoroughly. And then I built the one that was missing.

The Research That Started Everything

I spent weeks doing deep market research on two categories: Auto-EDA tools (libraries that explore your data) and AutoML/model comparison tools (libraries that train and compare models).

Auto-EDA landscape (10+ libraries):

YData Profiling (13K+ GitHub stars) — the king of one-line profiling reports. Great for stats and correlations, but no model insights.
DataPrep — Dask-powered, 10x faster. But stops at data profiling.
SweetViz — beautiful HTML reports with target analysis. But static and shallow.
D-Tale — Flask+React interactive GUI. Impressive, but no ML integration.
AutoViz, Lux, klib, Missingno — each does one thing well but nothing end-to-end.

AutoML landscape (16+ frameworks):

PyCaret (9K+ stars) — low-code model comparison with compare_models(). But no deep EDA, no statistical significance tests between models, no cross-model error analysis.
LazyPredict — trains 30 models in 2 lines. But zero depth: no plots, no tuning, no explanations.
AutoGluon (AWS) — wins competitions via stacking. But it's a black box focused on prediction, not understanding.
MLJAR — per-model SHAP reports. But reports are per-model, not comparative.
FLAML (Microsoft), H2O, TPOT, EvalML — all focused on finding the best model, not understanding why.

The gap I found:

Capability	YData	PyCaret	LazyPredict	Nobody
Deep EDA with statistical tests	✅	❌	❌	—
Train 20+ models in one call	❌	✅	✅	—
Cross-model error analysis	❌	❌	❌	❌
Statistical significance between models	❌	❌	❌	❌
Target leakage detection	❌	❌	❌	❌
Data readiness score	❌	❌	❌	❌
EDA insights informing model selection	❌	❌	❌	❌
End-to-end: EDA → Models → Report	❌	❌	❌	❌

The right column was empty across every tool. Not a single library bridges the full journey from "What is my data?" to "Which model is best and WHY?"

That's not an AutoML gap. It's an Auto-Analysis gap.

What I Built: dissectml

dissectml is a Python library that unifies deep EDA with comparative model analysis in a single, coherent pipeline. It has five stages:

Deep EDA — auto-detect types, distributions, correlations (Pearson + Spearman + Cramér's V), missing data patterns (MCAR/MAR/MNAR), outlier detection (IQR + Z-score + Isolation Forest), statistical tests (Shapiro-Wilk, chi-square, ANOVA), cluster discovery, feature interactions.
Pre-Model Intelligence — target leakage detection, multicollinearity (VIF), data readiness score (0-100 with letter grade), algorithm recommendations based on data characteristics.
Model Battle — parallel cross-validation across 19 classifiers or 19 regressors. Supports XGBoost, LightGBM, CatBoost as optional extras.
Comparative Analysis — side-by-side metrics, ROC/PR curves, confusion matrices, cross-model error analysis, McNemar/corrected paired t-tests for statistical significance, accuracy vs speed Pareto front.
HTML Report — self-contained interactive report with Plotly charts, collapsible sections, and narrative summaries.

The API is 3 lines:

import dissectml as dml

df = dml.load_titanic()
report = dml.analyze(df, target="survived")
report.export("report.html")

That's it. Five stages. One function call. One interactive report.

Or use any stage independently:

# Just EDA
eda = dml.explore(df, target="survived")
eda.correlations.heatmap()
eda.missing.patterns()
eda.outliers.plot()
eda.tests.normality()

# Just model comparison
models = dml.battle(df, target="survived")
models.leaderboard()

The Architecture Decisions

A few choices I'm proud of:

Lazy evaluation everywhere. dml.explore() returns instantly. Computation only happens when you access a sub-module like eda.correlations. This means you never wait for analysis you don't need.

EDA informs model training. The intelligence stage detects your data characteristics (non-linearity, sparsity, cardinality) and feeds that into the battle stage's preprocessing. Tree-based models skip scaling. High-cardinality categoricals get target encoding instead of one-hot. The pipeline adapts to your data.

Optional dependencies done right. Core package needs only sklearn + plotly. XGBoost/LightGBM/CatBoost install with pip install dissectml[boost]. SHAP with [explain]. If an optional model isn't installed, it's silently skipped — no crashes.

Modular plugin architecture. Each EDA sub-module, each model entry, each comparison method is a self-contained unit. Want to add a custom model? Register it with the model registry. Want to add a custom EDA analysis? Extend the base class.

The Numbers

11,000+ lines of source code across 67 files
600+ tests, all passing, 82% coverage
0 lint issues (ruff-clean)
19 classifiers + 19 regressors in the model catalog
10 EDA sub-modules: overview, univariate, bivariate, correlations, missing, outliers, statistical tests, clusters, interactions, target analysis
148KB wheel on PyPI

Try It Now

pip install dissectml

import dissectml as dml

# Load the built-in Titanic dataset
df = dml.load_titanic()

# Full pipeline: EDA → Intelligence → Battle → Compare → Report
report = dml.analyze(df, target="survived")
print(report.summary())
report.export("report.html")

GitHub: github.com/rupeshbharambe24/InsightML
PyPI: pypi.org/project/dissectml

If you find this useful, a ⭐ on GitHub means a lot — it's what helps open-source projects get discovered.

What's Next

v0.2: Polars backend for 10x EDA speed on large datasets
v0.3: Deep learning models (PyTorch MLP, TabNet)
v0.4: PDF export and branded report templates
v0.5: LLM-powered narrative insights (natural language summaries of findings)

I built this because I was tired of stitching together five libraries every time I started a new ML project. If you feel the same way, give dissectml a try and let me know what you think.

🚀 Try it now (no install needed):

👉 Run in Google Colab — full demo, runs in your browser in 60 seconds

👉 Kaggle Notebook — with rendered outputs

👉 pip install dissectml — install locally

Links: GitHub · PyPI · Docs

If this was useful, a ⭐ on GitHub helps the project get discovered!

Rupesh Bharambe — AIML Engineer & Open Source Developer
Find me on GitHub

DEV Community