A hands-on tutorial with dissectml — the library that combines deep EDA with model comparison.
Let me show you something. This is how most data scientists start a project:
import pandas as pd
from ydata_profiling import ProfileReport
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import shap
# ... 150 more lines of boilerplate
And this is the same thing with dissectml:
import dissectml as dml
report = dml.analyze(df, target="survived")
report.export("report.html")
Same output. Same depth. Three lines. Let me walk you through what happens under the hood.
Setup
pip install dissectml
For this tutorial, we'll use the built-in Titanic dataset:
import dissectml as dml
df = dml.load_titanic()
print(f"Dataset: {df.shape[0]} rows × {df.shape[1]} columns")
# Dataset: 891 rows × 8 columns
Stage 1: Deep EDA
eda = dml.explore(df, target="survived")
This returns instantly — dissectml uses lazy evaluation, so nothing computes until you ask for it. Now let's explore:
Overview
eda.overview.show()
This auto-detects column types (numeric, categorical, boolean, datetime, high-cardinality, constant), shows memory usage, and generates a type distribution chart.
Correlations
eda.correlations.heatmap()
Unlike basic df.corr(), this computes a unified correlation matrix that handles mixed types: Pearson for numeric-numeric, Cramér's V for categorical-categorical, and correlation ratio (eta) for numeric-categorical pairs. All in one heatmap.
Missing Data Intelligence
eda.missing.patterns()
This goes beyond "column X has 20% missing." It analyzes the pattern of missingness — is it Missing Completely At Random (MCAR), Missing At Random (MAR), or Missing Not At Random (MNAR)? This determines which imputation strategy you should use.
Outlier Detection
eda.outliers.plot()
Runs three methods simultaneously — IQR, Z-score, and Isolation Forest — and shows a consensus view. Points flagged by all three methods are the most confident outliers.
Statistical Tests
eda.tests.normality()
eda.tests.independence()
Automated Shapiro-Wilk normality tests for all numeric columns, chi-square independence tests for categorical pairs, and ANOVA/Kruskal-Wallis for group comparisons against the target.
Cluster Discovery
eda.clusters.scatter_2d()
Automatically runs K-Means and DBSCAN, finds the optimal number of clusters, and visualizes them with PCA projection. Reveals hidden structure in your data before you even start modeling.
Stage 2: Pre-Model Intelligence
intel = dml.analyze_intelligence(df, target="survived", task="classification")
Data Readiness Score
intel.readiness.show()
# Data Readiness: 96/100 (Grade A)
A composite score from 0-100 based on missing values, class imbalance, multicollinearity, outlier prevalence, and feature quality. No other library does this.
Target Leakage Detection
intel.leakage
Four-pronged leakage scan: suspiciously high correlations, look-ahead bias in temporal features, near-perfect predictors, and data contamination patterns. Catches issues that silently inflate your metrics.
Algorithm Recommendations
intel.recommendations.show()
Based on your data characteristics (size, non-linearity, cardinality, sparsity), recommends which algorithm families to prioritize. Small dataset with non-linear relationships? Trees and ensembles rank high, neural nets rank low.
Stage 3: Model Battle
models = dml.battle(df, target="survived")
models.leaderboard()
This trains 19 classifiers in parallel with cross-validation and returns a sorted leaderboard:
model accuracy f1_weighted train_time_s
0 GradientBoostingClassifier 0.8260 0.8245 5.01
1 RandomForestClassifier 0.8080 0.8062 3.90
2 LogisticRegression 0.7970 0.7958 0.84
...
Each model is automatically paired with appropriate preprocessing — tree-based models skip scaling, linear models get StandardScaler, categorical features get encoded based on cardinality.
Want only specific models?
# Filter by family
models = dml.battle(df, target="survived", families=["tree", "linear"])
# Or pick specific models
models = dml.battle(df, target="survived",
models=["RandomForestClassifier", "LogisticRegression", "XGBClassifier"])
Stage 4: Full Pipeline
Now let's run everything together:
report = dml.analyze(df, target="survived")
This chains all stages: EDA → Intelligence → Battle → Compare. The returned report object gives you access to everything:
# Text summary
print(report.summary())
# === DissectML Analysis Report ===
# Task: classification | Target: survived
# Dataset: 891 samples × 7 features
# Data Readiness: 96/100 (Grade A)
# Best Model: GradientBoostingClassifier (accuracy=0.8260)
# Access any sub-result
report.eda.correlations.heatmap()
report.models.leaderboard()
report.intelligence.readiness.show()
# Export interactive HTML report
report.export("report.html")
The HTML report is a single self-contained file with interactive Plotly charts, collapsible sections, a sidebar table of contents, and narrative summaries. Open it in any browser, share it with stakeholders, attach it to an email.
Configuration
# View current settings
dml.get_config()
# Customize for this session
with dml.config_context(cv_folds=10, random_state=123, n_jobs=-1):
report = dml.analyze(df, target="survived")
Installation Options
# Core (sklearn + plotly only)
pip install dissectml
# With XGBoost, LightGBM, CatBoost
pip install dissectml[boost]
# With SHAP explainability
pip install dissectml[explain]
# Everything
pip install dissectml[full]
What Makes This Different
I've used PyCaret, LazyPredict, and YData Profiling extensively. They're great tools. But each one covers only part of the workflow:
| What You Need | Old Way | dissectml |
|---|---|---|
| Understand your data | YData Profiling | dml.explore(df) |
| Check for leakage/issues | Manual code | dml.analyze_intelligence(df) |
| Compare models | PyCaret/LazyPredict | dml.battle(df) |
| Explain why models differ | SHAP + matplotlib | report.compare |
| Share findings | Copy-paste into slides | report.export("report.html") |
| All of the above | 5 libraries, 200 lines | 3 lines |
The key insight: these stages shouldn't be independent tools. Your EDA findings should inform your model preprocessing. Your model comparison should include statistical significance tests. Your report should contain both data insights and model insights in one place.
Links
-
Install:
pip install dissectml - GitHub: github.com/rupeshbharambe24/dissectML
- PyPI: pypi.org/project/dissectml
If this saves you time, drop a ⭐ on GitHub — it genuinely helps with discoverability.
Rupesh Bharambe — ML Engineer & Open Source Developer
Top comments (0)