Rupesh Bharambe

Posted on Apr 8

From Raw CSV to Model Comparison in 3 Lines of Python

#python #machinelearning #datascience #opensource

A hands-on tutorial with dissectml — the library that combines deep EDA with model comparison.

Let me show you something. This is how most data scientists start a project:

import pandas as pd
from ydata_profiling import ProfileReport
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import shap
# ... 150 more lines of boilerplate

And this is the same thing with dissectml:

import dissectml as dml

report = dml.analyze(df, target="survived")
report.export("report.html")

Same output. Same depth. Three lines. Let me walk you through what happens under the hood.

Setup

pip install dissectml

For this tutorial, we'll use the built-in Titanic dataset:

import dissectml as dml

df = dml.load_titanic()
print(f"Dataset: {df.shape[0]} rows × {df.shape[1]} columns")
# Dataset: 891 rows × 8 columns

Stage 1: Deep EDA

eda = dml.explore(df, target="survived")

This returns instantly — dissectml uses lazy evaluation, so nothing computes until you ask for it. Now let's explore:

Overview

eda.overview.show()

This auto-detects column types (numeric, categorical, boolean, datetime, high-cardinality, constant), shows memory usage, and generates a type distribution chart.

Correlations

eda.correlations.heatmap()

Unlike basic df.corr(), this computes a unified correlation matrix that handles mixed types: Pearson for numeric-numeric, Cramér's V for categorical-categorical, and correlation ratio (eta) for numeric-categorical pairs. All in one heatmap.

Missing Data Intelligence

eda.missing.patterns()

This goes beyond "column X has 20% missing." It analyzes the pattern of missingness — is it Missing Completely At Random (MCAR), Missing At Random (MAR), or Missing Not At Random (MNAR)? This determines which imputation strategy you should use.

Outlier Detection

eda.outliers.plot()

Runs three methods simultaneously — IQR, Z-score, and Isolation Forest — and shows a consensus view. Points flagged by all three methods are the most confident outliers.

Statistical Tests

eda.tests.normality()
eda.tests.independence()

Automated Shapiro-Wilk normality tests for all numeric columns, chi-square independence tests for categorical pairs, and ANOVA/Kruskal-Wallis for group comparisons against the target.

Cluster Discovery

eda.clusters.scatter_2d()

Automatically runs K-Means and DBSCAN, finds the optimal number of clusters, and visualizes them with PCA projection. Reveals hidden structure in your data before you even start modeling.

Stage 2: Pre-Model Intelligence

intel = dml.analyze_intelligence(df, target="survived", task="classification")

Data Readiness Score

intel.readiness.show()
# Data Readiness: 96/100 (Grade A)

A composite score from 0-100 based on missing values, class imbalance, multicollinearity, outlier prevalence, and feature quality. No other library does this.

Target Leakage Detection

intel.leakage

Four-pronged leakage scan: suspiciously high correlations, look-ahead bias in temporal features, near-perfect predictors, and data contamination patterns. Catches issues that silently inflate your metrics.

Algorithm Recommendations

intel.recommendations.show()

Based on your data characteristics (size, non-linearity, cardinality, sparsity), recommends which algorithm families to prioritize. Small dataset with non-linear relationships? Trees and ensembles rank high, neural nets rank low.

Stage 3: Model Battle

models = dml.battle(df, target="survived")
models.leaderboard()

This trains 19 classifiers in parallel with cross-validation and returns a sorted leaderboard:

                     model         accuracy    f1_weighted    train_time_s
0   GradientBoostingClassifier     0.8260       0.8245         5.01
1   RandomForestClassifier         0.8080       0.8062         3.90
2   LogisticRegression             0.7970       0.7958         0.84
...

Each model is automatically paired with appropriate preprocessing — tree-based models skip scaling, linear models get StandardScaler, categorical features get encoded based on cardinality.

Want only specific models?

# Filter by family
models = dml.battle(df, target="survived", families=["tree", "linear"])

# Or pick specific models
models = dml.battle(df, target="survived", 
                    models=["RandomForestClassifier", "LogisticRegression", "XGBClassifier"])

Stage 4: Full Pipeline

Now let's run everything together:

report = dml.analyze(df, target="survived")

This chains all stages: EDA → Intelligence → Battle → Compare. The returned report object gives you access to everything:

# Text summary
print(report.summary())
# === DissectML Analysis Report ===
# Task: classification  |  Target: survived
# Dataset: 891 samples × 7 features
# Data Readiness: 96/100 (Grade A)
# Best Model: GradientBoostingClassifier (accuracy=0.8260)

# Access any sub-result
report.eda.correlations.heatmap()
report.models.leaderboard()
report.intelligence.readiness.show()

# Export interactive HTML report
report.export("report.html")

The HTML report is a single self-contained file with interactive Plotly charts, collapsible sections, a sidebar table of contents, and narrative summaries. Open it in any browser, share it with stakeholders, attach it to an email.

Configuration

# View current settings
dml.get_config()

# Customize for this session
with dml.config_context(cv_folds=10, random_state=123, n_jobs=-1):
    report = dml.analyze(df, target="survived")

Installation Options

# Core (sklearn + plotly only)
pip install dissectml

# With XGBoost, LightGBM, CatBoost
pip install dissectml[boost]

# With SHAP explainability
pip install dissectml[explain]

# Everything
pip install dissectml[full]

What Makes This Different

I've used PyCaret, LazyPredict, and YData Profiling extensively. They're great tools. But each one covers only part of the workflow:

What You Need	Old Way	dissectml
Understand your data	YData Profiling	`dml.explore(df)`
Check for leakage/issues	Manual code	`dml.analyze_intelligence(df)`
Compare models	PyCaret/LazyPredict	`dml.battle(df)`
Explain why models differ	SHAP + matplotlib	`report.compare`
Share findings	Copy-paste into slides	`report.export("report.html")`
All of the above	5 libraries, 200 lines	3 lines

The key insight: these stages shouldn't be independent tools. Your EDA findings should inform your model preprocessing. Your model comparison should include statistical significance tests. Your report should contain both data insights and model insights in one place.

DEV Community