DEV Community

Rupesh Bharambe
Rupesh Bharambe

Posted on

From Raw CSV to Model Comparison in 3 Lines of Python

A hands-on tutorial with dissectml — the library that combines deep EDA with model comparison.


Let me show you something. This is how most data scientists start a project:

import pandas as pd
from ydata_profiling import ProfileReport
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import shap
# ... 150 more lines of boilerplate
Enter fullscreen mode Exit fullscreen mode

And this is the same thing with dissectml:

import dissectml as dml

report = dml.analyze(df, target="survived")
report.export("report.html")
Enter fullscreen mode Exit fullscreen mode

Same output. Same depth. Three lines. Let me walk you through what happens under the hood.


Setup

pip install dissectml
Enter fullscreen mode Exit fullscreen mode

For this tutorial, we'll use the built-in Titanic dataset:

import dissectml as dml

df = dml.load_titanic()
print(f"Dataset: {df.shape[0]} rows × {df.shape[1]} columns")
# Dataset: 891 rows × 8 columns
Enter fullscreen mode Exit fullscreen mode

Stage 1: Deep EDA

eda = dml.explore(df, target="survived")
Enter fullscreen mode Exit fullscreen mode

This returns instantly — dissectml uses lazy evaluation, so nothing computes until you ask for it. Now let's explore:

Overview

eda.overview.show()
Enter fullscreen mode Exit fullscreen mode

This auto-detects column types (numeric, categorical, boolean, datetime, high-cardinality, constant), shows memory usage, and generates a type distribution chart.

Correlations

eda.correlations.heatmap()
Enter fullscreen mode Exit fullscreen mode

Unlike basic df.corr(), this computes a unified correlation matrix that handles mixed types: Pearson for numeric-numeric, Cramér's V for categorical-categorical, and correlation ratio (eta) for numeric-categorical pairs. All in one heatmap.

Missing Data Intelligence

eda.missing.patterns()
Enter fullscreen mode Exit fullscreen mode

This goes beyond "column X has 20% missing." It analyzes the pattern of missingness — is it Missing Completely At Random (MCAR), Missing At Random (MAR), or Missing Not At Random (MNAR)? This determines which imputation strategy you should use.

Outlier Detection

eda.outliers.plot()
Enter fullscreen mode Exit fullscreen mode

Runs three methods simultaneously — IQR, Z-score, and Isolation Forest — and shows a consensus view. Points flagged by all three methods are the most confident outliers.

Statistical Tests

eda.tests.normality()
eda.tests.independence()
Enter fullscreen mode Exit fullscreen mode

Automated Shapiro-Wilk normality tests for all numeric columns, chi-square independence tests for categorical pairs, and ANOVA/Kruskal-Wallis for group comparisons against the target.

Cluster Discovery

eda.clusters.scatter_2d()
Enter fullscreen mode Exit fullscreen mode

Automatically runs K-Means and DBSCAN, finds the optimal number of clusters, and visualizes them with PCA projection. Reveals hidden structure in your data before you even start modeling.


Stage 2: Pre-Model Intelligence

intel = dml.analyze_intelligence(df, target="survived", task="classification")
Enter fullscreen mode Exit fullscreen mode

Data Readiness Score

intel.readiness.show()
# Data Readiness: 96/100 (Grade A)
Enter fullscreen mode Exit fullscreen mode

A composite score from 0-100 based on missing values, class imbalance, multicollinearity, outlier prevalence, and feature quality. No other library does this.

Target Leakage Detection

intel.leakage
Enter fullscreen mode Exit fullscreen mode

Four-pronged leakage scan: suspiciously high correlations, look-ahead bias in temporal features, near-perfect predictors, and data contamination patterns. Catches issues that silently inflate your metrics.

Algorithm Recommendations

intel.recommendations.show()
Enter fullscreen mode Exit fullscreen mode

Based on your data characteristics (size, non-linearity, cardinality, sparsity), recommends which algorithm families to prioritize. Small dataset with non-linear relationships? Trees and ensembles rank high, neural nets rank low.


Stage 3: Model Battle

models = dml.battle(df, target="survived")
models.leaderboard()
Enter fullscreen mode Exit fullscreen mode

This trains 19 classifiers in parallel with cross-validation and returns a sorted leaderboard:

                     model         accuracy    f1_weighted    train_time_s
0   GradientBoostingClassifier     0.8260       0.8245         5.01
1   RandomForestClassifier         0.8080       0.8062         3.90
2   LogisticRegression             0.7970       0.7958         0.84
...
Enter fullscreen mode Exit fullscreen mode

Each model is automatically paired with appropriate preprocessing — tree-based models skip scaling, linear models get StandardScaler, categorical features get encoded based on cardinality.

Want only specific models?

# Filter by family
models = dml.battle(df, target="survived", families=["tree", "linear"])

# Or pick specific models
models = dml.battle(df, target="survived", 
                    models=["RandomForestClassifier", "LogisticRegression", "XGBClassifier"])
Enter fullscreen mode Exit fullscreen mode

Stage 4: Full Pipeline

Now let's run everything together:

report = dml.analyze(df, target="survived")
Enter fullscreen mode Exit fullscreen mode

This chains all stages: EDA → Intelligence → Battle → Compare. The returned report object gives you access to everything:

# Text summary
print(report.summary())
# === DissectML Analysis Report ===
# Task: classification  |  Target: survived
# Dataset: 891 samples × 7 features
# Data Readiness: 96/100 (Grade A)
# Best Model: GradientBoostingClassifier (accuracy=0.8260)

# Access any sub-result
report.eda.correlations.heatmap()
report.models.leaderboard()
report.intelligence.readiness.show()

# Export interactive HTML report
report.export("report.html")
Enter fullscreen mode Exit fullscreen mode

The HTML report is a single self-contained file with interactive Plotly charts, collapsible sections, a sidebar table of contents, and narrative summaries. Open it in any browser, share it with stakeholders, attach it to an email.


Configuration

# View current settings
dml.get_config()

# Customize for this session
with dml.config_context(cv_folds=10, random_state=123, n_jobs=-1):
    report = dml.analyze(df, target="survived")
Enter fullscreen mode Exit fullscreen mode

Installation Options

# Core (sklearn + plotly only)
pip install dissectml

# With XGBoost, LightGBM, CatBoost
pip install dissectml[boost]

# With SHAP explainability
pip install dissectml[explain]

# Everything
pip install dissectml[full]
Enter fullscreen mode Exit fullscreen mode

What Makes This Different

I've used PyCaret, LazyPredict, and YData Profiling extensively. They're great tools. But each one covers only part of the workflow:

What You Need Old Way dissectml
Understand your data YData Profiling dml.explore(df)
Check for leakage/issues Manual code dml.analyze_intelligence(df)
Compare models PyCaret/LazyPredict dml.battle(df)
Explain why models differ SHAP + matplotlib report.compare
Share findings Copy-paste into slides report.export("report.html")
All of the above 5 libraries, 200 lines 3 lines

The key insight: these stages shouldn't be independent tools. Your EDA findings should inform your model preprocessing. Your model comparison should include statistical significance tests. Your report should contain both data insights and model insights in one place.


Links

If this saves you time, drop a ⭐ on GitHub — it genuinely helps with discoverability.


Rupesh Bharambe — ML Engineer & Open Source Developer

Top comments (0)