Python Data Analysis Toolkit

#datascience #sql #python #analytics

Python Data Analysis Toolkit

Production-ready Python templates for data analysis using pandas, NumPy, and the standard library. Includes EDA workflows, statistical testing, data profiling, visualization recipes, and reusable utility functions. All code uses type hints and follows Google-style docstrings.

Key Features

EDA Notebook Templates — structured exploratory analysis with profiling, distributions, and correlations
Statistical Testing Suite — t-tests, chi-squared, ANOVA, Mann-Whitney with effect size calculations
Data Profiling Engine — automated completeness, uniqueness, and distribution reports
Visualization Recipes — 30+ chart patterns for Matplotlib/Seaborn with publication-quality defaults
Time Series Utilities — decomposition, rolling statistics, seasonality detection
Reusable Utility Functions — date parsing, outlier detection, binning, sampling
Pipeline Pattern — chainable data transformation functions with logging

Quick Start

from src.python_data_analysis_toolkit.core import DataProfiler, EDARunner
from src.python_data_analysis_toolkit.utils import detect_outliers, bin_column

# 1. Profile your dataset
profiler = DataProfiler(df)
report = profiler.run()
print(report)
# Columns: 15 | Rows: 45,231 | Missing: 3.2% | Duplicates: 847

# 2. Run automated EDA
eda = EDARunner(df, target_column="revenue")
eda.univariate_analysis()
eda.correlation_matrix()
eda.target_vs_features()
eda.save_report("output/eda_report.html")

# 3. Use utility functions
outliers = detect_outliers(df["revenue"], method="iqr", threshold=1.5)
df["revenue_bin"] = bin_column(df["revenue"], bins=5, strategy="quantile")

Usage Examples

Data Profiling

from collections import Counter
import statistics

def profile_column(values: list) -> dict:
    """Generate a statistical profile for a single column.

    Args:
        values: List of values from a column (may contain None).

    Returns:
        Dictionary with completeness, uniqueness, and distribution stats.
    """
    non_null = [v for v in values if v is not None]
    total = len(values)
    n = len(non_null)

    profile = {
        "total_count": total,
        "non_null_count": n,
        "null_count": total - n,
        "completeness": round(n / total, 4) if total > 0 else 0,
        "unique_count": len(set(non_null)),
        "uniqueness": round(len(set(non_null)) / n, 4) if n > 0 else 0,
    }

    # Numeric stats
    numeric = [v for v in non_null if isinstance(v, (int, float))]
    if numeric:
        profile.update({
            "min": min(numeric),
            "max": max(numeric),
            "mean": round(statistics.mean(numeric), 4),
            "median": round(statistics.median(numeric), 4),
            "stdev": round(statistics.stdev(numeric), 4) if len(numeric) > 1 else 0,
        })

    # Top values for categoricals
    if not numeric and non_null:
        top_3 = Counter(non_null).most_common(3)
        profile["top_values"] = top_3

    return profile

Statistical Testing

import math

def two_sample_ttest(sample_a: list[float], sample_b: list[float]) -> dict:
    """Welch's t-test for two independent samples.

    Does not assume equal variances. Returns t-statistic, degrees
    of freedom, and approximate p-value.
    """
    n_a, n_b = len(sample_a), len(sample_b)
    mean_a = statistics.mean(sample_a)
    mean_b = statistics.mean(sample_b)
    var_a = statistics.variance(sample_a)
    var_b = statistics.variance(sample_b)

    # Welch's t-statistic
    se = math.sqrt(var_a / n_a + var_b / n_b)
    t_stat = (mean_a - mean_b) / se

    # Welch-Satterthwaite degrees of freedom
    num = (var_a / n_a + var_b / n_b) ** 2
    denom = (var_a / n_a) ** 2 / (n_a - 1) + (var_b / n_b) ** 2 / (n_b - 1)
    df = num / denom

    # Cohen's d effect size
    pooled_std = math.sqrt(((n_a - 1) * var_a + (n_b - 1) * var_b)
                           / (n_a + n_b - 2))
    cohens_d = (mean_a - mean_b) / pooled_std if pooled_std > 0 else 0

    return {
        "t_statistic": round(t_stat, 4),
        "degrees_of_freedom": round(df, 2),
        "cohens_d": round(cohens_d, 4),
        "mean_a": round(mean_a, 4),
        "mean_b": round(mean_b, 4),
        "effect_interpretation": (
            "small" if abs(cohens_d) < 0.5
            else "medium" if abs(cohens_d) < 0.8
            else "large"
        ),
    }

# Example: compare revenue between two segments
control = [45.2, 52.1, 48.7, 51.3, 49.8, 53.2, 47.6, 50.1]
treatment = [55.3, 58.7, 52.1, 57.8, 54.6, 59.2, 56.1, 53.9]
result = two_sample_ttest(control, treatment)
print(result)

Configuration

# config.example.yaml
profiling:
  sample_size: null             # null = profile all rows
  completeness_threshold: 0.95  # Flag columns below this

outlier_detection:
  default_method: "iqr"         # iqr | zscore
  iqr_multiplier: 1.5
  zscore_threshold: 3.0

Best Practices

Profile before analyzing — always understand data quality before drawing conclusions
Use type hints everywhere — catches errors early and documents expected inputs
Log, don't print — use the logging module for reproducible analysis trails
Test with small data first — validate logic on 1,000 rows before running on 10M
Separate data loading from analysis — keep IO and computation in different functions
Version your analysis configs — track parameter changes alongside code changes

Troubleshooting

Issue	Cause	Fix
MemoryError on large datasets	Loading entire file into memory	Use chunked reading or sample first
Incorrect statistics	Mixed types in column (e.g., "N/A" strings)	Clean and type-cast before calculating
Slow profiling	Profiling all columns including IDs	Exclude high-cardinality ID columns from profiling
Visualization looks wrong	Matplotlib backend issue	Set `matplotlib.use('Agg')` for non-interactive environments

Requirements

Python 3.10+
Standard library only for core modules (math, statistics, collections, csv)
Optional: pandas, numpy, matplotlib for notebook templates

This is 1 of 11 resources in the Data Analyst Toolkit toolkit. Get the complete [Python Data Analysis Toolkit] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire Data Analyst Toolkit bundle (11 products) for $129 — save 30%.

Get the Complete Bundle →

DEV Community

Python Data Analysis Toolkit

Python Data Analysis Toolkit

Key Features

Quick Start

Usage Examples

Data Profiling

Statistical Testing

Configuration

Best Practices

Troubleshooting

Requirements

Related Articles

Top comments (0)