DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Python Data Analysis Toolkit

Python Data Analysis Toolkit

Production-ready Python templates for data analysis using pandas, NumPy, and the standard library. Includes EDA workflows, statistical testing, data profiling, visualization recipes, and reusable utility functions. All code uses type hints and follows Google-style docstrings.

Key Features

  • EDA Notebook Templates — structured exploratory analysis with profiling, distributions, and correlations
  • Statistical Testing Suite — t-tests, chi-squared, ANOVA, Mann-Whitney with effect size calculations
  • Data Profiling Engine — automated completeness, uniqueness, and distribution reports
  • Visualization Recipes — 30+ chart patterns for Matplotlib/Seaborn with publication-quality defaults
  • Time Series Utilities — decomposition, rolling statistics, seasonality detection
  • Reusable Utility Functions — date parsing, outlier detection, binning, sampling
  • Pipeline Pattern — chainable data transformation functions with logging

Quick Start

from src.python_data_analysis_toolkit.core import DataProfiler, EDARunner
from src.python_data_analysis_toolkit.utils import detect_outliers, bin_column

# 1. Profile your dataset
profiler = DataProfiler(df)
report = profiler.run()
print(report)
# Columns: 15 | Rows: 45,231 | Missing: 3.2% | Duplicates: 847

# 2. Run automated EDA
eda = EDARunner(df, target_column="revenue")
eda.univariate_analysis()
eda.correlation_matrix()
eda.target_vs_features()
eda.save_report("output/eda_report.html")

# 3. Use utility functions
outliers = detect_outliers(df["revenue"], method="iqr", threshold=1.5)
df["revenue_bin"] = bin_column(df["revenue"], bins=5, strategy="quantile")
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Data Profiling

from collections import Counter
import statistics

def profile_column(values: list) -> dict:
    """Generate a statistical profile for a single column.

    Args:
        values: List of values from a column (may contain None).

    Returns:
        Dictionary with completeness, uniqueness, and distribution stats.
    """
    non_null = [v for v in values if v is not None]
    total = len(values)
    n = len(non_null)

    profile = {
        "total_count": total,
        "non_null_count": n,
        "null_count": total - n,
        "completeness": round(n / total, 4) if total > 0 else 0,
        "unique_count": len(set(non_null)),
        "uniqueness": round(len(set(non_null)) / n, 4) if n > 0 else 0,
    }

    # Numeric stats
    numeric = [v for v in non_null if isinstance(v, (int, float))]
    if numeric:
        profile.update({
            "min": min(numeric),
            "max": max(numeric),
            "mean": round(statistics.mean(numeric), 4),
            "median": round(statistics.median(numeric), 4),
            "stdev": round(statistics.stdev(numeric), 4) if len(numeric) > 1 else 0,
        })

    # Top values for categoricals
    if not numeric and non_null:
        top_3 = Counter(non_null).most_common(3)
        profile["top_values"] = top_3

    return profile
Enter fullscreen mode Exit fullscreen mode

Statistical Testing

import math

def two_sample_ttest(sample_a: list[float], sample_b: list[float]) -> dict:
    """Welch's t-test for two independent samples.

    Does not assume equal variances. Returns t-statistic, degrees
    of freedom, and approximate p-value.
    """
    n_a, n_b = len(sample_a), len(sample_b)
    mean_a = statistics.mean(sample_a)
    mean_b = statistics.mean(sample_b)
    var_a = statistics.variance(sample_a)
    var_b = statistics.variance(sample_b)

    # Welch's t-statistic
    se = math.sqrt(var_a / n_a + var_b / n_b)
    t_stat = (mean_a - mean_b) / se

    # Welch-Satterthwaite degrees of freedom
    num = (var_a / n_a + var_b / n_b) ** 2
    denom = (var_a / n_a) ** 2 / (n_a - 1) + (var_b / n_b) ** 2 / (n_b - 1)
    df = num / denom

    # Cohen's d effect size
    pooled_std = math.sqrt(((n_a - 1) * var_a + (n_b - 1) * var_b)
                           / (n_a + n_b - 2))
    cohens_d = (mean_a - mean_b) / pooled_std if pooled_std > 0 else 0

    return {
        "t_statistic": round(t_stat, 4),
        "degrees_of_freedom": round(df, 2),
        "cohens_d": round(cohens_d, 4),
        "mean_a": round(mean_a, 4),
        "mean_b": round(mean_b, 4),
        "effect_interpretation": (
            "small" if abs(cohens_d) < 0.5
            else "medium" if abs(cohens_d) < 0.8
            else "large"
        ),
    }

# Example: compare revenue between two segments
control = [45.2, 52.1, 48.7, 51.3, 49.8, 53.2, 47.6, 50.1]
treatment = [55.3, 58.7, 52.1, 57.8, 54.6, 59.2, 56.1, 53.9]
result = two_sample_ttest(control, treatment)
print(result)
Enter fullscreen mode Exit fullscreen mode

Configuration

# config.example.yaml
profiling:
  sample_size: null             # null = profile all rows
  completeness_threshold: 0.95  # Flag columns below this

outlier_detection:
  default_method: "iqr"         # iqr | zscore
  iqr_multiplier: 1.5
  zscore_threshold: 3.0
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Profile before analyzing — always understand data quality before drawing conclusions
  2. Use type hints everywhere — catches errors early and documents expected inputs
  3. Log, don't print — use the logging module for reproducible analysis trails
  4. Test with small data first — validate logic on 1,000 rows before running on 10M
  5. Separate data loading from analysis — keep IO and computation in different functions
  6. Version your analysis configs — track parameter changes alongside code changes

Troubleshooting

Issue Cause Fix
MemoryError on large datasets Loading entire file into memory Use chunked reading or sample first
Incorrect statistics Mixed types in column (e.g., "N/A" strings) Clean and type-cast before calculating
Slow profiling Profiling all columns including IDs Exclude high-cardinality ID columns from profiling
Visualization looks wrong Matplotlib backend issue Set matplotlib.use('Agg') for non-interactive environments

Requirements

  • Python 3.10+
  • Standard library only for core modules (math, statistics, collections, csv)
  • Optional: pandas, numpy, matplotlib for notebook templates

This is 1 of 11 resources in the Data Analyst Toolkit toolkit. Get the complete [Python Data Analysis Toolkit] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire Data Analyst Toolkit bundle (11 products) for $129 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)