Experiment Tracking Setup

#machinelearning #python #mlops #datascience

Experiment Tracking Setup

Stop losing track of which hyperparameters produced your best model. This toolkit provides production-ready configurations for MLflow and Weights & Biases, complete with experiment comparison utilities, a model registry workflow, and artifact management patterns. You get a structured approach to tracking every training run — from quick prototypes to full hyperparameter sweeps — with configs that work locally, on a shared server, or in the cloud.

Key Features

MLflow Server Configuration — Docker Compose setup with PostgreSQL backend store and S3-compatible artifact storage, ready for team use.
W&B Integration Layer — Wrapper classes that standardize logging across MLflow and W&B, so you can switch backends without changing training code.
Experiment Comparison — Scripts that pull runs from your tracking server, rank by metric, and generate comparison reports with statistical significance tests.
Model Registry Workflow — Promote models through Staging → Production with approval gates, A/B test metadata, and rollback support.
Artifact Management — Versioned storage for datasets, model checkpoints, and evaluation artifacts with automatic cleanup of old runs.
Hyperparameter Sweep Templates — Grid search, random search, and Bayesian optimization configs with parallelism control.
Reproducibility Snapshots — Capture git commit hash, pip freeze output, system info, and config files alongside every run.

Quick Start

unzip experiment-tracking-setup.zip && cd experiment-tracking-setup
pip install -r requirements.txt

# Option 1: Start local MLflow server
docker compose up -d  # Starts MLflow + PostgreSQL + MinIO

# Option 2: Use W&B (cloud)
export WANDB_API_KEY=YOUR_API_KEY_HERE

# configs/development.yaml
tracking:
  backend: mlflow  # mlflow | wandb
  mlflow:
    tracking_uri: http://localhost:5000
    artifact_location: s3://mlflow-artifacts/
    experiment_name: my_project
  wandb:
    project: my_project
    entity: my_team

logging:
  log_every_n_steps: 10
  log_model_checkpoints: true
  log_system_metrics: true
  log_code: true
  capture_git_hash: true
  capture_pip_freeze: true

registry:
  auto_register_best: true
  promotion_metric: val_f1
  promotion_threshold: 0.85
  stages: [None, Staging, Production, Archived]

sweeps:
  strategy: bayesian  # grid | random | bayesian
  max_runs: 50
  parallel_workers: 4
  early_stopping:
    metric: val_loss
    patience: 10

Architecture

┌──────────────┐     ┌───────────────────┐     ┌────────────────┐
│ Training     │────>│ Tracking Wrapper  │────>│ MLflow Server  │
│ Script       │     │ (Unified API)     │     │ or W&B Cloud   │
└──────────────┘     └───────────────────┘     └───────┬────────┘
                                                        │
┌──────────────┐     ┌───────────────────┐     ┌───────▼────────┐
│ Comparison   │<────│  Model Registry   │<────│ Artifact Store │
│ Reports      │     │  (Stage Gates)    │     │ (S3/GCS/Local) │
└──────────────┘     └───────────────────┘     └────────────────┘

Usage Examples

Log a Training Run

from experiment_tracking_setup.core import ExperimentTracker

tracker = ExperimentTracker.from_config("configs/development.yaml")

with tracker.start_run(run_name="resnet50_lr001") as run:
    # Log hyperparameters
    run.log_params({
        "model": "resnet50",
        "lr": 0.001,
        "batch_size": 32,
        "optimizer": "adamw",
    })

    for epoch in range(50):
        train_loss, val_loss, val_f1 = train_one_epoch(model, epoch)
        run.log_metrics({
            "train_loss": train_loss,
            "val_loss": val_loss,
            "val_f1": val_f1,
        }, step=epoch)

    # Log model artifact
    run.log_model(model, artifact_name="resnet50_classifier")
    run.log_artifact("./configs/training_config.yaml")

Compare Experiments

from experiment_tracking_setup.core import ExperimentComparator

comparator = ExperimentComparator(tracking_uri="http://localhost:5000")

# Get top 5 runs by validation F1
top_runs = comparator.get_top_runs(
    experiment_name="my_project",
    metric="val_f1",
    top_k=5,
)

# Generate comparison report
report = comparator.compare(top_runs, metrics=["val_f1", "val_loss", "train_time"])
comparator.save_report(report, "comparison_report.html")

# Statistical significance test between top 2
sig_test = comparator.significance_test(top_runs[0], top_runs[1], metric="val_f1")
print(f"p-value: {sig_test['p_value']:.4f}")

Configuration Reference

Parameter	Type	Default	Description
`tracking.backend`	str	`mlflow`	Tracking backend: mlflow or wandb
`logging.log_every_n_steps`	int	`10`	Metric logging frequency
`logging.capture_git_hash`	bool	`true`	Capture git commit for reproducibility
`registry.promotion_metric`	str	`val_f1`	Metric used for auto-promotion decisions
`sweeps.strategy`	str	`bayesian`	Hyperparameter search strategy
`sweeps.parallel_workers`	int	`4`	Parallel sweep workers

Best Practices

Name experiments by project, not by date — Use fraud_detection_v2, not experiment_2026_03_23. Dates are metadata; names should be searchable.
Log hyperparameters BEFORE training starts — If training crashes, you still know what was attempted.
Use tags for filtering, not run names — Tags like {"backbone": "resnet50", "dataset": "v3"} are queryable; clever run names are not.
Set up artifact cleanup — Old checkpoints accumulate fast. Schedule weekly cleanup of runs older than 30 days that aren't registered models.
Always log the full config file — Individual params are useful for search, but the complete config is essential for exact reproduction.

Troubleshooting

Issue	Cause	Fix
MLflow UI shows no runs	Tracking URI mismatch	Verify `MLFLOW_TRACKING_URI` env var matches your server address
Artifact upload fails	S3 credentials not configured	Set `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`, or configure MinIO credentials
W&B run hangs on init	Network/auth issue	Run `wandb login` and verify API key, check firewall rules
Model registry version conflict	Concurrent promotions	Use `transition_stage()` with `archive_existing=True` to auto-archive previous Production model

This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [Experiment Tracking Setup] with all files, templates, and documentation for $29.

Get the Full Kit →

Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.

Get the Complete Bundle →

DEV Community

Experiment Tracking Setup

Experiment Tracking Setup

Key Features

Quick Start

Architecture

Usage Examples

Log a Training Run

Compare Experiments

Configuration Reference

Best Practices

Troubleshooting

Related Articles

Top comments (0)