DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Experiment Tracking Setup

Experiment Tracking Setup

Stop losing track of which hyperparameters produced your best model. This toolkit provides production-ready configurations for MLflow and Weights & Biases, complete with experiment comparison utilities, a model registry workflow, and artifact management patterns. You get a structured approach to tracking every training run — from quick prototypes to full hyperparameter sweeps — with configs that work locally, on a shared server, or in the cloud.

Key Features

  • MLflow Server Configuration — Docker Compose setup with PostgreSQL backend store and S3-compatible artifact storage, ready for team use.
  • W&B Integration Layer — Wrapper classes that standardize logging across MLflow and W&B, so you can switch backends without changing training code.
  • Experiment Comparison — Scripts that pull runs from your tracking server, rank by metric, and generate comparison reports with statistical significance tests.
  • Model Registry Workflow — Promote models through Staging → Production with approval gates, A/B test metadata, and rollback support.
  • Artifact Management — Versioned storage for datasets, model checkpoints, and evaluation artifacts with automatic cleanup of old runs.
  • Hyperparameter Sweep Templates — Grid search, random search, and Bayesian optimization configs with parallelism control.
  • Reproducibility Snapshots — Capture git commit hash, pip freeze output, system info, and config files alongside every run.

Quick Start

unzip experiment-tracking-setup.zip && cd experiment-tracking-setup
pip install -r requirements.txt

# Option 1: Start local MLflow server
docker compose up -d  # Starts MLflow + PostgreSQL + MinIO

# Option 2: Use W&B (cloud)
export WANDB_API_KEY=YOUR_API_KEY_HERE
Enter fullscreen mode Exit fullscreen mode
# configs/development.yaml
tracking:
  backend: mlflow  # mlflow | wandb
  mlflow:
    tracking_uri: http://localhost:5000
    artifact_location: s3://mlflow-artifacts/
    experiment_name: my_project
  wandb:
    project: my_project
    entity: my_team

logging:
  log_every_n_steps: 10
  log_model_checkpoints: true
  log_system_metrics: true
  log_code: true
  capture_git_hash: true
  capture_pip_freeze: true

registry:
  auto_register_best: true
  promotion_metric: val_f1
  promotion_threshold: 0.85
  stages: [None, Staging, Production, Archived]

sweeps:
  strategy: bayesian  # grid | random | bayesian
  max_runs: 50
  parallel_workers: 4
  early_stopping:
    metric: val_loss
    patience: 10
Enter fullscreen mode Exit fullscreen mode

Architecture

┌──────────────┐     ┌───────────────────┐     ┌────────────────┐
│ Training     │────>│ Tracking Wrapper  │────>│ MLflow Server  │
│ Script       │     │ (Unified API)     │     │ or W&B Cloud   │
└──────────────┘     └───────────────────┘     └───────┬────────┘
                                                        │
┌──────────────┐     ┌───────────────────┐     ┌───────▼────────┐
│ Comparison   │<────│  Model Registry   │<────│ Artifact Store │
│ Reports      │     │  (Stage Gates)    │     │ (S3/GCS/Local) │
└──────────────┘     └───────────────────┘     └────────────────┘
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Log a Training Run

from experiment_tracking_setup.core import ExperimentTracker

tracker = ExperimentTracker.from_config("configs/development.yaml")

with tracker.start_run(run_name="resnet50_lr001") as run:
    # Log hyperparameters
    run.log_params({
        "model": "resnet50",
        "lr": 0.001,
        "batch_size": 32,
        "optimizer": "adamw",
    })

    for epoch in range(50):
        train_loss, val_loss, val_f1 = train_one_epoch(model, epoch)
        run.log_metrics({
            "train_loss": train_loss,
            "val_loss": val_loss,
            "val_f1": val_f1,
        }, step=epoch)

    # Log model artifact
    run.log_model(model, artifact_name="resnet50_classifier")
    run.log_artifact("./configs/training_config.yaml")
Enter fullscreen mode Exit fullscreen mode

Compare Experiments

from experiment_tracking_setup.core import ExperimentComparator

comparator = ExperimentComparator(tracking_uri="http://localhost:5000")

# Get top 5 runs by validation F1
top_runs = comparator.get_top_runs(
    experiment_name="my_project",
    metric="val_f1",
    top_k=5,
)

# Generate comparison report
report = comparator.compare(top_runs, metrics=["val_f1", "val_loss", "train_time"])
comparator.save_report(report, "comparison_report.html")

# Statistical significance test between top 2
sig_test = comparator.significance_test(top_runs[0], top_runs[1], metric="val_f1")
print(f"p-value: {sig_test['p_value']:.4f}")
Enter fullscreen mode Exit fullscreen mode

Configuration Reference

Parameter Type Default Description
tracking.backend str mlflow Tracking backend: mlflow or wandb
logging.log_every_n_steps int 10 Metric logging frequency
logging.capture_git_hash bool true Capture git commit for reproducibility
registry.promotion_metric str val_f1 Metric used for auto-promotion decisions
sweeps.strategy str bayesian Hyperparameter search strategy
sweeps.parallel_workers int 4 Parallel sweep workers

Best Practices

  1. Name experiments by project, not by date — Use fraud_detection_v2, not experiment_2026_03_23. Dates are metadata; names should be searchable.
  2. Log hyperparameters BEFORE training starts — If training crashes, you still know what was attempted.
  3. Use tags for filtering, not run names — Tags like {"backbone": "resnet50", "dataset": "v3"} are queryable; clever run names are not.
  4. Set up artifact cleanup — Old checkpoints accumulate fast. Schedule weekly cleanup of runs older than 30 days that aren't registered models.
  5. Always log the full config file — Individual params are useful for search, but the complete config is essential for exact reproduction.

Troubleshooting

Issue Cause Fix
MLflow UI shows no runs Tracking URI mismatch Verify MLFLOW_TRACKING_URI env var matches your server address
Artifact upload fails S3 credentials not configured Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or configure MinIO credentials
W&B run hangs on init Network/auth issue Run wandb login and verify API key, check firewall rules
Model registry version conflict Concurrent promotions Use transition_stage() with archive_existing=True to auto-archive previous Production model

This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [Experiment Tracking Setup] with all files, templates, and documentation for $29.

Get the Full Kit →

Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)