Experiment Tracking Setup
Stop losing track of which hyperparameters produced your best model. This toolkit provides production-ready configurations for MLflow and Weights & Biases, complete with experiment comparison utilities, a model registry workflow, and artifact management patterns. You get a structured approach to tracking every training run — from quick prototypes to full hyperparameter sweeps — with configs that work locally, on a shared server, or in the cloud.
Key Features
- MLflow Server Configuration — Docker Compose setup with PostgreSQL backend store and S3-compatible artifact storage, ready for team use.
- W&B Integration Layer — Wrapper classes that standardize logging across MLflow and W&B, so you can switch backends without changing training code.
- Experiment Comparison — Scripts that pull runs from your tracking server, rank by metric, and generate comparison reports with statistical significance tests.
- Model Registry Workflow — Promote models through Staging → Production with approval gates, A/B test metadata, and rollback support.
- Artifact Management — Versioned storage for datasets, model checkpoints, and evaluation artifacts with automatic cleanup of old runs.
- Hyperparameter Sweep Templates — Grid search, random search, and Bayesian optimization configs with parallelism control.
- Reproducibility Snapshots — Capture git commit hash, pip freeze output, system info, and config files alongside every run.
Quick Start
unzip experiment-tracking-setup.zip && cd experiment-tracking-setup
pip install -r requirements.txt
# Option 1: Start local MLflow server
docker compose up -d # Starts MLflow + PostgreSQL + MinIO
# Option 2: Use W&B (cloud)
export WANDB_API_KEY=YOUR_API_KEY_HERE
# configs/development.yaml
tracking:
backend: mlflow # mlflow | wandb
mlflow:
tracking_uri: http://localhost:5000
artifact_location: s3://mlflow-artifacts/
experiment_name: my_project
wandb:
project: my_project
entity: my_team
logging:
log_every_n_steps: 10
log_model_checkpoints: true
log_system_metrics: true
log_code: true
capture_git_hash: true
capture_pip_freeze: true
registry:
auto_register_best: true
promotion_metric: val_f1
promotion_threshold: 0.85
stages: [None, Staging, Production, Archived]
sweeps:
strategy: bayesian # grid | random | bayesian
max_runs: 50
parallel_workers: 4
early_stopping:
metric: val_loss
patience: 10
Architecture
┌──────────────┐ ┌───────────────────┐ ┌────────────────┐
│ Training │────>│ Tracking Wrapper │────>│ MLflow Server │
│ Script │ │ (Unified API) │ │ or W&B Cloud │
└──────────────┘ └───────────────────┘ └───────┬────────┘
│
┌──────────────┐ ┌───────────────────┐ ┌───────▼────────┐
│ Comparison │<────│ Model Registry │<────│ Artifact Store │
│ Reports │ │ (Stage Gates) │ │ (S3/GCS/Local) │
└──────────────┘ └───────────────────┘ └────────────────┘
Usage Examples
Log a Training Run
from experiment_tracking_setup.core import ExperimentTracker
tracker = ExperimentTracker.from_config("configs/development.yaml")
with tracker.start_run(run_name="resnet50_lr001") as run:
# Log hyperparameters
run.log_params({
"model": "resnet50",
"lr": 0.001,
"batch_size": 32,
"optimizer": "adamw",
})
for epoch in range(50):
train_loss, val_loss, val_f1 = train_one_epoch(model, epoch)
run.log_metrics({
"train_loss": train_loss,
"val_loss": val_loss,
"val_f1": val_f1,
}, step=epoch)
# Log model artifact
run.log_model(model, artifact_name="resnet50_classifier")
run.log_artifact("./configs/training_config.yaml")
Compare Experiments
from experiment_tracking_setup.core import ExperimentComparator
comparator = ExperimentComparator(tracking_uri="http://localhost:5000")
# Get top 5 runs by validation F1
top_runs = comparator.get_top_runs(
experiment_name="my_project",
metric="val_f1",
top_k=5,
)
# Generate comparison report
report = comparator.compare(top_runs, metrics=["val_f1", "val_loss", "train_time"])
comparator.save_report(report, "comparison_report.html")
# Statistical significance test between top 2
sig_test = comparator.significance_test(top_runs[0], top_runs[1], metric="val_f1")
print(f"p-value: {sig_test['p_value']:.4f}")
Configuration Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
tracking.backend |
str | mlflow |
Tracking backend: mlflow or wandb |
logging.log_every_n_steps |
int | 10 |
Metric logging frequency |
logging.capture_git_hash |
bool | true |
Capture git commit for reproducibility |
registry.promotion_metric |
str | val_f1 |
Metric used for auto-promotion decisions |
sweeps.strategy |
str | bayesian |
Hyperparameter search strategy |
sweeps.parallel_workers |
int | 4 |
Parallel sweep workers |
Best Practices
-
Name experiments by project, not by date — Use
fraud_detection_v2, notexperiment_2026_03_23. Dates are metadata; names should be searchable. - Log hyperparameters BEFORE training starts — If training crashes, you still know what was attempted.
-
Use tags for filtering, not run names — Tags like
{"backbone": "resnet50", "dataset": "v3"}are queryable; clever run names are not. - Set up artifact cleanup — Old checkpoints accumulate fast. Schedule weekly cleanup of runs older than 30 days that aren't registered models.
- Always log the full config file — Individual params are useful for search, but the complete config is essential for exact reproduction.
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| MLflow UI shows no runs | Tracking URI mismatch | Verify MLFLOW_TRACKING_URI env var matches your server address |
| Artifact upload fails | S3 credentials not configured | Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or configure MinIO credentials |
| W&B run hangs on init | Network/auth issue | Run wandb login and verify API key, check firewall rules |
| Model registry version conflict | Concurrent promotions | Use transition_stage() with archive_existing=True to auto-archive previous Production model |
This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [Experiment Tracking Setup] with all files, templates, and documentation for $29.
Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.
Top comments (0)