DEV Community

Cover image for MLflow Tutorial: How to Track ML Experiments Like a Pro (2026)
Ayub Shah
Ayub Shah

Posted on • Originally published at mlopslab.org

MLflow Tutorial: How to Track ML Experiments Like a Pro (2026)

Originally published at mlopslab.org/mlflow-tutorial — updated weekly. 0 sponsors, 0 affiliate links.


⚡ Quick answer: MLflow is an open-source platform that tracks everything about your ML experiments — parameters, metrics, model artifacts, and code versions — so you can reproduce any result and never lose a winning configuration again. You'll have your first experiment tracked in under 20 minutes.


Table of Contents

  1. What is MLflow?
  2. Before you start
  3. Step 1 — Install MLflow
  4. Step 2 — Start the tracking server
  5. Step 3 — Write your first tracking script
  6. Step 4 — View results in the UI
  7. Step 5 — Compare multiple runs
  8. What to learn next
  9. FAQ

1. What is MLflow?

MLflow is an open-source platform that tracks everything about your ML experiments — parameters, metrics, model artifacts, and code versions — so you can reproduce any result and never lose a winning configuration again.

Without experiment tracking, most ML engineers waste hours rerunning experiments they've already done — or ship models they can't reproduce. MLflow eliminates both problems permanently.

At its core, MLflow gives you four things:

  • Tracking — log parameters, metrics, and artifacts for every run
  • Projects — package code so it's reproducible on any machine
  • Models — a standard format to package models for deployment
  • Registry — a central hub to manage model lifecycle (staging → production)

This tutorial covers the Tracking component, which is where 90% of the day-to-day value lives.

💡 Note: MLflow is model-framework agnostic. It works with scikit-learn, PyTorch, TensorFlow, XGBoost, Keras, LightGBM — anything you're already using.


2. Before you start

You need three things:

  • Python 3.8+ — run python --version to check
  • pip installed — comes with Python 3.4+
  • Basic ML knowledge — you should know what "training a model" and "accuracy" mean

That's it. No Docker, no AWS account, no paid tier.


3. Step 1 — Install MLflow

2 minutes

MLflow is a single pip install. It includes the tracking server, the UI, and the full Python API.

pip install mlflow scikit-learn
Enter fullscreen mode Exit fullscreen mode

Verify the install:

mlflow --version
# mlflow, version 2.x.x
Enter fullscreen mode Exit fullscreen mode

Using a virtual environment? Run python -m venv .venv && source .venv/bin/activate before installing. Recommended to keep your environment clean.


4. Step 2 — Start the tracking server

1 minute

In a terminal, run:

mlflow ui
Enter fullscreen mode Exit fullscreen mode

You'll see:

[2026-04-15 10:23:01 +0000] [INFO] Starting gunicorn 21.2.0
[2026-04-15 10:23:01 +0000] [INFO] Listening at: http://127.0.0.1:5000
Enter fullscreen mode Exit fullscreen mode

Open http://localhost:5000 in your browser — you'll see an empty MLflow dashboard. Leave this terminal running.

⚠️ Port conflict? If port 5000 is taken (common on macOS), run mlflow ui --port 5001 and visit http://localhost:5001 instead.


5. Step 3 — Write your first tracking script

10 minutes

Create a file called train.py and paste this:

import mlflow
import mlflow.sklearn
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

# Configuration — change these to experiment
N_ESTIMATORS = 100
MAX_DEPTH = 5
RANDOM_STATE = 42

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=RANDOM_STATE
)

# Name your experiment (MLflow creates it if it doesn't exist)
mlflow.set_experiment("iris-classifier")

with mlflow.start_run():
    # Train model
    model = RandomForestClassifier(
        n_estimators=N_ESTIMATORS,
        max_depth=MAX_DEPTH,
        random_state=RANDOM_STATE
    )
    model.fit(X_train, y_train)

    # Evaluate
    predictions = model.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    f1 = f1_score(y_test, predictions, average="weighted")

    # Log everything to MLflow
    mlflow.log_param("n_estimators", N_ESTIMATORS)
    mlflow.log_param("max_depth", MAX_DEPTH)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("f1_score", f1)
    mlflow.sklearn.log_model(model, "random-forest-model")

    print(f"Accuracy: {accuracy:.4f} | F1: {f1:.4f}")
    print(f"Run ID: {mlflow.active_run().info.run_id}")
Enter fullscreen mode Exit fullscreen mode

Run it:

python train.py
# Accuracy: 0.9667 | F1: 0.9667
# Run ID: a1b2c3d4e5f6...
Enter fullscreen mode Exit fullscreen mode

MLflow created an mlruns/ folder in your working directory. That's where everything is stored locally.

What each MLflow call does

Call What it logs Example
mlflow.set_experiment() Groups runs under a named experiment "iris-classifier"
mlflow.log_param() A single key-value config value n_estimators=100
mlflow.log_metric() A numeric result (can be stepped over time) accuracy=0.967
mlflow.sklearn.log_model() The trained model artifact + signature Serialized RandomForest

It worked! Every run gets a unique run ID, timestamp, and its own folder under mlruns/. Nothing overwrites anything.


6. Step 4 — View results in the MLflow UI

2 minutes

Go back to http://localhost:5000. You'll now see your iris-classifier experiment with one run logged.

Click the run to see:

  • Parameters tabn_estimators, max_depth, random_state
  • Metrics tabaccuracy, f1_score with a time-series chart
  • Artifacts tab — the serialized model, ready to load

MLflow UI showing metric tracking dashboard
Figure 1: MLflow tracking UI — parameters and metrics are visualized automatically per run


7. Step 5 — Compare multiple runs

5 minutes

This is where MLflow pays off. Run train.py a few more times with different parameters:

# Edit N_ESTIMATORS and MAX_DEPTH in train.py between runs, then:
python train.py  # run 2: n_estimators=50, max_depth=3
python train.py  # run 3: n_estimators=200, max_depth=10
python train.py  # run 4: n_estimators=10, max_depth=2
Enter fullscreen mode Exit fullscreen mode

In the MLflow UI, check the checkboxes next to multiple runs and click "Compare". You'll get a side-by-side table of every parameter and metric across all runs.

MLflow run comparison table
Figure 2: Compare runs side-by-side — MLflow shows exactly which parameters produced the best results

You can now answer: "Which configuration gave us the best result, and can we reproduce it?" — with a single click, using the run ID.

🏆 Pro tip: In the UI, click any metric column header to sort runs by that metric. The best run floats to the top instantly.


8. What to learn next

Once you have basic tracking working, these are the natural next steps in order of complexity:

Model Registry — promote your best run from "Experiment" to "Staging" to "Production" with one click. Gives you a version-controlled model store with transition history.

Log more metrics — use mlflow.log_metric("loss", loss, step=epoch) inside your training loop to track metrics over time, not just at the end. The UI plots them automatically.

Serve your model — run mlflow models serve -m runs:/<RUN_ID>/random-forest-model --port 8080 to expose your logged model as a REST API endpoint. No extra code needed.

Remote tracking server — instead of mlflow ui on localhost, point your team at one shared PostgreSQL-backed server: mlflow server --backend-store-uri postgresql://.... Every engineer's runs go to the same place.


9. FAQ

What's the difference between MLflow and Weights & Biases?

MLflow is fully open-source and self-hostable — your data never leaves your infrastructure. W&B is cloud-first with a better UI and more advanced features (sweeps, reports), but costs money at scale. For teams that need data sovereignty or are cost-sensitive, MLflow wins. See the full MLflow vs W&B comparison for a detailed breakdown.

Can MLflow track deep learning training loops?

Yes. Use mlflow.log_metric("loss", loss, step=epoch) inside your epoch loop and MLflow plots the full training curve. It also has autologging support for PyTorch Lightning, Keras, and Hugging Face — one line enables automatic logging of all metrics, params, and the final model.

What happens to my runs if I delete mlruns/?

They're gone. For anything beyond local experimentation, set up a proper backend store (SQLite at minimum, PostgreSQL for teams) and an artifact store (S3, GCS, or Azure Blob). Then your runs survive machine restarts and are shareable.

Does MLflow work with open-source models like Llama or Mistral?

Yes — MLflow has a mlflow.transformers flavor for Hugging Face models and supports custom Python function flavors for anything else. You can log any model as long as you can serialize it.

How does MLflow compare to ClearML?

Both are strong open-source options. ClearML has a richer built-in UI and experiment orchestration features out of the box. MLflow has a larger ecosystem and better framework integrations. See the MLflow vs ClearML breakdown for a production-focused comparison.


Conclusion

MLflow experiment tracking isn't optional once you're running more than a handful of experiments. The "I'll remember which config worked best" approach breaks fast.

The minimum viable setup:

  • pip install mlflowmlflow uimlflow.log_param() + mlflow.log_metric()

That combination gives you full reproducibility with maybe 30 minutes of implementation work.

Don't set up the perfect MLflow infrastructure before you ship. Start local, log everything, move to a shared server when you have a team. The habit of logging compounds.

🔗 Next step: Run the train.py above → check your first trace in the UI at localhost:5000. That's the first 15 minutes. Everything else follows from having that first run visible.


Related articles on MLOpsLab


References

  1. MLflow Documentation. https://mlflow.org/docs/latest/index.html
  2. Chen, A., et al. (2020). Developments in MLflow: A System to Accelerate the Machine Learning Lifecycle. DEEM Workshop, ACM SIGMOD. https://doi.org/10.1145/3399579.3399867
  3. scikit-learn Documentation. https://scikit-learn.org/stable/

Written by Ayub Shah — ML Engineering student, MLOps enthusiast. Testing every tool so you don't have to. No sponsors, no affiliate links.

→ More at mlopslab.org

Top comments (0)