Originally published at mlopslab.org/mlflow-tutorial — updated weekly. 0 sponsors, 0 affiliate links.
⚡ Quick answer: MLflow is an open-source platform that tracks everything about your ML experiments — parameters, metrics, model artifacts, and code versions — so you can reproduce any result and never lose a winning configuration again. You'll have your first experiment tracked in under 20 minutes.
Table of Contents
- What is MLflow?
- Before you start
- Step 1 — Install MLflow
- Step 2 — Start the tracking server
- Step 3 — Write your first tracking script
- Step 4 — View results in the UI
- Step 5 — Compare multiple runs
- What to learn next
- FAQ
1. What is MLflow?
MLflow is an open-source platform that tracks everything about your ML experiments — parameters, metrics, model artifacts, and code versions — so you can reproduce any result and never lose a winning configuration again.
Without experiment tracking, most ML engineers waste hours rerunning experiments they've already done — or ship models they can't reproduce. MLflow eliminates both problems permanently.
At its core, MLflow gives you four things:
- Tracking — log parameters, metrics, and artifacts for every run
- Projects — package code so it's reproducible on any machine
- Models — a standard format to package models for deployment
- Registry — a central hub to manage model lifecycle (staging → production)
This tutorial covers the Tracking component, which is where 90% of the day-to-day value lives.
💡 Note: MLflow is model-framework agnostic. It works with scikit-learn, PyTorch, TensorFlow, XGBoost, Keras, LightGBM — anything you're already using.
2. Before you start
You need three things:
-
Python 3.8+ — run
python --versionto check - pip installed — comes with Python 3.4+
- Basic ML knowledge — you should know what "training a model" and "accuracy" mean
That's it. No Docker, no AWS account, no paid tier.
3. Step 1 — Install MLflow
⏱ 2 minutes
MLflow is a single pip install. It includes the tracking server, the UI, and the full Python API.
pip install mlflow scikit-learn
Verify the install:
mlflow --version
# mlflow, version 2.x.x
✅ Using a virtual environment? Run
python -m venv .venv && source .venv/bin/activatebefore installing. Recommended to keep your environment clean.
4. Step 2 — Start the tracking server
⏱ 1 minute
In a terminal, run:
mlflow ui
You'll see:
[2026-04-15 10:23:01 +0000] [INFO] Starting gunicorn 21.2.0
[2026-04-15 10:23:01 +0000] [INFO] Listening at: http://127.0.0.1:5000
Open http://localhost:5000 in your browser — you'll see an empty MLflow dashboard. Leave this terminal running.
⚠️ Port conflict? If port 5000 is taken (common on macOS), run
mlflow ui --port 5001and visithttp://localhost:5001instead.
5. Step 3 — Write your first tracking script
⏱ 10 minutes
Create a file called train.py and paste this:
import mlflow
import mlflow.sklearn
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
# Configuration — change these to experiment
N_ESTIMATORS = 100
MAX_DEPTH = 5
RANDOM_STATE = 42
# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=RANDOM_STATE
)
# Name your experiment (MLflow creates it if it doesn't exist)
mlflow.set_experiment("iris-classifier")
with mlflow.start_run():
# Train model
model = RandomForestClassifier(
n_estimators=N_ESTIMATORS,
max_depth=MAX_DEPTH,
random_state=RANDOM_STATE
)
model.fit(X_train, y_train)
# Evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
f1 = f1_score(y_test, predictions, average="weighted")
# Log everything to MLflow
mlflow.log_param("n_estimators", N_ESTIMATORS)
mlflow.log_param("max_depth", MAX_DEPTH)
mlflow.log_metric("accuracy", accuracy)
mlflow.log_metric("f1_score", f1)
mlflow.sklearn.log_model(model, "random-forest-model")
print(f"Accuracy: {accuracy:.4f} | F1: {f1:.4f}")
print(f"Run ID: {mlflow.active_run().info.run_id}")
Run it:
python train.py
# Accuracy: 0.9667 | F1: 0.9667
# Run ID: a1b2c3d4e5f6...
MLflow created an mlruns/ folder in your working directory. That's where everything is stored locally.
What each MLflow call does
| Call | What it logs | Example |
|---|---|---|
mlflow.set_experiment() |
Groups runs under a named experiment | "iris-classifier" |
mlflow.log_param() |
A single key-value config value | n_estimators=100 |
mlflow.log_metric() |
A numeric result (can be stepped over time) | accuracy=0.967 |
mlflow.sklearn.log_model() |
The trained model artifact + signature | Serialized RandomForest |
✅ It worked! Every run gets a unique run ID, timestamp, and its own folder under
mlruns/. Nothing overwrites anything.
6. Step 4 — View results in the MLflow UI
⏱ 2 minutes
Go back to http://localhost:5000. You'll now see your iris-classifier experiment with one run logged.
Click the run to see:
-
Parameters tab —
n_estimators,max_depth,random_state -
Metrics tab —
accuracy,f1_scorewith a time-series chart - Artifacts tab — the serialized model, ready to load

Figure 1: MLflow tracking UI — parameters and metrics are visualized automatically per run
7. Step 5 — Compare multiple runs
⏱ 5 minutes
This is where MLflow pays off. Run train.py a few more times with different parameters:
# Edit N_ESTIMATORS and MAX_DEPTH in train.py between runs, then:
python train.py # run 2: n_estimators=50, max_depth=3
python train.py # run 3: n_estimators=200, max_depth=10
python train.py # run 4: n_estimators=10, max_depth=2
In the MLflow UI, check the checkboxes next to multiple runs and click "Compare". You'll get a side-by-side table of every parameter and metric across all runs.

Figure 2: Compare runs side-by-side — MLflow shows exactly which parameters produced the best results
You can now answer: "Which configuration gave us the best result, and can we reproduce it?" — with a single click, using the run ID.
🏆 Pro tip: In the UI, click any metric column header to sort runs by that metric. The best run floats to the top instantly.
8. What to learn next
Once you have basic tracking working, these are the natural next steps in order of complexity:
Model Registry — promote your best run from "Experiment" to "Staging" to "Production" with one click. Gives you a version-controlled model store with transition history.
Log more metrics — use mlflow.log_metric("loss", loss, step=epoch) inside your training loop to track metrics over time, not just at the end. The UI plots them automatically.
Serve your model — run mlflow models serve -m runs:/<RUN_ID>/random-forest-model --port 8080 to expose your logged model as a REST API endpoint. No extra code needed.
Remote tracking server — instead of mlflow ui on localhost, point your team at one shared PostgreSQL-backed server: mlflow server --backend-store-uri postgresql://.... Every engineer's runs go to the same place.
9. FAQ
What's the difference between MLflow and Weights & Biases?
MLflow is fully open-source and self-hostable — your data never leaves your infrastructure. W&B is cloud-first with a better UI and more advanced features (sweeps, reports), but costs money at scale. For teams that need data sovereignty or are cost-sensitive, MLflow wins. See the full MLflow vs W&B comparison for a detailed breakdown.
Can MLflow track deep learning training loops?
Yes. Use mlflow.log_metric("loss", loss, step=epoch) inside your epoch loop and MLflow plots the full training curve. It also has autologging support for PyTorch Lightning, Keras, and Hugging Face — one line enables automatic logging of all metrics, params, and the final model.
What happens to my runs if I delete mlruns/?
They're gone. For anything beyond local experimentation, set up a proper backend store (SQLite at minimum, PostgreSQL for teams) and an artifact store (S3, GCS, or Azure Blob). Then your runs survive machine restarts and are shareable.
Does MLflow work with open-source models like Llama or Mistral?
Yes — MLflow has a mlflow.transformers flavor for Hugging Face models and supports custom Python function flavors for anything else. You can log any model as long as you can serialize it.
How does MLflow compare to ClearML?
Both are strong open-source options. ClearML has a richer built-in UI and experiment orchestration features out of the box. MLflow has a larger ecosystem and better framework integrations. See the MLflow vs ClearML breakdown for a production-focused comparison.
Conclusion
MLflow experiment tracking isn't optional once you're running more than a handful of experiments. The "I'll remember which config worked best" approach breaks fast.
The minimum viable setup:
-
pip install mlflow→mlflow ui→mlflow.log_param()+mlflow.log_metric()
That combination gives you full reproducibility with maybe 30 minutes of implementation work.
Don't set up the perfect MLflow infrastructure before you ship. Start local, log everything, move to a shared server when you have a team. The habit of logging compounds.
🔗 Next step: Run the
train.pyabove → check your first trace in the UI atlocalhost:5000. That's the first 15 minutes. Everything else follows from having that first run visible.
Related articles on MLOpsLab
- MLflow vs Weights & Biases: Which Actually Saves Engineering Time?
- MLflow vs ClearML: Which Open Source MLOps Tool Actually Wins (2026)?
- How to Deploy a Machine Learning Model with Docker & MLflow (2026)
- LLM Observability: The ML Engineer's Practical Guide (2026)
References
- MLflow Documentation. https://mlflow.org/docs/latest/index.html
- Chen, A., et al. (2020). Developments in MLflow: A System to Accelerate the Machine Learning Lifecycle. DEEM Workshop, ACM SIGMOD. https://doi.org/10.1145/3399579.3399867
- scikit-learn Documentation. https://scikit-learn.org/stable/
Written by Ayub Shah — ML Engineering student, MLOps enthusiast. Testing every tool so you don't have to. No sponsors, no affiliate links.
→ More at mlopslab.org
Top comments (0)