ML Data Versioning
DVC-based data versioning that brings Git-like version control to your datasets and ML pipelines. Track every dataset change, reproduce any experiment, and share data across teams without copying files around.
Key Features
- Dataset version control — track large files and directories with DVC alongside your Git repo
-
Pipeline versioning — define reproducible ML pipelines as DAGs with
dvc.yaml - Remote storage backends — push/pull data from S3, GCS, Azure Blob, SSH, and HDFS
-
Experiment tracking — compare metrics across branches and commits with
dvc metrics - Data lineage — trace any model prediction back to the exact training data version
- CI/CD integration — validate data pipelines in pull requests with automated checks
-
Lightweight switching — change dataset versions as fast as
git checkout
Quick Start
# 1. Initialize DVC in your Git repo
cd your-ml-project
dvc init
# 2. Configure remote storage
dvc remote add -d storage s3://your-bucket/dvc-store
dvc remote modify storage region us-east-1
# 3. Start tracking data
dvc add data/training_set.csv
git add data/training_set.csv.dvc data/.gitignore
git commit -m "Track training set v1"
# 4. Push data to remote
dvc push
"""Load a specific version of data programmatically."""
import subprocess
import pandas as pd
def load_versioned_data(git_rev: str, data_path: str) -> pd.DataFrame:
"""Checkout and load data from a specific Git revision."""
subprocess.run(["dvc", "checkout", data_path, "--rev", git_rev], check=True)
return pd.read_csv(data_path)
# Load training data from the v1.2 release
train_df = load_versioned_data("v1.2", "data/training_set.csv")
print(f"Loaded {len(train_df)} rows from v1.2")
Architecture
ml-data-versioning/
├── config.example.yaml # DVC remote and pipeline configuration
├── templates/
│ ├── dvc_setup/
│ │ ├── .dvc/config # DVC configuration template
│ │ ├── .dvcignore # Files to exclude from DVC tracking
│ │ └── dvc.yaml # Pipeline DAG definition
│ ├── pipelines/
│ │ ├── preprocess.py # Data preprocessing stage
│ │ ├── train.py # Model training stage
│ │ ├── evaluate.py # Evaluation and metrics stage
│ │ └── params.yaml # Pipeline parameters
│ └── ci/
│ ├── github_actions.yaml # CI pipeline validation workflow
│ └── validate_data.py # Data schema checks for PRs
├── docs/
│ └── overview.md
└── examples/
├── basic_tracking.sh # Track files and push to remote
└── pipeline_example.sh # Run a full DVC pipeline
The DVC pipeline DAG defines stages (preprocess → train → evaluate) with explicit dependencies. Running dvc repro only re-executes stages whose inputs changed.
Usage Examples
Define a Reproducible Pipeline
# dvc.yaml
stages:
preprocess:
cmd: python templates/pipelines/preprocess.py
deps:
- data/raw/
- templates/pipelines/preprocess.py
params:
- preprocess.split_ratio
- preprocess.random_seed
outs:
- data/processed/train.csv
- data/processed/test.csv
train:
cmd: python templates/pipelines/train.py
deps:
- data/processed/train.csv
- templates/pipelines/train.py
params:
- train.learning_rate
- train.epochs
outs:
- models/model.pkl
metrics:
- metrics/train_metrics.json:
cache: false
evaluate:
cmd: python templates/pipelines/evaluate.py
deps:
- models/model.pkl
- data/processed/test.csv
metrics:
- metrics/eval_metrics.json:
cache: false
plots:
- metrics/confusion_matrix.csv
Compare Experiments Across Branches
# Show metrics diff between current branch and main
dvc metrics diff main
# Compare parameters across experiments
dvc params diff main
# Show full experiment table
dvc exp show --sort-by metrics/eval_metrics.json:accuracy
Configuration
# config.example.yaml
remote:
name: "storage"
url: "s3://your-bucket/dvc-store" # S3 | gs:// | azure:// | ssh://
cache:
type: "hardlink" # hardlink | symlink | copy
preprocess:
split_ratio: 0.2
random_seed: 42
train:
learning_rate: 0.001
epochs: 50
Best Practices
-
Never
git addlarge files directly — usedvc addfor anything over 10MB; Git stores only the.dvcpointer file -
Tag data releases — use
git tag v1.0-dataafter significant dataset updates so you can always retrieve that version -
Use
dvc repro, not manual script runs — the pipeline DAG skips unchanged stages automatically, saving compute -
Store params in
params.yaml— DVC tracks parameter changes and links them to metrics for experiment comparison -
Set up CI data validation — use the included
validate_data.pyin PRs to catch schema drift before it hits training
Troubleshooting
| Problem | Cause | Fix |
|---|---|---|
dvc push hangs or fails |
Misconfigured remote credentials | Verify with aws s3 ls s3://your-bucket/ (or equivalent); check dvc remote list
|
dvc repro re-runs all stages |
Lock file deleted or corrupted | Run dvc repro once fully; ensure dvc.lock is committed to git |
| Cache filling up disk | Large datasets with many versions | Run dvc gc --workspace to clean unused cache entries |
File conflicts after git merge
|
.dvc pointer files diverged |
Run dvc checkout after merge to sync data with the correct pointers |
This is 1 of 10 resources in the ML Starter Kit toolkit. Get the complete [ML Data Versioning] with all files, templates, and documentation for $29.
Or grab the entire ML Starter Kit bundle (10 products) for $149 — save 30%.
Top comments (0)