Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

ML Data Versioning

#machinelearning #datascience #python #ai

ML Data Versioning

DVC-based data versioning that brings Git-like version control to your datasets and ML pipelines. Track every dataset change, reproduce any experiment, and share data across teams without copying files around.

Key Features

Dataset version control — track large files and directories with DVC alongside your Git repo
Pipeline versioning — define reproducible ML pipelines as DAGs with dvc.yaml
Remote storage backends — push/pull data from S3, GCS, Azure Blob, SSH, and HDFS
Experiment tracking — compare metrics across branches and commits with dvc metrics
Data lineage — trace any model prediction back to the exact training data version
CI/CD integration — validate data pipelines in pull requests with automated checks
Lightweight switching — change dataset versions as fast as git checkout

Quick Start

# 1. Initialize DVC in your Git repo
cd your-ml-project
dvc init

# 2. Configure remote storage
dvc remote add -d storage s3://your-bucket/dvc-store
dvc remote modify storage region us-east-1

# 3. Start tracking data
dvc add data/training_set.csv
git add data/training_set.csv.dvc data/.gitignore
git commit -m "Track training set v1"

# 4. Push data to remote
dvc push

"""Load a specific version of data programmatically."""
import subprocess
import pandas as pd

def load_versioned_data(git_rev: str, data_path: str) -> pd.DataFrame:
    """Checkout and load data from a specific Git revision."""
    subprocess.run(["dvc", "checkout", data_path, "--rev", git_rev], check=True)
    return pd.read_csv(data_path)

# Load training data from the v1.2 release
train_df = load_versioned_data("v1.2", "data/training_set.csv")
print(f"Loaded {len(train_df)} rows from v1.2")

Architecture

ml-data-versioning/
├── config.example.yaml           # DVC remote and pipeline configuration
├── templates/
│   ├── dvc_setup/
│   │   ├── .dvc/config           # DVC configuration template
│   │   ├── .dvcignore            # Files to exclude from DVC tracking
│   │   └── dvc.yaml              # Pipeline DAG definition
│   ├── pipelines/
│   │   ├── preprocess.py         # Data preprocessing stage
│   │   ├── train.py              # Model training stage
│   │   ├── evaluate.py           # Evaluation and metrics stage
│   │   └── params.yaml           # Pipeline parameters
│   └── ci/
│       ├── github_actions.yaml   # CI pipeline validation workflow
│       └── validate_data.py      # Data schema checks for PRs
├── docs/
│   └── overview.md
└── examples/
    ├── basic_tracking.sh         # Track files and push to remote
    └── pipeline_example.sh       # Run a full DVC pipeline

The DVC pipeline DAG defines stages (preprocess → train → evaluate) with explicit dependencies. Running dvc repro only re-executes stages whose inputs changed.

Usage Examples

Define a Reproducible Pipeline

# dvc.yaml
stages:
  preprocess:
    cmd: python templates/pipelines/preprocess.py
    deps:
      - data/raw/
      - templates/pipelines/preprocess.py
    params:
      - preprocess.split_ratio
      - preprocess.random_seed
    outs:
      - data/processed/train.csv
      - data/processed/test.csv

  train:
    cmd: python templates/pipelines/train.py
    deps:
      - data/processed/train.csv
      - templates/pipelines/train.py
    params:
      - train.learning_rate
      - train.epochs
    outs:
      - models/model.pkl
    metrics:
      - metrics/train_metrics.json:
          cache: false

  evaluate:
    cmd: python templates/pipelines/evaluate.py
    deps:
      - models/model.pkl
      - data/processed/test.csv
    metrics:
      - metrics/eval_metrics.json:
          cache: false
    plots:
      - metrics/confusion_matrix.csv

Compare Experiments Across Branches

# Show metrics diff between current branch and main
dvc metrics diff main

# Compare parameters across experiments
dvc params diff main

# Show full experiment table
dvc exp show --sort-by metrics/eval_metrics.json:accuracy

Configuration

# config.example.yaml
remote:
  name: "storage"
  url: "s3://your-bucket/dvc-store"     # S3 | gs:// | azure:// | ssh://

cache:
  type: "hardlink"                       # hardlink | symlink | copy

preprocess:
  split_ratio: 0.2
  random_seed: 42

train:
  learning_rate: 0.001
  epochs: 50

Best Practices

Never git add large files directly — use dvc add for anything over 10MB; Git stores only the .dvc pointer file
Tag data releases — use git tag v1.0-data after significant dataset updates so you can always retrieve that version
Use dvc repro, not manual script runs — the pipeline DAG skips unchanged stages automatically, saving compute
Store params in params.yaml — DVC tracks parameter changes and links them to metrics for experiment comparison
Set up CI data validation — use the included validate_data.py in PRs to catch schema drift before it hits training

Troubleshooting

Problem	Cause	Fix
`dvc push` hangs or fails	Misconfigured remote credentials	Verify with `aws s3 ls s3://your-bucket/` (or equivalent); check `dvc remote list`
`dvc repro` re-runs all stages	Lock file deleted or corrupted	Run `dvc repro` once fully; ensure `dvc.lock` is committed to git
Cache filling up disk	Large datasets with many versions	Run `dvc gc --workspace` to clean unused cache entries
File conflicts after `git merge`	`.dvc` pointer files diverged	Run `dvc checkout` after merge to sync data with the correct pointers

This is 1 of 10 resources in the ML Starter Kit toolkit. Get the complete [ML Data Versioning] with all files, templates, and documentation for $29.

Get the Full Kit →

Or grab the entire ML Starter Kit bundle (10 products) for $149 — save 30%.

Get the Complete Bundle →

DEV Community

ML Data Versioning

ML Data Versioning

Key Features

Quick Start

Architecture

Usage Examples

Define a Reproducible Pipeline

Compare Experiments Across Branches

Configuration

Best Practices

Troubleshooting

Related Articles

Top comments (0)