DEV Community

Rizwan Saleem
Rizwan Saleem

Posted on

Building a Robust Git Workflows for Data-Driven Projects

Building a Robust Git Workflows for Data-Driven Projects

Building a Robust Git Workflows for Data-Driven Projects

In many teams, version control becomes a glue between data, code, and deployment. When a project revolves around data processing, experiments, and model artifacts, a Git workflow must accommodate large binary files, frequent data refreshes, and reproducible environments. This tutorial lays out a practical, end-to-end Git workflow tailored for data-driven projects, including branch strategies, data/versioning conventions, CI considerations, and reproducibility hooks. It includes concrete commands you can adapt to your stack.

1) Start with a clear repository structure

A well-organized repo helps teams reason about data, code, and models.

  • src/ or notebooks/: your analysis scripts or notebooks
  • data/: small, version-controlled datasets or pointers to external sources
  • data-raw/: raw data (usually not committed; use data provenance)
  • data-processed/: outputs from your pipeline
  • models/: trained artifacts (consider Git LFS or external artifact storage)
  • pipelines/: data processing and model training pipelines (e.g., Snakemake, Airflow, or custom scripts)
  • configs/: experiment and pipeline configurations
  • tests/: validation tests for data quality and experiments
  • docs/: documentation including reproduction steps

Guiding principle: keep large artifacts out of Git when possible. Use pointers, hashes, or artifact stores.

2) Choose a branch model that aligns with experimentation

For data-heavy projects, you want to balance exploration with stability.

  • main (or master): production-ready baseline and validated experiments
  • develop: integration of ongoing experiments; unstable but closer to release
  • feature/experiment-xyz: isolated experiments
  • release/vX.Y.Z: staging for a release candidate
  • hotfix/issue-123: quick fixes in production

Tips:

  • Create a feature branch for each data experiment or pipeline change.
  • Merge into develop only after passing data validation and reproducibility checks.
  • Use a dedicated tag for reproducible runs (see reproducibility hooks). ### 3) Version data and artifacts properly

Git is not optimal for large data or binary artifacts. Use a hybrid approach:

  • Small data and deltas: store in Git with clear versioning (binary-safe practices).
  • Large files or models: use Git LFS or an external artifact store (e.g., DVC, MLflow, or an S3-compatible store).
  • Data provenance: track data sources with immutable references (hashes, URLs, and timestamps).

Concrete setup options:

  • Git LFS: for large binary files
    • Enable: git lfs install
    • Track: git lfs track ".pt" ".csv" "*.bin"
    • Commit as usual; LFS stores large objects outside Git
  • DVC (data version control): manage data, models, and pipelines with reproducible stages
    • Initialize: dvc init
    • Add data: dvc add data/large-dataset.csv
    • Push data: dvc push (configure remote storage)
    • Reproduce: dvc repro

Choose one or combine: LFS for simple large files, DVC for end-to-end data lineage and pipeline reproducibility.

4) Reproducibility as a first-class concern

Reproducible runs require explicit environment, data versions, and pipeline steps.

  • Capture environment:
    • Poetry or pip-tools for Python dependencies
    • Conda environment.yml
    • Dockerfile or docker-compose.yml for consistent execution
  • Capture configurations:
    • YAML/JSON configs per experiment in configs/
    • Include a timestamp, data version, and seed
  • Add a reproducibility script:
    • A single script like scripts/reproduce.py that fetches data, installs dependencies, runs the pipeline, and reports results

Example: Python-based reproduce script outline

  • fetch data version: data/versions/latest.json
  • load config: configs/exp-2026-06-01.yaml
  • run: python -m pipelines.run config configs/exp-2026-06-01.yaml ### 5) CI/CD that respects data constraints

Set up CI to validate code, data quality, and lightweight reproducibility checks.

  • Steps in CI:
    • Checkout code
    • Install dependencies (without downloading huge data)
    • Run unit tests for data processing functions
    • Validate data schema and basic integrity checks
    • Optionally run a lightweight data sample processing to ensure pipelines work
  • Artifacts and caches:
    • Cache Python wheels, conda envs
    • Use DVC or LFS in CI if you need to pull data for validation
  • Gatekeepers:
    • Require passing data-quality checks before merging into develop
    • Protect main with required status checks

Sample GitHub Actions skeleton (high level):

  • name: CI on: [push, pull_request] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Set up Python uses: actions/setup-python@v4 with: {python-version: '3.11'} - name: Install deps run: | python -m pip install -U pip pip install -r requirements.txt - name: Run tests run: | pytest -q - name: Lint run: | flake8

If you use DVC, add a step to fetch downstream data with dvc pull and ensure a lightweight test dataset is available.

6) Commit hygiene and review practices

Keep commits small, meaningful, and deterministic.

  • Atomic commits: one logical change per commit
  • Commit messages:
    • feat: add new experiment
    • fix: bug in data preprocessing
    • chore: update docs
    • ci: adjust workflow
  • PR checklist:
    • Data validation checks pass
    • Reproducibility script runs
    • Dependencies pinned and verified
    • Data provenance and storage are documented

Use pull requests to surface data-focused changes for review, not just code diffs.

7) Provenance and traceability

When collaborating on experiments, you want to trace decisions back to sources.

  • Record data sources: include a data_sources.json describing source, version, and checksum
  • Tie experiments to configs: name configs with date and git SHA
  • Tag reproducible results: tag a commit with a run-id, data-version, and a summary
    • git tag run-2026-06-01-v1
    • git push origin tags

An example of a concise run metadata file (runs/2026-06-01-expA.json):

  • run_id: expA-2026-06-01
  • commit: abc123...
  • data_version: data-v1.2.3
  • config: configs/exp-2026-06-01.yaml
  • seed: 42
  • metrics: {accuracy: 0.923, f1: 0.911}
  • notes: baseline with feature engineering ### 8) Example workflow: end-to-end from idea to reproducible result

1) Create a new experiment branch

  • git checkout -b feature/experiment-ensemble

2) Add data and/or models via appropriate tooling

  • If using DVC: dvc add data/ensemple.csv
  • If using Git LFS: git lfs track "*.pt" && git add .gitattributes

3) Implement pipeline changes

  • Update scripts/pipeline.py to incorporate ensemble method
  • Update configs/exp-ensemble.yaml with parameters and random_seed

4) Run local checks

  • python -m pytest tests/
  • python scripts/reproduce.py config configs/exp-ensemble.yaml

5) Commit with a focused message

  • git add .
  • git commit -m "feat(pipeline): implement ensemble method and config for exp-ensemble"

6) Open a PR and request reviews

  • Ensure reviewers verify data validity and reproducibility steps

7) Merge into develop after checks pass, then create a release tag after validation

8) Push to production-like environment and record run

  • Use a reproducibility script to reproduce the final result and store metrics

    9) Practical tips and common pitfalls

  • Pitfall: Large data sneaks into Git history

    • Solution: Use DVC or Git LFS; prune history if needed (git filter-repo)
  • Pitfall: Environment drift breaks reproducibility

    • Solution: Lock dependencies; store environment specs; use containerization
  • Pitfall: Data versioning lags behind experiments

    • Solution: Bake data version into run metadata; automate data version bumps on experiments
  • Tip: Automate audit trails

    • Keep a CHANGELOG and per-experiment run notes stored in runs/ with a brief narrative ### 10) Quick-start checklist
  • Set up repository structure and a clear branching model

  • Decide data/versioning approach (DVC, Git LFS, or both)

  • Add environment and configuration management (requirements, env.yml, Dockerfile)

  • Implement a lightweight reproducibility script

  • Configure CI to run data-quality checks

  • Establish provenance norms (data sources, config naming, run tags)

    If you’d like, I can tailor this workflow to your stack (Python vs. R vs. data engineering pipelines), suggest concrete commands for your chosen tools (DVC, MLflow, or plain Git LFS), or draft a starter set of config templates and CI files for your repo. Which parts would you like to customize first?

-

Rizwan Saleem | https://rizwansaleem.co

Top comments (0)