DEV Community

Rizwan Saleem
Rizwan Saleem

Posted on

A Practical Git Workflow for Data-Heavy Projects: Submodules, Shallow Clones, and Exact Reproducibil

A Practical Git Workflow for Data-Heavy Projects: Submodules, Shallow Clones, and Exact Reproducibil

A Practical Git Workflow for Data-Heavy Projects: Submodules, Shallow Clones, and Exact Reproducibility

git isn’t just for code. When your project blends data, models, notebooks, and artifacts, you quickly hit friction: large files, non-code dependencies, and the need for reproducible environments. This guide walks you through a concrete, robust workflow you can adopt for data-heavy projects. It covers repository structure, submodules, shallow clones, reproducible environments, and practical branching strategies. You’ll come away with a workflow that keeps teams aligned, reduces accidental data drift, and makes it easy to audit every step from data to model.

Table of contents

  • Why this workflow
  • Repository layout
  • Branching model
  • Data management with git: submodules and large files
  • Handling dependencies and environments
  • Reproducibility: provenance and environments
  • Typical commands you’ll run
  • CI considerations and testing
  • Common pitfalls and how to avoid them
  • Example end-to-end scenario

Why this workflow

  • You need stable references to data and models across developers and CI.
  • Data files are large; you don’t want every clone to fetch every dataset.
  • You want reproducible experiments with explicit environment specs and data provenance.
  • You want a lightweight way to ship smaller artifacts (code, configs) while keeping data tied to specific commits.

Repository layout

  • root/
    • data/ (large datasets; typically git-ignored or managed via submodules)
    • data-submodules/ (git submodules pointing to external data repos)
    • models/ (trained models or model artifacts, tracked as lightweight pointers)
    • notebooks/ (Jupyter/Colab notebooks)
    • src/ (source code)
    • configs/ ( experiment configs, hyperparameters )
    • environment/ (environment specs, e.g., conda.yml, pip-tools.txt)
    • pipelines/ ( orchestration scripts )
    • reports/ ( results, plots, dashboards )
    • .gitignore
    • README.md

Branching model

  • main (or master): production-ready state; minimal, well-tested data pipelines and models
  • dev: integration work; merging features here after review
  • feature/*: individual experiments or data processing features
  • data-reviewed/*: points of data validation or dataset versioning milestones
  • hotfix/*: urgent fixes to production pipelines
  • For experiments and notebooks, use short-lived branches and tag important data/model milestones

Data management with git: submodules and large files

  • Submodules for datasets:
    • Create a separate data repository (data-repo) that contains data assets.
    • In the main repo, add it as a submodule:
    • git submodule add https://example.com/org/data-repo.git data-submodules/dataset-A
    • Commit the submodule pointer:
    • git commit -m "Add dataset-A as submodule"
  • Benefits: you keep data separate, avoid bloating the main repo, and can pin exact data revisions.
  • Work with shallow clones for speed when you don’t need full history:
    • git clone depth 1 recurse-submodules
    • For existing repos: git fetch depth 1 recurse-submodules
  • If you must store large files in the repo:
    • Prefer Git LFS (Large File Storage) for datasets and models:
    • git lfs install
    • git lfs track "*.parquet"
    • git add .gitattributes
    • git add data/subset.parquet
    • git commit -m "Track large data with LFS"
    • Note: LFS requires a storage backend; coordinate with your CI/CD.

Handling dependencies and environments

  • Pin exact environments per experiment:
    • environment/conda.yaml (or environment.yml)
    • environment/pip-tools.txt (or requirements.txt)
  • Use reproducible environments:
    • Conda:
    • name: data-env
    • dependencies:
      • python=3.11
      • numpy=1.25
      • pandas=2.0
      • scikit-learn=1.1
    • Pip-tools for Python package pinning:
    • pip-tools can generate a pinned requirements.txt from a requirements.in
  • Capture environment every run:
    • In pipelines/, create a script that prints or writes the environment spec and git commit hash of data refs:
    • python -V
    • conda env export > environment/conda.yaml
    • git submodule status > environment/submodule-status.txt
  • Use deterministic container builds:
    • If you use Docker, provide a Dockerfile that installs exact pins from environment files.
    • Store a hash of the environment, data commit, and code commit to provenance file.

Reproducibility: provenance and environments

  • Provenance file per run:
    • pipelines/run_provenance.md or JSON with:
    • code_commit:
    • data_commits: { "dataset-A": "", "dataset-B": "" }
    • env_hash:
    • notebook_versions:
    • model_versions:
  • Notebook execution trace:
    • Use nbconvert to run notebooks non-interactively and log outputs.
    • Save the executed notebook as notes/experiment-2026-06-xx.ipynb with a metadata field for the run_id and provenance.
  • Data versioning policy:
    • Do not modify datasets in place. Create a new submodule commit or dataset tag for changes.
    • Document dataset changes in a changelog and link to the run provenance.

Typical commands you’ll run

  • Initialize a new data submodule and commit:
  • Update a dataset submodule to a new commit:
    • cd data-submodules/dataset-A
    • git fetch all
    • git checkout
    • cd ../..
    • git add data-submodules/dataset-A
    • git commit -m "Bump dataset-A to "
  • Work on a feature with isolation:
    • git checkout -b feature/data-cleanup
    • make changes in src/ notebooks/ configs/
    • git add -A
    • git commit -m "data-cleanup: fix NaN handling in preprocessing"
    • git push origin feature/data-cleanup
  • Create a reproducible run:
    • mkdir runs/2026-06-04
    • python pipelines/run.py config configs/experiment-01.yaml output runs/2026-06-04
    • Save provenance:
    • echo '{"code_commit": "'$(git rev-parse HEAD)'", "data_commits": {...}, "env_hash": ""}' > runs/2026-06-04/provenance.json
  • Reproduce locally:
    • git checkout main
    • git submodule update init recursive
    • conda env create -f environment/conda.yaml
    • conda activate data-env
    • python pipelines/run.py config configs/experiment-01.yaml output runs/2026-06-04

CI considerations and testing

  • CI should validate:
    • Submodule integrity: ensure submodule SHAs exist and are accessible
    • Environment reproducibility: compare env hash against pinned files
    • Data availability: verify dataset commits exist or submodule can fetch
    • Notebooks: ensure cells run without errors for critical paths
  • Suggested CI steps:
    • Checkout, init submodules
    • Create fresh environment from environment files
    • Run a small smoke test: a tiny dataset and a quick inference
    • Run end-to-end pipeline with a reduced dataset (subset)
    • Generate provenance record and store in artifacts
  • Test strategy:
    • Unit tests for data processing steps
    • Integration tests for pipeline stages (data extraction, transformation, loading)
    • Regression checks on key metrics with fixed seeds

Common pitfalls and how to avoid them

  • Pitfall: Submodules get out of sync
    • Avoid editing submodule content directly in the main repo. Always commit inside the submodule, then update the pointer in the superproject.
  • Pitfall: Large files bloating the repo
    • Prefer LFS or a separate data store; don’t commit raw large files in the main repo.
  • Pitfall: Non-reproducible environments
    • Pin exact versions and avoid relying on implicit latest packages; include a lockfile or pinned environment spec.
  • Pitfall: Not capturing provenance
    • Automate provenance capture in every run; store it with run outputs.
  • Pitfall: Notebook drift
    • Regularly convert notebooks to scripts where possible; store executed notebook as part of the run with outputs.

Example end-to-end scenario

  • You’re building a data pipeline that trains a simple anomaly detector.
  • Steps:
    • Create a feature branch feature/anomaly-pipeline
    • Add a dataset as a submodule dataset-A containing the raw sensor data
    • Write preprocessing in src/preprocess.py, train in src/train.py, and evaluate in src/evaluate.py
    • Pin environment in environment/conda.yaml and environment/pip-tools.txt
    • Create a run notebook that documents data checks, feature engineering, and results
    • Run the pipeline on a subset of data to validate the flow
    • Record provenance: code commit, dataset commits, environment hash, and run_id
    • Merge to dev after passing tests; promote to main after validation in CI
    • Tag the final, reproducible run with a semantic version: v0.1.0

Illustrative quick-start snippet

  • Add and pin a dataset submodule
  • Prepare a reproducible environment
    • echo "python=3.11" > environment/conda.yaml
    • echo "pandas==2.0.1\nnumpy==1.25.0" > environment/pinned-requirements.txt
    • git add environment/conda.yaml environment/pinned-requirements.txt
    • git commit -m "Pin environment for reproducible runs"
  • Run a quick smoke test in CI
    • conda env create -f environment/conda.yaml
    • conda activate data-env
    • python pipelines/run.py config configs/smoke.yaml output runs/2026-06-04-smoke
    • store provenance in runs/2026-06-04-smoke/provenance.json

Would you like this tailored for a specific stack (e.g., Python ML pipelines, data engineering with Spark, or bioinformatics)? I can adapt the structure, commands, and tooling to match your exact setup.

-

Rizwan Saleem | https://rizwansaleem.co

Sources

Top comments (0)