A Practical Git Workflow for Data-Heavy Projects: Submodules, Shallow Clones, and Exact Reproducibil
A Practical Git Workflow for Data-Heavy Projects: Submodules, Shallow Clones, and Exact Reproducibility
git isn’t just for code. When your project blends data, models, notebooks, and artifacts, you quickly hit friction: large files, non-code dependencies, and the need for reproducible environments. This guide walks you through a concrete, robust workflow you can adopt for data-heavy projects. It covers repository structure, submodules, shallow clones, reproducible environments, and practical branching strategies. You’ll come away with a workflow that keeps teams aligned, reduces accidental data drift, and makes it easy to audit every step from data to model.
Table of contents
- Why this workflow
- Repository layout
- Branching model
- Data management with git: submodules and large files
- Handling dependencies and environments
- Reproducibility: provenance and environments
- Typical commands you’ll run
- CI considerations and testing
- Common pitfalls and how to avoid them
- Example end-to-end scenario
Why this workflow
- You need stable references to data and models across developers and CI.
- Data files are large; you don’t want every clone to fetch every dataset.
- You want reproducible experiments with explicit environment specs and data provenance.
- You want a lightweight way to ship smaller artifacts (code, configs) while keeping data tied to specific commits.
Repository layout
- root/
- data/ (large datasets; typically git-ignored or managed via submodules)
- data-submodules/ (git submodules pointing to external data repos)
- models/ (trained models or model artifacts, tracked as lightweight pointers)
- notebooks/ (Jupyter/Colab notebooks)
- src/ (source code)
- configs/ ( experiment configs, hyperparameters )
- environment/ (environment specs, e.g., conda.yml, pip-tools.txt)
- pipelines/ ( orchestration scripts )
- reports/ ( results, plots, dashboards )
- .gitignore
- README.md
Branching model
- main (or master): production-ready state; minimal, well-tested data pipelines and models
- dev: integration work; merging features here after review
- feature/*: individual experiments or data processing features
- data-reviewed/*: points of data validation or dataset versioning milestones
- hotfix/*: urgent fixes to production pipelines
- For experiments and notebooks, use short-lived branches and tag important data/model milestones
Data management with git: submodules and large files
- Submodules for datasets:
- Create a separate data repository (data-repo) that contains data assets.
- In the main repo, add it as a submodule:
- git submodule add https://example.com/org/data-repo.git data-submodules/dataset-A
- Commit the submodule pointer:
- git commit -m "Add dataset-A as submodule"
- Benefits: you keep data separate, avoid bloating the main repo, and can pin exact data revisions.
- Work with shallow clones for speed when you don’t need full history:
- git clone depth 1 recurse-submodules
- For existing repos: git fetch depth 1 recurse-submodules
- If you must store large files in the repo:
- Prefer Git LFS (Large File Storage) for datasets and models:
- git lfs install
- git lfs track "*.parquet"
- git add .gitattributes
- git add data/subset.parquet
- git commit -m "Track large data with LFS"
- Note: LFS requires a storage backend; coordinate with your CI/CD.
Handling dependencies and environments
- Pin exact environments per experiment:
- environment/conda.yaml (or environment.yml)
- environment/pip-tools.txt (or requirements.txt)
- Use reproducible environments:
- Conda:
- name: data-env
- dependencies:
- python=3.11
- numpy=1.25
- pandas=2.0
- scikit-learn=1.1
- Pip-tools for Python package pinning:
- pip-tools can generate a pinned requirements.txt from a requirements.in
- Capture environment every run:
- In pipelines/, create a script that prints or writes the environment spec and git commit hash of data refs:
- python -V
- conda env export > environment/conda.yaml
- git submodule status > environment/submodule-status.txt
- Use deterministic container builds:
- If you use Docker, provide a Dockerfile that installs exact pins from environment files.
- Store a hash of the environment, data commit, and code commit to provenance file.
Reproducibility: provenance and environments
- Provenance file per run:
- pipelines/run_provenance.md or JSON with:
- code_commit:
- data_commits: { "dataset-A": "", "dataset-B": "" }
- env_hash:
- notebook_versions:
- model_versions:
- Notebook execution trace:
- Use nbconvert to run notebooks non-interactively and log outputs.
- Save the executed notebook as notes/experiment-2026-06-xx.ipynb with a metadata field for the run_id and provenance.
- Data versioning policy:
- Do not modify datasets in place. Create a new submodule commit or dataset tag for changes.
- Document dataset changes in a changelog and link to the run provenance.
Typical commands you’ll run
- Initialize a new data submodule and commit:
- git submodule add https://example.com/org/data-repo.git data-submodules/dataset-A
- git submodule update init recursive
- git commit -m "Add dataset-A as submodule"
- Update a dataset submodule to a new commit:
- cd data-submodules/dataset-A
- git fetch all
- git checkout
- cd ../..
- git add data-submodules/dataset-A
- git commit -m "Bump dataset-A to "
- Work on a feature with isolation:
- git checkout -b feature/data-cleanup
- make changes in src/ notebooks/ configs/
- git add -A
- git commit -m "data-cleanup: fix NaN handling in preprocessing"
- git push origin feature/data-cleanup
- Create a reproducible run:
- mkdir runs/2026-06-04
- python pipelines/run.py config configs/experiment-01.yaml output runs/2026-06-04
- Save provenance:
- echo '{"code_commit": "'$(git rev-parse HEAD)'", "data_commits": {...}, "env_hash": ""}' > runs/2026-06-04/provenance.json
- Reproduce locally:
- git checkout main
- git submodule update init recursive
- conda env create -f environment/conda.yaml
- conda activate data-env
- python pipelines/run.py config configs/experiment-01.yaml output runs/2026-06-04
CI considerations and testing
- CI should validate:
- Submodule integrity: ensure submodule SHAs exist and are accessible
- Environment reproducibility: compare env hash against pinned files
- Data availability: verify dataset commits exist or submodule can fetch
- Notebooks: ensure cells run without errors for critical paths
- Suggested CI steps:
- Checkout, init submodules
- Create fresh environment from environment files
- Run a small smoke test: a tiny dataset and a quick inference
- Run end-to-end pipeline with a reduced dataset (subset)
- Generate provenance record and store in artifacts
- Test strategy:
- Unit tests for data processing steps
- Integration tests for pipeline stages (data extraction, transformation, loading)
- Regression checks on key metrics with fixed seeds
Common pitfalls and how to avoid them
- Pitfall: Submodules get out of sync
- Avoid editing submodule content directly in the main repo. Always commit inside the submodule, then update the pointer in the superproject.
- Pitfall: Large files bloating the repo
- Prefer LFS or a separate data store; don’t commit raw large files in the main repo.
- Pitfall: Non-reproducible environments
- Pin exact versions and avoid relying on implicit latest packages; include a lockfile or pinned environment spec.
- Pitfall: Not capturing provenance
- Automate provenance capture in every run; store it with run outputs.
- Pitfall: Notebook drift
- Regularly convert notebooks to scripts where possible; store executed notebook as part of the run with outputs.
Example end-to-end scenario
- You’re building a data pipeline that trains a simple anomaly detector.
- Steps:
- Create a feature branch feature/anomaly-pipeline
- Add a dataset as a submodule dataset-A containing the raw sensor data
- Write preprocessing in src/preprocess.py, train in src/train.py, and evaluate in src/evaluate.py
- Pin environment in environment/conda.yaml and environment/pip-tools.txt
- Create a run notebook that documents data checks, feature engineering, and results
- Run the pipeline on a subset of data to validate the flow
- Record provenance: code commit, dataset commits, environment hash, and run_id
- Merge to dev after passing tests; promote to main after validation in CI
- Tag the final, reproducible run with a semantic version: v0.1.0
Illustrative quick-start snippet
- Add and pin a dataset submodule
- git submodule add https://example.com/org/data-repo-dataset-A.git data-submodules/dataset-A
- git submodule update init recursive
- git commit -m "Add dataset-A as submodule for reproducible experiments"
- Prepare a reproducible environment
- echo "python=3.11" > environment/conda.yaml
- echo "pandas==2.0.1\nnumpy==1.25.0" > environment/pinned-requirements.txt
- git add environment/conda.yaml environment/pinned-requirements.txt
- git commit -m "Pin environment for reproducible runs"
- Run a quick smoke test in CI
- conda env create -f environment/conda.yaml
- conda activate data-env
- python pipelines/run.py config configs/smoke.yaml output runs/2026-06-04-smoke
- store provenance in runs/2026-06-04-smoke/provenance.json
Would you like this tailored for a specific stack (e.g., Python ML pipelines, data engineering with Spark, or bioinformatics)? I can adapt the structure, commands, and tooling to match your exact setup.
-
Rizwan Saleem | https://rizwansaleem.co
Top comments (0)