A Practical Git Workflow for Data-Heavy Projects: Submodules, Shallow Clones, and Exact Reproducibil

#frontend #webdev

A Practical Git Workflow for Data-Heavy Projects: Submodules, Shallow Clones, and Exact Reproducibil

A Practical Git Workflow for Data-Heavy Projects: Submodules, Shallow Clones, and Exact Reproducibility

git isn’t just for code. When your project blends data, models, notebooks, and artifacts, you quickly hit friction: large files, non-code dependencies, and the need for reproducible environments. This guide walks you through a concrete, robust workflow you can adopt for data-heavy projects. It covers repository structure, submodules, shallow clones, reproducible environments, and practical branching strategies. You’ll come away with a workflow that keeps teams aligned, reduces accidental data drift, and makes it easy to audit every step from data to model.

Table of contents

Why this workflow
Repository layout
Branching model
Data management with git: submodules and large files
Handling dependencies and environments
Reproducibility: provenance and environments
Typical commands you’ll run
CI considerations and testing
Common pitfalls and how to avoid them
Example end-to-end scenario

Why this workflow

You need stable references to data and models across developers and CI.
Data files are large; you don’t want every clone to fetch every dataset.
You want reproducible experiments with explicit environment specs and data provenance.
You want a lightweight way to ship smaller artifacts (code, configs) while keeping data tied to specific commits.

Repository layout

root/
- data/ (large datasets; typically git-ignored or managed via submodules)
- data-submodules/ (git submodules pointing to external data repos)
- models/ (trained models or model artifacts, tracked as lightweight pointers)
- notebooks/ (Jupyter/Colab notebooks)
- src/ (source code)
- configs/ ( experiment configs, hyperparameters )
- environment/ (environment specs, e.g., conda.yml, pip-tools.txt)
- pipelines/ ( orchestration scripts )
- reports/ ( results, plots, dashboards )
- .gitignore
- README.md

Branching model

main (or master): production-ready state; minimal, well-tested data pipelines and models
dev: integration work; merging features here after review
feature/*: individual experiments or data processing features
data-reviewed/*: points of data validation or dataset versioning milestones
hotfix/*: urgent fixes to production pipelines
For experiments and notebooks, use short-lived branches and tag important data/model milestones

Data management with git: submodules and large files

Submodules for datasets:
- Create a separate data repository (data-repo) that contains data assets.
- In the main repo, add it as a submodule:
- git submodule add https://example.com/org/data-repo.git data-submodules/dataset-A
- Commit the submodule pointer:
- git commit -m "Add dataset-A as submodule"
Benefits: you keep data separate, avoid bloating the main repo, and can pin exact data revisions.
Work with shallow clones for speed when you don’t need full history:
- git clone depth 1 recurse-submodules
- For existing repos: git fetch depth 1 recurse-submodules
If you must store large files in the repo:
- Prefer Git LFS (Large File Storage) for datasets and models:
- git lfs install
- git lfs track "*.parquet"
- git add .gitattributes
- git add data/subset.parquet
- git commit -m "Track large data with LFS"
- Note: LFS requires a storage backend; coordinate with your CI/CD.

Handling dependencies and environments

Pin exact environments per experiment:
- environment/conda.yaml (or environment.yml)
- environment/pip-tools.txt (or requirements.txt)
Use reproducible environments:
- Conda:
- name: data-env
- dependencies:
  - python=3.11
  - numpy=1.25
  - pandas=2.0
  - scikit-learn=1.1
- Pip-tools for Python package pinning:
- pip-tools can generate a pinned requirements.txt from a requirements.in
Capture environment every run:
- In pipelines/, create a script that prints or writes the environment spec and git commit hash of data refs:
- python -V
- conda env export > environment/conda.yaml
- git submodule status > environment/submodule-status.txt
Use deterministic container builds:
- If you use Docker, provide a Dockerfile that installs exact pins from environment files.
- Store a hash of the environment, data commit, and code commit to provenance file.

Reproducibility: provenance and environments

Provenance file per run:
- pipelines/run_provenance.md or JSON with:
- code_commit:
- data_commits: { "dataset-A": "", "dataset-B": "" }
- env_hash:
- notebook_versions:
- model_versions:
Notebook execution trace:
- Use nbconvert to run notebooks non-interactively and log outputs.
- Save the executed notebook as notes/experiment-2026-06-xx.ipynb with a metadata field for the run_id and provenance.
Data versioning policy:
- Do not modify datasets in place. Create a new submodule commit or dataset tag for changes.
- Document dataset changes in a changelog and link to the run provenance.

Typical commands you’ll run

Initialize a new data submodule and commit:
- git submodule add https://example.com/org/data-repo.git data-submodules/dataset-A
- git submodule update init recursive
- git commit -m "Add dataset-A as submodule"
Update a dataset submodule to a new commit:
- cd data-submodules/dataset-A
- git fetch all
- git checkout
- cd ../..
- git add data-submodules/dataset-A
- git commit -m "Bump dataset-A to "
Work on a feature with isolation:
- git checkout -b feature/data-cleanup
- make changes in src/ notebooks/ configs/
- git add -A
- git commit -m "data-cleanup: fix NaN handling in preprocessing"
- git push origin feature/data-cleanup
Create a reproducible run:
- mkdir runs/2026-06-04
- python pipelines/run.py config configs/experiment-01.yaml output runs/2026-06-04
- Save provenance:
- echo '{"code_commit": "'$(git rev-parse HEAD)'", "data_commits": {...}, "env_hash": ""}' > runs/2026-06-04/provenance.json
Reproduce locally:
- git checkout main
- git submodule update init recursive
- conda env create -f environment/conda.yaml
- conda activate data-env
- python pipelines/run.py config configs/experiment-01.yaml output runs/2026-06-04

CI considerations and testing

CI should validate:
- Submodule integrity: ensure submodule SHAs exist and are accessible
- Environment reproducibility: compare env hash against pinned files
- Data availability: verify dataset commits exist or submodule can fetch
- Notebooks: ensure cells run without errors for critical paths
Suggested CI steps:
- Checkout, init submodules
- Create fresh environment from environment files
- Run a small smoke test: a tiny dataset and a quick inference
- Run end-to-end pipeline with a reduced dataset (subset)
- Generate provenance record and store in artifacts
Test strategy:
- Unit tests for data processing steps
- Integration tests for pipeline stages (data extraction, transformation, loading)
- Regression checks on key metrics with fixed seeds

Common pitfalls and how to avoid them

Pitfall: Submodules get out of sync
- Avoid editing submodule content directly in the main repo. Always commit inside the submodule, then update the pointer in the superproject.
Pitfall: Large files bloating the repo
- Prefer LFS or a separate data store; don’t commit raw large files in the main repo.
Pitfall: Non-reproducible environments
- Pin exact versions and avoid relying on implicit latest packages; include a lockfile or pinned environment spec.
Pitfall: Not capturing provenance
- Automate provenance capture in every run; store it with run outputs.
Pitfall: Notebook drift
- Regularly convert notebooks to scripts where possible; store executed notebook as part of the run with outputs.

Example end-to-end scenario

You’re building a data pipeline that trains a simple anomaly detector.
Steps:
- Create a feature branch feature/anomaly-pipeline
- Add a dataset as a submodule dataset-A containing the raw sensor data
- Write preprocessing in src/preprocess.py, train in src/train.py, and evaluate in src/evaluate.py
- Pin environment in environment/conda.yaml and environment/pip-tools.txt
- Create a run notebook that documents data checks, feature engineering, and results
- Run the pipeline on a subset of data to validate the flow
- Record provenance: code commit, dataset commits, environment hash, and run_id
- Merge to dev after passing tests; promote to main after validation in CI
- Tag the final, reproducible run with a semantic version: v0.1.0

Illustrative quick-start snippet

Add and pin a dataset submodule
- git submodule add https://example.com/org/data-repo-dataset-A.git data-submodules/dataset-A
- git submodule update init recursive
- git commit -m "Add dataset-A as submodule for reproducible experiments"
Prepare a reproducible environment
- echo "python=3.11" > environment/conda.yaml
- echo "pandas==2.0.1\nnumpy==1.25.0" > environment/pinned-requirements.txt
- git add environment/conda.yaml environment/pinned-requirements.txt
- git commit -m "Pin environment for reproducible runs"
Run a quick smoke test in CI
- conda env create -f environment/conda.yaml
- conda activate data-env
- python pipelines/run.py config configs/smoke.yaml output runs/2026-06-04-smoke
- store provenance in runs/2026-06-04-smoke/provenance.json

Would you like this tailored for a specific stack (e.g., Python ML pipelines, data engineering with Spark, or bioinformatics)? I can adapt the structure, commands, and tooling to match your exact setup.

Rizwan Saleem | https://rizwansaleem.co