A practical, beginner-friendly Git workflow for collaborative data science projects
A practical, beginner-friendly Git workflow for collaborative data science projects
Working with data science projects often means juggling notebooks, datasets, experiment tracking, and model artifacts alongside code. A solid Git workflow helps teams stay synchronized, reproduce results, and avoid spaghetti histories. This guide walks you through a complete, approachable workflow tailored for data science teams, with concrete commands, branching strategies, and tips for common pain points.
Why a dedicated workflow for data science
- Notebooks, datasets, and large artifacts can bloat repos and complicate versioning.
- Reproducibility requires tracking experiments, dependencies, and environment changes.
- Collaboration often involves data preprocessing steps, model training, evaluation, and deployment pipelines.
A disciplined workflow keeps code, data, and experiments aligned, while remaining practical for researchers, analysts, and engineers.
Core concepts you’ll use
- Git basics: commit, branch, merge, pull request (PR).
- Branching model: main (or master), develop, feature branches, and experiment branches.
- Versioning data: data/versioning practices, lightweight data references, and optional use of DVC or Git LFS for large files.
- Environment and reproducibility: environment.yml/requirements.txt, Pipenv/Poetry, and lightweight experiment logging.
- Notebooks: using nbstripout to strip outputs, and Jupytext to keep notebooks in a text-friendly format.
-
Experiment tracking: simple naming conventions or a lightweight tracker (e.g., MLflow, Weights & Biases) to record runs.
1) Choose a branching model
main/main-branch: always deployable, contains production-ready code and minimal datasets references.
develop: integration branch for features; not production-ready until merged to main.
feature/*: for new features, experiments, or data processing steps.
experiment/*: for active data experiments that may not be stable enough to merge.
hotfix/*: for urgent fixes in production.
Recommended approach:
- Work on feature branches for substantial changes.
- When an experiment branch yields solid results, merge into develop with clear PR notes.
- After QA, merge develop into main for release. ### 2) Set up repository structure
A lightweight structure to keep data-related work organized:
- data/
- raw/ # original datasets
- processed/ # cleaned/feature-engineered datasets (often small or references)
- external/ # public datasets (with provenance)
- references/ # checksums, data dictionaries
- notebooks/
- 01-explore.ipynb
- 02-preprocess.ipynb
- src/
- data_pipeline/
- modeling/
- evaluation/
- models/
- (optional) serialized models or references
- env/
- environment.yml or pyproject.toml
- tests/
- unit and integration tests
- docs/
- README, guidance, experiments log
Tips:
- Keep heavy datasets outside the repo when possible; store references or use DVC/Git LFS if you must track files.
- Add a DATA_REQUIREMENTS.md that documents dataset provenance, versioning rules, and usage. ### 3) Versioning datasets and large artifacts
Handling large files in Git can slow everything down. Choose one of these approaches:
- Lightweight references:
- Store datasets externally (S3, GCS, or a shared drive) and keep a manifest file data/manifest.csv describing dataset versions, checksums, and download URLs.
- DVC (Data Version Control):
- Tracks data, models, and experiments while keeping only lightweight links in Git.
- Pros: reproducibility, clean commands, easy experiments.
- Cons: adds a tool to learn; some teams mitigate by using DVC for critical artifacts only.
- Git LFS (Large File Storage):
- Stores large files in a separate storage, with pointer files in the repo.
- Pros: works well with Git hosting services; simple to adopt.
- Cons: bandwidth and storage quotas can be a constraint.
Recommended starter: use a manifest for most datasets; consider DVC if your team runs frequent model training with sizable data.
4) Notebook hygiene and collaboration
- Use nbstripout or jupytext to keep notebooks clean and portable.
- nbstripout removes outputs on commit, reducing merge conflicts.
- jupytext enables pairing notebooks with executable Python scripts (.py) for easier diffing.
- Version control notebooks as part of the project, but keep outputs out of Git when possible.
Commands:
- Install nbstripout:
- pip install nbstripout
- nbstripout install
-
If using Jupytext:
- pip install jupytext
- Configure notebooks as paired: in a notebook, choose File > Jupytext > Pair Notebook with Py Script. ### 5) Environment and reproducibility
-
Commit environment specs:
- Python: use poetry or pipenv, or a requirements.txt.
- Conda: environment.yml
Pin dependencies to versions to ensure reproducibility.
Add a quick validation script (e.g., python -m pytest) to verify the environment can run end-to-end.
Example files:
- environment.yml
- name: ds-env
- dependencies:
- python=3.11
- numpy
- pandas
- scikit-learn
- jupyter
- pyproject.toml (Poetry)
- [tool.poetry.dependencies]
- python = "^3.11"
- numpy = "^1.26"
- pandas = "^1.5"
Tip:
- Create a simple run.sh or Makefile target to set up the environment and run a quick data sanity check. ### 6) A practical Git workflow (step-by-step)
1) Create a feature branch
- git checkout main
- git pull origin main
- git checkout -b feature/data-cleaning-v1
2) Work on your changes
- Make changes in src/, notebooks/, and docs as needed.
- Periodically run tests and a quick data sanity check.
3) Stage and commit meaningfully
- git add .
- git commit -m "feat(data): add initial cleaning pipeline and unit tests"
- Include a small, focused changelog entry in docs/CHANGES.md.
4) Rebase or merge frequently
- To keep a clean history, rebase your feature branch onto develop:
- git fetch origin
- git rebase origin/develop
- Resolve conflicts if any, then git rebase continue
- Alternatively, merge develop into your feature branch to stay up-to-date:
- git merge origin/develop
5) Push and open a PR
- git push -u origin feature/data-cleaning-v1
- Open a PR against develop with:
- Title: feat(data): initial cleaning pipeline
- Description: outline what changed, dataset references, tests added, and how to reproduce.
6) PR review and CI
- Ensure tests pass (unit, integration, and data sanity checks).
- Add reviewers and respond to feedback.
- If needed, adjust code and push new commits to the same PR.
7) Merge to develop
- After approvals and checks, merge to develop.
- Update CHANGES.md with a concise summary.
8) Release to main
-
When develop is stable and ready for release:
- git checkout main
- git pull origin main
- git merge origin/develop
- git push origin main ### 7) Practical commands cheat-sheet
-
Create a new branch:
- git checkout -b feature/data-cleaning-v1
-
Update from develop:
- git fetch origin
- git rebase origin/develop
-
Resolve conflicts (example):
- edit conflicted files
- git add
- git rebase continue
-
Run tests (example with pytest):
- pytest tests/
-
Tag a release:
- git tag -a v0.1.0 -m "First data-cleaning release"
- git push origin v0.1.0
-
View log cleanly:
- git log oneline decorate graph all ### 8) Example workflow narrative
Alice teams on a project to preprocess customer data and build a simple churn model.
They agree to store data references in data/manifest.csv and keep heavy datasets in a shared data lake.
Alice creates feature/data-prep branch, implements a data cleaning pipeline in src/data_pipeline/clean.py, and adds unit tests in tests/test_clean.py.
She uses nbstripout to keep notebooks tidy and writes a small notebook that documents exploratory steps without heavy outputs.
After local validation, she opens a PR against develop with clear notes: dataset versions, environment changes, and tests.
The team reviews, runs tests, and approves. The PR is merged into develop.
-
When the project is ready for a release, develop is merged into main, and a release banner is added to docs/RELEASE_NOTES.md.
9) Common pitfalls and how to avoid them
-
Pitfall: Large files in Git slow down work.
- Solution: store large artifacts outside Git; use a manifest or DVC for critical assets.
-
Pitfall: Notebook merge conflicts.
- Solution: use nbstripout or Jupytext to minimize textual conflicts; commit notebooks less frequently, focusing on scripts and data pipeline code.
-
Pitfall: Environment drift across teammates.
- Solution: pin dependencies, provide a reproducible setup script (Makefile or run.sh), and document environment steps in a CONTRIBUTING guide.
-
Pitfall: Inconsistent experiment tracking.
- Solution: adopt a simple experiment naming convention and a lightweight tracker (e.g., MLflow) for reproducibility. ### 10) Quick starter checklist
[ ] Create main, develop, and feature branches following the model above.
[ ] Add a data handling plan: where data lives, how to reference versions, and how to reproduce.
[ ] Set up environment specs (requirements.txt, environment.yml, or Poetry).
[ ] Enable notebook hygiene tooling (nbstripout, Jupytext).
[ ] Implement a minimal CI to run tests and a quick data sanity check on PRs.
[ ] Document the workflow in a CONTRIBUTING.md.
If you’d like, I can tailor this to your team’s exact toolchain (Poetry vs Pipenv, DVC vs Git LFS, CI provider, etc.) and provide a ready-made repository skeleton with example files and scripts. Would you prefer a DVC-based data workflow or a lightweight manifest approach for datasets?
-
Rizwan Saleem | https://rizwansaleem.co
Top comments (0)