DEV Community

Rizwan Saleem
Rizwan Saleem

Posted on

A practical, beginner-friendly Git workflow for collaborative data science projects

A practical, beginner-friendly Git workflow for collaborative data science projects

A practical, beginner-friendly Git workflow for collaborative data science projects

Working with data science projects often means juggling notebooks, datasets, experiment tracking, and model artifacts alongside code. A solid Git workflow helps teams stay synchronized, reproduce results, and avoid spaghetti histories. This guide walks you through a complete, approachable workflow tailored for data science teams, with concrete commands, branching strategies, and tips for common pain points.

Why a dedicated workflow for data science

  • Notebooks, datasets, and large artifacts can bloat repos and complicate versioning.
  • Reproducibility requires tracking experiments, dependencies, and environment changes.
  • Collaboration often involves data preprocessing steps, model training, evaluation, and deployment pipelines.

A disciplined workflow keeps code, data, and experiments aligned, while remaining practical for researchers, analysts, and engineers.

Core concepts you’ll use

  • Git basics: commit, branch, merge, pull request (PR).
  • Branching model: main (or master), develop, feature branches, and experiment branches.
  • Versioning data: data/versioning practices, lightweight data references, and optional use of DVC or Git LFS for large files.
  • Environment and reproducibility: environment.yml/requirements.txt, Pipenv/Poetry, and lightweight experiment logging.
  • Notebooks: using nbstripout to strip outputs, and Jupytext to keep notebooks in a text-friendly format.
  • Experiment tracking: simple naming conventions or a lightweight tracker (e.g., MLflow, Weights & Biases) to record runs.

    1) Choose a branching model

  • main/main-branch: always deployable, contains production-ready code and minimal datasets references.

  • develop: integration branch for features; not production-ready until merged to main.

  • feature/*: for new features, experiments, or data processing steps.

  • experiment/*: for active data experiments that may not be stable enough to merge.

  • hotfix/*: for urgent fixes in production.

Recommended approach:

  • Work on feature branches for substantial changes.
  • When an experiment branch yields solid results, merge into develop with clear PR notes.
  • After QA, merge develop into main for release. ### 2) Set up repository structure

A lightweight structure to keep data-related work organized:

  • data/
    • raw/ # original datasets
    • processed/ # cleaned/feature-engineered datasets (often small or references)
    • external/ # public datasets (with provenance)
    • references/ # checksums, data dictionaries
  • notebooks/
    • 01-explore.ipynb
    • 02-preprocess.ipynb
  • src/
    • data_pipeline/
    • modeling/
    • evaluation/
  • models/
    • (optional) serialized models or references
  • env/
    • environment.yml or pyproject.toml
  • tests/
    • unit and integration tests
  • docs/
    • README, guidance, experiments log

Tips:

  • Keep heavy datasets outside the repo when possible; store references or use DVC/Git LFS if you must track files.
  • Add a DATA_REQUIREMENTS.md that documents dataset provenance, versioning rules, and usage. ### 3) Versioning datasets and large artifacts

Handling large files in Git can slow everything down. Choose one of these approaches:

  • Lightweight references:
    • Store datasets externally (S3, GCS, or a shared drive) and keep a manifest file data/manifest.csv describing dataset versions, checksums, and download URLs.
  • DVC (Data Version Control):
    • Tracks data, models, and experiments while keeping only lightweight links in Git.
    • Pros: reproducibility, clean commands, easy experiments.
    • Cons: adds a tool to learn; some teams mitigate by using DVC for critical artifacts only.
  • Git LFS (Large File Storage):
    • Stores large files in a separate storage, with pointer files in the repo.
    • Pros: works well with Git hosting services; simple to adopt.
    • Cons: bandwidth and storage quotas can be a constraint.

Recommended starter: use a manifest for most datasets; consider DVC if your team runs frequent model training with sizable data.

4) Notebook hygiene and collaboration

  • Use nbstripout or jupytext to keep notebooks clean and portable.
    • nbstripout removes outputs on commit, reducing merge conflicts.
    • jupytext enables pairing notebooks with executable Python scripts (.py) for easier diffing.
  • Version control notebooks as part of the project, but keep outputs out of Git when possible.

Commands:

  • Install nbstripout:
    • pip install nbstripout
    • nbstripout install
  • If using Jupytext:

    • pip install jupytext
    • Configure notebooks as paired: in a notebook, choose File > Jupytext > Pair Notebook with Py Script. ### 5) Environment and reproducibility
  • Commit environment specs:

    • Python: use poetry or pipenv, or a requirements.txt.
    • Conda: environment.yml
  • Pin dependencies to versions to ensure reproducibility.

  • Add a quick validation script (e.g., python -m pytest) to verify the environment can run end-to-end.

Example files:

  • environment.yml
    • name: ds-env
    • dependencies:
    • python=3.11
    • numpy
    • pandas
    • scikit-learn
    • jupyter
  • pyproject.toml (Poetry)
    • [tool.poetry.dependencies]
    • python = "^3.11"
    • numpy = "^1.26"
    • pandas = "^1.5"

Tip:

  • Create a simple run.sh or Makefile target to set up the environment and run a quick data sanity check. ### 6) A practical Git workflow (step-by-step)

1) Create a feature branch

  • git checkout main
  • git pull origin main
  • git checkout -b feature/data-cleaning-v1

2) Work on your changes

  • Make changes in src/, notebooks/, and docs as needed.
  • Periodically run tests and a quick data sanity check.

3) Stage and commit meaningfully

  • git add .
  • git commit -m "feat(data): add initial cleaning pipeline and unit tests"
  • Include a small, focused changelog entry in docs/CHANGES.md.

4) Rebase or merge frequently

  • To keep a clean history, rebase your feature branch onto develop:
    • git fetch origin
    • git rebase origin/develop
    • Resolve conflicts if any, then git rebase continue
  • Alternatively, merge develop into your feature branch to stay up-to-date:
    • git merge origin/develop

5) Push and open a PR

  • git push -u origin feature/data-cleaning-v1
  • Open a PR against develop with:
    • Title: feat(data): initial cleaning pipeline
    • Description: outline what changed, dataset references, tests added, and how to reproduce.

6) PR review and CI

  • Ensure tests pass (unit, integration, and data sanity checks).
  • Add reviewers and respond to feedback.
  • If needed, adjust code and push new commits to the same PR.

7) Merge to develop

  • After approvals and checks, merge to develop.
  • Update CHANGES.md with a concise summary.

8) Release to main

  • When develop is stable and ready for release:

    • git checkout main
    • git pull origin main
    • git merge origin/develop
    • git push origin main ### 7) Practical commands cheat-sheet
  • Create a new branch:

    • git checkout -b feature/data-cleaning-v1
  • Update from develop:

    • git fetch origin
    • git rebase origin/develop
  • Resolve conflicts (example):

    • edit conflicted files
    • git add
    • git rebase continue
  • Run tests (example with pytest):

    • pytest tests/
  • Tag a release:

    • git tag -a v0.1.0 -m "First data-cleaning release"
    • git push origin v0.1.0
  • View log cleanly:

    • git log oneline decorate graph all ### 8) Example workflow narrative
  • Alice teams on a project to preprocess customer data and build a simple churn model.

  • They agree to store data references in data/manifest.csv and keep heavy datasets in a shared data lake.

  • Alice creates feature/data-prep branch, implements a data cleaning pipeline in src/data_pipeline/clean.py, and adds unit tests in tests/test_clean.py.

  • She uses nbstripout to keep notebooks tidy and writes a small notebook that documents exploratory steps without heavy outputs.

  • After local validation, she opens a PR against develop with clear notes: dataset versions, environment changes, and tests.

  • The team reviews, runs tests, and approves. The PR is merged into develop.

  • When the project is ready for a release, develop is merged into main, and a release banner is added to docs/RELEASE_NOTES.md.

    9) Common pitfalls and how to avoid them

  • Pitfall: Large files in Git slow down work.

    • Solution: store large artifacts outside Git; use a manifest or DVC for critical assets.
  • Pitfall: Notebook merge conflicts.

    • Solution: use nbstripout or Jupytext to minimize textual conflicts; commit notebooks less frequently, focusing on scripts and data pipeline code.
  • Pitfall: Environment drift across teammates.

    • Solution: pin dependencies, provide a reproducible setup script (Makefile or run.sh), and document environment steps in a CONTRIBUTING guide.
  • Pitfall: Inconsistent experiment tracking.

    • Solution: adopt a simple experiment naming convention and a lightweight tracker (e.g., MLflow) for reproducibility. ### 10) Quick starter checklist
  • [ ] Create main, develop, and feature branches following the model above.

  • [ ] Add a data handling plan: where data lives, how to reference versions, and how to reproduce.

  • [ ] Set up environment specs (requirements.txt, environment.yml, or Poetry).

  • [ ] Enable notebook hygiene tooling (nbstripout, Jupytext).

  • [ ] Implement a minimal CI to run tests and a quick data sanity check on PRs.

  • [ ] Document the workflow in a CONTRIBUTING.md.

    If you’d like, I can tailor this to your team’s exact toolchain (Poetry vs Pipenv, DVC vs Git LFS, CI provider, etc.) and provide a ready-made repository skeleton with example files and scripts. Would you prefer a DVC-based data workflow or a lightweight manifest approach for datasets?

-

Rizwan Saleem | https://rizwansaleem.co

Top comments (0)