The Problem
You added a 2 GB dataset to a repo. Now git clone takes 10 minutes, CI downloads the full history on every run, and GitHub is billing you per GB transferred. You switch to Git LFS - and now you need a server, a token, and a storage plan. You try DVC - and now you need Python, a pipeline config, and lock files that conflict on every PR.
None of this is the actual problem. The actual problem is: large files don't belong in Git objects. Everything else is overhead.
Meet git-sfs
SFS stands for Symbolic File Storage. The name is deliberate - it's Git LFS with the L swapped for S. Git LFS replaces large files with opaque pointer files and routes bytes through a proprietary server protocol. git-sfs replaces large files with plain symlinks that Git already understands natively, and routes bytes through rclone to any remote you already have.
No new protocol. No server. No pointer file format to decode. Just symlinks.
How It Works
The model is three sentences:
Git tracks symlinks.
git-sfs stores file bytes.
rclone moves files.
When you run git-sfs add data/train-000.tar.zst, here's what happens:
- The file is hashed with SHA-256
- Bytes are stored in
<cache>/files/sha256/ab/<hash>- write-once, read-only - The original path becomes a relative Git symlink pointing into the cache
data/train-000.tar.zst -> ../.git-sfs/cache/files/sha256/ab/<hash>
data/train-000.tar.zst opens normally. Git commits a 70-byte symlink. Your repo stays fast.
Why Not Git LFS?
Git LFS solves the storage problem but adds a server problem. You need an LFS endpoint, pay per-GB transfer fees on GitHub, and the pointer-file format is an opaque internal detail.
git-sfs remotes are plain rclone destinations - S3, GCS, Azure Blob, Backblaze B2, SFTP, a local path, anything rclone supports. You can rclone ls your remote and see exactly what's there. No magic.
One Binary, No Runtime
git-sfs is written in Go. It ships as a single static binary - no Python environment, no runtime, no version conflicts. Drop it on any machine and it runs.
Quick Start
# Install (macOS/Linux, arm64 and x86_64)
curl -LsSf https://github.com/Red-Eyed/git-sfs/releases/latest/download/install.sh | sh
# Init and configure
git-sfs init # creates .git-sfs/config.toml
# edit config.toml: set remote backend, path, rclone config
git-sfs setup # bind local cache
# Add files
git-sfs add data/
git add .git-sfs/config.toml data/
git commit -m "track datasets"
# Sync to remote
git-sfs push
On another machine:
git clone <repo> && cd <repo>
git-sfs setup
git-sfs pull # download only what you need
You can also pull a subset - useful when one machine only needs validation data:
git-sfs pull data/validation/
Safety by Design
Data loss in a dataset management tool is unacceptable, so git-sfs has a few hard rules:
- Hash-verify at every boundary - after hashing, after download, after copy. A corrupted file is rejected, not silently accepted.
- Atomic writes - temp file + rename everywhere. An interrupted push or pull never leaves a partial file.
- Cache files are immutable - write-once, then write-protected. Accidental overwrites are impossible.
The verify command is designed for CI:
git-sfs verify # presence check (fast)
git-sfs verify --with-integrity data/ # rehash cached + remote files (thorough)
And doctor checks your entire setup - config, cache, rclone binary, remote connectivity - in one shot:
git-sfs doctor
What's Different from dvc, git-annex, etc.?
| Tool | Requires | Remote | PR-friendly? |
|---|---|---|---|
| Git LFS | LFS server | proprietary protocol | ✅ |
| DVC | Python, pipelines | S3/GCS/etc via SDK | ❌ |
| git-annex | Haskell runtime | many backends | ❌ |
| git-sfs | rclone | anything rclone supports | ✅ |
DVC stores large file metadata in .dvc sidecar files and a dvc.lock that encode pipeline state. When two branches touch the same dataset, merging those lock files creates conflicts that have no meaningful resolution in a pull request - reviewers see .dvc diffs, not data diffs, and the merge problem is fundamentally DVC's, not Git's.
git-annex stores its metadata in a separate orphan branch (git-annex). That branch never appears in a normal git log or PR diff, so reviewers have no visibility into what large files changed or why. Merging the annex branch is a separate out-of-band operation that GitHub PRs don't surface at all.
git-sfs tracks only plain relative symlinks. A PR diff shows exactly which files were added, removed, or renamed - the same way any other file change looks in Git. Reviewers can approve or reject dataset changes with full context, and there are no sidecar files or hidden branches to reconcile.
git-sfs has no pipelines, no Python, no manifests, no committed lock files. The Git tree is the file list. The cache is a plain directory. The remote is whatever rclone can reach.
Concurrency and Partial Pulls
Large dataset workflows often mean hundreds of files if not millions. git-sfs runs add, push, and pull with a configurable worker pool:
# .git-sfs/config.toml
[settings]
n_jobs = 8
And partial pulls let teams share a repo where different machines only materialize what they actually use:
git-sfs pull data/train/ # only training split
git-sfs pull data/checkpoints/ # only model weights
Source
The project is open source under MIT: github.com/Red-Eyed/git-sfs
Feedback and issues welcome. If you're managing large files in Git and the LFS server tax is getting old, give it a try.
Top comments (0)