DEV Community

Juan Torchia
Juan Torchia Subscriber

Posted on • Originally published at juanchi.dev

lode: Reimplementing DVC's core in Go without breaking the format

lode: Reimplementing DVC's core in Go without breaking the format

There's a type of open source project that earns my immediate respect: one that clearly defines what it doesn't do. lode is one of those.

When I first read the README, the sentence that stopped me was: "lode never invents a format; your repo stays a DVC repo." In an ecosystem where every new tool wants to be the center of gravity, that level of intentional restraint is rare. And it's exactly the technical decision I want to dissect here.

My thesis: format compatibility is not a marketing feature. It's operational risk management. In ML teams where DVC is already baked into pipelines, CI scripts, and audit flows, adopting a tool that invents its own artifact format requires a migration with a freeze window. lode eliminates that cost entirely, and it has a price: pipelines and dvc repro are out of scope. The trade-off is honest.


The problem lode attacked

DVC is the de facto standard for versioning datasets and models in ML projects. The problem isn't conceptual: it's runtime. When you have a directory with 20,000 files and run dvc add big/, DVC hashes sequentially in Python, with all the friction of an interpreter. The repo's README shows concrete measurement on the same repo:

$ time dvc add big/      # 20,000 files
real    0m5.79s

$ time lode add big/     # same repo, identical result byte for byte
real    0m0.44s
Enter fullscreen mode Exit fullscreen mode

That's roughly a ~13× difference in that case. I won't universalize that number as a performance guarantee: it depends on hardware, filesystem, individual file sizes, and how many are already in the state DB. What is reproducible is the mechanism: Go compiles to native binary without VM overhead, hashing runs with NumCPU goroutines in parallel, and the state DB (bbolt, under internal/hashfile) stores (inode, mtime, size) → md5 to skip files that haven't changed. That combination makes technical sense independent of the exact number.

The friction of the hot path matters more than it seems in ML workflows. A slow dvc status makes data scientists avoid it, leading to commits without updated pointer files, leading to broken reproducibility. Accelerating the happy path has real impact on team discipline.


The invariant that's non-negotiable

What interested me most about the repo was reading docs/ARCHITECTURE.md and finding this written as a cardinal principle:

Byte-compatibility with DVC. Anything that changes a serialized artifact (.dvc, .dir, cache/remote layout) must keep the oracle test (tests/oracle/, which runs the real dvc and compares bytes) green.

This isn't a throwaway comment in the README. It's a design invariant that runs through the entire architecture. The internal/dvcfile package reads and writes .dvc files byte-exact with DVC 3.x. The internal/hashfile package reimplements .dir manifest serialization to match exactly with Python's json.dumps (which has a specific key order). The internal/lock package implements DVC-compatible locking so both tools can coexist in the same repo without corruption.

The architecture is organized so format risk is concentrated in specific places:

internal/
├── dvcfile/   # Read/write .dvc — byte-exact compatibility with DVC 3.x
├── hashfile/  # Parallel MD5 + .dir serialization (the trickiest compat detail)
├── cache/     # Content-addressed object store: files/md5/<2>/<rest>
├── remote/    # S3-compatible backend via minio-go
├── transfer/  # Push/fetch with integrity verification
├── checkout/  # Materialization: reflink → hardlink/symlink → copy
└── lock/      # DVC-compatible locking (global flock + JSON rwlock)
Enter fullscreen mode Exit fullscreen mode

Each package has a single responsibility and the highest-risk format code lives in internal/dvcfile and internal/hashfile/tree.go. That makes it easier to reason about where compatibility can break if DVC changes its format in a future version.

CI has an oracle job that installs real DVC (via pipx install \"dvc[s3]\") and runs go test ./tests/oracle/... to compare bytes. If the invariant breaks, the pipeline fails. No ambiguity.


The honest trade-off: what you accelerate and what stays out

lode implements the data layer: add, status, push, pull, fetch, checkout, gc, remote, doctor, verify. That covers the daily hot path for a team versioning datasets.

What's not in scope: dvc repro, dvc run, pipelines, transformation DAGs. The architecture didn't pretend that was straightforward to reimplement with byte-identical compatibility. They chose to define a clear perimeter and execute it well, instead of building a partial clone of all of DVC.

Look at the README: "For ML pipelines (dvc repro), keep using DVC — lode accelerates the data layer and coexists with it." That sentence isn't an apology. It's a design decision. The two tools coexist because they share the same lock (internal/lock uses global flock + JSON rwlock compatible with DVC) and the same artifact format. You can run lode add and then dvc repro without any additional synchronization layer.

The main risk I see with any format reimplementation is drift: if DVC 4.x changes the .dvc file schema or the .dir JSON key order, lode has to update in parallel or compatibility silently breaks. The oracle test mitigates this, but only for the version of DVC installed in CI. That's not a flaw in lode's design; it's the structural cost of being compatible with a format you don't control. A team adopting it should plan for that maintenance.


The state DB: optimization with graceful degradation

The mechanism I liked most about the design is how they think about the state DB. The architecture spells it out explicitly:

The state DB (inode, mtime, size) -> md5 is an optimization, never a source of truth. It can produce a false "up to date" only if a file's content changes while all three keys stay identical (e.g. NFS quirks, restored backups that reset mtimes, recycled inodes). For those cases --rehash (and a corrupt/unreadable state DB) degrade to a full re-hash — the always-correct path.

That's a clear contract about the optimization's limits. Corrupt state or an NFS edge case doesn't break correctness: it degrades to the slow but always-correct path. The --rehash flag exists exactly for this. On network filesystems or CI environments where inodes can be recycled, it's something to keep in mind.

What looks like good technical maturity to me is that this limit is documented in the architecture, not buried in a GitHub issue. A team adopting it knows exactly when lode status can lie (and how to force the correct path).


The static binary as an operational argument

CGO_ENABLED=0 in the build means a binary with no dynamic dependencies. That has practical implications in MLOps:

make build       # single binary with no CGO, no external runtime
make test-short  # unit + oracle, no external services
make test        # full suite — needs MinIO and real dvc
Enter fullscreen mode Exit fullscreen mode

In a training Docker image, installing Python + DVC + S3 dependencies adds layers that can total hundreds of MB and minutes of build time. A static binary is COPY lode /usr/local/bin/lode and done. The release pipeline uses goreleaser with SBOM (via syft), keyless signing with cosign (OIDC), and build provenance attestation (SLSA). For a freshly built project, that level of rigor in the supply chain is a positive signal about how they think about long-term maintenance.


My position

I don't buy the claim of "drop-in compatible" in absolute terms: lode is drop-in compatible for the data layer. If your team's workflow depends on dvc repro, part of your flow stays in DVC. That's not a problem, but you have to name it honestly to avoid mismatched expectations.

What I do accept without reservation: the coexistence approach is technically correct. The alternative of inventing a custom format would shift the performance cost to a migration and lock-in cost. In ML teams where data artifacts are also audit evidence (experiment reproducibility, model traceability), changing that artifact format has a cost beyond engineering time.

The trade-off that feels honest to me: lode solves the hot-path performance problem with a constraint that in most cases is tolerable. The risk is format drift when DVC updates its spec. The oracle test in CI is the detection mechanism, but it requires active maintenance discipline.

If you manage DVC repos with large datasets and dvc add or dvc push time is a real bottleneck, lode deserves an evaluation. The fact that lode verify and dvc status can run on the same artifacts and give the same result is the contract that makes the evaluation reversible at no cost.

What would you do if DVC's format changes in a minor version and silently breaks compatibility in production? Do you have an oracle test to catch it, or do you discover it on the next dvc repro?


Repo analyzed: getlode/lode @ commit b6e6d34


This article was originally published on juanchi.dev

Top comments (0)