How to use native git as a key-value store

#git #database #phabricator #branches

Most engineers are familiar with creating branches and making commits in Git. The tool is notoriously unintuitive but has become universal in software engineering. But did you know that you can store more than just snapshots of code?

There is already a long history of tools hacking small amounts of metadata into Git. For example, the open-source code review tool Gerrit ingests pull requests through git push by allowing the user to encode data in the name of the remote ref. The command git push gerrit HEAD:refs/for/master would open a pull request against master on Gerrit rather than writing a new ref.

Another example of metadata hacking is the practice of adding unique IDs to commits. Maintaining association between a proposed code change and a specific commit can be hard because a git commit ID can change between revisions. In response, both Gerrit and another code review tool, Phabricator, leverage the commit message as a metadata store. Using a commit hook or alternative source control CLI, they add a unique ID to each commit message which proves stable across rebases and amendments.

The open-source CLI I work on, Graphite, has a different form of metadata it needs to track. To create stacks of branches, the tool needs to map branches to their parents. Storing a reference to the name of a parent branch in a commit message wouldn't work because no one commit is stable over the life of a branch.

After investigating various mechanisms, we landed on using git's object database directly to store branch metadata. The command [git hash-object](https://git-scm.com/docs/git-hash-object) allows a user to write any string Git's object database and returns an ID. A second command, [git update-ref](https://git-scm.com/docs/git-update-ref) allows you to create or update any ref to point to the stored object by its ID. Used together, we had a dead simple mechanism for storing JSON blobs in Git's native database:

const objectId = execSync(`git hash-object -w --stdin`, {
  input: JSON.stringify(metadata),
}).toString();

execSync(`git update-ref refs/branch-metadata/${branchName} ${objectId}`, {
  stdio: "ignore",
});

Storing data is of no use if we can't read it back. Luckily, the read operation is even easier using [git cat-file](https://git-scm.com/docs/git-cat-file)

const metadata = execSync(
  `git cat-file -p refs/branch-metadata/${branchName} 2> /dev/null`
).toString();

With these two code blocks, we have everything necessary to read and write any data to Git's object database. The advantages of this approach are plentiful:

The metadata refs are plainly visible to users by running ls .git/refs/branch-metadata
The data can be inspected, modified, and removed using native git commands.
The refs can be pushed and pulled from remote repositories, allowing easy syncing.

Graphite simply stores small JSON blobs keyed on branch names, but this approach could be used to store any data under any keys while remaining accessible to a tool as common as Git. For example, Graphite has already started caching open PR statuses through git hash-object. By asynchronously fetching and storing PR information, Graphite is able to print elegant log outputs like:

․  ◯ gf--fix_cycles_disallow_meta_parent_cycl PR #238 (Approved)
․  │ fix(cycles): disallow meta parent cycles
․  │ 68 minutes ago
․  │ * f0e3e7 - fix(cycles): disallow meta parent cycles
․  │
◌──┘

You can read Graphite's full implementation of metadata handing here.

DEV Community

How to use native git as a key-value store

Top comments (0)

Read next

Top 🐘👀 Postgres Monitoring Tools 🧰 and Best Practices in 2024 🔝

ClickHouse: The Key to Faster Insights

Just-in-Time Database Access

Mastering Git: The Essential Tool for Every Developer