Uthman Oladele

Posted on Nov 9

Building Git from Scratch in Go: What I Learned About Version Control Internals

#cli #git #go #programming

I didn't understand Git until I broke it open and rebuilt it from scratch. No libraries, no shortcuts - just SHA-256 hashing, tree structures, and commit graphs.

I built a Git implementation in Go without using any Git libraries. No magic, just content-addressable storage and the object model. It works, it taught me more about Git in a week than years of using it, and here's what I learned.

How I Learned This

This wasn't from a single source. I pieced it together from:

CodeCrafters - Their "Build Your Own Git" challenge gave me the structure and pushed me to actually implement things
Random YouTube videos - Honestly, just searching "how git works internally" and watching whatever came up. Some were helpful, most weren't
"Building Git" book - Wasn't really helpful for what I needed, but it did clarify some object format details

Most of the learning came from trial and error. Breaking things, reading error messages, and debugging for hours.

Why Build This?

I use Git every day but had no idea how it actually works. Just git add, git commit, and hope nothing breaks. I wanted to understand what's really happening under the hood.

The goal wasn't to make something production-ready. I just wanted answers:

Why are commits so cheap?
How does Git deduplicate files automatically?
What the hell is a "tree object"?
Why is branching fast?

Turns out the best way to understand something is to build it yourself.

What It Does

Go-Git implements the core stuff:

go-git init                    # Initialize repository
go-git config                  # Set user identity
go-git add <files...>          # Stage files
go-git commit -m "message"     # Create commit
go-git log                     # View history

It handles content-addressable storage, the staging area, tree objects, commit history, and zlib compression. Basically everything Git does to manage your code, minus branches, merges, and remotes.

The Three Main Ideas

1. Content-Addressable Storage

Every object (file, directory, commit) gets stored by its SHA-256 hash:

.git/objects/ab/c123def456...
            ↑↑  ↑↑↑↑↑↑↑
            │   └─ Rest of hash (filename)
            └───── First 2 chars (subdirectory)

This is actually genius. Same content = same hash = automatic deduplication. You could store the same README.md across 100 commits and it only takes up disk space once.

Your file's hash IS its address. No need for a separate indexing system.

2. The Three Trees

Git tracks files through three layers:

Working Directory  →  Staging Area  →  Repository
   (your files)        (.git/index)     (.git/objects)

go-git add moves files from working directory → staging area
go-git commit snapshots staging area → repository

The staging area is literally just a text file mapping paths to hashes:

100644 abc123... README.md
100644 def456... src/main.go

When you commit, Git hashes this whole thing into a tree object.

3. Tree Objects (The Hard Part)

Files get stored as blobs. Directories get stored as trees.

Simple project structure:

project/
  README.md
  src/
    main.go
    lib/
      helper.go

Git stores it like this:

Commit (abc123)
    ↓
Root Tree (def456)
├─ blob: README.md (hash: abc123)
└─ tree: src/ (hash: def456)
      ├─ blob: main.go (hash: ghi789)
      └─ tree: lib/ (hash: jkl012)
            └─ blob: helper.go (hash: mno345)

Here's what finally clicked for me: Trees don't contain their children - they just reference them by hash. That's why Git is fast. If a directory doesn't change, same hash, just reuse the tree. No need to re-store anything.

The tricky part: You have to build trees bottom-up (deepest first) because parent trees need their children's hashes.

You can't hash src/ until you know the hash of src/lib/. Can't hash the root tree until you know the hash of src/. The order matters, period.

This took me over 5 hours to get right.

What Was Hard

Tree Building Order

First try: Build trees top-down, starting from root. Immediately failed - you don't have the child tree hashes yet.

Fix: Sort directories by depth, build the deepest ones first:

sort.Slice(dirs, func(i, j int) bool {
    return strings.Count(dirs[i], "/") > strings.Count(dirs[j], "/")
})

Then for each directory, check if any trees you've already built are its children and add them.

Binary vs Hex Encoding

Tree objects store hashes as 32 binary bytes, not 64-character hex strings.

My bug:

content += entry.BlobHash  // Wrong! This is a 64-char hex string

The fix:

hashBytes, _ := hex.DecodeString(entry.BlobHash)
content = append(content, hashBytes...)  // 32 binary bytes

This made my trees twice as big as they should've been. Spent an hour debugging because it was subtle - objects were still readable but tree traversal was completely broken.

Excluding .git/ When Staging

When you do go-git add ., you don't want to accidentally add .git/objects/ to the index.

First attempt: Check if the path contains .git. Problem: this also excluded files like my.git.file.

The right way:

if d.Name() == ".git" && d.IsDir() {
    return filepath.SkipDir
}

filepath.SkipDir during directory traversal is the way to go.

How Objects Work

Every object follows this format:

<type> <size>\0<content>

Gets compressed with zlib, stored at .git/objects/<hash[:2]>/<hash[2:]>.

Blob:

blob 13\0Hello, World!

Hash it → a0b1c2d3... → Store at .git/objects/a0/b1c2d3...

Tree:

tree 74\0100644 README.md\0<32-byte-hash>040000 src\0<32-byte-hash>

Modes:

100644 = regular file
040000 = directory (tree object)

Commit:

commit <size>\0
tree abc123...
parent 789xyz...
author Uthman <email> timestamp
committer Uthman <email> timestamp

Initial commit

Commits form a directed acyclic graph. go-git log just walks backwards through this chain from HEAD.

What I Actually Learned

Content-addressable storage is elegant. Hash = address. This one idea enables deduplication, integrity checking, and efficient storage.

Trees are graphs, not nested structures. A tree object doesn't "contain" subtrees - it references them by hash. This indirection is what makes Git efficient. Multiple commits can share the same tree if a directory didn't change.

Building bottom-up is necessary. You can't hash a parent without knowing its children's hashes. The order matters fundamentally.

Compression matters. Without zlib, .git/objects/ would be 3-4x larger. Git uses compression everywhere for a reason.

Binary formats are tricky. Working with \0 null bytes and binary hash data requires careful handling. Text formats would be easier but less efficient.

What's Missing

This is a learning project, not production software:

No branches (only main exists)
No merge operations
No diff/status commands
No remote operations (push/pull/fetch)
No .gitignore support
No packed objects (each object is a separate file)
Plain text index (real Git uses binary format)

These limitations exist because this project focuses on Git's core: the object model, staging, and commits. Adding branches/merging/remotes would be another 2-3x the code and shift focus from fundamentals to features.

Try It Yourself

git clone https://github.com/codetesla51/go-git.git
cd go-git
./install.sh

# Or build manually:
go build -buildvcs=false -o go-git
ln -s $(pwd)/go-git ~/.local/bin/go-git

Then:

mkdir my-project
cd my-project
go-git init
go-git config

echo "Hello World" > README.md
go-git add README.md
go-git commit -m "Initial commit"
go-git log

Want to really learn? Clone it, break something on purpose (change the hash function to MD5, remove zlib compression), and see what fails. Watch how Git's assumptions about immutability and content-addressing cascade through the system. That's how you really understand it.

Final Thoughts

Building this taught me more about Git in a week than years of using it did. I now understand why commits are cheap (just pointers to trees), how deduplication works (content-addressable storage), and why branching is fast (just moving a pointer).

If you want to truly understand a tool, build it yourself.

Built with Go, no Git libraries used. All hashing, compression, and object storage is custom implementation.

Check out more of my work at devuthman.vercel.app | GitHub

Top comments (2)

shemith mohanan • Nov 10

This was such an insightful read 👏
I love how you broke Git down into its raw internals — SHA-256 hashes, trees, and commit graphs — and actually rebuilt it to understand how it works. That “bottom-up tree building” explanation really clicked for me.

Totally agree — the fastest way to understand a tool deeply is to build it yourself. Great work! 🚀

Uthman Oladele • Nov 10

Thanks