Nayan Pahuja

Posted on Sep 16, 2023

Version Control Systems and Git Part-2

#git #codenewbie #discuss #learning

In the last post we talked in brief about how Git takes snapshots of the directories we are working in , how the commits we do are actually categorized into some types of data objects and they are not really stored in a linear fashion. We did capture the essence of what Git does but we were still left hanging with little to no information about commits, SHA1 encryption that git uses and git commands. So let's get right back onto track. Incase you didn't read the first post, I would recommend reading that first as it's a continuation.

.git folder

Attempting to create a good distributed version control system within a mere week is an impossible task. However, what one can successfully design during the time is a fundamental data model.

So, what precisely does Git's data model entail? To delve into this topic, we must delve into the enigmatic .git folder.

In your ./git folder , it contains a folder called objects which is essentially nothing but your object can be either a commit, a tree, a blob, or a tag. They’re compressed with zlib, but we can extract and examine the objects easily.

Git Data Model as pseudo code

Last time we mentioned the terms such as blob(Binary Large Objects), commits, snapshots, tree and more.
Let's expand a little more on this data model internally of git.

// a binary large object is a file or say a bunch of bytes
type blob = array<byte>

// a folder/directory contains signed files and directories
type tree = map<string, tree | blob>

// a commit is an abstracted data object model with history to parents, author, the commit message, and the current snapshot basically the metadata of commit
type commit = struct {
    parents: array<commit>
    author: string
    message: string
    snapshot: tree
}

This is basically how git's data is modelled.

Git Objects and content-definitions

At some point this actually has to be created into data on disks.

Git defines objects which can be anything of the prescribed such as a blob, tree or commit

type object => tree | commit | blob

But what git actually maintains on disk is also a map to the hash of the object not just the object itself.

In git all data objects are content-addressed by SHA1 hash.

objects = map<string,object>
def store(object):
    id = SHA1(object)
    objects[id] = object

def load(id):
    return objects[id]

In case you don't know what SHA1 or Hashing is, you can think of it as making the data's or say the object's content and do it's addressing into a finite string. This helps us in addressing large amount of data's into small strings.

In practice commits don't actually contain the objects or snapshots for that matter but they have actually pointers to the obects location.

Blobs, trees, and commits are unified in this way: they are all objects. When they reference other objects, they don’t actually contain them in their on-disk representation, but have a reference to them by their hash.

In the realm of cryptography, SHA-1 (Secure Hash Algorithm 1) serves as a hash function. It operates by taking an input and generating a 160-bit (equivalent to 20 bytes) hash value, commonly represented as a 40-character string composed of hexadecimal digits.

Fun Fact:

Generating two identical SHA-1 hashes is an exceptionally rare occurrence, nearly approaching the realm of impossibility. To put this into perspective, imagine having to generate more hash codes than there are grains of sand on all the beaches across the entire Earth before the likelihood of encountering two identical hashes even becomes a consideration.

If we further use the command git cat-file -p <hashId> example something like

 git cat-file -p 4448adbf7ecd394f42ae135bbeed9676e894af85

The tree itself contains pointers to its contents, index.txt (a blob) and blogExample(a tree). If we look at the contents addressed by the hash corresponding to index.txt with git cat-file -p <hashId>, we get the following:

This Side Nayan

References

Now any normal human would not be able to remember 40characters long random sequences for having a pointer to its object but so then how do we call it.

Well Git actually maintains another set of references which gives human name to those SHA1 generated strings to actually have addressable pointers for the end user.

reference_map = map<string,string> (string1  is human name and string 2 is hash)


def store_reference(name, identifier):
    reference_map[name] = identifier

def retrieve_reference(name):
    return reference_map.get(name)

def load_reference(name_or_id):
    if name_or_id in reference_map:
        return load(reference_map[name_or_id])
    else:
        return load(name_or_id)

Unlike objects, which are immutable, references are mutable (can be updated to point to a new commit). For example, the master reference usually points to the latest commit in the main branch of development.

With the help of these references git can actually maintain human-readble names like master main and so on.

Repositories

In essence, a Git repository can be roughly defined as a collection of data objects and references.

When we examine the repository on disk, we find that it comprises only two fundamental elements: objects and references. This straightforward concept encapsulates Git's entire data model. Essentially, all Git commands can be mapped to actions that manipulate the commit Directed Acyclic Graph (DAG) by either adding objects or updating references.

The next time you enter a Git command, it's valuable to contemplate how that command influences the underlying graph data structure. Conversely, if you're aiming to achieve a specific alteration within the commit DAG – for instance, "discard uncommitted changes and set the 'master' reference to point to commit 4d123f6w" – you'll likely discover a command tailored to that purpose. In this particular case, the commands would be: gitcheckout master followed by git reset --hard 4d123f6w.

The Staging Area

The concept of staging area is quite brilliant when you understand it although it is separate from Git's data model but is an integral part of how commits are created.

Instead of having a single "create snapshot" command that captures the current state of everything and take the data again and repeat the whole story Git takes a different approach. It provides a way to choose precisely what changes we want to include in the next snapshot. This selection process is done through something called the "staging area."

The staging area lets us pick and choose which modifications should go into the next snapshot. This flexibility is useful when we have multiple changes in progress or when you need to create a commit that includes some changes while excluding others.

Since this post has already gone I won't be covering the git commands in depth. I will be linking down the best resources to read more about git and how git commands work.

Now that we have gotten a little understanding how git works, the commands should not feel like magical incantations but some very simple CLI commands.

In practice git uses delta compression and many more stuff to achieve what it is today. This is just an outline to the fancy technology used.

Conclusion:

In conclusion, through this two part series I have tried to provide some valuable insights into Git's fundamental data model, the role of the .git folder, and how Git objects are organized using SHA-1 hashes. We've learned about the significance of references in providing human-friendly names to commits and explored the concept of the staging area for precise control over what goes into each snapshot. Understanding these core aspects of Git's inner workings is crucial for efficient version control. To further enhance your Git proficiency, you can refer to the recommended resources mentioned in the post.

Resources:

Pro Git
I highly recommend reading first few chapters of the books and then looking into some advanced stuff.
GitHub: Git is not GitHub. GitHub has a specific way of contributing code to other projects, called pull requests.
Other Git providers: GitHub is not special: there are many Git repository hosts, like GitLab and BitBucket.
Git For Computer Scientists
Missing Lectures VCS
Git is Simpler than you think

Sources:

This series was heavily inspired by the Missing Semester Lecture on Youtube which I will leave the link down below and a few very good articles. I have tried to incorporate the parts I think were important and best from each.

Top comments (0)

Some comments have been hidden by the post's author - find out more

DEV Community

Version Control Systems and Git Part-2

.git folder

Git Data Model as pseudo code

Git Objects and content-definitions

References

Repositories

The Staging Area

Conclusion:

Resources:

Sources:

Top comments (0)

Read next

From Bartender to Developer: My Self-Taught Journey in Tech

State of JS 2024

9 JavaScript One-Liners That Replace 50 Lines of Code

I Tried Every Hot Programming Language