Poorshad Shaddel

Posted on Oct 30, 2023 • Originally published at levelup.gitconnected.com on Oct 29, 2023

Implement a Simple Version Control with JavaScript to Understand Git Better!

#javascript #git #versioncontrol

Learn Git Internals by trying to implement a simplified version of it!

What is Git or generally Version Control?

Simply something that helps us track our project over time. One good example is that we can easily go back in time in our source code and see how the code looked like at a specific time in the past.

Why do we need to understand it?

First of all, because when something goes wrong just blindly knowing some commands will not help. Second of all, if we do not understand how the thing, we are working with it every day is working, then where is the fun?

Now we will understand the concepts one by one in the way of implementing our super simplified version control called Gitj

Implementation

Keep in mind that git does so many more advanced things like compression and the way thing are stored that I might cover in separate articles.

Implement the first command: Init!

As you probably know we start the project by git init and git creates the .git folder and stores data in it. We are going to implement this:

Two important folders that git creates are refs and objects. Objects are building blocks of git, they can have 3 types(actually 4!): commit , tree and blob . We are going to see each one of these types in details. refs folder has a subfolder called heads that contains our branches and the latest commit of each branch(from the name it should be kind of obvious that it stores the head of the branch). We have also an important file called HEAD which keeps the current branch or commit.(sometimes we want to checkout on a commit and not a branch)

So let’s create these folders when someone is calling our init function.

const fs = require("fs");

function init() {
    // creates a folder called .gitj and creates a subfolder called .gitj/objects and .gitj/refs and .gitj/refs/heads
    fs.mkdirSync(".gitj");
    fs.mkdirSync(".gitj/objects");
    fs.mkdirSync(".gitj/refs");
    fs.mkdirSync(".gitj/refs/heads");
    // creates a file called .gitj/refs/heads/master
    fs.writeFileSync(".gitj/refs/heads/master", "");
    // creates a file called .gitj/HEAD
    fs.writeFileSync(".gitj/HEAD", "ref: refs/heads/master");
}

init();

Git Add! Put files into Staging

In git files can have three different stages:

Git Files Stages

Working Directory : your normal directory that you work on it and you change files and folders and structures.

Staging Directory: This is a snapshot of a moment in your working directory and when you use git add command it is actually copies your files to .git folder. Keep in mind that staging files are somehow final draft that you want to later commit to your repository.

Repository : By running commit command you are creating a new snapshot in your repository. Now you can point to this snapshot by using the SHA Hash of this commit(you could not point to staging files, staging files were kind of draft).

Now let’s implement git add command.

The steps we should do in git add:

Read the file content
Hash the content of the file
Use Hash as file name and store it in the objects folder
If the file is already there, do nothing. If you have 10 files with identical contents (even with different file name and folder location) git does not copy them 10 times, it can re-use the blob.

Also, git uses first two letters of hash and creates a folder name. For example, if the hash is 4f9be057f0ea5d2ba72fd2c810e8d7b9aa98b469 . git stores it in this folder: 4f and creates this file with the rest: 9be057f0ea5d2ba72fd2c810e8d7b9aa98b469 . The reason is that when over time there are a lot of files in a single folder accessing files could be slower and by using the first two letters as a folder git tries to prevent this problem.

const fs = require("fs");
const crypto = require("crypto");
function add(filename) {
    try {
        // file exists?
        fs.accessSync(filename);
        // read the file
        const content = fs.readFileSync(filename);
        // hash the file
        const hash = crypto.createHash("sha1");
        hash.update(content);
        const sha = hash.digest("hex");
        // create a folder with the first two characters of the hash if it doesn't exist
        if (!fs.existsSync(`.gitj/objects/${sha.slice(0, 2)}`)) {
            fs.mkdirSync(".gitj/objects/" + sha.slice(0, 2), { recursive: true });
        }
        if (fs.existsSync(`.gitj/objects/${sha.slice(0, 2)}/${sha.slice(2)}`)) {
            // a blob with the same content already exists
            process.exit(0);
        }
        // write the file to the objects folder
        fs.writeFileSync(`.gitj/objects/${sha.slice(0, 2)}/${sha.slice(2)}`, content);
    } catch (error) {
        console.log(error);
        console.log(`File ${filename} does not exist.`);
        process.exit(1);
    }
}

add('./sample/src/readme.md')

After running add.js and adding a file from your source code you should see a file in your objects repository like this.

How your Gitj folder should look like

Commit the Changes!

Commit itself is also an object type and as you are probably guessing it should create more files in our objects folder.

The intention of commit is to make a pointer to the current situation (files and folders) that we can point to and in the future, we can come back to this state.

A commit object in Git contains this information:

Author: The person who made the changes
Committer: The person who commit the changes (sometimes you receive a patch from someone else and you have to commit changes.)
Date of commit
Commit Message
Tree (which is itself another object that keeps the shape of working directory at the time of creation)
Parent(If exists)

Let’s see how the Git Commit looks like

If we use command git log we can see the list commit hashes and if we copy one of them, we can use the command git show --pretty=raw commitHash . Here is an example of result of this command(Date is represented as timestamp after the commiter and author name):

Result of Git Show Commit

In this example we have a parent but if the commit is the first commit ever, then we do not have a parent.

The reason the parent exists inside commit object is that by doing that we can chain these commits together and if we manually change a commit in the past everything goes wrong because all the commit hashes should be recalculated. This is something that we do by using commands like rebase .

What is Tree object type in Git?

It is a type of object that keeps the shape of folders and files. For exampe in the commit example that I showed, we can checkout the Tree object and see what is inside. For seeing the tree we have to use git ls-tree treeHash . Here is an example:

Git LS Tree Command

This is the base tree. It contains the files and folders of my working directory. As we can see it has two different types: blob which are files, and tree which points to another tree object that represents a sub folder in this case.

In the result second column is the type of object, third one is the SHA and the last one is file or folder name(We already mentioned that keeping file name out of blob helps to reuse the exact same content over and over). The only thing we do not know yet is the first column. First column is File Mode. The file mode specifies the type of the object (e.g., blob, tree) and its permissions. The leading numbers like 040000 or 100644 represent the file mode in octal notation. The most common modes are:

100644: Indicates a normal file (blob) with read-write permissions.
100755: Indicates an executable file (blob) with read-write-execute permissions.
040000: Indicates a directory (tree).

Actions needed to implement commit functionality:

Create a Tree of the current working directory
Create the commit object
Get the parent commit(Head), if there is not parent update the head(master) after this commit.

We are going to create a simple tree that generates the structure of files and folders in the same way that git does.

Implement Tree generator function!

Small function for getting the file mode of each file:

async function getTreeFileMode(fileType, fileOrFolder) {
    const { mode } = await fs.stat(fileOrFolder);
    return fileType === 'tree' ? '040000' : '100' + ((mode & parseInt("777", 8)).toString(8));
}

A function to get hash of a file

async function getHashOfFile(path) {
    const content = await fs.readFile(path);
    const hash = crypto.createHash("sha1");
    hash.update(content);
    const sha = hash.digest("hex");
    return sha;
};

Now the main function

async function createTreeObjectsFromPaths(folderPath) {
    let treeFileContent = '';
    let treeHash = ''
    // we want to create a tree object similar to git ls-tree
    const listOfFilesAndFolders = await fs.readdir(folderPath, { withFileTypes: true });
    // if it is a file we want to store the hash of the file, if it is a directory we want to call this function recursively
    for (const fileOrFolder of listOfFilesAndFolders) {
        const fileType = fileOrFolder.isDirectory() ? 'tree' : 'blob';
        const fileName = fileOrFolder.name;
        let fileHash = '';
        if (fileType === 'tree') {
            const treeHash = await createTreeObjectsFromPaths(`${folderPath}/${fileName}`);
            fileHash = treeHash;
        } else {
            // here we need to calculate the hash of the file
            fileHash = await getHashOfFile(`${folderPath}/${fileName}`);
        }
        const fileMode = await getTreeFileMode(fileType, `${folderPath}/${fileName}`);
        const fileModeAndName = `${fileMode} ${fileType} ${fileHash} ${fileName}`;
        treeFileContent += fileModeAndName + '\n';
    }
    const hash = crypto.createHash("sha1");
    hash.update(treeFileContent);
    treeHash = hash.digest("hex");
    // write the tree object to the objects folder
    if (!await folderOrFileExist(`.gitj/objects/${treeHash.slice(0, 2)}`)) {
        await fs.mkdir(`.gitj/objects/${treeHash.slice(0, 2)}`, { recursive: true });
    }
    if (await folderOrFileExist(`.gitj/objects/${treeHash.slice(0, 2)}/${treeHash.slice(2)}`)) {
        // a tree with the same content already exists
        console.log(`.gitj/objects/${treeHash.slice(0, 2)}/${treeHash.slice(2)}`);
        return treeHash;
    }
    // write the file to the objects folder
    await fs.writeFile(`.gitj/objects/${treeHash.slice(0, 2)}/${treeHash.slice(2)}`, treeFileContent);
    // console.log(`.gitj/objects/${treeHash.slice(0, 2)}/${treeHash.slice(2)} \n`, treeFileContent);
    return treeHash;
}

Explaination of this function:

1- We are iterating to result of fs.readdir .

2- If it is a directory then type is tree and we are calling this function recursively. Otherwise it is a file or blob in other words and we need to calculate the hash.

3- We get the file mode(040000, 100644, …) because we need it for creating our tree object.

4- We have the content of tree object. Now we can create the hash.

5- if the object(hash of current tree) exists we do not need to do anything otherwise we create the object and store it into our objects folder.

Let’s run this function and see if it is working correctly:

The folder structure look like this.

Folder Strucutre

Now we run createTreeObjectsFromPaths('.') . As a result two new objects are created in our Gitj folder:

Two new objects after creating our tree object

We expect the content of one of these objects to have package.json which was in the root folder as a type of blob and we expect another tree object which points to src folder:

Root Folder Tree Object

This tree commit hash now refers to another object that stores structure of src folder:

src Folder Tree Object

Now it’s time to implement Commit function.

const fs = require('fs').promises;
const crypto = require('crypto');
const { createTreeObjectsFromPaths, folderOrFileExist } = require('./tree');

async function commit(commitMessage) {
    const treeHash = await createTreeObjectsFromPaths('./sample');
    const parentHash = await getLatestCommitHash();
    const author = 'test';
    const committer = 'test';
    const commitDate = Date.now();
    const commitContent = `tree ${treeHash}\nparent ${parentHash}\nauthor ${author}\ncommitter ${committer}\ncommit date ${commitDate}\n${commitMessage}`;
    const hash = crypto.createHash("sha1");
    hash.update(commitContent);
    const commitHash = hash.digest("hex");
    // write the commit object to the objects folder
    if (!await folderOrFileExist(`.gitj/objects/${commitHash.slice(0, 2)}`)) {
        await fs.mkdir(`.gitj/objects/${commitHash.slice(0, 2)}`, { recursive: true });
    }
    if (await folderOrFileExist(`.gitj/objects/${commitHash.slice(0, 2)}/${commitHash.slice(2)}`)) {
        // a commit with the same content already exists
        console.log(`.gitj/objects/${commitHash.slice(0, 2)}/${commitHash.slice(2)}`);
        return commitHash;
    }
    // write the file to the objects folder
    await fs.writeFile(`.gitj/objects/${commitHash.slice(0, 2)}/${commitHash.slice(2)}`, commitContent);
    // set the head of current branch to the commit hash
    await fs.writeFile('.gitj/refs/heads/master', commitHash);
    return commitHash;
}

The hard part was the tree object, now we have all the data and we put them together and create a new object for commit. Also, we need to update the head of the branch and point to this new snapshot that we got (obviously the commit hash).

Git Checkout

What does branch Master or Main or other branches mean?

Branches are simply reference or bookmark to a commit. In the implementation of commit you saw that since we updated the content of file: refs/head/master , it is just a commit hash. And this commit hash has a parent(if it is not the first commit ever) and we can go back to history until there is no commit anymore. So, by using this branch name we can access the latest commit (Head!). Simply, being on a branch means that you are pointing to another head.

it does not store the file name with the blob, then the advantage is that git can use the blob even if the file names are different.

How to implement Git Checkout!

Since we already implemented commit functionality, we know that in a commit we have access to tree object (recursively all the folders and files) and we have files as blob in the .git(for us .gitj). So we should remove the working directory and then build the whole directory when someone is checking out on another commit(or branch — branch head points to a commit hash). But first of all we need to store the commit hash or the branch name to the file HEAD .

Before that we want to implement a small function to get the tree object of that commit:

async function getTreeHashFromCommit(commitHash) {
    const commitContent = await fs.readFile(`.gitj/objects/${commitHash.slice(0, 2)}/${commitHash.slice(2)}`, 'utf-8');
    const array = commitContent.split('\n').map(e=> e.split(' '))
    const elem = array.find(e => e[1] === 'tree');
    return elem[2];
};

Now by having the tree we have to rebuild the whole folder:

async function checkout(commitHash) {
    const listOfFilesToCreate = [];
    // store the commit hash in the refs folder
    await fs.writeFile('.gitj/HEAD', commitHash);
    const treeHash = await getTreeHashFromCommit(commitHash);
    // get tree file
    const baseTree = await convertTreeObject(treeHash);
    // clear the folder
    await removeAllFilesAndFolders(folderPath);
    // create the files and folders based on address on the blob
    await createFilesAndFolders(baseTree, folderPath);
}

First we are writing this commit to HEAD file to remember that the head is not on master anymore. We have to recursively get the tree and their blobs to re-create the whole folder. Here is the implementation:

async function convertTreeObject(treeHash, folderPrefix = '', files = []) {
    const treeObject = await fs.readFile(`.gitj/objects/${treeHash.slice(0, 2)}/${treeHash.slice(2)}`, 'utf-8');
    const array = treeObject.split('\n').map(e=> e.split(' '))
    for (const file of array) {
        if (!file || file.length < 2) continue;
        const [mode, type, hash, name] = file;
        if (type === 'tree') {
            await convertTreeObject(hash, folderPrefix + name + '/', files);
        } else {
            files.push({
                mode: mode,
                type: type,
                hash: hash,
                name: folderPrefix + name
            })
        }
    }
    return files;
}

If we have a file then we add the name file and the blob( content of the file) to an array. if the object type is tree it means that it is a folder and we need to call this function recursively and we have to path the parent folder to create the correct path to the file.

Some more functionalities that I would like to implement in future

Git Status
Git Diff

Conclusion

We have learned how git uses hashes and chaining of commit hashes to give us the ability of tracking the history. Personally, I find this kind of deep dive the best way to learn something really good, I hope it was useful. If you are interested in implementing more functionalities in git checkout this GitHub Repo.

References

https://www.youtube.com/watch?v=lG90LZotrpo&ab_channel=CS50

https://www.youtube.com/watch?v=P6jD966jzlk&pp=ygUgR2l0IGludGVybmFscyBob3cgaXQgc3RvcmVzIGRhdGE%3D

https://www.youtube.com/watch?v=dBSHLb1B8sw&ab_channel=GOTOConferences

https://www.youtube.com/watch?v=52MFjdGH20o&ab_channel=Brief

DEV Community