Ethan Arrowood

Posted on May 15, 2020

Building Git with Node.js and TypeScript - Part 1

#git #node #typescript #javascript

Read the introduction to this series here: Building Git with Node.js and TypeScript - Part 0.

In this post, I'll be sharing my work from chapter 3 section 2, implementing the commit command. Follow along with the code available here.

Throughout this article I try my best to highlight certain terms using code highlight, boldface, and italic. Any code highlight text will be referencing actual pieces of code such as commands, properties, variables, etc. Any boldface text refers to file and directory names. And any italic text references higher level data structures. Most classes will referred to using italics, but may sometimes appear as code highlights when referring to a type assignment. Keep in mind that some terms may be italicized before they are defined.

Imports are omitted from code examples. For this article, assume all imports refer to other local files or Node.js core modules. Furthermore, all code blocks will have their respective file name commented at the top of the block.

Overview

In the previous post I implemented the init command, which created a .git directory in current working directory and initialized two inner directories objects and refs. This section covers a simplified commit command. It adds all files in the current working directory to the git database as blobs, creates a tree with all of the entries, and then finally creates a commit with a message. Additionally, it tracks the commit author from data stored in environment variables, and the commit message is read from stdin rather than passing it in as a command line argument.

Adding the commit command

Inside jit.ts add a new 'commit' case to the switch statement. Then derive the database path to the objects directory created by the init command.

// jit.ts
async function jit() {
    const command = process.argv[2]

    switch (command) {
        case 'init': {
            // ...
        }
        case 'commit': {
            const rootPath = process.cwd() // get the current working directory
            const gitPath = path.join(rootPath, '.git')
            const dbPath = path.join(gitPath, 'objects')
        }
    }
}

With these paths, create Workspace and Database class instances.

// jit.ts
// inside of the `case 'commit': { }` block
const workspace = new Workspace(rootPath)
const database = new Database(dbPath)

Workspace

The workspace class contains one private property, ignore, one public property, pathname, and two public methods, listFiles and readFile. The ignore property is a list of things to ignore when scanning the current working directory. This emulates the commonly used .gitignore file. The pathname property is the absolute path to the current working directory and any files within will be included in the list. Thus, the listFiles method returns all files in the directory resolved from pathname, and filters out anything in the ignore list. Currently, this method is not recursive and will not list files contained within directories. Finally, the readFile method takes a file path, joins it with the absolute path pathname, and then reads its contents as a buffer.

It is intentional that the readFile method returns a buffer and not an encoded string version of the file contents. This method will be used when storing the entities in the Git database and we must store the binary representation of the data; not the encoded version.

// workspace.ts
import fs from 'fs'
import path from 'path'

export default class Workspace {
    private ignore = ['.', '..', '.git']

    public pathname: string

    constructor (pathname: string) {
        this.pathname = pathname
    }

    public async listFiles () {
        const dirFiles = await fs.promises.readdir(this.pathname)
        return dirFiles.filter(x => this.ignore.indexOf(x) === -1)
    }

    public async readFile (filePath: string) {
        return await fs.promises.readFile(path.join(this.pathname, filePath))
    }
}

Database

The database class is verbose, but is rightfully so as it is the basis for the entire application. It has a single public property pathname, one public method store, and two private methods writeObject and generateTempName. Start by defining the property, constructor, and methods with arguments.

// database.ts
export default class Database {
    public pathname: string

    constructor (pathname: string) {
        this.pathname = pathname
    }

    public async store(obj: Entity) {}

    private async writeObject(oid: string, content: Buffer) {}

    private generateTempName() {}
}

Starting with the store method there is already something new, the Entity class. Before continuing with the store method, let's define this class as it has some important details for the rest of the implementation.

Entity

This class is the crux for all items storable by the database. Anything that will be stored in the database (blobs, commits, trees) will need to extend from this class. It has one private property data that is a buffer of the contents of the entity, and two public properties type and oid (object id). While data and type are set by the constructor, the oid property is generated by a private method setOid. This method uses the type and data parameters, and creates a hash of a custom binary string. The code below contains comments detailing each step of this method. Lastly, the class overrides the toString method to return the underlying data buffer; this is the not the best practice as toString should generally return a string, but buffers can be implicitly turned into strings with their own toString method so this is (sorta) okay.

What is a "binary string"? This refers to a Node.js buffer, but I distinctly used this wording because I need to highlight something special about the implementation. As previously mentioned, the database needs to store the binary representation of the data and NOT the encoded string version; thus, buffers are used throughout the code samples. This detail is covered in the Building Git book, and while working on this solution, I received a wonderfully detailed stackoverflow answer on the difference between raw binary strings and encoded strings in JavaScript. If you're interested in learning more please utilize those resources.

// entity.ts

export default class Entity {
    private data: Buffer

    public type: string
    public oid: string

    constructor(type: string, data: Buffer) {
        this.type = type
        this.data = data
        this.oid = this.setOid()
    }

    private setOid () {
        // define the binary string
        const str = this.data
        // create a buffer from the type, binary string length, and a null byte
        const header = Buffer.from(`${this.type} ${str.length}\0`)
        // create the hash content by concatenating the header and the binary string
        const content = Buffer.concat([header, str], header.length + str.length)
        // create a hash generator using the 'sha1' algorithm
        const shasum = crypto.createHash('sha1')
        // update the hash generator with the content and use a hexadecimal digest to create the object id
        const oid = shasum.update(content).digest('hex')

        return oid
    }

    public toString () {
        return this.data
    }
}

Back to Database

Continuing with the database store implementation, it needs to recreate the content that was used to generate the oid property, and use that plus the oid to write the object to the database itself. Yes, the content is being generated twice (once in the Entity class and once here); I purposely did not optimize this as I didn't want to stray too far from the Ruby code. It is noted and may change in future implementations.

// database.ts

class Database {
    // ...
    async store (obj: Entity) {
        const str = obj.toString() // remember this returns the data buffer
        const header = Buffer.from(`${obj.type} ${str.length}\0`)
        const content = Buffer.concat([header, str], header.length + str.length)
        await this.writeObject(obj.oid, content)
    }
}

Next is the writeObject and generateTempName methods. Derived from the store method, writeObject has two arguments: oid and content. The binary string content will be written to a file path derived from oid. In a Git database, the objects are stored in subdirectories using the first two characters of their oid; thus, the substrings in the objectPath variable. The internal getFileDescriptor method is used to try to safely generate these directories on the fly. Unfortunately, it is not perfect and can sometimes still throw an error due to how the store method is called from jit.ts (more on this soon). Again, this is purposefully not fixed or optimized, but it is noted for future improvements. Finally, the end of the method. Another trick this method uses to prevent errors is by generating temporary names for the files, and then renaming them after. The content of the files is compressed using Zlib deflate at the Z_BEST_SPEED level.

// database.ts

class Database {
    // ...
    private async writeObject(oid: string, content: Buffer) {
        const objectPath = path.join(this.pathname, oid.substring(0, 2), oid.substring(2))
        const dirName = path.dirname(objectPath)
        const tempPath = path.join(dirName, this.generateTempName())

        const flags = fs.constants.O_RDWR | fs.constants.O_CREAT | fs.constants.O_EXCL

        const getFileDescriptor = async () => {
            try {
                return await fs.promises.open(tempPath, flags)
            } catch (err) {
                if (err.code === 'ENOENT') {
                    await fs.promises.mkdir(dirName)
                    return await fs.promises.open(tempPath, flags)
                } else if (err.code === 'EEXIST') {
                    return await fs.promises.open(tempPath, flags)
                } else {
                    throw err
                }
            }
        }

        const file = await getFileDescriptor()

        const deflate: any = util.promisify(zlib.deflate)
        const compressed = await deflate(content, { level: zlib.constants.Z_BEST_SPEED })

        await file.write(compressed)
        await file.close()

        await fs.promises.rename(tempPath, objectPath)
    }

    private generateTempName () {
        // hex ensures we only get characters 0-9 and a-f
        return `tmp_obj_${crypto.randomBytes(8).toString('hex').slice(0, 8)}`
    }
}

Back to the commit command

Continuing the commit block now that workspace and database are implemented, we list the files in the workspace, then iterating over the list, create blobs and store them in the database. Additionally, each object will be tracked as an entry which is used in the tree structure. Notice how both the blob and tree are stored in the database through the same store method. These objects are similar enough that they can both be based on the Entity class defined above.

// jit.ts
// inside of the `case 'commit': { }` block
const workspaceFiles = await workspace.listFiles()

const entries = await Promise.all(workspaceFiles.map(async path => {
    const data = await workspace.readFile(path)
    const blob = new Blob(data)

    database.store(blob)
    return new Entry(path, blob.oid)
}))

const tree = new Tree(entries)
database.store(tree)

Blob

Blobs are one of the simplest data structures in this application. They extend from Entity and set their type as 'blob'.

// blob.ts

export default class Blob extends Entity {
    constructor(data: Buffer) {
        super('blob', data)
    }
}

Entry

Another simple data structure, entry, has two public properties name and oid and both are of type string. This structure could be represented as just an object literal, but defining it as a class allows for better extensibility later on if it is needed.

// entry.ts

export default class Entry {
    public oid: string
    public name: string

    constructor (name: string, oid: string) {
        this.name = name
        this.oid = oid
    }
}

Tree

The Tree class is a bit more complicated compared to the Blob class, but it still extends from the Entity class. In the constructor, the class calls a private, static method generateData to create the data buffer passed to the parent Entity constructor. The Tree class also keeps a local, public copy of the entries list.

// tree.ts

export default class Tree extends Entity {
    public entries: Entry[]

    constructor(entries: Entry[]) {
        super('tree', Tree.generateData(entries, '100644'))
        this.entries = entries
    }

    private static generateData (input: Entry[], mode: string) {
        let totalLength = 0 // this is necessary for the final concatenation
        const entries = input
            .sort((a, b) => a.name.localeCompare(b.name)) // sort by file name
            .map(entry => {
                // encode as normal string and append a null byte
                let b1 = Buffer.from(`${mode} ${entry.name}\0`) 
                // encodes a string as hex. for example '00ce' is a string of 4 bytes; 
                // this is encoded to Buffer<00, ce>, a buffer of 2 hex bytes
                let b2 = Buffer.from(entry.oid, 'hex')
                totalLength += b1.length + b2.length
                return Buffer.concat([b1, b2], b1.length + b2.length)
            })
        // concat all of the entries into one buffer and return
        return Buffer.concat(entries, totalLength)
    }
}

The generateData function is one of my personal favorites. I think the best way to understand what it does is to first look at what it outputs. This function creates the data for the tree entry in the database. Unlike the blobs, the tree best resembles a list of all the blobs contained in the commit. Running git ls-tree <tree-hash> outputs this list:

Keep in mind this output is only possible with the rest of this sections code (i.e. commit entity), and that my commit hashes will be different from yours if you were to reproduce this yourself.

$ git ls-tree e42fafc6ea09f9b9633adc97218288b2861dd03f

100644 blob 1d15619c8d23447eac2924b07896b3be9530a42e    author.ts
100644 blob c8c1a93bf381f385bb70bcb95359ff056ee4a273    blob.ts
100644 blob fad23e45b228db3f33501691410541819e08a1e6    commit.ts
100644 blob 0355a9b19376a39700c3f44be73cb84d2398a219    database.ts
100644 blob c9a547e93c3101b3607f58469db26882645a120d    entity.ts
100644 blob c061d02df8007226fb6b4092a40f44678f533599    entry.ts
100644 blob 7a9f17b4ee76e13b062676fa74cb509aa423ee88    jit.ts
100644 blob 1adec84945be1564c70e9cdaf5b6a9c1d9326bd0    readStdin.ts
100644 blob aeafb5efdcd5e64897385341b92a33590517adae    timestamp.ts
100644 blob 377c1945ebb9aaf9f991656b7c232f7b02a55e78    tree.ts
100644 blob a331e9df15d9546f9d7dd1f28322bf1e24c2db00    workspace.ts

The ls-tree command derives this information from the contents of the tree entry itself. The entry is a hard to read as a human, but by using an inflate command and the hexdump tool we can get an output we can make sense of:

$ alias inflate="node -e 'process.stdin.pipe(zlib.createInflate()).pipe(process.stdout)'"
$ cat .git/objects/e4/2fafc6ea09f9b9633adc97218288b2861dd03f | inflate | hexdump -C

00000000  74 72 65 65 20 34 31 30  00 31 30 30 36 34 34 20  |tree 410.100644 |
00000010  61 75 74 68 6f 72 2e 74  73 00 1d 15 61 9c 8d 23  |author.ts...a..#|
00000020  44 7e ac 29 24 b0 78 96  b3 be 95 30 a4 2e 31 30  |D~.)$.x....0..10|
00000030  30 36 34 34 20 62 6c 6f  62 2e 74 73 00 c8 c1 a9  |0644 blob.ts....|
00000040  3b f3 81 f3 85 bb 70 bc  b9 53 59 ff 05 6e e4 a2  |;.....p..SY..n..|
00000050  73 31 30 30 36 34 34 20  63 6f 6d 6d 69 74 2e 74  |s100644 commit.t|
00000060  73 00 fa d2 3e 45 b2 28  db 3f 33 50 16 91 41 05  |s...>E.(.?3P..A.|
00000070  41 81 9e 08 a1 e6 31 30  30 36 34 34 20 64 61 74  |A.....100644 dat|
00000080  61 62 61 73 65 2e 74 73  00 03 55 a9 b1 93 76 a3  |abase.ts..U...v.|
00000090  97 00 c3 f4 4b e7 3c b8  4d 23 98 a2 19 31 30 30  |....K.<.M#...100|
000000a0  36 34 34 20 65 6e 74 69  74 79 2e 74 73 00 c9 a5  |644 entity.ts...|
000000b0  47 e9 3c 31 01 b3 60 7f  58 46 9d b2 68 82 64 5a  |G.<1..`.XF..h.dZ|
000000c0  12 0d 31 30 30 36 34 34  20 65 6e 74 72 79 2e 74  |..100644 entry.t|
000000d0  73 00 c0 61 d0 2d f8 00  72 26 fb 6b 40 92 a4 0f  |s..a.-..r&.k@...|
000000e0  44 67 8f 53 35 99 31 30  30 36 34 34 20 6a 69 74  |Dg.S5.100644 jit|
000000f0  2e 74 73 00 7a 9f 17 b4  ee 76 e1 3b 06 26 76 fa  |.ts.z....v.;.&v.|
00000100  74 cb 50 9a a4 23 ee 88  31 30 30 36 34 34 20 72  |t.P..#..100644 r|
00000110  65 61 64 53 74 64 69 6e  2e 74 73 00 1a de c8 49  |eadStdin.ts....I|
00000120  45 be 15 64 c7 0e 9c da  f5 b6 a9 c1 d9 32 6b d0  |E..d.........2k.|
00000130  31 30 30 36 34 34 20 74  69 6d 65 73 74 61 6d 70  |100644 timestamp|
00000140  2e 74 73 00 ae af b5 ef  dc d5 e6 48 97 38 53 41  |.ts........H.8SA|
00000150  b9 2a 33 59 05 17 ad ae  31 30 30 36 34 34 20 74  |.*3Y....100644 t|
00000160  72 65 65 2e 74 73 00 37  7c 19 45 eb b9 aa f9 f9  |ree.ts.7|.E.....|
00000170  91 65 6b 7c 23 2f 7b 02  a5 5e 78 31 30 30 36 34  |.ek|#/{..^x10064|
00000180  34 20 77 6f 72 6b 73 70  61 63 65 2e 74 73 00 a3  |4 workspace.ts..|
00000190  31 e9 df 15 d9 54 6f 9d  7d d1 f2 83 22 bf 1e 24  |1....To.}..."..$|
000001a0  c2 db 00                                          |...|
000001a3

Look closely at the table on the right of the hexdump, the mode "100644" is repeated as well as all of the file names in the tree. Following each file name is seemingly a bunch of gibberish. However, look back at the output of ls-tree and note the oid of the first entry author.ts:

1d15619c8d23447eac2924b07896b3be9530a42e

Now, take a look at the first couple lines of the hexdump, these correspond to the author.ts entry. What do you see (i've highlighted it below)?

00000000                                                    |tree 410.100644 |
00000010                                 1d 15 61 9c 8d 23  |author.ts...a..#|
00000020  44 7e ac 29 24 b0 78 96  b3 be 95 30 a4 2e        |D~.)$.x....0..10|

It is the author.ts oid in literal hex bytes! Thus, you can directly see how the generateData function transforms entries for the tree content.

Back to the commit command

Now that blob, entry, and tree have all been defined we can return to the commit code block and finally create a commit! First, read the name and email from environment variables. There are multiple ways to set these, one of the easiest is to set them in the shell profile. Then create an author instance with the name, email, and the current time. Next, read the commit message from process.stdin (the readStdin section will cover this in more detail). Create a new commit from the tree oid, the author, and the message and then write it to the database. Finally, write the commit oid to the HEAD file and the commit function is done!

// jit.ts
// inside of the `case 'commit': { }` block
const name = process.env['GIT_AUTHOR_NAME'] || ''
const email = process.env['GIT_AUTHOR_EMAIL'] || ''
const author = new Author(name, email, new Date())
const message = await readStdin()
const commit = new Commit(tree.oid, author, message)
database.store(commit)

const fd = await fs.promises.open(path.join(gitPath, 'HEAD'), fs.constants.O_WRONLY | fs.constants.O_CREAT)
await fd.write(`${commit.oid}\n`)
await fd.close()

console.log(`[(root-commit) ${commit.oid}] ${message.substring(0, message.indexOf("\n"))}`)

Author

Much like Blob and Entry, the Author class implements a unique toString method based on its properties.

// author.ts

export default class Author {
    public name: string
    public email: string
    public time: Date

    constructor(name: string, email: string, time: Date) {
        this.name = name
        this.email = email
        this.time = time
    }

    toString() {
        return `${this.name} <${this.email}> ${timestamp(this.time)}`
    }
}

This class makes use of custom timestamp method that derives the timezone offset string from a Date object:

// timestamp.ts

export default function timestamp (date: Date) {
    const seconds = Math.round(date.getTime() / 1000)
    const timezoneOffsetNum = date.getTimezoneOffset()
    const timezoneOffsetStr = timezoneOffsetNum >= 0
        ? `+${timezoneOffsetNum.toString().padStart(4, '0')}`
        : `-${(timezoneOffsetNum * -1).toString().padStart(4, '0')}`
    return `${seconds} ${timezoneOffsetStr}`
}

readStdin

The readStdin method is another utility method that helps simplify the process of reading data from process.stdin. Using async iterators, it collects chunks of the readable stream and then returns the complete string in a promise.

// readStdin.ts

export default async function () {
    let res = ''
    for await (const chunk of process.stdin) {
        res += chunk
    }
    return res
}

Commit

Finally, the last piece of the implementation is the Commit class. It extends from Entity, and thus needs to pass a type as well as data to the parent constructor. The generateData function for the Commit class joins multiple strings using the newline character and then transforms that into a buffer for the Entity data.

// commit.ts

export default class Commit extends Entity {
    public treeOid: string
    public author: Author
    public message: string

    constructor(treeOid: string, author: Author, message: string) {
        super('commit', Commit.generateData(treeOid, author, message))
        this.treeOid = treeOid
        this.author = author
        this.message = message
    }

    private static generateData(treeOid: string, author: Author, message: string) {
        const lines = [
            `tree ${treeOid}`,
            `author ${author.toString()}`,
            `committer ${author.toString()}`,
            "",
            message
        ].join("\n")

        return Buffer.from(lines)
    }
}

Running the commit command

I've posted all of this code as gist so you can clone and run it locally faster. Check it out here: Building Git with Node.js and TypeScript

Clone the sample repo:

git clone git@github.com:Ethan-Arrowood/building-git-with-nodejs-and-typescript.git

Fetch and checkout the part-1 branch

git fetch origin part-1
git checkout part-1

Install dependencies, build src, and link the executable

npm i
npm run build
npm link

Set the current working diretory to src and and run the commands

cd src
jit init
export GIT_AUTHOR_NAME="name" GIT_AUTHOR_EMAIL="email" && cat ../COMMIT_EDITMSG | jit commit

Now you should have a .git directory in the src directory that contains all of the blobs, the tree, and the commit.

To inspect the contents of the local .git directory, start by retrieving the commit hash from HEAD

cat .git/HEAD

Create an inflate command (I've added mine to my bash profile)

alias inflate="node -e 'process.stdin.pipe(zlib.createInflate()).pipe(process.stdout)'"`

Then inflate the contents of the root commit

cat .git/objects/<first two characters of HEAD>/<remaining characters of HEAD> | inflate

If everything works as expected the output should be:

commit 705tree <tree-oid>
author name <email> 1589553119 +0240
committer name <email> 1589553119 +0240

Initial revision of "jit", the information manager from Boston

This commit records a minimal set of functionality necessary for the code to store itself as a valid Git commit. This includes writing the following object types to the database:

- Blobs of ASCII text
- Trees containing a flat list of regular files
- Commits that contain a tree pointer, author info and message

These objects are written to `.git/objects`, compressed using zlib.

At this stage, there is no index and no `add` command; the `commit` command simply writes everything in the working tree to the database and commits it.

With the <tree-oid> you can then use git ls-tree to see the contents of the tree entry:

git ls-tree <tree-oid>

Conclusion

That is all for now! I intend to make following sections shorter so these posts are easier to read. I encourage you to ask questions and continue the discussion in the comments; I'll do my best to respond to everyone! If you enjoyed make sure to follow me on Twitter (@ArrowoodTech). And don't forget to check out the book, Building Git.

Happy coding 🚀

Latest comments (2)

This Dot Media • May 15 '20

Great series Ethan. You broke everything down seamlessly!

Ethan Arrowood • May 15 '20

Thank you!