Merge-trees are a model that my supervisor and I came up with to show how a commit is merged into the master branch of the repository.
The directed acyclic graph (DAG) is the "beautiful", graph-theory based, model used internally by git to store commits and merges in the repository, and it's actually really good at this job. It's somewhat reasonable for visualizing repositories. We're used to seeing a visualization of the DAG on Github, under the network tab or the big middle pane with all the pretty lines on GitKraken, or if you are terminally inclined,
git log --graph.
Since all of these tools use the DAG for visualization, it must be great, right? Nope. Once the repository reaches a certain size with a given level of complexity, these tools start to break apart. My main focus was on the Linux repository. Before the shenanigans that Microsoft went through to switch to git, Linux was probably the most complex git repositories in existence (that I know of anyway), making use of nearly every feature in git at some point or another. With the complexity and size of git, our friends over at Github and GitKraken can no longer provide us with our pretty pictures. Github doesn't even try, just showing a message saying that the repository is too big. GitKraken just sits there, trying to load it for hours. It might eventually get somewhere too, if you had enough memory and patience, but it would still take hours to perform any operations once it's loaded.
But... we still can have our pretty pictures. Both Gitk, which is a GUI application shipped with git, and
git log --graph on the terminal, are able to produce a visualization.
Hmm, which one of those lines is the master branch? What's actually going on here? Who knows? It turns out, that line that goes from red, to bright green to red again, is the master branch. How do I know this, I've looked at this repository more than what could be considered healthy, specifically this commit.
So why do people use the DAG for visualizations? It's exactly representative of the repository. Proponents might argue that it shows exactly how the repository looks. Realists might argue that, "it's easy". A simple pass through the repository will render the DAG, while using a different model (I don't actually know of anyone trying to design a different model) is quite difficult and will require some pre-processing steps. Further complicating matters, it turns out that git has no internal notion of a "master" branch. The idea of a main branch is something carried over from the versioning systems of yore (SVN, CVS). There is a somewhat agreed upon convention though. Every commit and merge has a parent list. Commits only have one parent, while merges may have many (two if you are a normal person, as many as you want if you are of the octopus-merge group). The first parent in the list is the previous commit or merge toward the initial commit, that is on the same branch as the commit being made. The second (through nth) are the next commit or merge that you are merging, in the order that you reference the merges.
So if I'm on the master branch and I do
git merge --no-ff branch1 branch2
The merge commit will have a parent list containing three elements (in this order);
- reference to the previous commit on the master branch
- reference to the last commit at the time of the merge on branch1
- reference to the last commit at the time of the merge on branch2
If something happens, say someone rebases incorrectly, or merges something funny, or does a
git pull incorrectly (this is why you get those people saying use
git fetch followed by
git rebase instead of
git pull), you can switch the order of those parents, which will make the master branch disappear. This is called a foxtrot.
In this case, we swapped branch2 and master, which makes it looks like E and D are both on the master branch, even though we know that the master branch goes directly from commit B to merge A.
Another issue is fast-foward merging. Many people are afraid of merge commits, and without a good visualization merge commits just complicate things, so they do fast-forward merges. Yeah, it gets rid of the merge commits, but since branches are usually separate ideas, the merge commits are like the separation of paragraphs. Yeah, you could write a book without paragraphs, but it will suck. Don't do that to your repositories, or they will suck too.
Okay, so now that we've thoroughly discussed the DAG, let's talk about the Merge-Tree. A Merge-Tree is a tree-based structure showing how groups of commits are merged into the master branch. The root is the merge into the master branch, the leaves are the individual commits, and the inner nodes are any merges that the commits must pass through to hit the master branch.
This is a merge-tree visualization for the same commit that is highlighted up above in the picture of the DAG for Gitk. The bright orange node is the commit. The model removes commits that are not related to the commit we are interested in, which makes the visualization a lot less confusing. Furthermore, because we know which commits belong to a merge, we are able to provide a full summarization of the commits and merges within the root. This can be the files modified and the authors.
Constructing merge-trees relies on knowledge of the master branch to determine where the roots should be.
We built a small tool called Linvis to test the concept of using merge-trees. A user can search for a commit by the hash, author, filename, and keywords from the log message. The search engine uses full-text searching for commits and merges that match your query.
Searching for 'net-next' for instance will return a whole bunch of trees with commits and merges related to the networking module of Linux. The results are grouped by the merge-tree they came from, with the link to the root at the top, and the links to the individual commits in the table.
The trees are ordered based on the mean of the search-ranking, so more relevant trees are at the front, and others float to the back. we provide various summaries, like the files modified
and the authors
Furthermore, we have three different visualizations of trees, which offer different benefits. The Reingold-Tilford tree being the classic tree, while the pack-tree was designed for file systems, which, like repositories, generally have a very wide but shallow structure.
The merge-tree is an interesting concept with lots of potential. The utility of merge-trees is dependent on the structure of the repository; if the repository uses fast-forward merges, and has foxtrots, merge-trees will either not be useful or correct, as they rely on knowledge of the master branch. The Caml repository, for instance, uses a CVS repository structure. All changes are made into the master branch of the repository. Branching is only used for separating releases. Modifications to the release go in the release branch, but the release branches are never merged back into the master branch. Merge-Trees would not help at all in this situation; every commit would be a separate merge-tree.
Some git workflows are centered around merging into the master branch at a release. These would likely benefit the most from merge-trees, as managers can easily summarize who wrote what. Furthermore, people can see exactly how every commit made it's way into the project.
The Linux repository doesn't use either of these. Linus Torvalds holds all the power, being the only one who can push to the master branch. He simply merges large submodules of the kernel at a time, then leaves tags at release points. This structure benefits from the merge-tree structure too.
This has been a long post, so I'll end it here. Let me know what you think.