DEV Community: Rafsun Masud

How to amend changes to your pull request?

Rafsun Masud — Thu, 23 Mar 2023 21:52:53 +0000

So you have just received feedback on your PR. You forgot a semi-colon at the end of a line. After you fixed it, your commit history looks like this:

5 (HEAD -> feature-branch) fix: add semi-colon (my bad)
4 add docs
3 write tests
2 implement feature
1 (master) base commit

I know commit IDs are not integers. For the purpose of this article, you can assume they are. The base commit of your branch starts from 1, and continues to increase sequentially.

Also, notice that commit history always starts from the most recent commit down to the oldest commit.

It is a best practice to maintain a neat commit history. But yours is cluttered by a small fix commit now.

To remove this clutter, you should amend the 4th commit into the 3rd. There are multiple ways of doing it. Rebasing the branch is one of them.

WARNING: The dark side of rebasing

Before you go to rebasing a branch without reading the whole article, I must stop you. Rebasing modifies the commit history. When you push the rebased branch, your modified history (i.e. local copy) will replace the branch's current history residing in the upstream repository (i.e. GitHub). This is something you ABSOLUTELY don't want to do. Because next time anyone else pushes or pulls from the upstream repository, git will find mismatch in history and cause unexpected issues.

A general rule of thumb is that do not modify a branch if more than one person is working on it. Ideally, your pull request is coming from a branch from a forked repo owned by you. And, you are the only person working on that branch most of the cases. In this situation, rebasing that branch is not a problem.

What rebasing does under the hood

When you rebase a branch, git will repeat the branch's history from the beginning (right after the base commit). It will remove all commits after 1, which are 2 to 5, from the commit history (not from the memory). Then, it will re-apply the commits from 2 to 5 sequentially. Before it starts re-applying, it will give you an opportunity to specify any modification you would like to make in the history. In your case, you want to specify that you want to squash the 5th commit into the 4th commit.

Let's start the rebasing process. If you are afraid of playing with rebasing in your work repository, you may init a sample test repository and recreate the commit history we are dealing with now.

How to perform rebasing

Run the following command.

git rebase -i HEAD~2

Why HEAD~2? Simply put, the command will repeat the last two commits, instead of the entire branch. Because you know you are modifying the last two commits in this case. If you really want to know what HEAD~2 means, it is a reference to the 3rd last commit (which is 3). The command is saying 'rebase all commits after the 3rd one into the 3rd one'.

After you hit the command, git will open a text editor with the following content to give you a chance to specify the changes you want to make to the history of the last two commits.

pick 4 add docs
pick 5 fix: add semi-color (my bad)

Notice, this time git is showing the history from oldest to recent

Modify the content by replacing pick by squash for the 5th commit. Save and close the text editor.

pick 4 add docs
squash 5 fix: add semi-color (my bad)

You will be given another prompt to specify new commit message for the 4th commit. It will show you both commit messages from 4th and 5th. Simply comment out the commit message of the 5th commit. Save and close the editor.

Git starts processing this commands from line 1. It re-applies the 4th commit. When it is time to re-apply the 5th commit, git sees the squash command. So, it skips creating the 5th commit. Instead it ammends to contents of the 5th commit into the 4th. Basically, the 5th commit is squashed but its contents are retained by amending to the previous commit.

Your job here is done!

Effects of rebasing

If you now see the commit history, you will see that all rebased commits received a new commit ID.

4' (HEAD -> feature-branch) add docs
3  write tests
2  implement feature
1  (master) base commit

This change in commit ID is what causes mismatch with upstream commit history.

Pushing the updated history

Now in your local commit history, the 'fix' commit is gone. And, it is time to update the upstream branch.

Because your history is modified, you need to push with the --force-with-lease option.

git push --force-with-lease

As we have already discussed, don't force push if your branch has multiple people working on it.

If you know go to the pull request page, you will see that updated history is reflected there. Also, you will see the any reference to the rebased commits made in the comments became invalid. But, they are not very important.

I intentionally left out some details in this article. If you want me to expand, let me know in the comments.

How Apache AGE turns a Relational DBMS into a Graph DBMS

Rafsun Masud — Fri, 24 Feb 2023 23:44:10 +0000

You may be wondering how it is possible to use tables to store and query a graph efficiently. Well, AGE has found its way of doing it.

In case you have not heard of AGE yet, it is an open source graph database system built as an extension of Postgres. Checkout their GitHub page and official website.

As someone who contributed to this project, I will explain how AGE does it and share some key insights on performance.

Little background on Graph DBMS

You will see two most common object in a AGE graph: vertex and edge. Generally, they are called entities or nodes.

Each entity is made of a label and some properties. Properties are simply a collection of key-value pairs (similar to a JSON object). But, what is a label? Why is label a separate thing? Why can't it be part of the properties?

Label is meant to provide categories (like students, professors), while properties are for details (like id, name, role etc.) Although you could still put category information inside properties, but using labels to categorise gives you performance advantage. In fact, labels do influence AGE's internal data structure, and you will see how.

Graphs are schemas

AGE creates and uses a separate schema (a schema is basically a namespace in Postgres) for each graph. Let's see by examples.

Run Postgres and load AGE (you'll find how to do that in the GitHub link above).

First, create a graph:

SELECT ag_catalog.create_graph('mygraph');

Next, run the command \dn to print all schemas:

test=# \dn
    List of schemas
    Name    |  Owner
------------+----------
 ag_catalog | rafsun42
 mygraph    | rafsun42
 public     | rafsun42
(3 rows)

You can see AGE just created two schemas: ag_catalog and mygraph.

Besides using a schema for each graph, AGE uses one catalog schema, the ag_catalog, to hold references (by means of schema name) to all AGE graphs created within the database.

In the rest of the article, we will explore the tables within the mygraph schema to understand AGE's graph data structure and performance.

Labels are tables

Building the graph with Cypher

Let's first build the graph by adding vertices and edges (AGE implements openCypher specification as its query language):

SELECT * FROM ag_catalog.cypher('mygraph',
$$
    CREATE 
        (prof: Professor {degree: 'Bioloy'} ),
        (class {code: 'BIOL 1001'}),
        (prof)-[teach: Teaches {day: 'Mon'}]->(class)
    RETURN prof, teach, class
$$)
as (prof agtype, teach agtype, class agtype);

Here, we created two vertices and a one-directional relation between them. Together, it represents that a professor teaches a biology class.

Notice how Cypher works. All three entities have properties (looks like JSON objects) and variables (to be used in the RETURN clause). But, only the prof vertex and the teach edge have label (right after the colon).

Tables and their schemes

Now, lets list the tables within the schema by running the commmand \dt mygraph.*:

test=# \dt mygraph.*
               List of relations
 Schema  |       Name       | Type  |  Owner
---------+------------------+-------+----------
 mygraph | Professor        | table | rafsun42
 mygraph | Teaches          | table | rafsun42
 mygraph | _ag_label_edge   | table | rafsun42
 mygraph | _ag_label_vertex | table | rafsun42
(4 rows)

As you can see, every label has its own table. Plus, two extra tables, called defaults, for entities with no label. Entities are stored in their respective label's table as a row.

I encourage you to see the schema of the Professor and Teaches tables by running the following two command:

test=# \d mygraph."Professor" 
                Table "mygraph.Professor"
   Column   |  Type   | Collation | Nullable |     Default                                                       
------------+---------+-----------+----------+------------------
 id         | graphid |           | not null | ...
 properties | agtype  |           | not null | ...
Inherits: mygraph._ag_label_vertex


test=# \d mygraph."Teaches" 
                Table "mygraph.Teaches"
   Column   |  Type   | Collation | Nullable |      Default                                                     
------------+---------+-----------+----------+-------------------
 id         | graphid |           | not null | ...
 start_id   | graphid |           | not null | 
 end_id     | graphid |           | not null | 
 properties | agtype  |           | not null | ...
Inherits: mygraph._ag_label_edge

Why these two tables has different schema? Because one represents vertex label, and the other represents edge label. Edge entities require two additional attributes start_id and end_id to store the endpoints.

Performance

I hope by now you understand how AGE stores graph and its entities in tables. Let's discuss why each label has separate table and how it affects performance.

With this design, AGE takes advantage of partitioning with Postgres' inheritance feature. If you are not familiar with either partitioning or inheritance in Postgres, I recommend checking out my articles on these topic first.

AGE puts all vertices in a single table at logical level and partition them by label at physical level. Notice the output of the \d commands again. All vertex label tables inherit the default _ag_label_vertex table, and all edge label tables inherit the default _ag_label_edge table.

When I mentioned before that _ag_label_vertex contains only vertices with no label, it was true only at physical level. Each label's table has its own physical file on disk. However, at logical level, the default tables, like _ag_label_vertex, is a table that contains all vertices. Because other vertex label tables are children of it. If you scan the _ag_label_vertex table, Postgres will fetch all vertices unless you specify not to.

Let's see how that helps in AGE's use cases. In a typical query, you would like to lookup some entities. Think about the two ways of looking up an entity in a graph (using Cypher):

# Example-1: By label
# Returns all professors
MATCH (x: Professor) RETURN x

# Example-2: By a property
# Returns all people with a biology
# degree- could be anyone
MATCH (x {degree: 'Biology'}) RETURN x

The performance of a lookup query greatly depends on how closely relevant data are stored on the physical disk. Since all rows of a relation are stored contiguously on disk, with this design, all nodes with same label will be put together in the same place. When executing the MATCH in the first example, all professor nodes can be fecthed to the main memory from the disk with minimum I\O calls.

For the second example, we don't know the label. So we don't know which table to scan. Because of inheritance, we don't have to manually scan each vertex label table, instead we can only scan _ag_label_vertex to look for any vertices regardless of its label.

Overall, this approach makes matching by label faster, while keeping matching by properties simple.

Summary

To summarize, each graph is a schema. Each label in a graph has its own table. For no-label entities, there are two default tables. All vertices and edges are stored in the respective table.

As for performance, entities are put in the default table at logical level. But at physical level they are partioned by their label. As a result of partitioning, lookup query performance improves.