DEV Community

Cover image for Label, Vertex, Edge redesign: Ideas
Bhaskar Sharma
Bhaskar Sharma

Posted on

Label, Vertex, Edge redesign: Ideas

INTRODUCTION AND THE PROBLEM

Labels in graph databases can be understood as a way of categorizing data. Labels help us to identify one kind of vertices/edges as one type, while other kind of vertices/edges as another. The issue arises when a vertex/edge fits in multiple types, in such cases the graph database should simply assign multiple labels to one vertex/edge, however, as of right now, AGE does not support the feature of assigning more than one label to a vertex/edge.
(From here onward we shall refer to "vertex/edge" as "v/e").
Read about Apache AGE here: https://age.apache.org/
Github here: https://github.com/apache/age

THE CAUSE

Apache AGE creates a different table for every label and puts the v/e in the table that their label belong to.
An analogy can be putting balls (vertices) of certain colors (labels) in boxes (tables) that are defined by their colors.
Like putting all red balls in red box, and putting all blue balls in blue box.
Vertex ID is generated using Label ID, which makes sure 2 vertices in different labels will never have the same ID, no matter how similar they may be.

On the surface, it seems fine because it helps us categorize the data, however, it does not suggest a way of handling a ball that is colored both blue and red? Where does it go? and What goes in the other box?

BEFORE WE GET TO POSSIBLE SOLUTIONS

Before we can suggest and evaluate solutions, we need to know exactly what tasks we may be performing with multiple labels, and how will they be impacted based on our design choices.

  • Fetching a vertex (with one label).
  • Fetching a vertex (with multiple labels).
  • Inserting a vertex.
  • Deleting a vertex.
  • Ease of implementation.

Solution 0 (Spoilers: It doesn't work)

On a very surface level, labels are no different than properties. They are only special because labels are generally used to categorize data, and hence, have been given a data structure which makes it faster to perform operations that are constrained by label names. Whatever purpose label serve, can be served if there were no labels at all, and label were just a property of type array that stored label names.
So, our solution 0 will be to get rid of labels all together, store all the data in a single table, and labels will be just properties associated with each node.
Pros:

  • Very straightforward and basic solution. Cons:
  • There will be no categorization in data at all.
  • Will make even simple operations (fetching, inserting, deleting) very inefficient.
  • Is probably a move backwards.

Conclusion: The introduction was a good place to start understanding why labels are given such importance. However, this is not a good solution.

Solution 1 (Spoilers: It doesn't work either)

One way you can think of is storing the same vertex in multiple labels in case a vertex is initialized with multiple labels.
Going back to the balls in boxes example, this is equivalent of making a copy of a ball with 2 colors (red and blue), and putting one copy in each of the respective boxes.
Pros (assuming that somehow different instances of the vertex can have same id, or related id): -

  • Easy to implement (probably).
  • Easy to fetch vertices. Cons: -
  • A vertex with multiple labels will take much more space.
  • If edited, will need to be changed at multiple locations.
  • Each instance of the vertex will have a different ID because the IDs are initialized according to the Label ID (unless a workaround is thought of).
  • Hard to create and delete a vertex.

Solution 2 (Spoilers: Maybe hard to implement and can make more sense)

Another way can be storing a vertex with multiple labels in only one of the label tables. And storing a reference to that vertex in other in which the vertex is supposed to belong.
Analogically, this can be thought of as putting the ball with 2 colors in one of the boxes, attaching a string to it, that goes into the other box. The string represents the reference which is available in other box, however, the data is only stored at one place.
Pros: -

  • Data only needs to be stored at one place.
  • Other tables having a reference to the original vertex means that data only needs to be edited at one place. Cons: -
  • Might be hard to implement given that any table would be containing both actual vertices, and reference to other vertices.
  • Makes less sense because how do you make sense of the data of a reference?
  • Which table gets to keep the actual data, and which one gets to keep the reference?
  • Fetching, inserting, deleting will take a reasonable amount of time.

Solution 3 (Makes sense to me, feedback open)

How about storing all the vertices in a single table and having references to the vertices stored in other label tables?
For example, we can have one vertex table that stores all the vertices along with their IDs, properties, and the data, and the table can be sorted and indexed by the vertex IDs.
Then we can have tables for individual labels in which we only store the vertex IDs.
Let the vertex IDs be absolute and unchangeable once the vertex has been created, so there is absolutely no scope of one vertex ID referring to some other vertex.
Pros: -

  • Insertion of vertices will be quick (insert in the vertex table, and IDs in the label tables).
  • Fetching the vertices should be quick if the table is sorted and indexed.
  • Deleting the vertices should not be an issue.
  • Editing the data should not be an issue (because it is only stored at one place).
  • Deleting a label only needs to delete the vertex ID in the label table.
  • Unchangeable vertex IDs can ensure that one vertex ID does not accidentally refer to another vertex.
  • Hopefully easier to implement than the previous solution. Cons: -
  • Storing all vertices in a single table may result in some inefficiencies.

Feedback open!

Top comments (0)