lisahui

Posted on Feb 24, 2022

What I learned working on Nebula Graph, an open source and distributed graph database

#opensource #database #deveoper

This article is based on a talk given by Dr. Min Wu, a senior expert at vesoft. Dr. Wu talked about the status quo of the global graph database market, the design and features of Nebula Graph as a distributed graph database, as well as Nebula Graph’s open-source community.

The Global Graph Database Market

Let’s start with some numbers. Markets and Markets anticipates the graph database market will reach $2.4 billion by 2023 from $821.8 million in 2018.

Graph database is still in a rising trend and it is one of Gartner’s top 10 data and analytics trends in 2021. This is because graph databases can be used in many more areas than traditional databases, including computing, processing, deep learning, and machine models.

Graph databases gained the most in popularity in the past 10 years, according to data compiled by DB-Engines. The data is based on social media mentions, Stack Overflow questions, and search trends.

Advantages of Graph Databases

One of the most significant advantages of graph databases is that they are intuitive. If you want to express the character relationships of Game of Thrones, you can use both traditional tabular databases and graph databases. But as it is shown below, using graph data is much more intuitive, though they both express the same data model.

(Tabular data)

(Graph data)

In another example, we can compare how to search in SQL databases and graph databases. For example, here is how to find out how many posts and comments were created in a given time frame and rank the results in SQL databases and graph databases.

SQL:

--PostgreSQL 

WITH RECURSIVE post_all (psa_threadid

                      , psa_thread_creatorid, psa_messageid

                      , psa_creationdate, psa_messagetype

                       ) AS (

    SELECT m_messageid AS psa_threadid

         , m_creatorid AS psa_thread_creatorid

         , m_messageid AS psa_messageid

         , m_creationdate, 'Post'

      FROM message

     WHERE 1=1 AND m_c_replyof IS NULL -- post, not comment

       AND m_creationdate BETWEEN :startDate AND :endDate

  UNION ALL

    SELECT psa.psa_threadid AS psa_threadid

         , psa.psa_thread_creatorid AS psa_thread_creatorid

         , m_messageid, m_creationdate, 'Comment'

      FROM message p, post_all psa

     WHERE 1=1 AND p.m_c_replyof = psa.psa_messageid

     AND m_creationdate BETWEEN :startDate AND :endDate

)

SELECT p.p_personid AS "person.id"

     , p.p_firstname AS "person.firstName"

     , p.p_lastname AS "person.lastName"

     , count(DISTINCT psa.psa_threadid) AS threadCount

END) AS messageCount

     , count(DISTINCT psa.psa_messageid) AS messageCount

  FROM person p left join post_all psa on (

       1=1   AND p.p_personid = psa.psa_thread_creatorid

   AND psa_creationdate BETWEEN :startDate AND :endDate

   )

 GROUP BY p.p_personid, p.p_firstname, p.p_lastname

 ORDER BY messageCount DESC, p.p_personid

 LIMIT 100;
Here is how to realize the same query using the Cypher graph query language:

--Cypher
MATCH (person:Person)<-[:HAS_CREATOR]-(post:Post)<-[:REPLY_OF*0..]-(reply:Message)
WHERE  post.creationDate >= $startDate   AND  post.creationDate <= $endDate
  AND reply.creationDate >= $startDate   AND reply.creationDate <= $endDate
person. RETURN
id,   person.firstName,   person.lastName,   count(DISTINCT post) AS threadCount,
  count(DISTINCT reply) AS messageCount
ORDER BY
  messageCount DESC,  person.id ASC
LIMIT 100

In addition, the graph ecosystem is diversified. The following is the graph technology landscape in 2020, and we can expect more graph related technologies coming along in 2021.

Nebula Graph

Now, let’s take Nebula Graph, a distributed graph database, as an example to talk about the evolution of graph technology. I will also share the challenges the team had faced when developing Nebula Graph and how we had solved them.

When we started designing the blueprint of Nebula Graph in late 2018, the team had set four goals for the database. They are scalability, production-ready, OLTP(Online Transaction Processing), and open source. The four goals are still influencing the roadmapping of Nebula Graph until today.

Scalability

Scalability is the No.1 design principle of Nebula Graph. This is because we believe the data that businesses will process in the future must be massive and there will be no way that single machines can handle them. That’s why we designed Nebula Graph in a way that it is capable of handling graph data with trillions of vertices and edges.

Production-ready

Nebula Graph is also designed to be production-ready on the first day, including the design of its query language, visualization, programmability, and DevOps.

OLTP

OLTP (online transactional processing) enables the real-time execution of large numbers of database transactions by large numbers of users, typically over the internet. One of the priorities of Nebula Graph’s design is OLTP. This makes Nebula Graph an online, high-concurrency, and low-latency graph database.

Open source

Nebula Graph is also devoted to building an open-source community and integrating with the big data world, supporting graph computing and training frameworks like Tencent Plato and Spark GraphX.

The Nebula Core

The above figure shows the ecosystem built around Nebula Graph. The section with red background is the Nebula Graph Core, which consists of three parts called Meta, Graph, and Storage.

Nebula Graph’s query language is our in-house nGQL, which is also compatible with openCypher. We have also developed clients in languages including Java, C++, Python, and Go. Then on the top we have a number of SDKs that can work with frameworks like Spark, Flink, GraphX, Tencent Plato.

Let’s dive into the Nebula core, which, as I mentioned above, consists of the meta, graph, and storage services. The meta service deals with metadata, while the storage service stores the data, and the graph service is in charge of querying. The three modules run on their independent processes, ensuring the separation of compute and storage.

The meta service manages the schema. Nebula Graph is not a schema-free database, and it requires the properties of vertices and edges to be pre-configured. The meta service also manages storage spaces, long-duration tasks, and data cleaning.

Nebula Graph can handle data with trillions of vertices and edges. This means the system must segment the data in storage and handling. Nebula Graph uses the segmentation of edges and stores vertices in partitions. Each partition may have a few replicas and run on different machines. The query engine is stateless, meaning that all query data should either be retrieved from the meta service or the storage service and there is no communication between query services.

The above is about Nebula Graph’s separation of compute and storage. Now let’s talk about data characteristics. We have mentioned that Nebula Graph is not a schema-free database and that all the data stored is pre-defined by Data Definition Languages (DDLs). We call types of vertices Tag and types of edges EdgeType. Vertices are defined using a 2-tuple consisting of the vid and the tag. Edges are defined using a 4-tuple consisting of the endpoints, EdgeType, and rank.

Nebula Graph supports primitive data types like boolean, int, and double as well as composite types like list, set, or graph data types like path and subgraph. If there is long string data stored in the database, it is usually indexed by Elasticsearch.

Here is some additional information about the storage engine. For the query engine graphd, the external interface exposed by the storage engine is a distributed graph service, but it can also be used as a distributed key-value (KV) service if necessary. In the storage engine, partitions adopt the Raft consensus protocol. Nebula Graph stores vertices and edges in separated partitions. The following figure is about how KV is implemented using the storage partitioning.

In Nebula Graph, each edge is stored as two pieces of data. As mentioned above, the storage layer relies on VID and guarantees strong consistency using the Raft protocol.

Nebula Graph uses ElasticSearch for full-text index. From Nebula Graph v2.x, our R&D team has optimized the write performance of Nebula’s indexing capability. Since v2.5.0, Nebula Graph has started to support the combination of data expiration TTL and indexing. And from v2.6.0, Nebula Graph started to support the TOSS (Transaction on storage side) function to achieve the eventual consistency of edges. That is to say, edges are either successfully written or failed at the same time when they are inserted or modified.

Nebula Graph uses its in-house nGQL as its query language. In June 2021, the International Organization for Standardization has drafted the standard for the syntax and semantics of GQL and there is a consensus between major graph database vendors.

From Nebula Graph v2.0, nGQL started to be compatible with openCypher, which was an open-source version of Neo4j’s Cypher query language. Now, nGQL supports the Doctrine Query Language (DQL) of openCypher and has developed its own vanilla nGQL syntax style in Data Manipulation Language (DML) and Data Definition Language (DDL).

We also mentioned that Nebula Graph is born to be production-ready. So it supports a wide range of operation features like data isolation, user permission and authentication, and replica configuration. Also, Nebula Graph supports clustering. Nebula Operator, which was released in April 2021, started to support Kubernetes.

As for the performance of Nebula Graph, most performance tests are carried out by users in the community, such as engineers from tech companies like Meituan, WeChat, 360 DigiTech, and WeBank, etc. The figures below are performance reports compiled by users.

As for the performance of Nebula Graph, most performance tests are carried out by users in the community, such as Meituan, WeChat, 360 DigiTech, WeBank. The figures below are performance reports compiled by users.

We have mentioned that one of the priorities of Nebula Graph’s design is OLTP. But that doesn’t mean it neglects analytical processing (AP). Nebula Graph has integrated AP frameworks like Apache Spark’s GraphX and Plato developed by Tencent.

Generally speaking, Nebula Graph performance has improved significantly when the deep traversal is conducted.

Nebula Graph meets users’ AP (Access Point) requirements of OLTP (Online Transactional Processing ) by docking with Spark’s Graph X and supports Plato, the graph computing engine of Tencent’s WeChat team. Plato docking is actually the data connection between the two engines, which needs to change the internal data format of Nebula Graph to be that of Plato and then Partition to map them one by one.

The Nebula Graph Community

Nebula Graph became open source in May 2019. Its v1.0 GA was released in June 2020, even though some companies had applied Nebula Graph in production before that. Nebula Graph first entered DB-Engines’ graph database management system ranking two years ago and now it ranks 15th on the list.

NOTE: The screenshot is the DB-Engine ranking in Apr. 2021 when this talk was given.

Nebula Graph is also one of the top open source players in China. This following report, which was published by the X-Lab of East China Normal University, ranks the companies according to the community popularity of their open source products. Vesoft Inc., the maker of Nebula Graph, ranks the eighth, before TikTok parent Bytedance, and just one position after Huawei.

Here are some of my thoughts about open source graph databases. Open source is very common in the graph database industry because it is a relatively new area and only got traction in recent years. This is why Nebula Graph chose to be open source from day one. Open source software can also attract more developers to use and gain valuable feedback from adopters.

If you encounter any problems in the process of using Nebula Graph, please refer to Nebula Graph Database Manual to troubleshoot the problem. It records in detail the knowledge points and specific usage of the graph database and the graph database Nebula Graph.