DEV Community: NebulaGraph

Combine ChatGPT with NebulaGraph Database to Predict Game Winner

NebulaGraph — Fri, 10 Mar 2023 07:50:21 +0000

An attempt to use ChatGPT to generate code for a data scraper to predict sports events with the help of the NebulaGraph graph database and graph algorithms.

The Hype of ChatGPT

In the hype for FIFA 2022 World Cup, I saw a blog post from Cambridge Intelligence, in which they leveraged limited information and correlations among players, teams, and clubs to predict the final winning team. As a graph technology enthusiast, I always would like to try similar things with NebulaGraph to share the ideas of graph algorithms to extract hidden information from the overall connections in a graph in the community.

The initial attempt was to complete it in about 2 hours, but I noticed the dataset needs to be parsed carefully from Wikipedia and I happened to be not good at doing this job, so I put the idea on hold for a couple of days.

In the meantime, another craze, the OpenAI ChatGPT, was announced. As I had been a user of DALL-E 2 already(to generate feature images of my blog posts), I also gave it a try very quickly. And I witnessed how other guys(via Twitter, blogs, HackerNews) tried to convince ChatGPT to do so many things that are hard to believe they could do:

Help to implement a piece of code at any time
Simulate any prompt interface: shell, python, virtual machine, or even a language you create
Act out almost any given persona, and chat with you
Write poetry, rap, prose
Find a bug in a piece of code
Explain the meaning of a complex regular expression/Open Cypher Query

ChatGPT's ability to contextualize and understand has never been greater before, so much so that everyone is talking about a new way of working: how to master asking/convincing/triggering machines to help us do our jobs better and faster.

So, after trying to get ChatGPT to help me write complex graph database query statements, explain the meaning of complex graph query statements, and a large chunk of Bison code, which he/she had done them WELL, I realized: why not let ChatGPT write the code that extracts the data for me?

Grabbing data

I really tried it and the result is... good enough.

The whole process was basically like a coding interviewer, or a product manager, presenting my requirements, then ChatGPT giving me the code implementation. I then try to run the code, find the things that don't make sense in the code, point them out, and give suggestions, and ChatGPT quickly understands the points I point out and makes the appropriate corrections, like:

I won't list this whole process, but I share the generated code and the whole discussion here.

The final generated data is a CSV file.

Raw version world_cup_squads.csv
Manually modified, separated columns for birthday and age world_cup_squads_v0.csv

It contains information/columns of a team, group, number, position, player name, birthday, age, number of international matches played, number of goals scored, and club served.

Team,Group,No.,Pos.,Player,DOB,Age,Caps,Goals,Club
Ecuador,A,1,1GK,Hernán Galíndez,(1987-03-30)30 March 1987,35,12,0,Aucas
Ecuador,A,2,2DF,Félix Torres,(1997-01-11)11 January 1997,25,17,2,Santos Laguna
Ecuador,A,3,2DF,Piero Hincapié,(2002-01-09)9 January 2002,20,21,1,Bayer Leverkusen
Ecuador,A,4,2DF,Robert Arboleda,(1991-10-22)22 October 1991,31,33,2,São Paulo
Ecuador,A,5,3MF,José Cifuentes,(1999-03-12)12 March 1999,23,11,0,Los Angeles FC

Final version with header removed world_cup_squads_no_headers.csv

Graph algorithm to predict the FIFA World Cup 2022

With the help of ChatGPT, I could finally try to predict the winner of the game with Graph Magic, before that, I need to map the data into the graph view.

If you don't care about the process, go to the predicted result directly.

Graph modeling

Prerequisites: This article uses NebulaGraph(Open-Source) and NebulaGraph Explorer(Proprietary), which you can request a free trial of on AWS.

Graph Modeling is the abstraction and representation of real-world information in the form of a "vertex-> edge" graph. In our case, we will project the information parsed from Wikipedia as:

Vertices:

player
team
group
club

Edges:

groupedin (the team belongs to which group)
belongto (players belong to the national team)
serve (players serve in the club)

The age of the players, the number of international caps, and the number of goals scored are naturally fit as properties for the player tag(type of vertex).

The following is a screenshot of this schema in NebulaGraph Explorer (I will just call it Explorer later).

Then, we can click the save icon in the upper right corner and the button: Apply to Space to create a graph space with the defined schema:

Ingesting into NebulaGraph

With the graph modeling, we can upload the CSV file (the no-header version) into Explorer, by pointing and selecting the vid and properties that map the different columns to the vertices and edges.

Click Import, we then import the whole graph to NebulaGraph, and after it succeeded, we could also get the entire CSV --> Nebula Importer configuration file: nebula_importer_config_fifa.yml, so that you reuse it in the future whenever to re-import the same data or share it with others.

Note: Refer to the Import Data Document

After importing, we can view the statistics on the schema view page, showing us that 831 players participated in the 2022 Qatar World Cup, serving in 295 different clubs.

Explore the graph

Let's see what insights we could get from the information/ knowledge in the form of a graph.

Querying the data

Let's start by showing all the data and see what we will get.

First, with the help of NebulaGraph Explorer, I simply did drag and drop to draw any type of vertex type (TAG) and any type of edge between vertex types (TAG), here we know that all the vertices are connected with others, so no isolated vertices will be missed by this query pattern:

Let it generate the query statement for me. Here, it defaults to LIMIT 100, so let's change it to something larger (LIMIT 10000) and let it execute in the Console.

Initial observation

The result renders out like this, and you can see that it naturally forms a pattern of clusters.

These peripheral clusters are mostly made up of players from clubs that are not traditionally strong ones (now we learned that they could win, though, who knows!). Many of those clubs have only one or two players who serve in one national team or region, so they are somewhat isolated from other clusters.

Graph algorithm-based analysis

After I clicked on the two buttons(Sized by Degrees, Colored by Louvain Algorithm) in Explorer (refer to the document for details), in the browser, we can see that the entire graph has become something like this:

Two graph algorithms are utilized to analyze the insights here.

change the display size of vertices to highlight importance using their degrees
using Louvain's algorithm to distinguish the community of the vertices

You can see that the big red circle is the famous Barcelona, and its players are marked in red, too.

Winner Prediction Algorithm

In order to be able to make full use of the graph magic(with the implied conditions and information on the graph), my idea(stolen/inspired from this post) is to choose a graph algorithm that considers edges for node importance analysis, to find out the vertices that have higher importance, iterate and rank them globally, and thus get the top team rankings.

These methods actually reflect the fact that excellent players have greater community and connectivity at the same time. Meanwhile, to increase the differentiation between traditionally strong teams, I am going to take into account the information on appearances and goals scored.

Ultimately, my algorithm is:

Take all the (player)-serve->(club) relationships and filter them for players with too few goals and too few goals per game (to balance out the disproportionate impact of older players from some weaker teams)
Explore outwards from all filtered players to get national teams
Run the Betweenness Centrality algorithm on the above subgraph to calculate the node importance scores

Process of the Prediction

First, we take out the subgraph in the pattern of (player)-serve->(club) for those who have scored more than ten goals and have an average of more than 0.2 goals per game.

MATCH ()-[e]->()
WITH e LIMIT 10000
WITH e AS e WHERE e.goals > 10 AND toFloat(e.goals)/e.caps > 0.2
RETURN e

Note: For convenience, I have included the number of goals and caps as properties in the serve edge, too.

Then, we select all the vertices on the graph in the left toolbar, select the belongto edge of the outgoing direction, expand the graph outwards (traverse), and select the icon that marks the newly expanded vertices as flags.

Now that we have the final subgraph, we use the graph algorithm function within the browser to execute BNC (Betweenness Centrality):

The graph canvas then looks like this:

Result

Ultimately, we sorted according to the value of Betweenness Centrality to get the final winning team: Brazil! 🇧🇷, followed by Belgium, Germany, England, France, and Argentina, so let's wait two weeks to come back and see if the prediction is accurate :D.

The sorted data is as follows:

Vertex	Betweenness Centrality
Brazil🇧🇷	3499
Paris Saint-Germain	3073.3333333333300
Neymar	3000
Tottenham Hotspur	2740
Belgium🇧🇪	2587.833333333330
Richarlison	2541
Kevin De Bruyne	2184
Manchester City	2125
İlkay Gündoğan	2064
Germany🇩🇪	2046
Harry Kane (captain	1869
England🏴󠁧󠁢󠁥󠁮󠁧󠁿	1864
France🇫🇷	1858.6666666666700
Argentina🇦🇷	1834.6666666666700
Bayern Munich	1567
Kylian Mbappé	1535.3333333333300
Lionel Messi (captain	1535.3333333333300
Gabriel Jesus	1344

Feature Image Credit: The image was also generated with OpenAI, through the DALL-E 2 model & DALL-E 2 Outpainting. See the original image.

What is a NoSQL Graph Database?

NebulaGraph — Tue, 10 Jan 2023 06:10:59 +0000

SQL databases have been dominant in the market for a while now, helping companies organize, manage, and leverage their data to meet various goals. However, as the data landscape continues to evolve, more dynamic solutions are needed to tackle challenges that the traditional databases are unable to address. That's why NoSQL databases are gaining traction and the biggest businesses are already utilizing them to great success. The global NoSQL database market size passed US$ 7 billion in 2022 and is projected to grow to about US$ 35 billion in the next 5-6 years.

Graph databases in particular, a type of NoSQL database, have become particularly sought after. The market size for graph databases will grow by about 22% and pass US$ 8 billion by 2028. So let's look at what exactly is a NoSQL graph database, how they work and what benefits they offer.

What is a NoSQL graph database?

A NoSQL graph database is a type of non-relational, distributed database which employs a graph model. NoSQL stands for “Not only SQL” and refers to a new breed of databases that differ from traditional relational databases in their data model and performance. Graph databases are especially useful for data associated with relationships—everything from friendships on social netwo#rks to equipment supply chains or business processes. They can quickly traverse vast amounts of linked data points to discover insights and hidden connections between entities, making them ideal for network analysis– such as financial fraud detection, recommendation engines and many other use cases– all while performing at scale.

Essentially, the natural structure of a NoSQL graph database encourages the creation of deep models that help businesses to uncover complex relationships across datasets; identify small changes in large amounts of related information; forecast outcomes; and create real-time dashboards with dynamic workflow automation.

Also Read: Difference between graph database and relational database

Essential components of a NoSQL graph database

A typical NoSQL graph database is made up of four essential components: nodes, edges, properties and labels.

Nodes are the basic unit of data in a graph database. They can represent anything—a person, an organization, a product, or even a transaction.
Edges are the lines that connect nodes, and they represent the graph relationships between them.
Properties are the attributes of nodes and edges.
Labels are used to categorize nodes and edges.

This visual structure of data makes it easy to see how different pieces of data are related to each other, one of the reasons why NOSQL graph databases are very efficient at handling relationships between data. They don't have to rely on complex join operations, which can be slow and inefficient. Instead, they can simply follow the relationships between nodes to find the data they're looking for. This makes NOSQL graph databases perfect for applications that need to handle a lot of data relationships.

Illustrating how a NoSQL graph database works

Think of it as a network of data points, all connected and able to communicate with each other. This makes it perfect for storing and managing complex data sets, such as social media networks or customer relationship data.

To illustrate this, let’s say you have a database of people and the relationships between them. In this database, each person would be represented by a node, and the relationships between them would be represented by edges.

So, if John is friends with Jane, there would be an edge between John’s node and Jane’s node. And if John is also friends with Bill, there would be an edge between John’s node and Bill’s node.

This might sound confusing, but it’s actually a very powerful way to store data. Because the relationships between nodes are represented by edges, it’s very easy to query the database to find out things like “who are all of the friends of John’s friends?” or “who are all of the friends of Jane’s friends?”

It’s also easy to add new data to a NoSQL graph database. For example, if John makes a new friend, you can just add a new edge between John’s node and the new friend’s node. There’s no need to restructure the entire database.

Advantages of NoSQL graph databases

NoSQL graph databases are growing in popularity, and it's not hard to see why. They offer a number of advantages over traditional relational databases.

Improved performance: Graph-based solutions greatly reduce complexity due to their connected nature. This dramatically reduces query times, allowing for faster access to valuable data points.
Flexibility: Instead of rigidly adhering to a predefined schema, NOSQL graph databases enable “schema-less” data models which can accommodate multiple use cases and types of business objects.
No redundancy: This makes it easier for data administrators to ensure accuracy and reduce space consumption on storage media.
Complex analysis: Due to their interconnected layers, graph databases provide invaluable insights by revealing hidden relationships among various elements within the structure that would have otherwise gone unnoticed.
Migration: There is an inherent level of interoperability between various formats in this type of database structure; this simplifies migration processes and makes it easy for users to move large datasets without having to start from scratch.
Large-scale data management: NoSQL graph databases can easily scale to store and manage billions of data points, while relational databases start to struggle at around the million-record mark.
Speed: NOSQL graph databases are incredibly fast. They can return query results in milliseconds.

Check out this article for comprehensive benefits of graph databases.

Examples of NoSQL graph database applications

Social media networks: Social media networks like Facebook and Twitter use graph databases to store information about users and their relationships. This allows them to recommend friends and content to users based on their interests and connections to other users.
eCommerce platforms: Ecommerce platforms like Amazon and eBay use graph databases to store information about products, sellers, and buyers. This allows them to recommend products to buyers based on their purchase history and connections to other products.
Location-based services: Location-based services like Foursquare and Yelp use graph databases to store information about businesses, users, and reviews. This allows them to recommend businesses to users based on their location and connections to other businesses.

A good example of a NoSQL graph database is Nebula Graph right here, a lightning fast open source graph database that can handle billions of nodes and trillions of edges without any performance issues. And because it's based on a distributed architecture, it's highly fault tolerant. It also offers an equally open source graph editing front-end library. Others include Neo4j, OrientDB, and ArangoDB among others.

Choosing the right NoSQL graph database for your needs

Your decision should be based on the type of graph shapes and connections required for your project. Consider if you need directional or non-directional graph patterns, reachability between graph elements, graph maneuvering methods such as splitting or merging orphan nodes, etc. Make sure to explore what types of graph operations are supported by the database in question as not all databases may support certain graph operations or some may have better performance than others in certain scenarios.

Equally think about your budget. NoSQL graph databases can range in price from free and open source options to expensive enterprise-level options. Choose the one that best fits your needs and budget.

Conclusion

NoSQL graph databases are quickly becoming an integral part of managing the powerful amounts of data generated by today’s tech-heavy agile environment. With graph databases, data is organized into graph shapes, allowing for flexible and quick scaling of different types of information without having to go through strict pre-defined schemas.

The architecture of a NoSQL database also provides the opportunity to have more sophisticated networking and navigation, allowing for easy retrieval of entire related data sets. This makes them a great choice for Agile settings where rapid processing and the ability to manage vast amounts of data with complex relationships are highly desired features.

Graph Database vs Relational Database: What to Choose?

NebulaGraph — Fri, 06 Jan 2023 03:23:44 +0000

The place of data in driving business growth is evident in the way that many organizations are aggressively collecting and utilizing data. As data becomes more accessible and easier to analyze, it is playing an increasingly important role in helping organizations to identify trends, understand customer behavior, optimize products and maximize profits. In many ways, data has become the "new oil" - a valuable commodity that can be used to drive growth and competitive advantage. Prudent data organization, manipulation and management are at the heart of commoditizing this new oil. It would not have been possible to realize the kind of ground-breaking data-driven applications we are witnessing today without database tools. From social networking sites to recommendation engines across e-Commerce marketplaces, it’s all about creating value out of data and databases are central here. In fact, the database management systems global market is expected to grow at a CAGR of 12.2 percent and reach 142.7 trillion US dollars by 2027.

Depending on what you want to achieve with your data, you might often end up at a crossroad where you’ll have to choose between databases. Graph database vs. relational database is one such crossroad that many may be struggling with. Let’s distinguish these two and help you choose the right one.

What is a graph database?

A graph database is a NoSQL database where data is stored as a network graph. A typical graph database contains edges, nodes, and properties that present and store data. Relationships are first-class citizens in a graph database and are identified with unique keys. By making relationships first-class citizens, graph databases can more efficiently store, retrieve, and traverse data. Graph databases excel at handling data with complex relationships and are therefore useful for applications such as social networking, recommendation engines, and fraud detection among many more.

An open source graph database is always the best place to start as they come with a supportive community that ultimately creates the perfect ecosystem.

How does a graph database work?

Graph databases work by using two essential elements: nodes and edges. The nodes in a graph represent entities, and the edges represent relationships between those entities. This allows for more flexible and efficient querying. For example, in a social network, each user would be represented by a node, and the edges would represent the relationships between users. A graph database could then easily answer questions such as "Who are the friends of my friends?" or "What path connects me to another user?"

They perform traversal queries and apply algorithms to determine patterns, influencers, paths, points of failure, and communities. Understanding these aspects allows users to effectively analyze massive amounts of data.

Types of graph databases

Graph databases can be categorized into two main groups based on data model and storage.

Data model-based graph databases

There are three major types under this category namely the property paragraph, RDF graphs, and hypergraph

1. Property graph

Property graphs organize data in relationships, nodes, and properties and store the data on relationships or nodes.

Nodes are entities and can hold multiple attributes called properties. You can tag nodes with labels that represent their roles in a domain. In addition, labels can attach metadata, like constraint or index information, to specific nodes.

Relationships provide relevant connections between different nodes. The relationship is directed, named, and has a start and end node. Furthermore, relationships have quantitative properties like distances, costs, ratings, weights, strengths, and time intervals. Relationships are efficiently stored, and thus nodes can share relationships without compromising performance. The navigation of relationships can be done in either direction, despite the fact that they are stored in one direction.

2. Hypergraph

Hypergraphs are data models that allow relationships to connect to multiple nodes by allowing several nodes at either end. This will enable users to analyze and store data with numerous many-to-many relationships. The relationships in a hypergraph are referred to as hyperedges.

In contrast to property graphs, hypergraphs have multidimensional edges, making them more generalized. However, the two are isomorphic, meaning you can present a hypergraph as a property graph, but you can’t do the opposite.

3. Resource description framework (RDF)

An RDF, or triple store, stores data in a three-part format of subject-data-object data structure. A separate node is used to present any additional information. RDF graph models are made of Arcs and Nodes, and the graph notation is presented by two nodes, one for the subject and another for the object, and an arc presents the predicate.

Triple stores are categorized as data model-graph databases because they process logically linked data. However, their storage engines are not optimized for property graphs, nor do they support index-free adjacency; thus, they aren’t native graph databases.

RDFs can scale horizontally but can’t rapidly transverse relationships because they store triples as individual elements. They must create connections from independent facts and add latency to perform graph queries. Triple stores are mostly applied in offline analytics because of their shortcomings in latency and scale.

Storage-based graph databases

This category contains three major types as well, namely the native storage graph, relational storage, and key-value store graphs:

1. Native graph storage

Native graph storage uses edge and vertex units to store and manage graph databases. It is best suited for multiple-hop or deep-link graph analytics. Native graph storage is designed to maximize traversal speed in arbitrary graph algorithms.

2. Relational storage

This graph database stores the edge and vertex table using a relational model. During runtime, relational “JOIN” concatenates (links) the two tables. The relational model represents data in tables using an intuitive and easy-to-understand way. Each row is a record with a key, which is the unique ID. Each column holds an attribute of the data, and each record has the attribute’s value. This makes it easy to identify the relationships between data points.

3. Key-value store

A key-value store is a non-relational database that uses NoSQL databases to store data. It stores data in key-value pairs where the key is the unique identifier. The keys and values range from complex compound objects to simple objects.
Perhaps the best benefit of a key-value graph DB vs relational DB is that the key-value graph is highly partitionable. As a result, it allows horizontal scaling that most database types cannot achieve.

The graph data model vs the relational data model

Though both graph data and relational models focus on data relationships, they don’t do it similarly.

You decide which entities will be nodes and links and which you will discard in a graph data model. This results in a blueprint that you can use to create visualization models for charts.

A relational data model groups database entries by their relationships using a table of values. Each row in a table represents corresponding data values and denotes a real-world relationship. Table and column names help in interpreting the values in each row.

A graph data model stores relationships at an individual record level. On the other hand, a relational data model uses predefined structures called table definitions. However, you can convert a relational database to a graph database.

Key differences between graph databases and relational databases

The fundamental difference between graph databases and relational databases lies in how relationships are handled. In a graph database, relationships are driven by data points. In a relational database, relationships are driven by columns in data tables.

Here are the differences in detail:

Storage format: A graph database stores entities as nodes and relationships as edges. A relational database stores data in tables with rows and columns and uses “JOIN” for fast querying.
Dataset size: Graph databases are quick even for large datasets, while relational databases are slower.
Index: Graph databases typically use index-free adjacency, meaning that every node is connected to every other node in the database, while relational databases use indexed pointers to connect related data.
Transactions: Graph databases typically don't support transactions very well, while relational databases can perfectly support transactions and even accounting.
Processing power: Graph databases require more processing power and storage space.

Advantages of graph database over relational databases

There are reasons you would choose a graph database over a relational database, and this is informed by the advantages. Here are the key advantages of a relational database vs a graph database:

1. Graph databases are better at handling relationships

When it comes to handling relationships, graph databases are king because of the way they're built. They can easily represent complex relationships between data points.

This is a huge advantage when you're trying to model data that's naturally hierarchical, like social networks or business relationships. With a graph database, you can easily create entities and track the relationships between them. This makes finding and understanding data a lot easier, which is essential when you're trying to make quick decisions based on insights from your data.

In a relational database, you have to define the relationships between tables ahead of time. With a graph database, you can define relationships on the fly, which makes it much easier to handle complex data models.

2. Graph databases are more scalable

This means that graph databases can handle large amounts of data without running into issues. And as data volumes continue to grow, this is an increasingly important factor to consider.

You can easily divide the database across multiple servers while keeping intact vital aspects of compliance such as privacy requirements.

3. Graph databases can be more performant

When it comes to graph database vs relational database performance, data is stored in a graph format in graph databases, which makes it easier to navigate and find connections between data points. Second, the way that data is structured in a graph database makes it possible to index data more effectively. This means that queries can be executed more quickly, which leads to better performance and user experience.

You can actually increase the relationships without compromising the performance.

4. Graph databases offer more flexible data modeling

With a graph database, you can model your data however you want, which means you're not limited to the rigid structures of a relational database. By representing data as a series of interconnected nodes, graph databases can more accurately capture the intricate web of relationships.

5. Graph databases are easier to understand

Things are a lot simpler to understand in a graph database. In fact, you can think of a graph database as a collection of nodes and relationships between nodes. And that's it!

There's no need to worry about tables, columns, or foreign keys. Just create your nodes and relationships and you're good to go. This makes data management a breeze and makes it easy for other people to understand what you're trying to do.

6. Graph databases can be more trusting

Graph databases can be more trusting because they are based on a one-table model. In a graph database, the table is called a graph, and it contains all the information about entities and the relationships between them.

This makes it much easier to find information because there is no need to join tables or run complex queries. All you need to do is find the node that you're interested in and look at the relationships associated with it.

7. Graph databases can better handle data consistency

Let's say you're a retailer and you have a customer database. A relational database would be fine for storing information about your customers, but it wouldn't be the best tool for handling customer-product interactions.

Why? Because a relational database is based on a normalized model, which means that data is divided into tables and columns. And this can lead to inconsistencies when you're trying to query data. For example, if you want to get a list of all your customers for a particular product, you might have to query two or three different tables.

A graph database, on the other hand, is based on a non-normalized model, which means that data is stored in one place. This makes it much easier to query data, because everything is in one place.

8. Graph databases can be more extensible

What this simply means is that they can handle more data combinations. For example, let's say you want to track the relationships between people, organizations, and things. A graph database would be perfect for this task, whereas a relational database would quickly become overloaded.

Graph databases are also great for managing data that changes rapidly, such as social media data or sensor data. This is because they can quickly adapt to changes in the data model, without having to perform a full database rebuild.

When should you use the graph database?

Graph databases are well suited for applications that require the storage and retrieval of data that can be represented as a graph, such as social networks, maps, and networks. They are also well suited for applications that require the analysis of data that is connected in complex ways, such as fraud detection and recommendation engines.

For example, in a social network, a user's friends are also friends with each other. A graph database can quickly find all the friends of a user's friends. In contrast, a relational database would need to perform multiple joins to find the same information. By prioritizing relationships, graph databases can provide greater insight into data. In general, any application that would benefit from being able to represent data as a network of interconnected nodes would be a good candidate for a graph database.

Conclusion

Understanding graph database vs relational database is the first step to building effective data models that will provide valuable insights into connected data. It is also important to note that the two are not alternatives, but each serves a different purpose.

The most important point that you’ll always need to bear in mind in graph vs relational databases is that graph databases are better suited for applications that require multiple relationships between data points, while relational databases are better for applications with less complex data structures.

FAQ Section

How does a graph database differ from a relational database?

A graph database uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A relational database, on the other hand, uses tables and relations between them to store data. Graph databases are used to query complex relationships, whereas relational databases are used for more simple relationship structures. In addition, graph databases typically require fewer joins than relational databases. As a result, graph databases can be faster and more efficient when it comes to handling complex data.

Are graph databases better than relational databases?

Yes to a large extent. Graph databases are better than relational databases because they are more flexible and can handle more complex data relationships. Relational databases are based on the table structure of data, which is difficult to change once the data is in the database. Graph databases, on the other hand, are based on the graph structure of data, which is easy to change. Considering that data is becoming increasingly complex, graph databases are far much better for most use cases where complex data manipulation is a priority.

What is a graph database not good for?

Graph databases are well suited for storing data with complex relationships, such as social networks or financial data. However, graph databases are not well suited for storing data that can be easily represented in a tabular format, such as product catalogs or customer orders.

How NebulaGraph Works

Using NebulaGraph Importer to Import Data into NebulaGraph Database

NebulaGraph — Thu, 29 Dec 2022 09:16:17 +0000

NebulaGraph is now a mature product with many ecosystem tools. It offers a wide range of options in terms of data import. There is the large and comprehensive Nebula Exchange, the small and compact Nebula Importer, and the Nebula Spark Connector and Nebula Flink Connector for Spark and Flink integrations.

But which of the many import methods is more convenient?

Here are my takes:

Nebula Exchange

If you need to import streaming data from Kafka and Pulsar into the NebulaGraph database
If you need to read batch data from relational databases (e.g. MySQL) or distributed file systems (e.g. HDFS)
If you need to generate SST files recognized by NebulaGraph from large batches of data

Nebula Importer

Nebula Importer is best for importing local CSV files into NebulaGraph

Nebula Spark Connector:

Migrate data between different NebulaGraph clusters
Migrate data between different graph spaces within the same NebulaGraph cluster
Migrate data between NebulaGraph and other data sources
Combining Nebula Algorithm for graph computation

For more options about how to import data from Spack, read: 4 different ways to work with NebulaGraph in Apache Spark

Nebula Flink Connector

Migrate data between different NebulaGraph clusters
Migrate data between different graph spaces within the same NebulaGraph cluster
Migrate data between NebulaGraph and other data sources

Overall, Nebula Exchange is large and comprehensive, and can be combined with most storage engines to import into Nebula, but requires a Spark environment to be deployed.

Nebula Importer is simple to use and requires fewer dependencies, but you need to generate your own data file in advance and configure the schema once and for all, but it does not support breakpoint transfer and is suitable for medium data volume.

Spark / Flink Connector needs to be combined with stream batch data.

Choose different tools for different scenarios. For newcomers to Nebula, it is recommended to use Nebula Importer, a data import tool, because it is easy to use and quick to get started.

Using Nebula Importer

When we first came across NebulaGraph, because the ecology was not perfect, and only some businesses migrated to Nebula, we used to import NebulaGraph data, whether full or incremental, by pushing Hive tables to Kafka and consuming Kafka to write NebulaGraph in batch. Later, as more and more data and businesses switched to NebulaGraph, the problem of importing data efficiency became more and more serious. The increase in import time made it unacceptable to still be importing data at full volume during peak business hours.

For the above problems, after trying Nebula Spark Connector and Nebula Importer, we decided to use Hive table → CSV → Nebula Server → Nebula Importer to import the full amount of data for the sake of easy maintenance and migration, and the overall time spent was significantly reduced. The overall time consumption is significantly reduced.

Configuring Nebula Importer

System environment

[root@nebula-server-prod-05 importer]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
Stepping:              7
CPU MHz:               2499.998
BogoMIPS:              4999.99
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              36608K
NUMA node0 CPU(s):     0-15

Disk：SSD
Memory: 128G

Cluster Environment

NebulaGraph Version: v2.6.1
Deployment Method: RPM
Cluster size: 3 replicas, 6 nodes

Data Size

+---------+--------------------------+-----------+
| "Space" | "vertices"               | 559191827 |
+---------+--------------------------+-----------+
| "Space" | "edges"                  | 722490436 |
+---------+--------------------------+-----------+

Nebula Importer configuration

# Graph version, set to v2 when connecting 2.x.
version: v2
description: Relation Space import data
# Whether to remove temporarily generated log and error data files.
removeTempFiles: false
clientSettings:
  # The number of retries for failed nGQL statement execution.
  retry: 3
  # Number of concurrency for NebulaGraph clients.
  concurrency: 5
  # The size of the cache queue for each NebulaGraph client.
  channelBufferSize: 1024
  # The NebulaGraph graph space to import data into.
  space: Relation
  # Connection information.
  connection:
    user: root
    password: ******
    address: 10.0.XXX.XXX:9669,10.0.XXX.XXX:9669
  postStart:
    # Configure some actions to be performed before inserting data after connecting to the NebulaGraph server.
    commands: |
    # The interval between the execution of the above commands and the execution of the insert data command.
    afterPeriod: 1s
  preStop:
    # Configure some actions to be performed before disconnecting from the NebulaGraph server.
    commands: |
# The path to the file where log messages such as errors will be output.    
logPath: /mnt/csv_file/prod_relation/err/test.log
....

Set up the Crontab, Hive generates the tables and transfers them to the NebulaGraph Server, running Nebula Importer tasks at night when traffic is low:

50 03 15 * * /mnt/csv_file/importer/nebula-importer -config /mnt/csv_file/importer/rel.yaml >> /root/rel.log

In total, it took 2 hours to complete the import of the full amount of data at 6 am.

Some of the logs are as follows, and the import speed is maintained at a maximum of about 200000/s

2022/05/15 03:50:11 [INFO] statsmgr.go:62: Tick: Time(10.00s), Finished(1952500), Failed(0), Read Failed(0), Latency AVG(4232us), Batches Req AVG(4582us), Rows AVG(195248.59/s)
2022/05/15 03:50:16 [INFO] statsmgr.go:62: Tick: Time(15.00s), Finished(2925600), Failed(0), Read Failed(0), Latency AVG(4421us), Batches Req AVG(4761us), Rows AVG(195039.12/s)
2022/05/15 03:50:21 [INFO] statsmgr.go:62: Tick: Time(20.00s), Finished(3927400), Failed(0), Read Failed(0), Latency AVG(4486us), Batches Req AVG(4818us), Rows AVG(196367.10/s)
2022/05/15 03:50:26 [INFO] statsmgr.go:62: Tick: Time(25.00s), Finished(5140500), Failed(0), Read Failed(0), Latency AVG(4327us), Batches Req AVG(4653us), Rows AVG(205619.44/s)
2022/05/15 03:50:31 [INFO] statsmgr.go:62: Tick: Time(30.00s), Finished(6080800), Failed(0), Read Failed(0), Latency AVG(4431us), Batches Req AVG(4755us), Rows AVG(202693.39/s)
2022/05/15 03:50:36 [INFO] statsmgr.go:62: Tick: Time(35.00s), Finished(7087200), Failed(0), Read Failed(0), Latency AVG(4461us), Batches Req AVG(4784us), Rows AVG(202489.00/s)

Then at 7:00, Kafka is re-consumed to import the incremental data from the early morning of the day to 7:00 based on the timestamp, preventing the full amount of t+1 data from overwriting the incremental data of the day.

The incremental consumption takes about 10-15 min.

Real-time

The incremental data obtained from the MD5 comparison is imported into Kafka, and Kafka data is consumed in real-time to ensure that the data delay is no more than 1 minute.

In addition, there may be unanticipated data issues that are not detected in real-time for a long time, so the full amount of data is imported every 30 days, which is the Nebula Importer import described above. Then add a_ TTL=35 days _to the point side of Space to ensure that any data not updated in time will be filtered and subsequently recycled.

About the author

Reid is an engineer at Qichacha, China’s biggest corporate information platform.

Introducing an Open Source Graph Editing Library: VEditor by NebulaGraph Database

NebulaGraph — Fri, 09 Dec 2022 02:39:46 +0000

NebulaGraph VEditor is a high-performance, highly customizable WYSIWYG visualization editing front-end library. NebulaGraph VEditor is based on SVG drawing, and it is easy to develop and customize the drawing by reasonably abstracting the code structure. It is extremely suitable for WYSIWYG editing and preview scenarios such as approval flow, workflow, kinship, ETL processing, and graph query. You can use VEditor to visually query, edit, and model graphs.

After continuous iterations and polishing, VEditor has been relatively perfect and now its code has been open-sourced. In this topic, I'd like to share some ideas and thoughts about its design.

Basic Features

Customizable node and edge styles.
Flat, simple, and clear code structure.
Small map and magnetic lines.
Shortcuts for common operations.
History management.
Lightweight, only 160 KB in size before compression.

Design Ideas

When I first came across the graph editor library, I just wanted to find one for customization and animation. After using many flowchart libraries, I found that most of the designs have bloated interfaces and complex classes. As a front-end developer, it was against my philosophy of writing simple, streamlined, and low-coupling code. So I decided to write a lightweight library myself to meet my needs.

The design concept of VEditor is to make it lighter for developers to use, reduce learning APIs, and reduce reliance on various libraries based on customizability and comprehensibility.

Architecture

The overall architecture is mainly through events to do the dependency management between entities, it is also recommended to get the state changes of the whole flowchart mainly through events.

Among them, the rendering process is semi-automatic rendering, and the rendering needs to be triggered manually after changing the flowchart data. Other state operations on the canvas will trigger the user-defined shape-rendering function to render nodes and lines.

Rendering Process

VEditor uses SVG to render the canvas. The declarative use of SVG makes the internal structure visible externally, which is convenient for developers to customize the rendering. Users can directly rewrite the relevant SVG styles outside and can directly use the SVG DOM (Document Object Model) to manipulate mouse events and animate nodes.

In terms of shape rendering, user-defined rendering functions are registered mainly through the exposed Shape interface. From this perspective, VEditor can perform rendering based on any rendering technique, as long as the rendering interface returns an SVGDOM, which can be an SVGElement or ForeignObject. I highly recommend managing SVC rendering through the DOM framework, such as React and Vue. In some cases, you can even wrap a Canvas to render WebGL nodes, which greatly extends the customizability.

In addition to nodes, node anchors and lines also support the object rendering after the corresponding interface is registered as Shape. In actual business scenarios, we use this feature to implement dynamic addition, deletion, and change of algorithm parameter anchors, the input and output anchors of OLTP queries (figure 1), filtering of edges in graph visualization, and step rendering (figure 2). You are welcome to apply for Explorer free trial. Just click here.

Data Structure Design

The data structure of VEditor is similar to most libraries, but it will not destroy the user's object reference, that is, when the user mounts the relevant data to the node or line object, it will be aligned and reserved, so that it will be convenient for the user to mount the relevant data to vertices after vertices or edges are configured. Therefore, the Redo, Undo, etc. operations of the history will store the user's data as snapshots.

Performance Design

It is well known that SVG has a much worse performance than Canvas for rendering at small resolutions, which is a disadvantage that comes with increased ease of use. This is especially obvious when initializing a large number of more complex or animated nodes. To address this situation, the data rendering part of VEditor uses an asynchronous process, putting the rendering of anchors into the next event loop to avoid the browser-forced redrawing caused by the large number of bounding boxes fetched during the synchronization process. After the drawing is finished, the corresponding nodes are cached to avoid repeated fetching.

When adding nodes or lines, the DOM feature of SVG will let the browser do dirty rendering automatically, so the performance of incremental rendering is not much different from Canvas, but mainly slower when interacting and animating, which leads to a lot of DOM redrawing. The current design performance target is 1000 nodes with complex shapes for smooth rendering, which is easy to achieve in flowchart editing scenarios.

Interaction Design

VEdtior provides Dagre-based directed graph layouts by default. It automatically centers all nodes after Dagre is optimized and called. At the same time, it provides an adaptive size function. Unlike other libraries, the coordinates of the current nodes will be reset to the adaptive position, and the adaptive position can be restored directly after the user saves the current data.

VEditor's mini-map adopts canvg rendering, which directly converts SVG to Canvas, which can guarantee the accuracy of mini-map and reduce the performance loss at the same time. In terms of interaction, VEditor allows you to zoom in and out of the canvas, and drag and drop elements.

Future Plans

VEditor is expected to be a data editor and renderer for any field by expanding the scenarios where it can be used. In addition, we will continue to improve the performance and user experience of VEditor while supporting the following features in the future.

Box selector and multi-select operations.
Undirected graphs and double arrows
Further performance optimizations

👉 GitHub open source address: https://github.com/vesoft-inc/nebulagraph-veditor

Graph Database Visualization: NebulaGraph Explorer

Exploring Geospatial Data with NebulaGraph Database

NebulaGraph — Fri, 02 Dec 2022 08:54:37 +0000

What is geospatial data?

Geospatial data is information related to geospatial entities, such as points, lines, and polygons.

NebulaGraph 2.6 supports geospatial data. You can store, compute, and retrieve geospatial data in NebulaGraph. Geography is a data type supported in NebulaGraph. It is composed of latitude and longitude that represents geospatial data.

How to use geospatial data in NebulaGraph?

Create Schema

The following example shows how to create tags. You can create edge types in the same way.

NebulaGraph currently supports three types of geospatial data: Point, LineString, and Polygon. The following shows how to create geography types and how to insert geospatial data.

CREATE TAG any_shape(geo geography);
CREATE TAG only_point(geo geography(point));
CREATE TAG only_linestring(geo geography(linestring));
CREATE TAG only_polygon(geo geography(polygon));

When no geography type is specified, it means that you can store data of any type; when a type is specified, it means that only geospatial data of that type can be stored, such as geography (point), which means that you can only store spatial information of points.

Insert Data

Insert data in the geo column of the any_shape tag.

INSERT VERTEX any_shape(geo) VALUES "101":(ST_GeogFromText("POINT(120.12 30.16)"));
INSERT VERTEX any_shape(geo) VALUES "102":(ST_GeogFromText("LINESTRING(3 8, 4.7 73.23)"));
INSERT VERTEX any_shape(geo) VALUES "103":(ST_GeogFromText("POLYGON((75.3 45.4, 112.5 53.6, 122.7 25.5, 93.9 28.6, 75.3 45.4))"));

Insert data in the geo column of the only_point tag.

INSERT VERTEX only_point(geo) VALUES "201":(ST_Point(120.12，30.16)"));;

Insert data in the geo column of the only_linestring tag.

INSERT VERTEX only_linestring(geo) VALUES "302":(ST_GeogFromText("LINESTRING(3 8, 4.7 73.23)"));

Insert data in the geo column of the only_polygon tag.

INSERT VERTEX only_polygon(geo) VALUES "403":(ST_GeogFromText("POLYGON((75.3 45.4, 112.5 53.6, 122.7 25.5, 93.9 28.6, 75.3 45.4))"));

When the data inserted does not meet the requirements of the specified type, the data insertion fails.

(root@nebula) [geo]> INSERT VERTEX only_polygon(geo) VALUES "404":(ST_GeogFromText("POINT((75.3 45.4))"));
[ERROR (-1005)]: Wrong value type: ST_GeogFromText("POINT((75.3 45.4))")

We can see that the geospatial data insertion method is rather peculiar, and is very different from the insertion of basic types such as int, string, and bool.

Let's take ST_GeogFromText("POINT(120.12 30.16)") as an example, ST_GeogFromText is a geographic location information parsing function, which accepts a string type of geographic location data in WKT (Well-Known Text) standard format.

POINT(120.12 30.16) represents a geographic point with longitude 120°12′E and latitude 30°16′N; the ST_GeogFromText function parses and constructs a geography data object from the WKT parameter, and then the INSERT statement stores it in the NebulaGraph in WKB (Well-Known Binary) standard.

Geospatial functions

The geospatial functions supported by NebulaGraph can be divided into the following main categories:

Constructing functions
- ST_Point(longitude, latitude): Constructs a geography point object based on a latitude and longitude pair.
Parsing functions
- ST_GeogFromText(wkt_string): Parses geography objects from the WKT text.
- ST_GeogFromWKB(wkb_string): Parses geography objects from the WKB text. # Not yet supported, because NebulaGraph does not yet support binary strings.
Format setting functions
- ST_AsText(geography): Outputs the geography object in the WKT text format.
- ST_AsBinary(geography): Outputs the geography object in the WKB text format. # Not yet supported, because NebulaGraph does not yet support binary strings.
Conversion functions
- ST_Centroid(geography): Calculates the center of gravity of a geography object, which is a geography point object.
The predicate function
- ST_Intersects(geography_1, geography_2): Determines whether two geography objects intersect.
- ST_Covers(geography_1, geography_2): Determines if the first geography object completely covers the second.
- ST_CoveredBy(geography_1, geography_2): The inverse of ST_Covers.
- ST_DWithin(geography_1, geography_2, distance_in_meters): Determines if the shortest distance between two geography objects is less than the given distance.
The metric function
- ST_Distance(geography_1, geography_2): Calculates the distance between two geography objects.

These function interfaces follow the OpenGIS Simple Feature Access and ISO SQL/MM standards. For details, see NebulaGraph doc.

Geospatial Index

What is a geospatial index?

Geospatial indexes are indexes that can be used to quickly filter data based on the predicate ST_Intersects and ST_Covers functions.

NebulaGraph uses the Google S2 library as the geospatial index.

The S2 library projects the Earth's surface into a tangent square, then recursively quadruples each square surface of the square n times, and uses a space-filling curve, the Hilbert curve, to connect the centers of these small square lattices.

When n is infinitely large, this Hilbert curve almost fills the square.

The S2 library uses a Hilbert curve of order 30.

The following figure shows that the Earth is filled with Hilbert curves.

It can be seen that the Earth's surface is divided into cells by these Hilbert curves. For any geographic shape on the earth's surface, such as a city, a river, or a person's location, we can use several of these cells to completely cover the geographic shape.

Each cell is identified by a unique int64 CellID. Thus, the spatial index of a geographic object is the set of S2 cells that are constructed to completely cover the geographic shape.

When constructing an index of a geospatial object, a collection of different S2 cells that completely cover the indexed object is constructed. The indexing query based on spatial predicate functions quickly filters out a large number of irrelevant geographic objects by finding the intersection between the set of S2 cells that cover the queried object and the S2 cells that cover the indexed object.

Create a geography index

CREATE TAG any_shape_geo_index on any_shape(geo)

For geospatial data with the type point, it can be represented by an S2 cell of order 30, so a point corresponds to one index entry; for geospatial data with the type inestring and polygon, we use multiple S2 cells of different levels to cover it, so it will correspond to multiple index entries.

Spatial indexing is used to speed up the lookup of all geo predicates, for example:

LOOKUP ON any_shape WHERE ST_Intersects(any_shape.geo, ST_GeogFromText("LINESTRING(3 8, 4.7 73.23)"));

When there is no spatial index on the geo column of any_shape, this statement will first read all the data of any_shape into memory and then use it to calculate whether it intersects with the point (3.0, 8.0), which is generally more expensive. When the amount of data in any_shape is large, the computation overhead will be unacceptable.

When the geo column of any_shape has a spatial index, the statement will first use the spatial index to filter out most of the data that intersected by the line, but there will still be some that may be intersected when read into memory, so there is still one more calculation to be done. In this way, the spatial index quickly filters out most of the data that is not likely to intersect at a small cost, and a small percentage is filtered, greatly reducing the computational overhead.

Open Source NebulaGraph Database Raises Tens of Millions of Dollars in Series A Funding

NebulaGraph — Fri, 16 Sep 2022 05:24:47 +0000

NebulaGraph, a leading open source graph database, announced it raised tens of millions of US dollars in Series A funding. Investors in the round are led by Jeneration Capital, with participation from the previous investors - Matrix Partner China, Redpoint China Ventures, and Source Code Capital. China Renaissance served as the exclusive financial advisor in this financing round.

NebulaGraph plans to continue building its established business strategy by accelerating the pace of its expansion to the global market, while also strengthening R&D to enhance product performance and differentiation with cutting-edge graph technology.

Earlier this year, NebulaGraph joined Linked Data Benchmark Council (LDBC) as an organizational member to participate in developing graph industry standards and specifications, especially the creation and popularization of Graph Query Language (GQL), which is a forthcoming International Standard Language for property graph querying.

Designed for scalability and fast recovery, NebulaGraph boasts a distributed, high-availability architecture, making it an ideally reliable graph database in production environments. Its capability in handling ultra-large datasets with trillions of edges, coupled with instant event data stream processing, has allowed Nebula Graph database to excel at transactional integrity and operational availability.

Hundreds of well-established enterprises, including Tencent, Meituan, JD Digits, and Kuaishou, are leveraging NebulaGraph Database to boost their graph data processing capabilities.

Sherman Ye, the founder and CEO of NebulaGraph, said, “Thanks to the in-depth understanding of industrial scenarios, the transformative value of our products, and the surging demand in graph technology, NebulaGraph is well-positioned to capture future growth opportunities.”

About NebulaGraph

NebulaGraph is an open-source graph database developed by vesoft Inc. It can store and process billions of vertices and trillions of edges with milliseconds of latency. With a distributed and scalable architecture, its shared-nothing deployment eliminates any single failure point and allows fast recoveries to secure business continuity.

For more information, please visit https://nebula-graph.io or follow NebulaGraph on Twitter.

The Azure Marketplace series: How to list your cloud offer successfully?

NebulaGraph — Fri, 22 Apr 2022 10:01:53 +0000

This post is part of a series that is focused on how to successfully launch your software solution on the Azure Marketplace, Microsoft’s online market for buying and selling cloud solutions certified to run on Azure.

I work for Nebula Graph, an open-source graph database management system. We’ve recently launched Nebula Graph on the Azure Marketplace to provide Azure users with a cloud-based database as a service (DBaaS). This is the result of six months’ effort from me and my team members. During the whole process, we have encountered numerous challenges and failed miserably a lot of times. Now I’m telling my story and experience and, hopefully, this series of articles will help you navigate Azure’s complex reviewing system and resolve some of the problems you may encounter.

The What — What is Azure Marketplace

Azure Marketplace is an online market that allows software providers to offer their cloud solutions to a global audience of businesses. Azure will handle account and billing systems so that you don’t have to. The marketplace is a great channel where independent software vendors (ISVs) like Nebula Graph can sell their products. All you need to do in order to list your product in the Marketplace is to register as a Microsoft partner. Once you have a listing in place, customers can purchase it directly through the marketplace using their Microsoft accounts and billing information. It allows your solution to be easily discovered by customers. The marketplace can be used as a valuable marketing tool to acquire new customers.

The products you can offer in the Azure Marketplace mainly fall into three categories:

Software, which includes virtual machine images, web applications, and developer services
Data services, which include business intelligence, analytics, and networking services
Consulting services

In the article, we will mainly focus on how to publish a software offer, but the process is pretty much the same for the other two types of products.

The Why — Azure Marketplace benefits for software solutions

After lots of research and comparison, we decided to list our product as a SaaS offer in the Azure Marketplace. One of the most obvious reasons is that Azure is huge — it is one of the world’s largest public cloud service providers with 19.7% of the global market. This means that our listing will be visible to millions of enterprise customers like 3M Informatics and Verizon.

But as a team leader of a cloud database product, I’m a big fan of the Azure Marketplace also for the following reasons:

Easy onboarding

The Azure Marketplace provides a great opportunity for us to onboard new customers that are already using Microsoft Azure. Microsoft customers can easily find our cloud service in the Azure Marketplace, try it with a few clicks, and quickly get started with a free trial or by purchasing through their Azure subscription.

We can leverage Azure’s account and billing systems directly

Azure Marketplace provides a neat solution for startups like Nebula Graph. Customers can purchase free or paid offers on-demand using Azure’s account and billing systems directly. Microsoft’s account and billing systems will help us gain more trust from potential customers — this is critically important since we are still new to the market. Plus, it will largely simplify the billing logic for us, reducing a lot of redundant work.

Azure also offers an easy and helpful admin system for you to configure your SaaS offer on demand.

The How — How to create a listing, correctly

While Azure already provided detailed instructions on how to list a SaaS offer in the marketplace, we still encountered numerous difficulties during the process. Here, I’d like to introduce the steps from my own experience. I hope this will have you navigate the process and reduce unnecessary back and forth.

First, you need to sign up for an account at the Microsoft Partner Center. This will allow you to create and configure your SaaS offer.

After you have created your commercial marketplace account, you can start to create a product. You can choose from many types of offers, including SaaS, managed services, and even consulting services. In this example, we will choose the SaaS offer type.

Create a demo listing

Your first step of successfully creating a SaaS listing will be creating a demo listing. In the SaaS Create Demo setup page, you will need to fill in all the configurations listed on the left sidebar, including the offer overview, properties, availability, and so on. You can refer to Azure’s Plan a test and development SaaS offer for more details on how to fill in this information.

Once you have finished the basic setup and published it to preview, you will be prompted with a preview link where you can visit the SaaS offer you have created. On the preview page, you can create a test subscription to see if it works.

Here is where you can find your preview link.

Here is an example of the preview of your SaaS offer in the Azure Marketplace.

You may wonder how the Azure marketplace communicates with your SaaS application or solution. In the Technical configuration tab, you can find configurations that help Azure talk to your application.

Landing page URL — A SaaS website URL that customers will land on after subscribing to your offer from the Azure Marketplace.
Connection webhook — Microsoft will send all asynchronous events, including subscription creation, cancellation, and upgrade, to you through the webhook.
Azure Active Directory tenant ID and Azure Active Directory application ID — You can find your tenant ID for your Azure Active Directory and your application ID in the App registrations blade in Azure Active Directory.

Set up billing

You can create SaaS offers that are billed according to non-standard units you have defined upfront, such as bandwidth, CPU, or storage. Customers then pay according to their consumption of these units. You will also need to inform Azure about these billable events via the commercial marketplace metering service API as they occur. For more details about how to set up the billing system, refer to Azure’s documentation here.

Go live

Once you have completed the basic settings for your SaaS application, you can click the “Go live” button to publish your offer for real users to access.

Billing is a big challenge

Billing is one of the biggest challenges we faced during the listing process. While Azure has provided a solid billing solution, there are still quite a few problems you have to watch out for.

The first problem is how to keep users’ usage on our own server updated with Azure’s billing system. We know that Azure has provided an API where we can upload users’ usage to the billing system. In order to keep the two sources of data up to date, you have to follow the following rules:

One resource metric can only be updated once within one hour. You may encounter a duplicate error if you try to update twice within an hour.
You have to update users’ usage data to Azure’s billing system within 24 hours. I think this rule helps avoid billing users more than what they should pay.
Azure will notify you via the webhook you have configured if a user cancels their subscription. You have to confirm via the webhook that you have received the message. Otherwise, the notification will repeat forever.

The second problem you have to watch out for is that upon the configuration of your SaaS offer, you can only configure up to 30 billing metrics, or non-standard units as mentioned above. Metrics may include traffic usage, disk space, and emails sent. Remember that you cannot delete or update any of these metrics once you have published your SaaS offer. My advice is that you think thoroughly about what billing metrics you want to add and make sure you don’t want to make any changes in the near future.

The last challenge related to billing is that Azure only provides a relatively fixed billing logic, meaning that sometimes you cannot configure your pricing and plans the way you wanted. For example, with Azure’s vanilla billing system, you can’t offer a pricing plan that gives new users a 30% discount for one year, or change the price of an existing subscriber if you need to.

Our solution to inflexible Azure’s billing system is adding a standalone billing metric called “service credit” on our end. This billing metric is different from the billing metrics we defined. It solely serves as a buffer metric that enables us to control users’ usage more flexibly. For example, if we want to give new users a 30% discount for one year, we will give them service credits that are equivalent to 30% of their billing amount so that they are essentially paying 70% of the amount of their billings.

I will talk more about billing problems in a follow-up article. Please stay tuned.

Brief Summary

This article is about the main steps of listing a SaaS product in the Azure Marketplace. I mainly want to give you an overview of the whole process from a developer’s perspective. In the rest of this series about how to deliver a SaaS product in the Azure Marketplace, I will talk about the following topics:

Problems and limits of Azure’s billing system and how to fix them.
The difference between Azure’s managed service, SaaS offer, and Azure Lighthouse.

Finally, Nebula Graph Cloud is currently available in the Azure Marketplace and it is in a beta period that offers a generous 70% off for beta users. Sign up here to get the offer if you are interested.

About the author

Jerry Liang is a technical lead at Nebula Graph’s Cloud division. He and his team are behind a series of Nebula Graph’s visualization tools such as Nebula Dashboard, Nebula Explorer, and Nebula Studio.

Originally published at https://nebula-graph.io on April 22, 2022.

How I cracked Chinese Wordle using knowledge graph

NebulaGraph — Thu, 21 Apr 2022 02:12:58 +0000

Wordle is going viral these days on social media. The game made by Josh Wardle allows players to try six times to guess a five-letter word, with feedback given for each guess in the form of colored tiles indicating when letters match or occupy the correct position.

We have seen many Wordle variants for languages that use the Latin script, such as the Spanish Wordle, French Wordle, and German Wordle. However, for non-alphabetic languages like Chinese, a simple adaptation of the English Wordle’s rules just won’t work.

In China, where most people are more familiar with Chinese characters, or hanzi, Wordle fans have invented a localized version of Wordle with a very clever name: Handle. (And you guessed it right, it is a combination of hanzi and Wordle.)

Like Wordle, Handle allows players to try 10 times to guess a four-character Chinese idiom, or chengyu. While English Wordle uses letters and their positions to indicate whether players have made the correct guess, Handle uses pinyin, a romanization system for Simplified Chinese in mainland China, to give players feedback.

After every guess in Handle, each hanzi and their pinyin is marked as either cyan, orange, or gray: cyan indicates the hanzi or the pinyin partial (the initial or the final) is correct and in the correct position, orange means that it is in the answer but not in the right position, while gray indicates it is not in the answer at all. Of course, each game is given a hint to indicate one of the character’s pinyin or hanzi (depending on how hard you want the game to be).

Here is an example of how Handle is successfully played.

The Handle helper: Your second brain

Of all magics that were used to solve Wordle, I was most impressed by Grant Sanderson (or 3Blue1Brown as he is known on YouTube), who provided an elegant and delightful way to solve Wordle using information theory.

Today, I’d like to write about how I solved Handle (the Chinese Wordle) using a different approach — knowledge graph. The idea behind the solution is that I believe a knowledge graph is the best way to mimic how we search for the final answer in our minds in games like Wordle and Handle.

Imagining a knowledge graph of five-letter words and letters that are connected using edges representing their relations (contain or made up of), each try in the game will give you some clues (hopefully!) about how to approach the hidden answer in the network of words and letters.

The same theory applies to Handle. We can also create a knowledge graph of chengyu (four-character Chinese idiom) and Chinese characters that are connected using edges to represent their relations.

Before we dive into how to solve Handle using a knowledge graph, I’d like to go through how to play Handle without the help of computers.

I have mentioned that Handle uses pinyin and hanzi to give players feedback. But pinyin is complicated, it consists of initials (声母; shēngmǔ), finals ( 韵母; yùnmǔ), and one in four tones. For example, the pinyin of the hanzi “声” (sheng1) is made up of the initial “sh” and the final “eng,” and its tone is the first tone (tone 1).

For each guess, each character may have the correct initial, but the wrong final. Sometimes if you are guessing the initials and finals right, the tone might be wrong. For example, the pinyin is “sheng1” could be “声” (sound) but the pinyin “sheng4” could be “圣” (sacred).

Let’s see what it is like to play Handle:

Players are given 10 times to guess the correct 4-character Chengyu.
Characters are the most basic element to be considered:

For example, in the first line in the screenshot below, the character “门” colored in green in position 2 is the correct hanzi and in the correct position.

In the second line, the character “仓” colored in orange is the correct hanzi but not in the correct position.

pinyin of the character provides further information. However, we should also know that sometimes more than one character may share one exact pinyin.

In the third line of the picture, the pinyin “qiao” colored in green means that the first character of the idiom is pronounced as “qiao” but the tone should not be the third tone as the guess implies.

In the third line, the final “uo” colored in orange means that there is at least one character in the idiom that has the final “ou” in their pinyin but the character is not in position 2.

The Handle knowledge graph

I’m not going to create a fully automated Handle solver. That will just kill the fun of the game. Instead, I’m going to make a Handle helper, which I call a second brain, that will help people reach the hidden four-character idiom.

When playing the English Wordle, people can search for five-letter words based on clues they already have. For example, they can search for five-letter words with the most vowels, five-letter words starting with “sau,” or five-letter words ending with “e”.

In Handle(Chinese), it’s almost impossible to search based on hints like tones and initials of pinyin in search engines, because most Chinese webpages are simply made of hanzi, not their pinyin, or tones.

As I mentioned, the key idea of this helper is that it should work as a second brain to help people locate the answer in the sea of Chinese idioms, which are estimated to have an amount of up to 20,000. Then the question is: How does our brain work while handling the knowledge of Handle (And yes, pun intended 😏)?

Thus, why not do it in a graph/neural-network way? And here we go, let’s create a knowledge graph of Chinese idioms and see how it goes with the Handle game.

What is a knowledge graph?

Simply put, the Knowledge Graph is a network of connected relationships between entities. It was originally proposed by Google and was used to answer search queries that are only possible to be answered via knowledge-based reasoning, rather than the inverted indexing of web pages. For example: “How many championships have the Houston Rockets won?” and “Who was married to Elvis Presley? “

How to build a knowledge graph for Handle?

A knowledge graph is composed of entities (vertices) and relationships (edges), and a graph database management system can be easily used to index, query, and explore the knowledge graph.

In this article, I will use the open-source graph database NebulaGraph to build the knowledge graph for solving Handle. Let’s start with the modeling of the Handle knowledge graph using NebulaGraph.

Setting up the Handle knowledge graph

Modeling the knowledge graph

The modeling of a Handle knowledge graph is actually quite straightforward: I only need to index entities in the game as vertices and connect them using their relationships.

Oftentimes, you will have to come back to optimize the schema after playing with the knowledge graph afterward. But the main principle is to not over-design it: Just do it in an intuitive way.

For my practice, the Handle knowledge graph has the following types of vertices:

idiom (four-character Chinese words)
character
pinyin

tone (1, 2, 3, 4)

pinyin_part

type (initial and final)

There are three types of edges:

with_char
with_pinyin
with_pinyin_part

Of course, each type of vertex and edge will have its own properties. For example, the vertex “idiom” will have a VID, which is a unique identifier in NebulaGraph; its pinyin represented by initials, finals, and tone numbers (like the pinyin “sheng4” mentioned above).

The following sketch is a rough representation of the schema of the Handle knowledge graph.

After modeling, what we need to do is to collect, clean, and index the data.

I extracted the universal set of idioms used in Handle from the game’s Github repo. I used PyPinyin, an open-source Python library, to convert idioms into their pinyin. PyPinyin can also be used to split pinyin entities into their initials and finals.

Here is the Github repo for the project: https://github.com/wey-gu/chinese-graph

Deploy NebulaGraph

You can use Nebula-UP to deploy NebulaGraph using only one line of code.

git clone https://github.com/wey-gu/chinese-graph.git && cd chinese-graph

Load the data

Play Handle with knowledge graph

With all the setup ready, let’s start playing the game with our second brain.

Now let’s visit the Handle game. Say we use the idiom “爱憎分明” as our first guess. After typing the idiom, we get our first batch of hints:

Not bad, now we have a few informative hints:

There is one character with the final of “ai” in tone 4, but it is not in position 1 and it is not “爱”.
There is one character in tone 1 but it is not in position 2.
There is one character with the final “ing”, but the character is not in position 4.
The 4th character is in tone 2.

Let’s translate these hints into NebulaGraph’s nGQL graph query language:

# There is one character with the final of *"ai"* in tone 4, but it is not in position 1 (starting from 0 in the query) and it is not *"爱"*.
(char0:character)<-[with_char_0:with_character]-(x:idiom)-[with_pinyin_0:with_pinyin]->(pinyin_0:character_pinyin)-[:with_pinyin_part]->(final_part_0:pinyin_part{part_type: "final"})
WHERE id(final_part_0) == "ai" AND pinyin_0.character_pinyin.tone == 4 AND with_pinyin_0.position != 0 AND with_char_0.position != 0 AND id(char0) != "爱"

# There is one character in tone 1 but it is not in position 2.
MATCH (x:idiom) -[with_pinyin_1:with_pinyin]->(pinyin_1:character_pinyin)
WHERE pinyin_1.character_pinyin.tone == 1 AND with_pinyin_1.position != 1

#There is one character with the final “ing”, but the character is not in position 4.
MATCH (x:idiom) -[with_pinyin_2:with_pinyin]->(:character_pinyin)-[:with_pinyin_part]->(final_part_2:pinyin_part{part_type: "final"})
WHERE id(final_part_2) == "ing" AND with_pinyin_2.position != 3

# The 4th character is in tone 2.
MATCH (x:idiom) -[with_pinyin_3:with_pinyin]->(pinyin_3:character_pinyin)
WHERE pinyin_3.character_pinyin.tone == 2 AND with_pinyin_3.position == 3

RETURN x, count(x) as c ORDER BY c DESC

After inputting those queries into the NebulaGraph instance that runs our Handle knowledge graph, now we have seven alternative idioms that can be used in the second guess!

("惊愚骇俗" :idiom{pinyin: "['jing1', 'yu2', 'hai4', 'su2']"}) 
("惊世骇俗" :idiom{pinyin: "['jing1', 'shi4', 'hai4', 'su2']"}) 
("惊见骇闻" :idiom{pinyin: "['jing1', 'jian4', 'hai4', 'wen2']"}) 
("沽名卖直" :idiom{pinyin: "['gu1', 'ming2', 'mai4', 'zhi2']"}) 
("惊心骇神" :idiom{pinyin: "['jing1', 'xin1', 'hai4', 'shen2']"}) 
("荆棘载途" :idiom{pinyin: "['jing1', 'ji2', 'zai4', 'tu2']"}) 
("出卖灵魂" :idiom{pinyin: "['chu1', 'mai4', 'ling2', 'hun2']"})

Let’s give the idiom “惊世骇俗” a try.

And here we go, we got the final answer. It is “惊世骇俗”.

Let’s try again with another day’s Handle (Mar 1).

My first guess was “一言为定”. And we got the following feedback:

This can be translated into the following nGQL statements:

# There is one character that is not in the first position whose pinyin final is "i" in the first tone, but its pinyin is not "yi"
MATCH (x:idiom) -[with_pinyin_0:with_pinyin]->(char_pinyin_0:character_pinyin)-[:with_pinyin_part]->(final_part_0:pinyin_part{part_type: "final"})
WHERE id(final_part_0) == "i" AND char_pinyin_0.character_pinyin.tone == 1 AND with_pinyin_0.position != 0 AND id(char_pinyin_0) != "yi1"

# There is one character whose pinyin initial is "d," but the character is not in the 4th position
MATCH (x:idiom) -[with_pinyin_1:with_pinyin]->(:character_pinyin)-[:with_pinyin_part]->(initial_part_1:pinyin_part{part_type: "initial"})
WHERE id(initial_part_1) == "d" AND with_pinyin_1.position != 3

# The third character is in tone 2, but its pinyin is not "wei"
MATCH (x:idiom) -[with_pinyin_2:with_pinyin]->(char_pinyin_2:character_pinyin)
WHERE char_pinyin_2.character_pinyin.tone == 2 AND id(char_pinyin_2) != "wei2" AND with_pinyin_2.position == 2

RETURN x

Here is what we get:

("堆积如山" :idiom{pinyin: "['dui1', 'ji1', 'ru2', 'shan1']"}) 
("丹漆随梦" :idiom{pinyin: "['dan1', 'qi1', 'sui2', 'meng4']"}) 
("植党营私" :idiom{pinyin: "['zhi2', 'dang3', 'ying2', 'si1']"}) 
("结党营私" :idiom{pinyin: "['jie2', 'dang3', 'ying2', 'si1']"}) 
("堆案盈几" :idiom{pinyin: "['dui1', 'an4', 'ying2', 'ji1']"}) 
("涓滴成河" :idiom{pinyin: "['juan1', 'di1', 'cheng2', 'he2']"}) 
("当之无愧" :idiom{pinyin: "['dang1', 'zhi1', 'wu2', 'kui4']"}) 
("荡析离居" :idiom{pinyin: "['dang4', 'xi1', 'li2', 'ju1']"}) 
("路断人稀" :idiom{pinyin: "['lu4', 'duan4', 'ren2', 'xi1']"}) 
("地广人稀" :idiom{pinyin: "['di4', 'guang3', 'ren2', 'xi1']"}) 
("地广人希" :idiom{pinyin: "['di4', 'guang3', 'ren2', 'xi1']"}) 
("地旷人稀" :idiom{pinyin: "['di4', 'kuang4', 'ren2', 'xi1']"}) 
("大失人望" :idiom{pinyin: "['da4', 'shi1', 'ren2', 'wang4']"}) 
("得不酬失" :idiom{pinyin: "['de2', 'bu4', 'chou2', 'shi1']"}) 
("得失荣枯" :idiom{pinyin: "['de2', 'shi1', 'rong2', 'ku1']"}) 
("独木难支" :idiom{pinyin: "['du2', 'mu4', 'nan2', 'zhi1']"}) 
("不得而知" :idiom{pinyin: "['bu4', 'de2', 'er2', 'zhi1']"}) 
("班师得胜" :idiom{pinyin: "['ban1', 'shi1', 'de2', 'sheng4']"}) 
("是非得失" :idiom{pinyin: "['shi4', 'fei1', 'de2', 'shi1']"}) 
("鸡虫得失" :idiom{pinyin: "['ji1', 'chong2', 'de2', 'shi1']"}) 
("锋镝余生" :idiom{pinyin: "['feng1', 'di1', 'yu2', 'sheng1']"}) 
("心到神知" :idiom{pinyin: "['xin1', 'dao4', 'shen2', 'zhi1']"}) 
("小大由之" :idiom{pinyin: "['xiao3', 'da4', 'you2', 'zhi1']"}) 
("水滴石穿" :idiom{pinyin: "['shui3', 'di1', 'shi2', 'chuan1']"}) 
("天打雷劈" :idiom{pinyin: "['tian1', 'da3', 'lei2', 'pi1']"})

Let’s try the idiom “首当其冲” this time.

Again, let’s try to translate the clues into nGQL:

# There is one character whose pinyin initial is "ch"
MATCH (x:idiom) -[:with_pinyin]->(:character_pinyin)-[:with_pinyin_part]->(initial_part_0:pinyin_part{part_type: "initial"})
WHERE id(initial_part_0) == "ch"

# There is one character whose pinyin initial is "d"
MATCH (x:idiom) -[:with_pinyin]->(:character_pinyin)-[:with_pinyin_part]->(initial_part_1:pinyin_part{part_type: "initial"})
WHERE id(initial_part_1) == "d"

# There is one character whose *pinyin* initial is "sh"
MATCH (x:idiom) -[:with_pinyin]->(:character_pinyin)-[:with_pinyin_part]->(initial_part_2:pinyin_part{part_type: "initial"})
WHERE id(initial_part_2) == "sh"

# The third character is in tone 2
MATCH (x:idiom) -[with_pinyin3:with_pinyin]->(char_pinyin3:character_pinyin)
WHERE char_pinyin3.character_pinyin.tone == 2 AND with_pinyin3.position == 2

# The fourth character is in tone 1
MATCH (x:idiom) -[with_pinyin4:with_pinyin]->(char_pinyin4:character_pinyin)
WHERE char_pinyin4.character_pinyin.tone == 1 AND with_pinyin4.position == 3

# There is one character whose Pinyin final is "ang"
MATCH (x:idiom) -[:with_pinyin]->(:character_pinyin)-[:with_pinyin_part]->(final_part_5:pinyin_part{part_type: "final"})
WHERE id(final_part_5) == "ang"

RETURN x

We got three possible results:

("适当其冲" :idiom{pinyin: "['shi4', 'dang1', 'qi2', 'chong1']"}) 
("得不偿失" :idiom{pinyin: "['de2', 'bu4', 'chang2', 'shi1']"}) 
("首当其冲" :idiom{pinyin: "['shou3', 'dang1', 'qi2', 'chong1']"})

Let’s try the idiom “首当其冲”.

And bingo! We’ve got it.

What’s Next

If you happen to be interested in graph databases, you can check out the Nebula Graph project on Github.

Nebula Graph will soon roll out a Visual Builder to enable users to generate nGQL queries in a drag and drop interface. With the no-code tool, you can explore the Handle knowledge graph more easily if you aren’t already familiar with graph query languages like nGQL. If you are interested in the new feature, please join our Slack channel to get alerted when it’s ready.

I will also share more visualized ways to play Handle in a follow-up article, please stay tuned.

Also, an easier way to try NebulaGraph out is its fully managed service in the cloud NebulaGraph is now available on Microsoft’s Azure Marketplace. The service is currently in a beta period and is offering a generous 70% off for beta users. Sign up here to get the offer if you are interested!

Happy graphing!

About the Author

Wey Gu is the Developer Advocate of Nebula Graph. He is passionate about spreading the graph technology to the developer community and trying his best to make distributed graph database more accessible. Follow him on Twitter or visit his blog for more fun stuff.

Originally published at https://nebula-graph.io on April 15, 2022.

An Introduction to NebulaGraph 2.0 Query Engine

NebulaGraph — Wed, 17 Mar 2021 00:43:42 +0000

Introduction

Compared with V1.0, NebulaGraph, a distributed graph database, has been significantly changed in V2.0. One of the most obvious changes is that in NebulaGraph 1.0, the code of the Query, Storage, and Meta modules are placed in one repository, while from NebulaGraph 2.0 onwards, these modules are placed in three repositories:

nebula-graph: Mainly contains the code of the Query module.
nebula-common: Mainly contains expression definitions, function definitions, and some public interfaces.
nebula-storage: Mainly contains the code of the Storage and Meta modules.

This article introduces the overall structure of the Query layer and uses an nGQL statement to describe how it is processed in the four main modules of the Query layer.

Architecture

This figure shows the architecture of the Query layer. It has four submodules:

Parser: Performs lexical analysis and syntax analysis.
Validator: Validates the statements.
Planner: Generates and optimizes the execution plans.
Executor: Executes the operators.

3. Source Code Hierarchy

Now, let’s look at the source code hierarchy under the nebula-graph repository.

|--src 
|--context // contexts for validation and execution 
|--daemons 
|--executor // execution operators 
|--mock 
|--optimizer // optimization rules 
|--parser // lexical analysis and syntax analysis 
|--planner // structure of the execution plans 
|--scheduler // scheduler 
|--service 
|--util // basic components 
|--validator // validation of the statements 
|--visitor

4. How a Query is Executed

Starting from the release of NebulaGraph 2.0-alpha, nGQL gets to support vertex IDs of the String type. It is working to support vertex IDs of the Int type in NebulaGraph 2.0 to get compatible with NebulaGraph 1.0.

To introduce how the Query layer works, let’s take an nGQL statement like GO FROM "Tim" OVER like WHERE like.likeness > 8.0 YIELD like._dst as an example. Its data flow through the Query layer is shown in the following figure.

When the statement is input, it is processed as follows.

Stage 1: Generating an AST

In the first stage, the statement is parsed by the Parser (composed of Flex and Bison) and its corresponding AST is generated. The structure of the AST is shown in the following figure.

In this stage, the Parser intercepts the statements that do not conform to the syntax rules. For example, a statement like GO "Tim" FROM OVER like YIELD like._dst will be directly intercepted in this stage because of its invalid syntax.

Stage 2: Validation

In the second stage, the Validator performs a series of validations on the AST. It mainly works on these tasks:

When parsing the OVER , WHERE , and YIELD clauses, the Validator looks up the Schema and verifies whether the edge and tag data exist or not. For an INSERT statement, the Validator verifies whether the types of the inserted data are the same as the ones defined in the Schema.

For multiple statements, like $var = GO FROM "Tim" OVER like YIELD like._dst AS ID; GO FROM $var.ID OVER serve YIELD serve._dst, the Validator verifies $var.ID: first to see if var was defined, and then to check if the ID property is attached to the var variable. If $var.ID is replaced with $var1.ID or $var.IID , the validation fails.

The Validator infers what type the result of an expression is and verifies the type against the specified clause. For example, the WHERE clause requires the result to be a Boolean value, a NULL value, or empty.

Take a statement like GO FROM "Tim" OVER * YIELD like._dst, like.likeness, serve._dst as an example. When verifying the OVER clause, the Validator needs to query the Schema and replace * with all the edges defined in the Schema. For example, if only like and serve edges are defined, the statement is as follows.

GO FROM "Tim" OVER serve, like YIELD like._dst, like.likeness, serve._dst

For an nGQL statement with the PIPE operator, such as GO FROM "Tim" OVER like YIELD like._dst AS ID | GO FROM $-.ID OVER serve YIELD serve._dst, the Validator verifies $-.ID . In the example statement, the ID property in the second clause was already defined in the previous clause, so the second clause is verified legal. If $-.ID is changed to $-.a , where a was not defined, the second clause is illegal.

Stage 3: Generating an Execution Plan

When the validation succeeds, an execution plan is generated in Stage 3. Its data structure is stored in the src/planner directory, with the following logical structure.

Query Execution Flow

The generated execution plan is a directed acyclic graph where the dependencies between nodes are determined in the toPlan() function of each module in the Validator. As shown in the preceding figure, the Project node depends on the Filter node, the Filter node depends on the GetNeighbor node, and so on, up to the Start node.

During the execution stage, the Executor generates an operator for each node and starts scheduling from the root node (e.g., the Project node in this example). If the root node is found to be dependent on other nodes, the Executor recursively calls the nodes that the root node depends on until it finds a node that is not dependent on any node (e.g., the Start node in this example), and then starts execution. After the execution, the Executor continues executing the node that depends on the executed node (e.g., the GetNeighbor node in this example), and so on, until the root node is reached.

Query Data Flow

The input and output of each node are also determined in toPlan(). Although the execution is performed in the order defined in the execution plan, the input of each node can be customized and does not completely depend on the previous node, because the input and output of all nodes are actually stored in a hash table, where the keys are the names customized during the creation of each node. For example, if the hash table is named as ResultMap, when creating the Filter node, you can determine that the node takes data from ResultMap["GN1"], then puts the result into ResultMap["Filter2"], and so on. All these work as the input and output of each node. The hash table is defined in src/context/ExecutionContext.cpp under the nebula-graph repository. The execution plan is not really executed, so the value of each key in the associated hash table is empty (except for the starting node, where the input variables hold the starting data), and they will be computed and filled in the Executor stage.

This is a simple example. A more complex one will be shown at the end for you to better understand how an execution plan works.

Stage 4: Optimizing the Execution Plan

An execution plan generated in the preceding stage can be optimized optionally. In the etc/nebula-graphd.conf file, when enable_optimizer is set to true, the execution plans will be optimized. In this example, when optimization is enabled, the data flow is as follows.

As shown in the preceding figure, the Filter node is integrated into the GetNeighbor node. Therefore, when the GetNeighbor operator calls interfaces of the Storage layer to get the neighboring edges of a vertex during the execution stage, the Storage layer will directly filter out the unqualified edges internally. Such optimization greatly reduces the amount of data transfer, which is commonly known as filter push down.

In the execution plan, each node directly depends on other nodes. To explore equivalent transformations and to reuse the same parts of an execution plan, such a direct dependency between nodes is converted into an OptGroupNode to OptGroup dependency. Each OptGroup contains an equivalent set of OptGroupNodes, and each OptGroupNode contains one node in the execution plan. An OptGroupNode depends on an OptGroup but not on other OptGroupNodes. Therefore, from one OptGroupNode, many equivalent execution plans can be explored because the OptGroup it depends on contains different OptGroupNodes, which saves storage space.

All the optimization rules we have implemented so far are considered as RBO (Rule-Based Optimization), which means the plans optimized based on the rules must be better than their original versions. The CBO (Cost-Based Optimization) feature is under development. The optimization process is a “bottom-up” exploration process: For each rule, the root node of the execution plan (in this case, the Project node) is the entry point, along the node dependencies, the bottom node is found, and then from the bottom node, the OptGroupNode in each OptGroup is explored to see if it matches the rule, until the entire execution plan can no longer apply the rule. Then, the next rule is explored.

This figure shows the optimization process of this example.

As shown in the preceding figure, when the Filter node is explored, it is found that its children node is GetNeighbors, which matches successfully with the pre-defined pattern in the rule, so a transformation is initiated to integrate the Filter node into the GetNeighbors node, the Filter node is removed, and then the process continues to the next rule.

The optimized code is in the src/optimizer/ directory under the nebula-graph repository.

Stage 5: Execution

In the fifth stage, the Scheduler generates the corresponding execution operators against the execution plan, starting from the leaf nodes and ending at the root node. The structure is as follows.

Each node of the execution plan has one execution operator node, whose input and output have been determined in the execution plan. Each operator only needs to get the values for the input variables, compute them, and finally put the results into the corresponding output variables. Therefore, it is only necessary to execute step by step from the starting node, and the result of the last operator is returned to the user as the final result.

An Example Query

Now we will get acquainted with the structure of an execution plan by executing a FIND SHORTEST PATH statement.

Open nebula-console and enter the following statement.

FIND SHORTEST PATH FROM "YAO MING" TO "Tim Duncan" OVER like, serve UPTO 5 STEPS

To get the details of the execution plan of this statement, precede it with the EXPLAIN.

The preceding figure shows, from left to right, the unique ID of each node in the execution plan, their names, the IDs of their dependent nodes, the profiling data (information about the execution of the PROFILE command), and the details of each node (including the names of the input and output variables, the column names of the output results, and the parameters of the node).

To visualize the information, you can follow these steps:

Use EXPLAIN format="dot" instead of EXPLAIN. nebula-console can generate the data in the DOT format.
Open the Graphviz Online website and paste the generated DOT data. You can see the structure as shown in the following figure. This structure corresponds to the execution flow of the operators in the execution stage.

The shortest path algorithm uses the two-way breadth-first search algorithm, expanding from both “YAO MING” and “Tim Duncan”, so GetNeighbors, BFSShortest, Project, and Dedup have two operators, respectively, the input is connected by the PassThrough operator, and the path is stitched by the ConjunctPath operator. The LOOP operator then controls the number of steps to be extended outward, and it can be seen that the input of the DataCollect operator is actually taken from the output variable of the ConjunctPath operator.

The information of each operator is in the src/executor directory under the nebula-graph directory. Feel free to share your ideas with us on GitHub: https://github.com/vesoft-inc/nebula or the official forum: https://discuss.nebula-graph.io/

Originally published at https://nebula-graph.io on March 16, 2021.

NebulaGraph Operator: Automated the NebulaGraph Databasee Deployment and Maintenance on K8s

NebulaGraph — Wed, 25 Nov 2020 07:32:41 +0000

NebulaGraph Operator is a plug-in to deploy, operate and maintain NebulaGraph automatically on K8s. Building upon the excellent scalability mechanism of K8s, we introduced the operation and maintenance knowledge of NebulaGraph into the K8s system in the CRD + Controller format, which makes NebulaGraph a real cloud-native graph database.

NebulaGraph is a high-performance distributed open source graph database. From the architecture chart below, we can see that a complete NebulaGraph cluster is composed of three types of services, namely the Meta Service, Query Service (Computation Layer) and Storage Service (Storage Layer).

Each type of the service is a cluster composed of multiple replica components. In NebulaGraph Operator, we name these three types of components as Metad, Graphd and Storaged.

Metad: Responsible for providing and storing the metadata of the graph database. It also acts as the scheduler for the cluster, directing operation and maintenance such as storage expansion, data migration and leader change.
Graphd: Responsible for processing the query language (nGQL) statements for NebulaGraph. There is a stateless query calculation engine running in each Graphd. And these engines don’t communicate with each other. Instead, they only read meta data form the Metad cluster and communicate with the Storaged cluster. At the same time, they are also responsible for the access and interaction of different clients.
Storaged: Responsible for the graph data storage. The graph data is divided into many partitions. Partitions with the same ID form a Raft Group to achieve multi-replica consistency. The default storage engine of NebulaGraph is the Key-Value storage of RocksDB.

Now we have a brief understanding to the core components of Nebula Graph, we can draw the following conclusions:

NebulaGraph adopts the disaggregated storage and compute architecture. With a clear components stratification and responsibility, you can do the scale in or out operation on any of the components independently according to your business needs. Thanks to this feature, deploying NebulaGraph on container orchestration systems like K8s is more friendly. You can take full advantage of the flexibility of NebulaGraph.
NebulaGraph is a complex distributed system. Deploying, operating and maintaining NebulaGraph all require specified expertise, which steepens the learning curve and increases the workload. Even if you deploy NebulaGraph on the K8s system, the native K8s controller is not good enough for state management and exception handling. As a result, NebulaGraph cluster can’t play its full role.

Based on the above considerations, we developed the Nebula Operator to give full play to NebulaGraph’s native scalability and failover capacity. The Nebula Operator also lowers the threshold for operation and maintenance of Nebula Graph clusters.

In order to better understand the working principle of Nebula Operator, let us first review what an Operator is.

What is Nebula Operator

Operator is not a new concept. As early as 2017, CoreOS launched their Etcd Operator. The aim of Operator is to strengthen K8s in the stateful applications management. Etcd Operator benefits from the two core concepts of K8s: declarative API and Control Loop.

Let’s describe this process by the following pseudocode snippet.

Declare the desired state of object X in the cluster and create X for { 
  current_state := Get the current state of object X in the cluster       

  desired_state := Get the desired state of object X in the cluster 

  if current_state == desired_state { 
     nothing is done 
  } else { 
    Do the pre - defined choreography actions and move the current state to the desired state 

  } 
}

In the K8s system, there is a specific control loop running in each built-in resource object. The control loop gradually moves the current state to the desired state through the pre-defined orchestration actions.

For resource types that do not exist in K8s, you can register them by adding customized API objects. The common way is to use the CustomResourceDefinition (CRD) and the Aggregation ApiServer (AA). For example, Nebula Operator uses the CRD to register a “Nebula Cluster” resource and an “Advanced Statefulset” resource.

When you’re done with the above-mentioned customized resource, you can write the customized controller to watch the state change of the your customized resource. Also, you can operate and maintain Nebula Graph automatically according to your own strategy. Actually, Nebula Operator simplifies the operations and maintenance in this way.

apiVersion: nebula.com/v1alpha1
kind: NebulaCluster
metadata:
  name: nebulaclusters
  namespace: default
spec:
  graphd:
    replicas: 1
    baseImage: vesoft/nebula-graphd
    imageVersion: v2-preview-nightly
    service:
      type: NodePort
      externalTrafficPolicy: Cluster
    storageClaim:
      storageClassName: fast-disks
  metad:
    replicas: 3
    baseImage: vesoft/nebula-metad
    imageVersion: v2-preview-nightly
    storageClaim:
      storageClassName: fast-disks
  storaged:
    replicas: 3
    baseImage: vesoft/nebula-storaged
    imageVersion: v2-preview-nightly
    storageClaim:
      storageClassName: fast-disks
  schedulerName: nebula-scheduler
  imagePullPolicy: Always

Here we displayed a simple Nebula Cluster instance for you. You need only to modify the size in the spec, and Nebula Operator will take care of deploying, destroying with the control loop. For example, if you want to expand the number of the Storaged replica to 10, you only need to modify the .spec.storaged.replicas parameter to 10.

Now that you have gained the initial concept of Nebula Graph and Nebula Operator, let’s see what features Nebula Operator provides.

Deploy/Uninstall: We describe the whole Nebula Graph cluster with a CRD and register it in the ApiServer. Users only need to provide the corresponding CR file, and the Operator can quickly deploy or uninstall a Nebula Graph cluster. In this way, the cluster creating and uninstalling is simplified.
Scalability: By calling the native scaling interface provided by Nebula Graph, we implemented scalability for the Nebula Operator encapsulation and ensured the data stability. Users need only to modify the size in the yaml spec to scale in or out.
Upgrade: Based on the natively provided StatefulSet by K8s, we extended its ability to replace the image in place. Upgrading your cluster with Nebula Operator not only cuts the Pod scheduling time but also improves the cluster stability and certainty because there is no Pod place or resource change during the upgrading.
Failover: Nebula Operator observes the services dynamically by calling the cluster interface provided by Nebula Graph. Once exception is detected, Nebula Operator automatically fixes the failure and activates the corresponding fault tolerance mechanism according to the exception type.
WebHook: A standard Nebula Graph cluster needs at least three metad replicas. Incorrect modification of the metad parameter will lead to cluster failure. We check the correctness of some required parameters by using the access control of the WebHook. To run the cluster stably, we change some wrong declarations compulsively by changing control.

Hope you find Nebula Operator interesting. If you have any question, feel free to leave comments below!

You might also like:

Like what we do ? Star us on GitHub. https://github.com/vesoft-inc/nebula

Originally published at https://nebula-graph.io on November 25, 2020.

Practicing Graph Computation with GraphX in NebulaGraph Database

NebulaGraph — Thu, 19 Nov 2020 07:14:15 +0000

With the rapid development of network information technology, data is gradually developing towards multi-source heterogeneity. Inside the multi-source heterogeneous data lies countless inextricable relations. And this kind of relations, together with the the network structure, are sure essential for data analysis. Unfortunately, when it comes to large scale data analysis, the traditional relational databases are poor in association detection and expressions. Therefore, graph data has attracted great attention for its powerful ability in expressions. Graph computing uses a graph model to express and solve the problem. Graphs can integrate with multi-source data types. In addition to displaying the static basic features for data, graph computing also finds its chance to display the graph structure and relationships hidden in the data. Thus graph becomes an important analysis tool in social network, recommendation system, knowledge graph, financial risk control, network security and text retrieval.

Practical Algorithms

In order to support the business needs of large-scale graph computing, NebulaGraph provides PageRank and Louvain community-discovered graph computing algorithms based on GraphX. Users can execute these algorithm applications by submitting Spark tasks. In addition, users can also write Spark programs by using Spark Connector to call other built-in graph algorithms in GraphX, such as LabelPropagation, ConnectedComponent, etc.

PageRank

PageRank is an algorithm raised by Google to rank web pages in their search engine results. PageRank is a way of measuring the importance of website pages.

Introduction to PageRank

PageRank was named after Larry Page and Sergey Brin from Stanford in the United States. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

Applications of PageRank

Content recommendation based on similarity

With the help of the PageRank, you can recommend similar content to users based on their browse history and view time when analyzing social applications such as Twitter and Facebook.

Analyze the social influence of users

You can analyze the social influence of users according to their PageRank values in social network analysis.

Importance research for papers

Judge a paper’s quality according to its PageRank value. Actually the PageRank algorithm comes from the idea of judging the quality of the papers. In addition, PageRank also finds its usage in data analysis and mining.

Implementing PageRank

PageRank in GraphX is implemented based on the Pregel computing model. The algorithm contains three procedures:

Set a same initial PageRank value for every vertex (web page) in the graph;
The first iteration: Send a message along the edge. Each vertex receives all the vertices information along its related edges and gets a new PageRank value;
The second iteration: Put the PageRank values you get in the first iteration to the formulas corresponding to different algorithm models and get the new PageRank values.

Louvain method

The Louvain method for community detection is a method to extract communities from large networks. The method is an aggregation algorithm for graphs.

Introduction to Louvain

The inspiration for Louvain is the optimization of modularity as the algorithm progresses. If a node joins a certain community and maximizes the modularity of the community compared to other communities, then the node belongs to the community. If a node doesn’t increase the modularity after joining other communities, it should stay in the current community.

Modularity

Modularity formula

The modularity Q measures the density of links inside communities compared to links between communities. Modularity is defined as:

where

Deformation for the modularity formula

In this formula, the formula is only meaningful when node i and node j belong to the same community. Therefore the formula measures the closeness within a certain community. The simplified deformation of this formula is as follows:

where

Calculate the modularity change

In the Louvain method, there is no need to calculate the specific modularity for each community. You only need to compare the modularity changes after adding a certain node to the community. That is to say you only need to calculate △Q.

When inserting node i to a certain community, the modularity change is:

where

Applications of Louvain

Financial risk control

In financial risk control scenarios, you can use the method to identify fraud gangs based on the users behaviors.

Social network

Louvain divides the social network based on the breadth and strength of nodes association in a graph. Louvain also analyzes complex networks and the closeness among a group of people.

Recommendation system

Community detection based on the users’ interests. More accurate and effective customized recommendation can be implemented based on the community and collaborative filtering algorithm.

Implement Louvain

The Louvain method contains two stages. The procedure is actually the iteration of the two stages.

Stage 1: Continuously traverse the nodes in the graph, and compare the modularity changes introduced by nodes in each neighbor community. Then add a single node to the community that can maximize the modularity. (For example, node v is added to communities A, B, and C respectively. The modularity increments of the three communities are -1, 1, 2. Then node v should be added to community C.)

Stage 2: Process based on the first stage. The nodes belonging to the same community are merged into a super node to reconstruct the graph. That is, the community is regarded as a new node in the graph. At this time, the weight of the edges between the two super nodes are the sum of the weight of the edges attached to the original nodes in the two super nodes. That is, the sum of the weight of the edges in the two communities.

Following are details for the two stages.

In the first stage, the nodes are traversed and then added to the communities they belong to. In this way, we get the middle graph and the four communities.

In the second stage, the nodes in the community are merged into a super node. The community nodes have self-connected edges whose weight is twice the sum of the weights of the edges connected between all nodes in the community. The edge weight between the communities is the sum of the weights of the edges connecting the two communities. For example, the red community and the light green community are connected by (8,11), (10,11), (10,13). So the weight of the edges between the two communities is 3.

Note: The weight inside the community is twice the weight of the edges between all internal nodes, because the concept of Kin is the sum of the edges of all nodes in the community and node i. When calculating the Kin for a certain community, each edge is actually counted twice.

The Louvain method continuously iterates stage 1 and stage 2 until the algorithm is stable (the modularity for the graph does not change) or the iterations reach the maximum.

Practice the proceeding algorithms

Demo environment

Three virtual machines:

Cpu name: Intel(R) Xeon(R) Platinum 8260M CPU @ 2.30GHz
Processors: 32
CPU Cores: 16
Memory Size: 128G

Software environment:

Spark: spark-2.4.6-bin-hadoop2.7 a cluster with three nodes
yarn V2.10.0: a cluster with three nodes
NebulaGraph V1.1.0: distributed deployed, used the default configurations

Dataset for testing

Create graph space

CREATE SPACE algoTest(partition_num=100, replica_factor=1);

CREATE TAG PERSON() CREATE EDGE FRIEND(likeness double);

Create schema

CREATE TAG PERSON()
CREATE EDGE FRIEND(likeness double);

Import data

Use Spark Writer to import data offline into NebulaGraph.

Test results

The resource allocation fot the Spark job is --driver-memory=20G --executor-memory=100G --executor-cores=3.

PageRank algorithm execution time on a dataset with hundred million nodes is 21 minutes.
Louvain algorithm execution time on a dataset with hundred million nodes is 1.3 hours.

How to use NebulaGraph algorithm

Download and pack the nebula-algorithm project to a jar package.

$ git clone git@github.com:vesoft-inc/nebula-java.git $ cd nebula-java/tools/nebula-algorithm $ mvn package -DskipTests

Configure the file src/main/resources/application.conf.

{ 
# Spark relation config 
spark: { 
  app: { 
    # not required, default name is the algorithm that you are going to execute. 
    name: PageRank 

    # not required partitionNum: 12 
 } 

 master: local 

 # not required 
 conf: { 
    driver-memory: 8g 
    executor-memory: 8g 
    executor-cores: 1g 
    cores-max:6 
  } 
 } 

# NebulaGraph relation config 
nebula: { 
  # metadata server address 
  addresses: "127.0.0.1:45500" 
  user: root 
  pswd: nebula 
  space: algoTest 
  # partition specified while creating nebula space, if you didn't specified the partition, then it's 100. 
  partitionNumber: 100 
  # nebula edge type 
  labels: ["FRIEND"] 

 hasWeight: true 
 # if hasWeight is true，then weightCols is required， and weghtCols' order must be corresponding with labels. 
 # Noted: the graph algorithm only supports isomorphic graphs, 
 # so the data type of each col in weightCols must be consistent and all numeric types. 
 weightCols: ["likeness"] 
} 

algorithm: { 
 # the algorithm that you are going to execute，pick one from [pagerank, louvain] 
 executeAlgo: louvain 
 # algorithm result path 
 path: /tmp 

 # pagerank parameter 
 pagerank: { 
   maxIter: 20 
   resetProb: 0.15 # default 0.15 

} 

 # louvain parameter 
 louvain: { 
    maxIter: 20 
    internalIter: 10 
    tol: 0.5 
  } 
 } 
}

Make sure that you have installed and started the Spark service on your machine.
Submit the nebula algorithm application:

spark-submit --master xxx --class com.vesoft.nebula.tools.algorithm.Main /your-jar-path/nebula-algorithm-1.0.1.jar -p /your-application.conf-path/application.conf

If you are interested in the content, welcome to try nebula-algorithm.

References

NebulaGraph: https://github.com/vesoft-inc/nebula
GraphX: https://github.com/apache/spark/tree/master/graphx
nebula-algorithm: https://github.com/vesoft-inc/nebula-java/tree/master/tools/nebula-algorithm

Like what we do ? Star us on GitHub. https://github.com/vesoft-inc/nebula

Originally published at https://nebula-graph.io on November 19, 2020.

DEV Community: NebulaGraph

Combine ChatGPT with NebulaGraph Database to Predict Game Winner

The Hype of ChatGPT

Grabbing data

Graph algorithm to predict the FIFA World Cup 2022

Graph modeling

Ingesting into NebulaGraph

Explore the graph

Querying the data

Initial observation

Graph algorithm-based analysis

Winner Prediction Algorithm

Process of the Prediction

Result

What is a NoSQL Graph Database?

What is a NoSQL graph database?

Essential components of a NoSQL graph database

Illustrating how a NoSQL graph database works

Advantages of NoSQL graph databases

Examples of NoSQL graph database applications

Choosing the right NoSQL graph database for your needs

Conclusion

You may also like:

Graph Database vs Relational Database: What to Choose?

What is a graph database?

How does a graph database work?

Types of graph databases

Data model-based graph databases

Storage-based graph databases

The graph data model vs the relational data model

Key differences between graph databases and relational databases

Advantages of graph database over relational databases

When should you use the graph database?

Conclusion

FAQ Section

How does a graph database differ from a relational database?

Are graph databases better than relational databases?

What is a graph database not good for?

You May Also Like:

How NebulaGraph Works

Using NebulaGraph Importer to Import Data into NebulaGraph Database

Using Nebula Importer

Configuring Nebula Importer

Real-time

About the author

Introducing an Open Source Graph Editing Library: VEditor by NebulaGraph Database

Basic Features

Design Ideas

Architecture

Rendering Process

Data Structure Design

Performance Design

Interaction Design

Future Plans

You Might Also Like

Exploring Geospatial Data with NebulaGraph Database

What is geospatial data?

How to use geospatial data in NebulaGraph?

Geospatial functions

Geospatial Index

Open Source NebulaGraph Database Raises Tens of Millions of Dollars in Series A Funding

The Azure Marketplace series: How to list your cloud offer successfully?

The What — What is Azure Marketplace

The Why — Azure Marketplace benefits for software solutions

Easy onboarding

We can leverage Azure’s account and billing systems directly

The How — How to create a listing, correctly

Create a demo listing

Set up billing

Go live

Billing is a big challenge

Brief Summary

About the author

How I cracked Chinese Wordle using knowledge graph

The Handle helper: Your second brain

The Handle knowledge graph

What is a knowledge graph?

How to build a knowledge graph for Handle?

Setting up the Handle knowledge graph

Modeling the knowledge graph