DEV Community: Antonio Filipovic

Link Prediction With node2vec in Physics Collaboration Network

Antonio Filipovic — Fri, 16 Jun 2023 14:14:42 +0000

Introduction

After you have successfully created a dynamic recommendation system, this time, MAGE will teach you how to generate link predictions by using a new spell called node2vec.

If you don't know what node2vec is or what node embeddings are, we got you covered with two blog posts for deeper understanding:

1) Introduction to node embedding - In this article, you can check out what node embeddings are, where we use them, why we use them, and how we can get embeddings from a graph.
2) How node2vec works - After the first blog post, you should have an idea of how node2vec works. But if you want to fully understand the algorithm, its benefits and check out how it works on a few examples, take a look at this node2vec blog post which covers everything mentioned.

As already mentioned, link prediction refers to the task of predicting missing links or links that are likely to occur in the future. In this tutorial, we will make use the of MAGE spell called node2vec. Also, we will use Memgraph to store data, and gqlalchemy to connect from a Python application. The dataset will be similar to the one used in this paper: Graph Embedding Techniques, Applications, and Performance: A Survey.

Don't worry, you are in safe hands, MAGE will guide you through dataset parsing, the creation of queries that will be used to import data into Memgraph, embeddings calculation with the node2vec algorithm in MAGE, and metrics report.

Now let's get to the fun part.

Prerequisites

For this to work, you will need:

The MAGE graph library
Memgraph Lab - the graph explorer for querying Memgraph and visualizing graphs
gqlalchemy - a Python driver and object graph mapper (OGM)

You can also try out MAGE on Memgraph Playground.

This is how we will set up our tutorial:
1) Dataset and query import
2) Splitting edges into test and train sets
3) Run node2vec on the train set to generate node embeddings
4) Get potential edges from embeddings
5) Rank potential edges to get top K predictions
6) Compare predicted edges with the test set

1. Dataset and query import

We will work on the High Energy Physics Collaboration Network. The dataset contains 12008 nodes and 118521 edges. MAGE has prepared a script that will help you parse the dataset and import it into Memgraph.

After you have downloaded the dataset from the link above, you should see the following contents:

# Directed graph (each unordered pair of nodes is saved once): CA-HepPh.txt 
# Collaboration network of Arxiv High Energy Physics category (there is an edge if authors co-authored at least one paper)
# Nodes: 12008 Edges: 237010
# FromNodeId    ToNodeId
17010   1943
17010   2489
17010   3426
17010   4049
17010   16961
17010   17897

The dataset description says it's a directed graph and that it contains 237010 edges. But earlier we mentioned it contains 118521 edges. Actually, both are true. Depends on your view.

The graph in question is directed, but it contains edges in both directions: from node u to node v and from node v to node u, u⟶v and u⟵v. The direction means that author u co-authored at least one paper with author v. Since co-authoring goes both ways we can act as if the graph is undirected with only one edge, u - v. The script below will create exactly 118521 undirected edges. So all is good. Phew.

We will import these 118521 edges, and act as if they are undirected. The Node2Vec algorithm in MAGE accepts parameters whether to treat graph from Memgraph as directed or undirected.

Note: Memgraph only accepts directed graphs, but the Node2Vec algorithm saves the day for us in this case.

Here is the function to parse edges from the file. It will return a List of int Tuples, which will represent undirected edges.

FILENAME = "CA-HepPh.txt"

def parse_edges_dataset(filename=FILENAME) -> List[Tuple[int, int]]:
    with open(filename) as file:
        lines = file.readlines()
    edges: Dict[Tuple[int, int]] = {}
    for line in lines:
        if line.startswith("#"):
            continue
        line = line.strip()
        line_parts = line.split("\t")
        edge = (int(line_parts[0]), int(line_parts[1]))
        if (edge[1], edge[0]) in edges:
            continue
        edges[edge] = 1

    return list(edges.keys())

We need to create Cypher queries from the given undirected edges. If you don't know anything about Cypher, here is a short getting started guide. You can also learn a lot about graph algorithms and Cypher queries on Memgraph Playground.

We need to create queries from edges so that we can run each query and import data into Memgraph.Let's use the MERGE clause, which ensures that a pattern we are looking for will exist only once in the database after a query is run. That means that if the pattern (node or edge) is not found, it will be created.

Now, let's create the queries:

NODE_NAME = "Collaborator"
EDGE_NAME = "COLLABORATED_WITH"

edge_template = Template(
    'MERGE (a:$node_name_a {id: $id_a}) MERGE (b:$node_name_b {id: $id_b}) CREATE (a)-[:$edge_name]->(b);')



def create_queries(edges: List[Tuple[int, int]]):
    queries: List[str] = ["CREATE INDEX ON :{node_name}(id);".format(node_name=NODE_NAME)]
    for source, target in edges:
        queries.append(edge_template.substitute(id_a=source,
                                                id_b=target,
                                                node_name_a=NODE_NAME,
                                                node_name_b = NODE_NAME,
                                                edge_name= EDGE_NAME))
    return queries


def main():
    edges = parse_edges_dataset()
    queries = create_queries(edges)

    file = open(OUTPUT_FILE, 'w')
    file.write("\n".join(queries))
    file.close()


if __name__ == '__main__':
    main()

This function create_queries() will return a list of strings. Each string represents a query we can run against our database.

Note: you can import datasets through one of the querying tools. We have developed our drivers using the Bolt protocol to deliver better performance. You can use Memgraph Lab, mgconsole or even one of our drivers, like the Python driver used in this tutorial.

We recommend you use Memgraph Lab due to the simple visualization, ease of use, export and import features, and memory usage. But here, we will use a Python driver in the form of gqlalchemy.

Image 1. Memgraph Lab interface

2. Splitting edges into test and train sets

Theory

First, we need to split our edges into a testing (test) and training (train) set. Let's explain why.

Our goal is to perform link prediction. This means that we need to be able to correctly predict new edges that might appear from existing ones. Since this is a definitive dataset, there will be no new edges. In order to test the algorithm we remove a part of the existing edges and make predictions based on the remaining ones. A correct prediction would recreate the edges we have removed. In the best case scenario, we would get the original dataset.

We will randomly remove 20% percent of edges. This will represent our test set. We will leave all the nodes in the graph, it doesn't matter that some of them could be completely disconnected from the graph. Next, we will run node2vec on the remaining edges (80% of them, in our case that would be something like 94000 edges) to get node embeddings. We will use these node embeddings to predict new edges.

You can imagine this case as a Twitter web, where new connections (follows) appear every second, and we want to be able to predict new connections from connections we already have.

How exactly we will predict which edges will appear is still left to explain, but we hope that you understand the WHY part of removing 20% of the edges. 🤞

Practical

Firstly, we need a connection to Memgraph so we can get edges, split them into two parts (train set and test set). For edge splitting, we will use scikit-learn. In order to make a connection towards Memgraph, we will use gqlalchemy.

From GitHub description of gqlalchemy:
"GQLAlchemy is a library developed to assist in writing and running queries on Memgraph. GQLAlchemy supports high-level connection to Memgraph as well as modular query builder."

And after we create a connection towards Memgraph, we will call these two functions down below in order to run a query. This query can be anything from getting edges, removing edges, running a node2vec procedure.

memgraph = gqlalchemy.Memgraph("127.0.0.1", 7687)
def call_a_query_and_fetch(query: str) -> Iterator[Dict[str, Any]]:
    return memgraph.execute_and_fetch(query)


def call_a_query(query: str) -> None:
    memgraph.execute(query)

Okay, so to get edges we need to make a query. With the connection we have, we will get edges, split them into two sets, and then make queries (plural) to remove each one of them in the test set from the graph.

edge_remove_template = Template(
    'MATCH (a:$node_a_name{id: $node_a_id})-[edge]-(b:$node_b_name{id: $node_b_id}) DELETE edge;')

def get_all_edges() -> List[Tuple[gqlalchemy.Node, gqlalchemy.Node]]:
    results = Match() \
        .node(dataset_parser.NODE_NAME, variable="node_a") \
        .to(dataset_parser.EDGE_NAME, variable="edge") \
        .node(dataset_parser.NODE_NAME, variable="node_b") \
        .execute()

    return [(result["node_a"], result["node_b"]) for result in results]

def remove_edges(edges: List[Tuple[gqlalchemy.Node, gqlalchemy.Node]]) -> None:
    queries = [edge_remove_template.substitute(node_a_name=dataset_parser.NODE_NAME,
                                               node_a_id=edge[0].properties["id"],
                                               node_b_name=dataset_parser.NODE_NAME,
                                               node_b_id=edge[1].properties["id"]) for edge in edges]
    for query in queries:
        call_a_query(query)


def split_edges_train_test(edges: List[Tuple[gqlalchemy.Node, gqlalchemy.Node]], test_size: float = 0.2) -> (
        List[Tuple[gqlalchemy.Node, gqlalchemy.Node]], List[Tuple[gqlalchemy.Node, gqlalchemy.Node]]):
    edges_train, edges_test = train_test_split(edges, test_size=test_size, random_state=int(time.time()))

    return edges_train, edges_test

This will be the "main" part of our program. We want you to notice a few things from here:

When getting all edges with a query, instead of edge object we got two nodes (gqlalchemy.Vertex object), one represents the head and the other represents the tail of the edge, but we will treat it as an undirected graph.
The split_edges_train_test() function accepts these edges and splits them into a train and test set.
We received an object, but it will be easier to work with the id property of the node We will just map from our list of edges, to a list of int tuples, where one pair will represent an undirected edge

def main():
    print("Getting all edges...")
    edges = get_all_edges()
    print("Current number of edges is {}".format(len(edges)))

    print("Splitting edges in train, test group...")
    edges_train, edges_test = split_edges_train_test(edges=edges, test_size=0.2)
    print("Splitting edges done.")

    print("Removing edges from graph.")
    remove_edges(edges_test)
    print("Edges removed.")
    train_edges_dict = {(node_from.properties["id"], node_to.properties["id"]): 1 for node_from, node_to in edges_train}
    test_edges_dict = {(node_from.properties["id"], node_to.properties["id"]): 1 for node_from, node_to in edges_test}

3. Run node2vec on the train set to generate node embeddings

Theory

After we have removed edges, we need to run the node2vec algorithm. Node embeddings will be calculated just from a train set of edges.

Repeat: we will get embeddings for every node, but for that, we will only use a certain amount of edges (80%) from the original graph. If a new node was to appear in the graph, we can't predict anything for that node, since we don't know it exists yet. We can only make predictions for the nodes we have in the graph.

Practical

Here we will call the node2vec query module to calculate node embeddings. There is a procedure called set_embeddings() in the node2vec module, which we will use to set embeddings in the graph as properties. So even if we lose power on the computer, we will still have those embeddings, since Memgraph acts as an in-memory database.

Node2Vec has some crucial hyperparameters like num_walks and walk_length. When we set them on higher value, they will cause the algorithm to run longer, but we should get better predictions if embeddings don't overfit to data we have.

Image 2. The algorithm's results are dependant on how we set our hyperparameters

Another problem we need to handle is to set proper p and q parameters. Since we are dealing here with a collaboration network, we will try to predict connections inside natural clusters. We can obtain clusters by sampling walks in more DFS like manner. If all these terms sound confusing to you, we would suggest checking out the blog post on node2vec where we have explained those terms. 💪

If we were to set node2vec params in a more BFS manner, so that hyperparameter p is smaller than hyperparameter q, then we would be looking for hubs, which isn't our intention.

# NODE2VEC PARAMS
is_directed: bool = False
p = 1  # return parameter
q = 1 / 256  # in-out parameter
num_walks = 10
walk_length = 80
vector_size = 100
alpha = 0.02
window = 5
min_count = 1
seed = int(time.time())
workers = 4
min_alpha = 0.0001
sg = 1
hs = 0
negative = 5
epochs = 5

def set_node_embeddings() -> None:
    call_a_query("""CALL node2vec.set_embeddings({is_directed},{p}, {q}, {num_walks}, {walk_length}, {vector_size}, 
    {alpha}, {window}, {min_count}, {seed}, {workers}, {min_alpha}, {sg}, {hs}, {negative}) YIELD *""".format(
        is_directed=is_directed, p=p, q=q, num_walks=num_walks, walk_length=walk_length, vector_size=vector_size,
        alpha=alpha, window=window, min_count=min_count, seed=seed, workers=workers, min_alpha=min_alpha, sg=sg,
        hs=hs, negative=negative))


def get_embeddings_as_properties():
    embeddings: Dict[int, List[float]] = {}
    results = Match() \
        .node(dataset_parser.NODE_NAME, variable="node") \
        .execute()

    for result in results:
        node: gqlalchemy.Node = result["node"]
        if not "embedding" in node.properties:
            continue
        embeddings[node.properties["id"]] = node.properties["embedding"]

    return embeddings

And this is our main part. After the node2vec query module finishes with calculations, we can get those embeddings directly from the graph, which is awesome.

def main():
    test_edges_dict = {(edge[0].properties["id"], edge[1].properties["id"]): 1 for edge in edges_test}

    # Calculate and get node embeddings
    print("Setting node embeddings as graph property...")
    set_node_embeddings()
    print("Embedding for every node set.")

    node_emeddings = get_embeddings_as_properties()

4. Get potential edges from embeddings

Theory

And now to the most important section ⟶ edge prediction.

How do we predict edges exactly? What is the idea behind it?

We expect nodes that have similar embeddings and still don't have an edge between them to form a new edge in the future. It's as simple as that.

We just need to find a good measure to be able to check whether two nodes have similar embeddings. One such measure is cosine similarity.

Image 3. Cosine similarity between two vectors A and B

The Image 3 above contains an explanation of cosine similarity, the measure that will calculate how similar two vectors are. It's essentially the cosine angle between two vectors. Notice that node embeddings also represent vectors in multi-dimensional space.

Practical

So for every pair of node embeddings, we will calculate the cosine similarity to check how similar two-node embeddings are. The problem with 12000 nodes is that there will be around 72 million pairs (72 000 000), which means that an average computer with 16GB of RAM would die at some point (open up a Chrome tab if you dare). To fix that, we will only hold a maximum of 2 million pairs in memory at any given point in time. We will also run a sorting algorithm to only keep the top K pairs.

What would be this top K number?

We will answer this question shortly and it's related to the precision@K measurement method.

def calculate_adjacency_matrix(embeddings: Dict[int, List[float]], threshold=0.0) -> Dict[Tuple[int, int], float]:
    def get_edge_weight(i, j) -> float:
        return np.dot(embeddings[i], embeddings[j])

    nodes = list(embeddings.keys())
    nodes = sorted(nodes)
    adj_mtx_r = {}
    cnt = 0
    for pair in itertools.combinations(nodes, 2):

        if cnt % 1000000 == 0:
            adj_mtx_r = {k: v for k, v in sorted(adj_mtx_r.items(), key=lambda item: -1 * item[1])}
            adj_mtx_r = {k: adj_mtx_r[k] for k in list(adj_mtx_r)[:3*PRECISION_AT_K_CONST]}
            gc.collect()

        if cnt % 10000 == 0:
            print(cnt)

        weight = get_edge_weight(pair[0], pair[1])
        if weight <= threshold:
            continue
        cnt += 1
        adj_mtx_r[(pair[0], pair[1])] = get_edge_weight(pair[0], pair[1])

    return adj_mtx_r

5. Rank potential edges to get top K predictions

To calculate the accuracy of our implementation, we will use a famous precision method called precision@K.

Some nodes (their embeddings) will be more similar than others, meaning the cosine similarity value will be larger. And let's say our manager arrives and says, give me the top 10 predictions. Would you give him pairs with lower or higher similarities? Probably the best ones.

The same principle can be applied here. We will take the top K predictions, and evaluate our model. At every point, we will remember how many guesses we had until then. And we will divide the number of our guesses by the number of tries we had until then.

Image 4. Example of precision@K method

This is how it would work for precision@6:
The first one is easy: 1 guess / 1 try. For the second one, we have: 1 guess / 2 tries. The rest is clear.

def compute_precision_at_k(predicted_edges: Dict[Tuple[int, int], float],
                           test_edges: Dict[Tuple[int, int], int], max_k):
    precision_scores = []  # precision at k
    delta_factors = []
    correct_edge = 0
    count = 0
    for edge in predicted_edges:
        if count > max_k:
            break

        # if our guessed edge is really in graph
        # this is due representation problem: (2,1) edge in undirected graph is saved in memory as (2,1)
        # but in adj matrix it is calculated as (1,2)
        if edge in test_edges or (edge[1], edge[0]) in test_edges:
            correct_edge += 1
            delta_factors.append(1.0)
        else:
            delta_factors.append(0.0)
        precision_scores.append(1.0 * correct_edge / (count + 1))  # (number of correct guesses) / (number of attempts)
        count += 1

    return precision_scores, delta_factors

Here is the main part of the code:

    # Calculate adjacency matrix
    print("Calculating adjacency matrix from embeddings.")
    adj_matrix = calculate_adjacency_matrix(embeddings=node_emeddings, threshold=0.0)
    print("Adjacency matrix calculated")
    # print(adj_matrix)

    print("Getting predicted edges...")
    predicted_edge_list = adj_matrix
    print("Predicted edge list is of length:", len(predicted_edge_list), "\n")

    print("Sorting predicted edge list")

    # We need to sort predicted edges so that ones that are most likely to appear are first in list
    sorted_predicted_edges = {k: v for k, v in sorted(predicted_edge_list.items(), key=lambda item: -1 * item[1])}
    print("Predicted edges sorted...")

    print("Filtering predicted edges that are not in train list...")
    # taking only edges that we are predicting to appear, not ones that are already in the graph
    sorted_predicted_edges = {k: v for k, v in sorted_predicted_edges.items() if k not in train_edges_dict}
    # print(sorted_predicted_edges)

    print("Calculating precision@k...")
    precision_scores, delta_factors = compute_precision_at_k(predicted_edges=sorted_predicted_edges,
                                                             test_edges=test_edges_dict,
                                                             max_k=PRECISION_AT_K_CONST)
    print("precision score", precision_scores)

    with open("../results.txt", 'a+') as fh:
        fh.write(" ".join(str(precision) for precision in precision_scores))
        fh.write("\n")

6. Compare predicted edges with the test set

Image 5. The graph of precision@k in our example

from matplotlib import pyplot as plt
import numpy as np

#tribute to https://stackoverflow.com/questions/12957582/plot-yerr-xerr-as-shaded-region-rather-than-error-bars
with open('../results.txt', 'r') as fh:
    lines = fh.readlines()


parsed_results=[]
for line in lines:
    values = line.split(" ")
    parsed_list=[float(value) for value in values]
    parsed_results.append(parsed_list[:2**5])

stddev = np.std(parsed_results, axis=0)
mean = np.mean(parsed_results, axis=0)
print(parsed_results)
x = np.arange(1,len(parsed_results[0])+1)
y = mean
error = stddev


plt.plot(x, y, 'k-')
plt.fill_between(x, y-error, y+error)
plt.show()

After running our code a couple of times, we can plot our results. Since we didn't take any features into account and only worked with the graph structure when doing link prediction, we can say that our results are good. It can be a lot better, but for 16 edges we have a precision of around 70%. MAGE is satisfied at the moment.

Conclusion

So that's it for the real-time link prediction tutorial. Hope that you learned something and that we got you interested in graph analytics even more. If you got lost during the tutorial at any point, here is a link to the GitHub repository for link prediction with MAGE.

Our team of engineers is currently tackling the problem of graph analytics algorithms on real-time data. If you want to discuss how to apply online/streaming algorithms on connected data, feel free to join our Discord server and message us.

MAGE shares his wisdom on a Twitter account. Get to know him better by following him 🐦

Last but not least, check out MAGE and don’t hesitate to give a star ⭐ or contribute with new ideas.

Recommendation System Using Online Node2Vec With Memgraph MAGE

Antonio Filipovic — Thu, 01 Jun 2023 11:22:25 +0000

The online node2vec algorithm learns and updates temporal node embeddings on the fly for tracking and measuring node similarity over time in graph streams.

Our little magician Memgraph MAGE has recently received one more spell - the Online Node2Vec algorithm. Since he is still too scared to use it, you, as a brave spirit, will step up and use it on a real challenge to show MAGE how it's done. This challenge includes building an Online Recommendation System using k-means clustering and the newborn spell - Online Node2Vec algorithm.

Prerequisites

To complete this tutorial, you will need:

An installation of Memgraph Advanced Graph Extensions (MAGE)
An installation of Memgraph Lab or usage of Memgraph's command-line tool, mgconsole, which is installed together with Memgraph.

Note: In short, you could have one of the following setups:
1) Memgraph installed (not the Docker version), and Memgraph MAGE built from source
2) The Memgraph MAGE Docker image
3) The Memgraph Docker image, but you have to additionally copy the MAGE directory inside the container, run python build and copy the created mage/dist to /usr/lib/memgraph/query_modules so Memgraph can access it.

To check that you have everything ready for the next steps, use the following command in mgconsole or Memgraph Lab:

CALL node2vec_online.help() YIELD *;

Graph

You will use the spell on the High-energy physics citation network. The already processed dataset contains 395 papers - nodes and 1106 citations - edges. If a paper i cites paper j, the graph contains a directed edge from i to j.

Below is the graph schema:

There is only one type of node in our graph schema. MAGE is happy to hear this kind of news since he believes that this way, his spell will give you the best results.

Before importing the dataset, MAGE wants you to read the instructions about the spell to learn how to use it properly.

Online Node2Vec

In the MAGE instructions, there is a note that researchers have shown how the Node2Vec Online spell creates similar embeddings for two nodes (e.g. v and u) if there is an option to reach one node from the other across edges that appeared recently. In other words, the embedding of a node v should be more similar to the embedding of node u if we can reach u by taking steps backward to node v across edges that appeared before the previous one. These steps backward from one node to the other form a temporal walk. It is temporal since it depends on when the edge appeared in the graph.

To make two nodes more similar and to create these temporal walks, the Node2Vec Online spell uses the StreamWalk updater and Word2Vec learner.

StreamWalk updater is a machine for sampling temporal walks. A sampling of the walk is done in a backward fashion because we look only at the incoming edges of the node. Since one node can have multiple incoming edges, when sampling a walk, StreamWalk updater uses probabilities to determine which incoming edge of the node it will take next, and that way leading to a new node. These probabilities are computed after the edge arrives and before temporal walk sampling. Probability represents a sum over all temporal walks z ending in node v using edges arriving no later than the latest one of already sampled ones in the temporal walk. When the algorithm decides which edge to take next for temporal walk creation, it uses these computed weights (probabilities). Every time a new edge appears in the graph, these probabilities are updated just for two nodes of a new edge.

After walks sampling, Word2Vec learner uses these prepared temporal walks to make node embeddings more similar using the gensim Word2Vec module. These sampled walks are given as sentences to the gensim Word2Vec module, which then optimizes for the similarity of the node embeddings in the walk with stochastic gradient descent using a skip-gram model or continuous-bag-of-words (CBOW).

Note: this implementation contains the code related to the research of Ferenc Béres, Róbert Pálovics, Domokos Miklós Kelen and András A. Benczúr

Node2Vec Online setup

From MAGE's instructions, you see now a piece of advice to use Memgraph Triggers for this spell to work. You can create a trigger that fires up on edge creation. Every time there is a new edge which we add in the graph, trigger fires and calls Node2Vec Online algorithm to update node embeddings.

To create the trigger, you can use the following command in Memgraph Lab or mgconsole:

CREATE TRIGGER trigger ON --> CREATE BEFORE COMMIT
EXECUTE CALL node2vec_online.update(createdEdges) YIELD *;

Before you start with importing the dataset, there is a big message on MAGE's instructions not to forget to set parameters for StreamWalk updater and Word2Vec learner.

StreamWalk updater uses the following parameters:

half_life: half-life [seconds], used in the temporal walk probability calculation
- max_length: maximum length of the sampled temporal random walks
- beta: damping factor for long paths
- cutoff: temporal cutoff in seconds to exclude very distant past
- sampled_walks: number of sampled walks for each edge update
- full_walks: return every node of the sampled walk (True) or only the endpoints of the walk (False)

All these parameters are for temporal walk sampling. And you can set now the parameters with the following command in Memgraph Lab or mgconsole:

CALL node2vec_online.set_streamwalk_updater(7200, 5, 0.9, 604800, 5, True) 
YIELD *;

Word2Vec learner uses the following parameters:

embedding_dimension: number of dimensions in the representation of the embedding vector
learning_rate: learning rate
skip_gram: use skip-gram model
negative_rate: negative rate for skip-gram model
threads: maximum number of threads for parallelization

These parameters are mostly used in the skip-gram model.
You can set the parameters with the following command in Memgraph Lab or mgconsole:

CALL node2vec_online.set_word2vec_learner(4,0.01,True,10) YIELD *;

Loading a dataset

Now, Memgraph and Node2Vec online are ready, and you can download and import the prepared dataset through your terminal:

mgconsole --use-ssl=false < graph.cypherl

After you execute this command, the trigger will fire upon every new edge that appears in the graph. This will take around 1 minute to calculate embeddings. Now you can add new edges and node2vec_online query module will update the embeddings and it will be ready for usage.

Recommendation system

Here you will build a recommendation system and show MAGE how it is done.

Before you continue, if you call this command, you should see that embeddings are ready. Again, execute the query in Memgraph Lab or mgconsole:

CALL node2vec_online.get() YIELD node, embedding
RETURN count(embedding) as embeddings_count;

To create a recommendation engine, GitHub is offering you help. He heard that you want to impress MAGE and has some code already prepared here. You can trust him since he is also part of Memgraph. Download a code and be ready to spin it up.

The recommendation engine will base on k-means from scikit-learn pacakge.

The k-means algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to a large number of samples and has been used across a large range of application areas in many different fields.

In our procedure, we will first get embeddings from the node2vec_online module. After that, using the elbow method for k-means inertia will give you the best k value, which represents the number of clusters. GitHub, which offered you help, recommends you to use 5 clusters, but you can try out different numbers of clusters.

You can visualize k-means inertia using this command:

python3 recommender.py visualize

Here we are using k-means since we want to find which embeddings are close to each other in vector space, which would give us papers (nodes) that are similar in physics.

After finding groups of similar papers, we are ready to get papers that are most similar by running the command:

python3 recommender.py similarities --n_clusters=5 --top_n_sim=5

After we get the results, there is one example of 99.7% similarity.

This is exaggerated a lot on one side since we can't be sure just from the graph that these two papers are this similar, but from the graph structure presented later, it is expected that these two nodes have high similarity. From the description, we can see that papers are similar as they talk about similar topics. 99.7% is still a lot, but the algorithm works well! MAGE is impressed.



                   id: 9606040                                                                    id: 9610195                                                                       STATS                                     
       title: Mirror Symmetry is T-Duality                              title: Unification of M- and F- Theory Calabi-Yau Fourfold Vacua       
                                                                                                                                                                              similarity:0.9973                               


It is argued that every Calabi-Yau manifold  X  wi                             We consider splitting type phase transitions betwe              
th a mirror  Y  admits a family of supersymmetric                              en Calabi-Yau fourfolds. These transitions general              
toroidal 3-cycles. Moreover the moduli space of su                             ize previously known types of conifold transitions              
ch cycles together with their flat connections is                               between threefolds. Similar to conifold configura              
precisely the space  Y . The mirror transformation                             tions the singular varieties mediating the transit              
 is equivalent to T-duality on the 3-cycles. The g                             ions between fourfolds connect moduli spaces of di              
eometry of moduli space is addressed in a general                              fferent dimensions, describing ground states in M-              
    framework. Several examples are discussed.                                  and F-theory with different numbers of massless m              
                                                                               odes as well as different numbers of cycles to wra              
                                                                               p various p-branes around. The web of Calabi-Yau f              
                                                                               ourfolds obtained in this way contains the class o              
                                                                               f all complete intersection manifolds embedded in               
                                                                               products of ordinary projective spaces, but extend              
                                                                               s also to weighted configurations. It follows from              
                                                                                this that for some of the fourfold transitions va              
                                                                               cua with vanishing superpotential are connected to              
                                                                                   ground states with nonzero superpotential.                  
--------------------------------------------------                             --------------------------------------------------                             --------------------------------------------------

Using Memgraph Lab, we can now visualize some of the data:

MATCH connection = (a:Paper)-[*bfs..10 (e, n | 'Paper' IN labels(n))]-(b:Paper)
WHERE
    (a.id = "9610195" AND b.id = "9606040") 
    OR 
    (a.id="9606040" and b.id="9610195")
RETURN connection;

Conclusion

Finding similar papers just from the graph structure is a complicated task, but the results are very promising. You can take similar steps in different domains. The main advantage of this algorithm is that it works completely online, which is a great functionality in today's world when more and more data is event-driven.

MAGE is a versatile open-source library containing standard graph algorithms that can help you analyze graph networks. While many graph libraries out there are great for performing graph computations, using MAGE and Memgraph provides you with additional benefits like persistent data storage and many other graph analytics capabilities.

If you found this tutorial beneficial, you should try out the other algorithms included in the MAGE library.

Finally, if you are working on your query module and would like to share it with other developers, take a look at the contributing guidelines. We would be more than happy to provide feedback and add the module to the MAGE repository.

References:

[1] F. Béres, D. M. Kelen, R. Pálovics and A. A. Benczúr. Node embeddings in dynamic graphs. 2019.

Temporal Graph Neural Networks With Pytorch - How to Create a Simple Recommendation Engine on an Amazon Dataset

Antonio Filipovic — Fri, 20 Jan 2023 13:37:10 +0000

PYTORCH x MEMGRAPH x GNN = 💟

Over the course of the last few months, we at Memgraph have been working on something that we believe could be helpful with classical graph prediction tasks. With our latest newborn query module, you will have the option of performing both label classification and link prediction.

But, how come a query module can do both label classification and link prediction? It's all thanks to graph neural networks, for short GNNs. ❤️

Graph neural networks

Whether you are a software engineer or a deep learning enthusiast, there is a high chance you heard of graph neural networks as a rising ⭐. Maybe you even deep-dived into this topic and are now ready for a new MAGE spell. But even if you haven't, don't worry, I will try to give you a quick overview so you can catch up and follow along.

You probably already know that a graph consists of nodes (vertices) and edges (relationships).

Every node can have its feature vector, which essentially describes that node with a vector of numbers. We can look at this feature vector as the representation vector of each node, also called embedding of the node.
To avoid getting lost in technical details, graph neural networks work as a message passing^[2] system, where each node aggregates feature representations of its 1-hop neighbors. To be more precise, nodes don’t aggregate feature representations directly, but feature vectors obtained by dimensionality reduction using the W matrix (you can look at them as fully connected linear layers). This matrix projected feature vectors are called messages and they give expressive power to graph neural networks.

This idea originates from the field of graph signal processing. Now, we don't have time here to explain all about how we got from signals to message passing, but it's all math.
Feel free to drop us a message on Discord and we will make sure to create a blog post explaining the graph neural network introduction topic, not a simplified version, but one explaining all of it from the beginning, somewhere about Tutorial on Spectral Clustering in 2007.

If you would like to get a better understanding of graph neural networks before continuing, I suggest you check out:

this blog post provides a gentle introduction to the topic
you can also check the video explanation by the Stanford professor Jure Leskovec about graph neural networks - I would honestly suggest to binge-watch the whole series, but if you don't have that much time, just watch the lectures called Message passing and Node Classification and Introduction to Graph Neural Networks
and if you want to deep-dive, which I suggest, I will leave the following blog post, it will be more than enough.

The reason why we added GNNs to MAGE is that GNNs are to graphs what CNNs are to images. GNNs can inductively learn about your dataset, which means that after training is complete you can apply their knowledge to a similar use case, which is very cool since you don't need to retrain the whole algorithm. With other representation learning methods like DeepWalk, Node2Vec, Planetoid, we haven't been able to do that until, well, GNNs.

Now, why temporal graph neural networks?
Imagine you are in charge of a product where users interact with items every minute of every day, and they like some and hate the others. You would like to present them with more items they like, and not just that, you would love it if they bought those new items. This way you have a stream of data. Interactions appear across time, so you are dealing with a temporal dataset. The classical GNNs are not designed to work with streams, although they work very well on unseen data. But it is not all nails if you have a hammer - some methods work better on streams, others on static data.

Temporal graph networks

As you already know, we in Memgraph are all about streams.

Thanks to the guys at Twitter, they developed a GNN that works on temporal graph networks. This way GNNs can deal with continuous-time dynamic graphs. In the image below you can see a schematic view of temporal graph networks. It is a lot to take in, but the process, once explained, is not that complicated.

Firstly, in continuous-time dynamic graphs, you can model changes on graphs that include edge or node addition, edge or node feature transformation (update), edge or node deletion as time-listed events.

Temporal graph networks^[1], shortened TGNs, work as follows:

node embedding calculations work on the concept of message passing, which I hope you are familiar with at this point
TGNs use events, and whenever a new edge appears, it represents an interaction event between two nodes involved
from every event, we create a message and use a message aggregator for all messages of the same node to get the aggregated message of every node
every node has its own memory which represents an accumulated state, updated with an aggregated message by one of the LSTM or GRU
Lastly, the embedding module is used to generate the temporal embedding

There are two embedding module types we integrated into our TGN implementation:

Graph attention layer: it is a similar concept as in Graph attention networks, but here they use the original idea from Vaswani et al. Attention is all you need which includes queries, keys and values and everything else is the same. I suggest you look at the TGN paper to check the exact embedding calculation details.
Graph sum layer: this mechanism is completely similar to the message passing system

There is a certain problem when dealing with embedding updates. We don't update embeddings for every node, only for ones that appear in a batch. Also, in order not to get so much into implementation details, we will try to abstractly explain the following problem. Nodes in the batch appear at different points in time. That's why we need to take into account when was their update so that we can only use neighbors which appeared in the graph before them. This is in case we update the whole representation of the graph with batch information, and then do the calculation, which is what we did. This leads to having a different computation graph for every node. You can see what it looks like in the image below:

Amazon data example

To try out how this works, we have prepared a Jupyter Notebook on our GitHub repository. It is about Amazon user-item reviews. In the following example, you will see how to do link prediction with TGN.

Exploring an Amazon data network in Memgraph

Through this short tutorial, you will learn how to install Memgraph, connect to it from a Jupyter Notebook and perform data analysis on an Amazon dataset using a graph neural network called Temporal graph networks.

1. Prerequisites

For this tutorial, you will need to install:

Docker is used because Memgraph is a native Linux application and cannot be installed on Windows and macOS.

2. Installation using Docker

After installing Docker, you can set up Memgraph by running:

docker run -it -p 7687:7687 -p 3000:3000 -p 7444:7444 memgraph/memgraph-platform

This command will start the download and after it finishes, run the Memgraph container.

3. Connecting to Memgraph with GQLAlchemy

We will be using the GQLAlchemy object graph mapper (OGM) to connect to Memgraph and execute Cypher queries easily. GQLAlchemy also serves as a Python driver/client for Memgraph. You can install it using:

pip install gqlalchemy

Hint: You may need to install CMake before installing GQLAlchemy.

Maybe you got confused when I mentioned Cypher. You can think of Cypher as SQL for graph databases. It contains many of the same language constructs like CREATE, UPDATE, DELETE... and it's used to query the database.

from gqlalchemy import Memgraph

memgraph = Memgraph("127.0.0.1", 7687)

Let's make sure that Memgraph is empty before we start with anything else.

memgraph.drop_database()

Following command should output {number_of_nodes:0}

results = memgraph.execute_and_fetch(
    """
    MATCH (n) RETURN count(n) AS number_of_nodes ;
    """
)
print(next(results))

4. Data analysis on an Amazon product dataset

You will load an amazon product dataset as a list of Cypher queries. This is what it looks like:

An example of the aforementioned queries is the following one:

MERGE (a:User {id: 'A1BHUGKLYW6H7V', profile_name:'P. Lecuyer'})
MERGE (b:Item {id: 'B0007MCVQ2'})
MERGE (a)-[:REVIEWED {review_text:'Like all Clarks, these guys didnt disappoint. They fit great and look even better. For the price, I dont think a better deal exists out there for casual shoes.',
  feature: [161.0, 133.0, 0.782608695652174, 0.0, 0.031055900621118012, 0.17391304347826086, 0.043478260869565216, 36.0, 36.0, 1.0, 3.6944444444444446, 0.0, 0.0, 3.0, 1.0, 12.0, 0.055, 0.519, 0.427, 0.9238],
  review_time:1127088000, review_score:5.0}]->(b);

So as you can see, we have User nodes and Item nodes in our graph schema. Every user has left a very positive review for an Item. This wasn't the case for all the reviews in our original dataset, but we processed it and removed negative reviews (all reviews with review_score <= 3.0).
Every User has an id and every Item that has been reviewed has an id as well. In this one query, we find the User and the Item with mentioned ids or we create one if such User or Item is missing from the database. We create an interaction event between them in terms of an edge which has a list of 20 edge features. This edge_features we created from user reviews:

1. Number of characters
2. Number of characters without counting white space
3. Fraction of alphabetical characters
4. Fraction of digits
5. Fraction of uppercase characters
6. Fraction of white spaces
7. Fraction of special characters, such as comma, exclamation mark, etc.
8. Number of words
9. Number of unique works
10. Number of long words (at least 6 characters)
11. Average word length
12. Number of unique stopwords
13. Fraction of stopwords
14. Number of sentences
15. Number of long sentences (at least 10 words)
16. Average number of words per sentence
17. Positive sentiment calculated by VADER 
# VADER - Valence Aware Dictionary and sEntiment Reasoner lexicon 
# and rule-based sentiment analysis tool 
18. Negative sentiment calculated by VADER
19. Neutral sentiment calculated by VADER
20. Compound sentiment calculated by VADER

We should have also prepared features for a User and Item, but these features seemed enough for our example.

One more note: In this dataset of queries we already prepared for you, there is one query that will change the "working mode" of our temporal graph networks module to evaluation(eval) mode. When the mode of the tgn is changed it also stops doing training of the model and starts doing evaluation of the trained model.
If you look inside the file, you should find the following query:

CALL tgn.set_mode("eval") YIELD *;

Trigger creation

In order to process a dataset, we need to create a trigger on the edge create event if a trigger with that name doesn't exist.

This check is a neat feature to have in your Jupyter notebook if you want just to rerun it without dumping the local Memgraph instance if you are not working with Docker.

results = memgraph.execute_and_fetch("SHOW TRIGGERS;")

trigger_exists = False
for result in results:
    if result['trigger name'] == 'create_embeddings':
            print("Trigger already exists")
            trigger_exists = True
            break;


if not trigger_exists:
    memgraph.execute(
        """
        CREATE TRIGGER create_embeddings ON --> CREATE BEFORE COMMIT
        EXECUTE CALL tgn.update(createdEdges) RETURN 1;
        """
    )

Index creation for dataset

Memgraph works best with indexes defined for nodes. In our case, we will create indexes for User and Item nodes.

index_queries = ["CREATE INDEX ON :User(id);",
                "CREATE INDEX ON :Item(id);"]

for query in index_queries:
    results = memgraph.execute_and_fetch(query)
    for result in results:
        continue

Training and evaluating Temporal Graph Networks

In order to train a Temporal graph network on an Amazon dataset, we will split the dataset into train and eval queries. Let's first load our raw queries. Each query creates an edge between User and Item thus representing a positive review of a certain Item by a User.

import os

dir_path = os.getcwd()

with open(f"{dir_path}/data/queries.cypherl", "r") as fh:
    raw_queries = fh.readlines()

train_eval_split_ratio = 0.8
queries_index_split = int(len(raw_queries) * train_eval_split_ratio)

train_queries = raw_queries[:queries_index_split]
eval_queries = raw_queries[queries_index_split:]

print(f"Num of train queries {len(train_queries)}")
print(f"Num of eval queries {len(eval_queries)}")

Before we start importing train queries, first we need to set parameters for temporal graph networks.

# since we are doing link prediction, we use self_supervised mode
learning_type = "self_supervised"
batch_size = 200 #optimal size as defined in paper
num_of_layers = 2 # GNNs don't need multiple layers, contrary to CNNs.
layer_type = "graph_attn" # choose between graph_attn or graph_sum
edge_message_function_type = "identity" # choose between identity or mlp
message_aggregator_type = "last" # choose between last or mean
memory_updater_type = "gru" # choose between gru or rnn
attention_heads = 1
memory_dimension = 100
time_dimension = 100
num_edge_features = 20
num_node_features=100
# number of sampled neighbors
num_neighbors = 15
# message dimension must be defined in the case we use MLP, 
# because then we define dimension of **projection**
message_dimension = time_dimension + num_node_features + num_edge_features

tgn_param_query = f"CALL tgn.set_params({{learning_type:'{learning_type}', 
batch_size: {batch_size}, num_of_layers:{num_of_layers}, 
layer_type:'{layer_type}', memory_dimension:{memory_dimension}, 
time_dimension:{time_dimension}, num_edge_features:{num_edge_features}, 
num_node_features:{num_node_features}, message_dimension:{message_dimension},
num_neighbors:{num_neighbors}, 
edge_message_function_type:'{edge_message_function_type}',
message_aggregator_type:'{message_aggregator_type}',
memory_updater_type:'{memory_updater_type}', 
attention_heads:{attention_heads}}}) 
YIELD *;"

print(f"TGN param query: {tgn_param_query}")

results = memgraph.execute_and_fetch(tgn_param_query)
for result in results:
    print(result)

Now it is time to execute queries and perform the first epoch of training.

for query in train_queries:
    results = memgraph.execute_and_fetch(query.strip())
    for result in results:
        continue

Now we need to change TGN mode to eval and start importing our evaluation queries.

results = memgraph.execute_and_fetch("CALL tgn.set_eval() YIELD *;")
for result in results:
    print(result)

for query in eval_queries:
    results = memgraph.execute_and_fetch(query.strip())
    for result in results:
        continue

After our stream is done, we should probably do a few more rounds of training and evaluation in order to have a properly working model. We can do so with the following query:

num_of_epochs = 5

results = memgraph.execute_and_fetch(
    f"""
    CALL tgn.train_and_eval({num_of_epochs}) YIELD *
    RETURN epoch_num, batch_num, precision, batch_process_time, batch_type 
    ORDER BY epoch_num, batch_num;
    """
)

for result in results:
    continue

Now, let's get the results and then do some plotting to check whether the precision increases between epochs.

from collections import defaultdict
import matplotlib.pyplot as plt
import numpy as np

results_train_dict = defaultdict(list)
results_eval_dict = defaultdict(list)

results = memgraph.execute_and_fetch(
    """
        CALL tgn.get_results() 
        YIELD  epoch_num, batch_num, precision, batch_process_time, batch_type
        RETURN epoch_num, batch_num, precision, batch_process_time, batch_type 
        ORDER BY epoch_num, batch_num;
    """
)

for result in results:
    if result['batch_type'] == 'Train':
        results_train_dict[result['epoch_num']].append(result['precision'])
    else:
        results_eval_dict[result['epoch_num']].append(result['precision'])

Now that we have collected the results, let's first plot the average accuracy of train batches inside epoch, and the average accuracy of eval batches inside epoch. We can do that since every batch is the same size. (NB: TGN uses a predefined batch size.)

X_train = []
Y_train = []

for epoch, batches_precision in results_train_dict.items():
    Y_train.append(np.mean(batches_precision))
    X_train.append(epoch)

X_eval = []
Y_eval = []
for epoch, batches_precision in results_eval_dict.items():
    Y_eval.append(np.mean(batches_precision))
    X_eval.append(epoch)


#scatter plot
plt.plot(X_train, Y_train, 'b', label="train")
plt.plot(X_eval, Y_eval, 'r', label="eval")

#add title
plt.title('epoch - average batch precision')

#add x and y labels
plt.xlabel('epoch')
plt.ylabel('precision')
plt.legend(loc="upper left")

#show plot
plt.show()

We can see that average accuracy increases, which is really good. Now we can start creating some recommendations. Let's find Users who reviewed one Item positively and those who reviewed multiple Items positively. Our module will return what it believes should be a prediction score for yet unreviewed Items.

results = memgraph.execute_and_fetch(
    """
        MATCH (n:User)
        WITH n
        LIMIT 15
        MATCH (m:Item)
        OPTIONAL MATCH (n)-[r]->(m)
        WHERE r is null
        CALL tgn.predict_link_score(n,m) YIELD prediction
        WITH n,m, prediction
        ORDER BY prediction DESC
        LIMIT 10
        MERGE (n)-[:PREDICTED_REVIEW {likelihood:prediction}]->(m);

    """
)
for result in results:
    print(result)

Now we can run the following query in Memgraph Lab:

MATCH (u:User)-[pr:PREDICTED_REVIEW]->(i:Item), (u)-[r:REVIEWED]->(oi:Item)
RETURN *;

And after applying a style, we get the following visualization. From the image below, we can see that most predictions are oriented towards one of the most popular items.

Where to next?

Well, I hope this was fun and that you have learned something. You can check out everything else that we implemented in the MAGE 1.2 release. If you loved our implementation, don't hesitate to give us a star on GitHub ⭐. If you have any comments or suggestions, you can contact us on Discord. And lastly, if you wish to continue reading posts about graph analytics, check out our blog.

References

[1] E. Rossi, B. Chamberlain, F. Frasca, D. Eynard, F. Monti, M. Bronstein (2020). Temporal Graph Networks for Deep Learning on Dynamic Graphs

[2] W.L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large
graphs. U Advances in Neural Information Processing Systems 30, 2017

3 Powerful Queries to Find Patterns in Your Knowledge Graph You Haven’t Noticed Before

Antonio Filipovic — Fri, 13 Jan 2023 14:07:42 +0000

Today, there are not a lot of companies worry about the lack of data. Everything is logged and stored in different databases and technologies. The current issue is that companies can’t conclude anything from all that data, which is especially disastrous if the data indicates that it’s time for the company to change how it does business.

Creating a large web of interconnected data as a graph is the crucial first step for companies to get a complete picture of their business and understand its direction.

First, data is gathered in one place. Then a graph is created with a semantics layer placed on top, containing information about the data, thus creating a knowledge graph. Analyzing the knowledge graph uncovers new information in the data.

To create a knowledge graph, you must be careful about which toolset you choose. If you need to use several different solutions, it is impossible to gather data entirely and thus impossible to analyze it. The ultimate result is slow decision-making. An excellent tool are graph databases, which are designed to explore relationships and hop through data. Using tools that cannot explore relationships and hopping requires a lot of coding to make just a few initial steps in the analysis.

As a graph database, Memgraph fits perfectly into the knowledge graph use case. It also offers free and open-source graph analytics algorithms. As each business needs to deal with unique problems, one algorithm cannot solve all of them equally well. That’s why the library tries to offer an extensive range of analytics algorithms. The best part is that your team doesn’t have to worry about writing a single line of code. They can focus on creating knowledge graphs, understanding data and making sense of it.

Graph analytics is just one of many big pluses. Memgraph is also an in-memory graph database, which means you won’t have to wait a whole day for graph analytics algorithms to spit out your results or get obsolete results. But having an in-memory database doesn’t mean losing your data is possible since backups and data persistence are obtained using disk storage.

Modeling data using property graphs, as you would with Memgraph, makes sense since you can see all the nodes and how they relate to each other by examining their relationships. After dealing with relational databases for years, it’s hard even to imagine what complex questions graph databases can answer. Relational databases cannot provide such answers, or the execution takes so much time you don’t even try it.

In the rest of this blog post, you will see what algorithms you can use to ask specific questions and query the graph database, which will help you uncover knowledge hidden within your data.

Pattern-matching questions

The easiest way to discover new knowledge in graphs is pattern matching. Pattern matching is a basic exploration of data in which the database searches for nodes with a specific label connected with a relationship of a specific type to another node. In other words, it searches for the shape of the data you defined and retrieves results.

Content and connectedness

In finance, companies often question whether an account is connected to another account known for fraudulent activity. If one account is connected to fraudulent activity, all the intermediary accounts might also be connected to it, like a neverending string of handkerchiefs.

With Memgraph and basic pattern matching, you can uncover such connections with the following query:

MATCH p1=(n:Node1)-[*]->(m:Node2), p2=(n)-[*]->(m), (n)-[r]->(f:FraudulantActivity)
WHERE p1!=p2
RETURN nodes(p1)+nodes(p2)

The query above looks for different paths p1 and p2 between node named n and node named m and returns such nodes on those different paths. As we mentioned above, such nodes could be part of some fraudulent activity.

Commonality

Graph analytics make it easier to think about cause and effect. Knowledge graphs particularly enable using some a priori knowledge about business processes to infer new knowledge.

In finance, when the same person controls certain companies, it can be a hint of illegal activities. It’s quite hard to find a common denominator using a relational database since it’s impossible to know how deep into company ownerships you need to dive into, and there might be different numbers of owners.

Graph databases can search for a common ancestor or successor of a certain entity, like in the picture below, regardless of the depth or number of companies the database needs to search through:

A query written in Cypher would be as follows:

MATCH (n1:Node1 {prop:"a"})
CALL graph_util.descendants(n1)  YIELD descendants as descendants_n1
WITH descendants_n1
MATCH (n2: Node2 {prop:"b"})
CALL graph_util.descendants(n2) YIELD descendants as descendants_n2
UNWIND descendants_n1 as dn1
WITH dn1
WHERE (dn1 IN descendants_n2) = true
RETURN dn1;

The query above will first find all the descendants of node n1 and all the descendants of node n2. If there is some descendant of node n1, which is also in descendants of node n2, then we have found a common descendant, which is exactly what commonality is looking for.

Alternative action

Mistakes happen all the time. Once they do, the most important question is how to soften the blow. In finance, once a certain branch is deactivated, we need to find another way to get money transferred.

To do so, we need to find content equivalence. Content equivalence finds a similar path between two nodes. It helps protect the business from future failures at certain points in the finance chain. And it does so by finding alternative paths between two nodes. It involves node hopping and pattern matching, two operations graph databases, especially Memgraph are optimized for.

In Memgraph, you would ask such a question with the following query:

MATCH p=(n:Node3 {prop:"c"})-[r *wShortest (e,v | 1) sum]->(t:Target), (n2:Node2 {prop:"b"})
WHERE not n2 in nodes(p)
RETURN p
ORDER BY sum DESC;

The following query finds shortest paths between starting node n and end node t. It returns paths starting with the shortest one; thus, you can see the most effective way to substitute your current solution in case something fails along the way.

Conclusion

Pattern matching doesn’t sound like a flashy analysis tool but combined with graph analytics algorithms, it is a powerful tool that can analyze almost any real-world graph regardless of how complex it is. So, if you are struggling with your highly interconnected data being scattered all over the place, don’t hesitate to use a graph database and employ graph analytics and pattern matching to discover new knowledge. The one relational databases can’t even start thinking about, let alone provide. Use a knowledge graph to uncover fraudulent activities or find alternative actions that can help avoid risks.

Understanding How Dynamic node2vec Works on Streaming Data

Antonio Filipovic — Fri, 23 Dec 2022 14:08:48 +0000

Introduction

In this article, we will try to explain how node embeddings can be updated and calculated dynamically, which basically means as new edges arrive to the graph. If you still don't know anything about node embeddings, be sure to check out our blog post on the topic of node embeddings. 📖

There, we have explained what node embeddings are, where they can be applied, and why they perform so well. Even if you are familiar with everything mentioned, you can still refresh your memory.

Dynamic networks

Many methods, like node2vec and deepwalk, focus on computing the embedding for static graphs which is great but also has a big drawback. Networks in practical applications are dynamic and evolve constantly over time. New links are formed, and old ones can disappear. Moreover, new nodes can be introduced into the graph (e.g., users can join the social network) and create new links toward existing nodes.

How could one deal with such networks?

One idea could be to create a snapshot of a graph when a new edge is created [Leskovec et al., 2007]. Naively applying static embedding algorithms to each snapshot leads to unsatisfactory performance due to the following challenges:

Stability: the embedding of graphs at consecutive time steps can differ substantially even though the graphs do not change much.
Growing Graphs: All existing approaches assume a fixed number of nodes in learning graph embeddings and thus cannot handle growing graphs.
Scalability: Learning embeddings independently for each snapshot leads to running time linear in the number of snapshots. As learning a single embedding is computationally expensive, the naive approach does not scale to dynamic networks with many snapshots

Dynamic Node2vec

Dynamic node2vec is a random-walk based method that creates embeddings for every new node added to the graph. For every new edge, there is a recalculation of probabilities (weights) that are used in walk sampling. A goal of the method is to enforce that the embedding of node v is similar to the embedding of nodes with the ability to reach node v across edges that appeared one before another. Don’t worry if this sounds confusing now, just remember that we have probability updates and walk sampling.

Take a look at Image 1. We sampled a walk as was mentioned before. By doing so, we created a list of nodes also known as temporal walk. The temporal part will be explained in a few seconds. And of course, the embedding of a node should be similar to the embedding of nodes in its temporal neighborhood. The algorithm itself consists of the following three parts in that order:

1) probabilities (weights) update
2) walk sampling
3) word2vec update

Image 1. Representation of a temporal walk

Few notes about terms we will use in the rest of the text. When a new directed edge is added to the graph it has source node and target node. Walk sampling means creating a walk through a graph. Every node we visit during walk sampling is memorized in order of visit. A walk can be constructed in a forward fashion, meaning we choose one of the out-going edges of a current node. Or it can be constructed in a backward fashion, which means we choose from one of the incoming edges of a current node. In backward walk sampling by choosing one of the incoming edges of the current node, we move to the new node, source node of that edge. And we repeat the step. The process for walk sampling in a backward fashion can be seen in Image 2. We start from node 9 and sampled walk looks as follows: 9,8,5,2,1.
For example, when we were on the node with id 5, we could have chosen a different edge, which would take us to node 2, or node 4. Important: we are still not looking at the edge appearance timestamp, that is when has edge appeared in the graph.

Image 2. Illustration of walk sampling on a directed graph

We will first try to explain walk sampling, and then weight update, although in the real process they are reversed. We first do weight update, then walk sampling.

Walk sampling

In dynamic node2vec walks, sampling is done in time-dependent backward form. Backword sampling was explained before, so it's important you have a basic understanding for the next section.

"Temporal" sampling means that from all possible incoming edges we only consider those that appeared before the last edge in the walk. Take a look at Image 3 and assume that we are on node 5. Since the last edge we visited was between nodes 5 and 8 (the edge is from 5 to 8, but we went from 8 to 5) appeared in the graph before edges 3⟶5 and 4⟶5, we can't even consider taking them as the next step. The only option is edge 2⟶5.

Image 3. Time-dependent walk sampling

Since one node can have multiple incoming edges when sampling a walk, we use probabilities (weights) to determine which incoming edge of the node we will take next. It is like a biased coin flip. Take a look at Image 1. From node u you can take edge t4 or t5. When creating a walk, we want to visit nodes that were more recently connected by a new edge. They are carrying more information, therefore they have more importance to the graph than the old edges.

These probabilities are computed after the edge arrives and before temporal walk sampling (before this step). Walk sampling can be stopped, at any point, but at the latest when we sample walk_length nodes in the walk. walk_length is a hyperparameter. It is set before the algorithm is started and it determines the size of every walk in the algorithm. Whether walk sampling will be stopped is determined by a biased coin flip and it depends on weight_of_current_node. For more details, please feel free to check the paper by Ferenc Béres.

Probabilities (weights) update

The weight of the node and probability for walk sampling are like two sides of the same coin. They represent a sum over all temporal walks z ending in node v using edges arriving no later than the latest one of already sampled ones in the temporal walk. When the algorithm decides which edge to take next for temporal walk creation, it uses these computed weights (probabilities). Every time a new edge appears in the graph, these probabilities are updated just for two nodes, source and target of the new edge. In Image 4 you can take a look at how probability is updated. For node u we need to check if there were already some walks sampled for it. If there were, make sure to update the weight of node (sum of walks ending in that node) by multiplying with time decayed factor exp(−c · t(uv) − tᵤ).

Here tᵤ represents last time edge pointing to node u appeared in graph, and t(uv) represents time of current edge u⟶v. Afterwards, for node v make sure to sum up the walks from node u, and also walks ending in v multiplied with time decayed factor exp(−c · t(uv) − tᵤ).

This time decayed factor is here to give more weight to nodes with fresher edges.

Image 4. Walk sampling by author from paper

Word2Vec update

This is the part where we optimize our node embeddings to be as similar as possible. We also make use of the word2vec method mentioned earlier.

After walks sampling, we use these prepared temporal walks to make nodes more similar to those nodes in their temporal neighborhood. What does this mean? So, let's say that our maximum walk length walk_length is set to 4, and the number of walks walk_num is set to 3. These hyperparameters can be found in our implementation of Dynamic Node2Vec on Github. Let's imagine we sampled the following temporal walks for node 9 in the graph on Image 3: [1,2,6,9], [1,2,5,9], [5,7,9]

Don't forget that this walk_length is the maximum length. Walk length can also be shorter. Now we need to make our embedding of node 9 most similar to nodes that appeared in that temporal walk. So in math terms, this would mean the following: we seek to optimize the objective function which maximizes the logarithmic probability of observing a network neighborhood Nₛ(u) for node u based on its feature representation (representation in embedded space). Also, we make some assumptions for the problem to be easier to imagine out. This is the formula:

Here Pr(Nₛ(u) | f(u)) is a probability of observing neighborhood nodes of node u with a condition that we are currently in embedded space in place of node u. For example, if the embedded space of node u is vector [0.5, 0.6], and imagine you are at that point, what is the likelihood to observe neighborhood nodes of node u. For each node (that's why there is summation), we want to make the probability as high as possible.

It sounds complicated, but with the following assumptions, it will get easier to comprehend:

The first one is that observing a neighborhood node is independent of observing any other neighborhood node given its feature representation. For example, in Image 3, there is a probability for us to observe nodes 6, 7, and 8 if we are on node 9. This is just a relaxation for optimization purposes used often in machine learning which makes it easier for us to compute the above mentioned probability Pr(Nₛ(u) | f(u)). Our probabilities (weights) from the chapter of Probabilities (weights) update are already incorporated in walk sampling. This relaxation is here just for probability calculation.
The second one is that observing the source node and neighborhood node in the graph is the same as observing feature representation (embedding of the source node and feature representation of the neighborhood node in feature space. Take a note that in the probability calculation Pr(Nₛ(u)|f(u)) we mentioned feature representation of node f(u) and regular nodes, which sounds like comparing apples and oranges, but with the following relaxation, it should be the same. So for some neighborhood node nᵢ in the neighborhood Nₛ of node u we could have the following:

The exponential term ensures that the sum of probabilities of every neighborhood node of node u is 1. This is also called a softmax function.

This is our optimization problem. Now, we hope that you have an idea of what our goal is.
Luckily for us, this is already implemented in a Python module called gensim. Yes, these guys are brilliant in natural language processing and we will make use of it. 🤝

Conclusion

Dynamic node2vec offers a good solution for growing networks. We learned where embeddings can be applied, we mentioned the drawbacks of static graph embedding algorithms and learned about the benefits of dynamic node2vec. All that's left is to try it out yourself.

MAGE shares his wisdom on a Twitter account. Get to know him better by clicking the follow button! 🐦

Last but not least, check out MAGE and don’t hesitate to give a star ⭐ or contribute with new ideas.

Introduction to Node Embedding

Antonio Filipovic — Wed, 14 Dec 2022 12:24:14 +0000

Introduction

In this article, we will try to provide an explanation to the following questions:

What are node embeddings?
How to generate node embeddings?
When can we even use node embeddings?

This is a lot to cover in one article, but let's give it our best. Any prior knowledge of graphs or machine learning is not necessary, just a bonus.
First, let's start with graphs. 🚀

Graphs

Graphs consist of nodes and edges - connections between the nodes.

Node and edge on a graph

In social networks, nodes could represent users, and links between them could represent friendships.

One interesting thing you can do with graphs is to predict which tweets on Twitter are from bots, and which are from organic users. How would we achieve that? Well, stick around and you will get an idea of how it can be done.

What are node embeddings?

So what does embedding mean, and why is it useful?

To embed, per the English dictionary, means to fix something in a substance or solid object. With graphs, it would mean to map the whole graph in N-dimensional space. Take a look at the example below. In this example, we mapped all the nodes in 2-dimensional space. Now, it should be obvious we have two clusters (or communities) in the graph. For us humans, it is easier to identify clusters in 2-dimensional space. In this example, it is also easy to spot clusters just from the graph layout, but imagine the graph having 1000 nodes - things aren't as straightforward anymore.

Furthermore, for a computer, it is easier to work with node embeddings (vectors of numbers), because it is easier to calculate how similar (close in space) 2 nodes are from embeddings in N-dimensional space than it would be to calculate from a graph only. On the other hand, there is no proper way how we could calculate the closeness of two nodes just from the graph. You could use something like the shortest path algorithm, but that itself is not representative enough. With vectors, it's easier. The most often used metric is called cosine similarity.

Now we have something a computer can work with:

Example of embedding

Now we know what embeddings are, but what do we use node embeddings for?

Supervised machine learning is a subset of machine learning where algorithms try to learn from data. Data is represented by input-output pairs, i.e. [2] -> 2, [1] -> 1. Our model tries to learn from data in such a way that it maps inputs to the correct outputs. In our example ([2] -> 2, [1] -> 1) model would try to learn function y=x. Here, it would be pretty easy for the model to learn input-output mapping, but imagine a problem where a lot of different points from input space map to same output value. That's why we can't directly apply a machine learning algorithm to our input-output pairs, but we first need to find a set of informative, discriminating, and independent features amongst input data points. Finding such features is an often difficult task.

In prediction problems on networks, we would need to do the same for the nodes and edges. A typical solution involves hand-engineering domain-specific features based on expert knowledge. Even if one discounts the tedious effort required for feature engineering, such features are usually designed for specific tasks and do not generalize across different prediction tasks.

This sounds like a bad news. 👎

We want our algorithm to be independent of the downstream prediction task and that the representations can be learned in a purely unsupervised way. This is where node embeddings come into place.

We will make our algorithm learn embeddings, and after that, we can apply those embeddings in any of the following applications, one of which is Twitter bot detection. Let's dig in.

How to generate node embeddings?

Researches have divided these methods into three broad categories:
1) Factorization based
2) Random Walk based
3) Deep Learning based

1. Factorization based

Factorization based algorithms represent the connections between nodes in the form of a matrix and factorize this matrix to obtain the embedding. In one such method called Local Linear Embedding, there is the assumption that every node is a linear combination of its neighbors, so the algorithm tries to represent the embedding of every node as a linear combination of its neighbors' embeddings. It is like the example from high school where you need to represent one vector as a linear combination of other two vectors. Only here, you can have multiple vectors, and they are much more complex.

2. Random walks

Random walk based methods use a walk approach to generate (sample) network neighborhoods for nodes. For every node, we would generate its network neighborhood by choosing in some way (depends on the method, can be random or it can include probabilities) the next node of our walk. You can take a look at the picture below of how it would look like.

The maximum walk length is determined before this process of walk sampling, and for every node, we generate N random walks. By doing so, we have created a network neighborhood of a node. And now our goal would be to make a node as similar as possible to nodes in its network neighborhood.

Example of a graph walk with three steps

Again with this boring question, but why do this?

Turns out this process was proven to be very good in another area called natural language processing dealing with words/documents where you want to find similar words. For example, words "intelligent" and "smart" should be similar words. This method in natural language processing is called word2vec. Words that appear in a similar context (words before or after that word), should be similar. Thankfully, the same applies to nodes. Nodes that appear in a similar context (sampled walks) should be similar. Our process of walk sampling is used to create a dataset on which we will try to make node embeddings as similar as possible. 🤯 And that is it. The dataset in word2vec methods is every sentence of a document, and analogously for us, it is every sampled graph random walk.

3. Deep learning

The growing research on deep learning has led to the usage of deep neural network-based methods applied to graphs. With deep learning, it is easier to model non-linear structures, so deep autoencoders have been used for dimensionality reduction. A few popular methods from this area are called Structural Deep Network Embedding (SDNE) and Deep Neural Networks for Learning Graph Representations (DNGR) so feel free to check them out.

Where can node embeddings be applied?

We know that graphs occur naturally in various real-world scenarios such as social networks (social sciences), word co-occurrence networks (linguistics), interaction networks (i.e. Protein-Protein interactions in biology), and so on. Modeling the interactions between entities as graphs have enabled researchers to understand the various networks in a systematic manner. For example, social networks have been used for applications like friendship or content recommendations, as well as for advertisement.

But how are researchers modeling such interactions?
You may already have the answer.

By embedding a large graph in low dimensional space (a.k.a. node embeddings). Embeddings have recently attracted significant interest due to their wide applications in areas such as graph visualization, link prediction, clustering, and node classification. It has been demonstrated that graph embedding is superior to alternatives in many supervised learning tasks, such as node classification, link prediction, and graph reconstruction. Here is a chronological list of research papers where you can check them out:

Node classification

Node classification aims to determine the label of nodes (a.k.a. vertices) based on other labeled nodes and the topology of the network. Often in networks, only a fraction of nodes are labeled. In social networks, labels may indicate interests, beliefs, or demographics, whereas the labels of entities in biology networks may be based on functionality. For example, we have some data where researchers have painstakingly worked out the functional role of specific proteins in their system of interest and characterized details of their interaction partners and the pathways in which they function. But still, a lot of them haven't yet been worked out completely. With embeddings, we could try to predict missing labels with high precision.

Link prediction

Link prediction refers to the task of predicting missing links or links that are likely to occur in the future. For example in the Protein-Protein network, where verifying the existence of links between nodes that are proteins requires costly experimental tests, link prediction might save you a lot of money so that you check only where you have a higher chance to guess correctly.

Clustering

Clustering is used to find subsets of similar nodes and group them; finally, visualization helps in providing insights into the structure of the network.

So back to our bot case. One assumption could be made that bots have a small number of links to real users, because who would want to be friends with them, but they have a lot of links between them so that they appear as real users. Graph clustering or community detection come in place here. We want to find those clusters and remove bot users. This can be done with node embeddings, especially dynamic node embeddings, where interactions are made every second.

There is a cool article The Rise of Social Bots in which you can read how bots are used to affect and possibly manipulate the online debate about vaccination policy.

Conclusion

Phew. There was a lot to cover, but we succeeded somehow! Good job if you stayed with me until here! ❤️. Now, after we have covered the theory, you can check out some implementations of node embedding algorithms, for static and dynamic graphs.

MAGE shares his wisdom on a Twitter channel. Get to know him better by following him 🐦

Last but not least, check out MAGE and don't hesitate to give a star ⭐ or contribute with new ideas.

Run Link Prediction or Node Classification Algorithms and Write Custom Procedures in C++ With Mage 1.4

Antonio Filipovic — Wed, 07 Dec 2022 06:33:23 +0000

In the new release of Memgraph’s open-source graph extension library MAGE, we focused on supporting graph machine learning. MAGE 1.4 now enables you to classify graph nodes and predict new relationships using the node classification and link prediction algorithms.

We also wanted to extend MAGE towards the C++ community even more and created the C++ API towards Memgraph database. Writing graph algorithms in C++ now comes close to working in Python since you don’t need to worry about handling memory and working with unnecessary interfaces.

If you are also familiar with igraph library, you’ll be happy to hear that we integrated it into MAGE, and the newly integrated k-means algorithm will help you cluster your data.

Link prediction

Link prediction tries to predict new relationships by generalizing on unseen nodes at inference time. Inside the module, you can choose to work on link prediction using GraphSAGE or GAT. The module was integrated using DGL implementation, and it supports a lot of different logging metrics, as well as storing models after a certain number of epochs.

One example of what you can do with the link prediction algorithm is to recommend new services for customers by using a query similar to this one:

MATCH (n:Customer {id: "1658"})
MATCH (s:Service)
WITH collect(s) AS services, n
CALL link_prediction.recommended_vertex(n, services, 6) YIELD *
RETURN score, recommendation;

Node classification

Node classification determines the labeling of samples (represented as nodes) by looking at the labels of their neighbors. It is motivated by homophily, which means "love of sameness” based on the sociological theory that similar things will group. The following module supports different layer types, loading and storing models, and much more.

With node classification, you can work on fraud prediction by using a query like the one below to determine if a certain user is a fraudster or not:

MATCH (n {id:1658}) 
CALL node_classification.predict(n) YIELD * 
RETURN predicted_class;

C++ API designed for humans

The new C++ API is designed for humans, not robots. We followed best practices to reduce unnecessary cognitive load: the components have simple and consistent interfaces, common use cases require fewer user actions, and the API comes with developer guides and extensive documentation.

Memory management is probably the main pain point in C++ development. The new C++ API automatically manages the memory used by graph data, saving you time that would otherwise be spent debugging and writing repetitive code.

igraph support is here

Furthermore, the igraphalg module provides a comprehensive set of thin wrappers around some of the algorithms present in the igraph package. The wrapper functions can create an igraph-compatible graph-like object that can stream the native database graph directly, significantly lowering memory usage.

From this version, MAGE supports NetworkX integration, cuGraph to support graph algorithms on CUDA devices, and now igraph. Whether you need something else, feel free to drop us a comment on GitHub.

k-means clustering to group examples

And last but not least, the k-means algorithm clusters given data by trying to separate samples in n groups of equal variance by minimizing the criterion known as within-the-cluster sum-of-squares. Find out more about this algorithm in the documentation,

You can use this algorithm when you already have embeddings, but clustering is missing. For example, feel free to combine it with Node2Vec or Node2Vec online version.

What next?

If any of the new features are the one that will make your use case easier, update MAGE to version 1.4. Feel free to leave a comment, report an issue or give us a star as support for our work on GitHub.
Also, we are always open for discussions and advice, drop them on our Discord Server and stay informed on everything graph-algorithm-related!

DEV Community: Antonio Filipovic

Link Prediction With node2vec in Physics Collaboration Network

Introduction

Prerequisites

Contents

1. Dataset and query import

2. Splitting edges into test and train sets

Theory

Practical

3. Run node2vec on the train set to generate node embeddings

Theory

Practical

4. Get potential edges from embeddings

Theory

Practical

5. Rank potential edges to get top K predictions

6. Compare predicted edges with the test set

Conclusion

Recommendation System Using Online Node2Vec With Memgraph MAGE

Prerequisites

Graph

Online Node2Vec

Node2Vec Online setup

Loading a dataset

Recommendation system

Conclusion

References:

Temporal Graph Neural Networks With Pytorch - How to Create a Simple Recommendation Engine on an Amazon Dataset

PYTORCH x MEMGRAPH x GNN = 💟

Graph neural networks

Temporal graph networks

Amazon data example

Exploring an Amazon data network in Memgraph

1. Prerequisites

2. Installation using Docker

3. Connecting to Memgraph with GQLAlchemy

4. Data analysis on an Amazon product dataset

Trigger creation

Index creation for dataset

Training and evaluating Temporal Graph Networks

Where to next?

References

3 Powerful Queries to Find Patterns in Your Knowledge Graph You Haven’t Noticed Before

Pattern-matching questions

Content and connectedness

Commonality

Alternative action

Conclusion

Understanding How Dynamic node2vec Works on Streaming Data

Introduction

Dynamic networks

Dynamic Node2vec

Walk sampling

Probabilities (weights) update

Word2Vec update

Conclusion

Introduction to Node Embedding

Introduction

Graphs

What are node embeddings?

How to generate node embeddings?

1. Factorization based

2. Random walks

3. Deep learning

Where can node embeddings be applied?

Node classification

Link prediction

Clustering

Conclusion

Run Link Prediction or Node Classification Algorithms and Write Custom Procedures in C++ With Mage 1.4

Link prediction

Node classification

C++ API designed for humans

igraph support is here

k-means clustering to group examples

What next?