DEV Community: Ante Pušić

Analyze Infrastructure Networks With Dynamic Betweenness Centrality

Ante Pušić — Wed, 22 Feb 2023 17:58:45 +0000

Remember when undersea internet cables came under attack by... sharks?!

In this blog post, you will explore how the loss of a connection can affect the global submarine internet cable network, and learn how to use dynamic betweenness centrality to efficiently analyze streamed data in Memgraph.

The loss of a cable is a textbook example of how a single change can immediately disrupt the entire network. To enable rapid response in such situations, our MAGE 🧙‍♂️ team has been adding online graph algorithms (Node2Vec, PageRank & community detection), whose magic ✨ is in updating previous outputs instead of computing everything anew.

Explore the Global Shipping Network

Prerequisites

In this tutorial, you will use Memgraph with:

Data

The dataset used in this blogpost represents the global network of submarine internet cables in the form of a graph whose nodes stand for landing points, the cables connecting them represented as relationships.

Landing points and cables have unique identifiers (id), and the
landing points also have names (name).

A giant thank you to TeleGeography for sharing the dataset for their submarine cable map. ❤️

Exploration

Our Setup

With Docker installed, we will start Memgraph through it using the following command:

docker run -it -p 7687:7687 -p 3000:3000 memgraph/memgraph-platform

We will be working with Memgraph from a Jupyter notebook. To interact with Memgraph from there, we use GQLAlchemy.

Betweenness Centrality

Before we start exploring our graph, let’s quickly refresh that betweenness centrality of a node is the fraction of shortest paths between all pairs pairs of nodes in the graph that pass through it:

In the above expression, n is the node of interest, i, j are any two distinct nodes other than n, and σij(n) is number of shortest paths from i to j (going through n).

The analysis of (internet) traffic flows, like what we are doing here, is an established use case for this metric.

Jupyter notebook

The Jupyter notebook is here – we can now go for a deep dive 🤿 in the data!

Preliminaries

First, let’s connect to our instance of Memgraph with GQLAlchemy and load the dataset.

from gqlalchemy import Memgraph

def load_dataset(path: str):
    with open(path, mode='r') as dataset:
        for statement in dataset:
            memgraph.execute(statement)

memgraph = Memgraph("127.0.0.1", 7687)    # connect to running instance
memgraph.drop_database()                  # make sure it’s empty
load_dataset('data/input.cyp')            # load dataset

Example

With everything set up, calling the betweenness_centrality_online module is a matter of a single Cypher query.
As we are analyzing how changes affect the undersea internet cable network, we save the computed betweenness centrality scores for later.

memgraph.execute(
    """
    CALL betweenness_centrality_online.set()
    YIELD node, betweenness_centrality
    SET node.bc = betweenness_centrality;
    """
)

Let’s see which landing points have the highest betweenness centrality score in the network:

most_central = memgraph.execute_and_fetch(
    """
    MATCH (n: Node)
    RETURN n.id AS id, n.name AS name, n.bc AS bc_score
    ORDER BY bc_score DESC, name ASC
    LIMIT 5;
    """
)
for node in most_central:
    print(node)

{'id': 15, 'name': 'Tuas, Singapore', 'bc_score': 0.29099145440251273}
{'id': 16, 'name': 'Fortaleza, Brazil', 'bc_score': 0.13807572163430684}
{'id': 467, 'name': 'Toucheng, Taiwan', 'bc_score': 0.13361801370831092}
{'id': 62, 'name': 'Manado, Indonesia', 'bc_score': 0.12915295031722301}
{'id': 123, 'name': 'Balboa, Panama', 'bc_score': 0.12783714460527598}

Two of the above results, Singapore and Panama, have become international trade hubs owing to their favorable geographic position. They are highly influential nodes in other networks as well.

Dynamic Operation

This part brings us to MAGE’s newest algorithm – iCentral dynamic betweenness centrality by Fuad Jamour and others.^[1].
After each graph update, iCentral can be run to update previously computed values without having to process the entire graph, going hand in hand with Memgraph’s graph streaming capabilities.

You can set this up in Memgraph with triggers – pieces of Cypher code that run after database transactions.

memgraph.execute(
    """
    CREATE TRIGGER update_bc 
    BEFORE COMMIT EXECUTE 
        CALL betweenness_centrality_online.update(createdVertices, createdEdges, deletedVertices, deletedEdges) 
        YIELD *;
    """
)

Let’s now see what happens when a shark (or something else) cuts off a submarine internet cable between Tuas in Singapore and Jeddah in Saudi Arabia.

memgraph.execute("""MATCH (n {name: "Tuas, Singapore"})-[e]-(m {name: "Jeddah, Saudi Arabia"}) DELETE e;""")

The above transaction activates the update_bc trigger, and previously computed betweenness centrality scores are updated using iCentral.

With the cable out of function, internet data must be transmitted over different routes. Some nodes in the network are bound to experience increased strain and internet speed might thus deteriorate. These nodes likely saw their betweenness centrality score increase. To inspect that, we’ll retrieve the new scores with betweenness_centrality_online.get() and compare them with the previously saved ones.

highest_deltas = memgraph.execute_and_fetch(
    """
    CALL betweenness_centrality_online.get()
    YIELD node, betweenness_centrality
    RETURN 
        node.id AS id,
        node.name AS name, 
        node.bc AS old_bc,
        betweenness_centrality AS bc,
        betweenness_centrality - node.bc AS delta
    ORDER BY abs(delta) DESC, name ASC
    LIMIT 5;
    """
)
for result in highest_deltas:
    print(result)

memgraph.execute("DROP TRIGGER update_bc;")

{'id': 111, 'name': 'Jeddah, Saudi Arabia', 'old_bc': 0.061933737931979434, 'bc': 0.004773934386713466, 'delta': -0.057159803545265966}
{'id': 352, 'name': 'Songkhla, Thailand', 'old_bc': 0.05259842296405675, 'bc': 0.07514804741735281, 'delta': 0.022549624453296065}
{'id': 15, 'name': 'Tuas, Singapore', 'old_bc': 0.29099145440251273, 'bc': 0.2730690696075149, 'delta': -0.017922384794997803}
{'id': 175, 'name': 'Yanbu, Saudi Arabia', 'old_bc': 0.0648358824682235, 'bc': 0.07561992914231867, 'delta': 0.010784046674095174}
{'id': 210, 'name': 'Dakar, Senegal', 'old_bc': 0.08708567541545133, 'bc': 0.09412362761485257, 'delta': 0.007037952199401246}

As seen above, the network landing point in Songkhla, Thailand had its score increase by 42.87% after the update. Conversely, other landing points became less connected to the rest of the network: the centrality of the Jeddah node in Saudi Arabia almost dropped to zero.

Performance

iCentral builds upon the Brandes algorithm^[2] and adds the following improvements in order to increase performance:

Process only the nodes whose betweenness centrality values change: after an update, betweenness centrality scores stay the same outside the affected biconnected component.
Avoid repeating shortest-path calculations: use prior output if it’s possible to tell it’s still valid; if new shortest paths are needed, update the prior ones instead of recomputing.
- Breadth-first search for computing graph dependencies does not need to be done out of nodes equidistant to both endpoints of the updated relationship.
- The BFS tree used for computing new graph dependencies (the contributions of a node to other nodes’ scores) can be determined from the tree obtained while computing old graph dependencies.

bcc_partition = memgraph.execute_and_fetch(
    """
    CALL biconnected_components.get()
    YIELD bcc_id, node_from, node_to
    RETURN
        bcc_id,
        node_from.id AS from_id,
        node_from.name AS from_name,
        node_to.id AS to_id,
        node_to.name AS to_name
    LIMIT 15;
    """
)
for relationship in bcc_partition:
    print(relationship)

Graphs of infrastructural networks, such as this one, fairly often consist of a number of smaller biconnected components (BCCs). As iCentral recognizes that betweenness centrality scores are unchanged outside the affected BCC, this can result in a significant speedup.

Algorithms: Online vs. Offline

An important property of algorithms is whether they are online or offline. Online algorithms can update their output as more data becomes available, whereas offline algorithms have to redo the entire computation.

The gold-standard offline algorithm for betweenness centrality is the one by Ulrik Brandes^[2]: it works by building a shortest path tree from each node of the graph and efficiently counting the shortest paths through dynamic programming.

In How to Identify Essential Proteins using Betweenness Centrality we built a web app to visualize protein-protein interaction networks with help of betweenness centrality.

However, we can easily see that updates often change only a tiny piece of the whole graph. Scalability means that one needs to take advantage of this by cutting down on repetition. To this end, we employed the fastest algorithm so far: iCentral.
Let’s see how it stacks up against the Brandes algorithm in complexity.

Brandes: runs in O(|V||E|) time and uses O(|V| + |E|) space on a graph with |V| nodes and |E| relationships,
iCentral: runs in O(|Q||EBC|) time and uses O(|VBC| + |EBC|) space. |VBC| and |EBC| are the counts of nodes and relationships in the affected portion of the graph; |Q| ≤ |VBC| (see the Performance section for the |Q| set). NB: iCentral also saves time by avoiding repeated shortest-path calculations where possible; this varies by graph.

Another key trait of iCentral is that it can be run fully in parallel, just like the Brandes algorithm. With N parallel instances, this has the algorithm run N times faster, at the expense of requiring N times more space (each thread keeps a copy of the data structures).

Takeaways

Betweenness centrality is a very common graph analytics tool, but it is nevertheless challenging to scale up to dynamic graphs. To solve this, Memgraph has implemented the fastest yet online algorithm for it – iCentral; it joins our growing suite of streaming graph analytics.

Our R&D team is working hard on streaming graph ML and analytics. We’re happy to discuss it with you – ping us at Discord!

What’s Next?

It’s time to build with Memgraph! 🔨

Check out our MAGE open-source graph analytics suite and don’t hesitate to give a star ⭐ or contribute with new ideas. If you have any questions or you want to share your work with the rest of the community, join our Discord server.

References

[1] Jamour, F., Skiadopoulos, S., & Kalnis, P. (2017). Parallel algorithm for incremental betweenness centrality on large graphs. IEEE Transactions on Parallel and Distributed Systems, 29(3), 659-672.

[2] Brandes, U. (2001). A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25(2), 163-177.

Monitoring a Dynamic Contact Network With Online Community Detection

Ante Pušić — Mon, 06 Feb 2023 15:50:39 +0000

This tutorial is a sequel to LabelRankT – Community Detection in Dynamic Environment.

Introduction

The newest spell in MAGE’s book is Online Community Detection. – an efficient, high-performance algorithm for detecting communities in networks that change with time.

As MAGE wants to use his knowledge to help people, in this tutorial you will learn with him how to build a utility that monitors a dynamic contact network. The utility will a) use the detected communities to show rumor-spreading clusters and b) track the average cluster size.

How did our magician learn about all of this, though?

It was again winter – too soon, MAGE sighed – and something odd was happening with his sage 🧙 friends. They were arguing with one another, and between fights, they were wondering what was up.
Soon they found out that rumors had somehow spread among them. Losing no time, MAGE set off to find a way to make sense of the situation with what he knows best – graphs. Having built a network that tracked their close contacts through time, he pored over his books until he found a suitable algorithm for analyzing it. [1]

Prerequisites

To complete this tutorial, you will need:

MAGE – Memgraph’s very own graph analytics library
gqlalchemy – a Python driver and object graph mapper (OGM)

Graph

It is paramount to protect the privacy of personal contact records. Keeping this in mind, we generated a dynamic dataset (250 nodes and < 1700 vertices) of a network of close contacts using the Barabási–Albert graph generation model. This model is suitable for such networks because it makes graphs with two traits that one commonly sees in real-life networks:

small world property: few degrees of separation between any two nodes in the graph
power-law degree distribution: the presence of hubs (highly connected nodes) in the graph

Our graph consists of Person nodes with a contactID attached. If two Persons have been in close contact, there exists between them a relationship CONTACTED.

Image 1. Graph schema

Contacts of age over the rumor’s “expiration date” do not play a role in rumor spreading and are thus dropped. Consequently, this means that communities of close contacts are not static; instead, they change with time, and communities need to be updated after each update anew.

Algorithm

We will do our network analysis with the new LabelRankT method described in the previous post. LabelRankT is an online algorithm that partitions the network into communities and returns nodes with their community labels.

In a nutshell, the algorithm does two main operations:

label propagation: assign each node a label and pass them along edges for several iterations
update: find the changed nodes and run label propagation on them only

Now, why did MAGE use this algorithm?

Let’s remember that we defined communities as sets of densely interconnected nodes. These are what our algorithm detects – as labels flow through the network, the sets quickly converge on a single one.

As for our task, this idea allows us to extend the notion of “close contact” to people who haven’t had direct contact with a rumor spreader if they are sufficiently connected to the people who have. In other words, rumors can quickly spread through the network, much like a chain reaction would.
In effect, communities are essentially rumor-spreading clusters. This way, we can cast a wider net and inform more people about possible exposure to misinformation.

Loading the dataset

The dataset is a file whose entries are structured as follows:

+|-,contactID_1,contactID_2

If the initial operator is +, the entry adds a contact between contactID_1 and contactID_2 to the network; - does the opposite.

Memgraph supports graph streaming. When working with a dynamic network, one would usually create a stream and connect it and Memgraph.
However, we’re focused on rumor spreading and working with streams is a bit beside the point. Instead, we’ll simulate a stream with this little snippet of code:

class Stream:
    i = 0
    data = [['+',120,954], ...]

    CREATE_EDGE = 'MERGE (a: Person {{contactID: {}}}) 
                   MERGE (b: Person {{id: {}}}) 
                   CREATE (a)-[r: CONTACTED]->(b);'
    DELETE_EDGE = 'MATCH (a: Person {{contactID: {}}})-[r: CONTACTED]->(a: Person {{contactID: {}}})
                   DELETE r;'

    def get_next(self):
        if self.i >= len(self.data):
            return None
        operation, node_1, node_2 = *self.data[self.i]
        self.i += 1
        if operation == '+':
            return self.CREATE_EDGE.format(node_1, node_2)
        elif operation == '-':
            return self.DELETE_EDGE.format(node_1, node_2)

Setting up detection

We will need two procedures: set() initializes the community detection algorithm and update() gets the new communities after each graph change.

Initialization with set() should look like the following:

CALL dynamic_community_detection.set()
YIELD node, community_id
RETURN node.id AS node_id, community_id
ORDER BY node_id;

This method lets the user set the parameters of the algorithm, as detailed in the documentation. Note that we are using the default values in this article.

Triggers are a Memgraph functionality that lets users set openCypher statements to run in response to graph updates. We will handle community updating with a trigger that activates after every graph change:

CREATE TRIGGER test_edges_changed
BEFORE COMMIT EXECUTE
CALL dynamic_community_detection.update(
    createdVertices,
    createdEdges,
    updatedVertices,
    updatedEdges,
    deletedVertices,
    deletedEdges) YIELD *;

The arguments passed to update() are predefined within the trigger. For a comprehensive list of trigger events and their predefined variables, take a look here in the documentation.

Putting it all together

Finally, we’re going to put this all together. 🚀

The main loop of our code holds the bulk of our utility’s logic. It handles the tasks of updating the graph, community detection, and returning rumor-spreading clusters and their mean sizes.


memgraph = Memgraph("127.0.0.1", 7687)

example_stream = Stream()
memgraph.execute(query=INIT_ALGORITHM)
memgraph.execute(query=UPDATE_TRIGGER)

while True:
    # update graph and run community detection

    next_change = example_stream.get_next()
    if not next_change:
        break
    memgraph.execute(query=next_change)
    results = memgraph.execute_and_fetch(query=GET_RESULTS)


    # for a random node, find its cluster

    nodes = []
    for result in results:
        node, _ = list(result.values())
        nodes.append(node)

    random_node = random.choice(nodes)
    cluster = []
    for result in results:
        node, community = list(result.values())
        if node == random_node:
            cluster.append(node)


    # calculate mean cluster size

    sizes = {}
    for result in results:
        node, community = list(result.values())
        if community not in sizes.keys():
            sizes[community] = 0
        else:
            sizes[community] += 1
    sizes = list(sizes.values())
    mean_size = sum(sizes) / len(sizes)

To wrap up this demo, here’s a plot of mean cluster/community size by epoch:

Image 2. Cluster size increases as the graph fills out; larger sizes may imply quicker spread and vice versa.

Conclusion

This article aimed to present the efficiency and output quality of MAGE’s new online community detection algorithm on a dynamic network, stressing the insights that one can glean from the communities. With the ongoing rise in data streaming, the demand for algorithms that can handle large volumes of data and produce useful results is rising, and our algorithm is one of them.

Our team of engineers is currently tackling the problem of graph analytics algorithms on real-time data. If you want to discuss how to apply online/streaming algorithms on connected data, feel free to join our Discord server and message us.

MAGE shares his wisdom on a Twitter account. Get to know him better by following him 🐦

Last but not least, check out MAGE and don’t hesitate to give a star ⭐ or contribute with new ideas.

References

[1] Jierui Xie, Mingming Chen, Boleslaw K. Szymanski: “LabelRankT: Incremental Community Detection in Dynamic Networks via Label Propagation”, 2013, Proc. DyNetMM 2013 at SIGMOD/PODS 2013, New York, NY, 2013; arXiv:1305.2006.

LabelRankT – Community Detection in Dynamic Environment

Ante Pušić — Thu, 02 Feb 2023 14:20:24 +0000

Introduction

Community detection helps us uncover hidden relations among nodes in a graph and find sets of nodes with some characteristics in common. Real-life networks often show community structure, and by figuring it out, one can tackle a wide range of problems.

If you’re doing graph analytics, the chances are that you have run community detection on the dataset. Algorithms take more time to run on large graphs, and handling the volume of work that comes along with a large and frequently updated dataset is an engineering problem. It makes sense to wonder if it’s possible to leverage the small size of an average update to deliver a performance boost.

We at Memgraph recognize your challenges. In this article, you will learn about the merits of online community detection methods and get acquainted with the LabelRankT algorithm by Xie et al., now available in MAGE 1.1.

Image 1. LabelRank network illustration

The case for online algorithms

Picture a family or a group of close friends – they feel a strong sense of community to one another, and their mutual relationships run deeper than those with people outside the bunch.
Now, let’s say that the family got a new arrival or that a new friend joined the crew. These communities changed, but this doesn’t imply that others did so. In other words, the change was local in scope.

Modern tech operations often work with big data, which is so in more than one way: individual datasets both reach large sizes and receive frequent updates. Scalability means that one needs to cut down on repetition and leverage that updates often change just a tiny piece of the data. One needs online algorithms – methods that process data as it comes.

Community detection lends itself well to online solutions for two reasons:

Complexity: many methods have high time complexity that scales with the number of nodes in the network
Locality: community changes tend to be local in scope after partial updates.

Graph streaming platforms such as Memgraph are natively dynamic environments, stressing the need for specialized algorithms on top of the usual performance considerations.
To equip Memgraph with a suitable algorithm, the MAGE team peered into ~~spellbooks~~ academic research and implemented LabelRankT (Xie et al.).

The LabelRankT algorithm

LabelRankT is an un-/semi-supervised machine learning algorithm made for online community detection on networks. It takes a graph as input and returns an assignment of nodes to the communities it detected; nodes belong to a single community.
Offline and online modes of operation are both supported; the latter takes advantage of the communities found in the previous iteration of the graph, whereas the former detects them ab initio.
The algorithm supports weighted and directed graphs by design.

Concepts

In network science, we define a community as a set of nodes that are densely connected internally.

Label propagation algorithms initialize every node with a unique label, and then these diffuse through the network. Consequently, densely interconnected groups – communities – quickly reach a common label.

Nodes do not wholly belong to a single community under LabelRankT. Instead, every node has an associated label distribution vector composed of label probabilities. The vector is scaled so that the probabilities sum to 1.

A self-loop is added to each node to stabilize the detection results across iterations.

Detection

The algorithm starts by assigning a unique label to every node in the graph.
Label distribution vectors are initialized as follows:

In other words, the probability that node i belongs to community j is the ratio of wⱼᵢ and the total weight of all edges that lead to i.

Label propagation is an iterative process with four steps per iteration. Below follows an abridged description of the steps:

Node selection: nodes belonging to the same community as at least k\% of their neighbors skip the next three steps
Label propagation: for an individual node, its new label distribution is a weighted sum of its in-neighbors’ distributions (incl. its own):

Inflation: label distribution vectors are raised elementwise to the n*th power and then scaled to sum to *1
Cutoff: label probabilities under a set threshold are pruned.

Image 2. Label propagation. Node 4 is not taken into account because the connecting edge 1→4 faces away from node 1.

After several iterations, the algorithm converges on a solution and returns the most probable labels for each node.

Node selection passes over nodes whose labels are too much like the neighbors’. In those cases, label propagation is likely not to change anything, and skipping it makes sense due to its expensiveness.
The inflation and cutoff steps help optimize the algorithm by cutting down the number of labels stored in the nodes’ label distribution vectors.

Online operation

Being an online algorithm, LabelRankT builds its solution incrementally as it adapts to changes in the input. Changed nodes are defined as all nodes that have been added, edited, or deleted, as well as their neighbors. If an edge has been added, edited, or deleted, its endpoint nodes are also considered to be changed.

Compared to the mathematical operations described in the previous section, the logic of online operation is simple:

1) Find out which nodes are changed
2) Preserve the community labels of unchanged nodes
3) Re-run community detection on changed nodes only

Image 3. After a node is deleted, its neighbors become changed nodes (red).

Performance

LabelRankT runs in linear O(m) time and takes O(mn) space on a graph with n nodes and m edges. Total execution time also depends on the number of iterations set for label propagation.

This section assesses the performance of LabelRankT on two dimensions: we compare offline community detection methods against offline ones in general and contrast LabelRankT’s properties with those of other community detection algorithms.

Online methods

➕ much faster on large, dynamic graphs than offline methods

➕ adaptive to changes in the input data

➖ assumption that the changes only have local effects

➖ need to store past state(s) in memory

LabelRankT

➕ result quality matching offline algorithms

➕ low computational cost

➕ scalable to very large graphs

➕ deterministic and replicable results (unlike other label propagation methods)

➕ compatible with directed and weighted graphs

➕ customizable parameters

➖ less grounded in theory than statistical and modularity-maximizing algorithms for community detection

Applications

Graphs that describe real-life networks show community structure often.

This insight applies to diverse use cases such as customer segmentation, contact tracing, medical diagnostics, and quantification of environmental hazards in public health.
As communities often correspond to the functional units of systems, additional use cases include the detection of cycles or pathways in metabolic networks and recommender systems where content forms communities sorted by topic.
Furthermore, tracking the evolution of communities across time provides a way to monitor entities such as viruses or rumors in real-time as they spread. With the COVID-19 pandemic being a top global concern, this problem is in search of a solution.

In all cases, online methods are well-suited for frequently updated graphs as they save time by re-running the detection only on changed nodes.

A key property of communities is that they often have very different properties than the rest of the network. If one knows the division of the graph into communities, it is possible to isolate them for closer study.
Conversely, this can allow one to treat entire communities as single meta-nodes, effectively reducing the size of the graph and allowing analysis with complex methods that otherwise wouldn’t scale – big data turned small!

Communities detected by LabelRankT have a quality that matches traditional offline algorithms. Being deterministic, the algorithm is also suitable for scientific applications that call for replicability.

Conclusion

Community detection is a common task in graph analytics owing to its wide variety of applications, but in big data, it faces concerns on the grounds of scalability in dynamic environments. Online methods such as LabelRankT help solve this by saving previously calculated results in unchanged graph regions.

MAGE shares his wisdom on a Twitter account. Get to know him better by following him 🐦

Last but not least, check out MAGE and don’t hesitate to give a star ⭐ or contribute with new ideas.