DEV Community: André Vermeij

Innovation Graph Analytics Powered by Embeddings and LLM’s

André Vermeij — Wed, 27 Nov 2024 10:52:44 +0000

Guest Author: André Vermeij, Founder of Kenedict Innovation Analytics & Developer of Kenelyze

Intro & Recap: Innovation Graphs

The first post in our series on Innovation Graphs introduced the usage of graphs in the analysis of innovation and its output, such as patents, scientific publications and research grants.

Innovation graphs focus on mapping the connections between technologies, organisations and people and can provide new insights into the actual underpinnings of innovative activity within topics or organisations of interest. They can be constructed based on all kinds of metadata and often focus on visually mapping three complementary perspectives:

Graphs of documents to gain deeper insight into technology/topic clusters.
Graphs of organisations and institutions to focus on sector-wide collaboration patterns.
Graphs of people/experts to get a better understanding of team-level collaboration and key players in a field of expertise.

In this second post on Innovation Graphs, we’ll focus on the creation and LLM-powered analysis of the first type of graph mentioned above—graphs detailing clusters of technologies and topics within a specific sector of interest. Specifically, we’ll dive into how we can use text embeddings to construct document similarity graphs, and how we can automatically analyse the content and label the graph’s clusters using locally running Large Language Models.

Text Embeddings & Graph Creation

Mapping clusters of technology and the connections between them is a key part of most innovation analytics projects. A common way to create the related document similarity graphs is to collect the unstructured text related to documents in a dataset (for example, abstracts for scientific publications or summaries of R&D project reports), convert the text into vectors/embeddings, and then calculate pairwise similarities to get similarity scores for each pair of documents to construct the final graph.

The Classical Way: TF-IDF

Converting unstructured text into ready-to-analyse vectors can be done in various ways. A classical way to approach this is to use a variant of Term Frequency-Inverse Document Frequency (TF-IDF). Here, all unstructured text is initially pre-processed using common techniques in Natural Language Processing (tokenization, lemmatization, stop-word removal, etc.), after which each token in a document is assigned a TF-IDF score. This score is based on how often the token appears in the document itself (TF) and on the inverse of how often it appears across all documents in the dataset (IDF). For each document, a vector with a length equalling the total number of unique tokens across all documents is then created, holding the TF-IDF scores for all tokens in the document and zeroes for any tokens that do not occur in the document.

Although this is a pretty intuitive way of converting text into vectors, it comes with several challenges. The main drawback is that semantic similarity is mostly overlooked in this approach, since the scores are simply based on term counts within and across documents. Also, the vectors resulting from TF-IDF are generally very sparse and can easily consist of thousands of elements per vector, depending on the size of the overall text corpus.

The Modern Way: Embedding Models

The rise of Large Language Models and Generative Artificial Intelligence has also resulted in the availability of a wide variety of embedding models and APIs to convert unstructured text into fixed-length vectors. For example, Nomic, Mixedbread, Jina, and OpenAI all offer APIs to get embeddings based on input of unstructured text of your choice. Some key use cases for these embedding models are query and document embedding for Retrieval Augmented Generation, but they also serve as an excellent basis for the large-scale embedding of datasets to create document similarity graphs.

The main benefits of these embedding models are that they also consider semantic similarity between concepts and are usually of a fixed, dense size (often 768 or 1024 elements, often called dimensionality). A challenge is that users need to carefully pick the parameters when using these models since these can significantly impact the overall outcome when converting the vectors into document similarity graphs.

Graph Creation Based on Embeddings

We can construct a document similarity graph based on all pairwise similarities between the document vectors as soon as embeddings are generated for all documents in our dataset. The nodes in the graph are simply the original documents from our dataset, with weighted links drawn between nodes when they have a certain degree of similarity. A commonly used similarity metric is cosine similarity, with scores ranging from 0 to 1, where 1 denotes identical texts/vectors. Links between nodes can be determined by setting a threshold similarity value.

The exact value used here can have a significant impact on the readability of the graph: setting the threshold too low will often lead to a hairball/spaghetti bowl visualization (too many links between nodes), while setting it too high will show many disparate clusters with no connections between them. When constructing a graph, it is therefore important to give this some thought and also relate it to the actual size of the text fragments you are dealing with – shorter strings (titles) usually go well with higher threshold values, while longer strings (abstracts, summaries) usually combine well with lower threshold values.

Community Detection for Technology Cluster Identification

As soon as the nodes and links in the graph have been constructed based on the embedding similarities and the threshold set, we can start analysing the graph of documents to uncover clusters of related content. In practice, this is a very important step to make the graph more readable and understandable.In innovation analytics, gaining insight into which technology clusters are present in a dataset and how they connect and evolve is often key to a project’s success.

An excellent way to uncover these clusters is by using the Leiden community detection algorithm, now available in Memgraph. Based on the structure of the graph, this algorithm detects densely connected subsets of nodes and iteratively assigns them to the same communities. In the end, when colouring nodes based on the communities they are assigned, we have an excellent basis to start labelling and annotating the graph to make sense of its contents.

LLM-Powered Innovation Cluster Labelling

In the analysis of technology and topic graphs, providing clear labelling and annotation of the resulting graph visualizations is key to gaining insights by stakeholders in an innovation analytics project. Annotated visuals are often used to provide initial high-level overviews of a graph’s contents in presentations, and often serve as a basis for further deep dives into specific clusters of interest.

A classical approach to initial cluster labelling is to treat each cluster's contents as a separate corpus of documents and then run a version of TF-IDF to extract the top-5 highest-scoring tokens or phrases for each cluster. The resulting labels often provide a decent first indication of a cluster’s contents, but they do require subsequent manual analysis and improvement to improve their readability.

An exciting alternative way to label clusters is to use a Large Language Model to summarize cluster contents. In our case, we utilize locally running models such as Llama 3.1 in Ollama or LM Studio based on the following high-level process:

For each detected community, we first gather relevant unstructured text from the attributes of the nodes in the cluster. In most cases, we have found that sending over collections of document titles per cluster works very well for cluster labelling. This collection of texts is then added to a prompt that specifies exactly how the LLM should respond in its summarization: based on the texts provided, return a short summary/label consisting of a maximum of 5 words with an indication of the high-level topic. Many LLM’s are prone to adding a lot of introductory (“Absolutely! Here is a summary of…”) and concluding text to answers, so the prompt also specifies that it should never do this and purely focus on returning the labels.

As soon as the LLM finishes providing the labels for all communities, we replace the nodes’ initial community attribute with the newly created label. Of course, these labels do require manual checks to see whether they make sense and sometimes require slight adjustments because they are too high-level. The quality of the labels is also dependent on the LLM itself: we’ve found that larger models such as Llama3.1 (8B parameters) generally provide better labels than smaller models such as Llama 3.2 (3B parameters).

Cluster Summarization Using LLM’s

Another valuable way to use LLM’s in document similarity graph analysis is to further enhance users’ understanding of clusters by providing point-and-click larger summaries of what the documents in a cluster are about. The approach here is similar to the LLM-based labelling described above, with the prompt sent to the model focusing on providing an overall summary consisting of 3 to 5 phrases instead.

Practically speaking, users of a graph visualization select nodes of their interest using a free-form selection tool and point out which unstructured text attribute should be used for the analysis, after which the LLM returns a summary based on the collection of texts sent to it. The summary is then printed in a window right on top of the visual, as in the example below:

Up Next: Step-By-Step Real-Life Examples and Visuals

This post provided an overview of how text embeddings can be used to construct document similarity graphs for innovation analysis, and how Large Language Models can aid in the labeling and summarization of the resulting graphs. Our next post in this series will show examples of this in practice using Kenelyze, based on a real-life dataset of the innovation output of a major high-tech company. It will also discuss the importance of local LLM’s when working with sensitive data, and highlight some technical considerations when picking and configuring a local LLM.

Innovation as a Graph: Improved Insight into Technology Clusters, Collaboration and Knowledge Networks

André Vermeij — Wed, 18 Sep 2024 12:42:47 +0000

Guest Author: André Vermeij, Founder of Kenedict Innovation Analytics & Developer of Kenelyze

Organisations focused on innovation come in many forms, including corporations with large Research & Development (R&D) departments, universities, research institutions active in advancing science, and startups working on the potentially next big thing. Innovation-related data has become increasingly important for each of these organisations to inform decision-making and stay ahead of market developments. For example, an R&D-intensive corporation could use data to benchmark its own technology portfolio with its direct competitors, while a startup might be analysing data to assess previous activity and potential market entry in a sector of interest.

Traditional Innovation Analysis

The traditional way to look at innovation-related data is to report on output within a topic or organisation of interest based on counts and sums of variables of interest. When analysing its competition, a business may for example gather information on a competitor’s recent output and report on the number of documents in each technology domain, produce a list of the companies the competitor has worked with, or generate an overview of the most active inventors or researchers in a field of interest. Although all these analyses can be valuable in their own right, they’re missing out on a key aspect of an innovation ecosystem: the connections between technologies, organisations, and people.

A Graph of Innovation

Viewing innovation and its output as a graph of interconnected data points allows us to get a much deeper understanding of the technology and knowledge structures in a context of interest. Using the metadata in a wide array of innovation-related data sources, which will be discussed more in the following section, it is possible to create graphs of connected documents, organisations and people and gain new insights into the actual underpinnings of innovative activity.

For example, innovation graphs allow us to answer questions, such as:

Which clusters of activity can we distinguish within a topic or organisation of interest, and how has this evolved?
What do the organisational collaboration networks in an area of interest look like, and who are the key players in network connectivity?
How are teams of individual experts in a specific field composed, and who are the leading experts in a given topic?

Open Data Sources for Innovation Analytics

Until just a few years ago, quality innovation data was quite hard to come by without a subscription to an expensive database hosting patent information or scientific publications. Luckily, in recent years, there has been a move towards more openly available data, which can serve as an excellent basis for setting up a wide variety of innovation graphs.

Here’s a quick overview of common data sources:

Patents: organisations apply for patents to protect their inventions against commercialisation by third parties. Patent applications and grants are published online by national patent offices around the world, with databases gathering data from all jurisdictions and providing a wide array of metadata. A great open data source is the European Patent Office's Open Patent Services (OPS) API, or the EPO's search platform Espacenet.
Scientific publications: journal publications, conference proceedings, book chapters and various other types of scientific output are gathered in databases which bring together output from many sources. Paid databases such as Scopus are still used often by large organisations – great open alternatives include OpenAlex and Semantic Scholar.
Subsidies & funding programmes: governmental subsidies to stimulate innovation and R&D in specific areas are often structured in openly available data sources. A good example is the European Union’s CORDIS data for the Horizon Europe programme. Many national enterprise agencies also publish their granted subsidies and projects online.
Internal data: the above data sources are often augmented with internal, unpublished data (e.g., internal project reports, unfiled patent applications, scientific output in the review stage) to get a view on very recent activity within an organisation. This is especially valuable when creating knowledge graphs within organisations or carrying out an innovation portfolio analysis for a specific client.

In a typical Innovation Analytics project, combining data from multiple of the above data sources is often key to gaining the best insights. For example, organisations applying for patents often also have scientific output related to the same theme and may also apply for governmental funding. To get a picture of innovative activity that is as complete as possible, it is therefore important to look at activity from multiple data sources and graph perspectives.

Graphs of Documents: Insight into Technology and Knowledge Clusters

The analysis and visualisation of innovation graphs often starts with looking at the relationships between documents based on a shared characteristic.

Depending on the goals of the analysis, there are various ways to link documents together:

Text similarity: unstructured text data in the form of document titles, abstracts and summaries can be used to connect documents when there is a high similarity between their contents. This relies on vectorisation of the text of interest and subsequent calculation of pairwise cosine similarities, where a link is then drawn between documents based on a minimum similarity score.
Knowledge flows / shared authors: another way to generate clusters of connected documents is to link them when the same people have worked on them. The authorship data on documents can be used to accomplish this. The key assumption here is that documents are part of the same “knowledge cluster” when persons with specific expertise have (co-) written them.
Citations: numerous citations to other documents can be found in both scientific publications and patent applications. We can use these citations to create various types of graphs:
- Shared references: connect documents when they cite the same sources, often with a minimum number of shared citations set as the weight for the links.
- Shared citing documents: connect documents when they have been cited by the same other documents, again often with a minimum weight set.
- Direct citations: creation of citation graphs where links are drawn between documents when they cite each other.
Technology classifications: patent documents are categorised using classification codes designating the technology areas which they fall into. These can be used to connect documents when they share one or multiple codes, essentially creating clusters of documents based on technological overlap.

The following graph is an example of a text similarity approach, where scientific publications in the area of autonomous vehicles are connected when they share significant textual content. Colors depict clusters of activity based on the outcomes of a community detection algorithm, and nodes are sized based on the number of times they were cited by other papers:

Figure 1: Graph of scientific publications linked based on text similarity approach

Graphs of Organisations: Insight into Collaboration Ecosystems

Another graph perspective, which is very common in innovation analysis, focuses on mapping the connections between organisations (businesses, universities, research institutions, public bodies, hospitals, etc.).

Many of the data sources above hold extensive metadata on the organisations responsible for the documents—scientific authors are affiliated with their employers, patents are applied for by the parties seeking protection of their invention and governmental subsidies are often received by consortia of collaborating organisations.

It is common to attach weights to the links based on the number of collaborations between two organisations. Using these weights, it is then possible to filter the graph to focus only on the strongest / most frequently occurring collaborations.

The graph below shows an example of collaboration in radiotherapy innovation, where colors are based on the type of organisation (e.g. blue = universities, green = hospital and medical centers) and node sizes based on their betweenness centrality scores:

Figure 2: Collaboration in radiotherapy

Graphs of People: Insight into Expertise and Knowledge Networks

This is a graph perspective that often follows after mapping organizational collaboration networks, focusing on the actual person-to-person collaborations taking place to produce the analysed output.

Using the author/inventor metadata on documents, we draw links between people when they have co-authored a document. Similar to the organisational networks, we can also attach weights to the links, which correspond to the number of documents which have been worked on jointly by two authors. This perspective can provide a deep understanding of the actual team structures and knowledge networks within and outside of organisations.

Here’s an example of the (relatively large!) network of inventors who have worked on Apple patents. Nodes are sized based on their betweenness centralities, and colors are based on clusters detected by a community detection algorithm:

Figure 3: Apple's inventor network

Graph Metrics & Innovation Insights

The above examples show various ways to convert innovation data into actionable graph visualisations. In the actual analysis and interpretation of these graphs, it is important to make good use of the many metrics available in graph analytics. These metrics can help us understand which clusters are present in a network, and can aid in determining the importance of nodes based on centrality measures.

The following metrics are valuable for analysing the overall graph structure in innovation analysis:

Component analysis: determining the components (interconnected subsets of nodes) in the graph to be able to see how far the graph is interconnected (how many nodes can reach each other directly or indirectly) and to determine the impact of the largest connected components versus smaller components.
K-Cores: to determine highly connected subsets of nodes in graphs, k-Cores can be used to highlight subgraphs in which all nodes have at least a degree of k. This can be used to focus on so-called cliques of nodes quickly and is especially valuable when analysing collaboration and knowledge networks.
Community detection: using an algorithm such as the Leiden community detection algorithm to determine which clusters we can distinguish within the components. These clusters then serve as the basis for graph annotation, where clusters are labeled based on their actual contents (see the labels in the autonomous driving graph above).

On the individual node level, degree and betweenness centrality measures can be used to determine the importance of nodes in innovation graphs:

Degree Centrality: determining simple connection counts per node to quickly see which actors are most important in terms of the number of other nodes they are connected to. Since most innovation graphs are weighted (links have weights associated with them), weighted degree centrality is also used regularly.
Betweenness Centrality: this is a frequently and often used metric to determine who holds key positions in a graph in terms of hub positions – which organizations/people are the “key connectors” between clusters/teams? It is calculated by determining how often each node appears on the shortest paths between all other nodes in the network.

Up Next: Use Cases

Now that you have an initial idea of the main ideas behind innovation graphs, we will showcase practical use cases, real-world client examples and common challenges in innovation graph analysis in the next blog post. Stay tuned!