André Vermeij for Memgraph

Posted on Nov 27, 2024 • Edited on Dec 4, 2024

Innovation Graph Analytics Powered by Embeddings and LLM’s

#llm #technologyclusters #innovationnetworkanalysis #graphdatabase

Guest Author: André Vermeij, Founder of Kenedict Innovation Analytics & Developer of Kenelyze

Intro & Recap: Innovation Graphs

The first post in our series on Innovation Graphs introduced the usage of graphs in the analysis of innovation and its output, such as patents, scientific publications and research grants.

Innovation graphs focus on mapping the connections between technologies, organisations and people and can provide new insights into the actual underpinnings of innovative activity within topics or organisations of interest. They can be constructed based on all kinds of metadata and often focus on visually mapping three complementary perspectives:

Graphs of documents to gain deeper insight into technology/topic clusters.
Graphs of organisations and institutions to focus on sector-wide collaboration patterns.
Graphs of people/experts to get a better understanding of team-level collaboration and key players in a field of expertise.

In this second post on Innovation Graphs, we’ll focus on the creation and LLM-powered analysis of the first type of graph mentioned above—graphs detailing clusters of technologies and topics within a specific sector of interest. Specifically, we’ll dive into how we can use text embeddings to construct document similarity graphs, and how we can automatically analyse the content and label the graph’s clusters using locally running Large Language Models.

Text Embeddings & Graph Creation

Mapping clusters of technology and the connections between them is a key part of most innovation analytics projects. A common way to create the related document similarity graphs is to collect the unstructured text related to documents in a dataset (for example, abstracts for scientific publications or summaries of R&D project reports), convert the text into vectors/embeddings, and then calculate pairwise similarities to get similarity scores for each pair of documents to construct the final graph.

The Classical Way: TF-IDF

Converting unstructured text into ready-to-analyse vectors can be done in various ways. A classical way to approach this is to use a variant of Term Frequency-Inverse Document Frequency (TF-IDF). Here, all unstructured text is initially pre-processed using common techniques in Natural Language Processing (tokenization, lemmatization, stop-word removal, etc.), after which each token in a document is assigned a TF-IDF score. This score is based on how often the token appears in the document itself (TF) and on the inverse of how often it appears across all documents in the dataset (IDF). For each document, a vector with a length equalling the total number of unique tokens across all documents is then created, holding the TF-IDF scores for all tokens in the document and zeroes for any tokens that do not occur in the document.

Although this is a pretty intuitive way of converting text into vectors, it comes with several challenges. The main drawback is that semantic similarity is mostly overlooked in this approach, since the scores are simply based on term counts within and across documents. Also, the vectors resulting from TF-IDF are generally very sparse and can easily consist of thousands of elements per vector, depending on the size of the overall text corpus.

The Modern Way: Embedding Models

The rise of Large Language Models and Generative Artificial Intelligence has also resulted in the availability of a wide variety of embedding models and APIs to convert unstructured text into fixed-length vectors. For example, Nomic, Mixedbread, Jina, and OpenAI all offer APIs to get embeddings based on input of unstructured text of your choice. Some key use cases for these embedding models are query and document embedding for Retrieval Augmented Generation, but they also serve as an excellent basis for the large-scale embedding of datasets to create document similarity graphs.

The main benefits of these embedding models are that they also consider semantic similarity between concepts and are usually of a fixed, dense size (often 768 or 1024 elements, often called dimensionality). A challenge is that users need to carefully pick the parameters when using these models since these can significantly impact the overall outcome when converting the vectors into document similarity graphs.

Graph Creation Based on Embeddings

We can construct a document similarity graph based on all pairwise similarities between the document vectors as soon as embeddings are generated for all documents in our dataset. The nodes in the graph are simply the original documents from our dataset, with weighted links drawn between nodes when they have a certain degree of similarity. A commonly used similarity metric is cosine similarity, with scores ranging from 0 to 1, where 1 denotes identical texts/vectors. Links between nodes can be determined by setting a threshold similarity value.

The exact value used here can have a significant impact on the readability of the graph: setting the threshold too low will often lead to a hairball/spaghetti bowl visualization (too many links between nodes), while setting it too high will show many disparate clusters with no connections between them. When constructing a graph, it is therefore important to give this some thought and also relate it to the actual size of the text fragments you are dealing with – shorter strings (titles) usually go well with higher threshold values, while longer strings (abstracts, summaries) usually combine well with lower threshold values.

Community Detection for Technology Cluster Identification

As soon as the nodes and links in the graph have been constructed based on the embedding similarities and the threshold set, we can start analysing the graph of documents to uncover clusters of related content. In practice, this is a very important step to make the graph more readable and understandable.In innovation analytics, gaining insight into which technology clusters are present in a dataset and how they connect and evolve is often key to a project’s success.

An excellent way to uncover these clusters is by using the Leiden community detection algorithm, now available in Memgraph. Based on the structure of the graph, this algorithm detects densely connected subsets of nodes and iteratively assigns them to the same communities. In the end, when colouring nodes based on the communities they are assigned, we have an excellent basis to start labelling and annotating the graph to make sense of its contents.

LLM-Powered Innovation Cluster Labelling

In the analysis of technology and topic graphs, providing clear labelling and annotation of the resulting graph visualizations is key to gaining insights by stakeholders in an innovation analytics project. Annotated visuals are often used to provide initial high-level overviews of a graph’s contents in presentations, and often serve as a basis for further deep dives into specific clusters of interest.

A classical approach to initial cluster labelling is to treat each cluster's contents as a separate corpus of documents and then run a version of TF-IDF to extract the top-5 highest-scoring tokens or phrases for each cluster. The resulting labels often provide a decent first indication of a cluster’s contents, but they do require subsequent manual analysis and improvement to improve their readability.

An exciting alternative way to label clusters is to use a Large Language Model to summarize cluster contents. In our case, we utilize locally running models such as Llama 3.1 in Ollama or LM Studio based on the following high-level process:

For each detected community, we first gather relevant unstructured text from the attributes of the nodes in the cluster. In most cases, we have found that sending over collections of document titles per cluster works very well for cluster labelling. This collection of texts is then added to a prompt that specifies exactly how the LLM should respond in its summarization: based on the texts provided, return a short summary/label consisting of a maximum of 5 words with an indication of the high-level topic. Many LLM’s are prone to adding a lot of introductory (“Absolutely! Here is a summary of…”) and concluding text to answers, so the prompt also specifies that it should never do this and purely focus on returning the labels.

As soon as the LLM finishes providing the labels for all communities, we replace the nodes’ initial community attribute with the newly created label. Of course, these labels do require manual checks to see whether they make sense and sometimes require slight adjustments because they are too high-level. The quality of the labels is also dependent on the LLM itself: we’ve found that larger models such as Llama3.1 (8B parameters) generally provide better labels than smaller models such as Llama 3.2 (3B parameters).

Cluster Summarization Using LLM’s

Another valuable way to use LLM’s in document similarity graph analysis is to further enhance users’ understanding of clusters by providing point-and-click larger summaries of what the documents in a cluster are about. The approach here is similar to the LLM-based labelling described above, with the prompt sent to the model focusing on providing an overall summary consisting of 3 to 5 phrases instead.

Practically speaking, users of a graph visualization select nodes of their interest using a free-form selection tool and point out which unstructured text attribute should be used for the analysis, after which the LLM returns a summary based on the collection of texts sent to it. The summary is then printed in a window right on top of the visual, as in the example below:

Up Next: Step-By-Step Real-Life Examples and Visuals

This post provided an overview of how text embeddings can be used to construct document similarity graphs for innovation analysis, and how Large Language Models can aid in the labeling and summarization of the resulting graphs. Our next post in this series will show examples of this in practice using Kenelyze, based on a real-life dataset of the innovation output of a major high-tech company. It will also discuss the importance of local LLM’s when working with sensitive data, and highlight some technical considerations when picking and configuring a local LLM.

Top comments (1)

Mayank Laddha • Nov 27 '24

thanks!