DEV Community: Memgraph

When Should You Use Query-Focused Summarization in GraphRAG?

Sabika Tasneem — Wed, 17 Jun 2026 11:14:17 +0000

A product lead asks your AI assistant what customers keep complaining about across thousands of reviews.

A Text2Cypher query cannot answer that directly. Local graph search may only explain one product or one user. The answer needs synthesis across a broader corpus. That is where query-focused summarization fits.

In this post, we'll look at when GraphRAG needs this global retrieval pattern, how it differs from Text2Cypher and local graph search, and why keeping the pipeline close to the graph matters.

Why Global Questions Need a Different Approach

Not every GraphRAG question has the same shape. Some questions are analytical:

How many issues are labeled as bugs?

Others are contextual:

Which issues are related to this pull request?

And some are global:

What are the recurring complaints across this product category?

The first question is best answered with Text2Cypher. The second is best answered with local graph search. The third is different.

The answer does not live in a single node, relationship, or graph neighborhood. It emerges from patterns spread across many connected records. Global questions often ask for:

recurring themes
blind spots
missing coverage
underrepresented topics
patterns across many connected records
signals that only become clear after grouping related parts of the graph

These questions require synthesis rather than lookup or neighborhood exploration.

That is where query-focused summarization becomes useful.

This distinction aligns with findings from Microsoft's GraphRAG research, which showed that traditional retrieval approaches often struggle with questions that require reasoning across an entire corpus rather than retrieving a handful of relevant passages. Their paper, From Local to Global: A GraphRAG Approach to Query-Focused Summarization, introduced a global retrieval workflow specifically designed for these broader questions.

What Query-Focused Summarization Does

Query-focused summarization, or QFS, creates a summary based on the user's question instead of producing a generic summary of the whole dataset.

That distinction matters.

A generic summary says:

Here is what this dataset is broadly about.

A query-focused summary says:

Here is what matters for this specific question.

In GraphRAG, QFS usually works by processing a broader slice of the graph, grouping related entities or communities, generating smaller summaries, and then reducing those summaries into a final answer.

The basic flow looks like this:

global question
      ↓
load a broader graph slice
      ↓
group related nodes or communities
      ↓
summarize each group against the question
      ↓
combine the partial summaries
      ↓
return a focused answer

The goal is not to inspect one neighborhood or execute one graph query.

The goal is to use graph structure to organize information at a larger scale and produce an answer that reflects broader patterns.

This approach is closely related to techniques used in large-scale information retrieval and multi-document summarization, where systems must aggregate evidence from many sources before generating an answer. The challenge becomes even more important as datasets grow beyond what can fit into a single LLM context window.

A GitHub Issues Example: Finding Product Blind Spots

Imagine a knowledge graph built over GitHub issues.

Issues are connected through labels, extracted entities, related issue links, community groupings, and summaries over those communities.

Now someone asks:

Where are the blind spots?

This question asks for a higher-level view of what the issue graph reveals. Which areas keep showing up? Which problems appear under-discussed? Which communities point to recurring product gaps?

A local graph search workflow can help explain relationships around a specific issue or entity. However, it is not designed to summarize patterns across hundreds or thousands of related issues.

Query-focused summarization can work across the broader issue graph, summarize different communities, and turn those partial summaries into a focused answer about product blind spots.

Query:

Output:

That is the global retrieval pattern. The value is not that the system finds a matching issue. The value is that it can surface a pattern across many related issues.

If you're interested in how graph communities are identified before summarization, Memgraph's GraphRAG workflows can combine graph algorithms and community detection techniques to organize related information before it reaches the LLM.

Why Atomic GraphRAG Helps With Global Retrieval

Global retrieval has more moving parts than either Text2Cypher or local graph search. Query-focused summarization requires additional orchestration.

The pipeline may need to select a broader graph slice, group related nodes, apply graph algorithms, summarize communities, rank partial summaries, and assemble the final answer.

You can split those steps across scripts, services, prompt chains, and post-processing code. It may work, but it gets painful to debug.

If the answer is weak, where did the failure happen? Was the graph slice too broad? Were the communities wrong? Did the summaries ignore the query? Did the final reduction step drop useful context?

This is where the Atomic GraphRAG pattern becomes useful. The benefit is that more of the retrieval plan can stay close to the graph, where the data, relationships, and grouping logic already live.

For global questions, that matters because the answer depends on how the system moves through the graph before it ever reaches the LLM.

A good QFS pipeline should make that path easier to inspect, test, and adjust.

Many teams implement these workflows through Atomic GraphRAG pipelines, where retrieval patterns such as Text2Cypher, local graph search, and query-focused summarization can be composed while keeping graph operations close to the data.

When Query-Focused Summarization Is Too Much

QFS is powerful, but it is not the right choice for every GraphRAG question.

If the user wants an exact answer, use a query-shaped pattern such as Text2Cypher.

Examples:

How many issues are labeled as bugs?
Does this user ID exist?
Which products have more than 100 reviews?

If the user wants context around one entity, use local graph search.

Examples:

Which issues are related to this pull request?
Which accounts are connected to this suspicious transaction?
Which reviews and products are closest to this user?

QFS is for broader questions.

Examples:

What themes keep showing up across negative reviews?
Where are the blind spots in this issue graph?
What recurring risks appear across incident reports?
Which areas of this research corpus are undercovered?

A simple way to choose:

If the User Needs...	Use...
An exact value, count, table, or lookup	Text2Cypher
Context around one entity	Local graph search
Themes, gaps, or patterns across a corpus	Query-focused summarization

This mirrors a broader principle in retrieval-augmented generation: different retrieval strategies solve different classes of problems. Research from organizations such as Microsoft, Stanford, and Meta consistently shows that retrieval quality depends heavily on matching the retrieval method to the user's intent rather than relying on a single retrieval approach for every query.

Wrapping Up

Query-focused summarization is the GraphRAG pattern for global questions.

Use it when the answer does not live in a single query result or a single graph neighborhood. Use it when the reader needs a focused synthesis across a larger graph or corpus.

That makes QFS useful for questions about themes, blind spots, gaps, recurring complaints, and broad patterns.

The next step is to test this pattern on a dataset where the same signal appears across many related records. Start with one broad question, define the graph slice worth summarizing, group related entities, and inspect the partial summaries before trusting the final answer.

For a deeper walkthrough, read Memgraph's guide on Query-Focused Summarization in Atomic GraphRAG or explore the GraphRAG pipeline docs.

When Does GraphRAG Need Local Graph Search?

Sabika Tasneem — Tue, 16 Jun 2026 12:23:33 +0000

A fraud analyst asks your AI assistant why an account looks suspicious. A plain lookup gives the account record. A broad summary pulls in too much noise. The useful answer sits nearby: connected devices, recent transactions, shared addresses, and linked accounts.

That is where local graph search helps. It starts from a relevant entity, expands through nearby relationships, and gives the LLM a focused slice of connected context.

In this post, we’ll look at when to use local graph search, how pivot-based retrieval works, and how to keep the neighborhood small enough to be useful.

What Is Local Graph Search?

Local graph search starts with a pivot.

A pivot is the node or set of nodes the rest of the retrieval depends on. It is the anchor for the local context.

In a GitHub Issues graph, the pivot might be an issue. In a product review graph, it might be a user or product. In a cybersecurity graph, it might be an alert, asset, or account.

The user might give the pivot directly:

Show me the related context around user ID AGLOOCISSVGEGUCSSSSNHWZHOM60.

Once the pivot is found, the graph does what flat retrieval cannot do well: it follows relationships.

user question
   ↓
find the pivot
   ↓
expand nearby relationships
   ↓
rank and trim the neighborhood
   ↓
return compact context

Search gets you to the right starting point. Traversal gives you the surrounding evidence.

The Answer Lives Around the Node

Local graph search is useful when the answer is not stored as one property on one node.

It lives around the node.

For example:

A fraud alert needs connected accounts, devices, merchants, and recent transactions.
A reopened GitHub issue needs related issues, pull requests, labels, and users.
A product page needs reviews, related products, parent products, and user behavior.
A security alert needs users, permissions, services, events, and assets.
A research paper needs authors, citations, methods, datasets, and follow-up papers.

These are not whole-corpus questions. They are also not clean lookup questions. They sit in the middle.

That is why local graph search is such a useful GraphRAG pattern. It handles the messy class of questions where the user points at one thing, but the answer depends on the surrounding structure.

A Local Graph Search Example

Imagine a pull request fixes a serialization bug in an open-source project. The PR may solve the immediate problem, but the surrounding issue graph can still contain related issues that were never linked to the PR, never updated, or never closed.

That is a local graph search problem.

In a GitHub Issues knowledge graph, issues can connect through labels and through RelatedTo edges derived from entity extraction over issue titles and descriptions. So the question is not only whether one PR fixed one issue. The better question is:

Which related GitHub issues should be updated or closed, and which community members should be informed?

A local graph search flow can start by embedding the phrase that describes the fixed problem:

CALL embeddings.text(
  ['serialization errors during concurrent edge writes on supernodes']
)
YIELD embeddings, success

Then vector search can find the most relevant starting points in the graph:

CALL vector_search.search('vs_index', 10, embeddings[0])

The system can also inspect the graph schema before running the next retrieval steps:

SHOW SCHEMA INFO;

From there, regular Cypher queries can expand from the matched issue or PR into related issues, labels, authors, commenters, and extracted relationships.

The value is not that the system finds one similar issue. Basic search can already do that.

The value is that local graph search recovers the surrounding context needed to decide what should happen next. Which issues are still open? Which ones describe the same underlying bug? Which community members were involved in earlier reports or discussions?

That neighborhood is the answer.

For the full walkthrough, watch Memgraph’s community call on Atomic GraphRAG.

How to Make Local Graph Search Easier to Debug

This is also where the Atomic GraphRAG pipeline becomes useful.

The GitHub Issues example is not a simple lookup. The pipeline has to embed the problem phrase, find a relevant starting point, inspect the schema, expand through related issues, and return the context that helps decide what should be updated or closed.

You can wire those steps together across separate scripts and services, but that makes the retrieval path harder to debug. If the answer is wrong, you need to check the embedding call, vector search result, traversal logic, filters, and final prompt assembly separately.

Atomic GraphRAG keeps more of that retrieval logic close to the graph. The benefit is that the path from question to context becomes easier to inspect, test, and change.

For local graph search, that matters because the quality of the answer depends on the path the system took through the graph.

Do Not Let the Neighborhood Become the Whole City

Local graph search can go wrong when the traversal expands too far.

One hop may give useful context. Two hops may reveal the pattern. Five hops can turn into graph confetti if you do not control it.

The job is not to return the biggest neighborhood. The job is to return the smallest neighborhood that still helps answer the question.

That means the pipeline needs guardrails:

Limit the number of hops.
Choose which relationship types matter.
Rank nearby nodes by relevance.
Cut low-value properties before prompt assembly.
Return samples instead of dumping every connected node.

This is where local graph search differs from “just traverse everything.”

Traversal without ranking is noise. Traversal with constraints is retrieval.

Use a Different Pattern When the Question Shape Changes

Local graph search is the right fit when the user asks about context around a specific entity.

But it is the wrong fit when the question shape changes.

If the user asks for an exact value, use a query-shaped approach such as Text2Cypher.

Examples:

How many open issues are labeled as bugs?
Does this user ID exist?
Which products have more than 100 reviews?

If the user asks for themes across a dataset, use a global retrieval pattern such as query-focused summarization.

Examples:

What are the main themes across negative reviews?
What gaps appear across this research corpus?
What are the biggest product complaints across all categories?

A simple rule:

If the User Needs...	Use...
An exact value or table	Text2Cypher
Context around one entity	Local graph search
Themes across the corpus	Query-focused summarization

The retrieval pattern should follow the question shape. If it does not, the pipeline becomes noisy or shallow.

Wrapping Up

Local graph search is useful when the question starts from a specific entity but the answer depends on what surrounds it.

A fraud analyst does not only need the account record. A support engineer does not only need the issue title. A product assistant does not only need the product description. In each case, the useful context lives in nearby relationships.

That is the core value of local graph search. It helps a GraphRAG system retrieve a focused neighborhood instead of pulling in either too little context or too much noise.

The next step is to test this pattern on a graph where relationships already matter. Start with one entity type, define the relationships worth traversing, limit the number of hops, and inspect what context reaches the LLM.

For a deeper walkthrough of this pattern, Memgraph’s guide on local graph search for GraphRAG breaks down GitHub Issues and Amazon Reviews examples. The related community call on building an Amazon-scale knowledge graph for GraphRAG is also useful if you want to see the pattern applied to a large connected dataset.

When Should You Use Text2Cypher in a GraphRAG Pipeline

Sabika Tasneem — Fri, 22 May 2026 10:53:33 +0000

Not every GraphRAG question needs the same retrieval pattern.

Some questions need the neighborhood around an entity. Some need a summary across a large part of the graph. Some just need an exact answer from structured data. That last group is where Text2Cypher fits.

It turns a natural language question into a Cypher query, so the system can return a precise graph result instead of a broad summary.

What Is Text2Cypher?

Text2Cypher is the graph version of a broader pattern developers already know from text-to-SQL systems where you take a natural language question and generate a database query that can answer it.

The difference is the target query language.

Instead of generating SQL for tables, Text2Cypher generates Cypher for graph data. Cypher is a declarative query language for property graphs, where data is modeled as nodes, relationships, labels, and properties.

The LLM’s job is not to invent the answer. Its job is to generate the right query, run it, and return the result. That distinction matters.

What Text2Cypher Does in GraphRAG

In a GraphRAG pipeline, Text2Cypher is useful when the user’s question maps cleanly to the graph schema.

For example:

Does user 31254 exist in this dataset?
Which suppliers provide components used in Product A?
How many orders are delayed by more than 7 days?
Which customers have more than 3 unresolved support tickets?

These questions are not asking the model to read a pile of text and summarize it. They are asking for a structured answer from structured data.

A practical Text2Cypher flow usually looks like this:

Inspect the graph schema.
Pass the relevant schema context to the LLM.
Generate the Cypher query.
Run the query.
Return the result.

Schema is the part people underestimate.

If the LLM does not know what labels, relationship types, and properties exist, it can generate a query that looks reasonable but does not match the actual graph. For example, it may generate (:Customer)-[:PURCHASED]->(:Product)when the real graph uses (:User)-[:BOUGHT]->(:Item).

That query is syntactically fine. It is just wrong for your data.

In Memgraph, SHOW SCHEMA INFO can expose labels, relationship types, and properties, giving the model real schema context before it generates the query.

Why Text2Cypher Is the Best Fit for Analytical GraphRAG Questions

Analytical GraphRAG questions ask for something concrete.

Usually, the answer is one of these:

A count
A boolean answer
A list of matching nodes
A filtered table
A grouped result
A ranked result based on a property or aggregate

For example, in a GitHub Issues knowledge graph, a user might ask:

How many feature requests Memgraph has?

That question does not need the model to retrieve five chunks about issue tracking and reason from prose.

It needs a query over the graph:

SHOW SCHEMA INFO;

MATCH (i:Issue)
RETURN i.issue_type AS issue_type,
       count(*) AS count
ORDER BY count DESC;

That answer comes back as a table shaped result.

No long context window. No vague summary. No pretending that a generative answer is better than a database result.

That is why Text2Cypher is a strong fit for analytical GraphRAG. The question has a query-shaped answer.

When Text2Cypher Is the Wrong Tool

Text2Cypher gets weaker when the question is open-ended, exploratory, or depends on broader context that does not live in a single clean query result.

Bad fits include questions like:

Why are users unhappy with this product?
What themes appear across negative reviews?
Which related issues should an engineer investigate first?
What is missing from this research corpus?

These questions need more than a count or table.

They may need local graph search, where the system starts from a relevant node and expands into its surrounding neighborhood. Or they may need query-focused summarization, where the system synthesizes patterns across a larger part of the graph.

Trying to force Text2Cypher onto those questions gives you shallow answers.

A query can return rows. It does not automatically explain themes, tradeoffs, causes, or missing context.

A useful rule is simple:

If the Answer Should Look Like...	Use...
A number, table, filtered list, or direct lookup	Text2Cypher
Connected context around one entity	Local graph search
Themes or patterns across a corpus	Query-focused summarization

The retrieval path should match the question.

Keep the Pipeline Inspectable

Text2Cypher has one major advantage for developers: you can inspect it.

You can read the generated query and you can run it again. That matters in GraphRAG because retrieval bugs are easy to hide behind fluent language.

If the answer is wrong, you need to know where the failure happened. Was the schema context incomplete? Did the model generate the wrong query? Did the graph lack the right data? Did the final LLM response overstate what the query returned?

For analytical retrieval, the cleanest pipeline is often the most boring one: inspect the schema, generate the query, execute it, and return the result.

That is also what makes Text2Cypher easier to evaluate than a retrieval flow hidden behind several prompts and orchestration steps. The generated query gives you something concrete to inspect before the final answer reaches the user.

For a deeper walkthrough of this pattern, Memgraph has a full guide on Text2Cypher for GraphRAG analytical questions.

Text2Cypher is not the whole GraphRAG story. It is the pattern you use when the question has a query-shaped answer.

When Should You Use GraphRAG Instead of RAG?

Sabika Tasneem — Thu, 21 May 2026 10:36:08 +0000

Most teams building LLM applications start with RAG for a good reason. It is practical, easy to understand, and usually good enough for a simple AI use case.

But once users stop asking simple lookup questions and start asking relationship-heavy questions, standard RAG can get shallow fast.

The issue is not that RAG is bad. The issue is that many real questions are not just about finding a relevant paragraph. They are about following connections across people, products, systems, documents, events, or dependencies.

That is the gap GraphRAG tries to fill.

RAG vs GraphRAG

RAG made LLM applications more useful because it gave models access to external information.

Instead of asking a model to answer from training data alone, a RAG pipeline retrieves relevant content from your docs, tickets, wikis, PDFs, or databases, adds that content to the prompt, and asks the model to answer from it.

That works well for a lot of use cases.

If the question is:

What is our refund policy for annual subscriptions?

A standard RAG pipeline can search the documentation, find the right policy section, and give the model the relevant text.

The problem starts when the question is not just about finding the right text. It starts when the answer depends on relationships.

For example:

Which suppliers could be causing delivery delays for products affected by a specific component shortage?

That question is not just asking for a matching paragraph. It needs the system to connect suppliers, components, products, shipments, delays, and dependencies.

This is where GraphRAG becomes useful.

RAG is good at finding text that sounds relevant. GraphRAG is better when the answer depends on how things are connected.

What RAG Does Well

Retrieval augmented generation, usually shortened to RAG, combines a language model with an external retrieval system. The original paper described this as combining a parametric model (the LLM itself) with non-parametric memory (external knowledge), usually retrieved from an external corpus.

In most modern implementations, that retrieval step uses embeddings. The basic flow looks like this:

Split documents into chunks.
Convert each chunk into an embedding.
Store those embeddings in a vector index.
Convert the user question into an embedding.
Retrieve the most similar chunks.
Add those chunks to the LLM prompt.
Generate the answer.

This is useful when the answer is likely to be contained in one or a few text chunks. Good RAG use cases include:

Documentation search
FAQ assistants
Internal knowledge base search
Customer support answer generation
Summarization over a small set of relevant documents

For many teams, this is the right starting point. It is simpler than building a knowledge graph, and it can deliver useful results quickly.

The issue is that similarity is not the same as understanding.

A vector search system can find chunks that sound close to the query. It does not automatically know whether one entity owns another, depends on another, contradicts another, or affects another through a multi step chain.

That difference matters once your questions become relational.

Where RAG Gets Shallow

RAG usually retrieves isolated text chunks. That creates a few common problems.

First, chunking can break context. A policy, customer, transaction, or technical decision might make sense only when you see how it connects to other facts. Splitting documents into chunks can hide that structure.

Second, semantic similarity can over retrieve. A chunk may sound relevant without being useful for the actual answer.

Third, RAG does not inherently reason across relationships. It may retrieve text about a supplier, text about a product, and text about a shipment delay, but it does not automatically know how those things connect.

Think about this question:

Which customers are affected by the delayed shipment from Supplier A?

A standard RAG pipeline might retrieve documents that mention Supplier A, delayed shipments, and customers. That is helpful, but still incomplete.

The actual answer may require a path like this:

Supplier A -> supplies -> Component X -> used in -> Product Y -> included in -> Shipment Z -> assigned to -> Customer C

That path is not just text similarity. It is structure.

If your application needs to answer questions like this, treating your knowledge base as flat chunks is a weak model of the problem.

What GraphRAG Adds

GraphRAG keeps the useful part of RAG: retrieval. But it adds a graph layer, where information is represented as entities and relationships. Microsoft’s paper on GraphRAG for query focused summarization helped popularize this pattern for using graph structure to answer questions that need broader connected context.

Instead of only storing chunks like:

Supplier A provides Component X. Component X is used in Product Y. Product Y is part of Shipment Z.

A graph represents the structure directly:

(Supplier A)-[:SUPPLIES]->(Component X)
(Component X)-[:USED_IN]->(Product Y)
(Product Y)-[:INCLUDED_IN]->(Shipment Z)
(Shipment Z)-[:ASSIGNED_TO]->(Customer C)

Now the system can retrieve context by following relationships, not just by matching similar text.

A GraphRAG pipeline might work like this:

Use semantic search, keyword search, or another method to find a starting point.
Identify the relevant node or set of nodes in the graph.
Traverse connected relationships.
Rank, filter, and compress the connected context.
Send the final context to the LLM.

The key difference is that search finds where to start, while graph traversal finds what is connected.

That is why GraphRAG is useful for relationship-heavy use cases, such as:

Supply chain analysis where the system needs to trace products, components, suppliers, and delayed shipments
Fraud detection where suspicious behavior appears across shared accounts, devices, transactions, or addresses
Cybersecurity investigation where alerts need to be connected to users, assets, permissions, and attack paths
Healthcare or life sciences research where answers depend on relationships between diseases, genes, drugs, and clinical evidence
Customer 360 applications where support tickets, purchases, product usage, and account history need to be connected

These are not just document lookup problems. They are relationship problems.

RAG and GraphRAG Are Not Enemies

The lazy version of this topic is: RAG bad, GraphRAG good.

That is wrong. RAG is still useful. If your data is mostly unstructured text and your questions are direct, a standard RAG pipeline may be enough. GraphRAG becomes useful when the shape of the answer depends on connected facts. A better way to think about it:

Use RAG When	Use GraphRAG When
The answer is likely inside a small number of text chunks.	The answer depends on relationships across entities.
You need fast document Q&A.	You need multi-hop reasoning.
Your data does not have strong entity relationships.	Your data has dependencies, hierarchies, ownership, or causality.
You are building a first version quickly.	You need more explainable and structured retrieval.

In practice, many good systems use both. Vector search can find semantically relevant entry points. Graph traversal can expand from those entry points into connected context.

That combination is often more useful than either approach alone.

Keep the Retrieval Logic Close to the Data

GraphRAG gets harder to maintain when every retrieval step lives in a different place.

One service finds similar chunks. Another stores the graph. Another expands relationships. Another ranks results. Another builds the final prompt.

That can work, but it gives you more moving parts to debug when the answer is wrong.

A cleaner pattern is to keep as much of the retrieval logic as possible close to the graph itself. Search can find the starting point. Traversal can expand the context. Ranking and filtering can reduce the result before it ever reaches the prompt.

That is the idea behind Atomic GraphRAG in Memgraph. It express the retrieval path as a single execution layer where possible, instead of spreading it across a pile of orchestration code.

The broader lesson is not tool specific. If your GraphRAG pipeline is hard to inspect, it will be hard to trust. The retrieval path should be visible, testable, and easy to change.

The Practical Rule

Use RAG when you need to retrieve relevant text. Use GraphRAG when you need to retrieve connected context. That is the real distinction.

If your question can be answered by finding the right paragraph, RAG is probably enough. If your question requires following relationships between people, products, systems, documents, events, risks, or dependencies, you are no longer just doing text retrieval. You are doing graph retrieval.

The point is not to use GraphRAG as an extra layer and start using it where it is right retrieval model for the problem.

MCP for Agents: The Security Gap Most Teams Miss

Sabika Tasneem — Mon, 16 Feb 2026 12:31:41 +0000

MCP is exciting because it turns an LLM into something that can execute actions through tool calls. One protocol, many tools. Your agent can pull data, update tickets, call APIs, and trigger workflows. That is exactly why teams are rushing to ship MCP based agents.

That speed comes with a tradeoff. Once an LLM can touch live systems, mistakes stop being “bad answers” and start becoming real actions. The point of this post is not to criticize MCP. It is to help you ship agents that stay useful without unintentionally expanding your blast radius.

Let’s dive in!

What MCP Gives You (And What it Does Not)

MCP standardizes how tools and context are exposed to a model, which is great for developer velocity. What it does not do is decide what is safe or appropriate in your environment. You still own boundaries and behavior.

In production, the gaps show up fast:

Which tool should be used for this request
What data is allowed for this user or team
Which actions should be blocked or require approval
How you can audit tool use after something goes wrong

If you want the spec level overview, start with Anthropic’s MCP introduction.

Building Agents with MCP: 3 Problems You Will Hit First

The first failure is rarely a headline breach. It usually looks like a normal product bug, except now the bug can trigger emails, update records, or touch production data. For instance:

The Agent Does the “Helpful” Thing You Did Not Ask For

A user asks, “Can you check which customers are impacted?” The agent decides that notifying customers is helpful and drafts a mass email. Nothing was hacked. The model was just optimizing for task completion, and you gave it a tool that made the wrong idea easy.
A Demo Tool Becomes a Production Hazard

Most teams start with a broad tool set because it makes the demo work. Later, the agent gets a slightly different question and reaches for the most powerful tool available. If that tool can write, delete, or trigger workflows, you now have an outsized failure scope. That is the blast radius.
The Agent Guesses and Guesses Wrong

If your agent can query a database, it will try. If it does not have the right context about what is allowed and what the data means, it will guess. Sometimes the guess is harmless. Sometimes it pulls data it should not have pulled, or it produces results that look right but are based on the wrong assumptions.

Why Prompt Rules Are Not Enforcement

The common response is to add more instructions: “Read-only,” “confirm before sending,” “never delete.” Those rules help, but they do not enforce anything.

There is a simple reason. Prompts influence the model’s behavior. They do not change the system’s capabilities. If a write tool is exposed, the model can still call it, even if you told it not to. If a broad SQL tool is exposed, the model can still retrieve more data than you intended, even if you asked it to be careful.

This is why prompt-only safety tends to decay over time. As you add tools, edge cases, and new workflows, the instruction layer becomes a long list of exceptions. The agent still has the same tool surface, but now it is operating under a growing set of text rules that are easy to miss, conflict, or misapply.

The fix is capability control. Reduce what the agent can do, scope what it can see, and require explicit approvals for actions that have a real blast radius.

The Practical Fix: Shrink the Tool Surface at Runtime

Do not rely on the model to always choose correctly. Make wrong choices harder. The simplest way to do that is to reduce what the agent can do by default, then expand capabilities only when you have a clear reason.

Start with these guardrails:

Expose fewer tools by default
Only expose tools that match the current task
Separate read tools from write tools
Require approvals for irreversible actions
Log tool calls so you can trace what happened

This is least-privilege design applied to agent tool access, enforced at runtime.

Where GraphRAG Fits in an MCP Tooling Stack

Most RAG stacks start with vectors. Vectors are great at finding semantically similar text, but they are not built to represent relationships like who owns which data, which rule is current, or which tool is allowed for this workflow.

Graphs are good at that because they model relationships directly. When you add a graph-based context layer, you can give the model a smaller, cleaner slice of context tied to the user and the task.

For example, you can make use of label-based access controls that determine which node labels and relationship edge types a given user or workflow can touch. That reduces overload and lowers the chance your agent reaches for the wrong tool.

A Checklist You Can Actually Use

If you are shipping MCP-powered agents, do not treat guardrails as a final polish step. Treat them as part of the build. The fastest way to end up in trouble is to bolt safety on after you have already exposed a wide tool surface to an LLM.

Start with a simple baseline and improve it as you learn. The point is not to predict every edge case up front. The point is to make tool behavior observable, reversible where possible, and scoped to what the agent should be doing right now.

If you are shipping MCP-powered agents, start here:

List your tools and label them read or write
Turn off anything irreversible by default
Add a human approval step for high impact actions
Keep tool descriptions short and specific
Log every tool call with who requested it and what tool ran
Review misfires weekly and treat them as product bugs

This checklist is not about paranoia. It is about making MCP workflows predictable enough to ship. If your plan is “we will fix it in the prompt,” you are in for some trouble.

What Memgraph adds to an MCP agent stack

At some point in production, most enterprise teams realize they need a real context layer. Memgraph is an in-memory graph database used as a real-time context engine, which makes it a good fit when your agent needs fast traversal, connected context, and governance that changes as your systems change.

In practice, you can use Memgraph to store and query the relationships your agent depends on, then apply GraphRAG patterns to retrieve a connected context slice instead of stuffing everything into a prompt.

This is also where Memgraph’s Atomic GraphRAG comes in. Instead of stitching together multiple retrieval steps in your application code, Atomic GraphRAG aims to generate context in a single query so it is simpler, faster, and easier to review and tweak.

For you, that means fewer moving parts, clearer failure modes, and a smaller surface area for accidental tool misuse.

If you are exploring MCP specifically, Memgraph provides an MCP Server to expose graph context to agents, and an MCP Client inside Memgraph Lab to compose workflows across MCP servers.

Wrapping Up

MCP is a doorway to useful agents. It also makes mistakes expensive. If you want to ship responsibly, focus on runtime guardrails: shrink the tool surface, keep context clean, and log everything.

If you want to explore a graph-based context layer for MCP, start here. And remember, tool access is part of your attack surface, so review it alongside your production code.

Innovation Graph Analytics Powered by Embeddings and LLM’s

André Vermeij — Wed, 27 Nov 2024 10:52:44 +0000

Guest Author: André Vermeij, Founder of Kenedict Innovation Analytics & Developer of Kenelyze

Intro & Recap: Innovation Graphs

The first post in our series on Innovation Graphs introduced the usage of graphs in the analysis of innovation and its output, such as patents, scientific publications and research grants.

Innovation graphs focus on mapping the connections between technologies, organisations and people and can provide new insights into the actual underpinnings of innovative activity within topics or organisations of interest. They can be constructed based on all kinds of metadata and often focus on visually mapping three complementary perspectives:

Graphs of documents to gain deeper insight into technology/topic clusters.
Graphs of organisations and institutions to focus on sector-wide collaboration patterns.
Graphs of people/experts to get a better understanding of team-level collaboration and key players in a field of expertise.

In this second post on Innovation Graphs, we’ll focus on the creation and LLM-powered analysis of the first type of graph mentioned above—graphs detailing clusters of technologies and topics within a specific sector of interest. Specifically, we’ll dive into how we can use text embeddings to construct document similarity graphs, and how we can automatically analyse the content and label the graph’s clusters using locally running Large Language Models.

Text Embeddings & Graph Creation

Mapping clusters of technology and the connections between them is a key part of most innovation analytics projects. A common way to create the related document similarity graphs is to collect the unstructured text related to documents in a dataset (for example, abstracts for scientific publications or summaries of R&D project reports), convert the text into vectors/embeddings, and then calculate pairwise similarities to get similarity scores for each pair of documents to construct the final graph.

The Classical Way: TF-IDF

Converting unstructured text into ready-to-analyse vectors can be done in various ways. A classical way to approach this is to use a variant of Term Frequency-Inverse Document Frequency (TF-IDF). Here, all unstructured text is initially pre-processed using common techniques in Natural Language Processing (tokenization, lemmatization, stop-word removal, etc.), after which each token in a document is assigned a TF-IDF score. This score is based on how often the token appears in the document itself (TF) and on the inverse of how often it appears across all documents in the dataset (IDF). For each document, a vector with a length equalling the total number of unique tokens across all documents is then created, holding the TF-IDF scores for all tokens in the document and zeroes for any tokens that do not occur in the document.

Although this is a pretty intuitive way of converting text into vectors, it comes with several challenges. The main drawback is that semantic similarity is mostly overlooked in this approach, since the scores are simply based on term counts within and across documents. Also, the vectors resulting from TF-IDF are generally very sparse and can easily consist of thousands of elements per vector, depending on the size of the overall text corpus.

The Modern Way: Embedding Models

The rise of Large Language Models and Generative Artificial Intelligence has also resulted in the availability of a wide variety of embedding models and APIs to convert unstructured text into fixed-length vectors. For example, Nomic, Mixedbread, Jina, and OpenAI all offer APIs to get embeddings based on input of unstructured text of your choice. Some key use cases for these embedding models are query and document embedding for Retrieval Augmented Generation, but they also serve as an excellent basis for the large-scale embedding of datasets to create document similarity graphs.

The main benefits of these embedding models are that they also consider semantic similarity between concepts and are usually of a fixed, dense size (often 768 or 1024 elements, often called dimensionality). A challenge is that users need to carefully pick the parameters when using these models since these can significantly impact the overall outcome when converting the vectors into document similarity graphs.

Graph Creation Based on Embeddings

We can construct a document similarity graph based on all pairwise similarities between the document vectors as soon as embeddings are generated for all documents in our dataset. The nodes in the graph are simply the original documents from our dataset, with weighted links drawn between nodes when they have a certain degree of similarity. A commonly used similarity metric is cosine similarity, with scores ranging from 0 to 1, where 1 denotes identical texts/vectors. Links between nodes can be determined by setting a threshold similarity value.

The exact value used here can have a significant impact on the readability of the graph: setting the threshold too low will often lead to a hairball/spaghetti bowl visualization (too many links between nodes), while setting it too high will show many disparate clusters with no connections between them. When constructing a graph, it is therefore important to give this some thought and also relate it to the actual size of the text fragments you are dealing with – shorter strings (titles) usually go well with higher threshold values, while longer strings (abstracts, summaries) usually combine well with lower threshold values.

Community Detection for Technology Cluster Identification

As soon as the nodes and links in the graph have been constructed based on the embedding similarities and the threshold set, we can start analysing the graph of documents to uncover clusters of related content. In practice, this is a very important step to make the graph more readable and understandable.In innovation analytics, gaining insight into which technology clusters are present in a dataset and how they connect and evolve is often key to a project’s success.

An excellent way to uncover these clusters is by using the Leiden community detection algorithm, now available in Memgraph. Based on the structure of the graph, this algorithm detects densely connected subsets of nodes and iteratively assigns them to the same communities. In the end, when colouring nodes based on the communities they are assigned, we have an excellent basis to start labelling and annotating the graph to make sense of its contents.

LLM-Powered Innovation Cluster Labelling

In the analysis of technology and topic graphs, providing clear labelling and annotation of the resulting graph visualizations is key to gaining insights by stakeholders in an innovation analytics project. Annotated visuals are often used to provide initial high-level overviews of a graph’s contents in presentations, and often serve as a basis for further deep dives into specific clusters of interest.

A classical approach to initial cluster labelling is to treat each cluster's contents as a separate corpus of documents and then run a version of TF-IDF to extract the top-5 highest-scoring tokens or phrases for each cluster. The resulting labels often provide a decent first indication of a cluster’s contents, but they do require subsequent manual analysis and improvement to improve their readability.

An exciting alternative way to label clusters is to use a Large Language Model to summarize cluster contents. In our case, we utilize locally running models such as Llama 3.1 in Ollama or LM Studio based on the following high-level process:

For each detected community, we first gather relevant unstructured text from the attributes of the nodes in the cluster. In most cases, we have found that sending over collections of document titles per cluster works very well for cluster labelling. This collection of texts is then added to a prompt that specifies exactly how the LLM should respond in its summarization: based on the texts provided, return a short summary/label consisting of a maximum of 5 words with an indication of the high-level topic. Many LLM’s are prone to adding a lot of introductory (“Absolutely! Here is a summary of…”) and concluding text to answers, so the prompt also specifies that it should never do this and purely focus on returning the labels.

As soon as the LLM finishes providing the labels for all communities, we replace the nodes’ initial community attribute with the newly created label. Of course, these labels do require manual checks to see whether they make sense and sometimes require slight adjustments because they are too high-level. The quality of the labels is also dependent on the LLM itself: we’ve found that larger models such as Llama3.1 (8B parameters) generally provide better labels than smaller models such as Llama 3.2 (3B parameters).

Cluster Summarization Using LLM’s

Another valuable way to use LLM’s in document similarity graph analysis is to further enhance users’ understanding of clusters by providing point-and-click larger summaries of what the documents in a cluster are about. The approach here is similar to the LLM-based labelling described above, with the prompt sent to the model focusing on providing an overall summary consisting of 3 to 5 phrases instead.

Practically speaking, users of a graph visualization select nodes of their interest using a free-form selection tool and point out which unstructured text attribute should be used for the analysis, after which the LLM returns a summary based on the collection of texts sent to it. The summary is then printed in a window right on top of the visual, as in the example below:

Up Next: Step-By-Step Real-Life Examples and Visuals

This post provided an overview of how text embeddings can be used to construct document similarity graphs for innovation analysis, and how Large Language Models can aid in the labeling and summarization of the resulting graphs. Our next post in this series will show examples of this in practice using Kenelyze, based on a real-life dataset of the innovation output of a major high-tech company. It will also discuss the importance of local LLM’s when working with sensitive data, and highlight some technical considerations when picking and configuring a local LLM.

Innovation as a Graph: Improved Insight into Technology Clusters, Collaboration and Knowledge Networks

André Vermeij — Wed, 18 Sep 2024 12:42:47 +0000

Guest Author: André Vermeij, Founder of Kenedict Innovation Analytics & Developer of Kenelyze

Organisations focused on innovation come in many forms, including corporations with large Research & Development (R&D) departments, universities, research institutions active in advancing science, and startups working on the potentially next big thing. Innovation-related data has become increasingly important for each of these organisations to inform decision-making and stay ahead of market developments. For example, an R&D-intensive corporation could use data to benchmark its own technology portfolio with its direct competitors, while a startup might be analysing data to assess previous activity and potential market entry in a sector of interest.

Traditional Innovation Analysis

The traditional way to look at innovation-related data is to report on output within a topic or organisation of interest based on counts and sums of variables of interest. When analysing its competition, a business may for example gather information on a competitor’s recent output and report on the number of documents in each technology domain, produce a list of the companies the competitor has worked with, or generate an overview of the most active inventors or researchers in a field of interest. Although all these analyses can be valuable in their own right, they’re missing out on a key aspect of an innovation ecosystem: the connections between technologies, organisations, and people.

A Graph of Innovation

Viewing innovation and its output as a graph of interconnected data points allows us to get a much deeper understanding of the technology and knowledge structures in a context of interest. Using the metadata in a wide array of innovation-related data sources, which will be discussed more in the following section, it is possible to create graphs of connected documents, organisations and people and gain new insights into the actual underpinnings of innovative activity.

For example, innovation graphs allow us to answer questions, such as:

Which clusters of activity can we distinguish within a topic or organisation of interest, and how has this evolved?
What do the organisational collaboration networks in an area of interest look like, and who are the key players in network connectivity?
How are teams of individual experts in a specific field composed, and who are the leading experts in a given topic?

Open Data Sources for Innovation Analytics

Until just a few years ago, quality innovation data was quite hard to come by without a subscription to an expensive database hosting patent information or scientific publications. Luckily, in recent years, there has been a move towards more openly available data, which can serve as an excellent basis for setting up a wide variety of innovation graphs.

Here’s a quick overview of common data sources:

Patents: organisations apply for patents to protect their inventions against commercialisation by third parties. Patent applications and grants are published online by national patent offices around the world, with databases gathering data from all jurisdictions and providing a wide array of metadata. A great open data source is the European Patent Office's Open Patent Services (OPS) API, or the EPO's search platform Espacenet.
Scientific publications: journal publications, conference proceedings, book chapters and various other types of scientific output are gathered in databases which bring together output from many sources. Paid databases such as Scopus are still used often by large organisations – great open alternatives include OpenAlex and Semantic Scholar.
Subsidies & funding programmes: governmental subsidies to stimulate innovation and R&D in specific areas are often structured in openly available data sources. A good example is the European Union’s CORDIS data for the Horizon Europe programme. Many national enterprise agencies also publish their granted subsidies and projects online.
Internal data: the above data sources are often augmented with internal, unpublished data (e.g., internal project reports, unfiled patent applications, scientific output in the review stage) to get a view on very recent activity within an organisation. This is especially valuable when creating knowledge graphs within organisations or carrying out an innovation portfolio analysis for a specific client.

In a typical Innovation Analytics project, combining data from multiple of the above data sources is often key to gaining the best insights. For example, organisations applying for patents often also have scientific output related to the same theme and may also apply for governmental funding. To get a picture of innovative activity that is as complete as possible, it is therefore important to look at activity from multiple data sources and graph perspectives.

Graphs of Documents: Insight into Technology and Knowledge Clusters

The analysis and visualisation of innovation graphs often starts with looking at the relationships between documents based on a shared characteristic.

Depending on the goals of the analysis, there are various ways to link documents together:

Text similarity: unstructured text data in the form of document titles, abstracts and summaries can be used to connect documents when there is a high similarity between their contents. This relies on vectorisation of the text of interest and subsequent calculation of pairwise cosine similarities, where a link is then drawn between documents based on a minimum similarity score.
Knowledge flows / shared authors: another way to generate clusters of connected documents is to link them when the same people have worked on them. The authorship data on documents can be used to accomplish this. The key assumption here is that documents are part of the same “knowledge cluster” when persons with specific expertise have (co-) written them.
Citations: numerous citations to other documents can be found in both scientific publications and patent applications. We can use these citations to create various types of graphs:
- Shared references: connect documents when they cite the same sources, often with a minimum number of shared citations set as the weight for the links.
- Shared citing documents: connect documents when they have been cited by the same other documents, again often with a minimum weight set.
- Direct citations: creation of citation graphs where links are drawn between documents when they cite each other.
Technology classifications: patent documents are categorised using classification codes designating the technology areas which they fall into. These can be used to connect documents when they share one or multiple codes, essentially creating clusters of documents based on technological overlap.

The following graph is an example of a text similarity approach, where scientific publications in the area of autonomous vehicles are connected when they share significant textual content. Colors depict clusters of activity based on the outcomes of a community detection algorithm, and nodes are sized based on the number of times they were cited by other papers:

Figure 1: Graph of scientific publications linked based on text similarity approach

Graphs of Organisations: Insight into Collaboration Ecosystems

Another graph perspective, which is very common in innovation analysis, focuses on mapping the connections between organisations (businesses, universities, research institutions, public bodies, hospitals, etc.).

Many of the data sources above hold extensive metadata on the organisations responsible for the documents—scientific authors are affiliated with their employers, patents are applied for by the parties seeking protection of their invention and governmental subsidies are often received by consortia of collaborating organisations.

It is common to attach weights to the links based on the number of collaborations between two organisations. Using these weights, it is then possible to filter the graph to focus only on the strongest / most frequently occurring collaborations.

The graph below shows an example of collaboration in radiotherapy innovation, where colors are based on the type of organisation (e.g. blue = universities, green = hospital and medical centers) and node sizes based on their betweenness centrality scores:

Figure 2: Collaboration in radiotherapy

Graphs of People: Insight into Expertise and Knowledge Networks

This is a graph perspective that often follows after mapping organizational collaboration networks, focusing on the actual person-to-person collaborations taking place to produce the analysed output.

Using the author/inventor metadata on documents, we draw links between people when they have co-authored a document. Similar to the organisational networks, we can also attach weights to the links, which correspond to the number of documents which have been worked on jointly by two authors. This perspective can provide a deep understanding of the actual team structures and knowledge networks within and outside of organisations.

Here’s an example of the (relatively large!) network of inventors who have worked on Apple patents. Nodes are sized based on their betweenness centralities, and colors are based on clusters detected by a community detection algorithm:

Figure 3: Apple's inventor network

Graph Metrics & Innovation Insights

The above examples show various ways to convert innovation data into actionable graph visualisations. In the actual analysis and interpretation of these graphs, it is important to make good use of the many metrics available in graph analytics. These metrics can help us understand which clusters are present in a network, and can aid in determining the importance of nodes based on centrality measures.

The following metrics are valuable for analysing the overall graph structure in innovation analysis:

Component analysis: determining the components (interconnected subsets of nodes) in the graph to be able to see how far the graph is interconnected (how many nodes can reach each other directly or indirectly) and to determine the impact of the largest connected components versus smaller components.
K-Cores: to determine highly connected subsets of nodes in graphs, k-Cores can be used to highlight subgraphs in which all nodes have at least a degree of k. This can be used to focus on so-called cliques of nodes quickly and is especially valuable when analysing collaboration and knowledge networks.
Community detection: using an algorithm such as the Leiden community detection algorithm to determine which clusters we can distinguish within the components. These clusters then serve as the basis for graph annotation, where clusters are labeled based on their actual contents (see the labels in the autonomous driving graph above).

On the individual node level, degree and betweenness centrality measures can be used to determine the importance of nodes in innovation graphs:

Degree Centrality: determining simple connection counts per node to quickly see which actors are most important in terms of the number of other nodes they are connected to. Since most innovation graphs are weighted (links have weights associated with them), weighted degree centrality is also used regularly.
Betweenness Centrality: this is a frequently and often used metric to determine who holds key positions in a graph in terms of hub positions – which organizations/people are the “key connectors” between clusters/teams? It is calculated by determining how often each node appears on the shortest paths between all other nodes in the network.

Up Next: Use Cases

Now that you have an initial idea of the main ideas behind innovation graphs, we will showcase practical use cases, real-world client examples and common challenges in innovation graph analysis in the next blog post. Stay tuned!

In-memory vs. disk-based databases: Why do you need a larger than memory architecture?

Memgraph — Tue, 05 Sep 2023 14:41:54 +0000

Memgraph is an in-memory graph database that recently added support for working with data that cannot fit into memory. This allows users with smaller budgets to still load large graphs to Memgraph without paying for (more) expensive RAM. However, expanding the main-memory graph database to support disk storage is, by all means, a complex engineering endeavor. Let’s break this process down into pieces.

On-disk databases

Disk-based databases have been, for a long time, a de facto standard in the database development world. Their huge advantage lies in their ability to store a vast amount of data relatively cheaply on disk. However, the development can be very complex due to the interaction with low-level OS primitives. Fetching data from disk is something that everyone strives to avoid since it takes approximately 10x more time than using it from main memory. Neo4j is an example of a graph, an on-disk database that uses disk as its main storage media while trying to cache as much data as possible to main memory so it could be reused afterward.

In-memory databases

In-memory databases avoid the fundamental cost of accessing data from disk by simply storing all its data in the main memory. Such architecture also significantly simplifies the development of the storage part of the database since there is no need for a buffer pool. However, the biggest issue with in-memory databases is when the data cannot fit into the random access memory since the only possible way out is to transfer the data to a larger and, consequently, more expensive machine.

In-memory database users rely on the fact that durability is still secured through durability mechanisms like transaction logging and snapshots so that data loss does not occur.

Larger-than-memory architecture

Main memory computation

Larger-than-memory architecture describes a database architecture when the majority of computations are still within the main memory, but the database offers the ability to store a vast amount of data on disk, too, without having the computational complexity of interacting with buffer pools.

Identify hot & cold data

The larger-than-memory architecture utilizes the fact that there are always hot and cold parts of the database in terms of accessing it. The goal is then to find cold data stored and move it to the disk so that transactions still have fast access to hot data. Cold data identification can be done either by directly tracking transactions’ access patterns (online) or by offline computation in which a background thread analyzes data.

The second very important feature of the larger-than-memory architecture is the process of evicting cold data. This can be done in two ways:

DB tracks the memory usage and starts evicting data as soon as it reaches a predefined threshold.
Eviction can be done only when new data is needed.

Transaction management

Different systems also behave differently regarding transaction management. If the transaction needs data that is currently stored on the disk, it can:

Abort the transaction, fetch data stored on the disk, and restart the transaction.
Stall the transaction by synchronously fetching data from the disk.

Transaction must fit into memory

The question is, what happens when the transaction data cannot fit into random access memory? In Memgraph, we decided to start with an approach that all transaction data must fit into memory. This means that some analytical queries cannot be executed on a large dataset, but this is the tradeoff we were willing to accept in the first iteration.

Benefits of larger-than-memory databases

Memgraph uses RocksDB as a key-value store for extending the capabilities of the in-memory database. Not to go into too many details about RocksDB, but let’s just briefly mention that it is based on a data structure called Log-Structured Merge-Tree (LSMT) (instead of B-Trees, typically the default option in databases), which are saved on disk and because of the design come with a much smaller write amplification than B-Trees.

The in-memory version of Memgraph uses Delta storage to support multi-version concurrency control (MVCC). However, for larger-than-memory storage, we decided to use the Optimistic Concurrency Control Protocol (OCC) since we assumed conflicts would rarely happen, and we could make use of RocksDB’s transactions without dealing with the custom layer of complexity like in the case of Delta storage.

We’ve implemented OCC in a way that every transaction has its own private workspace, so potential conflicts are detected at the commit time. One of our primary requirements before starting to add disk-based data storage was not to ruin the performance of the main memory-based storage. Although we all knew there was no such thing as zero-cost abstraction, we managed to stay within 10% of the original version. We decided to use snapshot isolation as an appropriate concurrency isolation level since we believed it could be the default option for the large majority of Memgraph users.

Disadvantages of larger-than-memory databases

As always, not everything is sunshine and flowers, especially when introducing such a significant feature to an existing database, so there are still improvements to be made. First, the requirement that a single transaction must fit into memory makes it impossible to use large analytical queries.

It also makes our LOAD CSV command for importing CSV files practically unusable since the command is executed as a single transaction. Although RocksDB is really good, fits really well into our codebase, and has proved to be very efficient in its caching mechanisms, maintaining an external library is always hard.

In retrospect

Albeit the significant engineering endeavor, the larger-than-memory architecture is a super valuable asset to Memgraph users since it allows them to store large amounts of data cheaply on disk without sacrificing the performance of in-memory computation. We are actively working on resolving issues introduced with the new storage mode, so feel free to ask, open an issue, or pull a request. We will be more than happy to help. Until next time 🫡

Exciting News: LangChain Now Supports Memgraph!

Memgraph — Fri, 25 Aug 2023 07:06:32 +0000

We're thrilled to announce a powerful integration between LangChain and Memgraph, bringing you an unparalleled natural language interface to your Memgraph database. Say goodbye to complex queries and welcome a seamless and intuitive way to interact with your data.

Memgraph QA chain tutorial

If you've ever wanted to effortlessly query your Memgraph database using natural language, this tutorial is for you. This step-by-step guide will walk you through the process, ensuring you have all the tools you need to get started.

Prerequisites

Before you dive in, make sure you have Docker and Python 3.x installed on your system.

Get started

Launch a Memgraph Instance: With a few simple commands, you can have your Memgraph instance up and running using Docker. Just follow our script to set it up.

Install dependencies: We've got you covered with the required packages. Use pip to install langchain, openai, neo4j, and gqlalchemy. Don't forget the --user flag to ensure smooth permissions.

Code playtime: Whether you prefer working within this notebook or want to use a separate Python file, the tutorial offers code snippets to guide you through the process.

What's inside

Explore the rich features and functionalities that LangChain and Memgraph offer together:

API reference: We provide an overview of the key components you'll be working with, such as ChatOpenAI, GraphCypherQAChain, and MemgraphGraph.

Populating the database: Learn how to populate your Memgraph database effortlessly using the Cypher query language. We guide you through the process of seeding data that serves as the foundation for your work.

Refresh graph schema: Familiarize yourself with refreshing the graph schema, a crucial step in setting up the Memgraph-LangChain graph for Cypher queries.

Querying the database: Discover how to interact with the OpenAI API and configure your API key. We'll show you how to utilize the GraphCypherQAChain to ask questions and receive informative responses.

Chain modifiers: Customize your chain's behavior with modifiers like return_direct, return_intermediate_steps, and top_k. Tailor the experience to your preferences.

Advanced querying: Delve into advanced querying techniques and uncover tips for refining your prompts to improve query accuracy.

Ready to take your data interaction to the next level? Join us in exploring the seamless synergy between LangChain and Memgraph. No more wrangling with queries – just natural language and meaningful insights. Simplify complexity, elevate your insights, and share your projects in our community.

What is a Graph Database?

Ani Ghazaryan — Wed, 23 Aug 2023 15:22:01 +0000

While relational databases have been the go-to choice for data storage, they fall short when it comes to handling complex relationships and traversing interconnected data, which puts graph databases in a special spotlight. A graph database is a specialized database system designed to store, manage, and query highly connected data using graph theory principles. As data volumes continue to explode, companies need efficient and scalable solutions to handle the complexities of their data.

Specialized database systems like graph databases offer a more natural and efficient way to model, query, and store data, leading to improved performance and better data insights.

Understanding graphs

In simple terms, at the core of graph databases lies the concept of a graph. In mathematics and computer science, a graph is a collection of nodes (also known as vertices) connected by edges. Nodes represent entities or objects, while edges depict the relationships or connections between them. This straightforward yet effective structure forms the foundation of graph databases.

Components of graphs

It's your time to shine! Let's reiterate on what we've learned so far. Graphs consist of two fundamental components: nodes and edges.

Nodes represent entities or objects and can have various attributes associated with them.
Edges, on the other hand, depict the relationships or connections between nodes and can also carry properties.

Together, nodes and edges create a rich network of connected data.

Graph theory basics

Another common term you may hear here and there, alluding to graphs or graph databases, is graph theory, which is a branch of mathematics, that provides the theoretical underpinning for understanding and analyzing graphs. It defines vertices as the fundamental building blocks of a graph and edges as the connections between vertices. Relationships in a graph can be represented by directed or undirected edges, capturing the nature and direction of the connections.

Relational databases vs. graph databases

Opinions split when it comes to choosing a database, however, the debate around relational vs. graph databases is still hot. Relational databases have long been the dominant database model, organizing data into structured tables with predefined schemas. They excel in handling structured data and transactions but face challenges when dealing with complex relationships and traversing connected data. This is largely due to their rigid tabular structure.

Joining multiple tables and navigating through numerous relationships can lead to performance bottlenecks and complex query formulations. This limits their effectiveness in scenarios where relationships play a crucial role.

Graph databases excel in modeling and querying relationships. They store connections explicitly, allowing for efficient traversals between nodes and enabling complex relationship queries with ease. And of course, graph databases provide flexibility, scalability, and performance advantages over a relational database when it comes to handling interconnected data.

Characteristics of graph databases

So far, you've been introduced to a few qualities that are typical to graphs, so let's put the learnings into structure and build off of what you've grasped.

Schema-less nature

Unlike relational databases, graph databases are schema-less, meaning they do not require a predefined structure or schema for data. This flexibility allows for the dynamic addition of new node types, properties, and relationships, making graph databases highly adaptable to evolving data models.

Native graph processing

Graph databases are purpose-built for processing graph data. They employ optimized algorithms and data structures to efficiently traverse and manipulate the graph structure, resulting in faster query response times and improved performance compared to non-native graph databases.

Graph traversal and pattern matching

One of the key strengths of graph databases is their ability to traverse and explore relationships between nodes. Graph traversal algorithms can efficiently navigate the graph to discover patterns, uncover hidden connections, and retrieve data based on specific criteria. This capability is particularly valuable in applications such as recommendation engines, fraud detection, and knowledge graphs, which we will explore in the sections to come.

Use cases of graph databases

Unlike a traditional relational database that relies on tabular data, a graph database utilizes a flexible and intuitive data model, allowing for the representation of intricate relationships between entities. With its ability to efficiently capture and traverse vast networks of data, a graph database has emerged as an advanced tool for diverse domains, including:

Social networks and recommendation engines

Graph databases have revolutionized social networking platforms and recommendation engines. They enable personalized recommendations, friend suggestions, and social network analysis by leveraging the rich network of connections between users, interests, and entities.

Fraud detection and network analysis

Graphs also excel in fraud detection and network analysis. By representing complex networks of relationships, they can identify suspicious patterns, detect fraudulent activities, and uncover hidden connections that might indicate illicit behavior, making them an invaluable tool for cybersecurity.

Knowledge graphs and semantic networks

Last but not least, graph databases serve as a foundation for building knowledge graphs and semantic networks. By representing data as nodes and relationships, they capture the semantics and context of information, enabling sophisticated knowledge discovery, semantic search, and data integration across disparate sources.

Sneak peek into graph algorithms

Surely enough, graph algorithms play a crucial role in leveraging the power of graph databases and unlocking valuable insights from connected data. In this section, we provide a sneak peek into some fundamental graph algorithms that form the backbone of graph database operations.

Breadth-First Search (BFS): Breadth-First Search is a fundamental algorithm used to explore and traverse a graph in a breadth-first manner. Starting from a given source node, BFS systematically explores all the neighboring nodes before moving deeper into the graph. This algorithm is commonly used to find the shortest path between two nodes, identify connected components, and perform level-based analysis.

Depth-First Search (DFS): Depth-First Search is another crucial graph algorithm that explores a graph with a depth-first principle. DFS starts from a given source node and traverses as far as possible along each branch before backtracking. The algorithm is useful for identifying cycles in a graph, performing topological sorting, and searching for specific nodes or patterns.

PageRank algorithm: Developed by Google's founders, PageRank is a graph algorithm used to measure the importance or relevance of nodes in a graph, particularly in web graphs. PageRank assigns each node a numerical value based on the number and quality of incoming links, and plays a vital role in search engine ranking, recommendation systems, and social network analysis.

These are just a few examples of the numerous graph algorithms available, however, graph databases employ a wide range of algorithms to perform tasks such as community detection, centrality analysis, graph clustering, and more.

Sum up

In this article, we explored the world of graph databases and their significance in modern data management. We defined graph databases and highlighted their importance in handling complex relationships and interconnected data. If you're curious and want to learn more about the fascinating world of graphs, make sure to check out our blog and give us a shout in our community.

Security Analysis with JupiterOne’s Starbase and Memgraph

Matea Pesic — Tue, 22 Aug 2023 16:36:38 +0000

Starbase is an open-source graph-based security analysis tool that unifies all of JupiterOne’s integrations into one. It collects assets and relationships from services and systems, including cloud infrastructure, SaaS applications, security controls, and more, into an intuitive graph visualization. With over 115 open-source graph integrations, Starbase collaborates with your existing toolkit enabling easy and insightful cyber security analysis.

In this article, we’ll dig into Starbase, guiding you through the setup of two example integrations and enabling Starbase to work with Memgraph for easy ingestion and visualization of your graph data.

Prerequisites

Installed Yarn package manager.
Installed Node.js.
A running Memgraph instance—visit Memgraph’s docs for instructions on how to install and connect to Memgraph.

Setting up Starbase

To kick-start your Starbase setup, first, you need to clone the JupiterOne/Starbase repo into your local directory and ensure you have Yarn and Node.js installed.
Once you’ve successfully cloned the repository and installed the prerequisites, place yourself in the terminal in the directory where you cloned the repo and run the yarn command. The command installs all of the necessary project dependencies.
The next step is setting up configurations for your integration of choice. You can find a list of all integrations on JupiterOne’s GitHub repo. Moving forward, we are going to explore two options for possible integration, Zoom, and GitHub.

Setting up integrations

In order to set up an integration, you need to register an account in the system the integration targets for ingestion and obtain the necessary API credentials. Starbase leverages credentials from external services to authenticate and collect data. When Starbase is started, it reads configuration data from a single configuration file named config.yaml at the root of the project.

Zoom integration

In order to configure the Zoom integration, we need to create a Zoom app to retrieve the needed credentials:

Go to the Zoom App Marketplace and sign into your Zoom account.
In the top right corner, go to the Develop dropdown menu and select Build App.
Choose to create an OAuth type of app.
Take note of your Account ID, Client ID, and Client secret which we’ll need for the configuration file later on.
In the Scopes section, add group:read:admin, role:read:admin, user:read:admin, and account:read:admin.

After you’ve successfully created your Zoom App, open up the starbase repo in your editor of choice and create your config.yaml file. This is an example of a config.yaml file for Zoom integration:

integrations:
  -
    name: graph-zoom
    instanceId: testInstanceId
    directory: ./.integrations/graph-zoom
    gitRemoteUrl: <https://github.com/JupiterOne/graph-zoom.git>
    config:
      ACCOUNT_ID: <ACCOUNT_ID>
      CLIENT_ID: <CLIENT_ID>
      CLIENT_SECRET: <CLIENT_SECRET>
      SCOPES: 'read:admin role:read:admin user:read:admin account:read:admin'

GitHub integration

In order to configure GitHub integration, we need to create a GitHub app to retrieve the needed credentials:

Go to the GitHub Apps and select to create a new GitHub App
Name your app, and enter a homepage URL (in this case, you can use the JupiterOne’s Starbase repo URL), uncheck the webhook and adjust the repository permissions. The following permissions need to be set to read-only: -Repository Permissions: Actions, Environments, Issues, Pull Requests and Secrets -Organization Permissions: Administration, Members, Secrets. The rest of the permissions are No access by default.

Read-only access for secrets repo doesn’t give read-only access to actual secret content, it only gives read-only info to the existence of the metadata about the secrets.

Select Any account and create your GitHub App.

After you’ve successfully created your GitHub App, open up the cloned Starbase repository in your editor of choice and create your config.yaml file. Generate your private key and retrieve other needed credentials from the GitHub App you previously created. Below is an example of a config.yaml file for a GitHub integration:

integrations:
    -
     name: graph-github
     instanceId: testInstanceId
     directory: ./.integrations/graph-github
     gitRemoteUrl: <https://github.com/JupiterOne/graph-github.git>
     config:
        GITHUB_APP_ID = <GITHUB_APP_ID>
GITHUB_APP_LOCAL_PRIVATE_KEY_PATH={YOURPATH}/{YOURFILENAME}.private-key.pem
        INSTALLATION_ID=<INSTALLATION_ID>
        GITHUB_API_BASE_URL=https://api.github.com

Use Starbase with Memgraph

After you’ve successfully created your config.yaml file, the last step is to adjust your queries to work with Memgraph. In order to do that, run the following steps:

First, you need to place yourself in the terminal in the folder you cloned your Starbase repo and run yarn starbase setup command to clone or update all integrations listed in the config.yaml file, as well as install all dependencies for each integration.
Run your Memgraph instance. Follow the instructions from Memgraph’s docs on how to connect to Memgraph, or if you are using Docker, simply run the following command:
docker run -it -p 3000:3000 -p 7444:7444 -p 7687:7687 memgraph/memgraph-platform
By modifying just a single line of code, you are ready to use Starbase with Memgraph. Inside the neo4jGraphStore.js file, locate the addEntities() function. To enable compatibility with Memgraph, simply update the following line:

await this.runCypherCommand(`CREATE INDEX index_${entity._type} IF NOT EXISTS FOR (n:${entity._type}) ON (n._key, n._integrationInstanceID);`);

With:

await this.runCypherCommand(`CREATE INDEX index_${entity._type} IF NOT EXISTS FOR (n:${entity._type}) ON (n._key, n._integrationInstanceID);`);
await this.runCypherCommand

You are all set to utilize Starbase with Memgraph. The instance is actively listening to port 7687, as defined in the code.

The final step is to run the yarn starbase run command. Afterward, launch your browser and go to localhost:3000 to access Memgraph Lab or open your desktop version to explore and visualize your graph data.

Explore your dataset

Below, we’ve provided a few query examples that demonstrate how you can dig into your dataset and extract valuable insights. The following examples assume the use of GitHub integration.

With the following query, you are retrieving the information of all of the extracted GitHub users from a certain organization:

MATCH (n:github_user) RETURN n LIMIT 3;

If you also want to determine which code owners of organization repositories grant access to outside contributors, execute the following query:

MATCH (account:github_account) - [e:OWNS] -> (repo:github_repo) -> [f:ALLOWS] -> (user:github_user {role: ‘OUTSIDE’}) RETURN account, repo, user, e, f;

Takeaways

Starbase is a powerful tool that simplifies security analysis by unifying integrations into a user-friendly graph view, enhancing cybersecurity insights. Incorporating Memgraph for data ingestion adds another dimension by enhancing its capabilities and visualizing your data. If you are curious about graphs and would like to learn more, make sure to check out our blog and join our community on Discord.

Memgraph vs. TigerGraph

Vlasta Pavicic — Fri, 18 Aug 2023 06:43:44 +0000

In today's data-driven world, the necessity to process and interpret complex relationships within massive datasets is making organizations continually search for the go-to graph database, leaving the traditional relational database options behind. After the initial DB-Engines consultations, two names commonly arise in conversations: TigerGraph and Memgraph.

Background on both solutions

Founded in 2012 by Dr. Yu Xu, TigerGraph's core objective is to provide a scalable and efficient graph database platform that enables organizations to leverage the power of interconnected data, supporting applications ranging from fraud detection to AI and machine learning.

Memgraph is an in-memory, open-source graph database with roots in the UK and Croatia. Founded by Marko Budiselic and Dominik Tomicevic in 2016, and backed by American investors, Memgraph prioritizes high performance and developer accessibility. With a robust community edition, the platform offers a blend of ease of use and practical functionality, all presented through clear and uncomplicated licensing, making it the backbone of many cybersecurity solutions.

Memgraph vs. TigerGraph differences

Although both TigerGraph and Memgraph have been developed in C++ and aim to provide performant solutions for real-time data analytics, there exist some important differences between the two platforms that set them apart. Let’s check what those are.

Query language

The choice of query language plays a significant role in the overall user experience.

GSQL, TigerGraph's proprietary query language, does offer an expressive, Turing-complete language tailored for graph pattern-matching and analytics functions but might present a steeper learning curve for those new to graphs. It has been specifically designed for TigerGraph, and the skillset may not be easily transferred from or to other graph database platforms.

In contrast, Cypher query language is an open-source, declarative language known for its user-friendly syntax. Cypher's human-readable style has propelled it into a standard for querying graph databases. It has been developed by Neo4j but is utilized by various systems, including Memgraph. Due to its simplicity, and broad community support, it is a preferred choice for many developers who know their applications will need minimum changes if they require a switch to another database vendor.

Data storage

TigerGraph and Memgraph offer distinctive approaches to handling data in their graph databases, each reflecting a unique strategy to balance performance, scalability, and flexibility.

TigerGraph employs a hybrid memory-disk approach, leveraging RAM for storing frequently accessed data and disk storage for large graphs that may exceed available memory. This hybrid model allows TigerGraph to achieve real-time analytics, where active datasets are immediately available, while also scaling to handle massive datasets without being constrained by RAM.

In contrast, Memgraph's architecture has been built natively for in-memory data analysis and storage, focusing on lightning-fast data processing. Being ACID compliant, it ensures consistency and reliability in its core design. However, Memgraph also offers flexibility. An analytical storage mode that bypasses ACID compliance is available, accelerating analytics and data import operations when absolute consistency is not required. Additionally, an on-disk storage option allows users to weigh performance against budget constraints, thus achieving a balance tailored to specific needs.

While TigerGraph's hybrid approach offers a comprehensive solution for both speed and scalability, Memgraph's focus on in-memory processing with adaptable options reflects a commitment to performance with versatility to suit various requirements. The distinction between these two models shows the innovation in graph database technology, catering to diverse needs in data management, analysis, and storage.

Data isolation level

TigerGraph employs a read-committed data isolation level, meaning that a transaction can access data which is committed before and during this transaction’s execution.
For example, two same READ queries inside one transaction can return different results because between them another transaction was committed.

On the other hand, Memgraph uses snapshot isolation by default, where each query operates on a consistent snapshot of the data at the query's start time, with the option to change the isolation level, but snapshot isolation offers an advantage as it provides a more consistent view of the data, reducing the chance of reading partial or uncommitted changes. This ensures more accurate query results and a smoother transaction experience, making snapshot isolation generally considered a more reliable approach in many scenarios.

Pricing models and support

TigerGraph has a free version that allows users to work with up to 50GB of data, making it suitable for small projects or initial exploration. Memgraph offers something different with its Community Edition, which is not only free but also open-source and packed with features.

For example, both TigerGraph and Memgraph offer high availability features to ensure that data is consistently accessible and resistant to failures, but Memgraph's replication is available even in the Community Edition of the product. This means that the Community Edition is not "crippleware" but a fully functional version that allows users to "kick the tires" on the product and properly test it to ensure it meets requirements before deploying it in a production environment.

By offering this, and a plethora of other features in the Community Edition, Memgraph not only shows a commitment to performance and reliability but also to accessibility, empowering users to explore and validate the capabilities of the software without barriers.

Due to the lack of complicated layers of management that some larger companies might have, in Memgraph you can talk directly to engineers if you have questions or need help. It's a more hands-on, direct way of working that puts you closer to the people who built the product, and it can make working with Memgraph a more pleasant and efficient experience.

Overview of features

Takeaways on both graph databases

Memgraph and TigerGraph both offer graph database solutions, but Memgraph's native in-memory design sets it apart. Built for speed without losing stability or ACID compliances, Memgraph provides efficient real-time querying and analytics. Although TigerGraph claims to be "The World’s Fastest Graph Analytics Platform for the Enterprise", clients have reported increased performance after switching to Memgraph. If speed, reliability, direct interaction, and support from engineers are key priorities, Memgraph may be the more appealing choice.

Check the performance of Memgraph on your own dataset using Benchgraph, a graph database performance benchmark, and feel free to contact us about making the switch.

DEV Community: Memgraph

When Should You Use Query-Focused Summarization in GraphRAG?

Why Global Questions Need a Different Approach

What Query-Focused Summarization Does

A GitHub Issues Example: Finding Product Blind Spots

Why Atomic GraphRAG Helps With Global Retrieval

When Query-Focused Summarization Is Too Much

Wrapping Up

Further Reading

When Does GraphRAG Need Local Graph Search?

What Is Local Graph Search?

The Answer Lives Around the Node

A Local Graph Search Example

How to Make Local Graph Search Easier to Debug

Do Not Let the Neighborhood Become the Whole City

Use a Different Pattern When the Question Shape Changes

Wrapping Up

When Should You Use Text2Cypher in a GraphRAG Pipeline

What Is Text2Cypher?

What Text2Cypher Does in GraphRAG

Why Text2Cypher Is the Best Fit for Analytical GraphRAG Questions

When Text2Cypher Is the Wrong Tool

Keep the Pipeline Inspectable

When Should You Use GraphRAG Instead of RAG?

RAG vs GraphRAG

What RAG Does Well

Where RAG Gets Shallow

What GraphRAG Adds

RAG and GraphRAG Are Not Enemies

Keep the Retrieval Logic Close to the Data

The Practical Rule

MCP for Agents: The Security Gap Most Teams Miss

What MCP Gives You (And What it Does Not)

Building Agents with MCP: 3 Problems You Will Hit First

Why Prompt Rules Are Not Enforcement

The Practical Fix: Shrink the Tool Surface at Runtime

Where GraphRAG Fits in an MCP Tooling Stack

A Checklist You Can Actually Use

What Memgraph adds to an MCP agent stack

Wrapping Up

Innovation Graph Analytics Powered by Embeddings and LLM’s

Intro & Recap: Innovation Graphs

Text Embeddings & Graph Creation

The Classical Way: TF-IDF

The Modern Way: Embedding Models

Graph Creation Based on Embeddings

Community Detection for Technology Cluster Identification

LLM-Powered Innovation Cluster Labelling

Cluster Summarization Using LLM’s

Up Next: Step-By-Step Real-Life Examples and Visuals

Innovation as a Graph: Improved Insight into Technology Clusters, Collaboration and Knowledge Networks

Traditional Innovation Analysis

A Graph of Innovation

Open Data Sources for Innovation Analytics

Graphs of Documents: Insight into Technology and Knowledge Clusters

Graphs of Organisations: Insight into Collaboration Ecosystems

Graphs of People: Insight into Expertise and Knowledge Networks

Graph Metrics & Innovation Insights

Up Next: Use Cases

In-memory vs. disk-based databases: Why do you need a larger than memory architecture?

On-disk databases

In-memory databases

Larger-than-memory architecture

Main memory computation

Identify hot & cold data

Transaction management

Transaction must fit into memory

Benefits of larger-than-memory databases

Disadvantages of larger-than-memory databases

In retrospect

Exciting News: LangChain Now Supports Memgraph!

Memgraph QA chain tutorial

Prerequisites

Get started

What's inside

What is a Graph Database?

Understanding graphs

Components of graphs

Graph theory basics

Relational databases vs. graph databases