Exercise based on AstraDB’s notebook “Code Generation with GraphRAG”
Introduction
DataStax (now part of IBM) is a company that provides a real-time data platform built on the open-source Apache Cassandra database. 🏢 They focus on providing highly scalable, resilient, and cloud-native solutions for businesses that need to handle large volumes of data with low latency. Their offerings are designed to help developers build and run data-intensive applications, and they have been a key player in the NoSQL and distributed database space for years.
AstraDB is DataStax’s serverless, cloud-native database service that is built on Apache Cassandra. Its serverless architecture makes it particularly useful for generative AI applications because it automatically scales to handle fluctuating workloads without requiring manual intervention. This is crucial for AI models that might have unpredictable usage patterns. It also provides powerful vector search capabilities, which are essential for RAG (Retrieval-Augmented Generation) applications. This feature allows the generative AI model to efficiently search through massive datasets of proprietary information to retrieve the most relevant context, leading to more accurate and grounded responses.
What is a GraphRAG?
Graph RAG, or Graph-based Retrieval-Augmented Generation, is an advanced technique used in generative AI to improve the accuracy and relevance of a large language model’s (LLM) output. While traditional RAG relies on simple vector search to retrieve relevant text snippets, Graph RAG goes a step further by leveraging a knowledge graph to understand the relationships and connections between different pieces of information. Instead of just finding similar text, it can follow logical paths in the data — for example, linking a function to its parent class, its parameters, and related concepts in the documentation. This structured retrieval allows the LLM to access a more complete and contextually rich set of information, which is crucial for complex tasks like generating functional code or providing detailed, multi-step answers. By navigating the graph, the system can retrieve not just a document, but the entire “contextual neighborhood” of that document, leading to more coherent and accurate outputs.
We can also refer to the ‘Graph RAG’ description provided on DataStax’s GitHub repository;
Retrievers providing both unstructured (similarity-search on vectors) and structured (traversal of metadata properties).
The description of ‘Graph RAG’ project gives the following description; “it provides retrievers combining vector-search (for unstructured similarity) and traversal (for structured relationships in metadata). These retrievers are implemented using the metadata search functionality of existing vector stores, allowing you to traverse your existing vector store!”
Among the collection of examples in the repository, one in particular truly stood out: “Code Generation with GraphRAG.” The title alone was enough to pique my curiosity, as it addressed a common developer pain point — the challenge of generating accurate and functional code from a vast and complex body of documentation. It wasn’t just another example; it was a promise to solve a tangible problem.
I was curious to run the code on my own. The code exercpt below is the original Notebook provided on GitHub.
# ruff: noqa: T201
# Code Generation with GraphRAG
## Introduction
In this notebook, we demonstrate that **GraphRAG significantly outperforms standard vector-based retrieval** for generating working code from documentation. While traditional vector search retrieves relevant snippets, it often lacks the structured understanding needed to produce executable results. In contrast, **GraphRAG enables the LLM to follow logical relationships within documentation, leading to functional code generation**.
We achieve this by leveraging a custom traversal strategy, selecting nodes that contain both **code examples and descriptive text**, allowing the LLM to assemble more complete responses.
## Getting Started
Below we will experiment with the AstraPy documentation to evaluate how well GraphRAG can generate working code.
Using AstraDB as the vector store, we compare GraphRAG’s structured retrieval with standard vector search to solve a specific coding task.
The query we will be sending to the LLM is the following:
query = """
Generate a function for connecting to an AstraDB cluster using the AstraPy library,
and retrieve some rows from a collection. The number of rows to return should be a
parameter on the method. Use Token Authentication. Assume the cluster is hosted on
AstraDB. Include the necessary imports and any other necessary setup. The following
environment variables are available for your use:
- `ASTRA_DB_API_ENDPOINT`: The Astra DB API endpoint.
- `ASTRA_DB_APPLICATION_TOKEN`: The Astra DB Application token.
- `ASTRA_DB_KEYSPACE`: The Astra DB keyspace.
- `ASTRA_DB_COLLECTION`: The Astra DB collection." \
"""
The following block will configure the environment from the Colab Secrets.
To run it, you should have the following Colab Secrets defined and accessible to this notebook:
- `OPENAI_API_KEY`: The OpenAI key.
- `ASTRA_DB_API_ENDPOINT`: The Astra DB API endpoint.
- `ASTRA_DB_APPLICATION_TOKEN`: The Astra DB Application token.
- `LANGCHAIN_API_KEY`: Optional. If defined, will enable LangSmith tracing.
- `ASTRA_DB_KEYSPACE`: Optional. If defined, will specify the Astra DB keyspace. If not defined, will use the default.
If you don't yet have access to an AstraDB database, or need to check your credentials, see the help [here](https://python.langchain.com/docs/integrations/vectorstores/astradb/#credentials).
# Install modules.
%pip install \
langchain-core \
langchain-astradb \
langchain-openai \
langchain-graph-retriever \
graph-rag-example-helpers
The last package -- `graph-rag-example-helpers` -- includes the helpers and example documents that we will use in this notebook.
# Configure import paths.
import os
import sys
from langchain_core.documents import Document
sys.path.append("../../")
# Initialize environment variables.
from graph_rag_example_helpers.env import Environment, initialize_environment
initialize_environment(Environment.ASTRAPY)
os.environ["LANGCHAIN_PROJECT"] = "code-generation"
os.environ["ASTRA_DB_COLLECTION"] = "code_generation"
def print_doc_ids(docs: list[Document]):
[print(f"`{doc.id}` has example: {'example' in doc.metadata}") for doc in docs]
## Part 1: Loading Data
First, we'll demonstrate how to load the example AstraPy documentation into `AstraDBVectorStore`. We will be creating a LangChain Document for every module, class, attribute, and function in the package.
We will use the pydoc description field for the `page_content` field in the document. Note that not every item in the package has a description. Because of this, there will be many documents that have no page content.
Besides the description, we will also include a bunch of extra information related to the item in the `metadata` field. This info can include the item's name, kind, parameters, return type, base class, etc.
The item's `id` will be the items path in the package.
Below are two example documents... One with page content and one without.
#### Example doc with page content
<details markdown><summary>Click to expand</summary>
#yaml
id: astrapy.client.DataAPIClient
page_content: |
A client for using the Data API. This is the main entry point and sits
at the top of the conceptual "client -> database -> collection" hierarchy.
A client is created first, optionally passing it a suitable Access Token.
Starting from the client, then:
- databases (Database and AsyncDatabase) are created for working with data
- AstraDBAdmin objects can be created for admin-level work
metadata:
name: DataAPIClient
kind: class
path: astrapy.client.DataAPIClient
parameters:
token: |
str | TokenProvider | None = None
an Access Token to the database. Example: `"AstraCS:xyz..."`.
This can be either a literal token string or a subclass of
`astrapy.authentication.TokenProvider`.
environment: |
str | None = None
a string representing the target Data API environment.
It can be left unspecified for the default value of `Environment.PROD`;
other values include `Environment.OTHER`, `Environment.DSE`.
callers: |
Sequence[CallerType] = []
a list of caller identities, i.e. applications, or frameworks,
on behalf of which Data API and DevOps API calls are performed.
These end up in the request user-agent.
Each caller identity is a ("caller_name", "caller_version") pair.
example: |
>>> from astrapy import DataAPIClient
>>> my_client = DataAPIClient("AstraCS:...")
>>> my_db0 = my_client.get_database(
... "https://01234567-....apps.astra.datastax.com"
... )
>>> my_coll = my_db0.create_collection("movies", dimension=2)
>>> my_coll.insert_one({"title": "The Title", "$vector": [0.1, 0.3]})
>>> my_db1 = my_client.get_database("01234567-...")
>>> my_db2 = my_client.get_database("01234567-...", region="us-east1")
>>> my_adm0 = my_client.get_admin()
>>> my_adm1 = my_client.get_admin(token=more_powerful_token_override)
>>> database_list = my_adm0.list_databases()
references:
astrapy.client.DataAPIClient
gathered_types:
astrapy.constants.CallerType
astrapy.authentication.TokenProvider
</details>
This is the documentation for [`astrapy.client.DataAPIClient`](https://github.com/datastax/astrapy/blob/v1.5.2/astrapy/client.py#L50) class. The `page_content` field contains the description of the class, and the `metadata` field contains the rest of the details, including example code of how to use the class.
The `references` metadata field contains the list of related items used in the example code block. The `gathered_types` field contains the list of types from the parameters section. In GraphRAG, we can use these fields to link to other documents.
#### Example doc without page content
<details markdown><summary>Click to expand</summary>
#yaml
id: astrapy.admin.AstraDBAdmin.callers
page_content: ""
metadata:
name: callers
path: astrapy.admin.AstraDBAdmin.callers
kind: attribute
</details>
This is the documentation for `astrapy.admin.AstraDBAdmin.callers`. The `page_content` field is empty, and the `metadata` field contains the details.
Despite having no page content, this document can still be useful for Graph RAG. We'll add a `parent` field to the metadata at vector store insertion time to link it to the parent document: `astrapy.admin.AstraDBAdmin`, and we can use this for traversal.
### Create the AstraDBVectorStore
Next, we'll create the Vector Store we're going to load these documents into.
In our case, we'll use DataStax Astra DB with Open AI embeddings.
from langchain_astradb import AstraDBVectorStore
from langchain_openai import OpenAIEmbeddings
store = AstraDBVectorStore(
embedding=OpenAIEmbeddings(),
collection_name=os.getenv("ASTRA_DB_COLLECTION"),
)
### Loading Data
Now its time to load the data into our Vector Store. We'll use a helper method to download already prepared documents from the `graph-rag-example-helpers` package. If you want to see how these documents were created from the AstraPy package, see details in the Appendix.
We will use the [`ParentTransformer`](../../guide/transformers/#parenttransformer) to add a parent field to the metadata document field. This will allow us to traverse the graph from a child to its parent.
from graph_rag_example_helpers.datasets.astrapy import fetch_documents
from langchain_graph_retriever.transformers import ParentTransformer
transformer = ParentTransformer(path_delimiter=".")
doc_ids = store.add_documents(transformer.transform_documents(fetch_documents()))
We can retrieve a sample document to check if the parent field was added correctly:
from graph_rag_example_helpers.examples.code_generation import format_document
print(
format_document(
store.get_by_document_id("astrapy.admin.AstraDBAdmin.callers"), debug=True
)
)
At this point, we've created a Vector Store with all the documents from the AstraPy documentation. Each document contains metadata about the module, class, attribute, or function, and the page content contains the description of the item.
In the next section we'll see how to build relationships from the metadata in order to traverse through the documentation in a similar way to how a human would.
## Part 2: Graph Traversal
The GraphRAG library allows us to traverse through the documents in the Vector Store. By changing the [`Strategy`](../../guide/strategies/), we can control how the traversal is performed.
### Basic Traversal
We'll start with the default [`Eager`](../../guide/strategies/#eager) strategy, which will traverse the graph in a breadth-first manner. In order to do this we need to set up the relationships between the documents. This is done by defining the "edges" between the documents.
In our case we will connect the "references", "gathered_types", "parent", "implemented_by", and "bases" fields in the metadata to the "id" field of the document they reference.
edges = [
("gathered_types", "$id"),
("references", "$id"),
("parent", "$id"),
("implemented_by", "$id"),
("bases", "$id"),
]
Note that edges are directional, and indicate metadata fields by default. The magic string `$id` is used to indicate the document's id.
In the above `edges` list, any document id found in `gathered_types` will be connected to documents with the corresponding id. The other edges will work in a similar way.
Lets use these edges to create a LangChain retriever and documents for our query.
from langchain_graph_retriever import GraphRetriever
default_retriever = GraphRetriever(store=store, edges=edges)
print_doc_ids(default_retriever.invoke(query, select_k=6, start_k=3, max_depth=2))
Notes on the extra keyword args:
- `select_k` in GraphRAG is equivalent to `k` in LangChain. It specifies the number of nodes to select during retrieval.
- `start_k` indicates the number of nodes to select using standard vector retrieval before moving onto graph traversal.
- `max_depth` is the maximum depth to traverse in the graph.
With this configuration, we were only able to find 2 documents with example code.
### Custom Strategy
Now we will create a custom strategy that will traverse a larger portion of the graph and return the documents that contain code examples or descriptive text.
To do this, we need to implement a class that inherits from the base [`Strategy`](../../reference/graph_retriever/strategies/#graph_retriever.strategies.Strategy) class and overrides [`iteration`](../../reference/graph_retriever/strategies/#graph_retriever.strategies.Strategy.iteration) method:
import dataclasses
from collections.abc import Iterable
from graph_retriever.strategies import NodeTracker, Strategy
from graph_retriever.types import Node
@dataclasses.dataclass
class CodeExamples(Strategy):
# internal dictionary to store all nodes found during the traversal
_nodes: dict[str, Node] = dataclasses.field(default_factory=dict)
def iteration(self, *, nodes: Iterable[Node], tracker: NodeTracker) -> None:
# save all newly found nodes to the internal node dictionary for later use
self._nodes.update({n.id: n for n in nodes})
# traverse the newly found nodes
new_count = tracker.traverse(nodes=nodes)
# if no new nodes were found, we have reached the end of the traversal
if new_count == 0:
example_nodes = []
description_nodes = []
# iterate over all nodes and separate nodes with examples from nodes with
# descriptions
for node in self._nodes.values():
if "example" in node.metadata:
example_nodes.append(node)
elif node.content != "":
description_nodes.append(node)
# select the nodes with examples first and descriptions second
# note: the base `finalize_nodes` method will truncate the list to the
# `select_k` number of nodes
tracker.select(example_nodes)
tracker.select(description_nodes)
As described in the comments above, this custom strategy will first try to select documents that contain code examples, and then will use documents that contain descriptive text.
We can now use this custom strategy to build a custom retriever, and ask the query again:
custom_retriever = GraphRetriever(store=store, edges=edges, strategy=CodeExamples())
print_doc_ids(custom_retriever.invoke(query, select_k=6, start_k=3, max_depth=2))
Now we have found 6 documents with code examples! That is a significant improvement over the default strategy.
## Step 3: Using GraphRAG to Generate Code
We now use the `CodeExamples` strategy inside a Langchain pipeline to generate code snippets.
We will also use a custom document formatter, which will format the document in a way that makes it look like standard documentation. In particular, it will format all the extra details stored in the metadata in a way that is easy to read. This will help the LLM use the information in the documents to generate code.
from graph_rag_example_helpers.examples.code_generation import format_docs
from langchain.chat_models import init_chat_model
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
llm = init_chat_model("gpt-4o-mini", model_provider="openai")
prompt = ChatPromptTemplate.from_template(
"""Generate a block of runnable python code using the following documentation as
guidance. Return only the code. Don't include any example usage.
Each documentation page is separated by three dashes (---) on its own line.
If certain pages of the provided documentation aren't useful for answering the
question, feel free to ignore them.
Question: {question}
Related Documentation:
{context}
"""
)
graph_chain = (
{"context": custom_retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
print(graph_chain.invoke(query))
We can try running this generated code to see if it works:
import os
from astrapy.client import DataAPIClient
from astrapy.collection import Collection
def connect_and_retrieve_rows(num_rows):
api_endpoint = os.getenv("ASTRA_DB_API_ENDPOINT")
application_token = os.getenv("ASTRA_DB_APPLICATION_TOKEN")
keyspace = os.getenv("ASTRA_DB_KEYSPACE")
collection_name = os.getenv("ASTRA_DB_COLLECTION")
client = DataAPIClient(token=application_token)
database = client.get_database(api_endpoint)
collection = Collection(database=database, name=collection_name, keyspace=keyspace)
rows = collection.find(limit=num_rows)
return list(rows)
for row in connect_and_retrieve_rows(5):
print(row)
## Conclusion
The results clearly demonstrate that **GraphRAG leads to functional code generation, while standard vector-based retrieval fails**.
In contrast, attempts using **only an LLM** or **standard vector-based RAG** resulted in **incomplete or non-functional outputs**. The appendix includes examples illustrating these limitations.
By structuring document relationships effectively, **GraphRAG improves retrieval quality, enabling more reliable LLM-assisted code generation**.
## Appendix
### LLM Alone
Here we show how to use the LLM alone to generate code for the query. We will use the same query as before, but modify the prompt to not include any context.
llm_only_prompt = ChatPromptTemplate.from_template(
"""Generate a block of runnable python code. Return only the code.
Don't include any example usage.
Question: {question}
"""
)
llm_only_chain = (
{"question": RunnablePassthrough()} | llm_only_prompt | llm | StrOutputParser()
)
print(llm_only_chain.invoke(query))
This code is not functional. The package `astra` and the class `AstraClient` do not exist.
### Standard RAG
Here we show how to use the LLM with standard RAG to generate code for the query. We will use the same query and prompt as we did with GraphRAG.
rag_chain = (
{
"context": store.as_retriever(k=6) | format_docs,
"question": RunnablePassthrough(),
}
| prompt
| llm
| StrOutputParser()
)
print(rag_chain.invoke(query))
This code is also not functional.
### Converting AstraPy Documentation
The AstraPy documentation was converted into a JSONL format via some custom code that is not included in this notebook. However, the code is available in the `graph-rag-example-helpers` package [here](https://github.com/datastax/graph-rag/blob/main/packages/graph-rag-example-helpers/src/graph_rag_example_helpers/examples/code_generation/converter.py).
Eager to put this promising technique to the test, I decided to replicate the experiment on my own machine. For this, I ventured into uncharted territory by opting for a local setup powered by Ollama and the gpt-oss model—a combination I had never worked with before. This step was crucial not only to validate the repository's code but also to explore the feasibility of running such a complex GraphRAG pipeline in a self-contained, offline environment.
Implementation with Ollama and gpt-oss
The following is my implementation of the original code so it would run using Ollama and gpt-oss on my laptop.
- Environment preparation;
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
- Installation of the required packages;
langchain-core
langchain-astradb
langchain-community
langchain-graph-retriever
graph-rag-example-helpers
sentence-transformers
ollama
pip install requirements.txt
- Download “gpt-oss” from Ollama site (the downloaded one is the 14gb);
ollama run gpt-oss
- Export your required environment variables;
export ASTRA_DB_APPLICATION_TOKEN="AstraCSXYZ"
export ASTRA_DB_API_ENDPOINT="https://xxx.yyy.zzz.apps.astra.datastax.com"
export ASTRA_DB_COLLECTION_NAME="your_collection"
export OPENAI_API_KEY="dummy_key"
- If you want to validate your AstraDB remote connection use the code below, but this is not mandatory!
# db_validator.py
import os
import sys
from astrapy import DataAPIClient
import astrapy.exceptions as astra_exc
def validate_astra_connection():
"""
Validates the connection to AstraDB using environment variables and the DataAPIClient.
"""
required_vars = ["ASTRA_DB_APPLICATION_TOKEN", "ASTRA_DB_API_ENDPOINT", "ASTRA_DB_COLLECTION_NAME"]
for var in required_vars:
if not os.getenv(var):
print(f"Error: Environment variable '{var}' is not set.")
print("Please set all required variables and try again.")
sys.exit(1)
astra_db_token = os.getenv("ASTRA_DB_APPLICATION_TOKEN")
api_endpoint = os.getenv("ASTRA_DB_API_ENDPOINT")
collection_name = os.getenv("ASTRA_DB_COLLECTION_NAME")
print("Attempting to validate AstraDB connection...")
print("-" * 40)
try:
# Test the connection to the API endpoint and validate the token.
print(f"Connecting to API endpoint: {api_endpoint}...")
my_client = DataAPIClient(token=astra_db_token)
my_database = my_client.get_database(api_endpoint)
print("✅ Successfully connected to AstraDB.")
# Collection exists and the token has permissions.
print(f"Checking for collection: '{collection_name}'...")
# This will return a list of all collections, allowing us to verify existence.
collections_list = my_database.list_collection_names()
if collection_name not in collections_list:
print(f"❌ Collection '{collection_name}' not found.")
print("Please ensure the collection name is correct and exists in your database.")
sys.exit(1)
print(f"✅ Found collection '{collection_name}'.")
print("\nConnection validation successful! 🎉")
print("Your application token, API endpoint, and collection name are all valid.")
except astra_exc.ConnectionException as e:
print(f"❌ Connection error: {e}")
print("Please check your API endpoint URL and network connectivity.")
sys.exit(1)
except astra_exc.APIException as e:
print(f"❌ API error: {e}")
print("Please check your ASTRA_DB_APPLICATION_TOKEN for correctness and permissions.")
sys.exit(1)
except Exception as e:
print(f"❌ An unexpected error occurred: {e}")
print("Please review your environment variables and the code for any typos.")
sys.exit(1)
if __name__ == "__main__":
validate_astra_connection()
- Last but not least, the main code implementation (changing slightly the original Notebook) ⬇️
import os
import sys
import dataclasses
from collections.abc import Iterable
# Ensure the project helpers are on the path
# This is required for fetching the example data and environment setup
sys.path.append("../../")
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_community.llms import Ollama
from langchain_astradb import AstraDBVectorStore
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_graph_retriever import GraphRetriever
from graph_rag_example_helpers.env import Environment, initialize_environment
from graph_rag_example_helpers.datasets.astrapy import fetch_documents
from graph_rag_example_helpers.examples.code_generation import format_document
from langchain_graph_retriever.transformers import ParentTransformer
from graph_retriever.strategies import NodeTracker, Strategy
from graph_retriever.types import Node
# --- User Configuration (Update these values) ---
# Replace with your actual Astra DB details or set as environment variables
# ASTRA_DB_API_ENDPOINT = os.getenv("ASTRA_DB_API_ENDPOINT")
# ASTRA_DB_APPLICATION_TOKEN = os.getenv("ASTRA_DB_APPLICATION_TOKEN")
# ASTRA_DB_KEYSPACE = os.getenv("ASTRA_DB_KEYSPACE", "default_keyspace")
# --- Main Application Logic ---
def main():
"""
Runs the full GraphRAG code generation pipeline.
"""
print("Initializing environment...")
# Initialize environment variables for AstraPy setup
initialize_environment(Environment.ASTRAPY)
os.environ["LANGCHAIN_PROJECT"] = "code-generation"
os.environ["ASTRA_DB_COLLECTION"] = "code_generation"
# Define the core query for code generation
query = """
Generate a function for connecting to an AstraDB cluster using the AstraPy library,
and retrieve some rows from a collection. The number of rows to return should be a
parameter on the method. Use Token Authentication. Assume the cluster is hosted on
AstraDB. Include the necessary imports and any other necessary setup. The following
environment variables are available for your use:
- `ASTRA_DB_API_ENDPOINT`: The Astra DB API endpoint.
- `ASTRA_DB_APPLICATION_TOKEN`: The Astra DB Application token.
- `ASTRA_DB_KEYSPACE`: The Astra DB keyspace.
- `ASTRA_DB_COLLECTION`: The Astra DB collection.
"""
# Create the AstraDB Vector Store with a local embedding model
print("Creating vector store...")
store = AstraDBVectorStore(
embedding=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2"),
collection_name=os.getenv("ASTRA_DB_COLLECTION"),
)
# Load documentation data into the vector store
print("Loading data into the vector store...")
# The ParentTransformer adds a 'parent' field to metadata for traversal
transformer = ParentTransformer(path_delimiter=".")
doc_ids = store.add_documents(transformer.transform_documents(fetch_documents()))
print(f"Loaded {len(doc_ids)} documents.")
# Custom strategy for graph traversal
@dataclasses.dataclass
class CodeExamples(Strategy):
_nodes: dict[str, Node] = dataclasses.field(default_factory=dict)
def iteration(self, *, nodes: Iterable[Node], tracker: NodeTracker) -> None:
self._nodes.update({n.id: n for n in nodes})
new_count = tracker.traverse(nodes=nodes)
if new_count == 0:
example_nodes = []
description_nodes = []
for node in self._nodes.values():
if "example" in node.metadata:
example_nodes.append(node)
elif node.content != "":
description_nodes.append(node)
tracker.select(example_nodes)
tracker.select(description_nodes)
# Define graph edges for traversal
edges = [
("gathered_types", "$id"),
("references", "$id"),
("parent", "$id"),
("implemented_by", "$id"),
("bases", "$id"),
]
# Create the GraphRAG retriever with the custom strategy
print("Creating custom graph retriever...")
custom_retriever = GraphRetriever(store=store, edges=edges, strategy=CodeExamples())
# Verify the documents retrieved by the custom strategy
retrieved_docs = custom_retriever.invoke(query, select_k=6, start_k=3, max_depth=2)
print("\n--- Retrieved Document IDs ---")
[print(f"`{doc.id}` has example: {'example' in doc.metadata}") for doc in retrieved_docs]
print("------------------------------")
# Define the LangChain Expression Language (LCEL) chain
print("\nSetting up the generation chain...")
llm = Ollama(model="gpt-oss")
prompt = ChatPromptTemplate.from_template("""
You are a programming expert. Your job is to create a Python function based on a query.
You will be given the query and some related documentation to help you.
The function you generate should be runnable and complete.
Do not include any text outside of the code block.
Query: {question}
Relevant Documentation:
{context}
""")
def format_docs(docs: list[Document]) -> str:
"""Helper function to format documents for the prompt."""
return "\n\n--\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{
"context": custom_retriever | format_docs,
"question": RunnablePassthrough(),
}
| prompt
| llm
| StrOutputParser()
)
# Invoke the chain to generate the code
print("\nGenerating code...")
generated_code = rag_chain.invoke(query)
print("\n--- Generated Code ---")
print(generated_code)
print("----------------------")
if __name__ == "__main__":
main()
- And the output which works as fine as the sample Notebook results! 💪
Initializing environment...
Creating vector store...
/Users/alainairom/Devs/datastax-graphrag/astra-graphrag.py:61: LangChainDeprecationWarning: The class `HuggingFaceEmbeddings` was deprecated in LangChain 0.2.2 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-huggingface package and should be used instead. To use it run `pip install -U :class:`~langchain-huggingface` and import as `from :class:`~langchain_huggingface import HuggingFaceEmbeddings``.
embedding=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2"),
modules.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 349/349 [00:00<00:00, 900kB/s]
config_sentence_transformers.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 1.70MB/s]
README.md: 10.5kB [00:00, 20.7MB/s]
sentence_bert_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53.0/53.0 [00:00<00:00, 312kB/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 612/612 [00:00<00:00, 2.88MB/s]
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90.9M/90.9M [00:02<00:00, 40.2MB/s]
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 350/350 [00:00<00:00, 4.66MB/s]
vocab.txt: 232kB [00:00, 10.7MB/s]
tokenizer.json: 466kB [00:00, 27.0MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 112/112 [00:00<00:00, 626kB/s]
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 190/190 [00:00<00:00, 944kB/s]
Loading data into the vector store...
Loaded 1081 documents.
Creating custom graph retriever...
--- Retrieved Document IDs ---
`astrapy.admin.AstraDBDatabaseAdmin.from_api_endpoint` has example: True
`astrapy.admin.AstraDBDatabaseAdmin.from_astra_db_admin` has example: True
`astrapy.client.DataAPIClient` has example: True
`astrapy.admin.AstraDBDatabaseAdmin` has example: True
`astrapy.admin.AstraDBAdmin` has example: True
`astrapy.database.AsyncDatabase` has example: True
------------------------------
Setting up the generation chain...
/Users/alainairom/Devs/datastax-graphrag/astra-graphrag.py:115: LangChainDeprecationWarning: The class `Ollama` was deprecated in LangChain 0.3.1 and will be removed in 1.0.0. An updated version of the class exists in the :class:`~langchain-ollama package and should be used instead. To use it run `pip install -U :class:`~langchain-ollama` and import as `from :class:`~langchain_ollama import OllamaLLM``.
llm = Ollama(model="gpt-oss")
Generating code...
--- Generated Code ---
#python
import os
from typing import List, Dict, Any
# The AstraPy library provides a convenient client for the Data API.
# If it is not already installed, you can install it via:
# pip install astrapy
from astrapy import DataAPIClient
def fetch_rows(num_rows: int) -> List[Dict[str, Any]]:
"""
Connect to an AstraDB cluster using Token authentication and return a list of
rows from the configured collection.
Parameters
----------
num_rows : int
The maximum number of rows to retrieve from the collection.
Returns
-------
List[Dict[str, Any]]
A list of dictionaries representing the rows fetched from the collection.
"""
# Grab the configuration from environment variables
api_endpoint = os.getenv("ASTRA_DB_API_ENDPOINT")
app_token = os.getenv("ASTRA_DB_APPLICATION_TOKEN")
keyspace = os.getenv("ASTRA_DB_KEYSPACE")
collection_name = os.getenv("ASTRA_DB_COLLECTION")
if not all([api_endpoint, app_token, keyspace, collection_name]):
raise ValueError(
"One or more required environment variables are missing "
"(ASTRA_DB_API_ENDPOINT, ASTRA_DB_APPLICATION_TOKEN, "
"ASTRA_DB_KEYSPACE, ASTRA_DB_COLLECTION)."
)
# Create a Data API client using token authentication
client = DataAPIClient(api_endpoint=api_endpoint, token=app_token)
# Retrieve the keyspace (database) and collection objects
database = client.get_database(keyspace)
collection = database.get_collection(collection_name)
# Query the collection for the requested number of rows.
# The AstraPy collection API exposes a `fetch_rows` method that accepts
# a `limit` argument. It returns an iterable of dictionaries.
rows = list(collection.fetch_rows(limit=num_rows))
return rows
----------------------
Conclusion
This entire process, from a simple curiosity to a fully functional application, was an enlightening journey. What began as a deep dive into an intriguing repository example quickly evolved into a hands-on technical challenge. I first had to adapt the original notebook to a completely local environment, a decision that led me to explore and integrate new tools like Ollama and the gpt-oss
model for the very first time. This transition wasn't without its hurdles; a tricky AssertionError
rooted in a hidden dependency required careful debugging and a deeper understanding of the project's underlying structure. After resolving this issue, I meticulously refactored the multi-cell notebook into a single, cohesive Python application, a task that demanded a thorough understanding of the logic and a clean, executable implementation. To ensure reproducibility and shareability, I then painstakingly compiled a requirements.txt
file, explicitly listing every dependency needed for a seamless setup. Ultimately, the successful execution of the final script was a powerful testament to the value of persistence in debugging, the flexibility of open-source tools, and the remarkable potential of Graph RAG to solve real-world problems. The finished product is more than just a piece of code—it's a complete, portable, and a highly educational example of the power of structured data retrieval for generative AI.
Thanks for reading.👍
Links
- DataStax GitHub repository: https://github.com/datastax
- Graph-rag repository: https://github.com/datastax/graph-rag
- Ollama/gpt-oss model: https://ollama.com/library/gpt-oss
Top comments (1)
Graph magic! 🔗