Gao Dalie (Ilyass)

Posted on Aug 22

LangExtract + Knowledge Graph— Google’s New Library for NLP Tasks

#machinelearning #ai #programming #datascience

In this story, I have a super quick tutorial showing you how to create a Knowledge graph and LangExtract to build a powerful chatbot for your business or personal use.

In today’s data-driven world, much valuable information is hidden in unstructured text — for example, clinical records, lengthy legal contracts, or user feedback threads. Extracting meaningful and traceable information from these documents has always been a dual challenge, both technically and practically.

On July 30, 2025, Google released the open-source AI program LangExtract. This tool accurately extracts only the necessary information from the types of text we read every day, such as emails, reports, and medical records, and organises it into a format that is easy for computers to process.

While AI is very useful, it also has weaknesses, such as generating false stories (hallucinations), providing incorrect information, having a limited amount of information it can retain at one time, and sometimes giving different answers each time.

LangExtract was created as a “smart bridge” to compensate for these weaknesses of AI and transform AI’s ability to understand text into the ability to extract reliable information.

So, let me give you a quick demo of a live chatbot to show you what I mean.

Check the video

I will ask the chatbot a question: “Apple Inc. was founded by Steve Jobs and Steve Wozniak in 1976. The company is headquartered in Cupertino, California. Steve Jobs served as CEO until he died in 2011.”

If you take a look at how the Agent generates the output, you’ll see that the agent extracts entities using the document_extractor_tool which leverages LangExtract with dynamic few-shot learning examples that automatically select appropriate extraction templates based on query keywords — when the system detects keywords like “financial,” “revenue,” or “company,” it applies business-focused examples that properly classify entities as company names, people, locations, and dates rather than generic categories.

The entity extraction process runs in parallel with relationship extraction, where the system identifies connections between entities such as “founded by,” “headquartered in,” and “competes with” relationships by analyzing the contextual information within each document.

Once both entities and relationships are extracted, the build_graph_data function constructs a graph structure, creating nodes for each unique entity and edges for each discovered relationship, with a robust fallback mechanism that ensures connectivity by creating "related_to" edges between all entities when explicit relationships aren't found.

and The final visualization layer uses Streamlit Agraph to render an interactive knowledge graph where users can explore the connections between companies, founders, locations, and other business entities, with the entire system operating in-memory without file operations and providing real-time debugging information to show the number of entities and relationships discovered, ultimately enabling users to query the knowledge graph and receive filtered results based on their specific questions about the technology companies and their interconnections.

What is LangExtract?

LangExtract is a publicly available, Google’s latest open source feature that might finally bring sanity back to developers and data teams.

This tool doesn’t just “use AI to extract information.” It combines each extraction with the original text. LangExtract acts as a “special mechanism” built on top of LLM to maximise its capabilities by addressing challenges AI faces in information extraction, such as hallucination, imprecision, limited context windows, and nondeterminism.

What’s special about LangExtract?

The core strength of LangExtract lies in its “programmatic extraction” capability — it not only identifies the required information precisely but also links each extracted result to the exact character position (offset) in the original text. This traceability allows users to highlight and verify results, significantly improving data reliability interactively.

LangExtract comes with a range of powerful features: it can process long documents with millions of tokens efficiently through chunking, parallel computation, and multi-pass extraction to ensure high recall. It produces structured outputs directly, eliminating the need for traditional RAG workflows such as chunking and embeddings.

It is also compatible with both cloud-based models (like Gemini) and local open-source large models, making it highly adaptable. In addition, it supports custom prompt templates, allowing easy adaptation to different domains.

Let’s start coding

Let us now explore step by step and unravel the answer to how to create a graph with the langExtract chatbot. We will install the libraries that support the model. For this, we will do a pip install requirements.

pip install -r requirements.txt

The next step is the usual one: We will import the relevant libraries, the significance of which will become evident as we proceed.

langextract is a Python library that uses LLMs to extract structured information from unstructured text documents based on user-defined instructions.

streamlit_agraph: is a custom component for the Streamlit framework, designed specifically for creating interactive graphs

import os
import textwrap
import langextract as lx
import logging
import streamlit as st
from streamlit_agraph import Config, Edge, Node, agraph
from typing import List, Dict, Any, Optional
import json

Let’s create the function document_extractor_tool that takes two strings: the unstructured_text and the user_query. The function returns a Python dictionary, making it easy to convert into JSON later. Inside, we first build a clean prompt textwrap.dedent(...) where you tell the model its role (an expert extractor), the task (pull out relevant info), and the specific query to focus on.

Next, we prepare “few-shot” examples to guide the extractor. Based on the query, you check for keywords: if it’s financial, it provides a company/revenue example; if it’s legal, you give a contract example; if it’s social/restaurant, it provides a feedback example; otherwise, it uses a generic Romeo/Juliet example. These short examples demonstrate how the model should process the extractions and ensure the output structure is clear.

Finally, you call lx.extract(...), passing the text, prompt, examples, and an API key stored safely in an environment variable. You log the results for debugging, then normalise the output so each extraction is a plain dictionary with "text", "class", and "attributes".

The function returns a single dictionary containing all extracted data in a clean, structured format, ready to be saved, printed, or sent to another system.

def document_extractor_tool(unstructured_text: str, user_query: str) -> dict:
    """
    Extracts structured information from a given unstructured text based on a user's query.
    """
    prompt = textwrap.dedent(f"""
    You are an expert at extracting specific information from documents.
    Based on the user's query, extract the relevant information from the provided text.
    The user's query is: "{user_query}"
    Provide the output in a structured JSON format.
    """)

    # Dynamic Few-Shot Example Selection
    examples = []
    query_lower = user_query.lower()
    if any(keyword in query_lower for keyword in ["financial", "revenue", "company", "fiscal"]):
        financial_example = lx.data.ExampleData(
            text="In Q1 2023, Innovate Inc. reported a revenue of $15 million.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="company_name",
                    extraction_text="Innovate Inc.",
                    attributes={"name": "Innovate Inc."},
                ),
                lx.data.Extraction(
                    extraction_class="revenue",
                    extraction_text="$15 million",
                    attributes={"value": 15000000, "currency": "USD"},
                ),
                lx.data.Extraction(
                    extraction_class="fiscal_period",
                    extraction_text="Q1 2023",
                    attributes={"period": "Q1 2023"},
                ),
            ]
        )
        examples.append(financial_example)
    elif any(keyword in query_lower for keyword in ["legal", "agreement", "parties", "effective date"]):
        legal_example = lx.data.ExampleData(
            text="This agreement is between John Doe and Jane Smith, effective 2024-01-01.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="party",
                    extraction_text="John Doe",
                    attributes={"name": "John Doe"},
                ),
                lx.data.Extraction(
                    extraction_class="party",
                    extraction_text="Jane Smith",
                    attributes={"name": "Jane Smith"},
                ),
                lx.data.Extraction(
                    extraction_class="effective_date",
                    extraction_text="2024-01-01",
                    attributes={"date": "2024-01-01"},
                ),
            ]
        )
        examples.append(legal_example)
    elif any(keyword in query_lower for keyword in ["social", "post", "feedback", "restaurant", "菜式", "評價"]):
        social_media_example = lx.data.ExampleData(
            text="I tried the new 'Taste Lover' restaurant in TST today. The black truffle risotto was amazing, but the Tiramisu was just average.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="restaurant_name",
                    extraction_text="Taste Lover",
                    attributes={"name": "Taste Lover"},
                ),
                lx.data.Extraction(
                    extraction_class="dish",
                    extraction_text="black truffle risotto",
                    attributes={"name": "black truffle risotto", "sentiment": "positive"},
                ),
                lx.data.Extraction(
                    extraction_class="dish",
                    extraction_text="Tiramisu",
                    attributes={"name": "Tiramisu", "sentiment": "neutral"},
                ),
            ]
        )
        examples.append(social_media_example)
    else:
        # Default generic example if no specific keywords match
        generic_example = lx.data.ExampleData(
            text="Juliet looked at Romeo with a sense of longing.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="character", extraction_text="Juliet", attributes={"name": "Juliet"}
                ),
                lx.data.Extraction(
                    extraction_class="character", extraction_text="Romeo", attributes={"name": "Romeo"}
                ),
                lx.data.Extraction(
                    extraction_class="emotion", extraction_text="longing", attributes={"type": "longing"}
                ),
            ]
        )
        examples.append(generic_example)

    logging.info(f"Selected {len(examples)} few-shot example(s).")

    result = lx.extract(
        text_or_documents=unstructured_text,
        prompt_description=prompt,
        examples=examples,
        api_key=os.getenv("GOOGLE_API_KEY")
    )

    logging.info(f"Extraction result: {result}")

    # Convert the result to a JSON-serializable format
    extractions = [
        {"text": e.extraction_text, "class": e.extraction_class, "attributes": e.attributes}
        for e in result.extractions
    ]

    return {
        "extracted_data": extractions
    }

Then the function is called load_gemini_key() and it returns a tuple with two things: the key itself (str) and a flag (bool) that tells you if the key is available. At the start, it sets key as an empty string and is_key_provided as False.

Then it checks if a file called .streamlit/secrets.toml exists and if it contains "GOOGLE_API_KEY". If yes, it pulls the key from there, shows a green success message in the sidebar saying it’s using the secrets file, and sets the flag to True.

If the key isn’t found in the secrets file, it falls back to asking the user directly. In the sidebar, it shows a password-style text input box where the user can paste their Gemini API key.

If the user enters something, it displays another green success message and sets the flag to True. If they leave it empty, it shows a red error message saying there’s no key.

# Streamlit utility functions
def load_gemini_key() -> tuple[str, bool]:
    """Load the Gemini API key from the environment variable or user input."""
    key = ""
    is_key_provided = False
    secrets_file = os.path.join(".streamlit", "secrets.toml")
    if os.path.exists(secrets_file) and "GOOGLE_API_KEY" in st.secrets.keys():
        key = st.secrets["GOOGLE_API_KEY"]
        st.sidebar.success('Using Gemini Key from secrets.toml')
        is_key_provided = True
    else:
        key = st.sidebar.text_input(
            'Add Gemini API key and press \'Enter\'', type="password")
        if len(key) > 0:
            st.sidebar.success('Using the provided Gemini Key')
            is_key_provided = True
        else:
            st.sidebar.error('No Gemini Key')
    return key, is_key_provided

Next we made format_output_agraph(output) takes a dictionary with "nodes" and "edges" and converts each node into a Node object (with id, label, size, and shape) and each edge into an Edge object (with source, target, label, colour, and arrows), returning two lists ready for visualisation.

And we create display_agraph(nodes, edges), then set up the graph’s appearance and behaviour with a Config object, controlling width, height, directed layout, physics simulation, hierarchical layout, highlight colour, collapsibility, and which property to use as the node label.

Finally, it calls agraph() with the nodes, edges, and config to render the graph in the Streamlit app, providing a simple pipeline from raw graph data to an interactive, styled visualisation.

def format_output_agraph(output):
    nodes = []
    edges = []
    for node in output["nodes"]:
        nodes.append(
            Node(id=node["id"], label=node["label"], size=8, shape="diamond"))
    for edge in output["edges"]:
        edges.append(Edge(source=edge["source"], label=edge["relation"],
                     target=edge["target"], color="#4CAF50", arrows="to"))
    return nodes, edges

def display_agraph(nodes, edges):
    config = Config(width=950,
                    height=950,
                    directed=True,
                    physics=True,
                    hierarchical=True,
                    nodeHighlightBehavior=False,
                    highlightColor="#F7A7A6",
                    collapsible=False,
                    node={'labelProperty': 'label'},
                    )
    return agraph(nodes=nodes, edges=edges, config=config)

After that, we develop the extract_entities(documents) function, which loops through each document and calls document_extractor_tool with a query to extract financial entities like company names, revenue figures, and fiscal periods, collecting all results into a single list.

Similarly, extract_relationships(documents) processes each document to extract connections and relationships between these entities, such as revenue links between companies and fiscal periods, again aggregating all results into a list.

Together, they convert raw text documents into structured entity and relationship data that can later be used to build a graph or knowledge network.

# Core GraphRAG functions
def extract_entities(documents: List[str]) -> List[Dict[str, Any]]:
    """Extract entities from documents"""
    all_entities = []

    for doc in documents:
        result = document_extractor_tool(
            doc, 
            "Extract financial entities including company names, revenue figures, and fiscal periods from business documents"
        )
        all_entities.extend(result["extracted_data"])

    return all_entities

def extract_relationships(documents: List[str]) -> List[Dict[str, Any]]:
    """Extract relationships between entities"""
    all_relationships = []

    for doc in documents:
        result = document_extractor_tool(
            doc,
            "Extract financial relationships and revenue connections between companies and fiscal periods"
        )
        all_relationships.extend(result["extracted_data"])

    return all_relationships

Next, we build_graph_data(entities, relationships) first converts each entity into a graph node, assigning a unique ID, label, and type, while storing a mapping from text to node ID.

It then processes relationships: for each relationship, it searches for mentioned entities and creates edges connecting them with the relationship type as the label. If no explicit relationships are found, it falls back to generating simple co-occurrence edges between all entities to ensure the graph is connected.

Then the answer_query(entities, relationships, query) function lets you search the extracted data. It splits the query into words and finds entities whose text or attributes match any of those words, doing the same for relationships. It returns a dictionary containing the query, lists of relevant entities and relationships, and counts of each.

def build_graph_data(entities: List[Dict[str, Any]], relationships: List[Dict[str, Any]]) -> Dict[str, Any]:
    """Build graph data for visualization"""
    nodes = []
    edges = []

    # Create nodes from entities
    entity_map = {}
    for i, entity in enumerate(entities):
        node_id = str(i)
        nodes.append({
            "id": node_id,
            "label": entity["text"],
            "type": entity["class"]
        })
        entity_map[entity["text"].lower()] = node_id

    # Create edges from relationships and simple co-occurrence
    for rel in relationships:
        rel_text = rel["text"].lower()
        found_entities = []

        # Find entities mentioned in this relationship
        for entity_text, entity_id in entity_map.items():
            if entity_text in rel_text:
                found_entities.append(entity_id)

        # Create edges between found entities
        for i in range(len(found_entities)):
            for j in range(i + 1, len(found_entities)):
                edges.append({
                    "source": found_entities[i],
                    "target": found_entities[j],
                    "relation": rel["class"]
                })

    # If no relationships found, create simple co-occurrence edges
    if not edges:
        st.write("No relationship edges found, creating fallback edges...")
        for i, entity1 in enumerate(entities):
            for j, entity2 in enumerate(entities):
                if i < j:
                    # Create edges between all entities
                    edges.append({
                        "source": str(i),
                        "target": str(j),
                        "relation": "related_to"
                    })

    return {"nodes": nodes, "edges": edges}

def answer_query(entities: List[Dict[str, Any]], relationships: List[Dict[str, Any]], query: str) -> Dict[str, Any]:
    """Answer query using extracted entities and relationships"""
    if not query:
        return None

    # Find relevant entities
    relevant_entities = [
        e for e in entities 
        if any(word.lower() in e["text"].lower() or word.lower() in str(e["attributes"]).lower() 
               for word in query.split())
    ]

    # Find relevant relationships
    relevant_relationships = [
        r for r in relationships
        if any(word.lower() in r["text"].lower() or word.lower() in str(r["attributes"]).lower()
               for word in query.split())
    ]

    return {
        "query": query,
        "relevant_entities": relevant_entities,
        "relevant_relationships": relevant_relationships,
        "entity_count": len(relevant_entities),
        "relationship_count": len(relevant_relationships)
    }

Then we create the process_documents Function is the main pipeline that ties everything together. It takes a list of text documents and an optional query. First, it calls extract_entities and extract_relationships to pull structured financial entities and their connections from the documents, then prints debug info showing how many entities and relationships were found. Next, it passes these to build_graph_data, creates nodes and edges for visualization, and prints debug info about the graph size.

Finally, if a query is provided, it calls answer_query to find relevant entities and relationships matching the query. The function returns a dictionary containing all extracted entities, relationships, the graph data, and any query results, giving a complete structured view of the documents and making it easy to visualise or analyse further.

def process_documents(documents: List[str], query: str = None) -> Dict[str, Any]:
    """Process documents and optionally answer a query"""
    # Extract entities and relationships
    entities = extract_entities(documents)
    relationships = extract_relationships(documents)

    # Debug info
    st.write(f"Debug: Found {len(entities)} entities, {len(relationships)} relationships")

    # Build graph data
    graph_data = build_graph_data(entities, relationships)

    # Debug graph data
    st.write(f"Debug: Graph has {len(graph_data['nodes'])} nodes, {len(graph_data['edges'])} edges")

    # Answer query if provided
    results = answer_query(entities, relationships, query) if query else None

    return {
        "entities": entities,
        "relationships": relationships,
        "graph_data": graph_data,
        "results": results
    }

Finally, we set the page title and layout, then display a header. Next, it loads the Gemini API key usingload_gemini_key(); if no key is provided, it warns the user and stops execution. If a key is available, it sets it as an environment variable so the extractor functions can use it.

The app uses a set of predefined documents about tech companies and displays a success message indicating how many documents will be processed. Users can optionally enter a query in a text input. When the “Process Documents” button is clicked, it process_documents is called with the documents and an optional query. This returns entities, relationships, graph data, and query results.

The results are displayed in four tabs: Graph Visualisation, Entities, Relationships, and Query Results. In the graph tab, format_output_agraph and display_agraph render an interactive knowledge graph. The entities and relationships tabs show extracted items with expandable JSON details for each. The query tab displays relevant results if a query was provided. Altogether, this function ties the full pipeline into an interactive, user-friendly Streamlit interface.

# Streamlit app
def main():
    st.set_page_config(page_title="GraphRAG with LangExtract", layout="wide")
    st.title("GraphRAG with LangExtract")

    # Load API key
    api_key, is_key_provided = load_gemini_key()

    if not is_key_provided:
        st.warning("Please provide an API key to continue")
        return

    # Set environment variable
    os.environ["GOOGLE_API_KEY"] = api_key

    # Predefined documents
    documents = [
        "Apple Inc. was founded by Steve Jobs and Steve Wozniak in 1976. The company is headquartered in Cupertino, California. Steve Jobs served as CEO until his death in 2011.",
        "Microsoft Corporation was founded by Bill Gates and Paul Allen in 1975. It's based in Redmond, Washington. Bill Gates was the CEO for many years.",
        "Both Apple and Microsoft are major technology companies that compete in various markets including operating systems and productivity software. They have a long history of rivalry.",
        "Google was founded by Larry Page and Sergey Brin in 1998. The company started as a search engine but has expanded into many areas including cloud computing and artificial intelligence."
    ]

    st.success(f"Using {len(documents)} predefined documents about tech companies")

    # Query input
    query = st.text_input("Enter your query (optional):")

    if st.button("Process Documents"):
        with st.spinner("Processing documents..."):
            result = process_documents(documents, query if query else None)

            # Display results in tabs
            tab1, tab2, tab3, tab4 = st.tabs(["Graph Visualization", "Entities", "Relationships", "Query Results"])

            with tab1:
                if result["graph_data"]:
                    st.subheader("Knowledge Graph")

                    nodes, edges = format_output_agraph(result["graph_data"])
                    if nodes:
                        display_agraph(nodes, edges)
                    else:
                        st.info("No graph data to display")

            with tab2:
                st.subheader("Extracted Entities")
                if result["entities"]:
                    for i, entity in enumerate(result["entities"]):
                        with st.expander(f"{entity['text']} ({entity['class']})"):
                            st.json(entity["attributes"])
                else:
                    st.info("No entities extracted")

            with tab3:
                st.subheader("Extracted Relationships")
                if result["relationships"]:
                    for i, rel in enumerate(result["relationships"]):
                        with st.expander(f"{rel['text']} ({rel['class']})"):
                            st.json(rel["attributes"])
                else:
                    st.info("No relationships extracted")

            with tab4:
                if query and result["results"]:
                    st.subheader("Query Results")
                    st.json(result["results"])
                else:
                    st.info("No query provided or no results")

if __name__ == "__main__":
    main()

Conclusion :

LangExtract alone cannot solve everything, but new AI tools must be developed and released. Using various AI tools together will highlight the problems with each tool, leading to further improvements. AI has made remarkable progress in recent years, but behind this progress is feedback from many people. There is no failure in using AI. It might be a good idea to try it out first and develop AI ourselves.

I would highly appreciate it if you

❣ Join my Patreon: https://www.patreon.com/GaoDalie_AI

Book an Appointment with me: https://topmate.io/gaodalie_ai
Support the Content (every Dollar goes back into the -video):https://buymeacoffee.com/gaodalie98d
Subscribe to the Newsletter for free:https://substack.com/@gaodalie