Linghua Jin

Posted on Dec 11

How I Built a Self-Updating Neo4j Knowledge Graph from Meeting Notes (That Saves 99% on LLM Costs)

#neo4j #ai #python #tutorial

The Problem: Your Meeting Notes Are Wasted

Every day, organizations hold 62-80 million meetings in the US alone. Those meetings generate decisions, action items, and task assignments—but most of that intelligence dies in Google Docs.

Want to know "Who was in all the budget meetings?" or "What tasks did Alex get assigned this month?" Good luck searching through thousands of Markdown files.

The real killer? Meeting notes are living documents. People fix names, reassign tasks, update decisions. Without incremental processing, you're stuck choosing between:

💸 Massive LLM bills from reprocessing everything
📉 A stale, outdated knowledge graph

I solved this by building a self-updating Neo4j knowledge graph that only processes changed documents—cutting LLM costs by 99%.

What We're Building

A pipeline that turns messy meeting notes into a queryable graph database:

Google Drive → Detect Changes → Split Meetings → LLM Extract → Neo4j

The result? Three node types (Meeting, Person, Task) and three relationships (ATTENDED, DECIDED, ASSIGNED_TO) that let you query:

"Which meetings did Sarah attend?"
"Where was this task decided?"
"Who owns all Q4 tasks?"

The Secret Sauce: Incremental Processing

Here's what makes this actually work at scale:

1. Only Process What Changed

The Google Drive source tracks last-modified timestamps. When you have 100,000 meeting notes and only 1% change daily, you process 1,000 files—not 100,000.

@cocoindex.flow_def(name="MeetingNotesGraph")
def meeting_notes_graph_flow(
    flow_builder: cocoindex.FlowBuilder,
    data_scope: cocoindex.DataScope
) -> None:
    credential_path = os.environ["GOOGLE_SERVICE_ACCOUNT_CREDENTIAL"]
    root_folder_ids = os.environ["GOOGLE_DRIVE_ROOT_FOLDER_IDS"].split(",")

    data_scope["documents"] = flow_builder.add_source(
        cocoindex.sources.GoogleDrive(
            service_account_credential_path=credential_path,
            root_folder_ids=root_folder_ids,
            recent_changes_poll_interval=datetime.timedelta(seconds=10),
        ),
        refresh_interval=datetime.timedelta(minutes=1),
    )

Impact: 99% reduction in LLM API costs for typical 1% daily churn.

2. Smart Document Splitting

Meeting files often contain multiple sessions. Split them intelligently:

with data_scope["documents"].row() as document:
    document["meetings"] = document["content"].transform(
        cocoindex.functions.SplitBySeparators(
            separators_regex=[r"\n\n##?\ "],
            keep_separator="RIGHT",
        )
    )

Keeping the header (## Meeting Title) with each section preserves context for the LLM.

3. Structured LLM Extraction

Instead of asking for "some JSON," give the LLM a concrete schema:

@dataclass
class Person:
    name: str

@dataclass
class Task:
    description: str
    assigned_to: list[Person]

@dataclass
class Meeting:
    time: datetime.date
    note: str
    organizer: Person
    participants: list[Person]
    tasks: list[Task]

Then extract with caching:

with document["meetings"].row() as meeting:
    parsed = meeting["parsed"] = meeting["text"].transform(
        cocoindex.functions.ExtractByLlm(
            llm_spec=cocoindex.LlmSpec(
                api_type=cocoindex.LlmApiType.OPENAI,
                model="gpt-4",
            ),
            output_type=Meeting,
        )
    )

The magic: CocoIndex caches extraction results. Same input + same model = reuse cached output. No redundant LLM calls.

Building the Graph

Collect Nodes and Relationships

Use collectors to accumulate graph data:

meeting_nodes = data_scope.add_collector()
attended_rels = data_scope.add_collector()
decided_tasks_rels = data_scope.add_collector()
assigned_rels = data_scope.add_collector()

meeting_key = {"note_file": document["filename"], "time": parsed["time"]}

meeting_nodes.collect(**meeting_key, note=parsed["note"])
attended_rels.collect(
    id=cocoindex.GeneratedField.UUID,
    **meeting_key,
    person=parsed["organizer"]["name"],
    is_organizer=True,
)

with parsed["participants"].row() as participant:
    attended_rels.collect(
        id=cocoindex.GeneratedField.UUID,
        **meeting_key,
        person=participant["name"],
    )

Export to Neo4j with Upsert Logic

Map meetings to Neo4j nodes:

meeting_nodes.export(
    "meeting_nodes",
    cocoindex.targets.Neo4j(
        connection=conn_spec,
        mapping=cocoindex.targets.Nodes(label="Meeting"),
    ),
    primary_key_fields=["note_file", "time"],
)

The primary_key_fields ensure updates modify existing nodes instead of creating duplicates.

Declare Person and Task nodes:

flow_builder.declare(
    cocoindex.targets.Neo4jDeclaration(
        connection=conn_spec,
        nodes_label="Person",
        primary_key_fields=["name"],
    )
)

flow_builder.declare(
    cocoindex.targets.Neo4jDeclaration(
        connection=conn_spec,
        nodes_label="Task",
        primary_key_fields=["description"],
    )
)

Export relationships:

attended_rels.export(
    "attended_rels",
    cocoindex.targets.Neo4j(
        connection=conn_spec,
        mapping=cocoindex.targets.Relationships(
            rel_type="ATTENDED",
            source=cocoindex.targets.NodeFromFields(
                label="Person",
                fields=[cocoindex.targets.TargetFieldMapping(
                    source="person", target="name"
                )],
            ),
            target=cocoindex.targets.NodeFromFields(
                label="Meeting",
                fields=[
                    cocoindex.targets.TargetFieldMapping("note_file"),
                    cocoindex.targets.TargetFieldMapping("time"),
                ],
            ),
        ),
    ),
    primary_key_fields=["id"],
)

Running the Pipeline

Setup:

export OPENAI_API_KEY=sk-...
export GOOGLE_SERVICE_ACCOUNT_CREDENTIAL=/path/to/service_account.json
export GOOGLE_DRIVE_ROOT_FOLDER_IDS=folderId1,folderId2

pip install cocoindex

Build the graph:

cocoindex update main

Query in Neo4j Browser (http://localhost:7474):

// Who attended which meetings?
MATCH (p:Person)-[:ATTENDED]->(m:Meeting)
RETURN p, m

// Tasks decided in meetings
MATCH (m:Meeting)-[:DECIDED]->(t:Task)
RETURN m, t

// Task assignments by person
MATCH (p:Person)-[:ASSIGNED_TO]->(t:Task)
RETURN p, t

Why This Matters

1. Cost Savings at Scale

In an enterprise with 1% daily document churn:

Traditional approach: Reprocess 100,000 docs = 100,000 LLM calls
Incremental approach: Process 1,000 changed docs = 1,000 LLM calls

99% cost reduction.

2. Real-Time Updates

Switch to live mode and the graph updates automatically when meeting notes change:

refresh_interval=datetime.timedelta(minutes=1)

3. Data Lineage

CocoIndex tracks every transformation. You can trace any Neo4j node back through LLM extraction to the source document.

Beyond Meeting Notes

This pattern works for any text-heavy domain:

📄 Research papers - Concepts, citations, authors
🎫 Support tickets - Issues, solutions, customers
📧 Email threads - Communication patterns, decisions
📋 Compliance docs - Requirements, policies, audits

The template is always: source → detect changes → split → extract → collect → export.

The Tech Stack

CocoIndex - Incremental processing framework (Rust + Python)
OpenAI GPT-4 - Structured extraction
Neo4j - Graph database
Google Drive API - Document source

Try It Yourself

Full source code: CocoIndex Meeting Notes Graph Example

Prerequisites:

Neo4j running locally (user: neo4j, password: cocoindex)
OpenAI API key
Google Cloud service account with Drive access

Key Takeaways

Incremental processing isn't optional at scale - It's the difference between $100/month and $10,000/month in LLM costs.
Structured schemas > free-form prompts - Dataclasses give LLMs clear targets and make downstream processing trivial.
Caching is critical - Don't recompute expensive LLM calls when inputs haven't changed.
Knowledge graphs unlock new queries - "Who attended meetings where we decided X?" is impossible with full-text search.
The pattern is reusable - Source → detect → split → extract → collect → export works for any text-heavy domain.

What Would You Build?

This example processes meeting notes, but the same incremental graph pipeline could extract:

Customer relationships from support tickets
Citation networks from research papers
Compliance chains from policy documents

What text-heavy problem would you solve with a self-updating knowledge graph? Drop a comment below.

⭐ If you found this useful, star CocoIndex on GitHub to support the project!

DEV Community