DEV Community

Linghua Jin
Linghua Jin

Posted on

How I Built a Self-Updating Neo4j Knowledge Graph from Meeting Notes (That Saves 99% on LLM Costs)

The Problem: Your Meeting Notes Are Wasted

Every day, organizations hold 62-80 million meetings in the US alone. Those meetings generate decisions, action items, and task assignments—but most of that intelligence dies in Google Docs.

Want to know "Who was in all the budget meetings?" or "What tasks did Alex get assigned this month?" Good luck searching through thousands of Markdown files.

The real killer? Meeting notes are living documents. People fix names, reassign tasks, update decisions. Without incremental processing, you're stuck choosing between:

  • 💸 Massive LLM bills from reprocessing everything
  • 📉 A stale, outdated knowledge graph

I solved this by building a self-updating Neo4j knowledge graph that only processes changed documents—cutting LLM costs by 99%.

What We're Building

A pipeline that turns messy meeting notes into a queryable graph database:

Google Drive → Detect Changes → Split Meetings → LLM Extract → Neo4j
Enter fullscreen mode Exit fullscreen mode

The result? Three node types (Meeting, Person, Task) and three relationships (ATTENDED, DECIDED, ASSIGNED_TO) that let you query:

  • "Which meetings did Sarah attend?"
  • "Where was this task decided?"
  • "Who owns all Q4 tasks?"

The Secret Sauce: Incremental Processing

Here's what makes this actually work at scale:

1. Only Process What Changed

The Google Drive source tracks last-modified timestamps. When you have 100,000 meeting notes and only 1% change daily, you process 1,000 files—not 100,000.

@cocoindex.flow_def(name="MeetingNotesGraph")
def meeting_notes_graph_flow(
    flow_builder: cocoindex.FlowBuilder,
    data_scope: cocoindex.DataScope
) -> None:
    credential_path = os.environ["GOOGLE_SERVICE_ACCOUNT_CREDENTIAL"]
    root_folder_ids = os.environ["GOOGLE_DRIVE_ROOT_FOLDER_IDS"].split(",")

    data_scope["documents"] = flow_builder.add_source(
        cocoindex.sources.GoogleDrive(
            service_account_credential_path=credential_path,
            root_folder_ids=root_folder_ids,
            recent_changes_poll_interval=datetime.timedelta(seconds=10),
        ),
        refresh_interval=datetime.timedelta(minutes=1),
    )
Enter fullscreen mode Exit fullscreen mode

Impact: 99% reduction in LLM API costs for typical 1% daily churn.

2. Smart Document Splitting

Meeting files often contain multiple sessions. Split them intelligently:

with data_scope["documents"].row() as document:
    document["meetings"] = document["content"].transform(
        cocoindex.functions.SplitBySeparators(
            separators_regex=[r"\n\n##?\ "],
            keep_separator="RIGHT",
        )
    )
Enter fullscreen mode Exit fullscreen mode

Keeping the header (## Meeting Title) with each section preserves context for the LLM.

3. Structured LLM Extraction

Instead of asking for "some JSON," give the LLM a concrete schema:

@dataclass
class Person:
    name: str

@dataclass
class Task:
    description: str
    assigned_to: list[Person]

@dataclass
class Meeting:
    time: datetime.date
    note: str
    organizer: Person
    participants: list[Person]
    tasks: list[Task]
Enter fullscreen mode Exit fullscreen mode

Then extract with caching:

with document["meetings"].row() as meeting:
    parsed = meeting["parsed"] = meeting["text"].transform(
        cocoindex.functions.ExtractByLlm(
            llm_spec=cocoindex.LlmSpec(
                api_type=cocoindex.LlmApiType.OPENAI,
                model="gpt-4",
            ),
            output_type=Meeting,
        )
    )
Enter fullscreen mode Exit fullscreen mode

The magic: CocoIndex caches extraction results. Same input + same model = reuse cached output. No redundant LLM calls.

Building the Graph

Collect Nodes and Relationships

Use collectors to accumulate graph data:

meeting_nodes = data_scope.add_collector()
attended_rels = data_scope.add_collector()
decided_tasks_rels = data_scope.add_collector()
assigned_rels = data_scope.add_collector()

meeting_key = {"note_file": document["filename"], "time": parsed["time"]}

meeting_nodes.collect(**meeting_key, note=parsed["note"])
attended_rels.collect(
    id=cocoindex.GeneratedField.UUID,
    **meeting_key,
    person=parsed["organizer"]["name"],
    is_organizer=True,
)

with parsed["participants"].row() as participant:
    attended_rels.collect(
        id=cocoindex.GeneratedField.UUID,
        **meeting_key,
        person=participant["name"],
    )
Enter fullscreen mode Exit fullscreen mode

Export to Neo4j with Upsert Logic

Map meetings to Neo4j nodes:

meeting_nodes.export(
    "meeting_nodes",
    cocoindex.targets.Neo4j(
        connection=conn_spec,
        mapping=cocoindex.targets.Nodes(label="Meeting"),
    ),
    primary_key_fields=["note_file", "time"],
)
Enter fullscreen mode Exit fullscreen mode

The primary_key_fields ensure updates modify existing nodes instead of creating duplicates.

Declare Person and Task nodes:

flow_builder.declare(
    cocoindex.targets.Neo4jDeclaration(
        connection=conn_spec,
        nodes_label="Person",
        primary_key_fields=["name"],
    )
)

flow_builder.declare(
    cocoindex.targets.Neo4jDeclaration(
        connection=conn_spec,
        nodes_label="Task",
        primary_key_fields=["description"],
    )
)
Enter fullscreen mode Exit fullscreen mode

Export relationships:

attended_rels.export(
    "attended_rels",
    cocoindex.targets.Neo4j(
        connection=conn_spec,
        mapping=cocoindex.targets.Relationships(
            rel_type="ATTENDED",
            source=cocoindex.targets.NodeFromFields(
                label="Person",
                fields=[cocoindex.targets.TargetFieldMapping(
                    source="person", target="name"
                )],
            ),
            target=cocoindex.targets.NodeFromFields(
                label="Meeting",
                fields=[
                    cocoindex.targets.TargetFieldMapping("note_file"),
                    cocoindex.targets.TargetFieldMapping("time"),
                ],
            ),
        ),
    ),
    primary_key_fields=["id"],
)
Enter fullscreen mode Exit fullscreen mode

Running the Pipeline

Setup:

export OPENAI_API_KEY=sk-...
export GOOGLE_SERVICE_ACCOUNT_CREDENTIAL=/path/to/service_account.json
export GOOGLE_DRIVE_ROOT_FOLDER_IDS=folderId1,folderId2

pip install cocoindex
Enter fullscreen mode Exit fullscreen mode

Build the graph:

cocoindex update main
Enter fullscreen mode Exit fullscreen mode

Query in Neo4j Browser (http://localhost:7474):

// Who attended which meetings?
MATCH (p:Person)-[:ATTENDED]->(m:Meeting)
RETURN p, m

// Tasks decided in meetings
MATCH (m:Meeting)-[:DECIDED]->(t:Task)
RETURN m, t

// Task assignments by person
MATCH (p:Person)-[:ASSIGNED_TO]->(t:Task)
RETURN p, t
Enter fullscreen mode Exit fullscreen mode

Why This Matters

1. Cost Savings at Scale

In an enterprise with 1% daily document churn:

  • Traditional approach: Reprocess 100,000 docs = 100,000 LLM calls
  • Incremental approach: Process 1,000 changed docs = 1,000 LLM calls

99% cost reduction.

2. Real-Time Updates

Switch to live mode and the graph updates automatically when meeting notes change:

refresh_interval=datetime.timedelta(minutes=1)
Enter fullscreen mode Exit fullscreen mode

3. Data Lineage

CocoIndex tracks every transformation. You can trace any Neo4j node back through LLM extraction to the source document.

Beyond Meeting Notes

This pattern works for any text-heavy domain:

  • 📄 Research papers - Concepts, citations, authors
  • 🎫 Support tickets - Issues, solutions, customers
  • 📧 Email threads - Communication patterns, decisions
  • 📋 Compliance docs - Requirements, policies, audits

The template is always: source → detect changes → split → extract → collect → export.

The Tech Stack

  • CocoIndex - Incremental processing framework (Rust + Python)
  • OpenAI GPT-4 - Structured extraction
  • Neo4j - Graph database
  • Google Drive API - Document source

Try It Yourself

Full source code: CocoIndex Meeting Notes Graph Example

Prerequisites:

  • Neo4j running locally (user: neo4j, password: cocoindex)
  • OpenAI API key
  • Google Cloud service account with Drive access

Key Takeaways

  1. Incremental processing isn't optional at scale - It's the difference between $100/month and $10,000/month in LLM costs.

  2. Structured schemas > free-form prompts - Dataclasses give LLMs clear targets and make downstream processing trivial.

  3. Caching is critical - Don't recompute expensive LLM calls when inputs haven't changed.

  4. Knowledge graphs unlock new queries - "Who attended meetings where we decided X?" is impossible with full-text search.

  5. The pattern is reusable - Source → detect → split → extract → collect → export works for any text-heavy domain.

What Would You Build?

This example processes meeting notes, but the same incremental graph pipeline could extract:

  • Customer relationships from support tickets
  • Citation networks from research papers
  • Compliance chains from policy documents

What text-heavy problem would you solve with a self-updating knowledge graph? Drop a comment below.


⭐ If you found this useful, star CocoIndex on GitHub to support the project!

Top comments (0)