The Problem: Your Meeting Notes Are Wasted
Every day, organizations hold 62-80 million meetings in the US alone. Those meetings generate decisions, action items, and task assignments—but most of that intelligence dies in Google Docs.
Want to know "Who was in all the budget meetings?" or "What tasks did Alex get assigned this month?" Good luck searching through thousands of Markdown files.
The real killer? Meeting notes are living documents. People fix names, reassign tasks, update decisions. Without incremental processing, you're stuck choosing between:
- 💸 Massive LLM bills from reprocessing everything
- 📉 A stale, outdated knowledge graph
I solved this by building a self-updating Neo4j knowledge graph that only processes changed documents—cutting LLM costs by 99%.
What We're Building
A pipeline that turns messy meeting notes into a queryable graph database:
Google Drive → Detect Changes → Split Meetings → LLM Extract → Neo4j
The result? Three node types (Meeting, Person, Task) and three relationships (ATTENDED, DECIDED, ASSIGNED_TO) that let you query:
- "Which meetings did Sarah attend?"
- "Where was this task decided?"
- "Who owns all Q4 tasks?"
The Secret Sauce: Incremental Processing
Here's what makes this actually work at scale:
1. Only Process What Changed
The Google Drive source tracks last-modified timestamps. When you have 100,000 meeting notes and only 1% change daily, you process 1,000 files—not 100,000.
@cocoindex.flow_def(name="MeetingNotesGraph")
def meeting_notes_graph_flow(
flow_builder: cocoindex.FlowBuilder,
data_scope: cocoindex.DataScope
) -> None:
credential_path = os.environ["GOOGLE_SERVICE_ACCOUNT_CREDENTIAL"]
root_folder_ids = os.environ["GOOGLE_DRIVE_ROOT_FOLDER_IDS"].split(",")
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.GoogleDrive(
service_account_credential_path=credential_path,
root_folder_ids=root_folder_ids,
recent_changes_poll_interval=datetime.timedelta(seconds=10),
),
refresh_interval=datetime.timedelta(minutes=1),
)
Impact: 99% reduction in LLM API costs for typical 1% daily churn.
2. Smart Document Splitting
Meeting files often contain multiple sessions. Split them intelligently:
with data_scope["documents"].row() as document:
document["meetings"] = document["content"].transform(
cocoindex.functions.SplitBySeparators(
separators_regex=[r"\n\n##?\ "],
keep_separator="RIGHT",
)
)
Keeping the header (## Meeting Title) with each section preserves context for the LLM.
3. Structured LLM Extraction
Instead of asking for "some JSON," give the LLM a concrete schema:
@dataclass
class Person:
name: str
@dataclass
class Task:
description: str
assigned_to: list[Person]
@dataclass
class Meeting:
time: datetime.date
note: str
organizer: Person
participants: list[Person]
tasks: list[Task]
Then extract with caching:
with document["meetings"].row() as meeting:
parsed = meeting["parsed"] = meeting["text"].transform(
cocoindex.functions.ExtractByLlm(
llm_spec=cocoindex.LlmSpec(
api_type=cocoindex.LlmApiType.OPENAI,
model="gpt-4",
),
output_type=Meeting,
)
)
The magic: CocoIndex caches extraction results. Same input + same model = reuse cached output. No redundant LLM calls.
Building the Graph
Collect Nodes and Relationships
Use collectors to accumulate graph data:
meeting_nodes = data_scope.add_collector()
attended_rels = data_scope.add_collector()
decided_tasks_rels = data_scope.add_collector()
assigned_rels = data_scope.add_collector()
meeting_key = {"note_file": document["filename"], "time": parsed["time"]}
meeting_nodes.collect(**meeting_key, note=parsed["note"])
attended_rels.collect(
id=cocoindex.GeneratedField.UUID,
**meeting_key,
person=parsed["organizer"]["name"],
is_organizer=True,
)
with parsed["participants"].row() as participant:
attended_rels.collect(
id=cocoindex.GeneratedField.UUID,
**meeting_key,
person=participant["name"],
)
Export to Neo4j with Upsert Logic
Map meetings to Neo4j nodes:
meeting_nodes.export(
"meeting_nodes",
cocoindex.targets.Neo4j(
connection=conn_spec,
mapping=cocoindex.targets.Nodes(label="Meeting"),
),
primary_key_fields=["note_file", "time"],
)
The primary_key_fields ensure updates modify existing nodes instead of creating duplicates.
Declare Person and Task nodes:
flow_builder.declare(
cocoindex.targets.Neo4jDeclaration(
connection=conn_spec,
nodes_label="Person",
primary_key_fields=["name"],
)
)
flow_builder.declare(
cocoindex.targets.Neo4jDeclaration(
connection=conn_spec,
nodes_label="Task",
primary_key_fields=["description"],
)
)
Export relationships:
attended_rels.export(
"attended_rels",
cocoindex.targets.Neo4j(
connection=conn_spec,
mapping=cocoindex.targets.Relationships(
rel_type="ATTENDED",
source=cocoindex.targets.NodeFromFields(
label="Person",
fields=[cocoindex.targets.TargetFieldMapping(
source="person", target="name"
)],
),
target=cocoindex.targets.NodeFromFields(
label="Meeting",
fields=[
cocoindex.targets.TargetFieldMapping("note_file"),
cocoindex.targets.TargetFieldMapping("time"),
],
),
),
),
primary_key_fields=["id"],
)
Running the Pipeline
Setup:
export OPENAI_API_KEY=sk-...
export GOOGLE_SERVICE_ACCOUNT_CREDENTIAL=/path/to/service_account.json
export GOOGLE_DRIVE_ROOT_FOLDER_IDS=folderId1,folderId2
pip install cocoindex
Build the graph:
cocoindex update main
Query in Neo4j Browser (http://localhost:7474):
// Who attended which meetings?
MATCH (p:Person)-[:ATTENDED]->(m:Meeting)
RETURN p, m
// Tasks decided in meetings
MATCH (m:Meeting)-[:DECIDED]->(t:Task)
RETURN m, t
// Task assignments by person
MATCH (p:Person)-[:ASSIGNED_TO]->(t:Task)
RETURN p, t
Why This Matters
1. Cost Savings at Scale
In an enterprise with 1% daily document churn:
- Traditional approach: Reprocess 100,000 docs = 100,000 LLM calls
- Incremental approach: Process 1,000 changed docs = 1,000 LLM calls
99% cost reduction.
2. Real-Time Updates
Switch to live mode and the graph updates automatically when meeting notes change:
refresh_interval=datetime.timedelta(minutes=1)
3. Data Lineage
CocoIndex tracks every transformation. You can trace any Neo4j node back through LLM extraction to the source document.
Beyond Meeting Notes
This pattern works for any text-heavy domain:
- 📄 Research papers - Concepts, citations, authors
- 🎫 Support tickets - Issues, solutions, customers
- 📧 Email threads - Communication patterns, decisions
- 📋 Compliance docs - Requirements, policies, audits
The template is always: source → detect changes → split → extract → collect → export.
The Tech Stack
- CocoIndex - Incremental processing framework (Rust + Python)
- OpenAI GPT-4 - Structured extraction
- Neo4j - Graph database
- Google Drive API - Document source
Try It Yourself
Full source code: CocoIndex Meeting Notes Graph Example
Prerequisites:
- Neo4j running locally (user:
neo4j, password:cocoindex) - OpenAI API key
- Google Cloud service account with Drive access
Key Takeaways
Incremental processing isn't optional at scale - It's the difference between $100/month and $10,000/month in LLM costs.
Structured schemas > free-form prompts - Dataclasses give LLMs clear targets and make downstream processing trivial.
Caching is critical - Don't recompute expensive LLM calls when inputs haven't changed.
Knowledge graphs unlock new queries - "Who attended meetings where we decided X?" is impossible with full-text search.
The pattern is reusable - Source → detect → split → extract → collect → export works for any text-heavy domain.
What Would You Build?
This example processes meeting notes, but the same incremental graph pipeline could extract:
- Customer relationships from support tickets
- Citation networks from research papers
- Compliance chains from policy documents
What text-heavy problem would you solve with a self-updating knowledge graph? Drop a comment below.
⭐ If you found this useful, star CocoIndex on GitHub to support the project!
Top comments (0)