Every engineering team I've worked with has the same problem: meetings happen, people say things, and three sprints later nobody can agree on who promised what. Meeting notes are scattered. Slack threads are buried. Jira tickets exist but lack context. You spend the first 10 minutes of every standup reconstructing conversations that already happened.
I got tired of it. So I built an AI meeting assistant that doesn't just transcribe — it learns your team's patterns across sessions and holds everyone accountable with actual evidence.
Here's how it works, what I learned about agent memory, and why the hard part wasn't transcription at all.
The Problem: Meetings Are Write-Only
We treat meetings like write-only storage. Information goes in, but it doesn't come back out in any useful form. The real cost isn't the hour spent in the room — it's the 20 minutes next week figuring out if Sarah actually agreed to refactor the auth middleware or if that was just "something we should look into."
I wanted three things:
- Real-time transcription with speaker context
- Auto-extraction of real commitments — not vague ideas, but "Sarah will do X by Thursday"
- Cross-session memory — so the assistant knows by Meeting 10 that your team always underestimates infrastructure tasks by 2x
The Stack
- Next.js 14 — web app and dashboard
- FastAPI — backend API and webhook handlers
- MongoDB — raw transcripts, task state, audit logs
- Redis — staging buffer for human review, chat history (TTL 7 days)
- Hindsight Cloud — the persistent memory layer that makes the agent actually learn
- Electron desktop client — captures system audio via virtual loopback (VB-CABLE/BlackHole/PipeWire) and runs faster-whisper locally
The architecture is straightforward: audio in → transcript out → LLM extraction → human review → memory retain → copilot recall.
The Hard Part: Teaching an Agent to Remember (Without Poisoning Itself)
Here's where most AI projects die: they try to remember everything. Raw transcripts, every LLM response, every status update. That's not memory — that's a data dump. And it's expensive to retrieve from.
I used Hindsight, which forces you to think in four distinct memory networks. This structure turned out to be the most important design decision:
The discipline this forces: Raw transcripts stay in MongoDB. Current task state stays in MongoDB. Chat history expires in Redis. Only validated, synthesized insight goes into Hindsight.
This means your agent isn't searching through 10 hours of transcript every time it answers a question. It's recalling structured experiences: "In Meeting 4, Sarah committed to the auth refactor. In Meeting 6, she raised a blocker about OAuth scope. In Meeting 8, it was marked done."
The Staging Pattern: Don't Let Hallucinations Into Long-Term Memory
The fastest way to ruin an AI assistant is to let it confidently remember things that never happened. If the LLM hallucinates a task assignment and you write that to memory, every future response is contaminated.
I implemented a staging buffer in Redis — a 5-minute review window before anything enters Hindsight:
The flow:
Confidence > 0.9 → Auto-promoted to memory
Confidence < 0.9 → Sits in review queue for team lead approval
Rejected → Logged to audit trail, never touches memory
The extraction schema requires a verbatim evidence field — an exact transcript quote. No quote, no task. This sounds like overhead, but it's actually a hallucination filter. If the LLM can't find the exact words "Sarah will handle the auth refactor by Thursday," it can't create the task.
How the Agent Actually Learns
The learning isn't magic — it's two feedback loops.
Loop 1: Experience accumulation
Meeting 1: The copilot answers based only on uploaded documents. It knows nothing about your team.
Meeting 5: Every post-meeting retain has added structured context. When the copilot recalls from Hindsight, it gets back confidence-weighted experiences:
Loop 2: Opinion correction
When a lead rejects an extracted task or a user thumbs-downs a response, that signal updates the Opinion network. The agent starts recognizing your team's jargon, adjusting its confidence thresholds, and learning which discussion patterns lead to real commitments vs. idle speculation.
The result by Meeting 10:
"Based on 3 previous discussions, the API performance issue has been raised but not formally assigned. Historically, your team has a 3-week lag between blocker mention and task creation for infrastructure work."
That's not a generic LLM answer. It's drawn from specific retained experiences and your team's observed behavior.
Closing the Loop: From Promises to Proof
A meeting assistant that only tracks what people say is halfway useful. The other half is knowing whether it actually happened.
Every Kanban task gets a unique key (MM-A1B2C3D4). Developers include it in PR titles or commit messages. A GitHub App webhook listens for merges:
Now every task card shows both sides:
- The promise: "Sarah will handle the auth refactor by Thursday — Meeting 4, 14:32"
- **The proof: **PR #47, merged by @sarah-dev, CI passed This cross-reference is what makes the system trustworthy for post-mortems and accountability.
What I Learned (The Short Version)
- Structure your memory before you store anything. "Store everything" is a trap. Decide what each piece of information means before you write a single retain call.
- Staging is not optional. Human review for low-confidence extractions keeps your memory layer clean. Auto-promote only what you're sure about.
- Evidence requirements filter hallucinations. Forcing verbatim quotes sounds bureaucratic. It's actually the best guardrail you can build.
- Learning needs feedback loops, not just more context. Longer prompts don't make an agent smarter. Correction signals that update behavior do.
- Human review is a feature. I wanted full automation. In practice, the staging queue is what team leads actually trust. Better to flag ambiguity for human eyes than guess wrong silently.
The Result
After 10 meetings, the copilot cites past decisions by meeting number. It flags recurring topics that never get formally assigned. It adjusts its confidence language based on your team's history.
More practically: we stopped having the "wait, who was supposed to do this?" conversation. Every auto-generated task has transcript evidence. Every completed task has a GitHub PR. And the agent knows your team's patterns well enough to surface the right context without being asked.
If you're building persistent agent memory — not just chat history, but cross-session learning that improves behavior — start with the staging pattern. The architecture decisions you make in week one are very hard to undo once your memory layer has ten meetings of data in it.
Resources:
Hindsight on GitHub : https://github.com/vectorize-io/hindsight
Hindsight docs : https://hindsight.vectorize.io/
What is agent memory? : https://vectorize.io/what-is-agent-memory






Top comments (0)