Pratik Patil

Posted on Mar 21

Our Group project was chaos until this agent

#ai #agents #llm #rag

Last night I wondered whether an agent could turn chaos into ownership without a human PM; by morning, ours had extracted action items from our transcript and auto-created tasks with owners and due dates from the same meeting memory.

What I actually built
I built a Telegram bot that behaves like a stubborn, slightly opinionated project manager for student teams. Not in the “talks nicely in chat” sense. In the “stores role updates, tracks open and done tasks, captures meeting transcripts, and sends deadline reminders whether you like it or not” sense.

The repo is small enough to reason about in one sitting, which I like. The runtime is centered in bot.py. State that must be deterministic lives in SQLite via storage.py. Long-term team memory and reasoning live behind a wrapper in hindsight_service.py. Config is boring by design in config.py. That split is the whole architecture: deterministic workflow state in SQL, fuzzy semantic memory in Hindsight.

If you want context on the memory layer itself, I used Hindsight open-source memory engine on GitHub, leaned on the Hindsight technical documentation for retain, recall, and reflect, and specifically followed patterns from this agent memory architecture guide.

The through-line: from meeting chatter to enforceable ownership
The core story in this codebase is not “I added an AI assistant.” It is: I turned unstructured team chat into structured obligations without writing custom NLP pipelines.

My first instinct was classic backend thinking: parse commands, write rows, call it done. That handles explicit actions like /task 2026-03-30 | write auth middleware. It fails immediately on real meetings, where people say things like:

“I can probably finish login by Thursday.”
“Can someone own test coverage?”
“We decided to drop OCR and ship basic upload.”
That is where the memory layer had to do more than retrieval. It had to preserve evolving context and then produce actionable outputs under constraints.

I ended up with this pattern:

Capture meeting text incrementally in SQLite, keyed by chat + session.
Upsert the full transcript into one stable memory document for that meeting session.
At summary or end-of-meeting time, run reflect over team + session tags.
Extract structured tasks and commit them back into deterministic SQL task rows.
That sounds obvious now. It was not obvious while building it.

Why this shape made sense
I wanted two properties that are usually at odds:

Memory should be semantically rich and update continuously.
Task state should be deterministic, auditable, and easy to query.
If I made Hindsight the source of truth for tasks, I’d lose deterministic updates and straightforward reminders. If I made SQL the only source, I’d lose meeting context and cross-session reasoning. So I stopped trying to choose one system and gave each system one job.

The design choices that mattered (and the ones that hurt)
1) Bank-per-group isolation with explicit behavior tuning
Each Telegram group maps to one Hindsight bank. I liked this because cross-team memory leakage is a silent failure mode and hard to detect until someone notices wrong recommendations.

The non-obvious part here is disposition settings. I pushed skepticism and literalism up on purpose. In project coordination, a polite hallucination is worse than an “unknown.” I wanted the system to under-claim when data is missing.

What surprised me: client compatibility mattered. The code has backward-compatibility handling when async config methods are unavailable. I didn’t plan to spend time there, but integration reality is never as clean as architecture diagrams.

2) Transcript upsert with stable meeting identity
Early version mistake: storing each message as a separate memory document and expecting good summaries later. Retrieval got noisy fast.

The better shape was one evolving memory document per meeting session. SQLite appends lines; Hindsight receives the full transcript using a stable document ID.

Using a stable document_id was the turning point. Instead of accumulating fragmented memory fragments, I got one coherent meeting artifact that could be updated over time and reasoned about at meeting end.

3) Structured extraction before touching SQL
I don’t let free-form model text write directly into task tables. The system asks for structured output and then validates before creating tasks.

And then the ingestion step in the bot:

This split gave me a practical contract:

Memory system proposes.
SQL layer disposes.
It’s not glamorous, but it’s the difference between “nice demo” and “can I trust this in a real group chat.”

4) Deterministic state where it matters
I kept roles, tasks, active sessions, and transcript buffers in SQLite. Boring and perfect.

Because of that, reminders are straightforward: query due tasks and send grouped messages. No fancy memory retrieval needed for cron-like behavior.

What broke in practice
The biggest practical issue was not model quality. It was Telegram bot privacy mode. If privacy mode is on, you mostly see commands, not conversation messages. No transcript means no meaningful summary, no extraction, and users think the AI is dumb. In reality, the input was missing.

The code now handles this explicitly with a targeted message at meeting end when transcript content is empty, pointing people to BotFather privacy settings. This is one of those product-adjacent engineering lessons: diagnosis UX is part of reliability.

Second issue: webhook operational fragility. The bot supports polling and webhook mode. In webhook mode, tunnel URLs can rotate (for example with ngrok), and stale WEBHOOK_PUBLIC_URL silently breaks delivery. The code enforces a required public URL in webhook mode, but operationally you still need to keep that value fresh. This is not a glamorous problem, but it’s where real uptime goes to die.

Third issue: owner resolution from extracted tasks is approximate. The code maps suggested owners by username where possible. If extraction returns display names or ambiguous references, assignment can degrade to unassigned. Better than wrong assignment, but still friction.

Before and after: one concrete flow
Before this architecture, our meetings ended with “okay, who’s doing what?” and five contradictory messages in chat.

Now the flow is operationally simple:

Run /meeting_start.
Chat normally; voice notes are transcribed if configured.
During meeting, /summary gives an in-progress recap.
Run /meeting_end.
Bot outputs decisions, next actions, and creates concrete tasks from extracted action items.
Daily reminder job pings overdue or due tasks.
The specific “chaos to ownership” moment for me was watching a meeting where people discussed a feature split, deferred one risky item, and casually volunteered owners. At /meeting_end, the bot produced tasks with due dates for the explicit commitments and left ambiguous items as unknown instead of inventing details. That behavior came directly from the skeptical/literal memory config and schema-constrained extraction.

Is it perfect? No. If your team never sets roles, ownership recommendations are weaker. If people use nicknames inconsistently, mapping is messy. If the meeting is all jokes and no decisions, extraction should return little, and it does.

I’ll take that over confident nonsense every day.

What I learned building this
Treat memory and state as separate systems with separate responsibilities.
Semantic memory is great for synthesis; SQL is great for invariants. Don’t force one tool to do both.

Stable identities beat clever retrieval tricks.
A single evolving meeting document keyed by session ID worked better than many tiny memory fragments.

Constrained outputs are not optional if you automate side effects.
Asking for schema-structured extraction before writing tasks reduced garbage and made downstream logic predictable.

Skepticism is a feature, not a personality setting.
In planning workflows, a model that says “unknown” is often more useful than one that fills gaps with plausible fiction.

Most production pain is integration pain.
Privacy mode, webhook drift, and dependency/version compatibility caused more incidents than prompt wording ever did.

If I were doing the next iteration
I’d keep the same core architecture, but I’d add three things immediately.

First, confidence-aware task creation. Right now extraction can produce tasks with low certainty; I’d gate auto-creation behind confidence thresholds and route uncertain items into a review queue.

Second, better identity resolution. I’d persist a richer alias map per user (username, display name variants, mention forms) so owner mapping survives nickname chaos.

Third, observability around memory calls. The bot logs failures, but I’d add explicit metrics around retain and reflect latency/error rates and extraction yield quality. When something feels “the agent got worse,” you need numbers, not vibes.

The main takeaway is this: I didn’t build a magical PM. I built a narrow system that converts conversational entropy into accountable work items by combining deterministic storage with a memory layer that is configured to be conservative and practical. That combination is what made it useful.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.