There's a thing PMs do that nobody puts on a job description. Every couple of weeks, after a retro, you flag three or four items in your head as things to watch. Then a sprint goes by. Then another. Three months later you're in a leadership review, somebody asks "is X getting better or worse?", and you answer from gut feeling because there's no time to read 12 retros before the meeting.
The retros are sitting there. The data is fine. The labour of reading them isn't.
I work on Kollabe, so take this with the appropriate grain of salt. The patterns generalise; if your retro tool exposes an API or an MCP server, the workflow below works against whatever you've got. I'll keep the prompts vendor-shaped so they're easy to swap.
What I changed
For the last two months I've been driving most of my retro-adjacent work through Claude with three MCP servers connected: Kollabe (for the retro and action-item surface), Atlassian (for Jira), and GitHub. The Kollabe MCP exposes about 50 tools that map 1:1 to the public REST API, so anything I can do in the UI, Claude can do as me. That includes reading every retro and action item in the spaces I have access to.
The Monday-morning workflow is one prompt. It does four things in order, then stops and waits for me.
- Reads the last 26 weeks of retros across the spaces I own.
- Pulls every open action item, including ones quietly older than the people who created them.
- Reads the last sprint's standups for additional signal.
- Writes a structured brief: what's getting better, what's getting worse, what action items are stale, and a one-line "here's what I'd ask the team about this week."
Then it shuts up and lets me think.
The action item half: the part that quietly fails
Here's the embarrassing thing nobody says out loud about retros: most action items don't close because they fall out of working memory. The retro happens, somebody writes "investigate flaky CI", it gets assigned, two sprints pass, the person who owned it changed teams, and the item is technically open in the tool forever. Read a clean dashboard of "open action items" on most teams and you'll find the median age is something like 47 days.
The fix isn't a better tracker. The fix is somebody reading every open action item once a week and asking three questions: is it still relevant, did it actually get done quietly, who really owns it now.
I have Claude do that. The prompt is short:
For each space I own, use `action_item_list` (status = PENDING).
For every item older than 21 days:
1. Use `search` (Kollabe MCP) to look for activity on the item's keywords across
retros, standups, and Jira via the Atlassian MCP — last 30 days.
2. If the item appears resolved or superseded, propose marking it COMPLETED with
a one-line note explaining what closed it.
3. If the assignee changed spaces or hasn't been active in standups for >14 days,
flag it for reassignment.
4. Otherwise, propose a one-sentence nudge comment from me, via
`action_item_create_comment`.
Show me a table. Wait for me to approve each row before any writes.
It produces a table. Most weeks, I approve about 80% of it and reject the rest. The rejected ones are the most interesting. They're items the AI thought were resolved but I know aren't, which usually means we have an undocumented decision that should become a docced one.
Two months in, the median action-item age in our spaces is 14 days, not 47. Nobody had to nag anyone. Nobody had to build a dashboard.
The six-month delta: the part that's new
This is the bit I underestimated when I wired it up. The Kollabe MCP has a search tool that runs semantic search across every retro, standup, action item, and round in your spaces, backed by pgvector embeddings. Faster-than-scrolling isn't really the point. The point is the questions you'd never have asked because the labour wasn't worth it.
What I ask, every Monday, in the same prompt:
Using the Kollabe MCP `search` and `retro_list` + `retro_get`, scan the last
26 weeks of retros in my spaces. Produce:
1. The five themes that appeared in the first 3 months but have NOT appeared
in the last 6 weeks. (Things that quietly got fixed.)
2. The five themes that have appeared in 3+ retros over the last 6 weeks
and weren't a problem 6 months ago. (Things that quietly got worse.)
3. Any theme that appeared, was resolved, and has come back.
4. Two questions worth asking the team this week given (1) - (3).
The first time I ran it, the AI surfaced three things that genuinely surprised me. One was a fix I'd forgotten the team had quietly made. The deploy flow had been a chronic complaint for the first quarter and had simply stopped being a complaint, which meant somebody's January work had paid off and I hadn't thanked them for it. One was a slow drift in code review wait times, too quiet to feel like a problem in any single retro, obviously a problem when you saw three retros in a row mention it without escalation. The third was a recurring frustration about meeting overload that had been resolved once and was creeping back.
It was uncomfortable. It was also the most useful single data point I'd gotten about my team in a year. I sent the dev who fixed deploys a public kudos that afternoon.
Why MCP and not "I'll just build a script"
I work on this stuff for a living and I have, in fact, built the script version. It works. It is also rigid: every time I want to ask a slightly different question, I edit code. The MCP version lets me write the question in English on Monday morning and have it run against the same API surface my Python script would have used.
The thing that makes this real, not a parlour trick, is that the public API and the MCP server are the same surface. Every MCP tool is registered against a /api/v1/* handler, with the same Zod schemas and the same access checks. So when I prototype something in chat that turns out to be useful weekly, I can lift the same calls into a scheduled Worker, run it Friday at 4pm, and email myself the brief. The prompt and the script share an interface.
That matters for technical PMs specifically. You're going to want to graduate the things that worked into something that runs without you. With an MCP that mirrors a real REST API, you can — without translating between two surfaces or waiting for a vendor to ship Zapier support.
The honest caveats
A few:
- It works to the depth of your retro content. If your team writes "deploy bad" as a retro item, the search and summary surfaces are not going to mine wisdom out of that. The first thing it convinced me of was that we needed better retro write-ups. We added a templated "what happened, who was affected, what changed" body to two columns. The signal jumped a month later.
- Semantic clusters are smart, not psychic. The AI will sometimes group two superficially-similar items that are actually about different things. A CI complaint and a release-process complaint, say. I read the cluster headers, not just the takeaway.
- The action-item triage prompt makes writes. Always. I've never let it write without a per-row approval step. The day I trust it to update without me is the day it'll mark an unresolved compliance item as fixed because somebody mentioned the word "fixed" in a tangent. The prompt stays in the "propose then wait" shape.
A rule of thumb you can steal
If you're a PM who feels like the data exists but the labour doesn't, here's the question to ask before you build anything:
What's the question I'd ask my team if I'd just spent two hours reading the last six months of retros?
Write that question down. That's your weekly prompt. If your retro tool has an MCP server or a public API with semantic search, you can have an AI do the reading and bring you back the question's answer with citations. If it doesn't, that's a procurement question. The next generation of agile tooling is going to assume an AI is reading the history, not just a human.
I run my version on Monday at 9am. Whatever your version is, time saved isn't really the win. The win is questions asked that weren't being asked.
If you want to try this against Kollabe's MCP, the connection takes about a minute and is included on Premium plus all trials. The MCP page has the OAuth flow and a one-click setup. If you want the script version, every tool I mentioned is also a documented REST endpoint, same auth, same shape.



Top comments (14)
We conduct quarterly retros with all of our engineering teams. This sounds like it would be super useful to compare the whole year!
Agreed
labour problem is real and underdiagnosed. tried something similar with sprint planning notes - Claude was better at finding what teams kept NOT saying across months than what they kept saying. what was the third thing?
Pretty interesting! I hadn’t considered trying to connect Claude to our retro tool.
The semantic search across 26 weeks of retros is a smart application of pgvector. Most teams just let that data sit untouched.
The propose then wait workflow for action item writes is the right call. I have seen AI triage make messy decisions when it auto-closes items based on keyword matches. Having a human approval step on each row keeps the trust high.
What is the false positive rate like on the semantic clustering? You mentioned it sometimes groups items that are superficially similar but actually different. Curious how often you have to manually split those clusters.
This is a brilliant use case for LLMs in product management. Retrospectives hold so much value, but the insights are almost always lost in the noise of the next sprint. Having an AI aggregate six months of qualitative data to spot slow-moving negative trends removes the recency bias perfectly. Definitely trying this with our Jira exports!
the drop from 47 to 14 days at the median is the part nobody else is benchmarking. action item decay is a metrics problem dressed up as a process problem.
ran a similar weekly pass over six months of SEO content feedback. claude found two recurring reader complaints we'd quietly stopped addressing because the original ticket owner left. surfacing "items that came back" was way more useful than "items that closed".
the bit that broke for us was side channel evidence — half our work happens in slack, not the tracker, so the AI marks things resolved that aren't. how do you bias the prompt toward "needs human context" when the only evidence is the description itself?
The "code review wait times creeping upward across three retros" finding is the one that resonates most. That's the failure mode of human-driven retrospectives in a nutshell — each instance is "just slightly worse than last time" so nothing crosses the urgency threshold to be raised, but the integral over six months is significant.
We've found similar value running LLM passes over long-horizon data: individual data points feel non-noteworthy, but trend detection across dozens of weekly artifacts surfaces stuff humans systematically miss because of recency bias and meeting-by-meeting framing.
One thing I'd push on: how do you handle the inverse — false positives where Claude flags a "trend" that's actually noise? With three MCP servers and ~50 tools on Kollabe alone, the surface area for plausible-but-wrong patterns is large. Do you have a confidence/evidence threshold the model has to clear before something makes it into your Monday digest, or is the human-in-the-loop review the only filter?
This is one of the genuinely underrated uses of LLMs - pattern-finding across a corpus too big and too boring for a human to re-read. Six months of retros is exactly the kind of data where the signal is real but buried under repetition, and a model surfacing three things you'd missed is a clean win.
The reason it works here and not in "write my whole app" is the task shape: synthesis over text you provide, with you as the judge of what's actionable. Low stakes if it's wrong (you just discard a bad suggestion), high value when it's right. That's the sweet spot for these tools - augmenting human judgment on dense data, not replacing it. Curious whether the three findings were things you'd have eventually spotted, or genuinely non-obvious ones the cross-retro view exposed? Great practical example.
The insight about action items falling out of working memory really resonates. Most teams struggle with this exact problem where items get assigned and then quietly age without anyone noticing. Reducing the median action item age from 47 days to 14 days without nagging is a practical win worth replicating. The approach of having AI propose changes while keeping human approval for writes strikes the right balance between automation and oversight. This workflow demonstrates how AI can handle the labor of reading historical data so PMs can focus on asking better questions.
This resonates deeply — the gap between "data exists" and "labour of reading it" is exactly where most team knowledge goes to die.
What strikes me most is the action item aging problem. The median 47-day open item isn't a tracker failure, it's a working memory failure. You've essentially built an external memory layer that does the weekly "is this still relevant?" pass that humans reliably skip. That's the same pattern I've been exploring in AI agent design: the value isn't in the AI doing the thinking, it's in the AI doing the remembering so the human can do the thinking.
One thing I'd be curious about: did Claude surface any patterns that felt wrong to you — things it flagged as "getting worse" that you knew from context were actually fine? That false-positive rate seems like the real trust calibration challenge for this kind of longitudinal analysis.