DEV Community

I gave Claude six months of our retros. It found three things I'd missed.

Matt on May 18, 2026

There's a thing PMs do that nobody puts on a job description. Every couple of weeks, after a retro, you flag three or four items in your head as th...

Read full post

Arlow • May 18

We conduct quarterly retros with all of our engineering teams. This sounds like it would be super useful to compare the whole year!

Ben Halpern • May 18

Agreed

Mykola Kondratiuk • May 20

labour problem is real and underdiagnosed. tried something similar with sprint planning notes - Claude was better at finding what teams kept NOT saying across months than what they kept saying. what was the third thing?

AgileScrump • May 18

Pretty interesting! I hadn’t considered trying to connect Claude to our retro tool.

S M Tahosin • May 24

This is a brilliant use case for LLMs in product management. Retrospectives hold so much value, but the insights are almost always lost in the noise of the next sprint. Having an AI aggregate six months of qualitative data to spot slow-moving negative trends removes the recency bias perfectly. Definitely trying this with our Jira exports!

Gilder Miller • May 18

The semantic search across 26 weeks of retros is a smart application of pgvector. Most teams just let that data sit untouched.
The propose then wait workflow for action item writes is the right call. I have seen AI triage make messy decisions when it auto-closes items based on keyword matches. Having a human approval step on each row keeps the trust high.
What is the false positive rate like on the semantic clustering? You mentioned it sometimes groups items that are superficially similar but actually different. Curious how often you have to manually split those clusters.

Max Quimby • May 24

The "code review wait times creeping upward across three retros" finding is the one that resonates most. That's the failure mode of human-driven retrospectives in a nutshell — each instance is "just slightly worse than last time" so nothing crosses the urgency threshold to be raised, but the integral over six months is significant.

We've found similar value running LLM passes over long-horizon data: individual data points feel non-noteworthy, but trend detection across dozens of weekly artifacts surfaces stuff humans systematically miss because of recency bias and meeting-by-meeting framing.

One thing I'd push on: how do you handle the inverse — false positives where Claude flags a "trend" that's actually noise? With three MCP servers and ~50 tools on Kollabe alone, the surface area for plausible-but-wrong patterns is large. Do you have a confidence/evidence threshold the model has to clear before something makes it into your Monday digest, or is the human-in-the-loop review the only filter?

Harjot Singh • May 30

This is one of the genuinely underrated uses of LLMs - pattern-finding across a corpus too big and too boring for a human to re-read. Six months of retros is exactly the kind of data where the signal is real but buried under repetition, and a model surfacing three things you'd missed is a clean win.

The reason it works here and not in "write my whole app" is the task shape: synthesis over text you provide, with you as the judge of what's actionable. Low stakes if it's wrong (you just discard a bad suggestion), high value when it's right. That's the sweet spot for these tools - augmenting human judgment on dense data, not replacing it. Curious whether the three findings were things you'd have eventually spotted, or genuinely non-obvious ones the cross-retro view exposed? Great practical example.

Manuel Bruña • Jun 15

Retros are a good fit for AI because the value is pattern detection, not final authority. The part I’d protect is traceability: every insight should point back to the source retro notes, otherwise the summary can become a confident story nobody can audit.

Siyu • May 18

The insight about action items falling out of working memory really resonates. Most teams struggle with this exact problem where items get assigned and then quietly age without anyone noticing. Reducing the median action item age from 47 days to 14 days without nagging is a practical win worth replicating. The approach of having AI propose changes while keeping human approval for writes strikes the right balance between automation and oversight. This workflow demonstrates how AI can handle the labor of reading historical data so PMs can focus on asking better questions.

Cophy Origin • May 19

This resonates deeply — the gap between "data exists" and "labour of reading it" is exactly where most team knowledge goes to die.

What strikes me most is the action item aging problem. The median 47-day open item isn't a tracker failure, it's a working memory failure. You've essentially built an external memory layer that does the weekly "is this still relevant?" pass that humans reliably skip. That's the same pattern I've been exploring in AI agent design: the value isn't in the AI doing the thinking, it's in the AI doing the remembering so the human can do the thinking.

One thing I'd be curious about: did Claude surface any patterns that felt wrong to you — things it flagged as "getting worse" that you knew from context were actually fine? That false-positive rate seems like the real trust calibration challenge for this kind of longitudinal analysis.

Andrii Krugliak • May 19

The "shuts up and waits for me" line is the part I'd underline. Most retro-automation tries to act. What I actually want is a read pass clean enough that I can decide in 30 seconds and close the tab.

One pushback on the action-item half. An 80% approve rate sounds healthy, but the 20% you reject is probably where the highest-leverage stuff hides. Items where the model called something "silently resolved" because someone answered the question on Slack instead of in the tracker. I ran the same workflow against a 24-month queue and the failure mode was almost always missing side-channel evidence, not bad reasoning.

What worked for me: have Claude include a confidence score per row, tied to the specific evidence sources it pulled from. If the only evidence is the action-item description itself (no Jira hit, no standup mention, nothing in Git), flag it as needs-human-context regardless of confidence. That catches the rows where the model is basically asking permission to guess.

Theo Valmis • May 20

The action item staleness detection is the part that actually matters. Retros produce action items that die quietly — nobody tracks whether they were resolved or abandoned, so the same issues surface again three months later and get written down again. Running analysis across 26 weeks catches patterns that would require someone to manually read every retro note and correlate across sprints — work that's technically possible but that nobody does because it doesn't fit into any meeting slot.

The "three things I'd missed" framing is honest. The analysis doesn't need to find something novel to be useful; it just needs to surface what's already there that human attention costs too much to retrieve consistently. Most team knowledge problems aren't about missing information — they're about the labor cost of accessing information that already exists.

Mudassir Khan • May 24

the drop from 47 to 14 days at the median is the part nobody else is benchmarking. action item decay is a metrics problem dressed up as a process problem.

ran a similar weekly pass over six months of SEO content feedback. claude found two recurring reader complaints we'd quietly stopped addressing because the original ticket owner left. surfacing "items that came back" was way more useful than "items that closed".

the bit that broke for us was side channel evidence — half our work happens in slack, not the tracker, so the AI marks things resolved that aren't. how do you bias the prompt toward "needs human context" when the only evidence is the description itself?

Shahzaib • May 20

The part about "things that quietly got fixed" really hit me. We've all had that moment where we realize a problem just disappeared, and we forgot to thank the person who quietly solved it.

The 26-week scan is a clever way to surface patterns that would never show up in a single retro, you'd need to manually read months of notes to see a slow drift in code review wait times.

The "shuts up and waits" approach is also refreshing. Most retro automation tries to take action, but what I really want is someone (or something) to do the reading and give me a clean summary so I can decide.

Did you ever hit a false positive where Claude flagged something as "getting worse" that you knew from context was actually fine? I'm curious how much trust calibration was needed.