We ran the same five meetings through Granola, Fireflies, and Otter to find out which AI notes tool actually captures what was said — and what it claims to summarize afterward. Here is the methodology, the failure modes we saw, and the honest answer to "which one should you use."
How we ran the benchmark
We picked five recordings that stress different parts of an ASR pipeline:
- A four-person engineering standup (15 min, accented speakers, technical jargon)
- A two-person sales call (30 min, clean audio, one speaker on a headset, one on speakerphone)
- A founder podcast clip (20 min, studio audio, single speaker)
- A noisy coffee-shop pair conversation (10 min)
- A six-person product review with heavily overlapping speech (45 min)
For each recording we measured three things:
- Word Error Rate (WER) against a human-corrected reference transcript
- Speaker attribution accuracy — the fraction of utterances tagged to the correct speaker
- Summary fidelity — whether the AI recap preserved the actual decisions and action items, judged blind by two reviewers against the human transcript
We deliberately avoided round-number marketing claims. The figures below come from this five-recording sample. A larger sample will move the absolute numbers; what matters is the relative ordering and the shape of each tool's failure mode.
WER and summary fidelity measure different things. A tool can have a low WER but still hallucinate action items, and a tool with a higher WER can still produce a useful summary if its model knows when to abstain. Read both columns before deciding which one to deploy across your team.
What we observed
Granola
Granola records audio locally on your machine and pairs the transcript with notes you type during the meeting. It does not try to do speaker diarization — there is no bot in the call and your laptop hears a single mixed stream. That choice is the whole point of the product: you skip Zoom invitations and you do not surprise anyone with a third-party recorder.
Summary quality on small meetings (two to four people) was the strongest of the three in our sample, because Granola anchors its output to the bullets you typed during the call. When you write "decision: ship behind a flag," that line ends up in the recap with surrounding context attached. The trade-off is that it relies on you doing the typing — Granola without your live notes drifted toward generic recaps that were technically accurate but operationally useless.
Fireflies
Fireflies sends a bot to the call (Zoom, Meet, Teams), which means it gets clean separate audio streams when the platform exposes them. Diarization in our sample landed in the 82–88% range — solid but not perfect on overlapping speech, and noticeably worse on the noisy coffee-shop recording.
Where Fireflies wobbled was summary over-generation. It is eager to extract action items, and roughly one in five "action items" in our sample was something that had been discussed and rejected, or said hypothetically. If you read the recap without scrolling back to the transcript, you can ship the wrong work. The transcript itself was reliable; the interpretation layer on top of it was where errors crept in.
Otter
Otter's transcription engine has been around the longest of the three and it shows in the WER. On clean audio it was the most consistent. Speaker attribution was also the best of the bot-based tools, including in the six-person product review where most other tools lose track.
The summary, however, tended to merge multiple points into one bullet. In that same product review, three distinct decisions about onboarding became one line that read "team aligned on onboarding changes" — true in the trivial sense, useless in practice. You lose specificity exactly when the recap matters most.
None of the three tools should be your only record of a meeting where a decision is contested. Always preserve the transcript, not just the AI summary, if you expect to argue later about what was agreed.
Which one you should pick
The honest framing is not "which is best" but "which failure mode can you tolerate."
Pick Granola if your meetings are small, you take notes during them, and you do not want a bot joining the call. The summaries lean on your own thinking, which is both their strength and their ceiling.
Pick Fireflies if you need a bot, you want CRM hooks (it integrates with most of the popular ones), and you have the discipline to read the transcript before trusting the action items it pulled out.
Pick Otter if transcription accuracy is your primary need — interviews, podcasts, anything legally adjacent — and you treat the summary as a rough index into the transcript rather than a decision log.
If you are going to take any AI summary and drop it into a knowledge base, the destination matters as much as the source. We have seen teams pipe these recaps into structured databases with explicit properties for decision, owner, and due date. That schema forces the AI's output through a shape that catches roughly half the hallucinations on the way in.
Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.
Top comments (0)