Two opposite designs for AI meeting notes: transcribe everything vs enhance what you typed

#ai #productivity #machinelearning #ux

I ran the same meeting through two AI notetakers, Otter and Granola, expecting to compare accuracy. The accuracy was close. What actually separated them was something more interesting: they aren't built to do the same job. They sit on opposite models of what a meeting "note" even is, and once you see the two designs, the "which is better" question dissolves into "which model fits your workflow."

Two models

Strip away the branding and you get two functions. This is conceptual — I'm describing the designs, not either company's internal code:

# Model A — transcribe everything, then summarize  (Otter)
notes = summarize(transcribe(audio))

# Model B — enhance what you typed  (Granola)
notes = enhance(user_bullets, transcribe(audio))

Model A captures the whole call verbatim, labels who spoke, and then runs an extractive summary over that complete record. The transcript is the primary artifact; the summary is derived from it. You did nothing during the meeting; the tool recorded everything.

Model B inverts the inputs. You jot a few rough bullets during the call, and the tool merges your notes with what it heard — using your fragments as a seed and filling in the specifics from the audio. The finished summary is the primary artifact; the raw transcript is secondary (Granola even deletes the audio once it has the text).

The difference isn't cosmetic. The two models take different inputs, so they produce structurally different outputs and fail in different ways.

What that looked like on one meeting

I fed both an 80-second, two-speaker product meeting (synthetic voices, so I knew the exact right answer) and, for Granola, typed six sloppy bullets while it listened — fragments like "get API latency under 200ms, file P1."

Otter (Model A) produced the more literal, verbatim transcript and labeled both speakers cleanly. It was a complete, attributed record — though it dropped the quarter off one figure, turning "Q3" into "Q." If the word-for-word record is what you need, this is it.
Granola (Model B) garbled a line in its raw transcript, but that's not what it's for. Its enhanced summary — my six bullets fused with what it heard — was the most complete and usable write-up either tool produced. It pulled in specifics my notes left out (an activation jump, a flagged latency spike, an owner to loop in) and organized them around the things I'd already signaled mattered by typing them.

Same meeting, opposite artifacts: one a faithful transcript, the other a send-ready recap.

Why the input model decides the output

Here's the part worth internalizing if you build with LLMs at all. Model B tends to produce more useful notes not because of a better model, but because it has a better input: your relevance signal. When you type "file P1," you've told the system this mattered — a piece of human judgment an extractive summarizer over a raw transcript simply doesn't have. The augmentation model gets to condition on what a participant already decided was important.

That also predicts its failure mode exactly: type nothing and Model B collapses toward Model A. Granola's whole edge assumes you take notes; give it no bullets and you lose the augmentation that makes it special. Model A has no such dependency — it records everything whether you participate or not, which is precisely why it's the safer default for a meeting you can't also take notes in.

So the trade is structural, not a quality ranking:

Model A (transcribe-everything) gives you a verbatim, speaker-labeled, searchable archive, and keeps the audio for playback. The transcript itself is a deliverable. Best when the record matters — exact quotes, who-said-what, a meeting you need to reconstruct later.
Model B (enhance-your-notes) gives you a finished recap seeded by your own judgment, plus an ask-anything memory across your notes — but it deletes the audio and assumes you're an active participant who jots. Best when the output matters and you're in the room thinking.

(Audio capture — whether a bot joins the call or it records your system audio — is a separate axis from note generation; I'm only talking about the latter here.)

The decision, and the build lesson

For choosing a tool, the framing that actually helps isn't "which is more accurate." Both are fine on clean audio. It's "do I want a record or a recap?" A record you can search and quote → the transcribe-everything model. A finished write-up you'll forward without editing → the enhance-your-notes model. I put the full head-to-head — pricing, privacy postures, and which one wins for which kind of meeting — in this tested comparison of Otter and Granola, but the model question is the one to settle first, because it's upstream of every feature.

And the lesson for anyone building AI summarization: the cheapest way to make a summary feel smarter is often to capture the user's relevance signal, not to upgrade the model. A few human-typed bullets as a conditioning input beat a larger model summarizing cold, because the human already solved the hardest part — deciding what mattered. Design for that input and your "dumb" summarizer punches well above its weight.

If you've built a summarizer that captures relevance signal some other way — reactions, highlights, edits-as-feedback — I'd like to hear how in the comments.