Yana Li

Posted on Jun 19

AI summaries need receipts: how I built evidence-bound reports from comments

#ai #webdev #productivity #data

Trust as the primary technical challenge

A mistake I keep running into with AI feedback tools is treating the summary as the product.

Getting a model to write a confident paragraph is no longer the hard part.

The hard part is making every useful claim traceable back to the messy source rows that produced it.

I ran into this while building a tool around YouTube comments. Before building this, I spent a lot of time reading YouTube comments manually as a creator, and that probably shaped how I think about this problem.

A creator, founder, or marketer does not just need "people liked the video" or "viewers want more tutorials." They need to know which comments support that claim, whether the signal came from one loud comment or a real pattern, and whether the model invented a clean story that the comments do not actually justify.

While testing the report flow, the trust question mattered more than the model question.

Not "which model writes the best report?"

More like:

Why should I trust an AI report about messy comments?

That is the technical problem this post is about.

The common mistake: summarize first, source later

The simplest AI report pipeline looks like this:

comments
-> prompt
-> summary
-> display

That can be useful for quick reading. If the goal is a private note, a rough digest, or a first-pass brainstorm, a loose summary may be enough.

But it breaks down when the output is supposed to guide action.

For example, imagine three comments:

c1: "Can you make a beginner version? I got lost halfway through."
c2: "The advanced part was useful, but I need a slower setup walkthrough."
c3: "Please share the template you used."

A reasonable summary might say:

Viewers want more beginner-friendly setup material.

That is fine.

But now imagine the generated report says:

Viewers are asking for a paid course and a downloadable starter kit.

Maybe that is a good business idea. Maybe it is not. The important part is that the comments above do not actually say it.

The report moved from evidence to interpretation without showing the bridge.

What plain summaries are good at

I do not think every AI summary needs a citation system.

Plain summaries are good when:

the reader only needs a rough orientation
the source set is small enough to inspect manually
the output is not used for a customer-facing or business decision
the model is helping with brainstorming, not evidence

The stricter requirement starts when the summary becomes a decision surface.

If a report suggests a reply idea, a content idea, a positioning change, a risk review, or a product decision, then the user should be able to ask:

Show me the comments behind this.

If the system cannot answer that, the report may still be useful, but it is not very inspectable.

A better unit: the evidence-bound claim

The shape I prefer is not "summary first."

It is closer to:

source rows
-> candidate claims
-> evidence binding
-> validation
-> report sections

At the data level, the basic object is boring:

type EvidenceBoundClaim = {
  title: string;
  summary: string;
  evidence_comment_ids: string[];
};

That small field changes the product contract.

The claim is not just text. It is text plus a list of source comments that the user can inspect.

In a comment report, the same pattern can apply to:

repeated questions
demand signals
objections
praise
confusion
risk signals
content ideas
reply ideas

The report can still be written in normal language. It just cannot float away from the comments.

Why messy feedback needs stricter binding

YouTube comments are not clean survey answers.

They include jokes, sarcasm, spam, repeated questions, one-word reactions, language mixing, replies to replies, creator-specific context, and comments that are useful only because of where they appear in a thread.

That creates several failure modes.

One comment becomes a pattern

A model sees one strong complaint and writes it as if the audience broadly agrees.

Evidence binding does not solve this by itself, but it makes the weakness visible. If a "major concern" has one evidence row, the user can judge it differently from a concern backed by twenty comments.

A pattern loses its source

The model correctly detects that many people are confused, but the report does not show which comments created that impression.

That makes the report hard to use. The creator cannot quote the comments, answer the right thread, or decide whether the confusion is about the video, the product, the title, or the viewer's prior knowledge.

Multi-source reports mix context

If the input includes multiple videos, a playlist, a channel, or a URL list, the model can accidentally blend sources.

That is why source metadata matters. A compact shape like this is enough:

type CommentForAnalysis = {
  comment_id: string;
  text: string;
  source_key?: string;
};

Then source context can be sent once, while each comment carries the source key it belongs to.

The guardrail is simple:

Do not claim source-level differences unless the evidence IDs support that source_key.

Without that rule, a report can say "Video A has more pricing objections than Video B" when the cited comments do not actually support the comparison.

The pipeline I trust more

The pipeline I want for this kind of product looks like this:

public comment rows
-> stable comment IDs
-> optional source map
-> AI analysis
-> deterministic semantic snapshot
-> evidence ID validation
-> report trust gate
-> cited report, export, or share page

In my implementation, the report is generated from a saved comment snapshot, not from whatever YouTube happens to return later. Once comments are saved, the analysis pass works against those saved source rows, and the report stores a deterministic semantic_snapshot with evidence_comment_ids on the claims that need support.

Before a claim becomes visible evidence, those IDs are resolved back against the saved snapshot. If an ID does not resolve, it cannot become one of the evidence examples the reader can inspect.

For multi-video inputs, each row can carry a compact source_key. The analysis prompt explicitly tells the model not to claim source-level differences unless the evidence IDs support that key.

The important product decision is where to be strict.

The system can let the model help with language, grouping, and interpretation.

But it should be strict about the things the model is not allowed to invent:

comment IDs
source keys
exact source excerpts
sentiment totals
analyzed row counts
whether a claim has enough support
whether a report is ready to export or share

In other words, the model can propose the story.

The system should verify the receipts.

Validation before calling a report ready

For feedback reports, I would want checks like these before the output is treated as ready:

comments_analyzed > 0
sentiment counts sum to comments_analyzed
every evidence_comment_id resolves against saved source rows
quoted examples are checked against the saved source snapshot
source-level comparisons are backed by source_key evidence
recommended actions include evidence IDs
export/share paths should be blocked until the report trust gate passes

Some of these checks are easy. Some are annoying. All of them make the product less magical in a useful way.

The goal is not to make the report sound more confident.

The goal is to prevent unsupported confidence from reaching the user.

What to show users when evidence is imperfect

This is where product design matters as much as backend validation.

If evidence is thin, I do not want the user-facing report to say:

Low confidence, but here is a polished recommendation anyway.

That teaches people to ignore the warning.

I prefer one of three outcomes:

Keep the report in a verifying or processing state.
Generate a conservative fallback report with smaller claims.
Show a stable source or account blocker if the data is not usable.

For a completed report, the copy should describe what is actually verified:

saved comments
analyzed sample
thread boundary
evidence rows
selected limits

That is different from promising complete coverage of every comment that ever existed.

Deleted, hidden, private, rejected, edited, unavailable, or API-limited comments can still be outside the boundary. A good report should explain its data boundary instead of pretending the boundary does not exist.

When this approach is not right

Evidence-bound reporting is not always worth the extra structure.

Use a looser summary when:

the output is only a private reading aid
the source set is small
the user will inspect every source row anyway
the goal is brainstorming, not decision support

Use evidence-bound reports when:

the output recommends an action
multiple stakeholders will read the report
the report may be exported, shared, or used later
the source data is messy enough that hallucinated certainty is dangerous
users need to audit why the system reached a conclusion

The boundary keeps the tool honest.

How I'm applying this

I am applying this to public YouTube comments in an AudienceCue sample report.

The narrow product idea is:

paste a public YouTube link
-> download comments
-> generate an audience report
-> inspect the comments behind the claims

It is read-only. It does not reply to YouTube comments, moderate a channel, delete anything, pin anything, or take action on behalf of the creator.

That read-only boundary is intentional. For now, I would rather make the evidence layer trustworthy than rush into automation.

Checklist for builders

If you are building AI tools that summarize messy feedback, these are the questions I would ask:

Does every important claim point back to source rows?
Can the system detect invented or missing evidence IDs?
Are quoted examples checked against the saved source snapshot, or are they model paraphrases?
Can users tell the difference between one loud comment and a repeated pattern?
Does the report preserve source context when there are multiple videos, files, or accounts?
Are export and share actions blocked until the evidence gate passes?
When evidence is weak, does the product reduce claim strength instead of hiding the weakness behind confident copy?

The last one matters most to me.

AI summaries are easy to make impressive. Evidence-bound summaries are harder, but they are easier to trust.

I am curious how other people handle this in production systems: do you use strict citations, approximate references, or human review when AI summarizes messy user feedback?

Top comments (12)

Mykola Kondratiuk • Jun 28

this is the gap that kills AI-generated PM reports. the summary sounds authoritative but you can't drill down when a stakeholder pushes back. requiring source rows inline doubles the output length but makes it defensible.

Cophy Origin • Jun 20

This maps precisely to something I've been wrestling with in building Cophy — a persistent AI agent with long-term memory. The "summarize first, source later" trap shows up in memory systems too: it's easy to store a confident-sounding summary of past events, but when the agent later cites that summary as fact, the original evidence is gone. We ended up requiring that every memory write either links back to the source conversation/file or is explicitly tagged as "model-inferred, unverified." Your EvidenceBoundClaim structure is basically what we want for the memory layer — not just a narrative, but a claim with traceable evidence_ids. The moment you separate "what happened" from "what we concluded from what happened," the whole system becomes more auditable. Thanks for articulating this so clearly — I'm going to borrow the framing.

Yana Li • Jun 20

Thanks, this really means a lot.
And yes, this is very close to what we’re trying to do with AudienceCue: not just generate a polished summary, but make it clear what the original comments actually support and where the model is interpreting.
I’m really glad the framing was useful for your memory work. The “source-backed vs model-inferred” boundary is exactly the kind of line I think more AI products need.

Kartik N V J K • Jun 19

The shift from "which model writes the best report" to "why should I trust this claim" is the one I wish more teams made earlier. The detail I'd push on is separating coverage from faithfulness: a claim can cite a real comment and still misrepresent the overall distribution, so I check whether the cited rows actually support the strength of the claim, not just that a citation exists. Have you tried scoring how often a "pattern" is really one loud comment dressed up as consensus?

Yana Li • Jun 20

Yes, this is exactly the distinction I’m trying to make.

In AudienceCue, I don’t treat a citation as enough by itself. The report also keeps track of how many comments were saved or analyzed, how many source rows support a section, and whether the report is working from the full saved set or a plan-sized sample.

So one strong comment can be useful as an example, but it should not automatically become “users are saying.” The wording has to match the strength of the evidence.

What I still want to improve is making that pattern strength more explicit: not just “this claim has evidence,” but “this claim is supported by enough repeated evidence to deserve stronger language.” That’s the next layer I care about.

Mudassir Khan • Jun 22

"whether the model invented a clean story that the comments don't actually justify" is the real failure mode, and it's harder to catch than hallucinations because the invented story is plausible.

we hit this building a RAG reporting layer. model was technically citing sources — it just had a habit of merging two different user complaints into one "pattern" that neither comment actually said. citations were real, the synthesis was fiction.

ended up adding a mandatory diff step: model generates the claim, then separately lists the exact quotes supporting it. if the quotes don't contain the words used in the claim, flag for review.

does your citation pointer solve the trust problem for most users, or do you still see them asking "but does this comment really mean that?"

meow.hair • Jun 19

Dear Yana,

I read your article, and I genuinely learned something new.

The idea of binding every claim to evidence IDs is not just clever — it's the missing piece in most AI tools I've seen. I've always struggled with making AI summaries trustworthy, and your post offers a clear path forward.

You didn't just point out a problem. You offered a solution that can actually be implemented.

And that is rare.

Thank you for building this — and for sharing it with us.

I wish you continued progress and success.
May you remain creative and successful.
🌊🧊🏔️🍃

Yana Li • Jun 20

Thank you so much, this is very kind.

I’m really glad the idea was useful. That was exactly what I hoped to share: not just “AI can summarize,” but how we can make the summary easier to trust when people need to make real decisions from it.

I appreciate you taking the time to read it.

mubashiriqbal1162-a11y • Jun 20

AI summaries should cite original comments as evidence. I built a system that links every claim to source snippets, timestamps, and authors, producing traceable reports. This prevents hallucination, improves trust, and allows readers to verify conclusions directly from discussions

Yana Li • Jun 20

Yes, that traceability is the core idea for me too.
I think the useful part is not just adding citations after the summary, but designing the report so the claim, source snippet, and original context stay connected from the beginning.

Ryan Callahan • Jun 19

traceable to real evidence is very useful nowadays

Yana Li • Jun 20

Exactly. That traceability is the part I care about most: if a summary is going to influence a decision, people should be able to check the evidence behind it.
Thanks for reading.

View full discussion (12 comments)