I've been playing with a small browser-side prototype that tries to answer a boring but increasingly important question:
Where did this piece of web content come from?
Not “is this AI-generated?” I think that question is already getting messy. Detectors can be useful sometimes, but I would not build trust on top of them alone.
The simpler question is: When was this page first seen? Who saw it?
Did it change? Can we still replay what was captured?
While building this, I kept running into the same problem. There does not seem to be a simple, standard object that explains the evidence around a page in a way normal developers can inspect. (If there is please let me know)
So I started thinking about it as a provenance card.
Not a magic truth label. Just a boring evidence card.
First rule: don't claim truth
The card should never say:
This is true.
That is not what provenance means.
A page can be real and still be wrong. A quote can be old and still be used in a dishonest way. A photo can be authentic and still be shown in the wrong context.
My focus is more on authenticity: is it human-made content, and where did it come from? For example, I often use AI for text correction and proofreading, but the content and the thinking are mine.
Also, I am from the Netherlands, so English is not my native language, even though I have been married to my American wife for 15 years and we speak English at home with our kids. I also travel regularly between California and Belgium, so my writing naturally sits somewhere between different languages and places.
So I try to split the problem into four parts.
First: did this thing appear somewhere?
Second: when was it seen or captured?
Third: is the captured version still the same?
Fourth: is the actual claim correct?
A provenance card can help with the first three. The fourth one is a different job. Mixing them together is how you end up with a green “verified” badge on something that still misleads people.
A minimal version
The first version could be something simple like this:
{
"url": "https://example.com/article",
"canonical_url": "https://example.com/article",
"first_seen": "2024-03-12T10:22:00Z",
"last_seen": "2026-07-01T18:04:00Z",
"live_status": "available",
"archive_witnesses": ["wayback", "common_crawl"],
"capture_integrity": "hash_available",
"replay_status": "partial",
"confidence": "medium",
"warnings": [
"dynamic_content_detected",
"single_independent_archive_witness"
]
}
This is not court evidence. But it is already more useful than a normal link preview, because at least now the user can see why the system thinks something has provenance. One witness is not enough
My first version leaned heavily on the Wayback Machine CDX API.
That was useful, but it also made the weakness obvious.
Sometimes there is no capture. Sometimes the capture is incomplete. Sometimes the site blocked crawlers. Sometimes replay looks nothing like the original page. Sometimes you only find one lonely snapshot and you have no idea how much weight to give it.
So I think the card should treat evidence as a witness stack:
A live page tells you what exists now.
A public archive tells you whether someone captured it before.
A crawl index tells you whether another crawler saw it.
A local capture tells you what you captured yourself.
A screenshot gives a visual reference.
A hash helps show whether the file changed.
A replay package tells you whether the page can be inspected later.
The real question is not just:
Was this archived?
It is:
Do different witnesses agree?
That matters a lot. Two sources that are basically copying the same thing should not count as two independent witnesses.
Confidence should not be a fake badge
I do not like simple “verified” labels for this.
They look nice in a UI, but they hide too much.
I would rather show smaller signals.
How strong is the evidence that this thing existed publicly?
How strong is the date?
Are the witnesses actually independent?
Are there hashes or signatures?
Does the archived version replay properly?
Are there multiple versions over time?
Then the card can say something like:
Provenance confidence: Medium. Two archive witnesses found, but replay is incomplete and there is no signed capture.
That feels more honest.
It also teaches the user what the evidence actually is, instead of asking them to trust a black box.Replay is more fragile than I expected. This was the part that annoyed me most.
A screenshot feels like proof until you start looking closer.
Modern pages are full of scripts, third-party assets, lazy-loaded images, embeds, ads, tracking code, and content that changes depending on time, location, login state, or browser.
Old captures often replay badly. You can get missing images, broken layouts, or even old HTML pulling in newer assets.
At that point, what are you looking at?
Not exactly the original page. More like a reconstruction.
So I think replay status deserves its own field.
*Clean replay.
*Partial replay.
*Screenshot only.
*Text only.
*Capture failed.
*Dynamic content warning.
It looks like a small detail, but it prevents a lot of false confidence.
Keep the public version simple
For a public tool, I would not start with something huge.
I would start with this: Paste a URL.
Return the live status, earliest known witness, basic archive timeline, capture information, confidence labels, warnings, a shareable card, and JSON output.
That is already enough for a lot of people: developers, journalists, researchers, students, bloggers, and anyone who has ever had to ask:
Where did this page come from?
The heavier version is a different story.
That is where you would want signed WARC/WACZ packages, SHA-256 manifests, replay verification, DOM diffs, text diffs, batch processing, monitoring, API access, and evidence exports.
Open formats matter here. I do not think this should become another proprietary “trust us” system.
WARC makes sense as the preservation layer. WACZ makes sense for portable replay and signing.
The package should be something another person can inspect later.
Questions I still have.These are the parts I have not settled yet:
*Should the verifier be open source?
*How should different archive witnesses be weighted?
*How should conflicting witnesses be shown?
*What should stay free forever?
*How do you stop abuse without making the tool annoying?
*Should there be a public JSON schema for provenance cards?
*Should replay quality be scored automatically, manually, or both?
*How much uncertainty can you show before normal users give up?
*My impression is that the web does not need more tools that scream “verified.”
It needs tools that show their work and admit what they do not know.
I am exploring this as part of a broader research direction around content provenance and human-origin signals.
Curious how others would design this. So what fields would you add?
What would you remove? And how would you handle two witnesses that disagree?
Top comments (0)