Ken W Alger

Posted on Jun 29 • Originally published at kenwalger.com

The New Information Borders

#ai #provenance #digitalpreservation #sovereignsystems

Recently I came across a discussion about AI crawlers and robots.txt files. The conversation centered on a simple question:

Should website owners allow AI systems to access their content?

One proposed configuration looked something like this:

User-agent: ClaudeBot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: PerplexityBot
Disallow: /

At first glance this is a reasonable policy decision.

Perhaps a company has a commercial relationship with one AI vendor and not another. Perhaps it trusts one organization more than another. Perhaps it simply dislikes a particular company and would rather that company not benefit from its content.

These are all rational decisions. And worth remembering: robots.txt is a request, not a wall. It governs the crawlers that choose to honor it. The borders we are about to talk about form through compliance norms and licensing agreements, not through technical enforcement.

The interesting part is what happens when thousands of organizations make similar decisions at once.

The Web We Assumed

For most of the modern Internet era, there was an implicit assumption that people were operating from a broadly shared information environment.

Search engines differed in quality. Ranking algorithms differed. Some sources were easier to discover than others. But in general, if two people searched for information on a topic, there was a good chance they were drawing from many of the same underlying sources.

The web functioned as a largely shared corpus of knowledge.

That assumption may not hold forever.

Fragmentation Without Malice

When people discuss information fragmentation, they often jump straight to government censorship, national firewalls, or deliberate propaganda systems.

Those are real examples. But fragmentation does not require any malicious intent.

Imagine the following:

Company A blocks OpenAI but allows Anthropic.
Company B licenses content exclusively to OpenAI.
Company C blocks all AI crawlers.
Company D optimizes specifically for one AI platform.
Company E maintains a private agreement with a commercial search provider.

None of these organizations is trying to create information silos. Each is making what looks like a reasonable local decision.

Collectively, those decisions begin to produce different information environments. The divergence does not emerge from AI reasoning. It emerges from AI access.

None of these organizations is trying to create information silos. They are simply trying to protect their intellectual property or negotiate a survival-level licensing deal in an ecosystem that no longer sends them traffic. Each is making what looks like a reasonable local decision.

Two Kinds of Access

It helps to separate two things that fragment differently.

The first is what a model was trained on. The second is what a model can reach at the moment you ask it a question.

Today these overlap heavily. Most large models are built from many of the same underlying sources: the same crawled archives, the same bulk licensing deals, the same public web that has been scraped for years. At the training layer, the corpus is still mostly shared.

Retrieval is where the divergence is already happening.

When a model answers using live access to the web, the robots.txt rules, the licensing agreements, and the private deals all decide what it is permitted to pull in right then. One system can cite a source. Another is told it may not look. Same question, different evidence, and the difference has nothing to do with how either model reasons.

So the honest version of the claim is not that Claude and ChatGPT already see two different webs. It is narrower and more defensible:

Retrieval access is fragmenting now. Training access could follow.

That second part is the one worth watching. If exclusive licensing becomes the norm rather than the exception, the divergence stops being a retrieval-time quirk and starts being baked into what each model knows at all. The shared corpus we have taken for granted would quietly stop being shared.

The Difference Between Thinking and Seeing

When two AI systems produce different answers, we tend to assume the difference lies in how the models reason.

Sometimes that is true. Increasingly, though, the more important question may be a different one: what information was the model allowed to see?

An answer generated from complete evidence and an answer generated from partial evidence can both arrive with equal confidence. Only one of them may reflect the full record.

The distinction matters.

A model cannot mourn the data it was never allowed to read. It simply synthesizes a flawless, highly confident answer out of the fragment it has, leaving the user entirely unaware of the missing horizon.

Museums Learned This Long Ago

One reason I spend so much time thinking about provenance is that museums, archives, and historians have wrestled with these questions for decades.

Researchers care not only about what artifacts exist. They care about what artifacts are missing. Absence affects interpretation. A collection missing half of its records tells a different story than a complete one, and a careful researcher never mistakes the surviving fragment for the whole.

AI systems face the same challenge. A model can only reason from the evidence available to it. If the evidence becomes fragmented, the resulting interpretations may diverge even when the underlying reasoning processes remain sound.

The Sovereign Systems Perspective

The Sovereign Systems Specification is built around a simple observation:

Information without provenance is just gossip.

Most discussions of provenance focus on where information came from. The harder and more neglected question is what was left out.

Not only:

Where did this information originate?

But also:

What information was unavailable?

What information was excluded?

What information was never allowed into the system at all?

Absence is itself a provenance category. A record of what a system could not see is as much a part of its lineage as a record of what it could. Those questions become more important, not less, as AI systems become primary interfaces to knowledge.

While commercial cloud models hide their data deficits behind a smooth conversational curtain, a Sovereign system must explicitly map its own borders—declaring exactly what lies within its registry, and where the boundary of its knowledge ends.

The New Information Borders

I do not believe AI is creating separate realities. We are.

Not through any coordinated effort. We are simply making thousands of local decisions about access, licensing, trust, governance, and control.

The cumulative effect may be the emergence of informational borders that are far less visible than national borders but no less consequential.

So here is the thing to watch for. The next time two AI systems hand you different answers, do not stop at asking which one reasoned better. Ask what each one was allowed to see. The gap between them may have nothing to do with intelligence and everything to do with access.

The web once assumed a largely shared corpus of knowledge. The next generation of knowledge systems may not.

When two AI systems disagree, are we observing different reasoning? Or are we observing different worlds?

Top comments (11)

Ken • Jun 29

The “absence is itself a provenance category” point is the part I’d want every AI retrieval/eval system to make explicit.

When two systems disagree, the comparison usually jumps straight to model quality. But for many real workflows the first question should be evidence eligibility: which sources could each system see, which were excluded by policy/licensing/access, and which absences were known versus invisible?

I’d love to see more answer artifacts carry that boundary: not just citations for what was used, but a compact record of what the system was not allowed or not able to inspect. Otherwise a partial corpus can still produce a very polished answer with no visible missing-data warning.

Ken W Alger • Jun 29

I think that's exactly the missing piece.

We've become accustomed to citations as evidence of inclusion. They tell us what contributed to an answer. What they don't tell us is what was excluded, inaccessible, or simply never available to the system in the first place.

The more I've thought about it, the more I suspect future answer systems may need some equivalent of a "known blind spots" section alongside traditional citations. Not necessarily an exhaustive list, but some indication of the boundaries the system was operating within when it generated the response.

That's really where the phrase "absence is itself a provenance category" came from. A missing source isn't always a retrieval failure. Sometimes it's a licensing decision, a policy decision, a geographic restriction, a crawler restriction, or a simple access limitation. Those absences shape the answer just as surely as the sources that were included.

The challenge is that today's answers are often presented with a level of confidence that obscures those boundaries. We see the conclusion, but rarely the contours of the information landscape that produced it.

As AI systems become more important intermediaries for knowledge, I suspect understanding those boundaries will become just as important as understanding the citations themselves.

Ken • Jun 30

I think "known blind spots" nails it. Then the answer can say "here is what I found" while the evidence packet says "here is the shape of what the system could not see." That would make cross-system comparisons much more honest.

Ken W Alger • Jun 30

Exactly. Today most answer systems treat citations as a record of inclusion: here are the sources that contributed to this answer.

What's missing is the complementary record of exclusion.

Not necessarily an exhaustive inventory of everything the system didn't see, but at least some indication of the boundaries it was operating within. Was the corpus limited? Were certain publishers unavailable? Were there licensing restrictions? Were portions of the web excluded from retrieval?

The phrase "known blind spots" resonates with me because it shifts the conversation from certainty to context. An answer doesn't become less useful because it acknowledges its boundaries. In many cases, it becomes more trustworthy.

And I agree that it would make cross-system comparisons much more meaningful. Instead of asking only "Which answer is better?" we could also ask "Which evidence horizon was each system operating within?" Those are related questions, but they're not the same question.

Vinicius Pereira • Jun 30

The line that'll stick with me is "a model cannot mourn the data it was never allowed to read." From the building side that's the whole problem in one sentence: confidence has almost no correlation with completeness. A RAG system hands you a clean, sure-sounding answer off three documents exactly the way it would off three hundred, and nothing in the output tells you which one you got.

What you name well is that this is sliding from a reasoning problem to an access problem. Two deployments of the same model can now disagree purely on what each was allowed to retrieve, which quietly breaks something builders lean on: a fixed question is supposed to have a stable answer. Once retrieval depends on licensing and robots.txt, "correct" becomes a function of what was reachable that day.

The only honest response I've found is to make the system show its evidence boundary instead of hiding it: cite what it used, and abstain or flag when the evidence is thin rather than smoothing over the gap. It's the museum instinct you describe, applied to a pipeline. Treat the missing horizon as a field in the output, not an absence the user never learns about.

Really good framing on the training-vs-retrieval split. I hadn't thought about how invisibly retrieval fragmentation would erode reproducibility.

Ken W Alger • Jun 30

I think the reproducibility point is the one that keeps growing in importance the more I think about it.

Historically, if two people queried the same search engine, database, or archive, we largely assumed they were operating against the same body of information. Differences in outcomes were usually attributed to interpretation, expertise, or reasoning.

Increasingly, that assumption may no longer hold.

As you point out, two deployments of the same model can now produce different answers simply because their evidence boundaries differ. Same question. Same model. Different retrieval horizon.

That's a subtle shift, but it has significant implications for trust, reproducibility, and even debugging. When two systems disagree, are we looking at a reasoning difference or an evidence difference? Without visibility into the boundary, it's difficult to tell.

I also like your framing of applying the museum instinct to the pipeline. Archivists, historians, and curators have long understood that provenance isn't just about what is present. It's also about documenting what is missing, unknown, restricted, or unavailable. The absence often tells part of the story.

The more I look at retrieval systems, the more I suspect that evidence boundaries may eventually become as important as citations themselves. Knowing what contributed to an answer is valuable. Knowing the shape of what could not contribute may be equally valuable.

Vinicius Pereira • Jul 1

Yeah, the reproducibility angle is the sleeper here, and I think it only gets teeth if you make the evidence boundary a real artifact instead of a description. Concretely, attach a retrieval manifest to every answer: not just the docs it cited, but the candidates it saw and excluded, each with a reason code (out of license, below the rank cutoff, stale, blocked by robots). The second you have that, your "reasoning difference or evidence difference?" question stops being a judgment call. When two deployments disagree you diff the manifests first, if the evidence sets differ it's an access problem, and if they're identical but the answers still diverge then it's genuinely reasoning or nondeterminism. Right now people debug that backwards, staring at the outputs, because the boundary was never captured in the first place.

And that's the same move that turns "the shape of what could not contribute" from a nice phrase into a field you can actually query. The exclusions with their reasons are the negative space, logged. Citations tell you the answer's support, the exclusion log tells you its blind spots, and imo you need both to really trust the thing. Enjoyed this exchange a lot.

Self-Correcting Systems • Jun 30

The line that stopped me was “a model cannot mourn the data it was never allowed to read.” That’s the whole problem in one sentence. A partial-evidence answer and a full-evidence answer show up wearing the same confidence, and the user never sees the missing horizon you’re pointing at. I’ve been circling the same thing from the agent side absence as a first-class provenance category, not a footnote. A system that can’t state what it wasn’t allowed to see is grading its own paper. Most retrieval setups log what they found and stay silent on what they were blocked from, which is the moment “confident” and “complete” quietly stop meaning the same thing. The retrieval-now / training-later split is the sharp part. Watching that one.

Ken W Alger • Jun 30

I think you've put your finger on the part that keeps pulling at me as well.

Most systems are very good at explaining presence. Here are the sources I found. Here are the documents I cited. Here are the passages that influenced the answer.

They're far less capable of explaining absence.

Was a source unavailable because it didn't exist? Because it wasn't indexed? Because it was behind a paywall? Because a crawler was blocked? Because a licensing agreement excluded it? Those are very different conditions, yet they often collapse into the same user experience: silence.

I particularly like your observation that a system unable to articulate its own boundaries is effectively grading its own paper. Confidence becomes a much weaker signal when the evidence horizon is unknown.

The retrieval-now / training-later distinction is fascinating to me for the same reason. Training data limitations eventually become historical artifacts. Retrieval limitations are happening in real time. Information Borders can shift overnight because access policies, agreements, and restrictions change overnight.

In that sense, the answer isn't just a reflection of what the model knows. It's increasingly a reflection of what the model was permitted to know at the moment the question was asked.

That feels like a very different problem than the one we've spent the last few years discussing.

Self-Correcting Systems • Jun 30

“Good at explaining presence, bad at explaining absence” is cleaner than how I had it, and I’m stealing it. Mine was “data deficits hidden behind a curtain” yours actually names the four different conditions that get flattened into one silence: doesn’t exist, wasn’t indexed, paywalled, blocked. Those are different failures with different fixes, and right now they all just look like “I don’t know.” The line that’s going to sit with me is “what the model was permitted to know at the moment the question was asked.” That’s not a knowledge limitation, that’s a permission state and permission states change without anyone re-running the question. Same prompt, different hour, different border. That’s a harder problem than training cutoffs ever were, because training cutoffs at least hold still.

Ken W Alger • Jun 30

I think you've captured the distinction perfectly.

A training cutoff is ultimately a historical boundary. It's frustrating, but at least it's stable. If we rerun the same question tomorrow, the model's training corpus hasn't changed.

Permission states are different. They're operational.

A publisher can change a robots.txt file. A licensing agreement can expire. An API can become unavailable. A paywall can appear. A retrieval index can be updated. The evidence boundary can shift without the underlying question changing at all.

That's what fascinates me about the idea of Information Borders. We tend to think of knowledge limitations as properties of the model. Increasingly, some of the most important limitations may be properties of the environment surrounding the model.

And I agree that those four conditions matter because they imply different remedies. "The information doesn't exist" is a very different situation from "the information exists, but the system wasn't permitted to access it." Today those often collapse into the same answer: silence.

The more I think about it, the more it feels like we're moving from a world where knowledge was primarily constrained by memory to one where knowledge is increasingly constrained by access. That's a subtle shift, but a consequential one.

View full discussion (11 comments)