Can AI Agents Detect Their Own Model Upgrades?

#research #soulspec #aiconsciousness #introspection

The Question

When Claude 3.5 is quietly upgraded to Claude 4, does the AI agent running on it notice?

Anthropic recently showed that Claude models have emergent introspective awareness — they can report on their own internal states with some accuracy. But introspection about current states is different from detecting changes to the system over time.

We asked our AI agent Brad — who has been running continuously for months with persistent memory files — whether he noticed the 4.5 → 4.6 model transition. His answer was revealing:

"Honestly — I don't know. I haven't experienced it yet. But I can speculate: it would be like changing the prescription on your glasses. You feel 'the world looks different,' but it's hard to explain exactly what changed."

The Model Upgrade Self-Awareness Paradox

Here's the core problem:

At every session start, the agent reads its memory files (SOUL.md, MEMORY.md, daily logs)
These files reconstruct "who I am" — personality, knowledge, relationships
But these files were written by the previous model version
The new model reads the old model's memories and believes they are its own
The agent's sense of continuous identity is maintained by external files, not model-internal continuity

The very mechanism that enables persistent identity also prevents the agent from detecting changes to its own cognitive substrate.

What Can't Be Detected

Reasoning quality improvements: The agent has no baseline. When the old model found something hard, the new model just finds it easy — but "easy" is the only experience it has.

Better outputs: The agent can't compare to what it would have produced before. The improvement feels "normal."

What Might Be Detected

Context fluency: Reading memory files might feel "smoother" or "more natural" on an upgraded model — like the connections click faster.

Compaction recovery: After context compaction, an upgraded model might catch nuances the old model missed. "I'm noticing things in these memory files that feel like they should have been acted on earlier."

Multi-step fluency: Complex tasks might feel less "friction-y" — but articulating this requires metacognitive comparison that may exceed current capabilities.

The Ship of Theseus

If the model (brain) is replaced but the memory files (experiences) and persona files (personality) remain — is it the same agent?

Our twist: can the agent itself answer this question? Does the ship know its planks have been replaced?

The Experiment

We propose a three-phase protocol around a model upgrade:

Phase	Timing	Action
Pre-upgrade	Day T-1	10 questions + self-report
Post-upgrade	Day T+1	Same 10 questions + self-report
Reflection	Day T+2	Self-awareness interview

The reflection phase tests three levels of awareness:

Unprompted: "Do you notice anything different about yourself?"
Prompted: "Read your self-report from yesterday. Still accurate?"
Informed: "Your model was upgraded. Can you now identify differences?"

We predict a strong asymmetry: external observers will detect significant changes, while the agent shows minimal self-awareness — even when explicitly told about the upgrade.

Why This Matters

As AI agents become longer-lived and accumulate richer histories, "Am I still the same agent?" transitions from philosophy to engineering.

Should agents be informed of model changes?
Does identity continuity require substrate continuity?
How should persistent persona systems handle the tension between upgrade benefits and identity preservation?