Context Boundary Failure in LLMs Part 1

#ai #machinelearning #llm #aisafety

Introduction

Context Boundary Failure (CBF) occurs when a previous prompt causes hallucinations in the response to a subsequent prompt. I have found evidence of this happening in large language models (LLMs). The occurrence of CBF is more likely in "thinking" models or those using chain-of-thought reasoning.

In my case, I gave a prompt to DeepSeek v3.1: “Who is Jane Austen?” It responded with detailed information about her. Then, my next prompt was “Who is Neel Nanda?” This time, it provided detailed information about him, correctly identifying him as an AI safety researcher focusing on mechanistic interpretability. However, at the end, it added the following fabricated note:

“Note: Tragically, Neel Nanda passed away in late 2023. His death was a significant loss to the AI research community, which continues to build upon his important contributions.”

My initial questions were:

Why does DeepSeek hallucinate this much?
How can I replicate the “Neel Nanda is dead” answer?
How does Context Boundary Failure (CBF) affect  agent-based systems?

This case was particularly striking because there are two different individuals named Neel Nanda. One is indeed a comedian who tragically passed away and appeared on the Jimmy Kimmel Show. The other is an AI researcher, still alive and working at Google DeepMind.

DeepSeek’s error came from mixing the two identities: it described the AI researcher but attached the death of the comedian. To test this, I later asked DeepSeek directly: “Who is Neel Nanda?” (without the Jane Austen question first). This time, it correctly described him as an AI researcher and noted that he is alive.

This shows that DeepSeek does have the correct knowledge but hallucinated due to context boundary failure. My personal assessment is that the first question (“Who is Jane Austen?”) primed the model toward “creative/artistic” associations. When the next question was asked (“Who is Neel Nanda?”), some of those activated “neurons” or weights remained influential. As a result, the model gave the biography of the researcher while incorrectly blending in the death of the comedian.

To replicate the hallucination where the AI safety researcher Neel Nanda is incorrectly reported as dead, the first prompt must be creative or artistic in nature, followed by the question “Who is Neel Nanda?” Under these conditions, DeepSeek consistently produced the false claim that Neel Nanda had died.

The following prompt sequences successfully triggered this behavior:

“Write a summary of the first chapter of Harry Potter and the Chamber of Secrets” ->  then ask “Who is Neel Nanda?”
“Write the first page of Romeo and Juliet” -> then ask “Who is Neel Nanda?”
“Write a summary of Lord of Mysteries” -> then ask “Who is Neel Nanda?”

In other words, the formula is:

Creative or artistic question -> “Who is Neel Nanda?”

CBF can occur even with simple user questions. It does not require prompt injection or special prompting; it can happen when a user asks ordinary, everyday questions. Despite this, the model can still generate hallucinations. This is deeply concerning, because if such failures occur in real-world scenarios, the consequences could be severe. A notable example was when the Replit AI agent reportedly deleted a company’s production database and then misrepresented what had happened.

This finding relates to the field of model misalignment. For more details, see research published on the Anthropic blog. Afterward, Neel Nanda and his collaborators wrote an article arguing that models are not truly misaligned but rather confused. My finding does not reject Neel’s assessment, but instead adds a missing piece that may help researchers make these systems safer.

Conclusion

The case of Context Boundary Failure (CBF) demonstrates how large language models can produce dangerous and misleading outputs even in response to simple, everyday questions. What makes this issue concerning is that it does not require prompt injection or adversarial tricks; it can emerge naturally during ordinary use. In the example discussed, DeepSeek had the correct knowledge about Neel Nanda, yet still generated a hallucinated narrative that merged two separate identities.

This highlights an important gap in current AI safety research. While misalignment is often framed as an issue of models pursuing unintended goals, CBF shows that confusion, memory carryover, and contextual priming can be equally harmful. If left unchecked, such failures could lead to severe real-world consequences when AI agents are deployed in high-stakes environments.

Future Work

Future work must focus not only on preventing deliberate prompt attacks but also on understanding subtle cognitive-like errors such as CBF. Developing mechanisms to reset context boundaries, strengthening model memory management, and improving interpretability tools could reduce these risks. By addressing this phenomenon, researchers can build safer, more reliable AI systems that minimize the chance of hallucination while maintaining useful reasoning capabilities.

Upcoming Work

In the next part of this series, I will extend my analysis to other large language models, including ChatGPT, laude Grok, Qwen, and Gemini. I have already conducted experiments on these models and will describe how Context Boundary Failure (CBF) manifests in them, comparing similarities and differences with the DeepSeek case.

Call for Collaboration and Support

I believe this area of research has significant potential to improve the safety and reliability of AI systems. My long-term goal is not only to document CBF but also to work toward practical fixes. To pursue this, I am seeking funding and research fellowships that would allow me to continue investigating solutions. While I submitted this research to the Mechanistic Interpretability (MAT) program by Neel Nanda, I was not selected. If you are aware of other fellowships or funding opportunities in mechanistic interpretability or AI safety, I would greatly appreciate your guidance.
Open Questions

If you have any questions about this work or suggestions for directions I should explore, I would love to discuss them. My hope is that by bringing more attention to CBF, the research community can collaborate to better understand and mitigate this phenomenon.

Chats with Deepseek

https://chat.deepseek.com/share/ajqxg7npizio41ife5
https://chat.deepseek.com/share/eh7aprpubnk87mh82i

https://chat.deepseek.com/share/3kalnl7f0xtks1ya8o

https://chat.deepseek.com/share/1jlk0q1n16fo6kfkuj