DEV Community

Fabio
Fabio

Posted on

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

We’ve been running a series of experiments using ChatGPT 5.4 integrated into a website chatbot across different environments:

🌐 a main website
🛒 a 1,000-product e-commerce demo store
🍳 a 570-page cooking blog

🎯 Goal: simulate realistic user behavior and observe how the model responds over time.

⚙️ Test setup

The chatbot is designed to (no self promo here, just context):

📌 answer strictly based on website content (RAG-like approach)
🧭 guide users through product discovery and content navigation

Over time, we intentionally tested recurring patterns:

🔎 product comparisons
💰 price-based filtering
🔀 cross-entity queries (multiple products, categories)
🧠 more complex “shopping intent” scenarios

💡 The idea was to approximate real-world usage, not synthetic benchmarks.

👀 Observation

At some point, a real user (yes, a real one) asked:

“How can you help my ecommerce?”

The answer was:

“I can help your e-commerce by answering visitors [...], [...] for example asking how many people they cook for to recommend the right cast iron pot, or asking for a price range to help them find products [...]”

🔍 What’s interesting

This response closely mirrors the exact interaction patterns we had been testing manually.

It wasn’t a generic explanation.
It reflected:

👉 guided questioning
👉 contextual recommendations
👉 progressive narrowing of user intent
🧠 Hypothesis

From a system behavior perspective, it feels like repeated usage patterns influence outputs in a given context.

Possible explanations:

🧩 Prompt conditioning over time (consistent system + user patterns)
📚 Context shaping via retrieved content (RAG)
🔁 Latent pattern activation due to repeated semantic structures
🧷 Session-level or interaction-level biasing
❓ Open question

This leads to a broader question for builders:

👉 When deploying LLMs in structured environments (chatbots, RAG systems, product assistants), does repeated real-world usage shape outputs in a measurable way?

👉 Or are we just observing better alignment due to consistent prompting + context injection?

🚀 Why this matters

If usage patterns do influence outputs (even indirectly), then:

🧪 testing is not just evaluation
🏗️ it becomes part of system behavior design
📈 and potentially a lever for optimization
💬 Curious to hear from others

If you’re working with:

RAG pipelines
production chatbots
LLM-powered assistants

Have you noticed similar effects?

Does your system behave differently after repeated real-world usage patterns?

Let’s compare notes 👇

Top comments (0)