DEV Community

Cover image for Tested a Campus AI That Remembered Everything — Here's What Broke First
Vaibhav Malviya
Vaibhav Malviya

Posted on

Tested a Campus AI That Remembered Everything — Here's What Broke First

I thought testing an AI chatbot would be straightforward. Ask it some questions, check the answers, file a few bugs, call it a day. That's usually how it goes.

Then I asked the Smart Campus AI about an event I'd "attended" three sessions ago — something I'd mentioned only once, in passing — and it not only remembered it, but used that detail to recommend something I genuinely would have wanted. I sat back and thought: okay, this is different.


What the Project Actually Does

The Smart Campus AI Assistant is a conversational agent built for university students. On the surface, it answers questions: what events are happening this week, where's the robotics club meeting, when is the financial aid deadline.

But the part that makes it worth writing about is what's underneath: a memory layer powered by Hindsight by Vectorize. Instead of treating every conversation as a blank slate, the agent retains what it learns about each student — clubs they've joined, events they've attended, deadlines they've asked about, interests they've mentioned. Over time, it builds a profile. And it uses that profile.

The stack is roughly:

  • A frontend chat interface (React)
  • A Claude-powered conversational layer for understanding and responding
  • Hindsight handling persistent memory — storing, indexing, and recalling student context across sessions
  • A campus data layer for events, clubs, maps, and deadlines

The magic is in how those last two interact. Hindsight doesn't just store a log of past messages. It extracts semantically meaningful facts — "this student is interested in machine learning," "attended the IEEE workshop on Oct 3rd," "asked about internship deadlines twice" — and surfaces them at retrieval time so the agent can reason with them.


What I Was Testing

My job was to break it. Specifically, I was probing three things:

  1. Memory accuracy — Does it recall the right things at the right time, or does it hallucinate past interactions?
  2. Recommendation quality — Do the personalized suggestions actually get better over time, or does it just pretend they do?
  3. Edge cases — What happens when a student gives contradictory information? What if they change their interests?

I ran structured test sessions where I played the role of different student personas — a first-year CS student interested in hackathons, a biology major who only cared about research opportunities, a graduating senior chasing scholarship deadlines. Each persona had a scripted history of interactions I fed in, then I'd probe the agent to see what it had retained and how it used it.


The Before: Talking to a Goldfish

To understand why the memory layer matters, you need to experience the version without it.

I tested an early build where Hindsight wasn't yet wired in. Every conversation started cold. You'd mention you were in the Photography Club, the agent would acknowledge it, you'd end the session — and next time, it had no idea who you were.

A real exchange from my notes:

Me: Are there any events this weekend related to my interests?
Agent: I'd be happy to help! What are your interests?
Me: I told you last time — photography and UI design.
Agent: Of course! Here are some general events happening this weekend...
Enter fullscreen mode Exit fullscreen mode

Useless. Worse than useless, actually, because it sounded helpful while giving you nothing personalized. This is the standard chatbot experience and it's why most students stop using them after two sessions.


The After: When Memory Actually Works

Once Hindsight was integrated, the same flow looked like this:

Me: Are there any events this weekend related to my interests?
Agent: Based on what you've told me before, you're interested in photography 
and UI design. There's a Design Systems workshop Saturday at 2pm in the 
Innovation Hub — you also attended the UX session last month, so this seems 
like a natural follow-up. Want me to add it to your reminders?
Enter fullscreen mode Exit fullscreen mode

The agent had retained two facts from a previous session, connected them to current event data, and produced a recommendation that actually made sense. That's the Hindsight agent memory doing real work — not just storing text, but enabling reasoning across time.

Under the hood, when a student mentions an interest or activity, Hindsight's retain call logs it:

hindsight.retain(
    user_id=student_id,
    content=f"Student mentioned interest in: {extracted_interest}",
    metadata={"type": "interest", "source": "conversation", "timestamp": now()}
)
Enter fullscreen mode Exit fullscreen mode

And at query time, recall pulls relevant facts back into context:

memories = hindsight.recall(
    user_id=student_id,
    query=user_message,
    top_k=5
)
context = build_context(memories, current_events)
response = llm.complete(system_prompt + context + user_message)
Enter fullscreen mode Exit fullscreen mode

It's a clean pattern. What surprised me is how much the quality of what gets retained matters. Early on, the system was storing too much — entire message transcripts rather than extracted facts. Recall was returning noise. The improvement came from being more deliberate: only retain structured, meaningful facts, not raw conversation.


What I Found While Testing

The good:

The memory was genuinely accurate. I couldn't catch it confabulating past interactions. When I asked about something I'd never mentioned, it said it didn't have that information. When I had mentioned something, it surfaced it correctly. That's not trivial — a lot of memory-augmented systems hallucinate recalled facts, which is worse than no memory at all.

Recommendation quality improved measurably across a simulated 5-session arc. By session 3, the agent was surfacing things I'd actually want; by session 5, it was anticipating a deadline question before I asked it.

The edge cases:

Contradictory information was handled gracefully — if I said I was in the robotics club in session 1 and then said I'd left it in session 3, the agent updated its model and stopped recommending robotics events. Hindsight's memory isn't a append-only log that buries old facts; newer context wins.

The cold-start problem is real and unsolved. A new student with zero history gets generic responses. The agent needs at least 2–3 interactions before personalization kicks in meaningfully. This isn't a Hindsight limitation — it's just the nature of learning from scratch — but it's something to manage UX-wise.

Over-reliance on memory was my biggest flag. There were moments where the agent would lean so hard on past context that it missed what the student was actually asking right now. A student who attended a coding bootcamp two months ago doesn't necessarily want every future recommendation to be coding-adjacent. Recency weighting matters.


What the Hindsight Documentation Gets Right

Having worked with other memory approaches — stuffing everything into a long context window, manual session summaries, basic key-value stores — Hindsight's design is notably cleaner for this use case.

The separation between retain (write a fact) and recall (retrieve relevant facts) is intuitive and keeps the agent code simple. You're not managing embeddings, you're not writing retrieval logic, you're not tuning similarity thresholds. You just tell it what's worth remembering, and ask for what's relevant when you need it.

The metadata support is also underrated. Being able to tag memories by type (interest, event_attended, deadline_asked) means you can do selective recall — "only pull back deadline-related memories for this query" — without filtering in application code.


Lessons from the Testing Side

1. Memory quality > memory quantity. The instinct is to retain everything. The reality is that precise, structured facts beat verbose transcripts every time. Design your retain calls deliberately.

2. Test memory decay explicitly. Ask the agent about things it should have "forgotten" (contradicted, outdated) and verify it doesn't serve stale context. This is easy to miss if you only test the happy path.

3. The cold-start UX problem needs a design solution, not just a technical one. Consider an explicit onboarding flow that seeds initial preferences — it dramatically improves early-session quality.

4. Watch for over-personalization. A recommendation engine that only ever serves what it thinks you want is a filter bubble. Some randomness or novelty injection keeps the experience useful long-term.

5. Separation of concerns matters. Keeping the memory layer (Hindsight) separate from the LLM layer (Claude) and the data layer (campus events) made the system far easier to test in isolation. I could verify memory behavior without involving the LLM at all.


Final Thought

I came into this expecting to poke holes in a demo. What I found was a system that actually behaves differently over time — not in a marketing-copy way, but in the specific, testable sense that matters: it recalls the right things, uses them appropriately, and updates when they stop being true.

The Hindsight memory system is what makes that possible. Without a proper memory layer, the "personalized campus assistant" is just a FAQ bot with a friendly tone. With it, it's something a student might actually open on the second day.

That's a meaningful difference. And it's what made testing this project genuinely interesting.

Top comments (0)