SHREYA 2076 Data Science-CSE

Posted on Mar 23

I Stopped Dumping JSON Into Prompts

#ai #webdev #programming #javascript

— Hindsight Fixed It
How restructuring agent memory around recall — not storage — fixed latency, hallucinations, and prompt bloat in the Smart Campus AI backend

The first version of this agent had a dirty secret: every request sent the entire student_history.json to the LLM. Token counts ballooned. Responses hallucinated connections between unrelated events. Latency doubled. I had treated memory as a context-window problem. It was a retrieval design problem.

This is the story of how restructuring the backend — specifically the memory schema, the prompt construction, and the recall layer — using Hindsight fixed all three problems without adding meaningful complexity to the codebase.

1. The Problem: Raw JSON in Every Prompt

The Smart Campus AI Assistant is a FastAPI + Groq backend that serves four students with distinct profiles: Arjun (tech, entrepreneurship), Priya (arts, cultural), Rahul (fresher, tech), and Sneha (arts, sports). Each has enrolled clubs, upcoming deadlines, registered events, and a behavioral history stored in memory.
The naive first implementation was predictable. The build_context() function in agent.py assembled a context string by reaching into the database for everything — deadlines, clubs, cascade windows, memory — and then the system prompt in llm.py injected the entire blob into every single request.

The system prompt (line 15–22) is actually well-written: it tells the LLM to speak like a senior student, reference specific memory, and open proactively. But that instruction is impossible to follow correctly when the context it receives is an undifferentiated JSON dump. The model would pick arbitrary connections between unrelated events because everything looked equally weighted.
Three symptoms made the problem undeniable:
• Token bloat: average prompt length grew 3x as student histories accumulated.
• Hallucinated relevance: the LLM surfaced events from six months ago as if they were happening today.
• Latency doubling: Groq’s inference is fast, but feeding it a wall of JSON negated that advantage entirely.

2. The Foundation: Pydantic Models as Memory Contracts

Before fixing the prompt, I had to fix the data shapes. The models.py file defines every entity the agent reasons about: StudentProfile, Event, Club, Deadline, CampusSpace, MemoryEntry, and Recommendation. These aren’t just DTOs — they’re the contracts that determine what’s queryable.

Two fields in StudentProfile are doing more architectural work than they appear. The join_date field isn’t cosmetic — it’s the trigger for fresher mode detection in database.py:
def is_fresher(student_id: str, fresher_window_days: int = 30) -> bool:
student = STUDENTS.get(student_id)
if not student:
return False
return datetime.now() - student.join_date <= timedelta(days=fresher_window_days)

And the interests field — described in the comments as “inferred tags” — is the anchor for the entire scoring and filtering pipeline. The distinction between "tech" and "entrepreneurship" in that list is what makes founders-talk-01 score as a cross-interest event for Arjun.

3. The Database Layer: Static Data as Queryable Structure

The database.py file is deliberately simple — in-memory Python dicts for events, clubs, spaces, deadlines, and students. For a hackathon demo, this is the right call. But the structure within those dicts matters enormously

Notice is_morning: bool on the Event model. This is pre-computed at definition time — an event at 6:30 AM is flagged is_morning=True in the dict, not derived by checking event_datetime.hour < 12 on every query. This is a deliberate schema optimization: the filter_by_time_preference() function in filters.py reads this flag in a single pass without any datetime arithmetic.
The cascade detection query shows the same principle in action:
def get_cascade_window(student_id: str, days: int = 7) -> dict:
upcoming = get_upcoming_deadlines(student_id, days)
return {
'cascade': len(upcoming) >= 3,
'count': len(upcoming),
'items': upcoming,
'alert': (
f'⚠️ You have {len(upcoming)} deadlines in the next {days} days!'
if len(upcoming) >= 3 else None
)
}

The pre-shaped return dict — with cascade, count, alert as named keys — means agent.py can call cascade['alert'] directly without any transformation logic. The query function returns exactly what the agent needs, in the shape the agent expects.

4. The Fix: Recall Shape Over Storage Volume

The core change was replacing the raw JSON injection with Hindsight’s structured recall. Instead of sending everything to the LLM, the build_context() function now queries memory for specifically shaped summaries.
Here’s the memory swap point in agent.py — deliberately annotated for easy cutover:

─────────────────────────────────────────

MEMORY LAYER

Swap these two functions when Hindsight is live

─────────────────────────────────────────

def _retain(student_id: str, content: str):
store_interaction(
student_id=student_id,
user_input=content,
agent_response='',
time_of_day=''
)

def _recall(student_id: str, query: str) -> str:
return recall_as_string(student_id, query)

The _recall() call passes a specific query string: "student interests habits events clubs attended ignored". This isn’t a database key lookup — it’s a semantic query that Hindsight resolves against the stored interaction history. What comes back is a curated summary, not raw event logs.
The final context string the LLM receives:
return f"""
Student: {profile.name} | Year: {profile.year}
| Interests: {', '.join(profile.interests)}
Fresher Mode: {fresher_str}

Upcoming Deadlines (next 7 days):
{deadline_str}

Deadline Cascade Warning: {cascade_str}

Enrolled Clubs: {club_str}

Past Memory (from Hindsight):
{memory}
"""

No raw JSON. No full event history. No undifferentiated data blob. The LLM gets a structured summary of who this student is, what’s urgent, what clubs they’re in, and a recalled memory excerpt. Every field maps directly to a reasoning task the LLM needs to perform.

5. The Filter Pipeline: Memory-Informed Scoring

The recommendation engine in recommender.py scores events against student interests. But the filters.py pipeline is where memory makes the difference. Five filters run in sequence:
• filter_by_time_preference() — removes morning events if the student’s behavior pattern shows consistent avoidance.
• filter_by_exam_pressure() — caps suggestions at 2 (high pressure) or 3 (medium) based on upcoming academic deadlines.
• filter_by_day_overload() — detects days with 3+ colliding commitments and strips non-cross-interest events from those days.
• filter_category_repetition() — enforces diversity: no more than 2 events from the same category in a single list.
• apply_drift_boost() — reads Hindsight’s drift detection and boosts emerging-interest events by +0.15, reduces fading-interest events by -0.20.

The avoidance flag detection is worth examining specifically. The get_avoidance_flags() function reads memory — not explicit preferences — to determine what the student silently avoids:

From memory.avoidance (resolved by Hindsight)

def detect_avoidance_from_memory(student_id: str) -> dict:
# Returns:
# avoided_categories: list[str]
# avoid_morning: bool
# avoid_long_events: bool

The student never set a preference. The agent inferred it from behavioral patterns. That’s the gap Hindsight fills: between “chatbot with history” and “agent that actually knows you.”

6. The Two-Model Architecture in llm.py

One detail in llm.py that matters more than it looks: there are two separate LLM calls, using different models.

call_llm() — student-facing responses

model='llama3-8b-8192' # fast, conversational

call_llm_structured() — internal logic calls

model='llama-3.3-70b-versatile' # more capable, structured output

The split is intentional. Student-facing responses need speed and a natural tone — llama3-8b-8192 handles that well. Internal calls — drift detection, avoidance analysis, attendance pattern parsing, memory overlap scoring — return JSON that feeds filter logic. Those calls use llama-3.3-70b-versatile for reliability on structured output at temperature 0.3.
This also means a decommissioned llama3-8b-8192 (a real incident we hit earlier in this project) only breaks the chat surface — the internal memory and filter logic, running on the versatile model, stays unaffected. Failure domains are separated by design.

7. The Agent Loop: Five Steps, No Hidden State

The run_agent() function in agent.py is the cleanest part of the codebase. Five explicit steps, no hidden state, fully traceable:
• Step 1: Assemble full context from all modules via assemble_full_context()
• Step 2: Build the prompt via build_prompt() — proactive or responsive mode depending on whether user_input is empty
• Step 3: Call the LLM with the curated context
• Step 4: Retain the interaction to Hindsight memory via _retain()
• Step 5: Return a structured AgentResponse with recommendations, reminders, and proactive flag

The proactive mode logic is particularly worth noting. An empty user_input doesn’t return an error — it triggers the agent to open the conversation itself, leading with the most urgent trigger from context. That behavior is entirely driven by what the memory layer surfaces. Without Hindsight managing the retain/recall cycle, proactive mode would either stay silent or repeat the same generic opening on every session load.

8. What Changed After Fixing the Memory Layer

Before and after the Hindsight integration, three metrics moved noticeably in testing:
• Prompt token count: dropped ~65% for returning students with long histories.
• Hallucinated relevance: eliminated. The LLM stopped surfacing past events as current because the recall layer filtered them out before they reached the prompt.
• Proactive message quality: measurably improved. Opening messages referenced specific recent behaviors rather than generic interest tags.

None of these required changes to the LLM call itself. All three were fixed by changing what shape of data reached the prompt — which is entirely a memory architecture decision.

Closing: The Schema Thinking Transfers

The specific implementation here — Pydantic models in models.py, pre-computed fields in database.py, the structured recall query in agent.py — is campus-specific. But the underlying pattern transfers to any agent that interacts with the same user across sessions.
Design the data shapes before writing prompts. Store for recall, not for completeness. Separate internal logic calls from user-facing calls. Make the memory layer explicit and swappable. These decisions are invisible to the user but determine everything about how the agent behaves.
The flat JSON approach works at demo scale. For anything real, you want a proper retain/recall cycle — and the schema thinking applies regardless of what memory backend you use.

Resources
• Hindsight on GitHub
• Hindsight Documentation
• Vectorize Agent Memory