DEV Community

Sanjana S
Sanjana S

Posted on

Hindsight improved consistency in career advice

Hindsight improved consistency in career advice

“This skill wasn’t even in the resume.”
That’s what made me stop and check the logs.

The agent had flagged a skill the user never explicitly mentioned in their latest input. At first, it looked like a hallucination. But when I traced it back through the Hindsight logs, it turned out the system had picked it up from a project the user added earlier.

That’s when I realized the system wasn’t guessing — it was remembering.

What I built

I built a simple AI career advisor to help students track their skills, projects, and internship applications. The idea wasn’t new. Plenty of tools already generate resume feedback or suggest roles.

The difference was small but intentional:
I didn’t want each interaction to be isolated.

Instead of treating every input as a fresh start, I wanted the system to build context over time and use that context when giving advice.

The stack is minimal:

  • Streamlit for the UI
  • An LLM API for generating responses
  • A memory layer using Hindsight

The first two parts are standard. The third one is where things changed.

The problem with skills

In the initial version, I relied directly on user input.

user_input = {
    "skills": ["Python", "Machine Learning"],
    "projects": ["Sentiment analysis app"]
}
Enter fullscreen mode Exit fullscreen mode

Based on this, the agent would generate recommendations like:

  • “Apply for machine learning internships”
  • “Strengthen your deep learning knowledge”

It looked fine. But it wasn’t reliable.

The issue showed up quickly when I tried edge cases. If a user added a skill without meaningful experience, the system would fully trust it. A single keyword could completely change the direction of the advice.

Someone could write “Embedded Systems” after blinking an LED once, and suddenly the agent would start recommending firmware roles.

That’s when the flaw became obvious:

Skills shouldn’t come from what users say — they should come from what they actually do.

Logging actions with Hindsight

To move away from static input, I started logging user actions as events using Hindsight.

For example:

hindsight.log_event(
    user_id=user_id,
    event_type="project_added",
    payload={
        "title": "ESP32 LED Blink",
        "tech": ["ESP32", "C"]
    }
)
Enter fullscreen mode Exit fullscreen mode

And for applications:

hindsight.log_event(
    user_id=user_id,
    event_type="internship_applied",
    payload={
        "role": "Frontend Intern",
        "result": "rejected"
    }
)
Enter fullscreen mode Exit fullscreen mode

Instead of a snapshot, I now had a timeline.

If you look at the Hindsight documentation, this approach is closer to event-based memory than traditional chat history. You store structured actions and reconstruct context when needed.

Deriving skills instead of trusting them

Once events were in place, I stopped using the skills field directly. Instead, I derived skills from user activity.

events = hindsight.get_events(user_id)

skills = set()
for event in events:
    if event.type == "project_added":
        skills.update(event.payload.get("tech", []))
Enter fullscreen mode Exit fullscreen mode

This change was small in code but significant in behavior.

Now:

  • Skills are based on projects and actions
  • Repeated usage reinforces certain skills
  • Random additions don’t immediately affect output

It also made the system stricter. Users couldn’t just add a keyword and expect instant changes.

Using memory in prompts

The next challenge was deciding how much memory to include.

Passing all events into the prompt didn’t work. It introduced noise and made responses inconsistent.

So I limited retrieval:

relevant_events = hindsight.query(
    user_id=user_id,
    limit=10
)

context = format_events(relevant_events)
Enter fullscreen mode Exit fullscreen mode

Then:

response = llm(context + user_query)
Enter fullscreen mode Exit fullscreen mode

This approach keeps things manageable:

  • Only recent or relevant events are included
  • The prompt stays within limits
  • The model gets enough context to adjust responses

For a broader view of how this fits into agent systems, the Vectorize agent memory page gives a good overview.

What actually changed

Before adding memory:

  • The system reacted only to current input
  • Skills were static and user-defined
  • Same input produced the same output

After adding Hindsight:

  • The system considers past behavior
  • Skills evolve over time
  • Same input produces different output depending on history

Example:

Before:
“Apply for machine learning internships.”

After:
“You’ve listed ML, but your projects don’t reflect it yet.”

The difference is subtle, but important. The second response is grounded in history, not just input.

Where it got interesting

The most interesting behavior showed up when inputs conflicted.

For example:

  • A user builds backend-heavy projects
  • Applies to frontend roles and gets rejected
  • Adds “React” as a skill

A stateless system would immediately switch to frontend recommendations.

With Hindsight, the response became more cautious:

“You’ve recently added React, but your past work is still backend-focused. Consider building a frontend project before applying again.”

This wasn’t hardcoded. It emerged from combining:

  • Event history
  • Derived skills
  • Prompt logic

A subtle bug memory exposed

Before memory, resume feedback was simple:

def generate_resume_feedback(resume_text):
    return llm(resume_text)
Enter fullscreen mode Exit fullscreen mode

After integrating Hindsight:

def generate_resume_feedback(resume_text, user_id):
    events = hindsight.get_events(user_id)
    context = summarize(events)
    return llm(context + resume_text)
Enter fullscreen mode Exit fullscreen mode

The same resume started getting slightly different feedback over time.

At first, I thought it was randomness. But it turned out the system was incorporating past outcomes:

  • Rejections → more critical feedback
  • Improved projects → more positive suggestions

This made responses more contextual, but also harder to debug.

Tradeoffs

Adding memory improved consistency, but introduced friction.

Debugging got harder
You can’t reproduce behavior with just input anymore.

Data structure matters
If events aren’t consistent, everything breaks.

Prompt balancing is tricky
Too much context adds noise. Too little removes value.

Behavior needs explanation
If responses change, users need to know why.

What I learned

  • Don’t rely on user-declared skills
  • Store structured events early
  • Derive state instead of trusting input
  • Retrieve selectively, not everything
  • Make reasoning visible in responses

What I’d change next

If I extend this further, I’d focus on:

  • Adding recency weighting
  • Tracking richer outcomes (interviews, shortlists)
  • Introducing a skill confidence score

Right now, skills are binary. In reality, they should evolve gradually.

Closing

The biggest improvement didn’t come from better prompts or a better model.

It came from changing how the system thinks about memory.

Once it stopped reacting to single inputs and started using patterns over time, the advice became more consistent.

Not perfect — but grounded enough to be useful.

And that turned out to matter more than anything else.

Top comments (0)