DEV Community

Theerthashree J R
Theerthashree J R

Posted on

I stopped guessing skills after Hindsight logs

“This skill wasn’t even in the resume.” The agent flagged it anyway, and digging into Hindsight logs showed it came from an earlier project entry.

I built a simple AI career advisor to help students track skills, projects, and internship applications. Initially, it just took user input and generated resume feedback or recommendations. That worked, but only on the surface.

The system is pretty straightforward: a Streamlit UI, an LLM API, and a memory layer using Hindsight. Instead of treating each interaction as isolated, I log user actions as events and use them later to guide responses. That shift—from input-based to history-based—ended up being the main change.

The problem with “skills”

At first, I trusted whatever users typed:

user_input = {
"skills": ["Python", "AI"],
"projects": ["basic chatbot"]
}

The agent would then suggest ML internships or advanced roles. It looked fine, but it wasn’t grounded in anything real.

The issue was simple: declared skills don’t reflect actual ability.

Logging events with Hindsight

Instead of relying on the latest input, I started storing actions:

hindsight.log_event(
user_id=user_id,
event_type="project_added",
payload={"tech": ["ESP32", "C"]}
)

And:

hindsight.log_event(
user_id=user_id,
event_type="internship_applied",
payload={"role": "Frontend", "result": "rejected"}
)

This builds a timeline instead of a snapshot. Tools like Hindsight on GitHub and its documentation explain this pattern well—it’s closer to event sourcing than chat memory.

Deriving skills instead

Once events were in place, I stopped using user_input["skills"] directly:

events = hindsight.get_events(user_id)

skills = set()
for event in events:
if event.type == "project_added":
skills.update(event.payload.get("tech", []))

Now skills come from projects and actions, not claims.

This made the system a bit stricter, but also more consistent.

Using memory in responses

I don’t pass all events to the model. Just a filtered set:

relevant = hindsight.query(user_id=user_id, limit=10)
context = format_events(relevant)

Then:

response = llm(context + user_query)

This keeps prompts manageable while still using history.

If you want a broader view of how this fits into agent systems, the Vectorize agent memory page gives a good overview.

What changed

Before:

Recommendations were generic
Skills were static
Same input → same output

After:

Advice reflects past actions
Skills evolve over time
Same input → different output (based on history)

Example:

Before:
“Apply for machine learning internships.”

After:
“You’ve listed ML, but your projects don’t reflect it yet.”

What didn’t go smoothly
Debugging got harder — behavior depends on history, not just input
Data structure matters — messy events break everything
Prompts got trickier — balancing context vs noise isn’t obvious
Takeaways
Don’t rely on user-declared skills
Store structured events early
Retrieve selectively, not everything
Make it clear why the system is giving advice

This isn’t a complex system, but adding memory changed how it behaves. It stopped reacting to single inputs and started using patterns over time—which turned out to be more useful than just improving prompts.

make my article around 800-1500 words
I stopped guessing skills after Hindsight logs

“This skill wasn’t even in the resume.” The agent flagged it anyway, and digging into Hindsight logs showed it came from an earlier project entry.

I built a simple AI career advisor to help students track their skills, projects, and internship applications. The idea wasn’t new—there are already tools that give resume feedback or suggest roles—but most of them treat each interaction as isolated. You paste your resume, get suggestions, and that’s it.

I wanted something slightly different: a system that remembers what you’ve actually done over time and uses that to guide advice.

The stack is simple. A Streamlit frontend, an LLM API for generating responses, and a memory layer using Hindsight. The only part that really changed how the system behaves is the last one.

The problem with “skills”

The first version of the system was straightforward. Take user input, pass it to the model, return suggestions.

user_input = {
"skills": ["Python", "Machine Learning"],
"projects": ["Sentiment analysis app"]
}

From this, the agent would generate things like:

“Apply for ML internships”
“Strengthen deep learning knowledge”
“Your profile is strong in AI”

Nothing technically wrong—but also not very reliable.

The issue became obvious when testing edge cases. If a user added a skill without meaningful experience, the system fully trusted it. Someone could write “Embedded Systems” after blinking an LED once, and the agent would start recommending firmware roles.

That’s when I realized the core flaw: skills shouldn’t come from what users say—they should come from what they consistently do.

Adding memory with Hindsight

Instead of treating each session independently, I started logging user actions as events.

For example:

hindsight.log_event(
user_id=user_id,
event_type="project_added",
payload={
"title": "ESP32 LED Blink",
"tech": ["ESP32", "C"]
}
)

And for applications:

hindsight.log_event(
user_id=user_id,
event_type="internship_applied",
payload={
"role": "Frontend Intern",
"result": "rejected"
}
)

This builds a timeline of user activity instead of a single snapshot.

If you haven’t come across it, the Hindsight GitHub repository and its documentation are useful for understanding how this kind of event-based memory works. It’s closer to event logging + retrieval than traditional chat history.

Deriving skills instead of trusting them

Once I had events, I stopped using the skills field directly. Instead, I derived skills from what users had actually built.

events = hindsight.get_events(user_id)

skills = set()
for event in events:
if event.type == "project_added":
skills.update(event.payload.get("tech", []))

This small change had a noticeable impact:

Skills are based on evidence (projects), not claims
Repeated usage naturally reinforces certain skills
Irrelevant or unused skills stop influencing recommendations

It also made the system slightly stricter. Users couldn’t just add a keyword and expect different suggestions immediately.

Using memory in prompts

The next problem was deciding how much memory to include in the prompt.

Dumping all events didn’t work—it quickly became noisy and inconsistent. Instead, I limited retrieval:

relevant_events = hindsight.query(
user_id=user_id,
limit=10
)

context = format_events(relevant_events)

Then passed it to the model:

response = llm(context + user_query)

This keeps things manageable:

Only recent or relevant events are included
The prompt stays within reasonable size
The model gets enough context to adjust its response

For a broader explanation of how this fits into agent systems, the Vectorize agent memory page gives a good overview.

What actually changed

Before adding memory:

The system reacted only to current input
Skills were static and user-defined
Same input → same output

After adding Hindsight:

The system considers past behavior
Skills evolve over time
Same input → different output (depending on history)

Example:

Before:
“Apply for machine learning internships.”

After:
“You’ve listed ML, but your projects don’t reflect it yet.”

This difference is small in wording but important in behavior. The second response is grounded in history, not just input.

Where it got interesting

One unexpected effect was how the system handled contradictions.

For example:

A user builds backend projects
Applies to frontend roles and gets rejected
Adds “React” as a skill

A stateless system would immediately switch to frontend recommendations.

With Hindsight, the response became more cautious:

“You’ve recently added React, but your project history is still backend-focused. Consider building a frontend project before applying again.”

This wasn’t explicitly programmed as a rule. It emerged from combining event history with prompt logic.

A subtle bug memory exposed

Before adding memory, resume feedback was simple:

def generate_resume_feedback(resume_text):
return llm(resume_text)

After integrating Hindsight:

def generate_resume_feedback(resume_text, user_id):
events = hindsight.get_events(user_id)
context = summarize(events)
return llm(context + resume_text)

The same resume started getting slightly different feedback over time.

At first, I thought it was randomness from the model. But it turned out the system was incorporating past outcomes:

If previous applications were rejected → feedback became more critical
If projects improved → feedback became more positive

This made responses feel more contextual, but also harder to debug.

Tradeoffs

Adding memory improved relevance, but introduced some friction.

  1. Harder debugging
    You can’t reproduce outputs with just input anymore—you need the full event history.

  2. Data structure matters
    If events aren’t consistent (e.g., different formats for projects), retrieval breaks.

  3. Prompt balancing is tricky
    Too much history → noisy responses
    Too little → no improvement over stateless version

  4. Behavior feels inconsistent without explanation
    Users don’t always understand why responses change unless you explicitly reference past actions.

What I learned
User input is unreliable as ground truth
Treat it as a hint, not a fact.
Event-based memory is more useful than long prompts
Structured logs are easier to query and reason about.
You don’t need complex models for better behavior
Memory + simple logic already improves output quality.
Keep event types minimal
A few consistent types (project_added, internship_applied) work better than many vague ones.
Explain the reasoning to users
Referencing past actions makes the system feel more consistent.
What I’d change next

If I extend this further, I’d focus on:

Adding weights (recent activity > older activity)
Tracking outcomes better (shortlisted, interviews, not just rejected)
Introducing a skill confidence score instead of a flat list

Right now, skills are either present or not. In reality, they should exist on a spectrum.

Closing

The biggest improvement didn’t come from better prompts or a different model. It came from adding memory and using it properly.

Once the system stopped trusting a single input and started looking at patterns over time, the advice became more grounded. Not perfect—but more consistent than before.

That was enough to make it useful.

Top comments (0)