DEV Community

P AASHISH
P AASHISH

Posted on

How I structured logs around Hindsight

“Why did it reject a perfect resume?” I dug into the logs and realized Hindsight had quietly rewritten the agent’s scoring logic based on one bad feedback loop.
job sense ai

“Why did it reject a perfect resume?” I dug into the logs and realized Hindsight had quietly rewritten the agent’s scoring logic based on one bad feedback loop.

What I actually built

This project is a job matching agent that reads resumes, scores candidates, and ranks them against job descriptions. Nothing fancy on the surface: parse resume → extract features → score → return top candidates.

The interesting part is that the scoring logic isn’t fixed.

I wired it up with Hindsight GitHub repository so the agent could learn from feedback—things like:

“This candidate should have been ranked higher”

“This profile is irrelevant despite keyword match”

Instead of retraining a model, I let the agent adapt its behavior by replaying past decisions and corrections.

How the system is structured

At a high level, the code splits into three parts:

  1. Resume ingestion + parsing

  2. Scoring pipeline

  3. Hindsight-backed memory + feedback loop

The scoring pipeline looks roughly like this:

def score_candidate(resume, job_description):
features = extract_features(resume, job_description)
base_score = weighted_score(features)
adjustments = hindsight_adjustments(resume, job_description)
return base_score + adjustments

The key is that hindsight_adjustments isn’t static. It’s derived from past feedback stored and replayed through Hindsight.

Feedback events are stored with context:

event = {
"resume_id": resume.id,
"job_id": job.id,
"original_score": score,
"feedback": "should_rank_higher",
"timestamp": now()
}

These get indexed and replayed later when similar candidates show up.

If you’ve read the Hindsight documentation, this is basically using event replay as a lightweight learning layer instead of retraining.

The bug that made this interesting

Everything seemed fine until I noticed something weird:

A strong candidate—clean experience, perfect keyword match—was consistently ranked low.

At first I thought:

parsing bug?

feature extraction issue?

bad weights?

Nope.

It was Hindsight.

What actually happened

One recruiter had marked a similar resume as “not relevant” earlier. That feedback got stored and replayed.

But the similarity match was too broad.

So now:

New candidate comes in

Hindsight finds a “similar” past event

Applies a negative adjustment

Score drops silently

No logs screamed “this is wrong.” It just looked like the system “decided” differently.

Debugging the feedback loop

I had to explicitly trace how Hindsight was influencing decisions.

I added logging like this:

def hindsight_adjustments(resume, job):
events = hindsight.retrieve_similar(resume, job)

for e in events:
    print("Replaying event:", e)

return aggregate_adjustments(events)
Enter fullscreen mode Exit fullscreen mode

That’s when it clicked:

The system wasn’t wrong

It was too eager to generalize

The feedback loop had effectively created a soft rule:

“Candidates like this are bad”

…based on a single data point.

Fixing it without killing learning

I didn’t want to remove Hindsight—it was the whole point.

Instead, I constrained it.

  1. Tightened similarity matching

Instead of loose matching, I added stricter filters:

if similarity_score < 0.85:
continue

This alone reduced a lot of bad carryover.

  1. Weighted feedback by frequency

One-off feedback shouldn’t dominate:

adjustment = feedback_weight * log(1 + occurrence_count)

Now repeated signals matter more than isolated ones.

  1. Scoped feedback by job context

A candidate rejected for one role shouldn’t be penalized globally.

So I started indexing feedback like:

key = (job_role, skill_cluster)

instead of just resume similarity.

Before vs after

Before:

One bad feedback → affects many future candidates

Silent score shifts

Hard to debug

After:

Feedback only applies in tight contexts

Multiple signals required to shift behavior

Logs clearly show why scores change

Now when a candidate is penalized, I can point to:

specific past events

similarity threshold

adjustment weight

What Hindsight actually gave me

The biggest shift wasn’t accuracy—it was behavior.

Without Hindsight:

The agent is static

Bugs are code bugs

With Hindsight:

The agent evolves

Bugs become behavioral drift

That’s a very different debugging problem.

If you’re curious, the concept is explained well in this agent memory overview on Vectorize.

A concrete example

A recruiter gives feedback:

“This candidate looks good on paper but lacks real project depth.”

That gets stored.

Later, a similar resume comes in:

same keywords

similar experience level

The system:

retrieves past feedback

applies a small negative adjustment

slightly lowers rank

After multiple similar feedback events, the agent starts implicitly learning:

“Keyword match isn’t enough—depth matters.”

No retraining. Just accumulated corrections.

What I learned

  1. Feedback loops are brittle by default
    One bad signal can poison future decisions if you don’t gate it carefully.

  2. Similarity is everything
    If your retrieval is loose, your learning is noisy. Tightening similarity improved behavior more than anything else.

  3. Logging matters more than modeling
    I didn’t change the scoring model much. I just made Hindsight visible.

  4. Local context beats global memory
    Scoping feedback to job role + skill cluster made the system far more stable.

  5. “Learning” is just controlled bias accumulation
    Hindsight doesn’t magically learn—it just accumulates past decisions. Your job is to control how that bias spreads.

Would I do this again?

Yes—but with guardrails from day one.

Hindsight is powerful, but it will happily amplify your mistakes if you let it.

If you treat it like:

a suggestion system (not ground truth)

a contextual memory (not global truth)

…it becomes a practical way to make agents adapt without retraining.

Otherwise, you’ll end up debugging why your system rejected a perfect resume—and realizing it was your own feedback loop all along.

Top comments (0)