Last night I wondered if an agent could learn what not to recommend in a job-matching pipeline; by morning, ours was blacklisting patterns it had only seen fail once—and getting better every run.
job sense ai
Last night I wondered if an agent could learn what not to recommend in a job-matching pipeline; by morning, ours was blacklisting patterns it had only seen fail once—and getting better every run.
What I built here isn’t another resume–job similarity tool. It’s a loop: ingest resumes and job descriptions, generate matches, evaluate those matches, and then feed the failures back into the system so it stops making the same mistake twice.
At a high level, the repo is pretty simple:
main.py wires the pipeline together
matcher/ handles embedding + similarity scoring
agent/ wraps the LLM logic (ranking, reasoning, critique)
memory/ is where Hindsight comes in
evaluation/ defines what “bad recommendation” actually means
The interesting part isn’t matching. It’s what happens after a bad match.
The thing I got wrong about “agent memory”
My initial assumption was embarrassingly common: memory = more context.
So I started with something like this:
def rank_candidates(job, resumes):
context = build_context(job, resumes)
return llm.generate_rankings(context)
It worked fine for obvious matches. It completely fell apart on edge cases:
Overweighting keyword overlap (“Python” everywhere)
Ignoring disqualifiers buried in experience
Recommending overqualified or irrelevant candidates
My first instinct was to “improve prompts” and “add more examples.” That helped, but it didn’t stick. The same class of mistake kept coming back.
What I actually needed was not better context—but persistent negative feedback.
Turning mistakes into data (with Hindsight)
The shift happened when I integrated Hindsight on GitHub.
Instead of trying to prevent bad outputs upfront, I let the system fail—and then recorded why it failed.
The core idea: every bad recommendation becomes a structured memory.
def record_failure(job_id, resume_id, reason):
memory.store({
"type": "negative_match",
"job_id": job_id,
"resume_id": resume_id,
"reason": reason,
"timestamp": now()
})
This isn’t just logging. These records are indexed and retrieved later during ranking.
If you haven’t seen it, the Hindsight documentation explains the retrieval model pretty well—but the real insight is what you choose to store.
I don’t store full conversations. I store compressed lessons:
“Rejected: frontend-heavy profile for backend-only role”
“Mismatch: required 5+ years, candidate has 1.5”
“False positive due to keyword overlap (React vs React Native backend tooling)”
That compression step matters more than anything else.
Injecting “don’t do this again” into the agent
Once failures are stored, the next step is using them during ranking.
Here’s the shape I ended up with:
def rank_with_memory(job, resumes):
past_failures = memory.retrieve(
query=job.description,
filter={"type": "negative_match"},
top_k=5
)
context = build_context(job, resumes, past_failures)
return llm.generate_rankings(context)
The important part is that past_failures are semantically retrieved. I’m not just filtering by job ID—I’m asking:
“What past mistakes look similar to this job?”
This is where something like the Vectorize agent memory layer becomes useful. You’re not building a database—you’re building a memory system that can generalize.
What surprised me: the agent started arguing with itself
I didn’t expect this, but once failures were injected into context, the agent started doing something interesting:
It began preemptively rejecting candidates before ranking them.
I made that explicit by adding a critique step:
def critique_candidate(job, resume, failures):
return llm.generate({
"job": job,
"resume": resume,
"past_failures": failures,
"task": "Should this candidate be rejected? Why?"
})
Then the pipeline became:
Generate candidate scores
Run critique step
Adjust ranking or drop candidates
This effectively turned the agent into a two-pass system:
Pass 1: “Who looks good?”
Pass 2: “Why might this be wrong?”
That second pass is where most improvements came from.
The blacklist isn’t static—and that’s the whole point
At some point, I considered just building a rule engine:
If experience < required → reject
If domain mismatch → penalize
If keywords mismatch → drop
But that quickly turns into a brittle mess.
Instead, I let the blacklist emerge dynamically from failures.
A typical memory entry looks like:
{
"type": "negative_match",
"embedding": "...",
"reason": "Candidate has strong frontend experience but no backend systems exposure",
"job_features": ["backend", "distributed systems"],
"timestamp": 1710000000
}
Over time, the system builds a soft blacklist:
Not explicit rules
Not hard filters
But patterns the agent learns to avoid
This is the part that actually feels like “learning,” even though it’s just retrieval + prompting.
A concrete before/after
Before adding Hindsight:
Job: Backend Engineer (Go, distributed systems)
Candidate: React-heavy frontend dev with some Node.js
The system ranked this candidate #2.
Why? Keyword overlap: “JavaScript”, “APIs”, “microservices”.
After adding failure memory:
A previous failure stored:
“Frontend-heavy profile incorrectly matched to backend system role”
Now the same candidate:
Gets flagged in critique step
Drops to #7 or removed entirely
Nothing in the model changed. Just the memory.
Where the design broke (and what I changed)
- Storing too much detail
At first, I stored entire LLM outputs as memory.
Bad idea.
Retrieval got noisy
Context bloated
Signal got diluted
Fix: store one-line reasons, not transcripts.
- Over-retrieval
I tried feeding 10–15 past failures into context.
That made the agent overly conservative—it started rejecting everything.
past_failures = memory.retrieve(..., top_k=15) # too much
Fix:
past_failures = memory.retrieve(..., top_k=3)
Less context, sharper signal.
- No feedback loop validation
Initially, every failure was treated equally.
But some “failures” were actually debatable (e.g., borderline candidates).
Fix: I added a lightweight scoring layer:
def validate_failure(reason):
return llm.generate({
"task": "Is this a valid rejection reason?",
"reason": reason
})
Only high-confidence failures get stored.
What this system actually feels like to use
It doesn’t feel like a smarter model.
It feels like a model that remembers being wrong.
That’s a subtle but important difference.
It still makes mistakes
But it rarely repeats the same mistake
And when it does, it’s usually because retrieval missed something
The behavior is closer to a junior engineer who keeps notes on what went wrong last time.
Lessons I’d carry forward
Memory is about compression, not accumulation
Storing everything is useless. Storing the right abstraction of failure is what matters.Negative examples are more valuable than positive ones
“Don’t do this again” shaped behavior more than “do this well.”Retrieval quality > model quality
A slightly worse model with good failure retrieval outperformed a better model with none.Two-pass systems are underrated
Generate → critique is dramatically more stable than single-pass ranking.Don’t build rules when you can build feedback loops
Static rules age poorly. Feedback systems adapt.
If I were to rebuild this
I’d double down on the memory layer earlier.
Specifically:
Better schema for failure types
Explicit clustering of similar mistakes
Decay or pruning of outdated memories
Right now it works—but it’s still naive.
Closing thought
If you’re building anything that ranks, recommends, or decides—don’t just ask:
“How do I make it better?”
Ask:
“How do I make it remember being wrong?”
That one shift changed this project from a brittle matcher into something that actually improves over time.
And if you want to go deeper into how this kind of memory layer works, it’s worth exploring:
The Hindsight GitHub repository
The Hindsight documentation
The agent memory approach from Vectorize
The implementation details matter—but the bigger idea is simple:
Don’t just build systems that predict.
Build systems that regret—and remember why.
Top comments (0)