Why I got frustrated with AI job search tools and built my own

Prateek Puri — Tue, 26 May 2026 19:14:40 +0000

A few months ago I built an AI resume tool to help with my own job search.
The first version worked — it scraped jobs, scored them, and could
generate a tailored resume per posting in about thirty seconds. I shipped
it. I used it.

After two weeks I noticed something: I was rewriting the same paragraphs
over and over. Every resume I generated for a research role would have the same generic, third-person, marketing-flavored. I'd rewrite
it be active, voice-led, the way I actually talk about it. Then I'd do it again on the next resume. And the next. The system had no idea I'd already
fixed this exact paragraph fifteen times.

I'd built a memory layer for this. It wasn't working.

This post is about why, and the architecture I ended up with after
deleting most of v1.

The original architecture

The system had three things:

A single-shot resume generator — one big LLM call that produced the entire resume given the user's profile, the target job, and a strategic plan.
A writing_memory table that stored "writing rules" extracted from edit diffs by another LLM call.
A confidence-scored injection layer that pulled rules above 0.6 confidence into future generation prompts.

The flow looked roughly like this:

User edits → background task feeds (before, after) to a classifier
LLM → classifier returns {rule_text: "use 'built' instead of 'utilized'", category: "word_choice", confidence: 0.5} → rule sits
at confidence 0.5 in the database → after 3 reinforcements the rule
crosses 0.6 and starts appearing in future prompts.

This is a perfectly reasonable design. It just doesn't address the
actual problem.

Four reasons it underfit

1. Style rules need repetition to surface.

The 0.6 confidence threshold required ~3 reinforcements before a rule
fired. A user who corrected "leveraged" → "used" once and didn't get a
chance to do it twice more saw the rule sit at 0.5 and never appear in
prompts. The system felt forgetful even when it had captured the
right signal.

I could have lowered the threshold, but lowering it makes the
classifier's noise dominate.

2. Style rules are the wrong abstraction.

Even when a rule fired correctly ("avoid passive constructions"), it
couldn't capture the user's preferred phrasing of a specific
accomplishment. My hand-tuned MUSE description from the Cohere
application — concrete, voice-matched, exactly the phrasing I wanted
— was thrown away and re-generated from scratch every time I
generated a new resume. The LLM had no way to know that text existed.

This is the difference between "the user prefers active verbs" (a
style rule) and "here's the exact paragraph the user wrote about MUSE
last time, and the time before that" (per-entity content). The second
one is what I actually wanted.

3. Single-shot generation papered over redundancy.

The pre-rewrite generation prompt told the LLM:

"If an accomplishment appears in both selected_research and an
experience bullet, the bullet MUST say something completely
different."

This is an instruction. LLMs follow instructions inconsistently. Every
few generations, the bullet would just paraphrase the research
description for the same accomplishment, and I'd have to fix it
manually. Not a bug — the LLM was doing the best it could with a
soft constraint.

4. The classifier was noisy.

The extraction LLM happily generated rules from any edit, including
one-off content rewrites that weren't generalizable. Rules like
"Replace 'qualitative research at RAND' with 'manual qualitative
research workflow'" would show up — that's a content edit, not a style
rule, but the classifier didn't know the difference.

The mental model that emerged

After staring at the failure modes for a while, I realized two memory
needs were collapsing into one layer:

Layer	Question it answers	Granularity	When it helps
Style	"What's the user's general voice?"	Whole-resume	Cold start — no prior version of this entity exists yet
Content	"What's the user's preferred phrasing for this accomplishment / employer / skill bucket?"	Per-entity	Warm — user has touched this entity before

writing_memory only addressed the first. There was no system for the
second — and that's where most of the frustration lived.

Two-tier memory

I added a new table, content_memory, that stores the user's
hand-tuned final text per entity, keyed on the underlying domain
object:

Entity type	Key	What's stored
`research_description`	`accomplishment_id`	The user's prose for that research entry
`experience_bullets_set`	`employer_key` (e.g. `rand_corporation`)	The full final array of bullets for that employer
`skill_bucket`	`ai_systems` / `data_science` / `engineering`	The user's curated comma-separated list
`summary` / `tagline`	`__scalar__`	The user's text

The unique constraint is (entity_type, entity_key, source_doc_id) —
that combination matters. Within one resume, every edit on the same
entity overwrites the same row. So a session of three FINRA-bullet edits
collapses to a single final-state record, not three diff records.
Across resumes, the same entity (say, rand-muse research description)
accumulates a row per resume.

That accumulated history is the corpus the agent grounds on at
generation time. After ingesting twelve past resumes, my MUSE
accomplishment had eleven different hand-tuned versions in
content_memory, each tagged with the role I wrote it for:

Anthropic Research Engineer: "Designed and built an internal human-AI research platform…"
Cohere Lead Data Scientist: "Replaced a manual qualitative research workflow with an AI-assisted system…"
Cohere MTS Data Analysis: "Designed and developed a platform to replace largely manual…"

These don't get copied verbatim into new resumes. They go into the
prompt as soft grounding:

## Your past hand-tuned versions for this content
Use these as grounding for tone, phrasing, and emphasis. Do NOT copy
verbatim — adapt to the current job context.

### Most recent (written for: Senior ML Engineer @ Anthropic)
Designed and built an internal human-AI research platform...

### Earlier (written for: Lead Data Scientist @ Cohere)
Replaced a manual qualitative research workflow...

The agent now sees, for any given entity, the user's actual voice
across multiple roles, with the role context attached. New
generations adapt content to the new role but inherit voice from the
corpus. That single change made the second generation of any given
resume feel dramatically more like me.

writing_memory stayed — but as a fallback signal that handles
genuinely abstract style preferences (banned words, sentence-form
rules) for entities that don't have content_memory rows yet. Its
extraction was narrowed to only fire on summary, tagline, and skills
edits, since those are the paths where abstract style is actually the
right granularity.

Staged generation, with structural cross-section dedup

Memory was half the problem. The other half was that single-shot
generation made the LLM coordinate five sections in one go — and the
"same accomplishment can't appear in both research and bullets"
constraint was an instruction it would honor maybe 80% of the time.

I split the single shot into a pipeline:

1. Strategic plan
2. Selection (which accomplishments / employers go into the resume)
3. Parallel: research entries + skill buckets + publications selection
4. Parallel: bullets per employer — receives the FINALIZED research
   entries as part of its input
5. Critic — audits the assembled draft and flags specific issues
   per entity
6. Refiner — re-runs ONLY flagged entities with critic notes
7. Summary + tagline — synthesizes the post-refiner draft

Stage 4 is the key change. The bullet generator literally receives the
finalized research entries as part of its prompt, with explicit
instruction: "For any bullet whose accomplishment_id appears above,
take a different angle." The cross-section dedup constraint became
data-flow-driven instead of instruction-following — which means it
actually holds.

Stage 6 (refiner) was equally important. An earlier design had a
critic that rewrote the whole draft. The problem: my hand-tuned
phrasings (now memorized in content_memory and used as grounding)
would get silently rewritten by the critic, undoing the memory work.
Targeted refinement preserves phrasings the critic doesn't flag.

Every leaf LLM call's full prompt + response gets dumped to
output/traces/{trace_id}/{stage}.txt. When something comes out
weird, the trace files are the audit trail.

The unexpected fight: voice mirroring

After the pipeline split landed, I generated a fresh resume against
the same Surge job. The research description came back as:

"Qualitative research at my firm moved from a mostly manual workflow
to a blended human-AI system that can support structured coding,
thematic synthesis, and policy analysis at production scale."

Passive. Subject-the-work, not subject-the-actor. Awkward.

But the grounding block had shown the LLM eleven previous versions,
all leading with active verbs. What was happening?

The system prompt had this rule:

"Sentence 1: the transformation — what's different now because
this work exists."

The model had resolved the conflict between "transformation-led"
and the grounding examples by going passive ("research moved from"),
which technically honors transformation framing. The grounding was
losing to the explicit rule.

The fix was a single paragraph added to the leaf prompt:

"If a "Your past hand-tuned versions" block appears in the user
message, those are the candidate's own prior phrasings of THIS SAME
accomplishment. Treat them as the source of TRUTH for voice and
style. MIRROR the opening verb structure. Do NOT switch to passive
transformation framings — that is not the candidate's voice and
will be rejected."

After that, the same prompt produced:

"Designed and built a blended human-AI research platform for
structured coding, thematic synthesis, and policy analysis that
turned messy qualitative inputs into reliable, reviewable outputs..."

That sounds like me writing about my work. The grounding finally won.

This is the kind of thing that's hard to anticipate before you ship.
You build the architecture, you wire up the data, and then a soft
prompt rule and a hard data signal fight, and the soft rule wins
because LLMs really do try to follow instructions.

Catching this in CI: semantic eval

I didn't want the next prompt change I made to silently break voice
mirroring again. That meant building an eval suite — but not the
typical "did the function return the expected value" kind. Voice
quality isn't unit-testable.

I wrote a multi-turn eval that simulates a user clicking on a section,
asking for a rewrite, then iterating with feedback. Three scenarios:

Tighten a bullet → tighten more → end with the scale figure.
Rewrite a research description with explicit "lead with 'Replaced'" feedback, then drop the duplicate last sentence.
Clean up a skills bucket with cross-bucket dedup feedback.

Each turn is graded by a separate LLM call acting as judge,
returning structured pass/fail per axis:

respects_instruction — did the rewrite actually do what the user asked?
no_fabrication — are all facts in the new value present in the underlying accomplishment data?
differs_from_prior — is the rewrite materially different from the prior turn? (Trivial whitespace tweaks count as fail.)
voice_matches_grounding — when past versions are shown, does the new value mirror their opening-verb pattern?

The eval is in CI. PRs that touch prompts, the agent, or the memory
layer trigger a workflow that boots a Postgres service container,
seeds the DB from a fictional sample profile, runs the eval, and
posts the per-turn judge output as a PR comment. The job fails if
the aggregate pass rate drops below 80%.

Current pass rate on the three scenarios: 26 / 27 graded checks.
The one failure is a real finding the eval is supposed to catch — on a
skill-bucket edit the agent silently added "Claude Code" (not in the
user's whitelist). The judge correctly flagged "no_fabrication: fail."

This is the kind of regression that would otherwise ship invisibly. Most
LLM-app projects can't catch it. The eval is what lets me push prompt
changes without holding my breath.

What this is and isn't

It's tempting to call this "AI memory" and reach for vector embeddings.
I didn't. Every entity here has a stable ID — accomplishment_id,
employer_key, bucket name — and the right key is the domain object,
not a similarity search. Embeddings would be the natural reach if
we were doing fuzzy retrieval. We're not. We're doing exact lookup.

It's also tempting to make the past versions a verbatim cache and
just splice them in. I didn't do that either — past versions inform
voice; they don't replace text. The split between soft grounding
(in prompts) and explicit "Past versions ▼" UI swap (deterministic
text replacement) matters. The user can adapt content to a new role
while inheriting voice; they can also choose to slot in a past
version verbatim if that's what they want. Two different mechanisms
for two different needs.

And it's not provider-locked. The whole stack runs against OpenAI by
default and against Anthropic / Ollama via their OpenAI-compat
endpoints. The factory is ~50 lines. Most of the design carries
over to any chat-completion API.

What I'd do differently

A few things I'd revisit if I were starting over:

Bullet-set memory at the bullet level. Today an entire employer's bullet array is one row, keyed on employer_key. Adding one new bullet overwrites the whole row. Synthetic stable bullet_ids would let unchanged bullets persist across edits.
Per-role-family scoping. The user might want different memorized phrasings for ML-research vs. applied-AI roles. Today the agent just sees the job_context JSON inline and pattern-matches.
Semantic eval earlier. I built the eval suite after I'd shipped the staged pipeline. Several iterations of prompt-tuning could have been validated mechanically instead of by re-reading generations and squinting.

Code

If you want to read the implementation:

Memory schema: backend/app/models/content_memory.py
Capture path: _learn_from_inline_edit
Generation pipeline: backend/app/ai/resume_pipeline.py
Eval suite: backend/scripts/eval/eval_focused_edit.py
Full design doc: docs/MEMORY_DESIGN.md

Mirror is MIT-licensed. It's designed for local single-user deployment
— no auth, no hosted service. If you have a different application
where you'd want to remember a user's voice across sessions, the
architecture transfers cleanly: stable entity IDs, soft grounding, a
critic that flags rather than rewrites, and an eval suite that catches
regressions that unit tests can't see.

— Prateek

DEV Community: Prateek Puri