DEV Community: Santiago Fernández de Valderrama Aparicio

I Built a Multi-Agent Job Search System with Claude Code — 631 Evaluations, 12 Modes

Santiago Fernández de Valderrama Aparicio — Tue, 17 Mar 2026 10:07:23 +0000

I sold my business after 16 years and went all-in on AI. Week one of the job search: read JDs, map skills, customize CV, fill forms. Everything manual, everything repetitive.

By week two I stopped applying. I was building the system that would do it for me.

631 evaluations later, Career-Ops makes more filtering decisions than I do.

What I Built

A multi-agent system with 12 operational modes, each a Claude Code skill file with its own context and rules. Not a script — an agent that reasons about the problem domain.

The key architectural choice: modes over one long prompt.

career-ops/
├── modes/
│   ├── _shared.md          # North Star archetypes, proof points
│   ├── auto-pipeline.md    # Full pipeline: JD → eval → PDF → tracker
│   ├── oferta.md           # Single-offer evaluation (A-F)
│   ├── batch.md            # Parallel processing with workers
│   ├── pdf.md              # ATS-optimized CV per offer
│   ├── scan.md             # Portal discovery
│   ├── apply.md            # Playwright form-filling
│   └── ... (12 total)
├── reports/                # 631 evaluation files
├── output/                 # Generated PDFs
├── applications.md         # Central tracker
└── scan-history.tsv        # 680 deduplicated URLs

Why modes? Each one loads only the context it needs. auto-pipeline skips contact rules. apply skips scoring logic. Less context = better decisions from the LLM.

The 10-Dimension Scoring

Every offer runs through a weighted evaluation framework:

Dimension	What It Measures	Weight
Role Match	Alignment with CV proof points	Gate-pass
Skills Alignment	Tech stack overlap	Gate-pass
Seniority	Stretch level	High
Compensation	Market rate vs target	High
Geographic	Remote/hybrid feasibility	Medium
Company Stage	Startup/growth/enterprise fit	Medium
Product-Market Fit	Problem domain resonance	Medium
Growth Trajectory	Career ladder visibility	Medium
Interview Likelihood	Callback probability	High
Timeline	Hiring urgency	Low

Role Match and Skills Alignment are gate-pass — if they fail, the final score drops regardless of everything else. 74% of evaluated offers scored below 4.0.

The Pipeline

auto-pipeline is the flagship mode. A URL goes in, and out comes:

Extract JD — Playwright navigates to the URL, extracts structured content
Evaluate 10D — Claude reads JD + CV + portfolio, generates scoring
Generate report — Markdown with 6 blocks: summary, CV match, level, comp, personalization, interview probability
Generate PDF — HTML template + keyword injection + Puppeteer render
Register tracker — TSV auto-merge via Node.js script
Dedup — Checks 680 URLs in scan-history.tsv. Zero re-evaluations

Batch Processing

For high volume, batch mode launches a conductor that orchestrates parallel workers:

# conductor spawns N workers, each an independent Claude Code process
./batch-runner.sh --input batch/batch-input.tsv --workers 4

# Each worker:
# 1. Claims a URL from the queue (lock file prevents doubles)
# 2. Runs auto-pipeline
# 3. Writes result to batch-state.tsv
# 4. Picks next URL

122 URLs processed in parallel. Fault-tolerant: a worker failure never blocks the rest. Resumable — reads state and skips completed items.

The AI Resume Builder

A generic PDF loses. Career-Ops generates a different ATS-optimized CV for each offer:

Extract 15-20 keywords from the JD
Detect language (English JD → English CV)
Detect region (US → Letter, Europe → A4)
Detect archetype (6 predefined: AI Platform, Agentic, PM, SA, FDE, Transformation)
Select top 3-4 projects by relevance
Reorder bullets — most relevant experience moves up
Render PDF — Puppeteer, self-hosted fonts, single-column ATS-safe

Same CV. 6 different framings. All real — keywords get reformulated, never fabricated.

Results

2 months in production. Real numbers, not demos.

631 reports generated
68 applications sent
354 PDFs generated
680 URLs deduplicated
0 re-evaluations

What I Learned

Automate analysis, not decisions. Career-Ops evaluates 631 offers. I decide which ones get my time. HITL is not a limitation — it is the design.

Modes beat a long prompt. 12 modes with precise context outperform a 10,000-token system prompt. This was my biggest mistake early on — I started with one massive prompt and the quality was terrible.

Dedup is more valuable than scoring. 680 deduplicated URLs mean 680 evaluations I never had to repeat. Boring infrastructure, highest ROI.

A CV is an argument, not a document. A generic PDF convinces nobody. A CV that reorganizes proof points by relevance and adapts framing to the archetype — that converts.

The system IS the portfolio. Building a multi-agent system to search for multi-agent roles is the most direct proof of competence.

Stack

Claude Code — LLM agent: reasoning, evaluation, content generation
Playwright — Browser automation: portal scanning and form-filling
Puppeteer — PDF rendering from HTML templates
Node.js — Utility scripts: merge-tracker, cv-sync-check
tmux — Parallel sessions: conductor + workers in batch

Full case study: santifer.io/career-ops-system

Has anyone else built tooling for their job search? Curious about different approaches — especially around evaluation frameworks and dedup strategies.

How I Built an AI Agent That Handled 90% of Customer Requests Without Human Intervention

Santiago Fernández de Valderrama Aparicio — Tue, 03 Feb 2026 17:31:45 +0000

In early 2024, I had a problem. My phone repair shop was processing thousands of customer inquiries a month across WhatsApp, phone calls, and walk-ins. My team was drowning in repetitive questions: "Is my phone ready?", "Do you have the screen for iPhone 14?", "Can I book for tomorrow at 5pm?"

Twelve months later, an AI agent named Jacobo was handling ~90% of those interactions autonomously. Customers got instant answers. My team focused on actual repairs. And when I sold the business in early 2025, the agent was a key part of what made it sellable.

Here's how I built it.

The Problem: Three Channels, One Bottleneck

Santifer iRepair had been running for 16 years when I started this project. We'd already automated the back-office with Airtable — 12 connected databases handling repairs, inventory, invoicing, the works. But customer communication was still manual.

The pain points:

WhatsApp: Customers expected instant replies. We couldn't deliver.
Phone calls: Staff interrupted mid-repair to answer "what's my repair status?"
Booking: Back-and-forth messages to find a slot that worked.

I needed something that could talk to customers across channels, understand what they wanted, and actually do things — not just generate text.

Architecture: A Router With Specialized Sub-Agents

The breakthrough came when I stopped thinking "chatbot" and started thinking "agent orchestration."

                    ┌─────────────────────┐
                    │   INCOMING REQUEST  │
                    │  (Voice/WhatsApp)   │
                    └──────────┬──────────┘
                               │
                    ┌──────────▼──────────┐
                    │    MAIN ROUTER      │
                    │  (Intent Classifier)│
                    └──────────┬──────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        │                      │                      │
┌───────▼───────┐    ┌─────────▼────────┐   ┌────────▼────────┐
│ APPOINTMENTS  │    │    DISCOUNTS     │   │     ORDERS      │
│  Sub-Agent    │    │    Sub-Agent     │   │    Sub-Agent    │
└───────┬───────┘    └─────────┬────────┘   └────────┬────────┘
        │                      │                      │
        └──────────────────────┼──────────────────────┘
                               │
                    ┌──────────▼──────────┐
                    │   HITL HANDOFF      │
                    │ (When confidence    │
                    │  is low or          │
                    │  escalation needed) │
                    └─────────────────────┘

Main Router: Every incoming message hits the router first. It classifies intent and delegates to the right sub-agent via tool calling. No giant monolithic prompt trying to do everything.

Sub-Agents: Each one is laser-focused on a single domain:

Appointments: Queries available slots from Airtable, handles booking logic, sends confirmation via WhatsApp
Discounts: Pulls customer history, calculates applicable promos, explains the discount
Orders: Validates stock against inventory DB, creates the order, sends ETA notification

HITL Handoff: When confidence drops below threshold or the customer explicitly asks for a human, Jacobo escalates — but passes the full conversation context so nobody starts from zero.

The Stack

Component	Tool	Why
LLM	Claude API	Best balance of reasoning + tool use at the time
Orchestration	n8n	Visual workflows, easy to debug, self-hosted
WhatsApp	WATI	Clean WhatsApp Business API wrapper
Voice	ElevenLabs	Natural-sounding Spanish TTS
Phone	Aircall	Cloud PBX with good API
Backend/DB	Airtable	Already our source of truth for everything

The key insight: Airtable wasn't just storage — it was the agent's brain. Every sub-agent queried Airtable directly. Customer history, inventory levels, appointment slots — all live data, no sync issues.

Key Technical Decisions (And Why)

1. Tool Calling Over Prompt Stuffing

Early versions tried to cram everything into the system prompt. "Here's how to check inventory, here's how to book appointments, here's our discount rules..."

It was brittle. The model would hallucinate discounts or book non-existent slots.

Tool calling changed everything. Each sub-agent has explicit tools:

check_available_slots(date, service_type) → returns actual slots
create_booking(customer_id, slot_id) → books or fails with reason
calculate_discount(customer_id, service) → returns applicable promo

The model reasons about what to do. The tools handle how. Clean separation.

2. Sub-Agent Specialization Over One Big Agent

A single agent handling appointments, discounts, orders, and general FAQs? That's a recipe for confusion.

Each sub-agent has:

Its own system prompt (focused, ~200 tokens)
Its own tool set (only what it needs)
Its own failure modes (easier to debug)

The router is dumb on purpose. It just classifies and delegates. Complexity lives at the edges.

3. Graceful HITL, Not Graceful Degradation

Some AI systems try to "degrade gracefully" — giving worse answers when uncertain. I took a different approach: escalate early, escalate with context.

When Jacobo wasn't confident:

Customer got a message: "Let me connect you with the team"
Staff got a Slack notification with full conversation history
Average human response time: under 2 minutes

The 10% that needed humans got better service than before, because staff had full context.

Lessons Learned

Start with the most repetitive task. Appointment booking was 40% of all inquiries. Automating that alone bought us massive breathing room.

Your database is your agent's memory. Don't build a separate "AI database." Query what you already have. Airtable's API was fast enough for real-time lookups.

Tool calling > RAG for transactional tasks. RAG is great for knowledge retrieval. But when you need to do things — book, order, check status — tool calling is the architecture.

Measure deflection rate, not just accuracy. "Did the agent answer correctly?" matters less than "Did the customer get what they needed without human help?" We tracked both.

The Outcome

After 12 months in production:

~90% of customer interactions handled without human intervention
Staff spent 70% more time on actual repairs
Customer satisfaction stayed flat (no degradation — that was the goal)
The system became a selling point when I exited the business

What I'd Do Differently

Voice was harder than expected. ElevenLabs sounds great, but latency in the voice → transcription → LLM → TTS loop was noticeable. I'd explore tighter integrations if rebuilding today.

More observability earlier. I added proper logging and trace monitoring late in the project. Should've been day one.

Simpler discount logic. The discount sub-agent had too many edge cases baked into the prompt. Should've moved more logic into deterministic code and kept the LLM for natural language understanding only.

Building Jacobo taught me that AI agents aren't magic — they're systems engineering with an LLM in the middle. The LLM handles the messy human language part. Everything else is APIs, databases, and good old-fashioned software architecture.

The 90% automation wasn't because the AI was brilliant. It was because we picked the right problems, built the right tools, and knew when to hand off to humans.

I'm currently open to AI Product Manager and Forward Deployed Engineer roles. Check my portfolio at santifer.io.