<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Santiago Fernández de Valderrama Aparicio</title>
    <description>The latest articles on DEV Community by Santiago Fernández de Valderrama Aparicio (@santifer).</description>
    <link>https://dev.to/santifer</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3751115%2F3316341b-9865-4b26-a480-ceb0816c05c2.jpeg</url>
      <title>DEV Community: Santiago Fernández de Valderrama Aparicio</title>
      <link>https://dev.to/santifer</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/santifer"/>
    <language>en</language>
    <item>
      <title>I Built a Multi-Agent Job Search System with Claude Code — 631 Evaluations, 12 Modes</title>
      <dc:creator>Santiago Fernández de Valderrama Aparicio</dc:creator>
      <pubDate>Tue, 17 Mar 2026 10:07:23 +0000</pubDate>
      <link>https://dev.to/santifer/i-built-a-multi-agent-job-search-system-with-claude-code-631-evaluations-12-modes-2cd0</link>
      <guid>https://dev.to/santifer/i-built-a-multi-agent-job-search-system-with-claude-code-631-evaluations-12-modes-2cd0</guid>
      <description>&lt;p&gt;I sold my business after 16 years and went all-in on AI. Week one of the job search: read JDs, map skills, customize CV, fill forms. Everything manual, everything repetitive.&lt;/p&gt;

&lt;p&gt;By week two I stopped applying. I was building the system that would do it for me.&lt;/p&gt;

&lt;p&gt;631 evaluations later, Career-Ops makes more filtering decisions than I do.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;A multi-agent system with 12 operational modes, each a Claude Code skill file with its own context and rules. Not a script — an agent that reasons about the problem domain.&lt;/p&gt;

&lt;p&gt;The key architectural choice: &lt;strong&gt;modes over one long prompt&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;career-ops/
├── modes/
│   ├── _shared.md          # North Star archetypes, proof points
│   ├── auto-pipeline.md    # Full pipeline: JD → eval → PDF → tracker
│   ├── oferta.md           # Single-offer evaluation (A-F)
│   ├── batch.md            # Parallel processing with workers
│   ├── pdf.md              # ATS-optimized CV per offer
│   ├── scan.md             # Portal discovery
│   ├── apply.md            # Playwright form-filling
│   └── ... (12 total)
├── reports/                # 631 evaluation files
├── output/                 # Generated PDFs
├── applications.md         # Central tracker
└── scan-history.tsv        # 680 deduplicated URLs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why modes? Each one loads only the context it needs. &lt;code&gt;auto-pipeline&lt;/code&gt; skips contact rules. &lt;code&gt;apply&lt;/code&gt; skips scoring logic. Less context = better decisions from the LLM.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 10-Dimension Scoring
&lt;/h2&gt;

&lt;p&gt;Every offer runs through a weighted evaluation framework:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;What It Measures&lt;/th&gt;
&lt;th&gt;Weight&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Role Match&lt;/td&gt;
&lt;td&gt;Alignment with CV proof points&lt;/td&gt;
&lt;td&gt;Gate-pass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skills Alignment&lt;/td&gt;
&lt;td&gt;Tech stack overlap&lt;/td&gt;
&lt;td&gt;Gate-pass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Seniority&lt;/td&gt;
&lt;td&gt;Stretch level&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compensation&lt;/td&gt;
&lt;td&gt;Market rate vs target&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Geographic&lt;/td&gt;
&lt;td&gt;Remote/hybrid feasibility&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Company Stage&lt;/td&gt;
&lt;td&gt;Startup/growth/enterprise fit&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Product-Market Fit&lt;/td&gt;
&lt;td&gt;Problem domain resonance&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Growth Trajectory&lt;/td&gt;
&lt;td&gt;Career ladder visibility&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interview Likelihood&lt;/td&gt;
&lt;td&gt;Callback probability&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Timeline&lt;/td&gt;
&lt;td&gt;Hiring urgency&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Role Match and Skills Alignment are gate-pass — if they fail, the final score drops regardless of everything else. 74% of evaluated offers scored below 4.0.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pipeline
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;auto-pipeline&lt;/code&gt; is the flagship mode. A URL goes in, and out comes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Extract JD&lt;/strong&gt; — Playwright navigates to the URL, extracts structured content&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluate 10D&lt;/strong&gt; — Claude reads JD + CV + portfolio, generates scoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate report&lt;/strong&gt; — Markdown with 6 blocks: summary, CV match, level, comp, personalization, interview probability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate PDF&lt;/strong&gt; — HTML template + keyword injection + Puppeteer render&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Register tracker&lt;/strong&gt; — TSV auto-merge via Node.js script&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dedup&lt;/strong&gt; — Checks 680 URLs in scan-history.tsv. Zero re-evaluations&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Batch Processing
&lt;/h2&gt;

&lt;p&gt;For high volume, batch mode launches a conductor that orchestrates parallel workers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# conductor spawns N workers, each an independent Claude Code process&lt;/span&gt;
./batch-runner.sh &lt;span class="nt"&gt;--input&lt;/span&gt; batch/batch-input.tsv &lt;span class="nt"&gt;--workers&lt;/span&gt; 4

&lt;span class="c"&gt;# Each worker:&lt;/span&gt;
&lt;span class="c"&gt;# 1. Claims a URL from the queue (lock file prevents doubles)&lt;/span&gt;
&lt;span class="c"&gt;# 2. Runs auto-pipeline&lt;/span&gt;
&lt;span class="c"&gt;# 3. Writes result to batch-state.tsv&lt;/span&gt;
&lt;span class="c"&gt;# 4. Picks next URL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;122 URLs processed in parallel. Fault-tolerant: a worker failure never blocks the rest. Resumable — reads state and skips completed items.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI Resume Builder
&lt;/h2&gt;

&lt;p&gt;A generic PDF loses. Career-Ops generates a different ATS-optimized CV for each offer:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Extract 15-20 keywords from the JD&lt;/li&gt;
&lt;li&gt;Detect language (English JD → English CV)&lt;/li&gt;
&lt;li&gt;Detect region (US → Letter, Europe → A4)&lt;/li&gt;
&lt;li&gt;Detect archetype (6 predefined: AI Platform, Agentic, PM, SA, FDE, Transformation)&lt;/li&gt;
&lt;li&gt;Select top 3-4 projects by relevance&lt;/li&gt;
&lt;li&gt;Reorder bullets — most relevant experience moves up&lt;/li&gt;
&lt;li&gt;Render PDF — Puppeteer, self-hosted fonts, single-column ATS-safe&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Same CV. 6 different framings. All real — keywords get reformulated, never fabricated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;2 months in production. Real numbers, not demos.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;631&lt;/strong&gt; reports generated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;68&lt;/strong&gt; applications sent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;354&lt;/strong&gt; PDFs generated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;680&lt;/strong&gt; URLs deduplicated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0&lt;/strong&gt; re-evaluations&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Automate analysis, not decisions.&lt;/strong&gt; Career-Ops evaluates 631 offers. I decide which ones get my time. HITL is not a limitation — it is the design.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Modes beat a long prompt.&lt;/strong&gt; 12 modes with precise context outperform a 10,000-token system prompt. This was my biggest mistake early on — I started with one massive prompt and the quality was terrible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dedup is more valuable than scoring.&lt;/strong&gt; 680 deduplicated URLs mean 680 evaluations I never had to repeat. Boring infrastructure, highest ROI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A CV is an argument, not a document.&lt;/strong&gt; A generic PDF convinces nobody. A CV that reorganizes proof points by relevance and adapts framing to the archetype — that converts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The system IS the portfolio.&lt;/strong&gt; Building a multi-agent system to search for multi-agent roles is the most direct proof of competence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude Code&lt;/strong&gt; — LLM agent: reasoning, evaluation, content generation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Playwright&lt;/strong&gt; — Browser automation: portal scanning and form-filling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Puppeteer&lt;/strong&gt; — PDF rendering from HTML templates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Node.js&lt;/strong&gt; — Utility scripts: merge-tracker, cv-sync-check&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;tmux&lt;/strong&gt; — Parallel sessions: conductor + workers in batch&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Full case study: &lt;a href="https://santifer.io/career-ops-system" rel="noopener noreferrer"&gt;santifer.io/career-ops-system&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Has anyone else built tooling for their job search? Curious about different approaches — especially around evaluation frameworks and dedup strategies.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>career</category>
      <category>claudecode</category>
    </item>
    <item>
      <title>How I Built an AI Agent That Handled 90% of Customer Requests Without Human Intervention</title>
      <dc:creator>Santiago Fernández de Valderrama Aparicio</dc:creator>
      <pubDate>Tue, 03 Feb 2026 17:31:45 +0000</pubDate>
      <link>https://dev.to/santifer/how-i-built-an-ai-agent-that-handled-90-of-customer-requests-without-human-intervention-4pci</link>
      <guid>https://dev.to/santifer/how-i-built-an-ai-agent-that-handled-90-of-customer-requests-without-human-intervention-4pci</guid>
      <description>&lt;p&gt;In early 2024, I had a problem. My phone repair shop was processing thousands of customer inquiries a month across WhatsApp, phone calls, and walk-ins. My team was drowning in repetitive questions: "Is my phone ready?", "Do you have the screen for iPhone 14?", "Can I book for tomorrow at 5pm?"&lt;/p&gt;

&lt;p&gt;Twelve months later, an AI agent named Jacobo was handling ~90% of those interactions autonomously. Customers got instant answers. My team focused on actual repairs. And when I sold the business in early 2025, the agent was a key part of what made it sellable.&lt;/p&gt;

&lt;p&gt;Here's how I built it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Three Channels, One Bottleneck
&lt;/h2&gt;

&lt;p&gt;Santifer iRepair had been running for 16 years when I started this project. We'd already automated the back-office with Airtable — 12 connected databases handling repairs, inventory, invoicing, the works. But customer communication was still manual.&lt;/p&gt;

&lt;p&gt;The pain points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;WhatsApp&lt;/strong&gt;: Customers expected instant replies. We couldn't deliver.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phone calls&lt;/strong&gt;: Staff interrupted mid-repair to answer "what's my repair status?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Booking&lt;/strong&gt;: Back-and-forth messages to find a slot that worked.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I needed something that could talk to customers across channels, understand what they wanted, and actually &lt;em&gt;do things&lt;/em&gt; — not just generate text.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture: A Router With Specialized Sub-Agents
&lt;/h2&gt;

&lt;p&gt;The breakthrough came when I stopped thinking "chatbot" and started thinking "agent orchestration."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                    ┌─────────────────────┐
                    │   INCOMING REQUEST  │
                    │  (Voice/WhatsApp)   │
                    └──────────┬──────────┘
                               │
                    ┌──────────▼──────────┐
                    │    MAIN ROUTER      │
                    │  (Intent Classifier)│
                    └──────────┬──────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        │                      │                      │
┌───────▼───────┐    ┌─────────▼────────┐   ┌────────▼────────┐
│ APPOINTMENTS  │    │    DISCOUNTS     │   │     ORDERS      │
│  Sub-Agent    │    │    Sub-Agent     │   │    Sub-Agent    │
└───────┬───────┘    └─────────┬────────┘   └────────┬────────┘
        │                      │                      │
        └──────────────────────┼──────────────────────┘
                               │
                    ┌──────────▼──────────┐
                    │   HITL HANDOFF      │
                    │ (When confidence    │
                    │  is low or          │
                    │  escalation needed) │
                    └─────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Main Router&lt;/strong&gt;: Every incoming message hits the router first. It classifies intent and delegates to the right sub-agent via tool calling. No giant monolithic prompt trying to do everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sub-Agents&lt;/strong&gt;: Each one is laser-focused on a single domain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Appointments&lt;/strong&gt;: Queries available slots from Airtable, handles booking logic, sends confirmation via WhatsApp&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discounts&lt;/strong&gt;: Pulls customer history, calculates applicable promos, explains the discount&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orders&lt;/strong&gt;: Validates stock against inventory DB, creates the order, sends ETA notification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;HITL Handoff&lt;/strong&gt;: When confidence drops below threshold or the customer explicitly asks for a human, Jacobo escalates — but passes the full conversation context so nobody starts from zero.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LLM&lt;/td&gt;
&lt;td&gt;Claude API&lt;/td&gt;
&lt;td&gt;Best balance of reasoning + tool use at the time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Orchestration&lt;/td&gt;
&lt;td&gt;n8n&lt;/td&gt;
&lt;td&gt;Visual workflows, easy to debug, self-hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WhatsApp&lt;/td&gt;
&lt;td&gt;WATI&lt;/td&gt;
&lt;td&gt;Clean WhatsApp Business API wrapper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Voice&lt;/td&gt;
&lt;td&gt;ElevenLabs&lt;/td&gt;
&lt;td&gt;Natural-sounding Spanish TTS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phone&lt;/td&gt;
&lt;td&gt;Aircall&lt;/td&gt;
&lt;td&gt;Cloud PBX with good API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backend/DB&lt;/td&gt;
&lt;td&gt;Airtable&lt;/td&gt;
&lt;td&gt;Already our source of truth for everything&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key insight: &lt;strong&gt;Airtable wasn't just storage — it was the agent's brain.&lt;/strong&gt; Every sub-agent queried Airtable directly. Customer history, inventory levels, appointment slots — all live data, no sync issues.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Technical Decisions (And Why)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Tool Calling Over Prompt Stuffing
&lt;/h3&gt;

&lt;p&gt;Early versions tried to cram everything into the system prompt. "Here's how to check inventory, here's how to book appointments, here's our discount rules..."&lt;/p&gt;

&lt;p&gt;It was brittle. The model would hallucinate discounts or book non-existent slots.&lt;/p&gt;

&lt;p&gt;Tool calling changed everything. Each sub-agent has explicit tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;check_available_slots(date, service_type) → returns actual slots
create_booking(customer_id, slot_id) → books or fails with reason
calculate_discount(customer_id, service) → returns applicable promo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model reasons about &lt;em&gt;what&lt;/em&gt; to do. The tools handle &lt;em&gt;how&lt;/em&gt;. Clean separation.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Sub-Agent Specialization Over One Big Agent
&lt;/h3&gt;

&lt;p&gt;A single agent handling appointments, discounts, orders, and general FAQs? That's a recipe for confusion.&lt;/p&gt;

&lt;p&gt;Each sub-agent has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Its own system prompt (focused, ~200 tokens)&lt;/li&gt;
&lt;li&gt;Its own tool set (only what it needs)&lt;/li&gt;
&lt;li&gt;Its own failure modes (easier to debug)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The router is dumb on purpose. It just classifies and delegates. Complexity lives at the edges.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Graceful HITL, Not Graceful Degradation
&lt;/h3&gt;

&lt;p&gt;Some AI systems try to "degrade gracefully" — giving worse answers when uncertain. I took a different approach: &lt;strong&gt;escalate early, escalate with context.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When Jacobo wasn't confident:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customer got a message: "Let me connect you with the team"&lt;/li&gt;
&lt;li&gt;Staff got a Slack notification with full conversation history&lt;/li&gt;
&lt;li&gt;Average human response time: under 2 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 10% that needed humans got &lt;em&gt;better&lt;/em&gt; service than before, because staff had full context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start with the most repetitive task.&lt;/strong&gt; Appointment booking was 40% of all inquiries. Automating that alone bought us massive breathing room.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your database is your agent's memory.&lt;/strong&gt; Don't build a separate "AI database." Query what you already have. Airtable's API was fast enough for real-time lookups.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool calling &amp;gt; RAG for transactional tasks.&lt;/strong&gt; RAG is great for knowledge retrieval. But when you need to &lt;em&gt;do things&lt;/em&gt; — book, order, check status — tool calling is the architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Measure deflection rate, not just accuracy.&lt;/strong&gt; "Did the agent answer correctly?" matters less than "Did the customer get what they needed without human help?" We tracked both.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Outcome
&lt;/h2&gt;

&lt;p&gt;After 12 months in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~90% of customer interactions handled without human intervention&lt;/li&gt;
&lt;li&gt;Staff spent 70% more time on actual repairs&lt;/li&gt;
&lt;li&gt;Customer satisfaction stayed flat (no degradation — that was the goal)&lt;/li&gt;
&lt;li&gt;The system became a selling point when I exited the business&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Voice was harder than expected.&lt;/strong&gt; ElevenLabs sounds great, but latency in the voice → transcription → LLM → TTS loop was noticeable. I'd explore tighter integrations if rebuilding today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;More observability earlier.&lt;/strong&gt; I added proper logging and trace monitoring late in the project. Should've been day one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simpler discount logic.&lt;/strong&gt; The discount sub-agent had too many edge cases baked into the prompt. Should've moved more logic into deterministic code and kept the LLM for natural language understanding only.&lt;/p&gt;




&lt;p&gt;Building Jacobo taught me that AI agents aren't magic — they're systems engineering with an LLM in the middle. The LLM handles the messy human language part. Everything else is APIs, databases, and good old-fashioned software architecture.&lt;/p&gt;

&lt;p&gt;The 90% automation wasn't because the AI was brilliant. It was because we picked the right problems, built the right tools, and knew when to hand off to humans.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm currently open to AI Product Manager and Forward Deployed Engineer roles. Check my portfolio at &lt;a href="https://santifer.io" rel="noopener noreferrer"&gt;santifer.io&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>automation</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
