DEV Community: Akram Bakhouche

Prompt caching in production: the 4 patterns that cut my Anthropic bill (and when not to bother)

Akram Bakhouche — Thu, 28 May 2026 16:14:58 +0000

The first month I ran Career-OS in production, the Anthropic bill was bigger
than my coffee budget. After I wired prompt caching properly into the scorer,
the drafter, and the digest, it dropped under it. Same calls. Same model.
Same outputs. Roughly an 80% cost reduction in one afternoon.

Prompt caching is the single highest-leverage knob in the Claude SDK. It's
also the one I see misconfigured most often in client code — usually because
people read the docs, slap cache_control on something, and assume they're
caching when they're not.

Here are the 4 patterns I ship in production, with the cost math, and the 4
cases where caching genuinely does not help so you don't waste a day on it.

What prompt caching actually does

The mechanics, in three lines, because you need to know this to use it right:

A cached block (added with "cache_control": { "type": "ephemeral" }) is stored on Anthropic's side after the first call. Subsequent calls with an identical cached block hit the cache instead of re-processing.
First call to a cache block costs 1.25× the base input price (cache write). Every subsequent call within the TTL costs 0.1× the base price (cache read). The break-even is at the second call.
Default TTL is 5 minutes. A 1-hour TTL is available at 2× write cost. Plan for the TTL — it shapes the entire pattern.

If your workload calls the same prompt twice within 5 minutes, caching pays
off. If you call it once an hour with no warmup, you're paying the write
penalty for nothing.

Pattern 1 — Cache the system block

The pattern everyone reaches for first, and the one that gives the biggest
win in 90% of cases.

// app/api/agent/route.ts
import Anthropic from "@anthropic-ai/sdk";

const claude = new Anthropic();

export async function POST(req: Request) {
  const { question } = await req.json();

  const reply = await claude.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: [
      {
        type: "text",
        text: SYSTEM_PROMPT,                    // 2,400 tokens of context
        cache_control: { type: "ephemeral" },   // ← the magic
      },
    ],
    messages: [{ role: "user", content: question }],
  });

  return Response.json({ answer: reply.content });
}

The math, for a 2,400-token system prompt called 100 times in 5 minutes (the
realistic shape of a busy support endpoint):

Without caching: 100 × 2,400 × $3/M input = $0.72
With caching: 1 × 2,400 × $3.75/M (write) + 99 × 2,400 × $0.30/M (read) = $0.08
Savings: ~89%.

The break-even is between the 1st and 2nd call. After call 2 you're already
ahead. After call 100 you've collapsed an 89% chunk of your bill into
operating expense.

Cache hits are silent. The API returns cache_creation_input_tokens and
cache_read_input_tokens in the usage block. Log them. If you're not seeing
reads, you're not caching:

console.log({
  cache_write: reply.usage.cache_creation_input_tokens,
  cache_read:  reply.usage.cache_read_input_tokens,
  uncached:    reply.usage.input_tokens,
});

A single dashboard tile showing cache_read / (cache_read + uncached) tells
you whether your caching is working. Mine sits at 94% for the Career-OS
scorer during morning crawl runs.

Pattern 2 — Cache long documents (the RAG-adjacent pattern)

The pattern that actually changes which architectures are economically viable.

Say you have a 30,000-token product manual, customer policy document, or
codebase. Without caching, every customer question costs you ~$0.09 in input
tokens alone. With caching, your first question of the day costs you
~$0.11, and every subsequent question costs $0.01.

# document_qa.py

reply = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=600,
    system=[
        {
            "type": "text",
            "text": LIGHT_INSTRUCTIONS,         # 200 tokens, uncached
        },
        {
            "type": "text",
            "text": SHOP_POLICY_DOCUMENT,       # 30,000 tokens, CACHED
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[{"role": "user", "content": user_question}],
)

What this kills: most of the use cases people built RAG for. If your
"retrieval over a fixed corpus" use case fits inside Claude's 200K context,
caching the full document is often cheaper and always more accurate than
embedding-based retrieval. No chunking. No top-k tuning. No vector DB
operational burden.

The catch: the corpus has to be relatively stable. If your "document" is
yesterday's database dump, you're paying the cache write fee every single
day. Use cache for things that change weekly, not hourly.

Pattern 3 — Cache tool definitions

Tool use blocks are tokens. They count. And they're identical across every
call to the same agent.

TOOLS = [
    {"name": "search_orders", "description": "...", "input_schema": {...}},
    {"name": "issue_refund",  "description": "...", "input_schema": {...}},
    {"name": "lookup_user",   "description": "...", "input_schema": {...}},
    # … 12 tools in total, ~3,500 tokens of schema
]

reply = client.messages.create(
    model="claude-sonnet-4-6",
    tools=TOOLS,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[...],
)

When you cache the system block, tool definitions get cached too if
they're declared in the same call. They become part of the cached prefix.
You don't need a separate cache_control on the tools array — the cache
boundary extends through everything in the system block and the tools.

This is a 3,500-token win you get for free when you're already caching the
system block. Most of the time it's already happening and you don't realize
it. Worth confirming with the cache_creation_input_tokens log line.

Pattern 4 — Conversation prefix caching for multi-turn agents

The pattern that makes long-running agentic loops affordable.

Multi-turn agents — the ones that loop through assistant → tool_use → tool_result → assistant → tool_use → … — re-send the entire conversation
history on every call. By turn 8, you're sending 12,000+ tokens of history,
most of which is unchanged from turn 7.

Cache the prefix.

def agent_loop(initial_message: str) -> str:
    messages = [{"role": "user", "content": initial_message}]

    for turn in range(max_turns := 10):
        # Cache everything up to the last assistant turn
        cached_messages = mark_last_message_cached(messages)

        reply = client.messages.create(
            model="claude-sonnet-4-6",
            tools=TOOLS,
            system=[{
                "type": "text", "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"}
            }],
            messages=cached_messages,
        )

        if reply.stop_reason == "end_turn":
            return reply.content[0].text

        messages.append({"role": "assistant", "content": reply.content})
        messages.append({"role": "user", "content": run_tools(reply)})

def mark_last_message_cached(messages: list) -> list:
    """Add cache_control to the last user message so the whole prefix caches."""
    out = list(messages)
    if out:
        last = out[-1].copy()
        if isinstance(last["content"], str):
            last["content"] = [{"type": "text", "text": last["content"]}]
        last["content"][-1]["cache_control"] = {"type": "ephemeral"}
        out[-1] = last
    return out

Each new turn extends the cached prefix by the previous turn's content. By
turn 10, ~95% of your input tokens hit cache reads. An agent loop that would
cost $0.40 to run uncached costs $0.05 with this pattern.

The 4 cases where caching does NOT help

This is where I see clients waste afternoons. Be honest about whether your
workload fits.

1. Your prompts vary too much. If each call has a different system
prompt (you're concatenating user-specific data into it, or A/B-testing
prompt variants), there's no shared cache prefix to hit. Either restructure
to push the variation into the messages block (keeping the system stable),
or accept that caching isn't your lever.

2. Your volume is low. If you call the model 5 times an hour spread
evenly, the 5-minute TTL means you almost never hit a warm cache. The
1-hour TTL helps but doubles the write cost. At extremely low volumes the
math sometimes works out to "uncached is cheaper."

3. Your prompts are short. Below ~1,024 tokens of cacheable content (the
Anthropic minimum), caching just doesn't activate. The write cost is paid;
no cache is created. Quietly. Check the usage block.

4. Your content is per-user and short-lived. If the cached content is
specific to one user and they only make one or two calls, you're paying the
write penalty without ever hitting the cache. Aggregation across users or
sessions doesn't apply.

Operational hygiene

The three things to wire up before you ship cached calls:

Log cache_creation_input_tokens and cache_read_input_tokens for every call. Without this, you have no idea if caching is working.
Alert on cache hit rate dropping. If your dashboard shows 90% hits on Monday and 12% on Tuesday, something changed in your prompt structure. Tuesday's bill will reflect it.
Don't put PII or per-user secrets in the cached block. Cached content is reused. Anything you put in there is shared across every call that hits the same cache key. Put per-user context in the user messages block where it belongs.

What this is worth, dollars and time

For Career-OS, the four patterns above collapsed the morning crawl-and-score
run from "noticeable on the bill" to "rounding error." Setup time: one
afternoon. Ongoing maintenance: the three log lines + one dashboard tile.

For an inbound support agent handling 20,000 queries a month: easily
$200–$400/month saved versus uncached, every month, forever, with the same
quality of output.

For a documentation-QA endpoint over a stable corpus: the difference between
"too expensive to ship to all users" and "an obvious feature." I've watched
this single decision unblock entire roadmap items.

When to call

If you have a Claude-powered feature in production today and you do not have
a dashboard tile showing cache hit rate, that's the bug. Cache misses are
silent and your bill is paying for them.

This is a 1–3 day scoped audit + fix that I take on:
the shape is on the hire-me page.

For the full context where these patterns ship, see the
Career-OS architecture walkthrough.
For the upstream patterns — where to bolt the Claude call onto your stack
in the first place — see the
5 places to bolt AI onto Laravel
and the
PrestaShop 5-file pattern.
And before any of this ships to production, the
eval harness post
is the discipline that catches the regressions caching alone can't.

Originally published on bak-dev.com. Find more build-in-public posts at bak-dev.com/blog.

How I evaluate Claude SDK features before shipping them to production

Akram Bakhouche — Thu, 28 May 2026 13:34:22 +0000

The riskiest line of code in your codebase is not the database transaction or
the third-party API call. It's the LLM prompt — because when it silently
regresses, nothing visible breaks. The endpoint still returns 200. The
JSON still parses. Test suite still green. The model just started giving
worse answers, and you find out from a customer.

I have shipped enough Claude SDK features in production to be afraid of this.
Here is the eval harness pattern I now ship with every LLM-powered feature.
It takes about an hour to build per feature and has saved me from at least
three subtle regressions in Career-OS alone.

The risk you can't see

A normal regression looks like this: you change a function, a test fails,
you fix it.

An LLM regression looks like this: you tweak the system prompt to fix a tone
issue on Monday. The fix works. Three weeks later, your scoring drift down by
12% on a class of inputs you didn't think about. No test failed. No alert
fired. You only notice because a downstream metric (apply-rate, conversion,
support-ticket close time) is off, and that's if you're watching.

The four common silent regressions I have personally hit:

Tweaking the system prompt broke a tone you weren't testing. You added "be friendly" and lost "be precise about prices."
A new Anthropic model version (e.g. Sonnet 4.5 → 4.6) calibrated differently. Same prompt, same input, different score by 8 points.
A library upgrade silently dropped a parameter (max_tokens, temperature). Defaults kicked in. Outputs got longer and looser.
Cache invalidation drift. You added a field to the system prompt; the cache hash changed; cost per call jumped 6× before you noticed.

All four are invisible to unit tests. All four are caught by a single eval
harness that takes a few hours to build.

The pattern

You need three things:

A fixtures file — frozen inputs with expected outputs.
A scorer — measures how far each actual output drifted from expected.
A runner — replays every fixture through the live model, fails the build if any fixture drifted past tolerance.

That's it. No fancy framework. Anything more is YAGNI.

Step 1 — Fixtures

Pick 8–15 hand-labeled cases. Cover the edge cases you actually care about,
not just the happy path. For a job-fit scorer, mine looks like:

// tests/fixtures/scored_jobs.jsonl

{"id":"f001","input":{...},"expected":{"fit":88,"angle":"strong fullstack + AI"},"tolerance":{"fit":5}}
{"id":"f002","input":{...},"expected":{"fit":35,"angle":"junior role, decline"},"tolerance":{"fit":5}}
{"id":"f003","input":{...},"expected":{"fit":72,"angle":"good freelance fit"},"tolerance":{"fit":7}}
{"id":"f004","input":{...},"expected":{"fit":15,"angle":"crypto, disqualify"},"tolerance":{"fit":3}}
// …

Per-fixture tolerance is the move I most often see skipped. Different inputs
have different acceptable variance. A clear "strong fit" might tolerate ±5
points. A clear "disqualify" should be tight — ±3 — because we want certainty
on rejection. A borderline case might tolerate ±10 because it's genuinely
ambiguous.

Label them yourself. Don't generate fixtures with another LLM. The whole
point is that you, the human, know what right looks like.

Step 2 — Scorer

For structured-output features, the scorer is plain arithmetic.

# tests/eval/score_fit.py

def score(actual: dict, expected: dict, tolerance: dict) -> EvalResult:
    fit_delta = abs(actual["fit"] - expected["fit"])
    fit_ok = fit_delta <= tolerance.get("fit", 5)

    angle_overlap = jaccard(
        tokenize(actual["angle"]),
        tokenize(expected["angle"]),
    )
    angle_ok = angle_overlap >= 0.3

    return EvalResult(
        passed=fit_ok and angle_ok,
        fit_delta=fit_delta,
        angle_overlap=angle_overlap,
    )

For free-form outputs (cover letters, descriptions), don't try to match the
text exactly. Score for constraints:

Did it use the requested language?
Did it stay under the word count?
Did it mention every required fact?
Did it avoid the forbidden phrases?

These are deterministic checks against the output text. They're easier to
write than people fear. For Career-OS's outreach drafter:

def score_outreach(actual: str, expected: dict) -> EvalResult:
    return EvalResult(
        passed=all([
            language_matches(actual, expected["language"]),
            word_count(actual) <= expected["max_words"],
            all(fact in actual for fact in expected["required_facts"]),
            not any(phrase in actual.lower() for phrase in expected["forbidden_phrases"]),
        ]),
    )

You can also do an "LLM-as-judge" pattern for fuzzy criteria — but only as a
last resort. It's slower, more expensive, and adds another silent-regression
surface (the judge model can drift too). Use deterministic checks whenever
the criterion is decidable.

Step 3 — Runner

A 30-line script that loops fixtures, calls the feature live, scores each one,
prints a summary table.

# tests/eval/run.py

import json
from pathlib import Path
from rich.table import Table
from rich.console import Console

from career_os.scorer import score_job
from tests.eval.score_fit import score as eval_one

console = Console()
fixtures = [json.loads(line) for line in Path("tests/fixtures/scored_jobs.jsonl").read_text().splitlines() if line.strip()]

table = Table(title="Eval results")
table.add_column("id"); table.add_column("expected_fit"); table.add_column("actual_fit"); table.add_column("delta"); table.add_column("status")

failed = 0
for fx in fixtures:
    actual = score_job(fx["input"])           # the live call
    res = eval_one(actual.dict(), fx["expected"], fx["tolerance"])
    status = "[green]PASS" if res.passed else "[red]FAIL"
    if not res.passed: failed += 1
    table.add_row(fx["id"], str(fx["expected"]["fit"]), str(actual.fit), str(res.fit_delta), status)

console.print(table)
console.print(f"\n[red]{failed}[/red] failed of {len(fixtures)}" if failed else f"[green]all {len(fixtures)} passed[/green]")
exit(1 if failed else 0)

The exit(1 if failed else 0) is what makes this an actual test, not just a
debugging tool. Wire it into CI.

Wiring it into CI

# .github/workflows/eval.yml

name: eval
on:
  pull_request:
    paths: ['src/career_os/scorer/**', 'tests/fixtures/**']
  workflow_dispatch:

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }
      - run: pip install -e ".[dev]"
      - run: python tests/eval/run.py
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

This runs the eval on every PR that touches the scorer or the fixtures. It
costs a few cents per run. It catches all four silent-regression categories.

When to extend the harness

You will hit cases the eval can't catch. That's fine — every catch is a new
fixture. The discipline:

Bug reported in production? Add a fixture that reproduces it. Fix. Eval now catches the regression next time.
You hand-tuned the prompt for 3 hours? Save 5 of those tuning cases as fixtures with the now-correct expected outputs.
Anthropic ships a new model? Run the eval against the new model before bumping it in production. The delta tells you whether the new model is a free upgrade or a calibration redo.

The fixtures grow with the system. After a year, you have a regression suite
that encodes most of what you know about how the feature should behave —
and any future engineer (or future-you, three months from now) can change
the prompt with confidence because the eval will catch them if they break it.

What this is worth to a client

Most teams shipping LLM features today do not have this. They ship a Claude
call, watch metrics, hope. When the system slowly degrades, they can't tell
whether it was the prompt change, the model bump, the user-input distribution
shift, or something else entirely. Triage time on a silent LLM regression in
production is measured in days.

A 4-hour investment in an eval harness collapses that to minutes. If you
ship a Claude feature without one, you are building technical debt you can't
see and can't measure.

If you have a Laravel / PrestaShop / Python app shipping an AI feature and
nobody has built the eval harness for it yet, that is a 1-week scoped
engagement I take on. The shape is on the hire-me page.

For the full architecture context where this pattern lives — including the
fixture file format and the eval runner — see the
Career-OS architecture walkthrough. The same
discipline applies to the Laravel patterns in
5 places to bolt AI and the
PrestaShop module in
the 5-file pattern — every Claude
call in those posts is a place where the eval harness pattern earns its
keep.

Originally published on bak-dev.com. Find more build-in-public posts at bak-dev.com/blog.

How to add a Claude agent to a PrestaShop store: a 5-file pattern

Akram Bakhouche — Thu, 28 May 2026 13:21:46 +0000

Liquid syntax error: Unknown tag 'endraw'

Career-OS architecture — how I built a job-search agent with Claude SDK

Akram Bakhouche — Thu, 28 May 2026 12:58:20 +0000

I built Career-OS to solve a real problem for myself: spending two hours a day
scrolling job boards is a terrible way to find your next role, and existing
"AI job applier" tools are either spam machines or vapor.

The system is open-source on
github.com/akrambak/career-os.
This post is the technical walkthrough: what it does, why each layer exists,
and the design decisions that made it actually ship instead of dying as a
weekend prototype.

What it does, in one paragraph

Every morning, Career-OS crawls five sources (RemoteOK, two WeWorkRemotely
feeds, Remotive, and two Hacker News monthly threads). It scores each posting
against my profile using Claude with structured-output JSON, drafts a tailored
outreach email when fit ≥ 70, and emails me a digest with the top picks. I
review the digest at breakfast, decide which to send, and run
career-os apply from the terminal to track the pipeline through to offer or
rejection.

The architecture

┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│ 5 scrapers   │──▶│ SQLite store │──▶│ Claude       │
│ (RemoteOK,   │   │ (jobs,       │   │ scorer       │
│  WWR, HN…)   │   │  scores)     │   │ (fit 0-100)  │
└──────────────┘   └──────────────┘   └──────────────┘
                          │                   │
                          ▼                   ▼
                   ┌──────────────┐   ┌──────────────┐
                   │ CLI          │   │ Drafter      │
                   │ (fetch/score │   │ (outreach    │
                   │  /draft/…)   │   │  per fit)    │
                   └──────────────┘   └──────────────┘
                          │                   │
                          ▼                   ▼
                   ┌──────────────────────────────────┐
                   │ Daily email digest (Resend /     │
                   │ Postmark / SMTP)                 │
                   └──────────────────────────────────┘

Five layers, each replaceable: scrapers, store, scorer, drafter, digest. No
framework. Just Python modules and one Anthropic SDK dependency.

Layer 1 — Scrapers

Each scraper is a 50-line file. The shape is intentionally boring:

# src/career_os/scrapers/remoteok.py

class RemoteOKScraper(BaseScraper):
    source_name = "remoteok"

    async def fetch(self) -> list[RawJob]:
        async with httpx.AsyncClient() as client:
            resp = await client.get("https://remoteok.com/api")
            resp.raise_for_status()
            data = resp.json()

        return [self.normalize(item) for item in data if self.is_job(item)]

    def normalize(self, item: dict) -> RawJob:
        return RawJob(
            external_id=str(item["id"]),
            title=item["position"],
            company=item["company"],
            description=item.get("description", ""),
            url=item["url"],
            posted_at=parse_date(item["date"]),
            source=self.source_name,
        )

The design rule: a scraper does fetching + normalization, nothing else. No
filtering, no scoring, no business logic. Adding a new source — say, a French
freelance board — is a 50-line file that the registry picks up automatically.

The HN "Who is hiring?" and "Seeking freelancer?" scrapers use the public
Algolia API instead of scraping HTML, which is rude. Algolia returns
structured fields (stack, budget, location, contact email) that would be
brittle to parse from prose.

Layer 2 — Store

SQLite with a Postgres-shaped schema. Four tables: jobs, scores,
applications, drafts.

The Postgres-shaped part matters. When this becomes a multi-tenant SaaS, the
swap to Postgres is one connection-string change and one npm i pg — no
schema migration, no rewriting queries. SQLite is fast enough for one user
and gives me file-based ergonomics (scp career_os.db to back up, sqlite3 career_os.db to inspect).

# src/career_os/db/schema.sql

CREATE TABLE jobs (
    id           INTEGER PRIMARY KEY,
    external_id  TEXT NOT NULL,
    source       TEXT NOT NULL,
    title        TEXT NOT NULL,
    company      TEXT NOT NULL,
    description  TEXT,
    url          TEXT NOT NULL,
    posted_at    DATETIME NOT NULL,
    seen_at      DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(external_id, source)
);

CREATE TABLE scores (
    id            INTEGER PRIMARY KEY,
    job_id        INTEGER REFERENCES jobs(id),
    fit           INTEGER NOT NULL,
    pros          JSON,
    cons          JSON,
    angle         TEXT,
    scored_at     DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(job_id)
);

(external_id, source) as a composite key dedupes across re-runs. The same
RemoteOK job won't get scored twice.

Layer 3 — The Claude scorer

The expensive layer, in both money and engineering effort. The scorer
decides which jobs deserve a draft, which deserve a glance, and which to
skip. Mis-calibrate it and you either drown in noise or miss good fits.

Two design decisions made this work:

Decision 1: prompt caching on the system block.

The system prompt is ~2,400 tokens — my profile, the scoring rubric, examples
of what "fit 80" vs "fit 50" looks like. It's identical for every job. Putting
it inside a cached block means I pay full price for the first job of the day
and ~10% for every subsequent one.

# src/career_os/scorer/claude_scorer.py

def score_job(job: Job) -> JobScore:
    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=[
            {
                "type": "text",
                "text": SCORER_SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"},
            }
        ],
        messages=[
            {"role": "user", "content": render_job(job)},
        ],
    )

    return parse_score(resp.content[0].text)

Decision 2: structured JSON output, not free-form.

I want {"fit": 87, "pros": [...], "cons": [...], "angle": "..."} — not a
paragraph of prose I have to regex. The system prompt asks for valid JSON
matching a schema, with a few-shot example. I use pydantic to parse the
response and fail loudly if the shape's wrong. Failure rate after the first
week of tuning: under 0.5%.

class JobScore(BaseModel):
    fit: int = Field(ge=0, le=100)
    pros: list[str] = Field(min_length=1, max_length=5)
    cons: list[str] = Field(min_length=0, max_length=5)
    angle: str  # the one-line pitch to lead the outreach with

Layer 4 — The drafter

When fit ≥ 70, the drafter writes a tailored outreach email. It takes the
job, the score, and a draft type (FT-cover or freelance-pitch) and produces
something I'd actually send.

The interesting bit: two completely separate system prompts. A cover
letter for a senior FT role and a freelance pitch are different jobs with
different tone requirements. I tried one prompt that "knew" which to write.
It produced muddled outputs. Splitting them — different system prompts loaded
based on the draft_type — fixed it overnight.

Hard rules baked into both prompts:

Never invent metrics. If the candidate (me) hasn't shipped X, don't claim X.
Freelance: no engagements under 2 weeks.
Freelance: no hourly rates under €60/hr equivalent.
FT: no roles requiring on-site days outside Europe.
No emoji in the body. (Personal taste. Survives prompt-injection attempts because it's in the cached system block.)

def draft_outreach(job: Job, score: JobScore, type: str) -> str:
    system_prompt = (
        FT_COVER_PROMPT if type == "ft_cover"
        else FREELANCE_PITCH_PROMPT
    )
    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=600,
        system=[{"type": "text", "text": system_prompt,
                 "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user", "content": render_brief(job, score)}],
    )
    return resp.content[0].text

Layer 5 — The eval harness

The part of the system most "AI agent" projects skip. The scorer is the
center of the system — if it silently regresses (because I tweaked the system
prompt, because Anthropic shipped a new model version, because the
temperature setting drifted), I'd ship bad drafts for weeks before noticing.

So: ten hand-labeled job fixtures with expected fit scores ±5. The eval
command (career-os eval) replays them through the scorer and reports any
that fell outside tolerance.

# tests/fixtures/scored_jobs.jsonl
{"job_id": "fixture_001", "expected_fit": 88, "tolerance": 5, ...}
{"job_id": "fixture_002", "expected_fit": 35, "tolerance": 5, ...}
...

Run before every commit to the scorer or any prompt change. Took about an hour
to build and has saved me from at least three subtle regressions.

This is non-negotiable for any LLM-in-production work. If you can't measure
quality, you can't iterate. Don't ship a Claude-powered feature without an
eval harness if the cost of being wrong exceeds the cost of building the
harness.

What it cost to build

Six weeks of evening + weekend work to get to the MVP I'm running today.
The Anthropic bill: under $40/month including the eval re-runs. Less than my
coffee budget.

What's next

Three things on the roadmap:

A LinkedIn / X cross-poster that publishes commit milestones with one command — career-os post --milestone "5 scrapers live".
An applications tracker UI beyond the current CLI. Probably a Next.js dashboard pulling from the same SQLite store.
Multi-tenant mode — same architecture, Postgres swap, an auth layer, and shipping as a SaaS.

If you want to play with it locally, pip install -e ".[dev]" and follow the
README. If you want a system like this built for your team — sales-lead
enrichment, internal ops agents, RAG-over-docs — that's exactly the work I
take on. The shape is on the hire-me page.

And if you want the LLM-in-production patterns from this post applied to your
existing Laravel or PrestaShop stack, the
5-places-to-bolt-AI post covers
that ground.

Originally published on bak-dev.com. Find more build-in-public posts at bak-dev.com/blog.

5 places to bolt an AI agent onto a PHP/Laravel app without rebuilding

Akram Bakhouche — Thu, 28 May 2026 12:58:19 +0000

If you run a Laravel or PrestaShop app, "add AI to it" arrived on your roadmap
sometime in the last twelve months. Probably from a CEO who saw a Claude demo,
or a customer-success person watching support tickets pile up.

The wrong answer is to rewrite. The right answer is to find the 5 places in
your existing stack where an LLM call collapses 20 lines of brittle code into
2 lines of Claude SDK — and start there.

These are the 5 I keep returning to in client work, ranked roughly by how
fast they break even.

1. Support-inbox triage

The use case. Every incoming support email goes through Claude before a
human sees it. The model assigns a category, severity, and language; drafts a
suggested reply; and routes it to the right inbox. Your team opens 50 emails
and sees 50 pre-categorized rows with draft answers, not 50 unparsed strings.

The failure mode if you do it badly. You let Claude auto-send replies.
Three months in, the model hallucinates a refund policy that doesn't exist and
costs you a five-figure chargeback. The fix is a hard rule: draft yes,
auto-send no. Humans approve the draft. The savings come from time-to-first-
draft, not from removing the human.

The integration point. A queued job per inbound email.

// app/Jobs/TriageSupportEmail.php

public function handle(Anthropic $claude): void
{
    $analysis = $claude->messages()->create([
        'model'    => 'claude-haiku-4-5',
        'system'   => $this->triagePrompt(),  // cached, 600 tokens
        'tools'    => $this->triageTools(),   // assign_category, set_severity
        'messages' => [
            ['role' => 'user', 'content' => $this->email->raw_body],
        ],
    ]);

    $this->email->update([
        'category'        => $analysis->tool('assign_category')->args['name'],
        'severity'        => $analysis->tool('set_severity')->args['level'],
        'draft_reply'     => $analysis->content_text(),
        'triaged_at'      => now(),
    ]);
}

Cost: roughly $0.0003/email on Haiku 4.5 with prompt caching. A team
processing 5,000 support emails a month spends $1.50 to give every one of them
a pre-categorized draft.

2. Product description generation (e-commerce)

The use case. Admin panel gets a "Generate description" button next to
the product editor. The button calls Claude with the product specs, the
store's tone-of-voice guide, and any existing description. Claude returns 3
variants. Editor picks one, edits if needed, saves.

This is the lowest-hanging fruit I see for PrestaShop and Shopify stores.
Catalog managers spend hours a week rewriting descriptions. This collapses to
minutes.

The failure mode if you do it badly. You let Claude invent specs ("100%
organic cotton" when you sold synthetic). The fix: pass the specs as a
structured fact list in the system prompt, and instruct Claude to
never invent attributes outside that list. Production-tested constraint
that actually works.

The integration point. A single controller action, idempotent by product
ID + tone.

// app/Http/Controllers/Admin/DescriptionGeneratorController.php

public function generate(Product $product, Request $request): JsonResponse
{
    $variants = $this->claude->messages()->create([
        'model'    => 'claude-sonnet-4-6',
        'system'   => $this->copywriterPrompt($request->tone),
        'messages' => [[
            'role'    => 'user',
            'content' => $product->fact_sheet(),  // structured: name, attrs, materials
        ]],
        'max_tokens' => 600,
    ]);

    return response()->json([
        'variants' => $this->parseVariants($variants->content_text()),
    ]);
}

The tone is the parameter that matters. Premium · friendly · technical ·
minimalist · playful. Same product, five wildly different angles, picker UI.

3. Semantic-search fallback

The use case. Your existing keyword search is fine for 70% of queries.
For the other 30% — typos, synonyms, intent-based searches — it returns
nothing and your bounce rate goes up. You add an embedding-based fallback:
when keyword search returns fewer than N results, the query gets embedded,
matched against pre-computed product embeddings, and the top results are
appended.

You don't replace keyword search. You augment it.

The failure mode if you do it badly. You replace keyword search wholesale.
Now exact-SKU lookups stop working because embedding similarity ranks
"BLACK-M-FRONT-9382" lower than "black t-shirt". The fix: augment, never
replace. Keyword is fast, exact, and deterministic. Embeddings are slow,
fuzzy, and probabilistic. They're complements.

The integration point. A Laravel scout-driver decorator.

// app/Search/HybridSearch.php

public function search(string $query, int $threshold = 5): Collection
{
    $keyword = Product::search($query)->take(20)->get();

    if ($keyword->count() >= $threshold) {
        return $keyword;  // keyword had enough, fast path
    }

    $embedding = $this->embed($query);  // cached, batched
    $semantic  = $this->vectorIndex->similar($embedding, take: 20 - $keyword->count());

    return $keyword->merge($semantic)->unique('id');
}

Cheap to run: embeddings cost about $0.00002/query. A million queries a
month is twenty bucks.

4. Daily admin summary

The use case. Every morning at 7am, the admin panel home screen loads
with a 3-paragraph summary of the previous day's activity: orders, refunds,
support tickets, unusual signals. Written by Claude, in your brand voice,
referencing real numbers.

This is the highest perceived value per dollar of any pattern on this list.
The owner opens the admin, sees a coherent paragraph saying "Yesterday: 47
orders (+8% vs Tuesday avg), 2 refund requests both for size issues on
SKU-9382, support volume normal, one anomaly: 4 abandoned carts at the
checkout step in the last 3 hours — worth checking the payment provider", and
the AI bill paid for itself before they finish their coffee.

The failure mode if you do it badly. You feed Claude raw database dumps
and ask it to "find something interesting". It invents trends from noise. The
fix: you do the aggregation, Claude does the narrative. Pre-compute the
stats. Pass them as a structured digest. Ask Claude to turn them into
English, not to discover them.

The integration point. A scheduled task, run before 7am.

// app/Console/Commands/DailyAdminDigest.php

public function handle(): void
{
    $stats = [
        'orders'           => $this->dailyOrders(),
        'refunds'          => $this->dailyRefunds(),
        'support_volume'   => $this->dailySupportVolume(),
        'unusual_signals'  => $this->detectAnomalies(),
    ];

    $narrative = $this->claude->messages()->create([
        'model'    => 'claude-haiku-4-5',
        'system'   => $this->editorPrompt(),
        'messages' => [['role' => 'user', 'content' => json_encode($stats)]],
    ]);

    AdminDashboard::cacheTodayDigest($narrative->content_text());
}

5. Outbound email personalization

The use case. Your abandoned-cart and re-engagement emails currently use
templated copy: "Hi {{first_name}}, you left items in your cart". Open rates
are mid-single-digits.

You add a Claude step: the model gets the customer's last 5 orders, the items
they abandoned, their average order value, the season. It rewrites the email
body to lean into the angle most likely to convert this customer.

Same template structure, same call-to-action button. Just the body paragraph
is dynamic per recipient.

The failure mode if you do it badly. You let Claude generate the subject
line too — and your spam complaint rate triples because the model used
clickbait phrases your sender reputation can't sustain. The fix: keep the
subject line and the CTA pinned. Let Claude work on the body only.

The integration point. A queued personalizer running before the
deliverability tool sends the email.

// app/Mail/Personalizers/AbandonedCartPersonalizer.php

public function personalize(Cart $cart, User $user): string
{
    $context = [
        'abandoned_items'     => $cart->items->toFactSheet(),
        'recent_orders'       => $user->recentOrders(5)->toFactSheet(),
        'avg_order_value_eur' => $user->avgOrderValue(),
    ];

    return $this->claude->messages()->create([
        'model'      => 'claude-haiku-4-5',
        'system'     => $this->copywriterPrompt(template: 'abandoned_cart'),
        'messages'   => [['role' => 'user', 'content' => json_encode($context)]],
        'max_tokens' => 240,
    ])->content_text();
}

A/B test it against the template baseline before rolling out fully. I have
seen lift between 18% and 40% in open-to-click conversion on abandoned-cart
sequences, but your results depend entirely on the quality of your customer
data.

What unites all five patterns

Look at the code. None of them needed a new framework, a vector database, a
"AI platform", or a rewrite. Each is a single class with a single Claude
call, dropped into the existing Laravel app like any other Service.

That's the actual playbook for adding AI to a production app:

Find the place where the existing code is brittle, slow, or expensive.
Wrap the brittle bit in a queued job or controller action.
Replace its body with one Claude SDK call.
Add structured constraints in the system prompt so the model can't invent things outside the facts you pass it.
Keep the human in the loop anywhere a hallucination would cost more than a draft.

If you have a Laravel or PrestaShop app shipping revenue and you want to add
one of these patterns without burning down what works,
email me a brief.
I do exactly this for a living: scope, build, ship, inside your codebase.
2–4 weeks fixed-scope. The shape is documented on the
hire-me page.

Or, if you want to see what the same patterns look like across the full
agent-system stack instead of one feature at a time, the source for
Career-OS — the AI-agent dashboard
I run my own job search through — is MIT-licensed on GitHub.

Originally published on bak-dev.com. Find more build-in-public posts at bak-dev.com/blog.

[FLUTTER] Need help on flutter!

Akram Bakhouche — Fri, 30 Jul 2021 15:21:20 +0000

Hello

I search a solution like a voice call app.
I need to have a solution to launch my app and show a screen with two buttons ( Accept / Reject )

The app must fire like Skype/Messenger/Signal when it receive a signal call.

Best regards