Ayussh Jhunjhunwala

Posted on Feb 7

6 Patterns That Changed How I Use Claude Code

#claudecode #ai #productivity #devtools

What I learned building a production system with AI-assisted development

I built a search pipeline with Claude Code. Four manual tests passed — bearing lookups, part numbers, the works. Looked great. Then I ran an automated eval on 2,134 queries. Combined recall: 23.2%. Target was 90%.

That moment changed how I work with AI.

What I'm building

A WhatsApp bot for manufacturing store personnel to search 37,000 MRO inventory items by text or voice. Three-layer search (tag matching, vector similarity, fuzzy trigram), multi-language support, the whole stack. Built over last week with Claude Code as my primary development partner.

The patterns below aren't configuration tips. They're hard-won lessons from shipping real code where "close enough" isn't good enough.

Pattern 1: The Mistake Log

Problem: Claude makes the same mistake across sessions. It has no persistent memory. Prose instructions like "be careful with database constraints" don't work — they're too vague to trigger the right behavior.

What I do: A structured table in CLAUDE.md with concrete columns:

### Mistake Log (Don't Repeat These)

| Date | Mistake | What Should Have Happened | File Reference |
|------|---------|---------------------------|----------------|

Real entries from my log:

Date	Mistake	What Should Have Happened
2026-02-06	`onConflictDoUpdate` targeted only `organisationId` but DB constraint was `UNIQUE(org_id, operation)`	Always check migration file for actual constraint before writing onConflict — Drizzle doesn't validate at compile time
2026-02-06	Passed JS array directly in Drizzle `sql` template (`${arr}::text[]`) — postgres.js can't cast it	Build SQL arrays explicitly: `ARRAY[${sql.join(...)}]::text[]`
2026-02-06	Declared search pipeline "working" after 4 cherry-picked queries. Audit revealed 23.2% recall	Automated recall evaluation on FULL dataset is a GATE, not optional

Each entry is specific — the exact function, the exact file, what actually happened versus what should have happened. Not "be more careful" but "check migration SQL before writing onConflict targets."

Why it works: LLMs don't learn across sessions, but your instructions can. Every mistake becomes a permanent guardrail. The log now has 11 entries, and each one has prevented a repeat.

Pattern 2: Evaluate Before Scaling (The 23.2% Story)

Problem: I tested my three-layer search pipeline manually. Four queries, four correct results. Bearing 6005, NU 326, 22216, 32938 — all code-based queries in well-tagged categories. I was ready to scale from 430 test items to 37,000 production items.

What happened: I built an automated eval harness that generates test queries from every item's tags — 2,134 queries across four query types. Then ran it against the full 430-item subset.

Combined recall@10: 23.2%. Not a typo. Nearly 77% of queries couldn't find their target item in the top 10 results.

The breakdown:

Query Type	Queries	Recall@10
bare_code (e.g., "6205")	526	18.8%
description (e.g., "ball bearing")	428	46.5%
category_code (e.g., "bearing 6205")	575	23.3%
qualifier (e.g., "oil seal")	605	10.4%

The root cause: weighted score combination (tag×0.5 + vector×0.3 + fuzzy×0.2) with a MIN_CONFIDENCE=0.4 threshold. Items found by only one search layer got crushed by zero contributions from the other two. Fuzzy search — the best-performing layer at 62.9% standalone recall — was weighted at 0.2, meaning its best score of 0.5 became 0.10 after weighting. Filtered out by the 0.4 threshold.

The fix: Replaced weighted scoring with Reciprocal Rank Fusion (RRF). After:

Query Type	Before (Weighted)	After (RRF)
bare_code	18.8%	79.7%
description	46.5%	72.2%
category_code	23.3%	63.0%
qualifier	10.4%	50.1%
Combined	23.2%	65.3%

This became Command Discipline Rule 7 in my CLAUDE.md:

Evaluate before scaling: For ANY data pipeline, run automated recall/accuracy
evaluation on the FULL dataset before declaring success or scaling. Manual
spot-checks (even 10-20 queries) are NOT sufficient — they validate UX,
not correctness. The eval harness is a GATE.

Takeaway: Manual spot-checks validate UX, not correctness. Write the eval harness BEFORE declaring success. Four passing tests on cherry-picked queries is not validation — it's confirmation bias.

Pattern 3: Known LLM Failure Modes Checklist

Problem: Working with Claude daily, I kept discovering the same failure modes — sycophancy, code bloat, premature victory declarations. Each time I'd think "I should remember this." I never did.

What I do: An explicit checklist in CLAUDE.md, visible to Claude every turn:

## Known LLM Failure Modes (Watch For These)

- **Assumption running**: Surface assumptions explicitly before proceeding
- **Hiding confusion**: Say "I'm not sure about X" instead of guessing
- **Code bloat**: Ask "Is there a simpler way?" before writing
- **Abstraction creep**: YAGNI — only abstract when forced
- **Dead code residue**: Clean up as you go
- **Side-effect edits**: Stay focused on the task
- **Sycophancy**: Push back if something seems wrong
- **Premature victory**: Spot-checking ≠ validation. See Rule 7.

Each failure mode has a concrete countermeasure — not "avoid sycophancy" but "push back if something seems wrong." Not "don't write too much code" but "ask 'is there a simpler way?' before writing."

The last one — premature victory — links directly to the eval gate from Pattern 2. That's the feedback loop: a real mistake became a mistake log entry, which became a failure mode with a countermeasure, which now fires every session.

Takeaway: Name the failure modes. Put them where the AI can see them every turn. Vague warnings don't work — specific countermeasures do.

Pattern 4: Cross-Session Memory (MEMORY.md)

Problem: Every new Claude Code session starts cold. I'd spend the first 10 minutes re-explaining what I'd built, what was done, what was next. Context that existed in my head but not in the system.

What I do: Claude Code recently added an auto-memory directory (~/.claude/projects/.../memory/) with a MEMORY.md that gets loaded into every session's system prompt automatically. The feature exists out of the box — but the default content is unstructured notes. The pattern is giving it deliberate structure.

My structure:

## Project State (Last Updated: 2026-02-07)

### Phase 3 Progress (LLM-Primary Tagging)
| Task | Status |
|------|--------|
| 3.0 Stratified 430-item subset | Done |
| 3.A LLM tags via Opus | Done |
| ...
| Further recall improvement | **NEXT** — 65.3% still below 90% target |
| 3.F Full 37K tags | BLOCKED — recall must be >90% first |

### Search Pipeline Status
Combined recall@10: **65.3%** (target >90%)
| Query Type | Queries | Combined@10 | Pre-RRF |
|------------|---------|-------------|---------|
| bare_code  | 526     | 79.7%       | 18.8%   |
| ...

## Technical Learnings
- RRF > Weighted Scoring for incompatible score distributions
- All tags MUST be in `category:identifier` format
- Automated recall evaluation is a GATE

## Commands Reference
pnpm check            # typecheck + lint + test
pnpm audit:search     # Search quality audit (24min)

Every session, Claude knows: what phase I'm in, what metrics look like, what's blocked and why, and what command to run. No re-explaining.

I keep it under 200 lines (it loads into the system prompt, so bloat has a cost). Status tables, not prose. Numbers, not adjectives.

Takeaway: Most people start cold every session. A structured memory file means Claude picks up where you left off — with the right numbers, the right blockers, the right next step.

Pattern 5: Teaching Mode (What / Why / Takeaway)

Problem: Claude does things for you. Fast. You ship code you don't understand. Next time the same problem comes up, you're just as dependent.

What I do: A permanent instruction in CLAUDE.md:

## Teaching Mode (Always On)

For significant actions, briefly explain:
- **What** we're doing
- **Why** this approach
- **Takeaway** you can apply elsewhere

Real example: When Claude switched my search pipeline from weighted scoring to RRF, the teaching mode explanation looked like:

What: Replacing weighted score combination with Reciprocal Rank Fusion (RRF).
Why: Weighted combination requires all layers to produce compatible score scales — our tag layer outputs binary 0/1, vector outputs cosine similarity 0-1, and fuzzy outputs trigram scores 0-1. Multiplying by weights and summing produces meaningless numbers. RRF uses rank positions instead, which are immune to score scale differences.
Takeaway: When combining ranked results from systems with incompatible score distributions, use rank-based fusion, not score-based combination.

That explanation taught me a concept — rank-based fusion is immune to score scale differences — that I now apply when thinking about any multi-source ranking problem. Without teaching mode, I'd have gotten a working fix but no understanding.

Three lines of explanation per significant action. That's the cheapest education you'll get.

Takeaway: AI should make you better, not just faster. Teaching mode turns every coding session into a learning session.

Pattern 6: Session Bundling (/wrap)

Problem: End-of-session knowledge capture is aspirational. You mean to update the implementation plan, capture learnings, create a handoff for next time. But after a long session, you just close the terminal. Things get lost between sessions.

What I do: A single /wrap command that bundles three sequential actions:

Update implementation plan — Mark tasks complete, update metrics, note blockers
Capture learnings — Log mistakes to the mistake table, add session insights
Create handoff — Update MEMORY.md, display a summary with the exact next action

The key detail: strict sequencing with validation gates. Step 2 depends on step 1's output (you can't capture what changed until you've reviewed what changed). Step 3 depends on step 2 (the handoff needs to include new learnings).

Before /wrap, I ran these as three separate actions — and frequently skipped one or two. Now it's a single command that takes about 60 seconds and produces everything the next session needs to start immediately.

## Continue With
Run `pnpm audit:search` to verify current recall numbers, then investigate
the 18 categories below 70% combined recall.

That "Continue With" line is the first thing I see next session. No re-orientation needed.

Takeaway: If knowledge capture requires discipline, it won't happen. Make it one command. The best system is the one that runs when you're tired.

The Meta-Insight

The common thread across all six patterns: build feedback loops that compound across sessions.

The mistake log turns errors into permanent guardrails. The eval gate turns gut-feel into measured reality. The failure modes checklist turns experience into systematic checks. Memory turns session context into persistent state. Teaching mode turns code generation into education. Session bundling turns aspiration into automation.

None of these are about configuring Claude Code. They're about building a system around it — where every session makes the next session better.

My setup is strong on the output side — what gets captured and preserved. The next frontier is input optimization — what Claude sees each turn, how to keep context focused, how to avoid the 200-line MEMORY.md from growing to 2,000. But that's a post for another day.

If even one of these patterns saves you a day of debugging, the debt is repaid.

DEV Community