DEV Community: Lars

Claude Code Best Practices for Vibe Coders: Ship More, Burn Fewer Tokens

Lars — Sat, 30 May 2026 00:00:00 +0000

Vibe coding with Claude Code can be very fast. Costs increase quickly when token usage is not controlled.

After one year of daily use as a software architect, I observed two recurring issues. Agents often produce code with high confidence that does not solve the right problem. The context window grows too large and becomes inefficient, which leads to forgotten instructions and avoidable mistakes.

The habits below address both. They come from two Anthropic talks, the official best-practices guide, and daily work building agent-based workflows. Here is what stuck.

How it works, and the one constraint

Claude Code is the coworker who does everything in the terminal. Under the hood it's a plain agent loop: a model, system instructions, and tools, running until the task is done. There's no orchestration framework. Instead of pre-indexing your repo with RAG, it explores like an engineer, with glob, grep, and find. That trade has a real cost: an unscoped grep dumps raw matches into the window and burns tokens fast, and on a large monorepo a code-aware semantic index returns tighter, ranked results for the same question. I still lean on agentic search for most work, because a fresh grep reads the current state while an index drifts the moment someone merges, but I keep the search scoped and pair it with a semantic retrieval tool once the repo is big enough that raw matches stop fitting. Code Graph is one open-source option: it pre-indexes the codebase into a queryable graph, Claude learns to call its CLI automatically, and you trade the freshness guarantee for a dramatically smaller retrieval footprint per query.

The one constraint behind every habit below: the context window holds the whole conversation, every message, file read, and command output, and the model gets worse as it fills. A single debugging session burns tens of thousands of tokens, and a crowded window means forgotten instructions and avoidable mistakes. The skill is deciding what gets into context and what stays out.

When an agent is worth it

The failure mode I see most is pointing an agent at work that didn't need one. Three conditions:

Complex with an unknown path. You know the destination, like a merged PR from a design doc, but not the route.
Recoverable on error. A wrong web search costs nothing to redo; if a mistake is irreversible, keep a human in the loop.
High value, because the loop burns compute. Save it for coding, deep research, heavy analysis.

If the task is a known sequence of steps, write the script. An agent earns its cost when the route is genuinely uncertain.

Explore, then plan, then code

Letting Claude jump straight to code is how you get a clean solution to the wrong problem. Explore in plan mode first, ask for a written plan, edit it, then switch to normal mode and let it implement against that plan. Skip the plan when the diff fits in one sentence; reach for it when the approach is uncertain or spans files. For a bigger feature, have Claude interview you about edge cases and tradeoffs, then write the answers into a written spec you review before any code gets written, so the spec becomes the handoff a human signs off on.

Be specific, and give it a check it can run

Claude infers intent well and reads minds badly. "Add tests for foo.py" gets generic tests; "write a test for foo.py covering the logged-out case, no mocks" gets the one you wanted. Name the file, the scenario, the constraint, and point it at the pattern to copy. Feed it real input: reference files with @, paste screenshots, pipe data in with cat error.log | claude.

Then give it something that returns pass or fail: a test suite, a build exit code, a linter, a screenshot compared to a design. That's the difference between a session you watch and one you can leave running, because Claude writes the code, runs the check, reads the result, and iterates until it passes. Make it show the evidence, the actual output, not a claim that it's done.

Prompt it like an intern, not a chatbot

A long-running loop needs guardrails a chat prompt doesn't. Tell it when to stop ("when you find the answer, you can stop") and give it a budget ("under five tool calls for a simple query"), or it'll burn the window hunting for a better source. Ask it to plan out loud before calling tools and reflect on the results after. When the window fills, /compact writes a dense summary and refreshes it; for big jobs, let sub-agents do the research and hand back a short markdown summary. Two reflexes worth building: hit Escape the instant it goes down the wrong path, and reach for a CLI tool over an MCP server when both exist, since the model parses terminal output well.

Write the prompt the way Anthropic writes theirs

The habits above are about steering the loop. This one is about the prompt itself, and it matters most for the text you reuse: a SPEC.md, a CLAUDE.md, or a one-shot prompt you run headless in CI. Hannah and Christian's prompting walkthrough took one task from a wrong answer to a usable one without touching the model, just by restructuring the prompt. The same moves work for Claude Code.

Order the prompt the way a careful handoff reads. Role and task first, so the model knows what it's doing before it sees the data. Then the content. Then step-by-step instructions for how to work through it. Then any examples. Then a short reminder of the critical rules at the very end, because the last thing it reads is the freshest. In their demo, swapping a vague intro for "you help a claims adjuster review Swedish car-accident forms" moved Claude off a guess about a skiing accident and onto the actual task.

A few moves carry most of the weight:

Structure with delimiters. Claude was tuned on XML tags, so wrapping sections in tags like <form_layout> or <examples> tells it exactly what each block is and lets it refer back later. Markdown headings help too.
Show examples for the cases that trip it up. When a tricky input has a right answer your intuition knows, bake the input and the worked reasoning into the prompt. A handful of labeled examples steers the model harder than another paragraph of instructions.
Put stable context up front and keep it fixed. The parts that never change, a form layout, your coding conventions, the project's directory map, belong at the top where they read the same every run. That's also exactly what prompt caching rewards.
Prefill the answer to shape the output. If you need JSON, start the response with {; if you need a verdict in tags, open with <final_verdict>. The model continues from where you started it, so you get the structured output without parsing past the preamble.
Order the analysis on purpose. Tell it what to read first. In the accident demo, reading the form before the sketch was the difference between a confident verdict and a shrug, because the sketch only makes sense once you know the boxes that were checked.

One more, specific to Claude 4: let it think between tool calls. Extended thinking is a readable scratchpad, so you can watch where the reasoning goes wrong and fold that correction back into the prompt instead of guessing.

Your CLAUDE.md can be a bottleneck

Claude Code reads CLAUDE.md on every turn. It loads at session start and rides along in context on every prompt after. Mine bloated for months. Restructuring dropped it 60 to 75 percent with the same coverage and the same guardrails.

Two changes did it. First, scoping. The top-level file keeps only what's truly global, and the docs put a number on it: target under 200 lines, because longer files eat context and adherence drops. Everything that only matters in one part of the codebase moved into .claude/rules/, one file per topic, scoped to the paths it covers with a paths glob in the frontmatter:

---
paths:
  - "src/api/**/*.ts"
---

# API rules
- Every endpoint validates input
- Use the standard error response shape

A path-scoped rule loads only when Claude reads a file that matches, so the design-system rules stay out of context until it opens a component and the API rules stay out until it touches a handler (path-specific rules). Rules without a paths field load every session at the same priority as the root file, so keep those few. Nested CLAUDE.md files behave the same lazy way if you'd rather keep instructions sitting next to the code.

Second, hooks for the deterministic rules. Instead of "don't import axios" in prose, I wrote .claude/hooks/forbidden-imports.sh; the hook intercepts the write before the file changes, so the import never lands. Same for version pinning and toolchain checks. Instructions are advisory and drift as the file grows; a hook costs zero tokens per turn and either passes or fails. The test on every line: would removing this cause a mistake? If not, cut it. If your CLAUDE.md is longer than a screen, you're paying for it on every prompt.

One caveat worth knowing. Context files make agents more rigorous, more testing, more search, more careful file work, because the file reads as permission to do the thorough thing. That rigor costs extra steps and extra tokens, which is the argument for surgical, human-written rules over a CLAUDE.md you asked the model to generate. Auto-generated docs tend to enumerate every directory in the repo and bloat the window without earning it.

Hooks: the guardrail Claude can't skip

A Claude Code hook is a shell command that runs automatically at a point in the agent's lifecycle, like before a file read or after an edit. A CLAUDE.md line such as "always run the linter" is advisory, so the model can forget it; a hook runs every time regardless of what the model decides. That makes hooks the right tool for anything that must never be skipped: secrets protection, formatting, type checks.

There are two kinds, and the difference is whether they run before or after the tool call:

	PreToolUse	PostToolUse
Runs	Before the operation	After it completes
Can block?	Yes, exit code 2 stops it	No, the operation already happened
Good for	Security, validation, protection	Formatting, type checks, feedback
Example	Block a read of `.env`	Auto-format the file Claude just wrote

You register a hook in JSON at one of three scopes: ~/.claude/settings.json for every project of yours, .claude/settings.json committed for the whole team, or .claude/settings.local.json gitignored for just your machine. A matcher field picks which tools fire it, and the pipe means "or", so Read|Grep watches both read paths and Edit|Write watches both write paths. The script reads a JSON event on stdin, so you pull the file path out with jq.

First guardrail: keep Claude out of your secrets. A .env with an API key is something the model should never read into context, and both the Read and Grep tools can pull it in. A PreToolUse hook on Read|Grep blocks it with exit code 2, and the stderr message goes back to Claude as the reason. Match .env only and you've covered one of a dozen places secrets live, so widen the pattern to credentials files, keys, and secrets/ directories:

#!/bin/bash
INPUT=$(cat)
FILE_PATH=$(echo "$INPUT" | jq -r '.tool_input.file_path // .tool_input.path // empty')
if echo "$FILE_PATH" | grep -qE '(^|/)\.env|/\.aws/credentials|credentials\.json|\.npmrc$|\.pem$|/secrets/'; then
  echo "Security policy: cannot read this file, it may hold secrets" >&2
  exit 2
fi
exit 0

Treat that hook as a second layer, not the lock. The hard control is settings-level permissions.deny (for example Read(./.env) and Read(./secrets/**)), which the client enforces before the tool ever runs, regardless of what the model decides; the hook only reports back after it fires. Use the deny rules for the files you know about, the hook as a wider net for the ones you forget.

Second: format after every edit so you stop reminding the model. This one is PostToolUse on Edit|Write, so it can't block, it just runs Prettier on the file that changed and reports back:

#!/bin/bash
INPUT=$(cat)
FILE_PATH=$(echo "$INPUT" | jq -r '.tool_input.file_path // empty')
if [[ "$FILE_PATH" =~ \.(ts|tsx|js|jsx|css|json|md)$ ]]; then
  npx prettier --write "$FILE_PATH" 2>/dev/null
  echo "Formatted: $FILE_PATH"
fi
exit 0

One trap to know: reformatting a file right after the model wrote it can desync the model's in-context view of that file. Claude's edits match on the exact old text, so if Prettier reflows whitespace behind its back, the next queued edit can miss. If you hit that, move formatting to a Stop hook that runs once when the agent finishes its batch instead of after every Edit|Write.

Third, the one that earns its keep: a type-check feedback loop. When Claude changes a function signature it often misses a call site, so a PostToolUse hook runs tsc --noEmit and writes any errors to stderr. It always exits 0, because the edit already landed and blocking won't undo it; the errors come back as feedback, Claude fixes the call sites, the hook runs again, and the loop closes when the codebase is type-safe:

#!/bin/bash
INPUT=$(cat)
FILE_PATH=$(echo "$INPUT" | jq -r '.tool_input.file_path // empty')
[[ ! "$FILE_PATH" =~ \.(ts|tsx)$ ]] && exit 0
OUTPUT=$(npx tsc --noEmit 2>&1)
if [ $? -ne 0 ]; then
  echo "Type errors after editing $FILE_PATH:" >&2
  echo "$OUTPUT" >&2
fi
exit 0

Watch the cost here. A full tsc --noEmit takes seconds to minutes on a medium-to-large repo, and this version runs it synchronously after every edit, so a five-file change pays the compile five times and the loop crawls. Two fixes: run it on a Stop hook so it fires once when the agent finishes its batch, or keep it per-edit but use tsc --incremental (or scope it to the changed package) so you're recompiling the delta, not the world. The per-edit version above is the version to learn the pattern on, not the one to point at a large codebase.

All three register in one settings file under their event name:

{
  "hooks": {
    "PreToolUse": [
      { "matcher": "Read|Grep", "hooks": [{ "type": "command", "command": ".claude/hooks/protect-env.sh", "timeout": 10 }] }
    ],
    "PostToolUse": [
      { "matcher": "Edit|Write", "hooks": [
        { "type": "command", "command": ".claude/hooks/auto-format.sh", "timeout": 30 },
        { "type": "command", "command": ".claude/hooks/typecheck.sh", "timeout": 60 }
      ] }
    ]
  }
}

The type-check pattern works in any typed language: swap tsc --noEmit for mypy, go vet, or cargo check, and for an untyped language run the linter or the test suite instead. There are more events than these two, like UserPromptSubmit, SessionStart, and Stop, but PreToolUse and PostToolUse cover most of what you'd want a guardrail for.

Set up once, then keep the session clean

A few minutes of setup pays off everywhere. Tame permissions so you're not approving every command: auto mode lets a classifier block only the risky ones, or allowlist safe repetitive commands like npm run lint. Install the CLIs for services you use, like gh and aws, connect MCP servers for what a CLI can't reach, and put sometimes-relevant knowledge in skills so it loads on demand instead of living in every conversation.

In the session itself, correct early; Escape preserves context so you can redirect, and /clear between unrelated tasks stops old context from distracting the model. If you've corrected the same thing twice, the context is polluted, so clear it and write a sharper prompt with what you learned. For research that would read a hundred files, send a subagent so the digging stays out of your main window.

Three more levers worth knowing. First, /context shows you exactly what is eating your token budget right now; an oversized markdown doc or an MCP loading silently in the background can drain thousands of tokens per turn without you noticing. Second, /model lets you switch to a smaller model mid-session: Haiku handles web navigation and repetitive scripting for a fraction of the cost, so save Sonnet or Opus for the decisions that need it. Third, if your agentic loop reads noisy CLI output (long test runners, server logs), RTK compresses that output before it lands in the context window; 43 test-passed lines become one. RTK is lossy, so turn it off when you need Claude to debug from the raw output, but for green-path automation it cuts token burn significantly.

Evaluate by checking the end state

Grading an agent is harder than grading text. Start small and manual: five to ten realistic tasks, transcripts read by hand to see where it gets stuck, then an LLM as a judge with a strict rubric for the fuzzy parts. The test I anchor on is final-state validation: stop reading what the agent says it did and check the environment. If the goal was to update a booking, query the row and confirm the date changed. The model is optimistic about its own work, so the only claim I trust is the one the environment confirms.

A few failure patterns repeat: the kitchen-sink session (unrelated tasks in one context, fix with /clear), the correction spiral (clear and rewrite after two failed tries), the over-stuffed CLAUDE.md (prune, move rules to hooks), the trust-then-verify gap (no runnable check, so edge cases slip), and the infinite exploration (an unscoped "investigate" that reads a hundred files, so scope it or hand it to a subagent).

Two things to keep in mind

This moves fast enough that today's failure is next week's default. If something doesn't work with the current model or tool, park it and try again in a few weeks. A hard "no" is usually a "not yet."

The other one doesn't go away: the model hallucinates, and you own the output. A confident wrong answer is still your commit, your PR, your 2am incident. Guardrails, rules, and hooks catch the predictable mistakes, and feeding the model your project's real conventions stops it from inventing its own. Don't over-tune it, though. A 600-line rulebook the model can't hold is worse than ten rules it actually follows.

What hooks or context trims are you running in your own setup? Tell me on LinkedIn.

Sources: Cal's Claude Code talk (YouTube), Hannah and Christian on prompting (YouTube), Anthropic's Best practices for Claude Code guide, the Claude Code hooks guide, and John Kim's breakdown of cutting Claude Code token usage by up to 90% for the Code Graph and RTK strategies. These are my notes and opinions, not Anthropic's. If I misread a detail, ping me and I'll fix it.

turbovec: Local RAG Without the 60 GB Tax

Lars — Mon, 25 May 2026 16:43:12 +0000

A 1536-dimensional float32 embedding is 6 KB. A corpus of 10 million documents is roughly 60 GB of raw vectors before any index overhead. That doesn't fit in laptop RAM, and even on a machine with 64 GB you've left yourself no headroom for anything else.

I kept reaching for FAISS. It works, but I kept hitting two friction points: training requires a representative sample of your corpus upfront, and compression quality depends on how well that sample matches the real distribution. If your data distribution shifts, you're rebuilding.

turbovec solves both, and the TurboQuant paper (arXiv April 2025, Google Research + NYU) explains the math behind why it can skip the training step entirely.

What TurboQuant actually does

The core idea is a mathematical trick: apply a random rotation to your vectors before compressing them.

After rotation, each coordinate follows a scaled Beta distribution that converges to Gaussian N(0, 1/d) in high dimensions. The coordinates also become nearly independent. That combination is what makes training-free quantization possible: you can precompute optimal bucket boundaries from pure math, with no data required upfront.

The algorithm in four steps:

Normalize each vector to unit length; store the norm as a float
Apply a fixed random rotation matrix (same matrix for the whole index, computed once at setup)
Quantize each rotated coordinate against precomputed bucket boundaries; at 4-bit that's 16 buckets per coordinate
Pack the integers: a 1536-dim vector goes from 6,144 bytes (float32) to 384 bytes

A 10M-doc corpus: ~60 GB float32 becomes ~7.5 GB at 4-bit, an 8x reduction. The paper proves the MSE distortion lands within a factor of √3·π/2 ≈ 2.7 of the information-theoretic lower bound at any bit-width, which is tight for a training-free method. At 4-bit specifically, MSE is approximately 0.009.

Search doesn't decompress vectors. It rotates the query once into the same domain and scores against codebook centroids using SIMD kernels (NEON on ARM, AVX-512 on x86). Per turbovec's own benchmarks, on ARM it beats FAISS IndexPQFastScan by 12-20%.

The part I initially glossed over: MSE and inner product are different problems

For RAG, what matters is preserving similarity scores, and MSE-optimized quantizers don't do that.

When you search a vector index, you're finding stored vectors with the highest dot product against your query. The TurboQuant paper proves that quantizers optimized purely for reconstruction accuracy introduce bias into inner product estimates. The compressed vectors rebuild accurately, but their similarity scores with a query vector are systematically off. You get wrong nearest neighbors.

TurboQuant fixes this with a two-stage approach. Stage one applies MSE quantization at one fewer bit than your target budget (so 3 bits if you want 4-bit total), which minimizes reconstruction error and shrinks the residual as much as possible. Stage two takes that residual and applies a 1-bit random projection transform called QJL (Quantized Johnson-Lindenstrauss). QJL is an optimal 1-bit inner product quantizer: it reduces the residual to a single bit per dimension using sign(random_matrix · vector), and the paper proves this makes the combined estimator unbiased.

The whole thing is data-oblivious. It works on the first vector you add to the index. The result is near-optimal reconstruction accuracy and unbiased similarity scores at your target bit-width.

For KV cache compression in long-context LLMs (storing attention keys and values), the paper tests Llama-3.1-8B on LongBench-E: 3.5 bits per channel matches unquantized quality, 2.5 bits shows only marginal degradation, while compressing the cache by more than 5x. The inner product unbiasedness property is what makes it work for attention computation.

The practical part: one import swap

turbovec ships drop-in replacements for the in-memory vector stores in LangChain, LlamaIndex, Haystack, and Agno. For LangChain:

pip install turbovec[langchain]

# Before
from langchain_core.vectorstores import InMemoryVectorStore

# After — same API, smaller footprint, faster search
from turbovec.integrations.langchain import TurboVecVectorStore as InMemoryVectorStore

Everything else in the pipeline stays the same. I swapped this into an existing LangChain project in a few minutes. Memory dropped by roughly 8x and retrieval got a bit faster.

For IdMapIndex (when you need stable IDs that survive deletes):

from turbovec import IdMapIndex
import numpy as np

index = IdMapIndex(dim=1536, bit_width=4)
index.add_with_ids(vectors, np.array([1001, 1002, 1003], dtype=np.uint64))

scores, ids = index.search(query, k=10)
index.remove(1002)  # O(1) by id

What the pgvector benchmarks actually show

I have been exploring turboquant for use with pgvector. To evaluate its performance, I ran the RAG benchmarks created by Johann-Peter Hartmann.

The storage and index scan wins are real. At 4-bit, your vector column shrinks by around 8x, and index scans run faster because you're moving far less data through memory. On a large corpus, that gap is meaningful.

The retrieval quality story is less clean. Quantizing inside pgvector degrades recall measurably compared to full float32 search. You can lose real top candidates from your top-k window. The TurboQuant unbiasedness proof is mathematically correct, but unbiased inner product estimates still carry variance at 4 bits, and in dense retrieval that variance pushes results around. The second-best document in float32 might not appear in your top-10 at 4-bit.

Two cases where the trade-off still makes sense: storage-constrained deployments where approximate retrieval is acceptable, or pipelines that rerank with a cross-encoder anyway (the reranker recovers from retrieval noise). If you're running semantic search where missing the true top result matters, measure your recall on a held-out set before committing.

If you want to run this comparison yourself against your own corpus, here's the benchmark setup I used:

import numpy as np
import time
from turbovec import IdMapIndex

dim = 1536
num_vectors = 1_000_000

embeddings = np.random.randn(num_vectors, dim).astype("float32")
embeddings /= np.linalg.norm(embeddings, axis=1, keepdims=True)
ids = np.arange(1, num_vectors + 1, dtype=np.uint64)

queries = np.random.randn(100, dim).astype("float32")
queries /= np.linalg.norm(queries, axis=1, keepdims=True)

t0 = time.perf_counter()
index = IdMapIndex(dim=dim, bit_width=4)
index.add_with_ids(embeddings, ids)
print(f"build: {time.perf_counter() - t0:.2f}s")

latencies = []
for q in queries:
    t0 = time.perf_counter()
    index.search(q, k=10)
    latencies.append(time.perf_counter() - t0)

avg_ms = np.mean(latencies) * 1000
p99_ms = np.percentile(latencies, 99) * 1000
print(f"avg: {avg_ms:.2f}ms  p99: {p99_ms:.2f}ms")

No training pass, no codebook warmup. The index is ready to search after the first add_with_ids call. Swap in your real embeddings and IDs, then run the same timing loop against FAISS IndexPQFastScan at the same bit-width to get a direct comparison.

When FAISS is still the right tool

turbovec is an in-memory flat index: it searches all vectors on every query. For a few million vectors on a single machine, that's fine. At hundreds of millions you need IVF partitioning to reduce the search scope, and FAISS handles that.

The ARM picture is clean: turbovec beats FAISS IndexPQFastScan by 12-20% across typical configurations. x86 is more conditional. At 4-bit, turbovec wins by 1-6% due to tighter cache lines and faster bit-unpacking. At 2-bit single-threaded, they run within 1% of each other. At 2-bit multi-threaded on AVX-512 hardware, FAISS pulls ahead by 2-4%; it exploits AVX-512 VBMI for bit manipulation during concurrent sweeps, an instruction path turbovec doesn't yet use. On enterprise x86 with high thread counts at 2-bit, that edge is real.

At high dimensions (d=1536, d=3072), turbovec matches or beats FAISS at R@1; both converge to 1.0 recall by k=4-8. At d=200 (GloVe territory), turbovec trails at R@1 because the near-Gaussian approximation from the random rotation weakens at low dimensions.

The rule: turbovec for local RAG with modern embedding dimensions, FAISS for very large corpora, GPU-accelerated search, or multi-threaded 2-bit lookups on AVX-512 servers.

What I'm using it for

I'm running turbovec in ThoughtForge for per-space semantic search. The nomic-embed-text-v1.5 model produces 768-dimensional embeddings; at 4-bit compression the full index is small enough that loading at app startup takes under a second. Local embeddings, local index, no data leaves the machine.

If you're building local RAG and hitting the float32 memory wall, this is the first thing I'd try.

Modern React Performance Without the Overhead

Lars — Mon, 25 May 2026 16:40:20 +0000

A product manager says "we'll use React, performance won't be a problem." Two sprints later, the Lighthouse INP score is 800ms and the main thread is blocked on a 400 KB vendor bundle nobody audited.

React itself isn't the problem. The flexibility is.

INP replaced FID, and most teams haven't adjusted

Google's Core Web Vitals now measure three things:

LCP (Largest Contentful Paint): loading speed for elements inside the viewport. Anything below the fold doesn't count.
CLS (Cumulative Layout Shift): visual stability. React 19's native document metadata handling solves a category of CLS issues caused by stylesheets loading after first paint.
INP (Interaction to Next Paint): the one most teams are failing now. It replaced First Input Delay in March 2024 and measures every interaction across the page lifetime, not just the first. A single slow click handler tanks your score.

FID was easy to pass because it only measured the first interaction. INP measures all of them, which means large synchronous bundles and heavy render trees anywhere on the page will show up.

Three browser profiles, not one

Your daily browser lies to you. Extensions inject scripts, manipulate the DOM, and run background tasks. Lighthouse scores measured under those conditions aren't representative of what users see.

Set up three profiles in Chrome or Firefox:

Dev profile: all your extensions (React DevTools, Apollo, Redux, etc.). Use this for development.
Profiling profile: zero extensions, CPU throttling enabled. Use this for Lighthouse and manual profiling.
Normal profile: whatever you use day-to-day.

Always profile a production build. Development mode adds diagnostic overhead that inflates render times and bundle sizes in ways that have nothing to do with what ships. Run npm run build, serve it locally, then open Lighthouse in the clean profile. The Lighthouse Tree Map shows you which dependencies are inflating which chunks; look there before opening the bundle analyzer.

The React Compiler doesn't fix bad state placement

Before React 19, avoiding cascading re-renders meant wrapping things in useMemo and useCallback. It worked when done carefully, failed silently when done wrong, and cluttered codebases either way.

The React Compiler statically analyzes your component tree and applies automatic memoization. You write straightforward JavaScript; the compiler handles the caching. This eliminates the category of "forgot to memoize this callback" bugs.

What it doesn't fix: state placed too high in the component tree.

If you have a text input and the user sees lag under CPU throttling, check where the state lives. The runtime evaluates every component between the state and the consumer on each keystroke. Move state as close to the consuming component as possible. The compiler can't restructure your component hierarchy; that decision belongs to you.

Bundle size and layout shift

Heavy components loaded before they're needed block the main thread. React.lazy and Suspense fix this by fetching a component only when it's required:

const HeavyChart = React.lazy(() => import("./HeavyChart"));

function Dashboard() {
  return (
    <Suspense fallback={<ChartSkeleton />}>
      <HeavyChart />
    </Suspense>
  );
}

Check packages on Bundlephobia before importing. A date-picker that pulls in 80 KB gzipped when you need one function is a problem you choose.

React 19 adds native support for resource hoisting. You can prefetch DNS and preload assets from components without managing <head> manually, and <title>, <meta>, and <link> tags rendered anywhere in the component tree are automatically hoisted to <head>. That removes the need for react-helmet or equivalent libraries for the common case:

import { prefetchDNS, preload } from "react-dom";

prefetchDNS("https://fonts.googleapis.com");
preload("/hero.webp", { as: "image" });

For CLS: give images explicit width and height attributes. The browser reserves the correct space before the asset arrives; without them, the layout shifts when the image loads and you lose CLS points for every user on a slow connection. Use loading skeletons rather than spinners; a skeleton anchors the layout geometry while data fetches.

Actions and optimistic UI

Form handling in React pre-19 involved isLoading state, disabled buttons, and a UI that froze until the server responded. That pattern hurts INP because the interaction latency is the full round-trip time.

React 19 introduces Actions and useOptimistic. The interaction updates the UI immediately; the server call runs in the background:

"use client";
import { useOptimistic, useActionState } from "react";
import { addToCart } from "./actions";

export default function BuyButton({ product }) {
  const [optimisticQty, addOptimistic] = useOptimistic(0, (q, delta) => q + delta);
  const [state, action, pending] = useActionState(addToCart, { ok: false });

  return (
    <form action={(formData) => {
      addOptimistic(1);
      action(formData);
    }}>
      <input type="hidden" name="productId" value={product.id} />
      <button disabled={pending}>
        {pending ? "Adding…" : "Add to cart"}
      </button>
      <p>In cart: {optimisticQty}</p>
    </form>
  );
}

The click registers immediately. INP measures the time between the interaction and the next paint; with useOptimistic, that's milliseconds rather than the full server latency.

React 19 also introduces use(), a new primitive that reads Promises or Context directly inside the render phase. Unlike hooks, it can be called conditionally, so you can suspend a component mid-render while a Promise resolves rather than managing that state manually:

import { use, Suspense } from "react";

function UserName({ userPromise }) {
  const user = use(userPromise);
  return <span>{user.name}</span>;
}

export default function Profile({ userPromise }) {
  return (
    <Suspense fallback={<span>Loading…</span>}>
      <UserName userPromise={userPromise} />
    </Suspense>
  );
}

The Suspense boundary catches the suspension; the parent doesn't need to know a Promise is involved at all.

What the RSC boundary actually costs

Working with React Server Components on high-traffic sites, the payload crossing the network boundary is where performance gets lost, not the client rendering.

The mistake: passing full database objects as props to Client Components. Every field on that object gets serialized into the RSC payload even if the component uses three of them. On a product page with a 40-field database record, that's a lot of JSON the browser decodes and discards.

Pass exactly the fields the client needs, nothing more. Push "use client" as far down the tree as possible so the boundary is small. Stream static layout immediately and wrap slow data fetches in <Suspense>:

async function ProductPage({ id }) {
  const product = await db.products.findById(id);

  return (
    <div>
      <StaticLayout />
      <Suspense fallback={<ProductSkeleton />}>
        <ProductClient
          name={product.name}
          price={product.price}
          inStock={product.inStock}
        />
      </Suspense>
    </div>
  );
}

I came up from backend work before moving heavily into frontend and edge delivery, so I've seen both sides of this equation. The bottleneck is almost never the framework itself; it's the payload crossing the network boundary and what you decided to put in it.

The tooling in React 19 is genuinely better. The structural rule is the same as it's always been: don't ship what the user doesn't immediately need.

AWS Summit Hamburg 2026: The Year Agentic AI Went from Hype to Production

Lars — Mon, 25 May 2026 16:34:38 +0000

Hamburg 2026 was my first AWS Summit in person; I'd only followed it through recordings before. The shift in the AI sessions was immediate. Last year the talks were mostly roadmaps: architecture diagrams for systems still being built, phrases like "we're exploring" and "we're excited about the potential." This year, several major companies, from automotive manufacturing to energy, e-commerce, and food delivery, each opened with results.

The infrastructure for agentic AI is real now. The questions still open are about safety, not capability.

Strands Agents: letting the model decide

I started the day with the Strands Agents talk. We're already using Strands at work, so I expected mostly familiar ground. What was actually new was how far they've pushed the model taking over orchestration decisions that developers used to make manually.

The shift sounds small but changes everything: you don't define which agent to call when. The orchestrator figures it out. The model reads the task, decides what capability it needs, and routes itself. That means less brittle wiring in your agent graph and more flexibility when edge cases appear at runtime.

Three capabilities stood out:

Self-extending: Agents write their own Python tools on the fly. Because the framework can reload tools at runtime, an agent can recognize a missing capability, write the code for it, save it, and execute that new tool immediately, without a restart.

Self-directing: Agents update their own system prompts based on interactions and store those updates in persistent memory. When deployed with Amazon Bedrock AgentCore, they use Amazon Bedrock Knowledge Bases to retrieve past sessions, so context accumulates across conversations.

Meta-agents: A primary orchestrator spawns sub-agents based on the task. Three patterns: Swarm (sub-agents work on parts of the problem in parallel, sharing a context space), Graph (directed hand-offs between specialized agents in sequence), and Think (the agent recursively spawns instances of itself to deepen reasoning before answering).

Similar talk from AWS re:Invent 2025: Strands Agents deep-dive.

When an enterprise HR platform runs itself

A global energy company's HR team had a legacy data distribution system. Expensive, rigid, built for a world where data formats and sources didn't change often. They replaced it with a serverless platform on AWS, powered by Amazon Bedrock Agents and Amazon Bedrock AgentCore.

What made it interesting wasn't the migration. It was what they built on top: the platform is self-documenting, self-modifying, and self-healing. When a new data source appears, the system builds the ingestion Lambda and Step Functions itself. When a transformation fails, it diagnoses the failure, writes a fix, and deploys it. Engineers can ask the platform questions in natural language and get answers about its own internals.

The result: $800K in annual license savings, and a data delivery platform that largely operates without engineers in the loop for routine changes.

That's Level 3 maturity, which I'll explain below. Most teams aren't there yet. But seeing it running in production at a company of this size makes it harder to argue it's years away.

Supply chain planning with agents

A global automotive manufacturer has been running ML-based forecasting and planning on AWS for years. The agentic layer they've added isn't a replacement for that infrastructure; it's a reasoning layer on top of it, one that can work through problems humans currently have to resolve manually.

The concrete example that stuck with me: "where is my tire and when will it arrive?" Sounds simple. Answering it requires querying across production schedules, capacity constraints, distribution paths, and carrier networks simultaneously, then reconciling those answers into something actionable. Before agentic AI, that took hours. Now it takes five minutes.

The architecture uses specialized agents that investigate production blocks and capacity constraints in parallel, then synthesize their findings into business-friendly explanations. That last part matters: the team specifically called out the move from black-box optimization to transparent decision-making. Supply chain planners needed to understand why the system made a recommendation, not just what it recommended. Agents that can explain their reasoning are the only version that gets adopted.

Privacy-first, then ship

One talk stood out for its honesty about what the path to production actually looks like. The team behind a European e-commerce platform didn't start with a multi-agent architecture. They started with a single Bedrock API call for intent classification and predefined response templates, built data privacy compliance in from day one, then iterated from there: prompt engineering, Bedrock Agents, multi-agent orchestration, and finally Amazon Bedrock AgentCore. Each phase had a ceiling where the simpler approach stopped being good enough.

The result of that iteration: over 50% of first-level support tickets are now handled by AI, and resolution time is down 60%.

Privacy was the first design constraint throughout, not an afterthought. Before any ticket content reaches an LLM, it goes through an anonymization layer that strips PII using pattern matching and rules. Only anonymized content goes forward. The gate pipeline:

flowchart LR
    A[Ticket arrives] --> B[Anonymize PII]
    B --> C[Route to team]
    C --> D[Confidence check]
    D --> E[Add context]
    E --> F[Lifecycle check]
    F --> G[AI response]

One detail I hadn't considered before: they tried sending the AI response five seconds after ticket submission. Customer satisfaction dropped. People felt the speed was inhuman and didn't trust the answer. They now add an artificial delay. The content didn't change, the timing did, and satisfaction went back up.

The system splits B2C and B2B into separate agent chains, because B2B tickets carry more company-specific context and require more specialized handling. Observability is built in from the start: every response is traced, retrieval success rates and model usage are tracked, and quality scores feed back into the system. That observability also changed how the team debugs. The question shifted from "what did the model answer?" to "why did the workflow behave this way?", which means tracing which tools were called, what was retrieved and how relevant it was, where latency came from, and what CSAT scores and ticket reopen rates are telling you about specific workflow paths.

flowchart LR
    A[Normalized Support Request] --> B[Router Agent]

    B --> C[Consumer Support Agent]
    B --> D[B2B Support Agent]
    B --> E[Human Handoff Agent]

    C --> F[Ticket Update Composer]
    D --> F

    F --> G[Ticket API]

Multi-agent does not mean everyone talks to everyone.

Bounded specialization reduces accidental cross-talk and makes handoff behavior easier to reason about. A few architectural decisions that make this work in practice:

The Router Agent decides which specialized agent handles the request; nothing else does routing.
Specialized agents are isolated by responsibility and don't call each other directly.
Only the Ticket Update Composer writes back to external systems (Ticket Systems).
Human escalation is an explicit path, not a fallback you discover at 2am.
Tool permissions and handoff contracts are defined up front, not negotiated at runtime.

The architecture is only part of what makes this production-grade. Agents introduced operational concerns that became part of the product itself:

IaC and reproducibility: agents, collaborators, roles, Lambda functions, API Gateway, and knowledge base configuration all need reproducible deployment. Drift between environments is a real failure mode.
Aliases and versioning: promote tested versions explicitly; don't rely on draft agent behavior in production.
Latency budgets: multi-tool workflows can exceed webhook timeouts. Latency is a design constraint, not a monitoring afterthought.
Structured traces: log intent, retrieval, tool inputs, API errors, and response payloads. Debugging an agentic workflow without traces is guesswork.
Human QA sampling: review low CSAT scores, ticket reopen rates, and escalation reasons on a regular cadence.
Cost guardrails: cap agent steps, retries, token budgets, and retrieval depth. Unbounded agents are a billing incident waiting to happen.

The customer sees the answer. Engineering owns the routing, policy, traces, and feedback loop.

Choosing the right level of agent complexity is its own decision. Not everything needs a full multi-agent setup:

Use case	Direct Bedrock call	Bedrock Agent	AgentCore
One decision, known outputs	Best fit	Overkill	Overkill
Needs RAG + tools	Possible but manual	Best fit	Good fit
Multiple topics / generated answer	Limited	Best fit	Good fit
Cross-channel middleware	Possible	Best fit	Good fit
Longer-term memory / operations	Limited	Partial	Best fit

Don't optimize for "most agentic." Optimize for the minimum autonomy that solves the customer problem with acceptable operational risk.

Their lesson for anyone starting this: get your data security and privacy team involved before you write the first line of agent code. The design decisions they'll ask for are the ones you can't retrofit later.

AgentCore: building multi-tenant AI as a service

I also caught the expert-level session on building multi-tenant SaaS agents with Amazon Bedrock AgentCore. If you're building a platform where multiple customers each get their own AI agents, the isolation problem is harder than it looks.

Tenant isolation in agentic systems runs across five dimensions: identity (each tenant's agents act with scoped credentials), memory (one tenant's conversation history can't leak into another's context), gateway (routing and rate-limiting per tenant), observability (tenant-scoped traces so you can debug without seeing another customer's data), and runtime (compute isolation so a runaway agent in one tenant doesn't affect others). The session walked through working examples for each.

The framing they used, "intelligence as a service," is worth keeping. If you're building AI capabilities that other teams or customers consume, the SaaS constructs of onboarding, isolation, and identity propagation apply just as much as they do to any other service you'd build. The AgentCore primitives give you the building blocks; you still have to wire them together intentionally.

Similar session recording: Building Multi-Tenant SaaS Agents with Amazon Bedrock AgentCore.

One published example that shows AgentCore working on a real regulated domain: an open-source medical content review application that cross-checks pharmaceutical marketing claims against clinical references, PubMed, OpenFDA, and ClinicalTrials.gov. A few architectural decisions in it are worth studying regardless of your domain. First, the reviewer sub-agents persist their findings to S3 and return only an S3 URI to the orchestrator; hundreds of findings from a 30-page document never flow through the orchestrator's context window, which would cause it to summarize and drop findings. Second, user identity is extracted server-side from the JWT sub claim, never from the request payload, which closes the impersonation-via-prompt-injection vector directly. Both patterns are reusable. Full writeup and open-source repo: Accelerate Medical Content Review with Amazon Bedrock AgentCore.

The maturity ladder

Across several talks, a rough framework emerged for how companies are thinking about agentic AI maturity.

Level 1 is rules-based AI. The system follows defined policies, humans define every decision path, and the AI fills in specific gaps.

Level 2 is autonomous task AI. The system handles entire workflows: self-documentation, quality monitoring, task routing. Humans oversee outcomes, not individual steps.

Level 3 is self-monitoring systems. The system diagnoses its own failures and builds capabilities it didn't have before. Human oversight is exception-based, not routine.

Most of the companies presenting were somewhere in the Level 1 to Level 2 transition. The energy company's HR platform was the one example of Level 3 running in production. That gap is worth knowing about before you start planning, because Level 2 requires different architecture decisions than Level 1, and Level 3 requires different trust decisions than Level 2.

What's still open

The results are real: $800K saved, hours to five minutes, resolution time down 60%. These aren't demos.

But three problems came up in nearly every talk, and nobody presented a clean solution.

Hallucination in production has operational consequences now. When your agent is writing transformation plugins and deploying them, a confidently wrong answer triggers real failures. Teams are managing this with human-in-the-loop gates at specific checkpoints.

Prompt injection and prompt protection got less stage time than they deserved. As agents act on data from external systems, the attack surface grows. One concrete mitigation that came up in the AgentCore examples: extract user identity from the JWT server-side, never from the request payload, so attackers can't impersonate users through crafted input. That closes one specific vector; the broader problem of agents acting on adversarial content from external sources is still largely unsolved in practice.

Data privacy at scale is hard even with PII anonymization. Cross-border data flows, multi-system agents, and context accumulating in memory all create compliance complexity that rules-based anonymization doesn't fully address.

These are reasons to build carefully, with privacy by default and observability from day one. They're not reasons to wait.

Context engineering is the next skill

Prompt engineering was the first visible handle: write better prompts, get better outputs. The feedback loop was tight. The harder problem in agentic systems is what surrounds the prompt: what context an agent has access to, when it gets loaded, how much fits in the window, and what happens when it doesn't.

The AgentCore medical content review example makes this concrete. Reviewer sub-agents persist findings to S3 and hand the orchestrator a URI instead of the content; the full set is loaded once, at the final editing pass. That's a context engineering decision: controlling which information exists in which agent's window at which moment, specifically to prevent the model from silently dropping findings it can't fit.

The same pattern shows up in the Strands memory model (Bedrock Knowledge Bases retrieve relevant past sessions, not all of them) and in the e-commerce platform's lifecycle gates (explicit stages controlling what context is available at each step). Every multi-agent architecture at the summit made the same choice. The common thread is explicit decisions about what each agent knows and when.

Context windows are a real production constraint, not a theoretical one. As agents chain together and produce intermediate output, what you carry forward and what you discard is an architectural choice. Getting it wrong makes the system quietly incorrect in ways that are hard to debug, because the model won't tell you it dropped something.

Treat context design with the same seriousness as a data model.

What I took away

The energy was different from the recordings I'd watched in previous years. Companies were comparing notes, sharing what broke and what they'd do differently.

The part worth repeating to anyone building with AI right now: share the positive stories. The results from these companies are real and worth talking about with your team. And while you're doing that, keep hallucination and data protection as first-class design constraints, not late-stage reviews. Include the people who care about those things early; they make the system better, not slower.

These are my personal impressions from the conference. The views here are my own and don't represent my employer or any of the companies mentioned. If I got something wrong or misunderstood a detail, ping me on LinkedIn and I'll correct it.