<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Lars</title>
    <description>The latest articles on DEV Community by Lars (@larsroettig_ai).</description>
    <link>https://dev.to/larsroettig_ai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2158169%2F41fc8577-f012-4f27-9ed6-51cfd4cb9185.jpg</url>
      <title>DEV Community: Lars</title>
      <link>https://dev.to/larsroettig_ai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/larsroettig_ai"/>
    <language>en</language>
    <item>
      <title>turbovec: Local RAG Without the 60 GB Tax</title>
      <dc:creator>Lars</dc:creator>
      <pubDate>Mon, 25 May 2026 16:43:12 +0000</pubDate>
      <link>https://dev.to/larsroettig_ai/turbovec-local-rag-without-the-60-gb-tax-1odh</link>
      <guid>https://dev.to/larsroettig_ai/turbovec-local-rag-without-the-60-gb-tax-1odh</guid>
      <description>&lt;p&gt;A 1536-dimensional float32 embedding is 6 KB. A corpus of 10 million documents is roughly 60 GB of raw vectors before any index overhead. That doesn't fit in laptop RAM, and even on a machine with 64 GB you've left yourself no headroom for anything else.&lt;/p&gt;

&lt;p&gt;I kept reaching for FAISS. It works, but I kept hitting two friction points: training requires a representative sample of your corpus upfront, and compression quality depends on how well that sample matches the real distribution. If your data distribution shifts, you're rebuilding.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/RyanCodrai/turbovec" rel="noopener noreferrer"&gt;turbovec&lt;/a&gt; solves both, and the &lt;a href="https://arxiv.org/abs/2504.19874" rel="noopener noreferrer"&gt;TurboQuant paper&lt;/a&gt; (arXiv April 2025, Google Research + NYU) explains the math behind why it can skip the training step entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  What TurboQuant actually does
&lt;/h2&gt;

&lt;p&gt;The core idea is a mathematical trick: apply a random rotation to your vectors before compressing them.&lt;/p&gt;

&lt;p&gt;After rotation, each coordinate follows a scaled Beta distribution that converges to Gaussian N(0, 1/d) in high dimensions. The coordinates also become nearly independent. That combination is what makes training-free quantization possible: you can precompute optimal bucket boundaries from pure math, with no data required upfront.&lt;/p&gt;

&lt;p&gt;The algorithm in four steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Normalize each vector to unit length; store the norm as a float&lt;/li&gt;
&lt;li&gt;Apply a fixed random rotation matrix (same matrix for the whole index, computed once at setup)&lt;/li&gt;
&lt;li&gt;Quantize each rotated coordinate against precomputed bucket boundaries; at 4-bit that's 16 buckets per coordinate&lt;/li&gt;
&lt;li&gt;Pack the integers: a 1536-dim vector goes from 6,144 bytes (float32) to 384 bytes&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A 10M-doc corpus: ~60 GB float32 becomes ~7.5 GB at 4-bit, an 8x reduction. The paper proves the MSE distortion lands within a factor of √3·π/2 ≈ 2.7 of the information-theoretic lower bound at any bit-width, which is tight for a training-free method. At 4-bit specifically, MSE is approximately 0.009.&lt;/p&gt;

&lt;p&gt;Search doesn't decompress vectors. It rotates the query once into the same domain and scores against codebook centroids using SIMD kernels (NEON on ARM, AVX-512 on x86). Per turbovec's own benchmarks, on ARM it beats FAISS IndexPQFastScan by 12-20%.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part I initially glossed over: MSE and inner product are different problems
&lt;/h2&gt;

&lt;p&gt;For RAG, what matters is preserving similarity scores, and MSE-optimized quantizers don't do that.&lt;/p&gt;

&lt;p&gt;When you search a vector index, you're finding stored vectors with the highest dot product against your query. The TurboQuant paper proves that quantizers optimized purely for reconstruction accuracy introduce bias into inner product estimates. The compressed vectors rebuild accurately, but their similarity scores with a query vector are systematically off. You get wrong nearest neighbors.&lt;/p&gt;

&lt;p&gt;TurboQuant fixes this with a two-stage approach. Stage one applies MSE quantization at one fewer bit than your target budget (so 3 bits if you want 4-bit total), which minimizes reconstruction error and shrinks the residual as much as possible. Stage two takes that residual and applies a 1-bit random projection transform called QJL (Quantized Johnson-Lindenstrauss). QJL is an optimal 1-bit inner product quantizer: it reduces the residual to a single bit per dimension using &lt;code&gt;sign(random_matrix · vector)&lt;/code&gt;, and the paper proves this makes the combined estimator unbiased.&lt;/p&gt;

&lt;p&gt;The whole thing is data-oblivious. It works on the first vector you add to the index. The result is near-optimal reconstruction accuracy and unbiased similarity scores at your target bit-width.&lt;/p&gt;

&lt;p&gt;For KV cache compression in long-context LLMs (storing attention keys and values), the paper tests Llama-3.1-8B on LongBench-E: 3.5 bits per channel matches unquantized quality, 2.5 bits shows only marginal degradation, while compressing the cache by more than 5x. The inner product unbiasedness property is what makes it work for attention computation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The practical part: one import swap
&lt;/h2&gt;

&lt;p&gt;turbovec ships drop-in replacements for the in-memory vector stores in LangChain, LlamaIndex, Haystack, and Agno. For LangChain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;turbovec[langchain]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.vectorstores&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;InMemoryVectorStore&lt;/span&gt;

&lt;span class="c1"&gt;# After — same API, smaller footprint, faster search
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;turbovec.integrations.langchain&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TurboVecVectorStore&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;InMemoryVectorStore&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything else in the pipeline stays the same. I swapped this into an existing LangChain project in a few minutes. Memory dropped by roughly 8x and retrieval got a bit faster.&lt;/p&gt;

&lt;p&gt;For &lt;code&gt;IdMapIndex&lt;/code&gt; (when you need stable IDs that survive deletes):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;turbovec&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;IdMapIndex&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;IdMapIndex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1536&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bit_width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_with_ids&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;1001&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1002&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1003&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remove&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1002&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# O(1) by id
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What the pgvector benchmarks actually show
&lt;/h2&gt;

&lt;p&gt;I have been exploring turboquant for use with pgvector. To evaluate its performance, I ran the RAG benchmarks created by Johann-Peter Hartmann.&lt;/p&gt;

&lt;p&gt;The storage and index scan wins are real. At 4-bit, your vector column shrinks by around 8x, and index scans run faster because you're moving far less data through memory. On a large corpus, that gap is meaningful.&lt;/p&gt;

&lt;p&gt;The retrieval quality story is less clean. Quantizing inside pgvector degrades recall measurably compared to full float32 search. You can lose real top candidates from your top-k window. The TurboQuant unbiasedness proof is mathematically correct, but unbiased inner product estimates still carry variance at 4 bits, and in dense retrieval that variance pushes results around. The second-best document in float32 might not appear in your top-10 at 4-bit.&lt;/p&gt;

&lt;p&gt;Two cases where the trade-off still makes sense: storage-constrained deployments where approximate retrieval is acceptable, or pipelines that rerank with a cross-encoder anyway (the reranker recovers from retrieval noise). If you're running semantic search where missing the true top result matters, measure your recall on a held-out set before committing.&lt;/p&gt;

&lt;p&gt;If you want to run this comparison yourself against your own corpus, here's the benchmark setup I used:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;turbovec&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;IdMapIndex&lt;/span&gt;

&lt;span class="n"&gt;dim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1536&lt;/span&gt;
&lt;span class="n"&gt;num_vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt;

&lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_vectors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;/=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdims&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_vectors&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;uint64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;queries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;queries&lt;/span&gt; &lt;span class="o"&gt;/=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdims&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;t0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;IdMapIndex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bit_width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_with_ids&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;build: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t0&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;latencies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;queries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;t0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;latencies&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;t0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;avg_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latencies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;span class="n"&gt;p99_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latencies&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;99&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;avg_ms&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;ms  p99: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;p99_ms&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No training pass, no codebook warmup. The index is ready to search after the first &lt;code&gt;add_with_ids&lt;/code&gt; call. Swap in your real embeddings and IDs, then run the same timing loop against FAISS &lt;code&gt;IndexPQFastScan&lt;/code&gt; at the same bit-width to get a direct comparison.&lt;/p&gt;

&lt;h2&gt;
  
  
  When FAISS is still the right tool
&lt;/h2&gt;

&lt;p&gt;turbovec is an in-memory flat index: it searches all vectors on every query. For a few million vectors on a single machine, that's fine. At hundreds of millions you need IVF partitioning to reduce the search scope, and FAISS handles that.&lt;/p&gt;

&lt;p&gt;The ARM picture is clean: turbovec beats FAISS IndexPQFastScan by 12-20% across typical configurations. x86 is more conditional. At 4-bit, turbovec wins by 1-6% due to tighter cache lines and faster bit-unpacking. At 2-bit single-threaded, they run within 1% of each other. At 2-bit multi-threaded on AVX-512 hardware, FAISS pulls ahead by 2-4%; it exploits AVX-512 VBMI for bit manipulation during concurrent sweeps, an instruction path turbovec doesn't yet use. On enterprise x86 with high thread counts at 2-bit, that edge is real.&lt;/p&gt;

&lt;p&gt;At high dimensions (d=1536, d=3072), turbovec matches or beats FAISS at R@1; both converge to 1.0 recall by k=4-8. At d=200 (GloVe territory), turbovec trails at R@1 because the near-Gaussian approximation from the random rotation weakens at low dimensions.&lt;/p&gt;

&lt;p&gt;The rule: turbovec for local RAG with modern embedding dimensions, FAISS for very large corpora, GPU-accelerated search, or multi-threaded 2-bit lookups on AVX-512 servers.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm using it for
&lt;/h2&gt;

&lt;p&gt;I'm running turbovec in &lt;a href="https://github.com/larsroettig/thoughtforge" rel="noopener noreferrer"&gt;ThoughtForge&lt;/a&gt; for per-space semantic search. The nomic-embed-text-v1.5 model produces 768-dimensional embeddings; at 4-bit compression the full index is small enough that loading at app startup takes under a second. Local embeddings, local index, no data leaves the machine.&lt;/p&gt;

&lt;p&gt;If you're building local RAG and hitting the float32 memory wall, this is the first thing I'd try.&lt;/p&gt;

</description>
      <category>ai</category>
    </item>
    <item>
      <title>Modern React Performance Without the Overhead</title>
      <dc:creator>Lars</dc:creator>
      <pubDate>Mon, 25 May 2026 16:40:20 +0000</pubDate>
      <link>https://dev.to/larsroettig_ai/modern-react-performance-without-the-overhead-4oak</link>
      <guid>https://dev.to/larsroettig_ai/modern-react-performance-without-the-overhead-4oak</guid>
      <description>&lt;p&gt;A product manager says "we'll use React, performance won't be a problem." Two sprints later, the Lighthouse INP score is 800ms and the main thread is blocked on a 400 KB vendor bundle nobody audited.&lt;/p&gt;

&lt;p&gt;React itself isn't the problem. The flexibility is.&lt;/p&gt;

&lt;h2&gt;
  
  
  INP replaced FID, and most teams haven't adjusted
&lt;/h2&gt;

&lt;p&gt;Google's Core Web Vitals now measure three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LCP (Largest Contentful Paint):&lt;/strong&gt; loading speed for elements &lt;em&gt;inside the viewport&lt;/em&gt;. Anything below the fold doesn't count.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CLS (Cumulative Layout Shift):&lt;/strong&gt; visual stability. React 19's native document metadata handling solves a category of CLS issues caused by stylesheets loading after first paint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;INP (Interaction to Next Paint):&lt;/strong&gt; the one most teams are failing now. It replaced First Input Delay in March 2024 and measures every interaction across the page lifetime, not just the first. A single slow click handler tanks your score.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;FID was easy to pass because it only measured the first interaction. INP measures all of them, which means large synchronous bundles and heavy render trees anywhere on the page will show up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three browser profiles, not one
&lt;/h2&gt;

&lt;p&gt;Your daily browser lies to you. Extensions inject scripts, manipulate the DOM, and run background tasks. Lighthouse scores measured under those conditions aren't representative of what users see.&lt;/p&gt;

&lt;p&gt;Set up three profiles in Chrome or Firefox:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Dev profile:&lt;/strong&gt; all your extensions (React DevTools, Apollo, Redux, etc.). Use this for development.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Profiling profile:&lt;/strong&gt; zero extensions, CPU throttling enabled. Use this for Lighthouse and manual profiling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Normal profile:&lt;/strong&gt; whatever you use day-to-day.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Always profile a production build. Development mode adds diagnostic overhead that inflates render times and bundle sizes in ways that have nothing to do with what ships. Run &lt;code&gt;npm run build&lt;/code&gt;, serve it locally, then open Lighthouse in the clean profile. The Lighthouse Tree Map shows you which dependencies are inflating which chunks; look there before opening the bundle analyzer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The React Compiler doesn't fix bad state placement
&lt;/h2&gt;

&lt;p&gt;Before React 19, avoiding cascading re-renders meant wrapping things in &lt;code&gt;useMemo&lt;/code&gt; and &lt;code&gt;useCallback&lt;/code&gt;. It worked when done carefully, failed silently when done wrong, and cluttered codebases either way.&lt;/p&gt;

&lt;p&gt;The React Compiler statically analyzes your component tree and applies automatic memoization. You write straightforward JavaScript; the compiler handles the caching. This eliminates the category of "forgot to memoize this callback" bugs.&lt;/p&gt;

&lt;p&gt;What it doesn't fix: state placed too high in the component tree.&lt;/p&gt;

&lt;p&gt;If you have a text input and the user sees lag under CPU throttling, check where the state lives. The runtime evaluates every component between the state and the consumer on each keystroke. Move state as close to the consuming component as possible. The compiler can't restructure your component hierarchy; that decision belongs to you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bundle size and layout shift
&lt;/h2&gt;

&lt;p&gt;Heavy components loaded before they're needed block the main thread. &lt;code&gt;React.lazy&lt;/code&gt; and &lt;code&gt;Suspense&lt;/code&gt; fix this by fetching a component only when it's required:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;HeavyChart&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;React&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lazy&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;import&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;./HeavyChart&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;Dashboard&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Suspense&lt;/span&gt; &lt;span class="na"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;ChartSkeleton&lt;/span&gt; &lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;HeavyChart&lt;/span&gt; &lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nc"&gt;Suspense&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check packages on &lt;a href="https://bundlephobia.com" rel="noopener noreferrer"&gt;Bundlephobia&lt;/a&gt; before importing. A date-picker that pulls in 80 KB gzipped when you need one function is a problem you choose.&lt;/p&gt;

&lt;p&gt;React 19 adds native support for resource hoisting. You can prefetch DNS and preload assets from components without managing &lt;code&gt;&amp;lt;head&amp;gt;&lt;/code&gt; manually, and &lt;code&gt;&amp;lt;title&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;meta&amp;gt;&lt;/code&gt;, and &lt;code&gt;&amp;lt;link&amp;gt;&lt;/code&gt; tags rendered anywhere in the component tree are automatically hoisted to &lt;code&gt;&amp;lt;head&amp;gt;&lt;/code&gt;. That removes the need for &lt;code&gt;react-helmet&lt;/code&gt; or equivalent libraries for the common case:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;prefetchDNS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;preload&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;react-dom&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;prefetchDNS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://fonts.googleapis.com&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nf"&gt;preload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;/hero.webp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;as&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;image&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For CLS: give images explicit &lt;code&gt;width&lt;/code&gt; and &lt;code&gt;height&lt;/code&gt; attributes. The browser reserves the correct space before the asset arrives; without them, the layout shifts when the image loads and you lose CLS points for every user on a slow connection. Use loading skeletons rather than spinners; a skeleton anchors the layout geometry while data fetches.&lt;/p&gt;

&lt;h2&gt;
  
  
  Actions and optimistic UI
&lt;/h2&gt;

&lt;p&gt;Form handling in React pre-19 involved &lt;code&gt;isLoading&lt;/code&gt; state, disabled buttons, and a UI that froze until the server responded. That pattern hurts INP because the interaction latency is the full round-trip time.&lt;/p&gt;

&lt;p&gt;React 19 introduces Actions and &lt;code&gt;useOptimistic&lt;/code&gt;. The interaction updates the UI immediately; the server call runs in the background:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;use client&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;useOptimistic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;useActionState&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;react&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;addToCart&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;./actions&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;BuyButton&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;product&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;optimisticQty&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;addOptimistic&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;useOptimistic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;q&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pending&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;useActionState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;addToCart&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;form&lt;/span&gt; &lt;span class="nx"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{(&lt;/span&gt;&lt;span class="nx"&gt;formData&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nf"&gt;addOptimistic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nf"&gt;action&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;formData&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt; &lt;span class="nx"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;hidden&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;productId&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;product&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;
&lt;/span&gt;      &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;button&lt;/span&gt; &lt;span class="nx"&gt;disabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;pending&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;pending&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Adding…&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Add to cart&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="sr"&gt;/button&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;
&lt;/span&gt;      &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nx"&gt;In&lt;/span&gt; &lt;span class="nx"&gt;cart&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;optimisticQty&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="sr"&gt;/p&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;
&lt;/span&gt;    &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="sr"&gt;/form&lt;/span&gt;&lt;span class="err"&gt;&amp;gt;
&lt;/span&gt;  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The click registers immediately. INP measures the time between the interaction and the next paint; with &lt;code&gt;useOptimistic&lt;/code&gt;, that's milliseconds rather than the full server latency.&lt;/p&gt;

&lt;p&gt;React 19 also introduces &lt;code&gt;use()&lt;/code&gt;, a new primitive that reads Promises or Context directly inside the render phase. Unlike hooks, it can be called conditionally, so you can suspend a component mid-render while a Promise resolves rather than managing that state manually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;use&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;Suspense&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;react&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;UserName&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;userPromise&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;user&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;use&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userPromise&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;Profile&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;userPromise&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Suspense&lt;/span&gt; &lt;span class="na"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;Loading…&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;UserName&lt;/span&gt; &lt;span class="na"&gt;userPromise&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;userPromise&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt; &lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nc"&gt;Suspense&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Suspense boundary catches the suspension; the parent doesn't need to know a Promise is involved at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the RSC boundary actually costs
&lt;/h2&gt;

&lt;p&gt;Working with React Server Components on high-traffic sites, the payload crossing the network boundary is where performance gets lost, not the client rendering.&lt;/p&gt;

&lt;p&gt;The mistake: passing full database objects as props to Client Components. Every field on that object gets serialized into the RSC payload even if the component uses three of them. On a product page with a 40-field database record, that's a lot of JSON the browser decodes and discards.&lt;/p&gt;

&lt;p&gt;Pass exactly the fields the client needs, nothing more. Push &lt;code&gt;"use client"&lt;/code&gt; as far down the tree as possible so the boundary is small. Stream static layout immediately and wrap slow data fetches in &lt;code&gt;&amp;lt;Suspense&amp;gt;&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;ProductPage&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;product&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;products&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findById&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;StaticLayout&lt;/span&gt; &lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Suspense&lt;/span&gt; &lt;span class="na"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;ProductSkeleton&lt;/span&gt; &lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;ProductClient&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;product&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
          &lt;span class="na"&gt;price&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;product&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;price&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
          &lt;span class="na"&gt;inStock&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;product&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;inStock&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nc"&gt;Suspense&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I came up from backend work before moving heavily into frontend and edge delivery, so I've seen both sides of this equation. The bottleneck is almost never the framework itself; it's the payload crossing the network boundary and what you decided to put in it.&lt;/p&gt;

&lt;p&gt;The tooling in React 19 is genuinely better. The structural rule is the same as it's always been: don't ship what the user doesn't immediately need.&lt;/p&gt;

</description>
      <category>react</category>
      <category>performance</category>
      <category>webdev</category>
      <category>webperf</category>
    </item>
    <item>
      <title>AWS Summit Hamburg 2026: The Year Agentic AI Went from Hype to Production</title>
      <dc:creator>Lars</dc:creator>
      <pubDate>Mon, 25 May 2026 16:34:38 +0000</pubDate>
      <link>https://dev.to/larsroettig_ai/aws-summit-hamburg-2026-the-year-agentic-ai-went-from-hype-to-production-2cj7</link>
      <guid>https://dev.to/larsroettig_ai/aws-summit-hamburg-2026-the-year-agentic-ai-went-from-hype-to-production-2cj7</guid>
      <description>&lt;p&gt;Hamburg 2026 was my first AWS Summit in person; I'd only followed it through recordings before. The shift in the AI sessions was immediate. Last year the talks were mostly roadmaps: architecture diagrams for systems still being built, phrases like "we're exploring" and "we're excited about the potential." This year, several major companies, from automotive manufacturing to energy, e-commerce, and food delivery, each opened with results.&lt;/p&gt;

&lt;p&gt;The infrastructure for agentic AI is real now. The questions still open are about safety, not capability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Strands Agents: letting the model decide
&lt;/h2&gt;

&lt;p&gt;I started the day with the Strands Agents talk. We're already using Strands at work, so I expected mostly familiar ground. What was actually new was how far they've pushed the model taking over orchestration decisions that developers used to make manually.&lt;/p&gt;

&lt;p&gt;The shift sounds small but changes everything: you don't define which agent to call when. The orchestrator figures it out. The model reads the task, decides what capability it needs, and routes itself. That means less brittle wiring in your agent graph and more flexibility when edge cases appear at runtime.&lt;/p&gt;

&lt;p&gt;Three capabilities stood out:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Self-extending:&lt;/strong&gt; Agents write their own Python tools on the fly. Because the framework can reload tools at runtime, an agent can recognize a missing capability, write the code for it, save it, and execute that new tool immediately, without a restart.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-directing:&lt;/strong&gt; Agents update their own system prompts based on interactions and store those updates in persistent memory. When deployed with Amazon Bedrock AgentCore, they use Amazon Bedrock Knowledge Bases to retrieve past sessions, so context accumulates across conversations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Meta-agents:&lt;/strong&gt; A primary orchestrator spawns sub-agents based on the task. Three patterns: Swarm (sub-agents work on parts of the problem in parallel, sharing a context space), Graph (directed hand-offs between specialized agents in sequence), and Think (the agent recursively spawns instances of itself to deepen reasoning before answering).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Similar talk from AWS re:Invent 2025: &lt;a href="https://www.youtube.com/watch?v=RQfW7eQsXqk&amp;amp;t=3130s" rel="noopener noreferrer"&gt;Strands Agents deep-dive&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  When an enterprise HR platform runs itself
&lt;/h2&gt;

&lt;p&gt;A global energy company's HR team had a legacy data distribution system. Expensive, rigid, built for a world where data formats and sources didn't change often. They replaced it with a serverless platform on AWS, powered by Amazon Bedrock Agents and Amazon Bedrock AgentCore.&lt;/p&gt;

&lt;p&gt;What made it interesting wasn't the migration. It was what they built on top: the platform is self-documenting, self-modifying, and self-healing. When a new data source appears, the system builds the ingestion Lambda and Step Functions itself. When a transformation fails, it diagnoses the failure, writes a fix, and deploys it. Engineers can ask the platform questions in natural language and get answers about its own internals.&lt;/p&gt;

&lt;p&gt;The result: $800K in annual license savings, and a data delivery platform that largely operates without engineers in the loop for routine changes.&lt;/p&gt;

&lt;p&gt;That's Level 3 maturity, which I'll explain below. Most teams aren't there yet. But seeing it running in production at a company of this size makes it harder to argue it's years away.&lt;/p&gt;

&lt;h2&gt;
  
  
  Supply chain planning with agents
&lt;/h2&gt;

&lt;p&gt;A global automotive manufacturer has been running ML-based forecasting and planning on AWS for years. The agentic layer they've added isn't a replacement for that infrastructure; it's a reasoning layer on top of it, one that can work through problems humans currently have to resolve manually.&lt;/p&gt;

&lt;p&gt;The concrete example that stuck with me: "where is my tire and when will it arrive?" Sounds simple. Answering it requires querying across production schedules, capacity constraints, distribution paths, and carrier networks simultaneously, then reconciling those answers into something actionable. Before agentic AI, that took hours. Now it takes five minutes.&lt;/p&gt;

&lt;p&gt;The architecture uses specialized agents that investigate production blocks and capacity constraints in parallel, then synthesize their findings into business-friendly explanations. That last part matters: the team specifically called out the move from black-box optimization to transparent decision-making. Supply chain planners needed to understand why the system made a recommendation, not just what it recommended. Agents that can explain their reasoning are the only version that gets adopted.&lt;/p&gt;

&lt;h2&gt;
  
  
  Privacy-first, then ship
&lt;/h2&gt;

&lt;p&gt;One talk stood out for its honesty about what the path to production actually looks like. The team behind a European e-commerce platform didn't start with a multi-agent architecture. They started with a single Bedrock API call for intent classification and predefined response templates, built data privacy compliance in from day one, then iterated from there: prompt engineering, Bedrock Agents, multi-agent orchestration, and finally Amazon Bedrock AgentCore. Each phase had a ceiling where the simpler approach stopped being good enough.&lt;/p&gt;

&lt;p&gt;The result of that iteration: over 50% of first-level support tickets are now handled by AI, and resolution time is down 60%.&lt;/p&gt;

&lt;p&gt;Privacy was the first design constraint throughout, not an afterthought. Before any ticket content reaches an LLM, it goes through an anonymization layer that strips PII using pattern matching and rules. Only anonymized content goes forward. The gate pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    A[Ticket arrives] --&amp;gt; B[Anonymize PII]
    B --&amp;gt; C[Route to team]
    C --&amp;gt; D[Confidence check]
    D --&amp;gt; E[Add context]
    E --&amp;gt; F[Lifecycle check]
    F --&amp;gt; G[AI response]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One detail I hadn't considered before: they tried sending the AI response five seconds after ticket submission. Customer satisfaction dropped. People felt the speed was inhuman and didn't trust the answer. They now add an artificial delay. The content didn't change, the timing did, and satisfaction went back up.&lt;/p&gt;

&lt;p&gt;The system splits B2C and B2B into separate agent chains, because B2B tickets carry more company-specific context and require more specialized handling. Observability is built in from the start: every response is traced, retrieval success rates and model usage are tracked, and quality scores feed back into the system. That observability also changed how the team debugs. The question shifted from "what did the model answer?" to "why did the workflow behave this way?", which means tracing which tools were called, what was retrieved and how relevant it was, where latency came from, and what CSAT scores and ticket reopen rates are telling you about specific workflow paths.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    A[Normalized Support Request] --&amp;gt; B[Router Agent]

    B --&amp;gt; C[Consumer Support Agent]
    B --&amp;gt; D[B2B Support Agent]
    B --&amp;gt; E[Human Handoff Agent]

    C --&amp;gt; F[Ticket Update Composer]
    D --&amp;gt; F

    F --&amp;gt; G[Ticket API]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Multi-agent does not mean everyone talks to everyone.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Bounded specialization reduces accidental cross-talk and makes handoff behavior easier to reason about. A few architectural decisions that make this work in practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;Router Agent&lt;/strong&gt; decides which specialized agent handles the request; nothing else does routing.&lt;/li&gt;
&lt;li&gt;Specialized agents are isolated by responsibility and don't call each other directly.&lt;/li&gt;
&lt;li&gt;Only the &lt;strong&gt;Ticket Update Composer&lt;/strong&gt; writes back to external systems (Ticket Systems).&lt;/li&gt;
&lt;li&gt;Human escalation is an explicit path, not a fallback you discover at 2am.&lt;/li&gt;
&lt;li&gt;Tool permissions and handoff contracts are defined up front, not negotiated at runtime.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture is only part of what makes this production-grade. Agents introduced operational concerns that became part of the product itself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;IaC and reproducibility&lt;/strong&gt;: agents, collaborators, roles, Lambda functions, API Gateway, and knowledge base configuration all need reproducible deployment. Drift between environments is a real failure mode.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aliases and versioning&lt;/strong&gt;: promote tested versions explicitly; don't rely on draft agent behavior in production.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency budgets&lt;/strong&gt;: multi-tool workflows can exceed webhook timeouts. Latency is a design constraint, not a monitoring afterthought.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured traces&lt;/strong&gt;: log intent, retrieval, tool inputs, API errors, and response payloads. Debugging an agentic workflow without traces is guesswork.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human QA sampling&lt;/strong&gt;: review low CSAT scores, ticket reopen rates, and escalation reasons on a regular cadence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost guardrails&lt;/strong&gt;: cap agent steps, retries, token budgets, and retrieval depth. Unbounded agents are a billing incident waiting to happen.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The customer sees the answer. Engineering owns the routing, policy, traces, and feedback loop.&lt;/p&gt;

&lt;p&gt;Choosing the right level of agent complexity is its own decision. Not everything needs a full multi-agent setup:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;th&gt;Direct Bedrock call&lt;/th&gt;
&lt;th&gt;Bedrock Agent&lt;/th&gt;
&lt;th&gt;AgentCore&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;One decision, known outputs&lt;/td&gt;
&lt;td&gt;Best fit&lt;/td&gt;
&lt;td&gt;Overkill&lt;/td&gt;
&lt;td&gt;Overkill&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Needs RAG + tools&lt;/td&gt;
&lt;td&gt;Possible but manual&lt;/td&gt;
&lt;td&gt;Best fit&lt;/td&gt;
&lt;td&gt;Good fit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multiple topics / generated answer&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Best fit&lt;/td&gt;
&lt;td&gt;Good fit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-channel middleware&lt;/td&gt;
&lt;td&gt;Possible&lt;/td&gt;
&lt;td&gt;Best fit&lt;/td&gt;
&lt;td&gt;Good fit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Longer-term memory / operations&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Best fit&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Don't optimize for "most agentic." Optimize for the minimum autonomy that solves the customer problem with acceptable operational risk.&lt;/p&gt;

&lt;p&gt;Their lesson for anyone starting this: get your data security and privacy team involved before you write the first line of agent code. The design decisions they'll ask for are the ones you can't retrofit later.&lt;/p&gt;

&lt;h2&gt;
  
  
  AgentCore: building multi-tenant AI as a service
&lt;/h2&gt;

&lt;p&gt;I also caught the expert-level session on building multi-tenant SaaS agents with Amazon Bedrock AgentCore. If you're building a platform where multiple customers each get their own AI agents, the isolation problem is harder than it looks.&lt;/p&gt;

&lt;p&gt;Tenant isolation in agentic systems runs across five dimensions: identity (each tenant's agents act with scoped credentials), memory (one tenant's conversation history can't leak into another's context), gateway (routing and rate-limiting per tenant), observability (tenant-scoped traces so you can debug without seeing another customer's data), and runtime (compute isolation so a runaway agent in one tenant doesn't affect others). The session walked through working examples for each.&lt;/p&gt;

&lt;p&gt;The framing they used, "intelligence as a service," is worth keeping. If you're building AI capabilities that other teams or customers consume, the SaaS constructs of onboarding, isolation, and identity propagation apply just as much as they do to any other service you'd build. The AgentCore primitives give you the building blocks; you still have to wire them together intentionally.&lt;/p&gt;

&lt;p&gt;Similar session recording: &lt;a href="https://www.youtube.com/watch?v=uwXrtyXXuy8" rel="noopener noreferrer"&gt;Building Multi-Tenant SaaS Agents with Amazon Bedrock AgentCore&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;One published example that shows AgentCore working on a real regulated domain: an open-source medical content review application that cross-checks pharmaceutical marketing claims against clinical references, PubMed, OpenFDA, and ClinicalTrials.gov. A few architectural decisions in it are worth studying regardless of your domain. First, the reviewer sub-agents persist their findings to S3 and return only an S3 URI to the orchestrator; hundreds of findings from a 30-page document never flow through the orchestrator's context window, which would cause it to summarize and drop findings. Second, user identity is extracted server-side from the JWT &lt;code&gt;sub&lt;/code&gt; claim, never from the request payload, which closes the impersonation-via-prompt-injection vector directly. Both patterns are reusable. Full writeup and open-source repo: &lt;a href="https://builder.aws.com/content/37phdmvQL1KmluO9s6xx0TJMod2/accelerate-medical-content-review-with-amazon-bedrock-agentcore" rel="noopener noreferrer"&gt;Accelerate Medical Content Review with Amazon Bedrock AgentCore&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The maturity ladder
&lt;/h2&gt;

&lt;p&gt;Across several talks, a rough framework emerged for how companies are thinking about agentic AI maturity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 1&lt;/strong&gt; is rules-based AI. The system follows defined policies, humans define every decision path, and the AI fills in specific gaps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 2&lt;/strong&gt; is autonomous task AI. The system handles entire workflows: self-documentation, quality monitoring, task routing. Humans oversee outcomes, not individual steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 3&lt;/strong&gt; is self-monitoring systems. The system diagnoses its own failures and builds capabilities it didn't have before. Human oversight is exception-based, not routine.&lt;/p&gt;

&lt;p&gt;Most of the companies presenting were somewhere in the Level 1 to Level 2 transition. The energy company's HR platform was the one example of Level 3 running in production. That gap is worth knowing about before you start planning, because Level 2 requires different architecture decisions than Level 1, and Level 3 requires different trust decisions than Level 2.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's still open
&lt;/h2&gt;

&lt;p&gt;The results are real: $800K saved, hours to five minutes, resolution time down 60%. These aren't demos.&lt;/p&gt;

&lt;p&gt;But three problems came up in nearly every talk, and nobody presented a clean solution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hallucination in production&lt;/strong&gt; has operational consequences now. When your agent is writing transformation plugins and deploying them, a confidently wrong answer triggers real failures. Teams are managing this with human-in-the-loop gates at specific checkpoints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt injection and prompt protection&lt;/strong&gt; got less stage time than they deserved. As agents act on data from external systems, the attack surface grows. One concrete mitigation that came up in the AgentCore examples: extract user identity from the JWT server-side, never from the request payload, so attackers can't impersonate users through crafted input. That closes one specific vector; the broader problem of agents acting on adversarial content from external sources is still largely unsolved in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data privacy at scale&lt;/strong&gt; is hard even with PII anonymization. Cross-border data flows, multi-system agents, and context accumulating in memory all create compliance complexity that rules-based anonymization doesn't fully address.&lt;/p&gt;

&lt;p&gt;These are reasons to build carefully, with privacy by default and observability from day one. They're not reasons to wait.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context engineering is the next skill
&lt;/h2&gt;

&lt;p&gt;Prompt engineering was the first visible handle: write better prompts, get better outputs. The feedback loop was tight. The harder problem in agentic systems is what surrounds the prompt: what context an agent has access to, when it gets loaded, how much fits in the window, and what happens when it doesn't.&lt;/p&gt;

&lt;p&gt;The AgentCore medical content review example makes this concrete. Reviewer sub-agents persist findings to S3 and hand the orchestrator a URI instead of the content; the full set is loaded once, at the final editing pass. That's a context engineering decision: controlling which information exists in which agent's window at which moment, specifically to prevent the model from silently dropping findings it can't fit.&lt;/p&gt;

&lt;p&gt;The same pattern shows up in the Strands memory model (Bedrock Knowledge Bases retrieve relevant past sessions, not all of them) and in the e-commerce platform's lifecycle gates (explicit stages controlling what context is available at each step). Every multi-agent architecture at the summit made the same choice. The common thread is explicit decisions about what each agent knows and when.&lt;/p&gt;

&lt;p&gt;Context windows are a real production constraint, not a theoretical one. As agents chain together and produce intermediate output, what you carry forward and what you discard is an architectural choice. Getting it wrong makes the system quietly incorrect in ways that are hard to debug, because the model won't tell you it dropped something.&lt;/p&gt;

&lt;p&gt;Treat context design with the same seriousness as a data model.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I took away
&lt;/h2&gt;

&lt;p&gt;The energy was different from the recordings I'd watched in previous years. Companies were comparing notes, sharing what broke and what they'd do differently.&lt;/p&gt;

&lt;p&gt;The part worth repeating to anyone building with AI right now: share the positive stories. The results from these companies are real and worth talking about with your team. And while you're doing that, keep hallucination and data protection as first-class design constraints, not late-stage reviews. Include the people who care about those things early; they make the system better, not slower.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;These are my personal impressions from the conference. The views here are my own and don't represent my employer or any of the companies mentioned. If I got something wrong or misunderstood a detail, ping me on &lt;a href="https://www.linkedin.com/in/larsroettig/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; and I'll correct it.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aws</category>
    </item>
  </channel>
</rss>
