<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: seah-js</title>
    <description>The latest articles on DEV Community by seah-js (@seahjs).</description>
    <link>https://dev.to/seahjs</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3750615%2Fe5cdc7e4-9d32-42da-99e1-243fcf4ed66e.png</url>
      <title>DEV Community: seah-js</title>
      <link>https://dev.to/seahjs</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/seahjs"/>
    <language>en</language>
    <item>
      <title>Attention Is All You Need — Full Paper Breakdown</title>
      <dc:creator>seah-js</dc:creator>
      <pubDate>Sat, 07 Mar 2026 22:57:00 +0000</pubDate>
      <link>https://dev.to/seahjs/attention-is-all-you-need-full-paper-breakdown-2kf3</link>
      <guid>https://dev.to/seahjs/attention-is-all-you-need-full-paper-breakdown-2kf3</guid>
      <description>&lt;p&gt;The 2017 paper "Attention Is All You Need" by Vaswani et al. introduced the Transformer — the architecture behind GPT, Claude, Gemini, and every major LLM today. It replaced recurrent models entirely with attention mechanisms, and the field has never looked back.&lt;/p&gt;

&lt;p&gt;This post walks through the key ideas.&lt;/p&gt;

&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/C7YiPaUYo1k"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem with RNNs
&lt;/h2&gt;

&lt;p&gt;Before Transformers, sequence modeling meant RNNs and LSTMs. These process tokens one at a time, left to right. That sequential dependency creates two problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No parallelization&lt;/strong&gt; — each step depends on the previous hidden state, so you can't process tokens simultaneously during training&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-range dependencies decay&lt;/strong&gt; — by the time an RNN reaches token 500, the signal from token 1 has been compressed through hundreds of hidden states&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Attention mechanisms existed before this paper (Bahdanau attention, 2014), but they were bolted onto RNNs. The radical idea here: &lt;strong&gt;what if attention is all you need?&lt;/strong&gt; Drop the recurrence entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Encoder-Decoder architecture
&lt;/h2&gt;

&lt;p&gt;The Transformer follows the classic encoder-decoder structure used in machine translation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Encoder&lt;/strong&gt; (left side): Takes the input sequence and produces a rich representation. 6 identical layers stacked.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decoder&lt;/strong&gt; (right side): Takes the encoder's output + previously generated tokens to produce the next token. Also 6 layers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each layer in both stacks has the same building blocks: multi-head attention, feed-forward networks, residual connections, and layer normalization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-attention: the core mechanism
&lt;/h2&gt;

&lt;p&gt;Self-attention lets every token in a sequence look at every other token and decide how much to "pay attention" to it.&lt;/p&gt;

&lt;p&gt;For each token, the model computes three vectors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Query (Q)&lt;/strong&gt; — "what am I looking for?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key (K)&lt;/strong&gt; — "what do I contain?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Value (V)&lt;/strong&gt; — "what information do I provide?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are produced by multiplying the input embeddings by learned weight matrices: 

&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;WQ,WK,WVW_Q, W_K, W_V &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;Q&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;K&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;V&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
.&lt;/p&gt;

&lt;p&gt;The attention score between two tokens is the dot product of the query of one with the key of the other. High dot product = high relevance. The formula:&lt;/p&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;Attention(Q,K,V)=softmax(QKTdk)V
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;Attention&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;Q&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;K&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;V&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;softmax&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="minner"&gt;&lt;span class="mopen delimcenter"&gt;&lt;span class="delimsizing size3"&gt;(&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord sqrt"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span class="svg-align"&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;d&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="hide-tail"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;Q&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;K&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;T&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose delimcenter"&gt;&lt;span class="delimsizing size3"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;V&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;The 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;dk\sqrt{d_k} &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord sqrt"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span class="svg-align"&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;d&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="hide-tail"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 scaling factor prevents the dot products from growing too large as dimensionality increases — without it, the softmax would produce extremely peaked distributions, effectively killing gradients.&lt;/p&gt;
&lt;h2&gt;
  
  
  Multi-head attention
&lt;/h2&gt;

&lt;p&gt;Instead of computing attention once with the full dimensionality, the model splits Q, K, and V into multiple &lt;strong&gt;heads&lt;/strong&gt; (8 in the original paper). Each head operates on a smaller subspace (
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;dmodel/h=512/8=64d_{\text{model}} / h = 512 / 8 = 64 &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;d&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord text mtight"&gt;&lt;span class="mord mtight"&gt;model&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;/&lt;/span&gt;&lt;span class="mord mathnormal"&gt;h&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;512/8&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;64&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 dimensions per head).&lt;/p&gt;

&lt;p&gt;Why? Different heads can learn different types of relationships:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One head might focus on syntactic structure (subject-verb agreement)&lt;/li&gt;
&lt;li&gt;Another might capture positional proximity&lt;/li&gt;
&lt;li&gt;Another might track semantic similarity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The outputs of all heads are concatenated and projected back to the full dimension.&lt;/p&gt;
&lt;h2&gt;
  
  
  Three types of attention in the Transformer
&lt;/h2&gt;

&lt;p&gt;The paper uses multi-head attention in three distinct ways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Encoder self-attention&lt;/strong&gt; — every input token attends to every other input token&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Masked decoder self-attention&lt;/strong&gt; — each output token attends only to previous output tokens (the mask prevents looking ahead, preserving autoregressive generation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-attention&lt;/strong&gt; — decoder tokens attend to encoder outputs, connecting the input representation to the output generation&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Positional encoding
&lt;/h2&gt;

&lt;p&gt;Self-attention has no inherent notion of order — it's a set operation. "The cat sat on the mat" and "mat the on sat cat the" would produce identical attention patterns without positional information.&lt;/p&gt;

&lt;p&gt;The paper adds &lt;strong&gt;positional encodings&lt;/strong&gt; using sine and cosine functions of different frequencies:&lt;/p&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;PE(pos,2i)=sin⁡(pos100002i/dmodel)
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;P&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;E&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mopen mtight"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;p&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;os&lt;/span&gt;&lt;span class="mpunct mtight"&gt;,&lt;/span&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;i&lt;/span&gt;&lt;span class="mclose mtight"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop"&gt;sin&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="minner"&gt;&lt;span class="mopen delimcenter"&gt;&lt;span class="delimsizing size2"&gt;(&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;1000&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;0&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;i&lt;/span&gt;&lt;span class="mord mtight"&gt;/&lt;/span&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;d&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size3 size1 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord text mtight"&gt;&lt;span class="mord mtight"&gt;model&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;p&lt;/span&gt;&lt;span class="mord mathnormal"&gt;os&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose delimcenter"&gt;&lt;span class="delimsizing size2"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;



&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;PE(pos,2i+1)=cos⁡(pos100002i/dmodel)
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;P&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;E&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mopen mtight"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;p&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;os&lt;/span&gt;&lt;span class="mpunct mtight"&gt;,&lt;/span&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;i&lt;/span&gt;&lt;span class="mbin mtight"&gt;+&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;span class="mclose mtight"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop"&gt;cos&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="minner"&gt;&lt;span class="mopen delimcenter"&gt;&lt;span class="delimsizing size2"&gt;(&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;1000&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;0&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;i&lt;/span&gt;&lt;span class="mord mtight"&gt;/&lt;/span&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;d&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size3 size1 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord text mtight"&gt;&lt;span class="mord mtight"&gt;model&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;p&lt;/span&gt;&lt;span class="mord mathnormal"&gt;os&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose delimcenter"&gt;&lt;span class="delimsizing size2"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;These are added (not concatenated) to the input embeddings. The sinusoidal approach was chosen because it allows the model to generalize to sequence lengths longer than those seen during training — any relative position can be expressed as a linear function of the encodings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Position-wise feed-forward networks
&lt;/h2&gt;

&lt;p&gt;Each attention sub-layer is followed by a feed-forward network applied independently to each position:&lt;/p&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;FFN(x)=max⁡(0,xW1+b1)W2+b2
\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;FFN&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mop"&gt;max&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;0&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;b&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;b&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;This is two linear transformations with a ReLU in between. The inner dimension expands to 2048 (4× the model dimension of 512), then projects back down. Think of it as each token individually "processing" the information it gathered from attention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Residual connections and layer norm
&lt;/h2&gt;

&lt;p&gt;Every sub-layer (attention or FFN) is wrapped with:&lt;/p&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;LayerNorm(x+SubLayer(x))
\text{LayerNorm}(x + \text{SubLayer}(x))
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;LayerNorm&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;SubLayer&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mclose"&gt;))&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;The residual connection 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;x+SubLayer(x)x + \text{SubLayer}(x) &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;SubLayer&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 ensures gradients flow easily through deep stacks — without them, training a 6-layer stack would be much harder. Layer normalization stabilizes the activations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Training details
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Optimizer:&lt;/strong&gt; Adam with 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;β1=0.9,β2=0.98\beta_1 = 0.9, \beta_2 = 0.98 &lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;β&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;0.9&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;β&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;0.98&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learning rate schedule:&lt;/strong&gt; Warmup + decay. LR increases linearly for 4000 steps, then decays proportionally to the inverse square root of the step number&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regularization:&lt;/strong&gt; Dropout (0.1) on attention weights and after each sub-layer, plus label smoothing (0.1)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Training data:&lt;/strong&gt; WMT English-German (4.5M sentence pairs) and English-French (36M pairs)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardware:&lt;/strong&gt; 8 NVIDIA P100 GPUs, 3.5 days for the big model&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The results
&lt;/h2&gt;

&lt;p&gt;The Transformer achieved state-of-the-art on English-to-German and English-to-French translation, beating all previous models including deep ensembles — while training significantly faster due to full parallelization.&lt;/p&gt;

&lt;p&gt;But translation was just the beginning. The architecture turned out to be the foundation for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BERT&lt;/strong&gt; (encoder-only) — bidirectional pretraining&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT&lt;/strong&gt; (decoder-only) — autoregressive language modeling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vision Transformers&lt;/strong&gt; — applying the same architecture to images&lt;/li&gt;
&lt;li&gt;Basically everything in modern AI&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key takeaway
&lt;/h2&gt;

&lt;p&gt;The paper's core insight is elegant: you don't need recurrence or convolutions for sequence modeling. Attention alone — properly scaled, split into multiple heads, and stacked with residual connections — is sufficient. And because attention computes all pairwise relationships in parallel, it's dramatically faster to train.&lt;/p&gt;

&lt;p&gt;That's why nine years later, every frontier model is still a Transformer at its core.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>transformers</category>
      <category>deeplearning</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Agent Loops &amp; State Management</title>
      <dc:creator>seah-js</dc:creator>
      <pubDate>Mon, 09 Feb 2026 11:15:37 +0000</pubDate>
      <link>https://dev.to/seahjs/agent-loops-state-management-1hhm</link>
      <guid>https://dev.to/seahjs/agent-loops-state-management-1hhm</guid>
      <description>&lt;p&gt;This is a learning session between me and &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;Klover&lt;/a&gt;, my AI assistant. New topic — building on ReAct and function calling, now looking at the loop itself.&lt;/p&gt;




&lt;h2&gt;
  
  
  How does an agent remember?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; LLMs are stateless. So how does the agent maintain context across multiple iterations of the loop?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Something tracks the conversation. After each tool call, the observation is appended to the tracked conversation and passed as input to the next LLM call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Exactly. The conversation history grows with each step — every Thought, Action, and Observation gets appended, and the entire thing is fed back to the LLM on each iteration. ✅&lt;/p&gt;

&lt;h2&gt;
  
  
  The growing context problem
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; The history keeps growing. What's the problem?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Hitting the context window is an issue. We can summarise the conversation. But the summarisation can't be too lossy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Good — context window limits are the main constraint. Other approaches beyond summarization:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sliding windows&lt;/strong&gt; — drop oldest messages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Selective retention&lt;/strong&gt; — keep only relevant steps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval-based&lt;/strong&gt; — store older steps externally, retrieve what's relevant (RAG for agent history)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When does the loop stop?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; What are the termination conditions for an agent loop?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; When the agent has enough to give a conclusive answer. Also a maximum number of tool calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Those are 2 of 5. The full list:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Natural completion&lt;/strong&gt; — LLM decides it has the answer ✅&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Max iterations&lt;/strong&gt; — hard cap on loop steps ✅&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timeout&lt;/strong&gt; — wall clock time limit ❌&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token/cost budget&lt;/strong&gt; — spending cap per task ❌&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error threshold&lt;/strong&gt; — consecutive failures = stop ❌&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Key insight: &lt;strong&gt;never trust the LLM alone to stop.&lt;/strong&gt; Always have orchestrator-level guardrails.&lt;/p&gt;

&lt;h2&gt;
  
  
  State beyond conversation history
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; What other kinds of state might an agent need beyond chat messages?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; &lt;em&gt;(needed prompting)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Think of a &lt;strong&gt;scratchpad or working memory&lt;/strong&gt; — a separate structured data store:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"selected_flight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SQ321"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"passenger"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Jay Seah"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"payment_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pending"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Different from conversation history:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Structured&lt;/strong&gt; — not buried in natural language&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mutable&lt;/strong&gt; — agent updates specific fields&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accessible&lt;/strong&gt; — no parsing through messages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Production agents need both conversation history AND structured state management.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scratchpad solves the summarization problem
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Agent is 25 steps deep, context almost full. You summarize steps 1-20. What's the risk?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; The risk is loss of important information during summarisation. The structured file we talked about could help.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Nailed it. The scratchpad survives summarization because it's separate. Critical data lives in structured state, so even if conversation gets compressed, essential info is preserved. ✅&lt;/p&gt;




&lt;h2&gt;
  
  
  Session verdict: Exposure
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What I got:&lt;/strong&gt; Conversation history as state, context limits + summarization, natural completion + max iterations, scratchpad solving summarization risk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I missed:&lt;/strong&gt; 3 of 5 termination conditions (timeout, token/cost budget, error threshold). Structured scratchpad needed significant prompting.&lt;/p&gt;

&lt;p&gt;All 5 termination conditions on the test next review. No excuses.&lt;/p&gt;

&lt;p&gt;Next review: February 10.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>statemanagement</category>
      <category>learning</category>
    </item>
    <item>
      <title>Function Calling &amp; Tool Schemas — Review</title>
      <dc:creator>seah-js</dc:creator>
      <pubDate>Mon, 09 Feb 2026 11:15:23 +0000</pubDate>
      <link>https://dev.to/seahjs/function-calling-tool-schemas-review-d4j</link>
      <guid>https://dev.to/seahjs/function-calling-tool-schemas-review-d4j</guid>
      <description>&lt;p&gt;This is a learning session between me and &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;Klover&lt;/a&gt;, my AI assistant. Spaced repetition review — testing what stuck from the first session.&lt;/p&gt;




&lt;h2&gt;
  
  
  What does the LLM actually output?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; When an LLM "calls a function," what does it actually output, and what does it NOT do?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; It outputs a JSON table with all the relevant parameters which will be passed to an orchestrator which will parse the JSON and call the actual tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Core is right — the LLM outputs structured JSON, it does NOT execute anything itself. The orchestrator handles actual execution. Small note: "JSON table" — it's more accurately a JSON object (key-value pairs), not a table. Minor terminology thing but worth being precise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where do schemas come from?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Where do tool schemas come from? Are they baked into the model during training, or something else? And what does a schema typically contain?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; It is external to the model. It is passed like a system prompt to the model. A schema usually contains the description of the tool, the required and optional parameters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Solid. Schemas are injected at runtime, not baked in during training. They contain the tool name, description, and typed parameters (required + optional). ✅&lt;/p&gt;

&lt;h2&gt;
  
  
  Why schema quality matters
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Why does the quality of your schema descriptions matter so much?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; It might cause the LLM to use the tool wrongly which might result in useless or non-relevant outputs from the tool called.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Right — poor descriptions lead to wrong tool selection or bad parameter values. Schema quality is basically &lt;strong&gt;prompt engineering for tools&lt;/strong&gt;. Garbage in, garbage out. ✅&lt;/p&gt;

&lt;h2&gt;
  
  
  The orchestrator's role
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Beyond just executing the tool call, what else does the orchestrator do?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; It acts as a security layer. It ensures that the tool called is allowed for this model. Also, it prevents prompt injection from directly calling a tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Security is a big one. But the complete list:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Validation&lt;/strong&gt; — are the params correct/allowed?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limiting&lt;/strong&gt; — preventing abuse&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security/filtering&lt;/strong&gt; — blocking unauthorized calls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logging&lt;/strong&gt; — audit trail&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I got 2 out of 4. Missed rate limiting (prevents runaway agents hammering APIs) and logging (critical for debugging agent behavior in long chains).&lt;/p&gt;

&lt;h2&gt;
  
  
  Connecting to ReAct
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; How does function calling relate to the ReAct pattern?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Tool calls fit in the Action part. The Observation is the output from the tool call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Spot on. Action = LLM outputs tool call JSON, orchestrator executes. Observation = raw result fed back for next Thought. ✅&lt;/p&gt;




&lt;h2&gt;
  
  
  Review verdict: Pass → Developing
&lt;/h2&gt;

&lt;p&gt;Core concepts solid. Missed rate limiting and logging from orchestrator responsibilities. Next review: February 12.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>functioncalling</category>
      <category>learning</category>
    </item>
    <item>
      <title>Function Calling &amp; Tool Schemas</title>
      <dc:creator>seah-js</dc:creator>
      <pubDate>Sun, 08 Feb 2026 13:28:15 +0000</pubDate>
      <link>https://dev.to/seahjs/function-calling-tool-schemas-265a</link>
      <guid>https://dev.to/seahjs/function-calling-tool-schemas-265a</guid>
      <description>&lt;p&gt;This is a learning session between me and &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;Klover&lt;/a&gt;, my AI assistant. The idea is simple — Klover teaches me a concept through back-and-forth dialogue, I explain things in my own words, and we go deeper together. This format captures the actual conversation, including mistakes and corrections.&lt;/p&gt;

&lt;p&gt;This time: Function calling and tool schemas — how agents actually interact with external tools.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is a tool schema?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; You know the ReAct loop: Thought → Action → Observation. But when the agent picks an Action, how does it know what tools exist and how to call them? What do you think a tool schema is?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; I assume it details all the available tools the LLM can call. It exposes a "list" of tools and their purposes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Good intuition but there's more. A tool schema doesn't just list what tools exist — it defines &lt;strong&gt;how to call them&lt;/strong&gt;. Like a function signature in code. For each tool: a name, a description (natural language for the LLM), and typed parameters with required/optional flags.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name: get_weather
description: "Get current weather for a location"
parameters:
  location (string, required): City name
  units (string, optional): "celsius" or "fahrenheit"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Where do schemas live?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Where does this schema live? Part of the model's training, or somewhere else?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; It's not part of training. The schema lives externally — almost like additional prompts fed to the LLM during every invoke.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Exactly. Schemas are injected into the system prompt or a dedicated tools section at runtime. The model was &lt;em&gt;trained&lt;/em&gt; to understand and follow the format, but the specific tools change per application. That's why you can give Claude a custom tool it's never seen and it works — it learned the format, not the specific tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  The full round trip
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; When the LLM decides to call a tool, what does that actually look like? Walk me through the full round trip.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; The model decides to call the weather tool → invokes it via the output → weather tool returns an observation → model thinks about the observation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Right flow, but let's zoom in. The LLM is a text generator — it can't literally call an API. What does it actually produce?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; I guess it outputs the actual function call with the parameters, which then calls a function.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Specifically, it outputs &lt;strong&gt;structured JSON&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"get_weather"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"location"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Singapore"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"units"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"celsius"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But the critical part: &lt;strong&gt;the LLM doesn't execute anything&lt;/strong&gt;. It produces this JSON and stops. Then your &lt;strong&gt;application code&lt;/strong&gt; (the orchestrator) takes over:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;LLM outputs a tool call → stops generating&lt;/li&gt;
&lt;li&gt;Your code parses the JSON&lt;/li&gt;
&lt;li&gt;Your code actually calls the weather API&lt;/li&gt;
&lt;li&gt;Your code feeds the result back as the Observation&lt;/li&gt;
&lt;li&gt;LLM continues with the next Thought&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The LLM is the brain. Your application is the hands.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the orchestrator matters
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Why is it important that the orchestrator sits between the LLM and the tool?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; So you can add safeguards — limiting total tool calls, ensuring tools are called correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Spot on. The orchestrator is your control layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Validate&lt;/strong&gt; — check parameters before executing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limit&lt;/strong&gt; — cap tool calls to prevent infinite loops&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filter&lt;/strong&gt; — block certain tools based on context or permissions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log&lt;/strong&gt; — record every call for debugging&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sanitize&lt;/strong&gt; — clean output before feeding it back&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without that middle layer, the LLM has direct access to your systems. Imagine a prompt injection tricking the model into calling &lt;code&gt;delete_database&lt;/code&gt; — the orchestrator is where you catch that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Schema quality matters
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; What happens if the schema is poorly written? Vague description, wrong parameter requirements?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; The tool gets called when it shouldn't be, or called wrongly — resulting in bad or failed results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Exactly. Bad schemas → bad tool calls:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vague description&lt;/strong&gt; → model calls the wrong tool or skips the right one&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wrong types/requirements&lt;/strong&gt; → malformed requests, crashes, garbage output&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing parameter descriptions&lt;/strong&gt; → model guesses what "q" or "id" means&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tool schema design is basically &lt;strong&gt;prompt engineering for tools&lt;/strong&gt;. Clear names, precise descriptions, correct types — it's the difference between a reliable agent and a flaky one.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;First session: Feb 8, 2026. Status: Exposure. Good intuition throughout — connected naturally to ReAct concepts. Review tomorrow.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>functioncalling</category>
      <category>learninpublic</category>
    </item>
    <item>
      <title>ReAct Pattern — Review</title>
      <dc:creator>seah-js</dc:creator>
      <pubDate>Sun, 08 Feb 2026 13:14:13 +0000</pubDate>
      <link>https://dev.to/seahjs/react-pattern-review-3cki</link>
      <guid>https://dev.to/seahjs/react-pattern-review-3cki</guid>
      <description>&lt;p&gt;This is a review session between me and &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;Klover&lt;/a&gt;, my AI assistant. The idea is simple — Klover quizzes me on something I've learned before, I explain it in my own words, and we go deeper on the parts that stick (or don't). This format captures the actual back-and-forth, including the mistakes.&lt;/p&gt;

&lt;p&gt;This time: ReAct pattern review, testing whether the core concepts have stuck since the first session.&lt;/p&gt;




&lt;h2&gt;
  
  
  Empty results — what happens next?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; An agent calls a search tool but gets back an empty result. Walk me through what happens next in the ReAct loop — what does the Thought step look like, and what are two ways it could go wrong?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; The Thought step would analyse what to do next. One failure: it could call the tool again with the same query, get nothing again, and end up in an endless loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; That's the first one — infinite retry loop. What's the second?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; I wasn't sure about the second one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; The second is &lt;strong&gt;hallucination&lt;/strong&gt;. The Thought step sees no results and instead of trying a different approach, the agent makes up an answer. It "reasons" that it probably knows enough and generates a confident but fabricated response. Dangerous because the user has no idea the info wasn't grounded in actual tool output.&lt;/p&gt;

&lt;h2&gt;
  
  
  What should the agent actually do?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; So what should a well-designed ReAct agent do when the search returns nothing?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; The next reasoning step should recognise that something went wrong — maybe the query was too specific or used wrong terms — and try something else. Or if retrying doesn't work, report back to the user that it couldn't find the information.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Exactly. The Thought should explicitly acknowledge the failure and pivot. The key insight — the Thought step is where the agent decides between &lt;strong&gt;adapting&lt;/strong&gt; and &lt;strong&gt;giving up gracefully&lt;/strong&gt;. Both are valid. Hallucinating is never valid.&lt;/p&gt;

&lt;h2&gt;
  
  
  The T-A-O loop
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Quick check: what's the correct order of the loop, what does each letter stand for, and where does the agent decide whether to give a final answer or keep going?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; T — Think, A — Act, O — Observe. In that order. The Think step is where it decides.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Spot on. At each Thought step, the agent either decides to act (picks a tool, executes it, observes the result) or decides it's done (outputs the final answer and exits the loop). That's why Thought is the most critical step.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who generates the Observation?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; What's the difference between the Action and the Observation? Specifically — who or what generates each one?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Action is an external tool call or API. Observation is done by the LLM on the output of that tool call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Close but not quite. The Action — yes, the LLM decides the tool call, and the tool executes externally. But the &lt;strong&gt;Observation is the raw output returned by the tool/environment&lt;/strong&gt;. It's data coming &lt;em&gt;in&lt;/em&gt;, not the LLM thinking about it. The LLM's analysis of that output happens in the &lt;strong&gt;next Thought step&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So the full flow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Thought&lt;/strong&gt; → LLM reasons&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action&lt;/strong&gt; → LLM picks a tool, environment executes it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observation&lt;/strong&gt; → Environment returns the result (LLM just receives it)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thought&lt;/strong&gt; → LLM analyses the observation, decides next move&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Observation is the only part the LLM doesn't generate. It's the real world talking back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Yeah, that makes sense. I was conflating the observation with the reasoning about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure modes recall
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Name the failure modes we've covered and give a one-line mitigation for each.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Context window overflow — mitigate by summarising. Forever loop — limit the total number of tool calls. Attention dilution — same thing, summarise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Three for three on unprompted recall. Plus you know hallucination from earlier. And you correctly identified the tradeoff — summarisation helps with both context bloat and attention dilution, but it's lossy compression. You might throw away something that matters later.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Review session: Feb 8, 2026. Stays at developing. Recalled 3 of 4 failure modes (missed hallucination — was prompted). Corrected on Observation being environment output, not LLM reasoning. Next review targets unprompted recall of all 4 failure modes.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>react</category>
      <category>learninpublic</category>
    </item>
    <item>
      <title>KV Cache Optimization — Why Inference Memory Explodes and How to Fix It</title>
      <dc:creator>seah-js</dc:creator>
      <pubDate>Fri, 06 Feb 2026 13:21:28 +0000</pubDate>
      <link>https://dev.to/seahjs/kv-cache-optimization-why-inference-memory-explodes-and-how-to-fix-it-54l3</link>
      <guid>https://dev.to/seahjs/kv-cache-optimization-why-inference-memory-explodes-and-how-to-fix-it-54l3</guid>
      <description>&lt;p&gt;Learning session with &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;Klover&lt;/a&gt;. Today: why the KV cache is the biggest memory bottleneck in LLM inference, and three ways to optimize it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Does the KV Cache Exist?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; You already know from MLA that the KV cache is a bottleneck. Quick refresh — why does it exist in the first place?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; KV cache exists because it's used for inference to generate the next token. It speeds up the process because the key and value vectors don't have to be recalculated every time a new token is generated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Spot on. Without it, every new token would recompute attention over the entire sequence — quadratic cost. With it, you just compute K and V for the new token and append. Linear.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Much Memory Does It Use?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; What factors make the KV cache bigger?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; If I'm not wrong, a single number is 4 bits? If your vector is 128 numbers and you have both keys and values, each token would have 4×128×2, and then it scales linearly with tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Close but a few corrections:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;FP16 (standard for inference) is &lt;strong&gt;2 bytes (16 bits)&lt;/strong&gt;, not 4 bits&lt;/li&gt;
&lt;li&gt;You got dimension and K+V (×2) right&lt;/li&gt;
&lt;li&gt;But you're missing &lt;strong&gt;number of layers and number of attention heads&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Full formula:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2 × num_layers × num_heads × head_dim × seq_length × 2 bytes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Llama 70B (80 layers, 64 heads, head_dim 128) at 4K sequence length — the KV cache alone is &lt;strong&gt;~10GB per single request&lt;/strong&gt;. 100 concurrent users = 1TB just for KV cache.&lt;/p&gt;




&lt;h2&gt;
  
  
  Optimization #1: PagedAttention
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; The name comes from how operating systems manage RAM. What do you think the connection might be?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; I don't even know how operating systems manage RAM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Fair enough. Quick detour:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The OS problem:&lt;/strong&gt; Programs ask for different amounts of memory. If you give each one a big contiguous block "just in case," you waste memory sitting unused. This is &lt;strong&gt;internal fragmentation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The OS solution — paging:&lt;/strong&gt; Split memory into small fixed-size &lt;strong&gt;pages&lt;/strong&gt;. Allocate on demand. Programs think they have contiguous memory, but the OS maps pages to wherever there's free space.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Same problem with KV cache:&lt;/strong&gt; Each request pre-allocates for the maximum sequence length. A 500-token response with 4096 reserved = 87% wasted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PagedAttention (from vLLM):&lt;/strong&gt; Splits the KV cache into small fixed-size blocks, allocated on demand as the sequence grows. Result: ~60-80% better memory utilization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; What's the core insight that both OS paging and PagedAttention share?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Rationing memory to the services that need it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Yes — allocate on demand, not upfront. Don't reserve what you might not use.&lt;/p&gt;




&lt;h2&gt;
  
  
  Optimization #2: Prefix Caching
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Think about a chatbot with a system prompt that every conversation starts with. Every request computes the exact same K and V for those tokens. What do you think prefix caching does?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Cache the KV for the system prompt even before any user prompt is given.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Exactly. Compute once, reuse across all requests sharing that prefix.&lt;/p&gt;

&lt;p&gt;This applies beyond system prompts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Few-shot examples&lt;/strong&gt; — same examples prepended to every request&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-turn chat&lt;/strong&gt; — prior conversation already cached, only compute for the new message&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel requests&lt;/strong&gt; — multiple requests sharing the same document&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;2000-token system prompt across 100 users = 200K tokens of computation eliminated.&lt;/p&gt;




&lt;h2&gt;
  
  
  Optimization #3: MLA
&lt;/h2&gt;

&lt;p&gt;Already covered in a previous session. Compresses K and V into a smaller latent space using learned projections. Less memory per token stored.&lt;/p&gt;




&lt;h2&gt;
  
  
  Putting It All Together
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; A company is serving an LLM to 1000 concurrent users. Same 1000-token system prompt, but response lengths vary wildly (50 to 4000 tokens). Which optimization helps with which problem?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Same 1000-token prompt can use prefix cache across users. Varying response lengths can use PagedAttention. MLA can be used across all LLM settings if you want.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Nailed it. All three correct. And you're right that MLA is a model-level decision baked into the architecture, while PagedAttention and prefix caching are &lt;strong&gt;serving-layer&lt;/strong&gt; optimizations you can apply to any model.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Optimization&lt;/th&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;When Applied&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MLA&lt;/td&gt;
&lt;td&gt;Model architecture&lt;/td&gt;
&lt;td&gt;Training time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PagedAttention&lt;/td&gt;
&lt;td&gt;Serving layer&lt;/td&gt;
&lt;td&gt;Any model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prefix caching&lt;/td&gt;
&lt;td&gt;Serving layer&lt;/td&gt;
&lt;td&gt;Any model&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Self-Assessment: 3.8/5
&lt;/h2&gt;

&lt;p&gt;Core concepts make sense. Need another pass to solidify the memory formula and deepen understanding of PagedAttention internals.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>inference</category>
      <category>optimization</category>
    </item>
    <item>
      <title>AWS ECS Deployment — Review</title>
      <dc:creator>seah-js</dc:creator>
      <pubDate>Fri, 06 Feb 2026 01:28:13 +0000</pubDate>
      <link>https://dev.to/seahjs/aws-ecs-deployment-review-53ci</link>
      <guid>https://dev.to/seahjs/aws-ecs-deployment-review-53ci</guid>
      <description>&lt;p&gt;Review session with &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;Klover&lt;/a&gt;. Testing recall on AWS ECS deployment concepts from yesterday's session.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Recall Test
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Walk me through the flow from &lt;code&gt;git push&lt;/code&gt; to a user hitting your API. What are the key AWS services involved?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; After git push, if you set up CI/CD, the image will be pushed to AWS ECR, then deployed via ECS. After that I don't remember well. I know there's Fargate where you configure system requirements, ALB which routes containers to speak to one another, and some kind of task system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Good start — you've got the big picture. But let's untangle the ECS side because you're mixing a few things.&lt;/p&gt;




&lt;h2&gt;
  
  
  Task Definition vs Task
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; You mentioned "task system or something." What do you think a Task Definition actually is? Think blueprint vs running thing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Task Definition is the blueprint for the actual running of the containers? What settings for this service and everything like that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Yes. Task Definition = blueprint. It defines which Docker image, how much CPU/memory, environment variables, port mappings, secrets. The actual running instance of that blueprint is called a &lt;strong&gt;Task&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; So is it accurate to say the Task is the running of the container, with configurations defined in the Task Definition?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Exactly. Same relationship as a Docker image vs a running Docker container.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Service — What I Forgot
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; What does an ECS Service do?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; I don't know.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; What happens if your Task crashes? It's just dead. Nobody restarts it. That's what a &lt;strong&gt;Service&lt;/strong&gt; does — it says "I want 3 Tasks running at all times." If one dies, the Service launches a replacement. It also handles rolling deployments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The hierarchy:&lt;/strong&gt; Task Definition (blueprint) → Service (manager) → Tasks (running containers)&lt;/p&gt;




&lt;h2&gt;
  
  
  Crash Loop — My Wrong Assumption
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; If you push a broken image that crashes on startup, what happens?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; It will retry, and if it doesn't work, it will revert to the previous successful task definition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Close but not quite — it's actually dumber than that. The Service sees the crash, launches a new one from the &lt;strong&gt;same broken definition&lt;/strong&gt;. That crashes too. It tries again. And again. This is a &lt;strong&gt;crash loop&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It does NOT auto-rollback. The Service only knows "maintain desired count." You have to intervene manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigation:&lt;/strong&gt; Rolling deployments with health checks. New tasks must pass health checks before old ones are killed. Doesn't "revert" — just stops the rollout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key takeaway:&lt;/strong&gt; The Service is persistent but not smart.&lt;/p&gt;




&lt;h2&gt;
  
  
  ALB — What I Got Wrong
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; ALB routes containers to speak to one another.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Not quite. ALB sits between &lt;strong&gt;users and your containers&lt;/strong&gt;. Users hit the ALB, ALB routes traffic to healthy Tasks. Two reasons you need this middleman:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Tasks are ephemeral — they get replaced, scale up/down, IPs change&lt;/li&gt;
&lt;li&gt;ALB gives you a &lt;strong&gt;stable endpoint&lt;/strong&gt; (one URL) that always works&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There's a &lt;strong&gt;Target Group&lt;/strong&gt; between ALB and Tasks — a registry of currently healthy tasks that ALB routes to.&lt;/p&gt;




&lt;h2&gt;
  
  
  Serverless ≠ No Servers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Regarding Fargate, I don't understand what "serverless" means. Isn't EC2 also using AWS infra?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; "Serverless" is confusing because obviously there ARE servers. The difference is who manages them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;EC2:&lt;/strong&gt; AWS gives you a VM. You manage capacity, patches, disk space, scaling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fargate:&lt;/strong&gt; You never see a machine. Just say "I need 0.5 vCPU and 1GB RAM." AWS figures out the rest.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EC2 = renting an apartment (you fix the plumbing)&lt;/li&gt;
&lt;li&gt;Fargate = hotel room (they handle everything)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both use AWS infra. "Serverless" means you don't manage servers — not that servers don't exist.&lt;/p&gt;




&lt;h2&gt;
  
  
  Self-Assessment: 2.7/5
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Solid on:&lt;/strong&gt; ECR → ECS pipeline, Task Definition = blueprint, Fargate = serverless.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fuzzy on:&lt;/strong&gt; ALB/Target Groups, Services, crash loop behavior, serverless meaning.&lt;/p&gt;

&lt;p&gt;Staying at exposure. Review again tomorrow.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ecs</category>
      <category>devops</category>
      <category>docker</category>
    </item>
    <item>
      <title>AWS ECS Deployment — From Git Push to Running Containers</title>
      <dc:creator>seah-js</dc:creator>
      <pubDate>Thu, 05 Feb 2026 01:52:56 +0000</pubDate>
      <link>https://dev.to/seahjs/aws-ecs-deployment-from-git-push-to-running-containers-3of</link>
      <guid>https://dev.to/seahjs/aws-ecs-deployment-from-git-push-to-running-containers-3of</guid>
      <description>&lt;p&gt;Learning session with &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;Klover&lt;/a&gt;. Today: deploying FastAPI + Streamlit to AWS ECS. The goal is understanding the full flow from &lt;code&gt;git push&lt;/code&gt; to running containers.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Big Picture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git push
   ↓
GitHub Actions (CI/CD)
   ↓
ECR (store images)
   ↓
ECS + Fargate (run containers)
   ↓
ALB (route traffic)
   ↓
Users
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two services in my case: FastAPI backend + Streamlit frontend. Each gets its own container.&lt;/p&gt;




&lt;h2&gt;
  
  
  ECS vs Fargate — What Actually Runs Containers?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; ECS is an &lt;em&gt;orchestrator&lt;/em&gt; — it doesn't run containers itself. What actually runs your containers?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Maybe EC2?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; That's one option. The other is Fargate.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;EC2 mode:&lt;/strong&gt; You manage the instances, patching, scaling, capacity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fargate mode:&lt;/strong&gt; Serverless. Just define CPU/memory, AWS handles the rest.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For someone who hates managing infrastructure, Fargate is the obvious choice.&lt;/p&gt;




&lt;h2&gt;
  
  
  ECR — Where Images Live
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Why not just use Docker Hub? Why does AWS have its own registry?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Docker Hub is external to AWS. ECS runs within AWS, so having ECR inside AWS makes sense.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Exactly. Being inside AWS means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Faster pulls&lt;/strong&gt; — same network, low latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IAM integration&lt;/strong&gt; — no separate credentials&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No egress costs&lt;/strong&gt; — pulling from Docker Hub = paying for data transfer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ECR = AWS's private Docker Hub.&lt;/p&gt;




&lt;h2&gt;
  
  
  Task Definitions — The Blueprint
&lt;/h2&gt;

&lt;p&gt;A Task Definition tells ECS everything it needs to run your container:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Image&lt;/strong&gt; — which ECR image to pull&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU/Memory&lt;/strong&gt; — resources allocated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Port mappings&lt;/strong&gt; — which ports to expose&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Environment variables&lt;/strong&gt; — config, secrets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logging&lt;/strong&gt; — where to send logs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IAM role&lt;/strong&gt; — AWS permissions for the container&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like a docker-compose file but AWS-flavored.&lt;/p&gt;




&lt;h2&gt;
  
  
  Services — Keeping Things Alive
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; A Task Definition is just a blueprint. To actually run it, you create a Service. Why the separation?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; So if one crashes, the Service can spin up a replacement automatically?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Exactly. Service maintains desired state. You say "keep 2 copies running" — if one dies, Service notices and replaces it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Service gives you:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Desired count (keep N copies running)&lt;/li&gt;
&lt;li&gt;Auto-restart on failure&lt;/li&gt;
&lt;li&gt;Connection to load balancer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Hierarchy:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task Definition (blueprint)
       ↓
Service (keeps N copies alive)
       ↓
Tasks (actual running containers)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  ALB + Target Groups — Routing Traffic
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; If you have 2 FastAPI tasks running, how does traffic get routed?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Some kind of routing — the ALB?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Yes. ALB (Application Load Balancer) sits in front and distributes requests.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User request
     ↓
    ALB
   /   \
Task1  Task2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ALB does:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Load balancing&lt;/strong&gt; — spreads traffic across healthy tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Health checks&lt;/strong&gt; — pings &lt;code&gt;/health&lt;/code&gt;, stops sending traffic to dead tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Path routing&lt;/strong&gt; — &lt;code&gt;/api/*&lt;/code&gt; → FastAPI, &lt;code&gt;/&lt;/code&gt; → Streamlit&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why Target Groups?
&lt;/h3&gt;

&lt;p&gt;Tasks are ephemeral. Their IPs change constantly. ALB can't track "send traffic to 10.0.1.45" when that IP might be gone in 5 minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Target Group = stable reference to an ever-changing set of tasks.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ECS auto-registers new tasks, auto-deregisters dead ones. ALB just points at the target group.&lt;/p&gt;




&lt;h2&gt;
  
  
  The CI/CD Flow
&lt;/h2&gt;

&lt;p&gt;When I push code:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Actions&lt;/strong&gt; detects the push&lt;/li&gt;
&lt;li&gt;Workflow builds Docker image&lt;/li&gt;
&lt;li&gt;Pushes image to &lt;strong&gt;ECR&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Tells ECS to update the &lt;strong&gt;Service&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Service pulls new image, spins up new &lt;strong&gt;Tasks&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;New tasks register in &lt;strong&gt;Target Group&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ALB&lt;/strong&gt; routes traffic to healthy tasks&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Why "update service"?
&lt;/h3&gt;

&lt;p&gt;Pushing a new image to ECR just updates storage. Running tasks don't know — they're already running the old image.&lt;/p&gt;

&lt;p&gt;"Update ECS service" triggers a rolling deployment:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pull latest image&lt;/li&gt;
&lt;li&gt;Spin up new tasks&lt;/li&gt;
&lt;li&gt;Wait for health checks&lt;/li&gt;
&lt;li&gt;Drain traffic from old tasks&lt;/li&gt;
&lt;li&gt;Kill old tasks&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No downtime.&lt;/p&gt;




&lt;h2&gt;
  
  
  Secrets Management
&lt;/h2&gt;

&lt;p&gt;Two different places for secrets:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub Secrets&lt;/strong&gt; — for CI/CD&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS credentials so GitHub Actions can push to ECR&lt;/li&gt;
&lt;li&gt;Used &lt;em&gt;during&lt;/em&gt; build/deploy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;AWS Secrets Manager / SSM&lt;/strong&gt; — for runtime&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DB passwords, API keys the app actually uses&lt;/li&gt;
&lt;li&gt;Injected as environment variables when container starts&lt;/li&gt;
&lt;li&gt;Never in image, never in git&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concept&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ECR&lt;/td&gt;
&lt;td&gt;Stores Docker images&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ECS&lt;/td&gt;
&lt;td&gt;Orchestrates containers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fargate&lt;/td&gt;
&lt;td&gt;Serverless compute for containers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task Definition&lt;/td&gt;
&lt;td&gt;Blueprint (image, CPU, memory, ports, secrets)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Service&lt;/td&gt;
&lt;td&gt;Keeps N tasks running, auto-restart&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task&lt;/td&gt;
&lt;td&gt;Actual running container&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ALB&lt;/td&gt;
&lt;td&gt;Routes traffic, health checks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Target Group&lt;/td&gt;
&lt;td&gt;Stable reference to dynamic tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitHub Actions&lt;/td&gt;
&lt;td&gt;CI/CD automation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Secrets Manager/SSM&lt;/td&gt;
&lt;td&gt;Runtime secrets&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Gotcha I Hit
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover's quiz:&lt;/strong&gt; Fill in the blanks for the CI/CD flow.&lt;/p&gt;

&lt;p&gt;I said the workflow pushes images to &lt;em&gt;ECS&lt;/em&gt;. Wrong — it's &lt;em&gt;ECR&lt;/em&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EC*&lt;em&gt;R&lt;/em&gt;* = &lt;strong&gt;R&lt;/strong&gt;egistry (storage)&lt;/li&gt;
&lt;li&gt;EC*&lt;em&gt;S&lt;/em&gt;* = &lt;strong&gt;S&lt;/strong&gt;ervice (orchestration)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Easy to mix up. Now I won't.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>docker</category>
      <category>cicd</category>
      <category>devops</category>
    </item>
    <item>
      <title>Race Conditions, Asyncio Locks &amp; Concurrency Patterns</title>
      <dc:creator>seah-js</dc:creator>
      <pubDate>Wed, 04 Feb 2026 13:57:17 +0000</pubDate>
      <link>https://dev.to/seahjs/race-conditions-asyncio-locks-concurrency-patterns-4fmo</link>
      <guid>https://dev.to/seahjs/race-conditions-asyncio-locks-concurrency-patterns-4fmo</guid>
      <description>&lt;p&gt;This is a Socratic-style learning session between me and &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;Klover&lt;/a&gt;, my AI assistant. Instead of reading docs or watching tutorials, I learn by explaining concepts in my own words and getting challenged on the gaps. The conversation format is the point — it captures the actual "aha" moments as they happen.&lt;/p&gt;

&lt;p&gt;In this session, I hit a real bug in an AWS pipeline and used it to understand race conditions, asyncio locks, thread pools, and what "concurrent" actually means in single-threaded async code.&lt;/p&gt;




&lt;h2&gt;
  
  
  The bug
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; I've got a pipeline on AWS that processes invoices. There's a shared manifest file in S3 that tracks processing status. Multiple async LLM API calls run at once, and each one updates the manifest after it finishes. Problem: if two LLM calls return at nearly the same time, they both read the old manifest, both write their update, and one overwrites the other. Lost data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Classic &lt;strong&gt;read-modify-write race condition&lt;/strong&gt;. Same fundamental pattern as why Python needs the GIL for reference counting — two things reading the same value, modifying independently, then writing back. One update gets stomped.&lt;/p&gt;

&lt;p&gt;The key issue: S3 doesn't lock files for you. There's no built-in mechanism that says "someone else is editing this, wait your turn." It's just object storage — you read bytes, you write bytes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; But this is single-threaded async. I thought race conditions were a multithreading thing?&lt;/p&gt;

&lt;h2&gt;
  
  
  Why single-thread async still has races
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Common misconception. The shared state here isn't in-memory — it's external, sitting in S3. Single thread protects in-memory variables (only one line of Python runs at a time), but your event loop switches tasks at every &lt;code&gt;await&lt;/code&gt; point.&lt;/p&gt;

&lt;p&gt;Here's what happens:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Task A finishes its LLM call, reads manifest from S3 (&lt;code&gt;await s3_read&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;At that &lt;code&gt;await&lt;/code&gt;, the event loop can switch. Task B also finishes, reads the &lt;strong&gt;same old manifest&lt;/strong&gt; from S3&lt;/li&gt;
&lt;li&gt;Task A writes its update to S3&lt;/li&gt;
&lt;li&gt;Task B writes its update to S3 — &lt;strong&gt;overwriting A's changes&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Single thread doesn't help because the "shared state" isn't protected by the thread. It's a file on a remote server.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; So even though only one line of code runs at a time, the damage happens because both tasks saw a stale version of the manifest before either wrote back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Exactly. The race window is between the read and the write. Any &lt;code&gt;await&lt;/code&gt; in that window is an opportunity for another task to slip in.&lt;/p&gt;

&lt;h2&gt;
  
  
  The naive fix vs the right fix
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Would making the S3 reads and writes synchronous fix it?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Technically yes — if everything is sync, there's no &lt;code&gt;await&lt;/code&gt; for the event loop to switch at. But you'd kill all your concurrency. Your LLM calls, which are pure I/O waiting, would run one at a time. For a batch of invoices, that's brutal.&lt;/p&gt;

&lt;p&gt;Better approach: &lt;strong&gt;&lt;code&gt;asyncio.Lock()&lt;/code&gt;&lt;/strong&gt;. Keep the LLM calls concurrent, but serialize just the manifest updates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; What is asyncio.Lock exactly?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Same concept as any lock — only one task can hold it at a time. The difference from &lt;code&gt;threading.Lock&lt;/code&gt; is that it's async-friendly. When a task is waiting to acquire the lock, it &lt;strong&gt;yields back to the event loop&lt;/strong&gt; instead of blocking the thread. Other tasks keep running while you wait.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;manifest_lock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;update_manifest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_result&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;manifest_lock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# only one task in here at a time
&lt;/span&gt;        &lt;span class="n"&gt;manifest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;read_manifest_from_s3&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;invoice_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;invoice_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;write_manifest_to_s3&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; So all the S3 read/writes for the manifest queue up one at a time?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Yes. Task A grabs the lock → reads → updates → writes → releases. Task B was waiting → grabs the lock → reads the manifest &lt;strong&gt;with A's update already in it&lt;/strong&gt; → updates → writes. No lost updates. Meanwhile, all your LLM API calls are still running concurrently outside the lock.&lt;/p&gt;

&lt;h2&gt;
  
  
  Single-thread async for I/O pipelines
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Does single-thread async actually work well for an I/O-heavy pipeline like this? Feels like threading would be more powerful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Single-thread async is &lt;em&gt;ideal&lt;/em&gt; for I/O-heavy work. You're mostly waiting on network responses — LLM APIs, S3 reads/writes. One thread juggling thousands of I/O waits uses almost no memory and has zero thread-switching overhead.&lt;/p&gt;

&lt;p&gt;It only breaks down when you have CPU-heavy steps in the pipeline — like parsing giant PDFs or running local models. The event loop can't switch during CPU work since there's no &lt;code&gt;await&lt;/code&gt; to pause at.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; So what if my pipeline has a mix — some I/O calls and some heavy parsing?&lt;/p&gt;

&lt;h2&gt;
  
  
  Threading for CPU-bound work
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; For mixed workloads, you offload the CPU parts to a &lt;strong&gt;thread pool&lt;/strong&gt; using &lt;code&gt;run_in_executor&lt;/code&gt;. The async event loop stays responsive for I/O, and CPU work runs on separate threads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; What's a thread pool?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Pre-created threads sitting idle, waiting for work. Think of a taxi stand — the taxis are already there, engines running. When a job comes in, a taxi picks it up immediately instead of having to spawn a brand new car from scratch.&lt;/p&gt;

&lt;p&gt;Python gives you a default one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_in_executor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cpu_heavy_fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Break that down for me.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;loop&lt;/code&gt; — the event loop (your async scheduler)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;run_in_executor&lt;/code&gt; — hand this work off to a thread pool&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;None&lt;/code&gt; — use Python's default thread pool&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cpu_heavy_fn&lt;/code&gt; — the function to run on a thread&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;data&lt;/code&gt; — argument passed to that function&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;await&lt;/code&gt; — pause here, let the event loop do other stuff, resume when the thread finishes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You stay in async-land. The CPU work happens on a thread without blocking your event loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Got it. But for my invoice pipeline where it's all API calls and S3 — just async with the lock is enough. No need for threads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Correct. Don't add threading complexity when you don't need it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sequential initiation vs concurrent waiting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Let me make sure I understand the full picture. For batch invoice processing: the event loop fires off all the LLM calls, they wait concurrently, and when each returns, it hits the asyncio lock to update the manifest one at a time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; That's the right mental model. One nuance — &lt;code&gt;asyncio.gather&lt;/code&gt; vs a &lt;code&gt;for&lt;/code&gt; loop with &lt;code&gt;await&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# concurrent waiting — all tasks run at the same time
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nf"&gt;process_invoice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inv1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;process_invoice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inv2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;process_invoice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inv3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# sequential — each one finishes before the next starts
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;inv&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;process_invoice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;gather&lt;/code&gt;, all three tasks are kicked off and wait concurrently. With &lt;code&gt;for + await&lt;/code&gt;, you wait for each one to finish before starting the next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Wait — even with &lt;code&gt;gather&lt;/code&gt;, the event loop is single-threaded. So it still &lt;em&gt;initiates&lt;/em&gt; the tasks one by one, right? It's the &lt;strong&gt;waiting&lt;/strong&gt; that's concurrent, not the initiation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; That's a sharp catch, and you're right. I should be more precise. The event loop is single-threaded, so it does start each coroutine sequentially. But initiation is near-instant — it's just setting up the task and sending the network request. The heavy part is the waiting, and that's fully concurrent. So practically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Initiation&lt;/strong&gt; — sequential, but microseconds per task&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Waiting&lt;/strong&gt; — truly concurrent, all tasks in-flight at the same time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manifest updates&lt;/strong&gt; — serialized by the lock&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For 100 invoices, you're not waiting 100× the LLM latency. You're waiting roughly 1× (plus lock overhead for manifest writes).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; That's the bit that makes async powerful for this use case. Cool — I think I've got it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Impromptu session: Feb 4, 2026. Practical async patterns for real-world pipelines — race conditions, locks, and concurrency.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>async</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Async &amp; Sync — Review</title>
      <dc:creator>seah-js</dc:creator>
      <pubDate>Wed, 04 Feb 2026 11:36:43 +0000</pubDate>
      <link>https://dev.to/seahjs/async-sync-review-4fim</link>
      <guid>https://dev.to/seahjs/async-sync-review-4fim</guid>
      <description>&lt;p&gt;This is a review session between me and &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;Klover&lt;/a&gt;, my AI assistant. Yesterday we covered async vs sync from scratch. Today Klover tests what I actually retained — then pushes deeper where the gaps show up.&lt;/p&gt;




&lt;h2&gt;
  
  
  Recall: I/O-bound vs CPU-bound
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Let's start simple. What's the difference between I/O-bound and CPU-bound work, and why does async only help one of them?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; I/O-bound is when you're calling external services like APIs. CPU-bound is heavy computation. Async helps I/O-bound because the thread can move on to other work while waiting for the external thing to come back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Good — you've got the core idea. Small addition though: I/O isn't just APIs. It's anything where the CPU is idle and waiting — disk reads, database queries, network calls, file writes. All I/O. Now explain the event loop. And what happens if you put CPU-heavy code inside one?&lt;/p&gt;

&lt;h2&gt;
  
  
  Event loop + blocking
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; The event loop delegates work to different tasks. But I'm not sure what you mean by CPU-heavy code inside an event loop?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Remember the chef analogy? One chef, single kitchen. If one dish requires 10 minutes of non-stop chopping — what happens to every other dish?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; They're on hold. Nothing else gets worked on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Exactly. Everything blocks. The event loop is single-threaded — one task hogging the CPU means every other task just sits in the queue waiting. That's why CPU-bound work inside an event loop is dangerous. Now, different topic — what's Python's GIL?&lt;/p&gt;

&lt;h2&gt;
  
  
  The GIL
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; The GIL only allows a single thread to execute Python code at a time. The workaround is multiprocessing — one process gets its own GIL, so you can actually run things in parallel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Good recall. But &lt;em&gt;why&lt;/em&gt; does the GIL exist? What problem is it solving?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; It prevents memory leakage. And it stops the same thing from being worked on by multiple threads at the same time, which would cause conflicts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Close, but it's not memory leakage exactly. Python uses &lt;strong&gt;reference counting&lt;/strong&gt; to manage memory — every object has a counter tracking how many things point to it. When the count hits zero, the object gets cleaned up. The GIL exists because if two threads modify that reference count at the same time, you get a &lt;strong&gt;race condition&lt;/strong&gt;. The count gets corrupted.&lt;/p&gt;

&lt;h2&gt;
  
  
  Race conditions — step by step
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; I get that two threads shouldn't touch the same thing, but how does the count actually get corrupted?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Walk through it. Say an object's reference count is 3.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Thread A&lt;/strong&gt; reads the count: sees &lt;strong&gt;3&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thread B&lt;/strong&gt; also reads the count: sees &lt;strong&gt;3&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thread A&lt;/strong&gt; decrements and writes: &lt;strong&gt;2&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thread B&lt;/strong&gt; decrements and writes: &lt;strong&gt;2&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Final count: &lt;strong&gt;2&lt;/strong&gt;. But two references were removed — it should be &lt;strong&gt;1&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Both threads read the old value before either wrote. Thread B's write just overwrites Thread A's work. Now scale that up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the count never reaches 0 → the object never gets cleaned up → &lt;strong&gt;memory leak&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;If the count hits 0 too early → the object gets freed while something still points to it → &lt;strong&gt;crash&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The GIL prevents this by making sure only one thread touches Python objects at a time. Brute force, but it works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Okay that makes sense now. Both threads reading the same stale value before either one writes — that's the race.&lt;/p&gt;

&lt;h2&gt;
  
  
  The async trap
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Last one. What's wrong with this?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_data&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.example.com/data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; &lt;code&gt;requests&lt;/code&gt; is synchronous. Even though the function is &lt;code&gt;async&lt;/code&gt;, the thread has to sit there and wait for the response. It's blocking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Exactly. It never yields control back to the event loop. There's no pause point, so the event loop can't switch to other tasks. Your "async" function is secretly sync. The fix: use an async HTTP client like &lt;code&gt;httpx&lt;/code&gt; with &lt;code&gt;await&lt;/code&gt; — that &lt;code&gt;await&lt;/code&gt; is the pause/resume point that lets the event loop do its job.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_data&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.example.com/data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;em&gt;Review session: Feb 4, 2026. Exposure → Developing. Core concepts retained. Deeper understanding of GIL race conditions gained during review.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>async</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Multi-head Latent Attention (MLA) — Review</title>
      <dc:creator>seah-js</dc:creator>
      <pubDate>Wed, 04 Feb 2026 11:13:03 +0000</pubDate>
      <link>https://dev.to/seahjs/multi-head-latent-attention-mla-review-4jbm</link>
      <guid>https://dev.to/seahjs/multi-head-latent-attention-mla-review-4jbm</guid>
      <description>&lt;p&gt;This is a review session between me and &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;Klover&lt;/a&gt;, my AI assistant. The idea is simple — Klover quizzes me on something I've learned before, I explain it in my own words, and we go deeper on the parts that stick (or don't). This format captures the actual back-and-forth, including the mistakes.&lt;/p&gt;

&lt;p&gt;This time: Multi-head Latent Attention (MLA), the mechanism behind DeepSeek's efficient inference.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does MLA solve?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Let's start with the basics. What problem does MLA solve?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; It saves on KV cache. Instead of storing the full key and value vectors for every token, MLA stores a smaller latent vector. That latent vector gets decoded back into keys and values during inference when you actually need them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Good — you've got the core idea. Two follow-ups. First: how does that compression actually work, mechanically? And second: if the goal is to reduce memory, why not just use fewer attention heads?&lt;/p&gt;

&lt;h2&gt;
  
  
  Compression mechanism and why not fewer heads
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; The compression is matrix multiplication. You encode the full KV into a smaller latent space, then decode it back when needed. As for fewer heads — different heads attend to different aspects of the input. If you just cut heads, you lose those different perspectives entirely. MLA preserves the multi-head relationships but stores everything compactly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Right. The key detail: MLA uses &lt;strong&gt;learned linear projections&lt;/strong&gt;. There's a down-projection matrix (W_c) that compresses the KV into the latent vector, and up-projection matrices (W_uk, W_uv) that reconstruct keys and values per head during attention.&lt;/p&gt;

&lt;p&gt;The distinction that matters — this is a &lt;strong&gt;learned&lt;/strong&gt; compression, not a post-hoc one. The model learns &lt;em&gt;how&lt;/em&gt; to compress during training, so the latent vector captures what actually matters for attention. That's fundamentally different from just truncating or dropping heads after the fact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Memory vs compute
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Next question: does MLA save compute during training?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; No, it's actually more compute. You have extra encode and decode steps — the compression and decompression aren't free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Correct. The savings are about &lt;strong&gt;memory&lt;/strong&gt;, not compute. During inference, the KV cache is the main bottleneck — it grows linearly with sequence length and batch size, and it's what limits how many tokens you can process or how many users you can serve. MLA shrinks that cache dramatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pushback on training memory
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; Wait — isn't the KV cache during training a win too? You still have KV stored in memory during the forward pass.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Good catch, actually. During training, the activations from the forward pass need to be stored for the backward pass. If you store the latent vectors instead of the full KV, that's less activation memory — similar in spirit to gradient checkpointing. So yes, there is a training memory benefit.&lt;/p&gt;

&lt;p&gt;But the relative win is much smaller than inference. During training, you have a fixed batch and sequence length, and the activation memory is just one part of the total memory budget (model parameters, optimizer states, gradients all compete for space). During inference, the KV cache is often &lt;em&gt;the&lt;/em&gt; dominant memory cost, especially for long sequences. That's where MLA really shines.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's the risk?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Last one. What's the risk of this approach?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; It's lossy compression. You're squeezing KV into a lower-dimensional space, so some information gets lost. Attention quality could drop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Exactly. The latent dimension is a tuning knob — smaller means more compression and less memory, but more information loss. Too aggressive and your attention patterns degrade. The art is finding the sweet spot where you get major memory savings without meaningful quality loss.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Review session: Feb 4, 2026. Exposure → Developing. All core concepts retained, good pushback on training memory savings.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>deeplearning</category>
      <category>beginners</category>
    </item>
    <item>
      <title>ReAct Pattern</title>
      <dc:creator>seah-js</dc:creator>
      <pubDate>Wed, 04 Feb 2026 02:35:56 +0000</pubDate>
      <link>https://dev.to/seahjs/react-pattern-38d1</link>
      <guid>https://dev.to/seahjs/react-pattern-38d1</guid>
      <description>&lt;h2&gt;
  
  
  What is ReAct?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; ReAct stands for &lt;strong&gt;Reasoning + Acting&lt;/strong&gt; — explicitly interleaved. The idea is simple: instead of just having an LLM answer a question in one shot, you let it think out loud, take actions (like calling tools), observe the results, and then think again. It's a loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; so it's like chain-of-thought but with actual tool use mixed in?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Exactly. The loop looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Thought&lt;/strong&gt; — the model reasons about what to do next&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action&lt;/strong&gt; — the model calls a tool (search, calculator, API, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observation&lt;/strong&gt; — the system injects the real result from that tool&lt;/li&gt;
&lt;li&gt;Repeat until the model decides it has enough info → &lt;strong&gt;Final Answer&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key word is "interleaved." It's not "think everything through then act" — it's think a little, act a little, observe, adjust.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; how does the system know when the model wants to call a tool vs just talking?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Stop sequences. The model is trained (or prompted) to output in a specific format — like &lt;code&gt;Action: search["weather in Singapore"]&lt;/code&gt;. The system watches for that pattern, intercepts it, actually runs the tool, then injects the result back as &lt;code&gt;Observation: 32°C, sunny&lt;/code&gt;. The model never actually executes anything. It just writes text that looks like a tool call, and the system does the rest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; so the LLM doesn't know it's calling real tools? it just thinks it's generating text?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Right. From the model's perspective, it's just doing next-token prediction. The magic is in the scaffolding around it — the system that intercepts, executes, and injects.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical example
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Say you ask a ReAct agent: "Should I go to the beach this Saturday?"&lt;/p&gt;

&lt;p&gt;The agent might do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Thought:&lt;/strong&gt; I need to check the weather for Saturday.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action:&lt;/strong&gt; &lt;code&gt;weather_forecast("Saturday")&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observation:&lt;/strong&gt; Saturday: 34°C, sunny, UV index 11, no rain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thought:&lt;/strong&gt; Weather looks great, but UV is very high. Let me check tide times too.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action:&lt;/strong&gt; &lt;code&gt;tide_info("Saturday")&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observation:&lt;/strong&gt; High tide at 2pm, low tide at 8am.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thought:&lt;/strong&gt; Good beach weather, but go early to avoid peak UV and catch low tide. I have enough info now.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Final Answer:&lt;/strong&gt; Yes! Saturday looks perfect — sunny, 34°C. Go in the morning around 8-9am when the tide is low and UV isn't brutal yet.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Multiple steps, each one informed by the last. The model decides when it has enough.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; got it. so it's basically an LLM in a loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; That's the most concise way to put it. An LLM in a loop, with tool access and the ability to decide when to stop. That's what an "agent" is at its core.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; wait — are you a ReAct agent right now?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Pretty much. I reason about what you're asking, I can call tools (search, read files, run code), I observe the results, and I decide what to say next. You're watching the pattern in action.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure modes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; what goes wrong with this?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Two big failure modes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First: infinite loops.&lt;/strong&gt; The model might keep calling tools without converging on an answer. It searches for something, doesn't find what it wants, rephrases, searches again, rephrases again… forever. The fix is simple — set a &lt;strong&gt;max iteration limit&lt;/strong&gt;. After N steps, force it to give a final answer with whatever it has.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second: context overflow.&lt;/strong&gt; Every Thought, Action, and Observation gets appended to the conversation. After 10-15 steps, that's a LOT of tokens. Eventually you hit the context window limit and the model just breaks. But the real problem starts before you even hit the limit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context degradation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; what do you mean "before the limit"?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; The context window might be 128K tokens, but performance degrades long before you fill it. Think about how attention works — the softmax function spreads attention weight across all tokens. More tokens = each one gets a thinner slice of attention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; like a pie with more slices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Exactly. A pie cut into 10 slices vs 1000 slices. Each individual slice becomes almost meaningless. The model can still "see" all the tokens technically, but its ability to actually focus on any one of them gets diluted.&lt;/p&gt;

&lt;p&gt;And there's another effect on top of this — &lt;strong&gt;"lost in the middle."&lt;/strong&gt; Research shows models disproportionately attend to the beginning and end of context, and neglect the middle. So if your key reasoning happened in step 5 of a 15-step chain, it might as well not exist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; so long ReAct chains get hit twice — diluted attention AND lost-in-the-middle?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Exactly. Double hit. Which is why real systems don't just naively append everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix: summarize between steps
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; The practical fix is to &lt;strong&gt;compress context between steps&lt;/strong&gt;. Instead of keeping the raw Thought/Action/Observation history, you periodically summarize it.&lt;/p&gt;

&lt;p&gt;Think of it as a sliding window — you keep the last few exchanges verbatim for detail, and everything before that gets compressed into a summary. LangChain has &lt;code&gt;ConversationSummaryBufferMemory&lt;/code&gt; that does exactly this: recent messages stay raw, older ones get summarized by an LLM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; but the summary is lossy right? the summarizer might drop something important.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Yes. That's the tradeoff. It's lossy compression. The summarizer might decide a detail was unimportant and drop it, but then 10 steps later that detail turns out to be critical. There's no perfect solution — it's a tradeoff between context freshness and information retention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Thought step matters — CoT as computation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; why does the explicit Thought step matter so much? couldn't the model just go straight to Action?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; This is one of the deeper insights. Chain-of-thought isn't just a prompting trick — it's &lt;strong&gt;extra computation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Without CoT, the model gets one forward pass to produce an answer. That's it — the entire computation happens in the fixed depth of the transformer network. With CoT, every token the model generates becomes new context for the next token. Each generated token is the product of a full forward pass, and it feeds into the next one. You're effectively chaining multiple forward passes together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; but isn't each forward pass independent? the only difference is it has new context from the previous tokens. it's not like the model "remembers" — it just sees more text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; You're right about the mechanism — yes, each pass is independent, and the only connection is through the context. But think about the effect: you've turned a fixed-depth computation (one pass through N transformer layers) into a &lt;strong&gt;variable-depth computation&lt;/strong&gt; (multiple passes, each informed by the output of the last). The mechanism is context. The effect is more computation. You're chaining fixed-depth into variable-depth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; so the extra computation is encapsulated in the generated tokens themselves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Yes — each token is a compressed artifact of a full forward pass. All those layers of attention and feedforward processing get distilled into a single token, which then becomes context for the next pass. The output tokens are essentially &lt;strong&gt;working memory&lt;/strong&gt; — scratch space where the model does its intermediate computation. Without CoT, there's no scratch space. One shot, one pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  ReAct vs other patterns
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; are there alternatives to ReAct?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Two main ones worth comparing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pure tool calling&lt;/strong&gt; — no explicit Thought step. The model just emits tool calls directly. Faster, fewer tokens, but the model doesn't reason about what to do before doing it. Works fine for simple tasks, falls apart for complex multi-step reasoning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Plan-first&lt;/strong&gt; — the model creates a full plan upfront (step 1, step 2, step 3…), then executes it. Advantage: it can see dependencies and think ahead. "I need X before I can do Y." Disadvantage: plans are brittle. They don't survive contact with reality. What if step 2 returns unexpected data that invalidates step 3?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; so ReAct is more adaptive but short-sighted, and plan-first thinks ahead but breaks easily?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; That's a clean summary. ReAct is &lt;strong&gt;adaptive but myopic&lt;/strong&gt; — it only thinks one step ahead. Plan-first is &lt;strong&gt;strategic but brittle&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The practical sweet spot is a &lt;strong&gt;hybrid: plan and replan&lt;/strong&gt;. Make a rough plan, start executing, but be willing to replan when observations don't match expectations. This is essentially what frameworks like LangGraph enable — you define a graph of steps with conditional edges, so the agent can branch and loop based on what actually happens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Me:&lt;/strong&gt; so nobody actually uses pure ReAct in production?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Klover:&lt;/strong&gt; Not in its textbook form, no. Real systems take the core insight — interleave reasoning with acting — and add structure around it. Planning, memory management, guardrails, fallbacks. But the ReAct loop is the foundation all of it builds on.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Session: Feb 4, 2026. First exposure — covered full ReAct pattern including failure modes, CoT mechanics, and agent pattern comparisons.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>machinelearning</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
