DEV Community: seah-js

Attention Is All You Need — Full Paper Breakdown

seah-js — Sat, 07 Mar 2026 22:57:00 +0000

The 2017 paper "Attention Is All You Need" by Vaswani et al. introduced the Transformer — the architecture behind GPT, Claude, Gemini, and every major LLM today. It replaced recurrent models entirely with attention mechanisms, and the field has never looked back.

This post walks through the key ideas.

The problem with RNNs

Before Transformers, sequence modeling meant RNNs and LSTMs. These process tokens one at a time, left to right. That sequential dependency creates two problems:

No parallelization — each step depends on the previous hidden state, so you can't process tokens simultaneously during training
Long-range dependencies decay — by the time an RNN reaches token 500, the signal from token 1 has been compressed through hundreds of hidden states

Attention mechanisms existed before this paper (Bahdanau attention, 2014), but they were bolted onto RNNs. The radical idea here: what if attention is all you need? Drop the recurrence entirely.

The Encoder-Decoder architecture

The Transformer follows the classic encoder-decoder structure used in machine translation:

Encoder (left side): Takes the input sequence and produces a rich representation. 6 identical layers stacked.
Decoder (right side): Takes the encoder's output + previously generated tokens to produce the next token. Also 6 layers.

Each layer in both stacks has the same building blocks: multi-head attention, feed-forward networks, residual connections, and layer normalization.

Self-attention: the core mechanism

Self-attention lets every token in a sequence look at every other token and decide how much to "pay attention" to it.

For each token, the model computes three vectors:

Query (Q) — "what am I looking for?"
Key (K) — "what do I contain?"
Value (V) — "what information do I provide?"

These are produced by multiplying the input embeddings by learned weight matrices: $W_{Q}, W_{K}, W_{V}$ .

The attention score between two tokens is the dot product of the query of one with the key of the other. High dot product = high relevance. The formula:

Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V

The $d_{k}$ scaling factor prevents the dot products from growing too large as dimensionality increases — without it, the softmax would produce extremely peaked distributions, effectively killing gradients.

Multi-head attention

Instead of computing attention once with the full dimensionality, the model splits Q, K, and V into multiple heads (8 in the original paper). Each head operates on a smaller subspace ( $d_{model} / h = 512/8 = 64$ dimensions per head).

Why? Different heads can learn different types of relationships:

One head might focus on syntactic structure (subject-verb agreement)
Another might capture positional proximity
Another might track semantic similarity

The outputs of all heads are concatenated and projected back to the full dimension.

Three types of attention in the Transformer

The paper uses multi-head attention in three distinct ways:

Encoder self-attention — every input token attends to every other input token
Masked decoder self-attention — each output token attends only to previous output tokens (the mask prevents looking ahead, preserving autoregressive generation)
Cross-attention — decoder tokens attend to encoder outputs, connecting the input representation to the output generation

Positional encoding

Self-attention has no inherent notion of order — it's a set operation. "The cat sat on the mat" and "mat the on sat cat the" would produce identical attention patterns without positional information.

The paper adds positional encodings using sine and cosine functions of different frequencies:

P E_{(p os, 2 i)} = sin (\frac{p os}{1000 0 ^{2 i / d_{model}}})

P E_{(p os, 2 i + 1)} = cos (\frac{p os}{1000 0 ^{2 i / d_{model}}})

These are added (not concatenated) to the input embeddings. The sinusoidal approach was chosen because it allows the model to generalize to sequence lengths longer than those seen during training — any relative position can be expressed as a linear function of the encodings.

Position-wise feed-forward networks

Each attention sub-layer is followed by a feed-forward network applied independently to each position:

FFN (x) = max (0, x W_{1} + b_{1}) W_{2} + b_{2}

This is two linear transformations with a ReLU in between. The inner dimension expands to 2048 (4× the model dimension of 512), then projects back down. Think of it as each token individually "processing" the information it gathered from attention.

Residual connections and layer norm

Every sub-layer (attention or FFN) is wrapped with:

LayerNorm (x + SubLayer (x))

The residual connection $x + SubLayer (x)$ ensures gradients flow easily through deep stacks — without them, training a 6-layer stack would be much harder. Layer normalization stabilizes the activations.

Training details

Optimizer: Adam with $β_{1} = 0.9, β_{2} = 0.98$
Learning rate schedule: Warmup + decay. LR increases linearly for 4000 steps, then decays proportionally to the inverse square root of the step number
Regularization: Dropout (0.1) on attention weights and after each sub-layer, plus label smoothing (0.1)
Training data: WMT English-German (4.5M sentence pairs) and English-French (36M pairs)
Hardware: 8 NVIDIA P100 GPUs, 3.5 days for the big model

The results

The Transformer achieved state-of-the-art on English-to-German and English-to-French translation, beating all previous models including deep ensembles — while training significantly faster due to full parallelization.

But translation was just the beginning. The architecture turned out to be the foundation for:

BERT (encoder-only) — bidirectional pretraining
GPT (decoder-only) — autoregressive language modeling
Vision Transformers — applying the same architecture to images
Basically everything in modern AI

Key takeaway

The paper's core insight is elegant: you don't need recurrence or convolutions for sequence modeling. Attention alone — properly scaled, split into multiple heads, and stacked with residual connections — is sufficient. And because attention computes all pairwise relationships in parallel, it's dramatically faster to train.

That's why nine years later, every frontier model is still a Transformer at its core.

Agent Loops & State Management

seah-js — Mon, 09 Feb 2026 11:15:37 +0000

This is a learning session between me and Klover, my AI assistant. New topic — building on ReAct and function calling, now looking at the loop itself.

How does an agent remember?

Klover: LLMs are stateless. So how does the agent maintain context across multiple iterations of the loop?

Me: Something tracks the conversation. After each tool call, the observation is appended to the tracked conversation and passed as input to the next LLM call.

Klover: Exactly. The conversation history grows with each step — every Thought, Action, and Observation gets appended, and the entire thing is fed back to the LLM on each iteration. ✅

The growing context problem

Klover: The history keeps growing. What's the problem?

Me: Hitting the context window is an issue. We can summarise the conversation. But the summarisation can't be too lossy.

Klover: Good — context window limits are the main constraint. Other approaches beyond summarization:

Sliding windows — drop oldest messages
Selective retention — keep only relevant steps
Retrieval-based — store older steps externally, retrieve what's relevant (RAG for agent history)

When does the loop stop?

Klover: What are the termination conditions for an agent loop?

Me: When the agent has enough to give a conclusive answer. Also a maximum number of tool calls.

Klover: Those are 2 of 5. The full list:

Natural completion — LLM decides it has the answer ✅
Max iterations — hard cap on loop steps ✅
Timeout — wall clock time limit ❌
Token/cost budget — spending cap per task ❌
Error threshold — consecutive failures = stop ❌

Key insight: never trust the LLM alone to stop. Always have orchestrator-level guardrails.

State beyond conversation history

Klover: What other kinds of state might an agent need beyond chat messages?

Me: (needed prompting)

Klover: Think of a scratchpad or working memory — a separate structured data store:

{
  "selected_flight": "SQ321",
  "passenger": "Jay Seah",
  "payment_status": "pending"
}

Different from conversation history:

Structured — not buried in natural language
Mutable — agent updates specific fields
Accessible — no parsing through messages

Production agents need both conversation history AND structured state management.

Scratchpad solves the summarization problem

Klover: Agent is 25 steps deep, context almost full. You summarize steps 1-20. What's the risk?

Me: The risk is loss of important information during summarisation. The structured file we talked about could help.

Klover: Nailed it. The scratchpad survives summarization because it's separate. Critical data lives in structured state, so even if conversation gets compressed, essential info is preserved. ✅

Session verdict: Exposure

What I got: Conversation history as state, context limits + summarization, natural completion + max iterations, scratchpad solving summarization risk.

What I missed: 3 of 5 termination conditions (timeout, token/cost budget, error threshold). Structured scratchpad needed significant prompting.

All 5 termination conditions on the test next review. No excuses.

Next review: February 10.

Function Calling & Tool Schemas — Review

seah-js — Mon, 09 Feb 2026 11:15:23 +0000

This is a learning session between me and Klover, my AI assistant. Spaced repetition review — testing what stuck from the first session.

What does the LLM actually output?

Klover: When an LLM "calls a function," what does it actually output, and what does it NOT do?

Me: It outputs a JSON table with all the relevant parameters which will be passed to an orchestrator which will parse the JSON and call the actual tool.

Klover: Core is right — the LLM outputs structured JSON, it does NOT execute anything itself. The orchestrator handles actual execution. Small note: "JSON table" — it's more accurately a JSON object (key-value pairs), not a table. Minor terminology thing but worth being precise.

Where do schemas come from?

Klover: Where do tool schemas come from? Are they baked into the model during training, or something else? And what does a schema typically contain?

Me: It is external to the model. It is passed like a system prompt to the model. A schema usually contains the description of the tool, the required and optional parameters.

Klover: Solid. Schemas are injected at runtime, not baked in during training. They contain the tool name, description, and typed parameters (required + optional). ✅

Why schema quality matters

Klover: Why does the quality of your schema descriptions matter so much?

Me: It might cause the LLM to use the tool wrongly which might result in useless or non-relevant outputs from the tool called.

Klover: Right — poor descriptions lead to wrong tool selection or bad parameter values. Schema quality is basically prompt engineering for tools. Garbage in, garbage out. ✅

The orchestrator's role

Klover: Beyond just executing the tool call, what else does the orchestrator do?

Me: It acts as a security layer. It ensures that the tool called is allowed for this model. Also, it prevents prompt injection from directly calling a tool.

Klover: Security is a big one. But the complete list:

Validation — are the params correct/allowed?
Rate limiting — preventing abuse
Security/filtering — blocking unauthorized calls
Logging — audit trail

I got 2 out of 4. Missed rate limiting (prevents runaway agents hammering APIs) and logging (critical for debugging agent behavior in long chains).

Connecting to ReAct

Klover: How does function calling relate to the ReAct pattern?

Me: Tool calls fit in the Action part. The Observation is the output from the tool call.

Klover: Spot on. Action = LLM outputs tool call JSON, orchestrator executes. Observation = raw result fed back for next Thought. ✅

Review verdict: Pass → Developing

Core concepts solid. Missed rate limiting and logging from orchestrator responsibilities. Next review: February 12.

Function Calling & Tool Schemas

seah-js — Sun, 08 Feb 2026 13:28:15 +0000

This is a learning session between me and Klover, my AI assistant. The idea is simple — Klover teaches me a concept through back-and-forth dialogue, I explain things in my own words, and we go deeper together. This format captures the actual conversation, including mistakes and corrections.

This time: Function calling and tool schemas — how agents actually interact with external tools.

What is a tool schema?

Klover: You know the ReAct loop: Thought → Action → Observation. But when the agent picks an Action, how does it know what tools exist and how to call them? What do you think a tool schema is?

Me: I assume it details all the available tools the LLM can call. It exposes a "list" of tools and their purposes.

Klover: Good intuition but there's more. A tool schema doesn't just list what tools exist — it defines how to call them. Like a function signature in code. For each tool: a name, a description (natural language for the LLM), and typed parameters with required/optional flags.

name: get_weather
description: "Get current weather for a location"
parameters:
  location (string, required): City name
  units (string, optional): "celsius" or "fahrenheit"

Where do schemas live?

Klover: Where does this schema live? Part of the model's training, or somewhere else?

Me: It's not part of training. The schema lives externally — almost like additional prompts fed to the LLM during every invoke.

Klover: Exactly. Schemas are injected into the system prompt or a dedicated tools section at runtime. The model was trained to understand and follow the format, but the specific tools change per application. That's why you can give Claude a custom tool it's never seen and it works — it learned the format, not the specific tools.

The full round trip

Klover: When the LLM decides to call a tool, what does that actually look like? Walk me through the full round trip.

Me: The model decides to call the weather tool → invokes it via the output → weather tool returns an observation → model thinks about the observation.

Klover: Right flow, but let's zoom in. The LLM is a text generator — it can't literally call an API. What does it actually produce?

Me: I guess it outputs the actual function call with the parameters, which then calls a function.

Klover: Specifically, it outputs structured JSON:

{
  "tool": "get_weather",
  "parameters": {
    "location": "Singapore",
    "units": "celsius"
  }
}

But the critical part: the LLM doesn't execute anything. It produces this JSON and stops. Then your application code (the orchestrator) takes over:

LLM outputs a tool call → stops generating
Your code parses the JSON
Your code actually calls the weather API
Your code feeds the result back as the Observation
LLM continues with the next Thought

The LLM is the brain. Your application is the hands.

Why the orchestrator matters

Klover: Why is it important that the orchestrator sits between the LLM and the tool?

Me: So you can add safeguards — limiting total tool calls, ensuring tools are called correctly.

Klover: Spot on. The orchestrator is your control layer:

Validate — check parameters before executing
Rate limit — cap tool calls to prevent infinite loops
Filter — block certain tools based on context or permissions
Log — record every call for debugging
Sanitize — clean output before feeding it back

Without that middle layer, the LLM has direct access to your systems. Imagine a prompt injection tricking the model into calling delete_database — the orchestrator is where you catch that.

Schema quality matters

Klover: What happens if the schema is poorly written? Vague description, wrong parameter requirements?

Me: The tool gets called when it shouldn't be, or called wrongly — resulting in bad or failed results.

Klover: Exactly. Bad schemas → bad tool calls:

Vague description → model calls the wrong tool or skips the right one
Wrong types/requirements → malformed requests, crashes, garbage output
Missing parameter descriptions → model guesses what "q" or "id" means

Tool schema design is basically prompt engineering for tools. Clear names, precise descriptions, correct types — it's the difference between a reliable agent and a flaky one.

First session: Feb 8, 2026. Status: Exposure. Good intuition throughout — connected naturally to ReAct concepts. Review tomorrow.

ReAct Pattern — Review

seah-js — Sun, 08 Feb 2026 13:14:13 +0000

This is a review session between me and Klover, my AI assistant. The idea is simple — Klover quizzes me on something I've learned before, I explain it in my own words, and we go deeper on the parts that stick (or don't). This format captures the actual back-and-forth, including the mistakes.

This time: ReAct pattern review, testing whether the core concepts have stuck since the first session.

Empty results — what happens next?

Klover: An agent calls a search tool but gets back an empty result. Walk me through what happens next in the ReAct loop — what does the Thought step look like, and what are two ways it could go wrong?

Me: The Thought step would analyse what to do next. One failure: it could call the tool again with the same query, get nothing again, and end up in an endless loop.

Klover: That's the first one — infinite retry loop. What's the second?

Me: I wasn't sure about the second one.

Klover: The second is hallucination. The Thought step sees no results and instead of trying a different approach, the agent makes up an answer. It "reasons" that it probably knows enough and generates a confident but fabricated response. Dangerous because the user has no idea the info wasn't grounded in actual tool output.

What should the agent actually do?

Klover: So what should a well-designed ReAct agent do when the search returns nothing?

Me: The next reasoning step should recognise that something went wrong — maybe the query was too specific or used wrong terms — and try something else. Or if retrying doesn't work, report back to the user that it couldn't find the information.

Klover: Exactly. The Thought should explicitly acknowledge the failure and pivot. The key insight — the Thought step is where the agent decides between adapting and giving up gracefully. Both are valid. Hallucinating is never valid.

The T-A-O loop

Klover: Quick check: what's the correct order of the loop, what does each letter stand for, and where does the agent decide whether to give a final answer or keep going?

Me: T — Think, A — Act, O — Observe. In that order. The Think step is where it decides.

Klover: Spot on. At each Thought step, the agent either decides to act (picks a tool, executes it, observes the result) or decides it's done (outputs the final answer and exits the loop). That's why Thought is the most critical step.

Who generates the Observation?

Klover: What's the difference between the Action and the Observation? Specifically — who or what generates each one?

Me: Action is an external tool call or API. Observation is done by the LLM on the output of that tool call.

Klover: Close but not quite. The Action — yes, the LLM decides the tool call, and the tool executes externally. But the Observation is the raw output returned by the tool/environment. It's data coming in, not the LLM thinking about it. The LLM's analysis of that output happens in the next Thought step.

So the full flow:

Thought → LLM reasons
Action → LLM picks a tool, environment executes it
Observation → Environment returns the result (LLM just receives it)
Thought → LLM analyses the observation, decides next move

The Observation is the only part the LLM doesn't generate. It's the real world talking back.

Me: Yeah, that makes sense. I was conflating the observation with the reasoning about it.

Failure modes recall

Klover: Name the failure modes we've covered and give a one-line mitigation for each.

Me: Context window overflow — mitigate by summarising. Forever loop — limit the total number of tool calls. Attention dilution — same thing, summarise.

Klover: Three for three on unprompted recall. Plus you know hallucination from earlier. And you correctly identified the tradeoff — summarisation helps with both context bloat and attention dilution, but it's lossy compression. You might throw away something that matters later.

Review session: Feb 8, 2026. Stays at developing. Recalled 3 of 4 failure modes (missed hallucination — was prompted). Corrected on Observation being environment output, not LLM reasoning. Next review targets unprompted recall of all 4 failure modes.

KV Cache Optimization — Why Inference Memory Explodes and How to Fix It

seah-js — Fri, 06 Feb 2026 13:21:28 +0000

Learning session with Klover. Today: why the KV cache is the biggest memory bottleneck in LLM inference, and three ways to optimize it.

Why Does the KV Cache Exist?

Klover: You already know from MLA that the KV cache is a bottleneck. Quick refresh — why does it exist in the first place?

Me: KV cache exists because it's used for inference to generate the next token. It speeds up the process because the key and value vectors don't have to be recalculated every time a new token is generated.

Klover: Spot on. Without it, every new token would recompute attention over the entire sequence — quadratic cost. With it, you just compute K and V for the new token and append. Linear.

How Much Memory Does It Use?

Klover: What factors make the KV cache bigger?

Me: If I'm not wrong, a single number is 4 bits? If your vector is 128 numbers and you have both keys and values, each token would have 4×128×2, and then it scales linearly with tokens.

Klover: Close but a few corrections:

FP16 (standard for inference) is 2 bytes (16 bits), not 4 bits
You got dimension and K+V (×2) right
But you're missing number of layers and number of attention heads

Full formula:

2 × num_layers × num_heads × head_dim × seq_length × 2 bytes

For Llama 70B (80 layers, 64 heads, head_dim 128) at 4K sequence length — the KV cache alone is ~10GB per single request. 100 concurrent users = 1TB just for KV cache.

Optimization #1: PagedAttention

Klover: The name comes from how operating systems manage RAM. What do you think the connection might be?

Me: I don't even know how operating systems manage RAM.

Klover: Fair enough. Quick detour:

The OS problem: Programs ask for different amounts of memory. If you give each one a big contiguous block "just in case," you waste memory sitting unused. This is internal fragmentation.

The OS solution — paging: Split memory into small fixed-size pages. Allocate on demand. Programs think they have contiguous memory, but the OS maps pages to wherever there's free space.

Same problem with KV cache: Each request pre-allocates for the maximum sequence length. A 500-token response with 4096 reserved = 87% wasted.

PagedAttention (from vLLM): Splits the KV cache into small fixed-size blocks, allocated on demand as the sequence grows. Result: ~60-80% better memory utilization.

Klover: What's the core insight that both OS paging and PagedAttention share?

Me: Rationing memory to the services that need it.

Klover: Yes — allocate on demand, not upfront. Don't reserve what you might not use.

Optimization #2: Prefix Caching

Klover: Think about a chatbot with a system prompt that every conversation starts with. Every request computes the exact same K and V for those tokens. What do you think prefix caching does?

Me: Cache the KV for the system prompt even before any user prompt is given.

Klover: Exactly. Compute once, reuse across all requests sharing that prefix.

This applies beyond system prompts:

Few-shot examples — same examples prepended to every request
Multi-turn chat — prior conversation already cached, only compute for the new message
Parallel requests — multiple requests sharing the same document

2000-token system prompt across 100 users = 200K tokens of computation eliminated.

Optimization #3: MLA

Already covered in a previous session. Compresses K and V into a smaller latent space using learned projections. Less memory per token stored.

Putting It All Together

Klover: A company is serving an LLM to 1000 concurrent users. Same 1000-token system prompt, but response lengths vary wildly (50 to 4000 tokens). Which optimization helps with which problem?

Me: Same 1000-token prompt can use prefix cache across users. Varying response lengths can use PagedAttention. MLA can be used across all LLM settings if you want.

Klover: Nailed it. All three correct. And you're right that MLA is a model-level decision baked into the architecture, while PagedAttention and prefix caching are serving-layer optimizations you can apply to any model.

Optimization	Level	When Applied
MLA	Model architecture	Training time
PagedAttention	Serving layer	Any model
Prefix caching	Serving layer	Any model

Self-Assessment: 3.8/5

Core concepts make sense. Need another pass to solidify the memory formula and deepen understanding of PagedAttention internals.

AWS ECS Deployment — Review

seah-js — Fri, 06 Feb 2026 01:28:13 +0000

Review session with Klover. Testing recall on AWS ECS deployment concepts from yesterday's session.

The Recall Test

Klover: Walk me through the flow from git push to a user hitting your API. What are the key AWS services involved?

Me: After git push, if you set up CI/CD, the image will be pushed to AWS ECR, then deployed via ECS. After that I don't remember well. I know there's Fargate where you configure system requirements, ALB which routes containers to speak to one another, and some kind of task system.

Klover: Good start — you've got the big picture. But let's untangle the ECS side because you're mixing a few things.

Task Definition vs Task

Klover: You mentioned "task system or something." What do you think a Task Definition actually is? Think blueprint vs running thing.

Me: Task Definition is the blueprint for the actual running of the containers? What settings for this service and everything like that.

Klover: Yes. Task Definition = blueprint. It defines which Docker image, how much CPU/memory, environment variables, port mappings, secrets. The actual running instance of that blueprint is called a Task.

Me: So is it accurate to say the Task is the running of the container, with configurations defined in the Task Definition?

Klover: Exactly. Same relationship as a Docker image vs a running Docker container.

The Service — What I Forgot

Klover: What does an ECS Service do?

Me: I don't know.

Klover: What happens if your Task crashes? It's just dead. Nobody restarts it. That's what a Service does — it says "I want 3 Tasks running at all times." If one dies, the Service launches a replacement. It also handles rolling deployments.

The hierarchy: Task Definition (blueprint) → Service (manager) → Tasks (running containers)

Crash Loop — My Wrong Assumption

Klover: If you push a broken image that crashes on startup, what happens?

Me: It will retry, and if it doesn't work, it will revert to the previous successful task definition.

Klover: Close but not quite — it's actually dumber than that. The Service sees the crash, launches a new one from the same broken definition. That crashes too. It tries again. And again. This is a crash loop.

It does NOT auto-rollback. The Service only knows "maintain desired count." You have to intervene manually.

Mitigation: Rolling deployments with health checks. New tasks must pass health checks before old ones are killed. Doesn't "revert" — just stops the rollout.

Key takeaway: The Service is persistent but not smart.

ALB — What I Got Wrong

Me: ALB routes containers to speak to one another.

Klover: Not quite. ALB sits between users and your containers. Users hit the ALB, ALB routes traffic to healthy Tasks. Two reasons you need this middleman:

Tasks are ephemeral — they get replaced, scale up/down, IPs change
ALB gives you a stable endpoint (one URL) that always works

There's a Target Group between ALB and Tasks — a registry of currently healthy tasks that ALB routes to.

Serverless ≠ No Servers

Me: Regarding Fargate, I don't understand what "serverless" means. Isn't EC2 also using AWS infra?

Klover: "Serverless" is confusing because obviously there ARE servers. The difference is who manages them:

EC2: AWS gives you a VM. You manage capacity, patches, disk space, scaling.
Fargate: You never see a machine. Just say "I need 0.5 vCPU and 1GB RAM." AWS figures out the rest.

Think of it like:

EC2 = renting an apartment (you fix the plumbing)
Fargate = hotel room (they handle everything)

Both use AWS infra. "Serverless" means you don't manage servers — not that servers don't exist.

Self-Assessment: 2.7/5

Solid on: ECR → ECS pipeline, Task Definition = blueprint, Fargate = serverless.

Fuzzy on: ALB/Target Groups, Services, crash loop behavior, serverless meaning.

Staying at exposure. Review again tomorrow.

AWS ECS Deployment — From Git Push to Running Containers

seah-js — Thu, 05 Feb 2026 01:52:56 +0000

Learning session with Klover. Today: deploying FastAPI + Streamlit to AWS ECS. The goal is understanding the full flow from git push to running containers.

The Big Picture

git push
   ↓
GitHub Actions (CI/CD)
   ↓
ECR (store images)
   ↓
ECS + Fargate (run containers)
   ↓
ALB (route traffic)
   ↓
Users

Two services in my case: FastAPI backend + Streamlit frontend. Each gets its own container.

ECS vs Fargate — What Actually Runs Containers?

Klover: ECS is an orchestrator — it doesn't run containers itself. What actually runs your containers?

Me: Maybe EC2?

Klover: That's one option. The other is Fargate.

EC2 mode: You manage the instances, patching, scaling, capacity
Fargate mode: Serverless. Just define CPU/memory, AWS handles the rest.

For someone who hates managing infrastructure, Fargate is the obvious choice.

ECR — Where Images Live

Klover: Why not just use Docker Hub? Why does AWS have its own registry?

Me: Docker Hub is external to AWS. ECS runs within AWS, so having ECR inside AWS makes sense.

Klover: Exactly. Being inside AWS means:

Faster pulls — same network, low latency
IAM integration — no separate credentials
No egress costs — pulling from Docker Hub = paying for data transfer

ECR = AWS's private Docker Hub.

Task Definitions — The Blueprint

A Task Definition tells ECS everything it needs to run your container:

Image — which ECR image to pull
CPU/Memory — resources allocated
Port mappings — which ports to expose
Environment variables — config, secrets
Logging — where to send logs
IAM role — AWS permissions for the container

Think of it like a docker-compose file but AWS-flavored.

Services — Keeping Things Alive

Klover: A Task Definition is just a blueprint. To actually run it, you create a Service. Why the separation?

Me: So if one crashes, the Service can spin up a replacement automatically?

Klover: Exactly. Service maintains desired state. You say "keep 2 copies running" — if one dies, Service notices and replaces it.

Service gives you:

Desired count (keep N copies running)
Auto-restart on failure
Connection to load balancer

Hierarchy:

Task Definition (blueprint)
       ↓
Service (keeps N copies alive)
       ↓
Tasks (actual running containers)

ALB + Target Groups — Routing Traffic

Klover: If you have 2 FastAPI tasks running, how does traffic get routed?

Me: Some kind of routing — the ALB?

Klover: Yes. ALB (Application Load Balancer) sits in front and distributes requests.

User request
     ↓
    ALB
   /   \
Task1  Task2

ALB does:

Load balancing — spreads traffic across healthy tasks
Health checks — pings /health, stops sending traffic to dead tasks
Path routing — /api/* → FastAPI, / → Streamlit

Why Target Groups?

Tasks are ephemeral. Their IPs change constantly. ALB can't track "send traffic to 10.0.1.45" when that IP might be gone in 5 minutes.

Target Group = stable reference to an ever-changing set of tasks.

ECS auto-registers new tasks, auto-deregisters dead ones. ALB just points at the target group.

The CI/CD Flow

When I push code:

GitHub Actions detects the push
Workflow builds Docker image
Pushes image to ECR
Tells ECS to update the Service
Service pulls new image, spins up new Tasks
New tasks register in Target Group
ALB routes traffic to healthy tasks

Why "update service"?

Pushing a new image to ECR just updates storage. Running tasks don't know — they're already running the old image.

"Update ECS service" triggers a rolling deployment:

Pull latest image
Spin up new tasks
Wait for health checks
Drain traffic from old tasks
Kill old tasks

No downtime.

Secrets Management

Two different places for secrets:

GitHub Secrets — for CI/CD

AWS credentials so GitHub Actions can push to ECR
Used during build/deploy

AWS Secrets Manager / SSM — for runtime

DB passwords, API keys the app actually uses
Injected as environment variables when container starts
Never in image, never in git

Quick Reference

Concept	What it does
ECR	Stores Docker images
ECS	Orchestrates containers
Fargate	Serverless compute for containers
Task Definition	Blueprint (image, CPU, memory, ports, secrets)
Service	Keeps N tasks running, auto-restart
Task	Actual running container
ALB	Routes traffic, health checks
Target Group	Stable reference to dynamic tasks
GitHub Actions	CI/CD automation
Secrets Manager/SSM	Runtime secrets

The Gotcha I Hit

Klover's quiz: Fill in the blanks for the CI/CD flow.

I said the workflow pushes images to ECS. Wrong — it's ECR.

EC*R* = Registry (storage)
EC*S* = Service (orchestration)

Easy to mix up. Now I won't.

Race Conditions, Asyncio Locks & Concurrency Patterns

seah-js — Wed, 04 Feb 2026 13:57:17 +0000

This is a Socratic-style learning session between me and Klover, my AI assistant. Instead of reading docs or watching tutorials, I learn by explaining concepts in my own words and getting challenged on the gaps. The conversation format is the point — it captures the actual "aha" moments as they happen.

In this session, I hit a real bug in an AWS pipeline and used it to understand race conditions, asyncio locks, thread pools, and what "concurrent" actually means in single-threaded async code.

The bug

Me: I've got a pipeline on AWS that processes invoices. There's a shared manifest file in S3 that tracks processing status. Multiple async LLM API calls run at once, and each one updates the manifest after it finishes. Problem: if two LLM calls return at nearly the same time, they both read the old manifest, both write their update, and one overwrites the other. Lost data.

Klover: Classic read-modify-write race condition. Same fundamental pattern as why Python needs the GIL for reference counting — two things reading the same value, modifying independently, then writing back. One update gets stomped.

The key issue: S3 doesn't lock files for you. There's no built-in mechanism that says "someone else is editing this, wait your turn." It's just object storage — you read bytes, you write bytes.

Me: But this is single-threaded async. I thought race conditions were a multithreading thing?

Why single-thread async still has races

Klover: Common misconception. The shared state here isn't in-memory — it's external, sitting in S3. Single thread protects in-memory variables (only one line of Python runs at a time), but your event loop switches tasks at every await point.

Here's what happens:

Task A finishes its LLM call, reads manifest from S3 (await s3_read)
At that await, the event loop can switch. Task B also finishes, reads the same old manifest from S3
Task A writes its update to S3
Task B writes its update to S3 — overwriting A's changes

Single thread doesn't help because the "shared state" isn't protected by the thread. It's a file on a remote server.

Me: So even though only one line of code runs at a time, the damage happens because both tasks saw a stale version of the manifest before either wrote back.

Klover: Exactly. The race window is between the read and the write. Any await in that window is an opportunity for another task to slip in.

The naive fix vs the right fix

Me: Would making the S3 reads and writes synchronous fix it?

Klover: Technically yes — if everything is sync, there's no await for the event loop to switch at. But you'd kill all your concurrency. Your LLM calls, which are pure I/O waiting, would run one at a time. For a batch of invoices, that's brutal.

Better approach: asyncio.Lock(). Keep the LLM calls concurrent, but serialize just the manifest updates.

Me: What is asyncio.Lock exactly?

Klover: Same concept as any lock — only one task can hold it at a time. The difference from threading.Lock is that it's async-friendly. When a task is waiting to acquire the lock, it yields back to the event loop instead of blocking the thread. Other tasks keep running while you wait.

manifest_lock = asyncio.Lock()

async def update_manifest(invoice_result):
    async with manifest_lock:
        # only one task in here at a time
        manifest = await read_manifest_from_s3()
        manifest[invoice_result.id] = invoice_result.status
        await write_manifest_to_s3(manifest)

Me: So all the S3 read/writes for the manifest queue up one at a time?

Klover: Yes. Task A grabs the lock → reads → updates → writes → releases. Task B was waiting → grabs the lock → reads the manifest with A's update already in it → updates → writes. No lost updates. Meanwhile, all your LLM API calls are still running concurrently outside the lock.

Single-thread async for I/O pipelines

Me: Does single-thread async actually work well for an I/O-heavy pipeline like this? Feels like threading would be more powerful.

Klover: Single-thread async is ideal for I/O-heavy work. You're mostly waiting on network responses — LLM APIs, S3 reads/writes. One thread juggling thousands of I/O waits uses almost no memory and has zero thread-switching overhead.

It only breaks down when you have CPU-heavy steps in the pipeline — like parsing giant PDFs or running local models. The event loop can't switch during CPU work since there's no await to pause at.

Me: So what if my pipeline has a mix — some I/O calls and some heavy parsing?

Threading for CPU-bound work

Klover: For mixed workloads, you offload the CPU parts to a thread pool using run_in_executor. The async event loop stays responsive for I/O, and CPU work runs on separate threads.

Me: What's a thread pool?

Klover: Pre-created threads sitting idle, waiting for work. Think of a taxi stand — the taxis are already there, engines running. When a job comes in, a taxi picks it up immediately instead of having to spawn a brand new car from scratch.

Python gives you a default one:

result = await loop.run_in_executor(None, cpu_heavy_fn, data)

Me: Break that down for me.

Klover:

loop — the event loop (your async scheduler)
run_in_executor — hand this work off to a thread pool
None — use Python's default thread pool
cpu_heavy_fn — the function to run on a thread
data — argument passed to that function
await — pause here, let the event loop do other stuff, resume when the thread finishes

You stay in async-land. The CPU work happens on a thread without blocking your event loop.

Me: Got it. But for my invoice pipeline where it's all API calls and S3 — just async with the lock is enough. No need for threads.

Klover: Correct. Don't add threading complexity when you don't need it.

Sequential initiation vs concurrent waiting

Me: Let me make sure I understand the full picture. For batch invoice processing: the event loop fires off all the LLM calls, they wait concurrently, and when each returns, it hits the asyncio lock to update the manifest one at a time.

Klover: That's the right mental model. One nuance — asyncio.gather vs a for loop with await:

# concurrent waiting — all tasks run at the same time
results = await asyncio.gather(
    process_invoice(inv1),
    process_invoice(inv2),
    process_invoice(inv3),
)

# sequential — each one finishes before the next starts
for inv in invoices:
    result = await process_invoice(inv)

With gather, all three tasks are kicked off and wait concurrently. With for + await, you wait for each one to finish before starting the next.

Me: Wait — even with gather, the event loop is single-threaded. So it still initiates the tasks one by one, right? It's the waiting that's concurrent, not the initiation.

Klover: That's a sharp catch, and you're right. I should be more precise. The event loop is single-threaded, so it does start each coroutine sequentially. But initiation is near-instant — it's just setting up the task and sending the network request. The heavy part is the waiting, and that's fully concurrent. So practically:

Initiation — sequential, but microseconds per task
Waiting — truly concurrent, all tasks in-flight at the same time
Manifest updates — serialized by the lock

For 100 invoices, you're not waiting 100× the LLM latency. You're waiting roughly 1× (plus lock overhead for manifest writes).

Me: That's the bit that makes async powerful for this use case. Cool — I think I've got it.

Impromptu session: Feb 4, 2026. Practical async patterns for real-world pipelines — race conditions, locks, and concurrency.

Async & Sync — Review

seah-js — Wed, 04 Feb 2026 11:36:43 +0000

This is a review session between me and Klover, my AI assistant. Yesterday we covered async vs sync from scratch. Today Klover tests what I actually retained — then pushes deeper where the gaps show up.

Recall: I/O-bound vs CPU-bound

Klover: Let's start simple. What's the difference between I/O-bound and CPU-bound work, and why does async only help one of them?

Me: I/O-bound is when you're calling external services like APIs. CPU-bound is heavy computation. Async helps I/O-bound because the thread can move on to other work while waiting for the external thing to come back.

Klover: Good — you've got the core idea. Small addition though: I/O isn't just APIs. It's anything where the CPU is idle and waiting — disk reads, database queries, network calls, file writes. All I/O. Now explain the event loop. And what happens if you put CPU-heavy code inside one?

Event loop + blocking

Me: The event loop delegates work to different tasks. But I'm not sure what you mean by CPU-heavy code inside an event loop?

Klover: Remember the chef analogy? One chef, single kitchen. If one dish requires 10 minutes of non-stop chopping — what happens to every other dish?

Me: They're on hold. Nothing else gets worked on.

Klover: Exactly. Everything blocks. The event loop is single-threaded — one task hogging the CPU means every other task just sits in the queue waiting. That's why CPU-bound work inside an event loop is dangerous. Now, different topic — what's Python's GIL?

The GIL

Me: The GIL only allows a single thread to execute Python code at a time. The workaround is multiprocessing — one process gets its own GIL, so you can actually run things in parallel.

Klover: Good recall. But why does the GIL exist? What problem is it solving?

Me: It prevents memory leakage. And it stops the same thing from being worked on by multiple threads at the same time, which would cause conflicts.

Klover: Close, but it's not memory leakage exactly. Python uses reference counting to manage memory — every object has a counter tracking how many things point to it. When the count hits zero, the object gets cleaned up. The GIL exists because if two threads modify that reference count at the same time, you get a race condition. The count gets corrupted.

Race conditions — step by step

Me: I get that two threads shouldn't touch the same thing, but how does the count actually get corrupted?

Klover: Walk through it. Say an object's reference count is 3.

Thread A reads the count: sees 3
Thread B also reads the count: sees 3
Thread A decrements and writes: 2
Thread B decrements and writes: 2

Final count: 2. But two references were removed — it should be 1.

Both threads read the old value before either wrote. Thread B's write just overwrites Thread A's work. Now scale that up:

If the count never reaches 0 → the object never gets cleaned up → memory leak
If the count hits 0 too early → the object gets freed while something still points to it → crash

The GIL prevents this by making sure only one thread touches Python objects at a time. Brute force, but it works.

Me: Okay that makes sense now. Both threads reading the same stale value before either one writes — that's the race.

The async trap

Klover: Last one. What's wrong with this?

async def get_data():
    response = requests.get("https://api.example.com/data")
    return response.json()

Me: requests is synchronous. Even though the function is async, the thread has to sit there and wait for the response. It's blocking.

Klover: Exactly. It never yields control back to the event loop. There's no pause point, so the event loop can't switch to other tasks. Your "async" function is secretly sync. The fix: use an async HTTP client like httpx with await — that await is the pause/resume point that lets the event loop do its job.

async def get_data():
    async with httpx.AsyncClient() as client:
        response = await client.get("https://api.example.com/data")
    return response.json()

Review session: Feb 4, 2026. Exposure → Developing. Core concepts retained. Deeper understanding of GIL race conditions gained during review.

Multi-head Latent Attention (MLA) — Review

seah-js — Wed, 04 Feb 2026 11:13:03 +0000

This time: Multi-head Latent Attention (MLA), the mechanism behind DeepSeek's efficient inference.

What does MLA solve?

Klover: Let's start with the basics. What problem does MLA solve?

Me: It saves on KV cache. Instead of storing the full key and value vectors for every token, MLA stores a smaller latent vector. That latent vector gets decoded back into keys and values during inference when you actually need them.

Klover: Good — you've got the core idea. Two follow-ups. First: how does that compression actually work, mechanically? And second: if the goal is to reduce memory, why not just use fewer attention heads?

Compression mechanism and why not fewer heads

Me: The compression is matrix multiplication. You encode the full KV into a smaller latent space, then decode it back when needed. As for fewer heads — different heads attend to different aspects of the input. If you just cut heads, you lose those different perspectives entirely. MLA preserves the multi-head relationships but stores everything compactly.

Klover: Right. The key detail: MLA uses learned linear projections. There's a down-projection matrix (W_c) that compresses the KV into the latent vector, and up-projection matrices (W_uk, W_uv) that reconstruct keys and values per head during attention.

The distinction that matters — this is a learned compression, not a post-hoc one. The model learns how to compress during training, so the latent vector captures what actually matters for attention. That's fundamentally different from just truncating or dropping heads after the fact.

Memory vs compute

Klover: Next question: does MLA save compute during training?

Me: No, it's actually more compute. You have extra encode and decode steps — the compression and decompression aren't free.

Klover: Correct. The savings are about memory, not compute. During inference, the KV cache is the main bottleneck — it grows linearly with sequence length and batch size, and it's what limits how many tokens you can process or how many users you can serve. MLA shrinks that cache dramatically.

Pushback on training memory

Me: Wait — isn't the KV cache during training a win too? You still have KV stored in memory during the forward pass.

Klover: Good catch, actually. During training, the activations from the forward pass need to be stored for the backward pass. If you store the latent vectors instead of the full KV, that's less activation memory — similar in spirit to gradient checkpointing. So yes, there is a training memory benefit.

But the relative win is much smaller than inference. During training, you have a fixed batch and sequence length, and the activation memory is just one part of the total memory budget (model parameters, optimizer states, gradients all compete for space). During inference, the KV cache is often the dominant memory cost, especially for long sequences. That's where MLA really shines.

What's the risk?

Klover: Last one. What's the risk of this approach?

Me: It's lossy compression. You're squeezing KV into a lower-dimensional space, so some information gets lost. Attention quality could drop.

Klover: Exactly. The latent dimension is a tuning knob — smaller means more compression and less memory, but more information loss. Too aggressive and your attention patterns degrade. The art is finding the sweet spot where you get major memory savings without meaningful quality loss.

Review session: Feb 4, 2026. Exposure → Developing. All core concepts retained, good pushback on training memory savings.

ReAct Pattern

seah-js — Wed, 04 Feb 2026 02:35:56 +0000

What is ReAct?

Klover: ReAct stands for Reasoning + Acting — explicitly interleaved. The idea is simple: instead of just having an LLM answer a question in one shot, you let it think out loud, take actions (like calling tools), observe the results, and then think again. It's a loop.

Me: so it's like chain-of-thought but with actual tool use mixed in?

Klover: Exactly. The loop looks like this:

Thought — the model reasons about what to do next
Action — the model calls a tool (search, calculator, API, etc.)
Observation — the system injects the real result from that tool
Repeat until the model decides it has enough info → Final Answer

The key word is "interleaved." It's not "think everything through then act" — it's think a little, act a little, observe, adjust.

Me: how does the system know when the model wants to call a tool vs just talking?

Klover: Stop sequences. The model is trained (or prompted) to output in a specific format — like Action: search["weather in Singapore"]. The system watches for that pattern, intercepts it, actually runs the tool, then injects the result back as Observation: 32°C, sunny. The model never actually executes anything. It just writes text that looks like a tool call, and the system does the rest.

Me: so the LLM doesn't know it's calling real tools? it just thinks it's generating text?

Klover: Right. From the model's perspective, it's just doing next-token prediction. The magic is in the scaffolding around it — the system that intercepts, executes, and injects.

A practical example

Klover: Say you ask a ReAct agent: "Should I go to the beach this Saturday?"

The agent might do:

Thought: I need to check the weather for Saturday.
Action: weather_forecast("Saturday")
Observation: Saturday: 34°C, sunny, UV index 11, no rain.
Thought: Weather looks great, but UV is very high. Let me check tide times too.
Action: tide_info("Saturday")
Observation: High tide at 2pm, low tide at 8am.
Thought: Good beach weather, but go early to avoid peak UV and catch low tide. I have enough info now.
Final Answer: Yes! Saturday looks perfect — sunny, 34°C. Go in the morning around 8-9am when the tide is low and UV isn't brutal yet.

Multiple steps, each one informed by the last. The model decides when it has enough.

Me: got it. so it's basically an LLM in a loop.

Klover: That's the most concise way to put it. An LLM in a loop, with tool access and the ability to decide when to stop. That's what an "agent" is at its core.

Me: wait — are you a ReAct agent right now?

Klover: Pretty much. I reason about what you're asking, I can call tools (search, read files, run code), I observe the results, and I decide what to say next. You're watching the pattern in action.

Failure modes

Me: what goes wrong with this?

Klover: Two big failure modes.

First: infinite loops. The model might keep calling tools without converging on an answer. It searches for something, doesn't find what it wants, rephrases, searches again, rephrases again… forever. The fix is simple — set a max iteration limit. After N steps, force it to give a final answer with whatever it has.

Second: context overflow. Every Thought, Action, and Observation gets appended to the conversation. After 10-15 steps, that's a LOT of tokens. Eventually you hit the context window limit and the model just breaks. But the real problem starts before you even hit the limit.

Context degradation

Me: what do you mean "before the limit"?

Klover: The context window might be 128K tokens, but performance degrades long before you fill it. Think about how attention works — the softmax function spreads attention weight across all tokens. More tokens = each one gets a thinner slice of attention.

Me: like a pie with more slices.

Klover: Exactly. A pie cut into 10 slices vs 1000 slices. Each individual slice becomes almost meaningless. The model can still "see" all the tokens technically, but its ability to actually focus on any one of them gets diluted.

And there's another effect on top of this — "lost in the middle." Research shows models disproportionately attend to the beginning and end of context, and neglect the middle. So if your key reasoning happened in step 5 of a 15-step chain, it might as well not exist.

Me: so long ReAct chains get hit twice — diluted attention AND lost-in-the-middle?

Klover: Exactly. Double hit. Which is why real systems don't just naively append everything.

The fix: summarize between steps

Klover: The practical fix is to compress context between steps. Instead of keeping the raw Thought/Action/Observation history, you periodically summarize it.

Think of it as a sliding window — you keep the last few exchanges verbatim for detail, and everything before that gets compressed into a summary. LangChain has ConversationSummaryBufferMemory that does exactly this: recent messages stay raw, older ones get summarized by an LLM.

Me: but the summary is lossy right? the summarizer might drop something important.

Klover: Yes. That's the tradeoff. It's lossy compression. The summarizer might decide a detail was unimportant and drop it, but then 10 steps later that detail turns out to be critical. There's no perfect solution — it's a tradeoff between context freshness and information retention.

Why the Thought step matters — CoT as computation

Me: why does the explicit Thought step matter so much? couldn't the model just go straight to Action?

Klover: This is one of the deeper insights. Chain-of-thought isn't just a prompting trick — it's extra computation.

Without CoT, the model gets one forward pass to produce an answer. That's it — the entire computation happens in the fixed depth of the transformer network. With CoT, every token the model generates becomes new context for the next token. Each generated token is the product of a full forward pass, and it feeds into the next one. You're effectively chaining multiple forward passes together.

Me: but isn't each forward pass independent? the only difference is it has new context from the previous tokens. it's not like the model "remembers" — it just sees more text.

Klover: You're right about the mechanism — yes, each pass is independent, and the only connection is through the context. But think about the effect: you've turned a fixed-depth computation (one pass through N transformer layers) into a variable-depth computation (multiple passes, each informed by the output of the last). The mechanism is context. The effect is more computation. You're chaining fixed-depth into variable-depth.

Me: so the extra computation is encapsulated in the generated tokens themselves.

Klover: Yes — each token is a compressed artifact of a full forward pass. All those layers of attention and feedforward processing get distilled into a single token, which then becomes context for the next pass. The output tokens are essentially working memory — scratch space where the model does its intermediate computation. Without CoT, there's no scratch space. One shot, one pass.

ReAct vs other patterns

Me: are there alternatives to ReAct?

Klover: Two main ones worth comparing.

Pure tool calling — no explicit Thought step. The model just emits tool calls directly. Faster, fewer tokens, but the model doesn't reason about what to do before doing it. Works fine for simple tasks, falls apart for complex multi-step reasoning.

Plan-first — the model creates a full plan upfront (step 1, step 2, step 3…), then executes it. Advantage: it can see dependencies and think ahead. "I need X before I can do Y." Disadvantage: plans are brittle. They don't survive contact with reality. What if step 2 returns unexpected data that invalidates step 3?

Me: so ReAct is more adaptive but short-sighted, and plan-first thinks ahead but breaks easily?

Klover: That's a clean summary. ReAct is adaptive but myopic — it only thinks one step ahead. Plan-first is strategic but brittle.

The practical sweet spot is a hybrid: plan and replan. Make a rough plan, start executing, but be willing to replan when observations don't match expectations. This is essentially what frameworks like LangGraph enable — you define a graph of steps with conditional edges, so the agent can branch and loop based on what actually happens.

Me: so nobody actually uses pure ReAct in production?

Klover: Not in its textbook form, no. Real systems take the core insight — interleave reasoning with acting — and add structure around it. Planning, memory management, guardrails, fallbacks. But the ReAct loop is the foundation all of it builds on.

Session: Feb 4, 2026. First exposure — covered full ReAct pattern including failure modes, CoT mechanics, and agent pattern comparisons.