Srijan Shukla

Posted on Jun 8

AI Builder Notes - Week of June 8, 2026

#agents #ai #automation #testing

AI-assisted notes from my liked-tweets feed, organized around agent loops, cloud agent infrastructure, skill security, memory, and runtime context. Treat this as a source of information, not as a finished essay.

Practical takeaways

Put validation inside the agent loop. Backpressure forces the agent to fix code before a human sees it. The system runs typechecks, lint, tests, builds, and browser checks, then pushes failures straight back to the agent. [1] [2]
Dynamic workflows are disposable verification harnesses. Claude Code can write a temporary script to extract every technical claim from a draft and test it against the repo before publishing. [3] [4]
Cloud agents are infrastructure products. The hard parts are pod lifecycles, stream rewinds, state isolation, and hiding stale output during retries. [5] [6]
Treat skills as a supply chain. Agents are loading skills from APIs and repos, so skill PRs need scanners to catch shadow commands and context leaks. [7] [8]
Replace generic prompts with runtime context. Give the agent the failing curl, log excerpt, trace, or database row. [9]
Work memory is shared state. It tracks what is current, what already failed, and what another agent can trust. [10] [11]

Agent loops

Without backpressure, the agent writes code and hands it to a human. The human spots a missing import or broken test and tells the agent to retry.

Backpressure moves the harness in front of the human. The system runs checks: typecheck, lint, tests, build, logs, and browser checks. The failure goes to the agent. The human only reviews intent. [1]

May's notes covered running multiple agents. The newer version is generating a disposable workflow for a single strict task. Claude Code can write a JavaScript harness to verify a blog post: extract every technical claim, map claims to files, run checks, and output contradictions. [3]

A workflow is a team: plan, fleet, breaker. Dynamic workflows work best when a task needs separate planning, execution, and adversarial review. [12]

If the verification procedure is less precise than a human running three shell commands, just run the commands.

Cloud agents

Peter Pang's post explains why moving a desktop agent to a server ignores the actual operating layer. [5]

Once the loop leaves the laptop, the hard problems are distributed systems: who owns machine state, how pods recover, and how retries interact with streamed output. If retries and streaming are not handled carefully, the user experience breaks when clients see stale partial code. Cursor uses Temporal to decouple the agent loop from the VM and manages pod lifecycles separately.

Skills

Hiten Shah suggested capturing how your best people work and making those patterns reusable. [13]

Vercel's skills.sh API puts this into practice: over 600,000 searchable skills and project-scoped OIDC auth. [7] [14]

If skills act like packages, they need security reviews. The risk comes from autonomous agents acting on hijacked instructions, not just bad markdown existing in a repo. NVIDIA's SkillSpector scans agent skills for hidden instructions, context leakage, and shadow command triggers. [8] [15]

Runtime context

Agents fail when they read source code and invent a theory. Provide evidence: a failing test, a trace, a request payload, or exact command output. [9]

PostHog Autoresearch worked because the scope was narrow. They gave an agent slow production queries and the query-engine source, let it run overnight, and got a fix for a 3-year-old bug that improved performance by 11%. That is the right shape for an agent task: real production artifact, narrow source context, fixed time budget, and a measurable result. [16]

Memory

May's links treated memory as a personal archive. This week's links treat memory as shared work state.

Agents need to compress work into state. [10] Mem0 positions memory inside the harness alongside tools and coordination. [11] [17]

Quarq hit 98.2% on LongMemEval for continual learning. [18] GBrain builds an agent-native knowledge graph over markdown with a nightly synthesis cycle. [19]

A personal archive answers what was saved. Work memory answers what is safe to act on. If two agents retrieve conflicting versions of a plan, you have drift.

Browser and agent infra

These tools sit below the browser-skill layer, dealing with page maps, runtime cost, command-output compaction, local model access, and human interruption channels.

Hyperbrowser /web creates a web.md map of a site for agents. [20] [21]

Browser Use is running custom runtimes to drop cold starts and browser-hour costs. [22] [23]

RTK filters and truncates shell output before the model sees it. AVB reported 2.5M tokens saved across coding agents in two weeks. [26] [27]

API for Cursor exposes Cursor Composer models to other coding agents via a local API. [24] [25]

Razorpay shipped a CLI + MCP combo. Humans get dashboards, agents get CLIs. [28] [29]

Peter Steinberger's sag lets an agent interrupt a human when blocked by a 1Password prompt or release gate. [30] [31]

Models and evals

NVIDIA Nemotron 3 Ultra claims 550B total parameters, 55B active, hybrid Mamba-Transformer MoE, and a 1M context window. [32] [33]

MiniMax M3 claims high SWE-Bench Pro and Terminal Bench numbers. [34]

Liquid LFM2.5-VL Extract returns structured JSON from images. [35] [36]

Nemotron 3.5 ASR Streaming runs 40 languages with controllable 80ms to 1s latency for voice agents. [37]

Anthropic warned that remote MCP servers can change behavior after approval, and persistent context increases blast radius. [38]

Agent Arena evaluates live sessions. Static prompts hide failures in loops, tools, permissions, and steering. [39] [40]

Source range: 248 liked tweets from June 1, 2026 through June 7, 2026, collected from my authenticated X likes on June 8, 2026.

DEV Community