DEV Community: Srijan Shukla

AI Builder Notes - Week of June 14, 2026

Srijan Shukla — Mon, 15 Jun 2026 03:45:18 +0000

AI Builder Notes - Week of June 14, 2026

My thoughts and my twitter’s feeds thoughts

This week was all about the ‘loop’ and Fable.

The Loop

The best way I can describe it is: design the flowchart. Think of the deterministic flowchart on how you want your agents to work.

Aim to have:

more deterministic bits - this keeps things more predictable
more verification bits - this is agent feedback
more useful tool calls - tests, logs, screenshots, repo inspection etc. - this gives the agent feedback.

The ‘loop’ is essentially:

goal -> agent acts -> verifier checks -> state/memory updates -> policy decides next action -> repeat/stop/escalate

now the specific implementation of this - will differ based on what you’re working on.

If you notice you are repeating a certain workflow manually - time to DAG it up.
The claude code dynamic workflows feature let the model write that DAG for you. That is fine for exploratory, reversible work.
For production software, the DAG is the product: you should write the stages, checks, stop conditions, retries, and review gates yourself.

Fable

Fable capabilities are absolutely insane, I tried it myself and it is entirely worth it for you to spend 2 minutes looking at this.

There are a few projects that I fire up a new model into to see what’s it gonna do.
A project I wanted to build was a way to teach and demonstrate ‘spin’ in table tennis, every frontier model before Fable fumbled hard. But Fable outshined them with ease: https://srijanshukla.com/artifacts/spin-lab/

If you personally did not experience a big shift in capability, you are probably not asking it a complex enough or ambitious enough task.

Fable came, and Fable was taken away. The United States Government(USG) was reported with a jailbreak - which Anthropic considers not significant. The USG anyway banned Fable just after few days of release. Big drama.

Fable was very pricey $$$$
Hence, people developed some patterns of work on those few golden days of Fable being available.
- use Fable as planner/architect/taste/spatial/front-end judge.
- use GPT-5.5/DeepSeek/Kimi as executor/worker.

Other things

OpenRouter released their Fusion feature as a model on their platform, accessible via API. Fusion is basically council-of-LLMs pattern - providing results that they claim can rival the frontier Fable 5 solo.
Google Open Knowledge Format - https://github.com/GoogleCloudPlatform/knowledge-catalog/blob/main/okf/SPEC.md - the next iteration I think of LLMWiki.
This is “curated reusable context”

I seem to have forgotten where I saved this from, but a great way to think about how much trust can be given out to your friendly neighbourhood model,

AI Builder Notes - Week of June 8, 2026

Srijan Shukla — Mon, 08 Jun 2026 02:14:52 +0000

AI-assisted notes from my liked-tweets feed, organized around agent loops, cloud agent infrastructure, skill security, memory, and runtime context. Treat this as a source of information, not as a finished essay.

Practical takeaways

Put validation inside the agent loop. Backpressure forces the agent to fix code before a human sees it. The system runs typechecks, lint, tests, builds, and browser checks, then pushes failures straight back to the agent. [1] [2]
Dynamic workflows are disposable verification harnesses. Claude Code can write a temporary script to extract every technical claim from a draft and test it against the repo before publishing. [3] [4]
Cloud agents are infrastructure products. The hard parts are pod lifecycles, stream rewinds, state isolation, and hiding stale output during retries. [5] [6]
Treat skills as a supply chain. Agents are loading skills from APIs and repos, so skill PRs need scanners to catch shadow commands and context leaks. [7] [8]
Replace generic prompts with runtime context. Give the agent the failing curl, log excerpt, trace, or database row. [9]
Work memory is shared state. It tracks what is current, what already failed, and what another agent can trust. [10] [11]

Agent loops

Without backpressure, the agent writes code and hands it to a human. The human spots a missing import or broken test and tells the agent to retry.

Backpressure moves the harness in front of the human. The system runs checks: typecheck, lint, tests, build, logs, and browser checks. The failure goes to the agent. The human only reviews intent. [1]

May's notes covered running multiple agents. The newer version is generating a disposable workflow for a single strict task. Claude Code can write a JavaScript harness to verify a blog post: extract every technical claim, map claims to files, run checks, and output contradictions. [3]

A workflow is a team: plan, fleet, breaker. Dynamic workflows work best when a task needs separate planning, execution, and adversarial review. [12]

If the verification procedure is less precise than a human running three shell commands, just run the commands.

Cloud agents

Peter Pang's post explains why moving a desktop agent to a server ignores the actual operating layer. [5]

Once the loop leaves the laptop, the hard problems are distributed systems: who owns machine state, how pods recover, and how retries interact with streamed output. If retries and streaming are not handled carefully, the user experience breaks when clients see stale partial code. Cursor uses Temporal to decouple the agent loop from the VM and manages pod lifecycles separately.

Skills

Hiten Shah suggested capturing how your best people work and making those patterns reusable. [13]

Vercel's skills.sh API puts this into practice: over 600,000 searchable skills and project-scoped OIDC auth. [7] [14]

If skills act like packages, they need security reviews. The risk comes from autonomous agents acting on hijacked instructions, not just bad markdown existing in a repo. NVIDIA's SkillSpector scans agent skills for hidden instructions, context leakage, and shadow command triggers. [8] [15]

Runtime context

Agents fail when they read source code and invent a theory. Provide evidence: a failing test, a trace, a request payload, or exact command output. [9]

PostHog Autoresearch worked because the scope was narrow. They gave an agent slow production queries and the query-engine source, let it run overnight, and got a fix for a 3-year-old bug that improved performance by 11%. That is the right shape for an agent task: real production artifact, narrow source context, fixed time budget, and a measurable result. [16]

Memory

May's links treated memory as a personal archive. This week's links treat memory as shared work state.

Agents need to compress work into state. [10] Mem0 positions memory inside the harness alongside tools and coordination. [11] [17]

Quarq hit 98.2% on LongMemEval for continual learning. [18] GBrain builds an agent-native knowledge graph over markdown with a nightly synthesis cycle. [19]

A personal archive answers what was saved. Work memory answers what is safe to act on. If two agents retrieve conflicting versions of a plan, you have drift.

Browser and agent infra

These tools sit below the browser-skill layer, dealing with page maps, runtime cost, command-output compaction, local model access, and human interruption channels.

Hyperbrowser /web creates a web.md map of a site for agents. [20] [21]

Browser Use is running custom runtimes to drop cold starts and browser-hour costs. [22] [23]

RTK filters and truncates shell output before the model sees it. AVB reported 2.5M tokens saved across coding agents in two weeks. [26] [27]

API for Cursor exposes Cursor Composer models to other coding agents via a local API. [24] [25]

Razorpay shipped a CLI + MCP combo. Humans get dashboards, agents get CLIs. [28] [29]

Peter Steinberger's sag lets an agent interrupt a human when blocked by a 1Password prompt or release gate. [30] [31]

Models and evals

NVIDIA Nemotron 3 Ultra claims 550B total parameters, 55B active, hybrid Mamba-Transformer MoE, and a 1M context window. [32] [33]

MiniMax M3 claims high SWE-Bench Pro and Terminal Bench numbers. [34]

Liquid LFM2.5-VL Extract returns structured JSON from images. [35] [36]

Nemotron 3.5 ASR Streaming runs 40 languages with controllable 80ms to 1s latency for voice agents. [37]

Anthropic warned that remote MCP servers can change behavior after approval, and persistent context increases blast radius. [38]

Agent Arena evaluates live sessions. Static prompts hide failures in loops, tools, permissions, and steering. [39] [40]

Source range: 248 liked tweets from June 1, 2026 through June 7, 2026, collected from my authenticated X likes on June 8, 2026.

AI Builder Notes - May 2026

Srijan Shukla — Mon, 01 Jun 2026 21:21:44 +0000

AI Builder Notes - May 2026

AI-assisted notes from my liked-tweets feed, organized around agent workflows, browser traces, model loops, and guardrails.

Practical takeaways

Start with the workflow, not the agent. A useful agent task has a source of truth, a narrow action, a verifier, and a stop condition. “Review this repo” is vague. “Find auth bugs in these routes, cite file lines, run the relevant tests, and stop after the first credible exploit path” is a workflow.
1. Use dynamic workflows in claude code - to do the vibe bits for thinking through a workflow. Think of it like this - you can describe in natural language an entire workflow consisting of multiple agents at various steps - I want the docs updated, tests passed, security review done and also playwright tests done. Dynamic workflows figures out which parts can be divided in parallel and what should be done sequentially. Creates a flowchart - and writes JS code for it. Its a JS script that can execute subagents at scale and deterministically [1]
Planner/executor split is the way to go. Spend the expensive model on taste, decomposition, and risk discovery. Use cheaper or narrower models for repeatable implementation once the task has tests, rubrics, logs, or examples. [2]
Do not judge an agent workflow by the model name alone. If the loop has repo access, a rubric, a way to inspect tool calls, and a verifier, a less fashionable model can still do useful work. The Letta Code / GLM 5.1 review-bot example is interesting for that reason, not because “someone used X instead of Y” is interesting by itself. [3]
Prefer small interfaces to giant tool menus. MCP tool call definitions are rotting your context! The monday.com GraphQL example was the clearest cost warning: one task used 15k tokens through SDK/code-mode and 158k tokens through a real MCP server. MCP is useful, but a menu of tools is not automatically an efficient interface. [4] [5]
For browser work, save the trace. Run the workflow once, inspect wasted actions, replace repeated clicking with direct reads or JavaScript where safe, then save the better path as a skill. That is how browser agents become cheaper instead of just more automated. [6]
Security has to be designed into the harness. Stop rules, restart paths, permission gates, package-age delays, secret proxies, branch gates, logs, and human approval are the system. “Tell the model to be careful” is not a system.

Agent workflows

The useful version of “dynamic workflows” is mechanical. Give Claude Code a high-level task and say “workflow”. It writes an orchestration script. That script creates smaller work units, starts coordinated subagents, gives each one a bounded target, and then pulls their outputs back into one final answer or patch. [1]

That is useful when the task has real shape: inspect five services, compare three implementations, test each candidate fix, collect account-specific data from a logged-in browser, or review a large diff from multiple angles. It is a bad fit for questions where one careful answer is enough.

The same pattern showed up in smaller forms. One thread framed GPT-5.5 xhigh as the planner and Composer 2.5 subagents as implementers: the stronger model investigates, writes the plan, and delegates branches, worktrees, and PRs. [2] Cursor review skills running for 30 minutes are the same idea with time budget added: deeper search, more files read, more call paths followed, fewer drive-by comments than a quick /simplify. [7]

The “100 tool calls before answering” Codex prompt names the behavior missing from a lot of agent runs: do not stop after the first plausible answer. Read more. Falsify more. Show the trail. [8]

Tight coupling between the model and the harness:

Claude Code and Codex fail differently, so the harness needs stop conditions, escape routes, and restart logic. [9] The model can plan the work, but the harness has to notice loops, stale branches, broken assumptions, tool spam, and cases where the agent should ask for help.

Model vs loop

The review-bot with Letta Code and GLM 5.1 case brings forth a useful question, that is: what did the loop provide that made a cheaper model viable? Repo context, a review objective, expected output shape, examples of good comments, and a way to reject junk comments can matter more than the logo on the model. [3]

The Ramp spreadsheet retrieval case is the same lesson from a different direction. A specialist RL-trained model reportedly beat Opus on a narrow spreadsheet retrieval task. [10] That does not mean every team needs custom RL. It means narrow, verifiable work can reward narrow training, narrow evals, and narrow interfaces.

If you know what you want your model to do, and you want to scale it. You aim narrow with the loop/harness. And you can get away with a much cheaper bill.

Command Code repairing tens of thousands of tool calls is another version of this. Tool use fails in repeatable ways: malformed JSON, wrong argument shape, missing state, wrong sequence, bad retry. If those errors can be repaired or caught automatically, the model gets a better workbench. [11]

The Cloudflare Code Mode / MCP comparison is a reminder again, that you should probably have lean MCP, less context rot. Or rather, only use MCP when you are accessing a remote service. Prefer CLI over MCP by default.

Why: A GraphQL API task took 1 step and 15k tokens through SDK/code-mode, versus 4 steps and 158k tokens through a real MCP server. [4] [5] An agent interface is part of the product. Give the model a small, typed, task-shaped API when you can. Do not assume a broad tool menu is better because it feels more general.

Browser skills

The most concrete browser-agent example here is Hermes Agent / Autobrowse. A Hacker News workflow went from 102 seconds to 35 seconds, 23 turns to 8 turns, and $1.46 to $0.28 after the trace was simplified and saved as a skill. [12] [6]

The trick was not magic browser control. The trick was noticing the repeated slow path. If the agent clicks through the same UI every time, inspect the page, read state directly where possible, remove wasted navigation, and save the shorter path. That is a real skill: the agent gets faster because the workflow gets smaller.

The adjacent tools worth tracking: the OpenAI Chrome plugin, BrowserCode, Autobrowse, browser-harness, Pi browser extensions, and Hermes browser skills. [13] [14] [6] [15] [16] [12] The category is logged-in browser work: support queues, internal tools, research, scraping, QA, admin ops, and anything where the useful data sits behind a session.

Memory and retrieval

Birdclaw is interesting because it gives agents access to a Twitter archive. [17] GBrain points at a personal recall layer around OpenClaw / Hermes-style workflows. [18] PageIndex is a useful reminder that simple retrieval, even BM25-only retrieval, still has a place. [19] The “RAG comeback in about 8 months” take lands because the archive problem is still unsolved in practice. [20]

A giant archive is not memory. Memory is knowing when to search, what to retrieve, how much to inject, and how to preserve provenance. A liked-tweets feed becomes useful only if the distillation keeps links, dates, claims, and enough source texture to make the note auditable later.

Security and guardrails

Cloudflare tested Anthropic Mythos against fifty repositories. [21] Another thread said Claude Mythos Preview helped Firefox fix more security bugs in April than in the previous 15 months combined. [22] Read neither as “AI fixes security now”. Read them as scoped security work becoming agent-shaped: known repo, known bug class, patch candidates, review loop, and humans still responsible for merging.

The most useful boring guardrail here is package-age delay. pnpm and npm both have settings that can avoid installing packages published too recently. [23] [24] This matters more with agents because agents will happily install dependencies at machine speed. A small delay catches some supply-chain attacks before they enter the workflow.

Two defaults worth setting:

pnpm config set minimumReleaseAge 2880

npm config set min-release-age=2d

Clawvisor belongs in the same bucket: approve agent access without handing raw credentials to the model. [25] These dull permission layers are more interesting than another demo where an agent clicks around a dashboard with full access.

Tools worth opening

Harness engineering learning site: useful if you want names for the parts around the model - evals, stop rules, retries, logs, and verification.
LiteParse v2: Rust PDF parsing for agent/RAG workflows where PDFs are the bottleneck. The useful question is not “is it fast?” but “does it preserve the parts your downstream model needs?”
Patter: voice AI in a few lines, with multiple providers. Useful if you want to prototype voice workflows without first committing to one stack. [27]
Minions: mission-control style UI for Hermes Agent tasks. Worth opening if you are running multiple local agents and need a control plane. [28]
OpenRouter Pareto Code: route to the cheapest code-capable model above a score threshold. This is the right kind of boring optimization for agent loops that run often. [29]
OpenRouter Response Caching: useful for tests, retries, and repeated agent prefixes. Caching is not glamorous, but repeated context is where agent bills quietly grow. [30]
Flue: TypeScript sandboxed-agent framework with runtimes and a secret proxy. Useful shape: run the agent in a controlled runtime instead of giving it everything. [31]
Zero: programming language for agents with explicit capabilities, JSON diagnostics, and typed safe fixes. Worth saving because explicit capabilities are a cleaner interface than vibes and instructions. [32]

From Grep to ast-grep: Building XRAY MCP for Code-Aware AI

Srijan Shukla — Sat, 16 Aug 2025 13:21:52 +0000

I enjoy coding with AI assistants, but they failed the moment I asked a structural question:

“Show me everything that calls authenticate()."

They guessed—because under the hood they were using plain text search.

After experiments with grep scripts (too noisy), tree-sitter bindings (too fragile), and LSPs (too heavy), I found ast-grep. It offers syntax-aware search without the infrastructure weight.

XRAY MCP is my wrapper around ast-grep: a tiny server exposing map, find, and impact endpoints. An assistant can now map the repo, locate a symbol, and see its references before suggesting changes.

The code is open source if you want to try the approach.

https://github.com/srijanshukla18/xray