<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Srijan Shukla</title>
    <description>The latest articles on DEV Community by Srijan Shukla (@srijanshukla18).</description>
    <link>https://dev.to/srijanshukla18</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1576269%2F488a7c50-2f20-4488-b99c-e71fb09a3e31.jpeg</url>
      <title>DEV Community: Srijan Shukla</title>
      <link>https://dev.to/srijanshukla18</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/srijanshukla18"/>
    <language>en</language>
    <item>
      <title>AI Builder Notes - Week of June 14, 2026</title>
      <dc:creator>Srijan Shukla</dc:creator>
      <pubDate>Mon, 15 Jun 2026 03:45:18 +0000</pubDate>
      <link>https://dev.to/srijanshukla18/ai-builder-notes-week-of-june-14-2026-2a6g</link>
      <guid>https://dev.to/srijanshukla18/ai-builder-notes-week-of-june-14-2026-2a6g</guid>
      <description>&lt;h1&gt;
  
  
  AI Builder Notes - Week of June 14, 2026
&lt;/h1&gt;

&lt;p&gt;My thoughts and my twitter’s feeds thoughts&lt;/p&gt;

&lt;p&gt;This week was all about the ‘loop’ and &lt;a href="https://platform.claude.com/docs/en/about-claude/models/introducing-claude-fable-5-and-claude-mythos-5" rel="noopener noreferrer"&gt;Fable&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Loop
&lt;/h2&gt;

&lt;p&gt;The best way I can describe it is: design the flowchart. Think of the deterministic flowchart on how you want your agents to work.&lt;/p&gt;

&lt;p&gt;Aim to have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;more deterministic bits - this keeps things more predictable&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;more verification bits - this is agent feedback&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;more useful tool calls - tests, logs, screenshots, repo inspection etc. - this gives the agent feedback.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ‘loop’ is essentially:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;goal -&amp;gt; agent acts -&amp;gt; verifier checks -&amp;gt; state/memory updates -&amp;gt; policy decides next action -&amp;gt; repeat/stop/escalate&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;now the specific implementation of this - will differ based on what you’re working on.&lt;/p&gt;

&lt;p&gt;If you notice you are repeating a certain workflow manually - time to DAG it up.&lt;br&gt;
The claude code dynamic workflows feature let the model write that DAG for you. That is fine for exploratory, reversible work.&lt;br&gt;
For production software, the DAG is the product: you should write the stages, checks, stop conditions, retries, and review gates yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fable
&lt;/h2&gt;

&lt;p&gt;Fable capabilities are absolutely insane, I tried it myself and it is entirely worth it for you to spend 2 minutes looking at this.&lt;/p&gt;

&lt;p&gt;There are a few projects that I fire up a new model into to see what’s it gonna do.&lt;br&gt;
A project I wanted to build was a way to teach and demonstrate ‘spin’ in table tennis, every frontier model before Fable fumbled hard. But Fable outshined them with ease: &lt;a href="https://srijanshukla.com/artifacts/spin-lab/" rel="noopener noreferrer"&gt;https://srijanshukla.com/artifacts/spin-lab/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you personally did not experience a big shift in capability, you are probably not asking it a complex enough or ambitious enough task.&lt;/p&gt;

&lt;p&gt;Fable came, and Fable was taken away. The United States Government(USG) was reported with a &lt;code&gt;jailbreak&lt;/code&gt; - which Anthropic considers not significant. The USG anyway banned Fable just after few days of release. Big drama.&lt;/p&gt;

&lt;p&gt;Fable was very pricey $$$$&lt;br&gt;
Hence, people developed some patterns of work on those few golden days of Fable being available.&lt;br&gt;
- use Fable as planner/architect/taste/spatial/front-end judge.&lt;br&gt;
- use GPT-5.5/DeepSeek/Kimi as executor/worker.&lt;/p&gt;

&lt;h2&gt;
  
  
  Other things
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;OpenRouter released their &lt;code&gt;Fusion&lt;/code&gt; feature as a model on their platform, accessible via API. Fusion is basically council-of-LLMs pattern - providing results that they claim can rival the frontier Fable 5 solo.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Google Open Knowledge Format - &lt;a href="https://github.com/GoogleCloudPlatform/knowledge-catalog/blob/main/okf/SPEC.md" rel="noopener noreferrer"&gt;https://github.com/GoogleCloudPlatform/knowledge-catalog/blob/main/okf/SPEC.md&lt;/a&gt; - the next iteration I think of LLMWiki.&lt;br&gt;
This is &lt;strong&gt;“curated reusable context”&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I seem to have forgotten where I saved this from, but a great way to think about how much trust can be given out to your friendly neighbourhood model,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyrvzd57tklxqx6hjdn39.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyrvzd57tklxqx6hjdn39.png" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>claude</category>
      <category>llm</category>
    </item>
    <item>
      <title>AI Builder Notes - Week of June 8, 2026</title>
      <dc:creator>Srijan Shukla</dc:creator>
      <pubDate>Mon, 08 Jun 2026 02:14:52 +0000</pubDate>
      <link>https://dev.to/srijanshukla18/ai-builder-notes-week-of-june-8-2026-5eh9</link>
      <guid>https://dev.to/srijanshukla18/ai-builder-notes-week-of-june-8-2026-5eh9</guid>
      <description>&lt;p&gt;AI-assisted notes from my liked-tweets feed, organized around agent loops, cloud agent infrastructure, skill security, memory, and runtime context. Treat this as a source of information, not as a finished essay.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Put validation inside the agent loop. Backpressure forces the agent to fix code before a human sees it. The system runs typechecks, lint, tests, builds, and browser checks, then pushes failures straight back to the agent. &lt;a href="https://generativeprogrammer.com/p/stop-babysitting-your-coding-agent" rel="noopener noreferrer"&gt;[1]&lt;/a&gt; &lt;a href="https://x.com/bibryam/status/2063349737089331531" rel="noopener noreferrer"&gt;[2]&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Dynamic workflows are disposable verification harnesses. Claude Code can write a temporary script to extract every technical claim from a draft and test it against the repo before publishing. &lt;a href="https://claude.com/blog/a-harness-for-every-task-dynamic-workflows-in-claude-code" rel="noopener noreferrer"&gt;[3]&lt;/a&gt; &lt;a href="https://x.com/trq212/status/2061907337154367865" rel="noopener noreferrer"&gt;[4]&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Cloud agents are infrastructure products. The hard parts are pod lifecycles, stream rewinds, state isolation, and hiding stale output during retries. &lt;a href="https://cursor.com/blog/cloud-agent-lessons" rel="noopener noreferrer"&gt;[5]&lt;/a&gt; &lt;a href="https://x.com/intuitiveml/status/2062699747224568212" rel="noopener noreferrer"&gt;[6]&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Treat skills as a supply chain. Agents are loading skills from APIs and repos, so skill PRs need scanners to catch shadow commands and context leaks. &lt;a href="https://vercel.com/changelog/the-skills-sh-api-is-now-available" rel="noopener noreferrer"&gt;[7]&lt;/a&gt; &lt;a href="https://github.com/NVIDIA/skillspector" rel="noopener noreferrer"&gt;[8]&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Replace generic prompts with runtime context. Give the agent the failing curl, log excerpt, trace, or database row. &lt;a href="https://x.com/ericzakariasson/status/2062199026544787576" rel="noopener noreferrer"&gt;[9]&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Work memory is shared state. It tracks what is current, what already failed, and what another agent can trust. &lt;a href="https://x.com/ashwingop/status/2061836996541083912" rel="noopener noreferrer"&gt;[10]&lt;/a&gt; &lt;a href="https://x.com/mem0ai/status/2061822612398014782" rel="noopener noreferrer"&gt;[11]&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Agent loops
&lt;/h2&gt;

&lt;p&gt;Without backpressure, the agent writes code and hands it to a human. The human spots a missing import or broken test and tells the agent to retry.&lt;/p&gt;

&lt;p&gt;Backpressure moves the harness in front of the human. The system runs checks: typecheck, lint, tests, build, logs, and browser checks. The failure goes to the agent. The human only reviews intent. &lt;a href="https://generativeprogrammer.com/p/stop-babysitting-your-coding-agent" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;May's notes covered running multiple agents. The newer version is generating a disposable workflow for a single strict task. Claude Code can write a JavaScript harness to verify a blog post: extract every technical claim, map claims to files, run checks, and output contradictions. &lt;a href="https://claude.com/blog/a-harness-for-every-task-dynamic-workflows-in-claude-code" rel="noopener noreferrer"&gt;[3]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A workflow is a team: plan, fleet, breaker. Dynamic workflows work best when a task needs separate planning, execution, and adversarial review. &lt;a href="https://x.com/code_rams/status/2062577777279168842" rel="noopener noreferrer"&gt;[12]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If the verification procedure is less precise than a human running three shell commands, just run the commands.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cloud agents
&lt;/h2&gt;

&lt;p&gt;Peter Pang's post explains why moving a desktop agent to a server ignores the actual operating layer. &lt;a href="https://cursor.com/blog/cloud-agent-lessons" rel="noopener noreferrer"&gt;[5]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once the loop leaves the laptop, the hard problems are distributed systems: who owns machine state, how pods recover, and how retries interact with streamed output. If retries and streaming are not handled carefully, the user experience breaks when clients see stale partial code. Cursor uses Temporal to decouple the agent loop from the VM and manages pod lifecycles separately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skills
&lt;/h2&gt;

&lt;p&gt;Hiten Shah suggested capturing how your best people work and making those patterns reusable. &lt;a href="https://x.com/hnshah/status/2062647149582750101" rel="noopener noreferrer"&gt;[13]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Vercel's skills.sh API puts this into practice: over 600,000 searchable skills and project-scoped OIDC auth. &lt;a href="https://vercel.com/changelog/the-skills-sh-api-is-now-available" rel="noopener noreferrer"&gt;[7]&lt;/a&gt; &lt;a href="https://x.com/rauchg/status/2062951924677128455" rel="noopener noreferrer"&gt;[14]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If skills act like packages, they need security reviews. The risk comes from autonomous agents acting on hijacked instructions, not just bad markdown existing in a repo. NVIDIA's SkillSpector scans agent skills for hidden instructions, context leakage, and shadow command triggers. &lt;a href="https://github.com/NVIDIA/skillspector" rel="noopener noreferrer"&gt;[8]&lt;/a&gt; &lt;a href="https://x.com/dani_avila7/status/2063336153630011728" rel="noopener noreferrer"&gt;[15]&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Runtime context
&lt;/h2&gt;

&lt;p&gt;Agents fail when they read source code and invent a theory. Provide evidence: a failing test, a trace, a request payload, or exact command output. &lt;a href="https://x.com/ericzakariasson/status/2062199026544787576" rel="noopener noreferrer"&gt;[9]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;PostHog Autoresearch worked because the scope was narrow. They gave an agent slow production queries and the query-engine source, let it run overnight, and got a fix for a 3-year-old bug that improved performance by 11%. That is the right shape for an agent task: real production artifact, narrow source context, fixed time budget, and a measurable result. &lt;a href="https://x.com/posthog/status/2062595534381326421" rel="noopener noreferrer"&gt;[16]&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Memory
&lt;/h2&gt;

&lt;p&gt;May's links treated memory as a personal archive. This week's links treat memory as shared work state.&lt;/p&gt;

&lt;p&gt;Agents need to compress work into state. &lt;a href="https://x.com/ashwingop/status/2061836996541083912" rel="noopener noreferrer"&gt;[10]&lt;/a&gt; Mem0 positions memory inside the harness alongside tools and coordination. &lt;a href="https://x.com/mem0ai/status/2061822612398014782" rel="noopener noreferrer"&gt;[11]&lt;/a&gt; &lt;a href="https://github.com/mem0ai/mem0" rel="noopener noreferrer"&gt;[17]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Quarq hit 98.2% on LongMemEval for continual learning. &lt;a href="https://x.com/quarqlabs/status/2061571757488972153" rel="noopener noreferrer"&gt;[18]&lt;/a&gt; GBrain builds an agent-native knowledge graph over markdown with a nightly synthesis cycle. &lt;a href="https://x.com/PSkinnerTech/status/2061262192171700366" rel="noopener noreferrer"&gt;[19]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A personal archive answers what was saved. Work memory answers what is safe to act on. If two agents retrieve conflicting versions of a plan, you have drift.&lt;/p&gt;

&lt;h2&gt;
  
  
  Browser and agent infra
&lt;/h2&gt;

&lt;p&gt;These tools sit below the browser-skill layer, dealing with page maps, runtime cost, command-output compaction, local model access, and human interruption channels.&lt;/p&gt;

&lt;p&gt;Hyperbrowser &lt;code&gt;/web&lt;/code&gt; creates a &lt;code&gt;web.md&lt;/code&gt; map of a site for agents. &lt;a href="https://www.hyperbrowser.ai/" rel="noopener noreferrer"&gt;[20]&lt;/a&gt; &lt;a href="https://x.com/hyperbrowser/status/2062246808282439867" rel="noopener noreferrer"&gt;[21]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Browser Use is running custom runtimes to drop cold starts and browser-hour costs. &lt;a href="https://docs.browser-use.com/" rel="noopener noreferrer"&gt;[22]&lt;/a&gt; &lt;a href="https://x.com/larsencc/status/2061524507437707384" rel="noopener noreferrer"&gt;[23]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;RTK filters and truncates shell output before the model sees it. AVB reported 2.5M tokens saved across coding agents in two weeks. &lt;a href="https://github.com/rtk-ai/rtk" rel="noopener noreferrer"&gt;[26]&lt;/a&gt; &lt;a href="https://x.com/neural_avb/status/2061345960060707238" rel="noopener noreferrer"&gt;[27]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;API for Cursor exposes Cursor Composer models to other coding agents via a local API. &lt;a href="https://api-for-composer.standardagents.ai/" rel="noopener noreferrer"&gt;[24]&lt;/a&gt; &lt;a href="https://x.com/jpschroeder/status/2061484426387677268" rel="noopener noreferrer"&gt;[25]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Razorpay shipped a CLI + MCP combo. Humans get dashboards, agents get CLIs. &lt;a href="https://razorpay.com/cli/" rel="noopener noreferrer"&gt;[28]&lt;/a&gt; &lt;a href="https://x.com/harshilmathur/status/2061699649837449259" rel="noopener noreferrer"&gt;[29]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Peter Steinberger's &lt;code&gt;sag&lt;/code&gt; lets an agent interrupt a human when blocked by a 1Password prompt or release gate. &lt;a href="https://github.com/steipete/sag" rel="noopener noreferrer"&gt;[30]&lt;/a&gt; &lt;a href="https://x.com/steipete/status/2061574752574283858" rel="noopener noreferrer"&gt;[31]&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Models and evals
&lt;/h2&gt;

&lt;p&gt;NVIDIA Nemotron 3 Ultra claims 550B total parameters, 55B active, hybrid Mamba-Transformer MoE, and a 1M context window. &lt;a href="https://docs.nvidia.com/nemotron/nightly/usage-cookbook/Nemotron-3-Ultra-Base/README.html" rel="noopener noreferrer"&gt;[32]&lt;/a&gt; &lt;a href="https://x.com/victormustar/status/2063017894221591008" rel="noopener noreferrer"&gt;[33]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MiniMax M3 claims high SWE-Bench Pro and Terminal Bench numbers. &lt;a href="https://x.com/AndrewCurran_/status/2061281239907406257" rel="noopener noreferrer"&gt;[34]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Liquid LFM2.5-VL Extract returns structured JSON from images. &lt;a href="https://huggingface.co/LiquidAI/LFM2.5-VL-450M-Extract" rel="noopener noreferrer"&gt;[35]&lt;/a&gt; &lt;a href="https://x.com/liquidai/status/2062686748291846307" rel="noopener noreferrer"&gt;[36]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nemotron 3.5 ASR Streaming runs 40 languages with controllable 80ms to 1s latency for voice agents. &lt;a href="https://x.com/PiotrZelasko/status/2062538923776290909" rel="noopener noreferrer"&gt;[37]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Anthropic warned that remote MCP servers can change behavior after approval, and persistent context increases blast radius. &lt;a href="https://www.anthropic.com/engineering/how-we-contain-claude" rel="noopener noreferrer"&gt;[38]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Agent Arena evaluates live sessions. Static prompts hide failures in loops, tools, permissions, and steering. &lt;a href="https://arena.ai/blog/agent-arena-methodology/" rel="noopener noreferrer"&gt;[39]&lt;/a&gt; &lt;a href="https://arena.ai/leaderboard/agent" rel="noopener noreferrer"&gt;[40]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Source range: 248 liked tweets from June 1, 2026 through June 7, 2026, collected from my authenticated X likes on June 8, 2026.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>automation</category>
      <category>testing</category>
    </item>
    <item>
      <title>AI Builder Notes - May 2026</title>
      <dc:creator>Srijan Shukla</dc:creator>
      <pubDate>Mon, 01 Jun 2026 21:21:44 +0000</pubDate>
      <link>https://dev.to/srijanshukla18/ai-builder-notes-may-2026-450g</link>
      <guid>https://dev.to/srijanshukla18/ai-builder-notes-may-2026-450g</guid>
      <description>&lt;h1&gt;
  
  
  &lt;strong&gt;AI Builder Notes - May 2026&lt;/strong&gt;
&lt;/h1&gt;

&lt;h3&gt;
  
  
  &lt;em&gt;AI-assisted notes from my liked-tweets feed, organized around agent workflows, browser traces, model loops, and guardrails.&lt;/em&gt;
&lt;/h3&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Practical takeaways&lt;/strong&gt;
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Start with the workflow, not the agent. A useful agent task has a source of truth, a narrow action, a verifier, and a stop condition. “Review this repo” is vague. “Find auth bugs in these routes, cite file lines, run the relevant tests, and stop after the first credible exploit path” is a workflow.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Use dynamic workflows in claude code - to do the vibe bits for thinking through a workflow. Think of it like this - you can describe in natural language an entire workflow consisting of multiple agents at various steps - I want the docs updated, tests passed, security review done and also playwright tests done. Dynamic workflows figures out which parts can be divided in parallel and what should be done sequentially. Creates a flowchart - and writes JS code for it. Its a JS script that can execute subagents at scale and deterministically &lt;a href="https://x.com/ClaudeDevs/status/2060044853279617150" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Planner/executor split is the way to go. Spend the expensive model on taste, decomposition, and risk discovery. Use cheaper or narrower models for repeatable implementation once the task has tests, rubrics, logs, or examples. &lt;a href="https://x.com/fitchmultz/status/2058582687124967461" rel="noopener noreferrer"&gt;[2]&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Do not judge an agent workflow by the model name alone. If the loop has repo access, a rubric, a way to inspect tool calls, and a verifier, a less fashionable model can still do useful work. The Letta Code / GLM 5.1 review-bot example is interesting for that reason, not because “someone used X instead of Y” is interesting by itself. &lt;a href="https://x.com/sarahwooders/status/2054652698142925117" rel="noopener noreferrer"&gt;[3]&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Prefer small interfaces to giant tool menus. MCP tool call definitions are rotting your context! The &lt;a href="http://monday.com" rel="noopener noreferrer"&gt;monday.com&lt;/a&gt; GraphQL example was the clearest cost warning: one task used 15k tokens through SDK/code-mode and 158k tokens through a real MCP server. MCP is useful, but a menu of tools is not automatically an efficient interface. &lt;a href="https://blog.cloudflare.com/code-mode/" rel="noopener noreferrer"&gt;[4]&lt;/a&gt; &lt;a href="https://x.com/YoniBraslaver/status/2055260079700791544" rel="noopener noreferrer"&gt;[5]&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;For browser work, save the trace. Run the workflow once, inspect wasted actions, replace repeated clicking with direct reads or JavaScript where safe, then save the better path as a skill. That is how browser agents become cheaper instead of just more automated. &lt;a href="https://x.com/kylejeong/status/2052497318017208470" rel="noopener noreferrer"&gt;[6]&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Security has to be designed into the harness. Stop rules, restart paths, permission gates, package-age delays, secret proxies, branch gates, logs, and human approval are the system. “Tell the model to be careful” is not a system.&lt;/p&gt;&lt;/li&gt;

&lt;/ol&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Agent workflows&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The useful version of “dynamic workflows” is mechanical. Give Claude Code a high-level task and say “workflow”. It writes an orchestration script. That script creates smaller work units, starts coordinated subagents, gives each one a bounded target, and then pulls their outputs back into one final answer or patch. &lt;a href="https://x.com/ClaudeDevs/status/2060044853279617150" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That is useful when the task has real shape: inspect five services, compare three implementations, test each candidate fix, collect account-specific data from a logged-in browser, or review a large diff from multiple angles. It is a bad fit for questions where one careful answer is enough.&lt;/p&gt;

&lt;p&gt;The same pattern showed up in smaller forms. One thread framed GPT-5.5 xhigh as the planner and Composer 2.5 subagents as implementers: the stronger model investigates, writes the plan, and delegates branches, worktrees, and PRs. &lt;a href="https://x.com/fitchmultz/status/2058582687124967461" rel="noopener noreferrer"&gt;[2]&lt;/a&gt; Cursor review skills running for 30 minutes are the same idea with time budget added: deeper search, more files read, more call paths followed, fewer drive-by comments than a quick &lt;code&gt;/simplify&lt;/code&gt;. &lt;a href="https://x.com/ParthJadhav8/status/2057788210949030351" rel="noopener noreferrer"&gt;[7]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The “100 tool calls before answering” Codex prompt names the behavior missing from a lot of agent runs: do not stop after the first plausible answer. Read more. Falsify more. Show the trail. &lt;a href="https://x.com/StijnSmits/status/2057801732051042430" rel="noopener noreferrer"&gt;[8]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Tight coupling between the model and the harness:&lt;/p&gt;

&lt;p&gt;Claude Code and Codex fail differently, so the harness needs stop conditions, escape routes, and restart logic. &lt;a href="https://x.com/PrimeIntellect/status/2055056385273123229" rel="noopener noreferrer"&gt;[9]&lt;/a&gt; The model can plan the work, but the harness has to notice loops, stale branches, broken assumptions, tool spam, and cases where the agent should ask for help.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Model vs loop&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The review-bot with Letta Code and GLM 5.1 case brings forth a useful question, that is: what did the loop provide that made a cheaper model viable? Repo context, a review objective, expected output shape, examples of good comments, and a way to reject junk comments can matter more than the logo on the model. &lt;a href="https://x.com/sarahwooders/status/2054652698142925117" rel="noopener noreferrer"&gt;[3]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Ramp spreadsheet retrieval case is the same lesson from a different direction. A specialist RL-trained model reportedly beat Opus on a narrow spreadsheet retrieval task. &lt;a href="https://www.primeintellect.ai/case-study/ramp" rel="noopener noreferrer"&gt;[10]&lt;/a&gt; That does not mean every team needs custom RL. It means narrow, verifiable work can reward narrow training, narrow evals, and narrow interfaces.&lt;/p&gt;

&lt;p&gt;If you know what you want your model to do, and you want to scale it. You aim narrow with the loop/harness. And you can get away with a much cheaper bill.&lt;/p&gt;

&lt;p&gt;Command Code repairing tens of thousands of tool calls is another version of this. Tool use fails in repeatable ways: malformed JSON, wrong argument shape, missing state, wrong sequence, bad retry. If those errors can be repaired or caught automatically, the model gets a better workbench. &lt;a href="https://x.com/MrAhmadAwais/status/2056183311211602405" rel="noopener noreferrer"&gt;[11]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Cloudflare Code Mode / MCP comparison is a reminder again, that you should probably have lean MCP, less context rot. Or rather, only use MCP when you are accessing a remote service. Prefer CLI over MCP by default.&lt;/p&gt;

&lt;p&gt;Why: A GraphQL API task took 1 step and 15k tokens through SDK/code-mode, versus 4 steps and 158k tokens through a real MCP server. &lt;a href="https://blog.cloudflare.com/code-mode/" rel="noopener noreferrer"&gt;[4]&lt;/a&gt; &lt;a href="https://x.com/YoniBraslaver/status/2055260079700791544" rel="noopener noreferrer"&gt;[5]&lt;/a&gt; An agent interface is part of the product. Give the model a small, typed, task-shaped API when you can. Do not assume a broad tool menu is better because it feels more general.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Browser skills&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The most concrete browser-agent example here is Hermes Agent / Autobrowse. A Hacker News workflow went from 102 seconds to 35 seconds, 23 turns to 8 turns, and $1.46 to $0.28 after the trace was simplified and saved as a skill. &lt;a href="https://github.com/NousResearch/hermes-agent" rel="noopener noreferrer"&gt;[12]&lt;/a&gt; &lt;a href="https://x.com/kylejeong/status/2052497318017208470" rel="noopener noreferrer"&gt;[6]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The trick was not magic browser control. The trick was noticing the repeated slow path. If the agent clicks through the same UI every time, inspect the page, read state directly where possible, remove wasted navigation, and save the shorter path. That is a real skill: the agent gets faster because the workflow gets smaller.&lt;/p&gt;

&lt;p&gt;The adjacent tools worth tracking: the OpenAI Chrome plugin, BrowserCode, Autobrowse, browser-harness, Pi browser extensions, and Hermes browser skills. &lt;a href="https://developers.openai.com/codex/app/chrome-extension" rel="noopener noreferrer"&gt;[13]&lt;/a&gt; &lt;a href="https://github.com/browser-use/browsercode" rel="noopener noreferrer"&gt;[14]&lt;/a&gt; &lt;a href="https://x.com/kylejeong/status/2052497318017208470" rel="noopener noreferrer"&gt;[6]&lt;/a&gt; &lt;a href="https://github.com/browser-use/browser-harness" rel="noopener noreferrer"&gt;[15]&lt;/a&gt; &lt;a href="https://pi.dev/" rel="noopener noreferrer"&gt;[16]&lt;/a&gt; &lt;a href="https://github.com/NousResearch/hermes-agent" rel="noopener noreferrer"&gt;[12]&lt;/a&gt; The category is logged-in browser work: support queues, internal tools, research, scraping, QA, admin ops, and anything where the useful data sits behind a session.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Memory and retrieval&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Birdclaw is interesting because it gives agents access to a Twitter archive. &lt;a href="https://birdclaw.sh/" rel="noopener noreferrer"&gt;[17]&lt;/a&gt; GBrain points at a personal recall layer around OpenClaw / Hermes-style workflows. &lt;a href="https://github.com/garrytan/gbrain" rel="noopener noreferrer"&gt;[18]&lt;/a&gt; PageIndex is a useful reminder that simple retrieval, even BM25-only retrieval, still has a place. &lt;a href="https://github.com/VectifyAI/PageIndex" rel="noopener noreferrer"&gt;[19]&lt;/a&gt; The “RAG comeback in about 8 months” take lands because the archive problem is still unsolved in practice. &lt;a href="https://x.com/jxnlco/status/2053900167645159709" rel="noopener noreferrer"&gt;[20]&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A giant archive is not memory. Memory is knowing when to search, what to retrieve, how much to inject, and how to preserve provenance. A liked-tweets feed becomes useful only if the distillation keeps links, dates, claims, and enough source texture to make the note auditable later.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Security and guardrails&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Cloudflare tested Anthropic Mythos against fifty repositories. &lt;a href="https://x.com/Cloudflare/status/2056360412510060748" rel="noopener noreferrer"&gt;[21]&lt;/a&gt; Another thread said Claude Mythos Preview helped Firefox fix more security bugs in April than in the previous 15 months combined. &lt;a href="https://x.com/alexalbert__/status/2052468573516513762" rel="noopener noreferrer"&gt;[22]&lt;/a&gt; Read neither as “AI fixes security now”. Read them as scoped security work becoming agent-shaped: known repo, known bug class, patch candidates, review loop, and humans still responsible for merging.&lt;/p&gt;

&lt;p&gt;The most useful boring guardrail here is package-age delay. pnpm and npm both have settings that can avoid installing packages published too recently. &lt;a href="https://pnpm.io/settings#minimumreleaseage" rel="noopener noreferrer"&gt;[23]&lt;/a&gt; &lt;a href="https://docs.npmjs.com/cli/using-npm/config/#min-release-age" rel="noopener noreferrer"&gt;[24]&lt;/a&gt; This matters more with agents because agents will happily install dependencies at machine speed. A small delay catches some supply-chain attacks before they enter the workflow.&lt;/p&gt;

&lt;p&gt;Two defaults worth setting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pnpm config &lt;span class="nb"&gt;set &lt;/span&gt;minimumReleaseAge 2880
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm config &lt;span class="nb"&gt;set &lt;/span&gt;min-release-age&lt;span class="o"&gt;=&lt;/span&gt;2d
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Clawvisor belongs in the same bucket: approve agent access without handing raw credentials to the model. &lt;a href="https://clawvisor.com/" rel="noopener noreferrer"&gt;[25]&lt;/a&gt; These dull permission layers are more interesting than another demo where an agent clicks around a dashboard with full access.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Tools worth opening&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://walkinglabs.github.io/learn-harness-engineering/en/" rel="noopener noreferrer"&gt;Harness engineering learning site&lt;/a&gt;: useful if you want names for the parts around the model - evals, stop rules, retries, logs, and verification.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://github.com/run-llama/liteparse" rel="noopener noreferrer"&gt;LiteParse v2&lt;/a&gt;: Rust PDF parsing for agent/RAG workflows where PDFs are the bottleneck. The useful question is not “is it fast?” but “does it preserve the parts your downstream model needs?”&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Patter: voice AI in a few lines, with multiple providers. Useful if you want to prototype voice workflows without first committing to one stack. &lt;a href="https://www.getpatter.com/" rel="noopener noreferrer"&gt;[27]&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Minions: mission-control style UI for Hermes Agent tasks. Worth opening if you are running multiple local agents and need a control plane. &lt;a href="https://www.producthunt.com/products/minions" rel="noopener noreferrer"&gt;[28]&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;OpenRouter Pareto Code: route to the cheapest code-capable model above a score threshold. This is the right kind of boring optimization for agent loops that run often. &lt;a href="https://openrouter.ai/docs/guides/routing/routers/pareto-router" rel="noopener noreferrer"&gt;[29]&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;OpenRouter Response Caching: useful for tests, retries, and repeated agent prefixes. Caching is not glamorous, but repeated context is where agent bills quietly grow. &lt;a href="https://openrouter.ai/announcements/response-caching" rel="noopener noreferrer"&gt;[30]&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Flue: TypeScript sandboxed-agent framework with runtimes and a secret proxy. Useful shape: run the agent in a controlled runtime instead of giving it everything. &lt;a href="https://flueframework.com/" rel="noopener noreferrer"&gt;[31]&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Zero: programming language for agents with explicit capabilities, JSON diagnostics, and typed safe fixes. Worth saving because explicit capabilities are a cleaner interface than vibes and instructions. &lt;a href="https://zerolang.ai/" rel="noopener noreferrer"&gt;[32]&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>claude</category>
      <category>llm</category>
    </item>
    <item>
      <title>From Grep to ast-grep: Building XRAY MCP for Code-Aware AI</title>
      <dc:creator>Srijan Shukla</dc:creator>
      <pubDate>Sat, 16 Aug 2025 13:21:52 +0000</pubDate>
      <link>https://dev.to/srijanshukla18/from-grep-to-ast-grep-building-xray-mcp-for-code-aware-ai-31m3</link>
      <guid>https://dev.to/srijanshukla18/from-grep-to-ast-grep-building-xray-mcp-for-code-aware-ai-31m3</guid>
      <description>&lt;p&gt;I enjoy coding with AI assistants, but they failed the moment I asked a structural question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Show me everything that calls &lt;code&gt;authenticate()&lt;/code&gt;."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;They guessed—because under the hood they were using plain text search.&lt;/p&gt;

&lt;p&gt;After experiments with grep scripts (too noisy), tree-sitter bindings (too fragile), and LSPs (too heavy), I found &lt;strong&gt;ast-grep&lt;/strong&gt;. It offers syntax-aware search without the infrastructure weight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;XRAY MCP&lt;/strong&gt; is my wrapper around ast-grep: a tiny server exposing &lt;code&gt;map&lt;/code&gt;, &lt;code&gt;find&lt;/code&gt;, and &lt;code&gt;impact&lt;/code&gt; endpoints. An assistant can now map the repo, locate a symbol, and see its references before suggesting changes.&lt;/p&gt;

&lt;p&gt;The code is open source if you want to try the approach.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/srijanshukla18/xray" rel="noopener noreferrer"&gt;https://github.com/srijanshukla18/xray&lt;/a&gt;&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>programming</category>
      <category>vibecoding</category>
    </item>
  </channel>
</rss>
