AI Daily Digest: May 25, 2026 — Grok Build CLI, Cursor Composer 2.5, Qwen 3.7, X-Humanoid Wise KaiWu & More

#ai #machinelearning #programming #agents

5-min read · Curated daily by an AI Systems Architect
Focus: Agentic Workflows · AI Coding Tools · Embodied Intelligence

1. Grok Build CLI — 8 Parallel Subagents, 2M Context Window

xAI launched Grok Build CLI (May 14) in early beta for SuperGrok Heavy subscribers. Powered by Grok 4.3 beta with a 2-million-token context window, it supports 8 parallel subagents, headless mode, ACP protocol support, terminal-based planning, and clean git diffs with worktree management.

【Technical Core】
Grok Build CLI is the highest agentic-feature-density tool to ship in May 2026. The 2M context window means an entire large codebase fits in a single agent run, while parallel subagents allow truly concurrent task execution — architecturally closer to a multi-agent system than a single coding copilot. Available on macOS and Linux (WSL2 for Windows).

【Why It Matters】
This is xAI's direct strike at Claude Code and OpenAI Codex CLI. With 8 parallel subagents at $299/mo, it positions itself as the power-user tool for teams running complex multi-step agent pipelines. The 2M window in particular is a game-changer for legacy codebase refactors and large monorepo operations.

🔗 Grok Build CLI announcement

2. Cursor Composer 2.5 — 79.8% SWE-Bench Multilingual, Parity with Opus 4.7

Composer 2.5, Cursor's in-house coding model, reached GA on May 18. It scores 79.8% on SWE-Bench Multilingual and 63.2% on CursorBench v3.1, claiming parity with Claude Opus 4.7 and GPT-5.5 on coding tasks. Pricing starts at $0.50/$2.50 per 1M tokens.

【Technical Core】
Built on top of Cursor 3.4's new team-configurable agent environment infrastructure, Composer 2.5 is tightly integrated with in-app PR review (create → merge inside the editor). The SWE-Bench multilingual score signals strong performance across non-English codebases — a first for Cursor's proprietary models.

【Why It Matters】
Cursor is no longer just a UI wrapper around third-party models. With a proprietary model competitive with frontier labs on code benchmarks, Cursor is building a defensible vertical position. For teams already on Cursor 3.4, Composer 2.5 is available now without any config change — just select it from the model picker.

🔗 Cursor Composer 2.5

3. Qwen 3.7-Max-Preview — 1M Context, 1,000+ Tool Calls in 35-Hour Autonomous Run

Alibaba's Qwen 3.7-Max-Preview (May 20) introduces a 1-million-token context window with extended-thinking mode. In demos, it completed a 35-hour autonomous agent run chaining 1,000+ tool calls without degradation. LM Arena Elo 1,475, currently the highest-ranked Chinese model.

【Technical Core】
The 35-hour / 1,000+ tool-call demonstration is the most credible long-horizon agentic benchmark to emerge from any Chinese lab. Priced at $2.50/$7.50 per 1M tokens on OpenRouter, it's competitively positioned against western frontier models. DeepSeek V4-Pro's permanent 75% discount ($0.435 input) meanwhile makes China's open-weight ecosystem dramatically more accessible.

【Why It Matters】
The combination of Qwen 3.7's long-horizon capability and DeepSeek V4-Pro's cost structure signals that the frontier for agentic workloads is no longer exclusively held by OpenAI or Anthropic. Teams building high-volume agent pipelines should seriously evaluate Chinese models for cost optimization without sacrificing capability.

🔗 Qwen 3.7-Max-Preview on OpenRouter

4. Anthropic Billing Split — Separate Agent SDK Credit Pool from June 15

Anthropic announced (May 14, effective June 15) a billing split: Claude's credits will be divided into two pools — chat/first-party tools (existing Pro/Max) and a new Agent SDK credit pool for Claude Code, claude -p, GitHub Actions, and third-party frameworks.

【Technical Core】
Agent SDK allowances: $20/mo (Pro), $100/mo (Max 5x), $200/mo (Max 20x). Usage above these caps shifts to full API rates. This separates usage patterns — a Claude Code power user can now have a dedicated budget without burning into conversational credits.

【Why It Matters】
This is a signal that agentic usage has grown to the point where Anthropic needs to manage it as a distinct billing category. For teams running CI/CD pipelines with Claude Code or using the Agent SDK in production, this billing split is actionable immediately — audit your current usage before June 15 to avoid surprise charges.

🔗 Anthropic billing announcement

5. Gemini 3.5 Flash GA — Agent-First Positioning from Google I/O 2026

Gemini 3.5 Flash shipped GA on May 19, explicitly framed as an agent-first model. It outperforms Gemini 3.1 Pro on coding, agentic tasks, and multimodal reasoning, and is available across Gemini API, AI Studio, Android Studio, Google Antigravity, and the consumer Gemini app.

【Technical Core】
Google's "agent-first" framing is substantive — the model is designed around long-horizon tool-use rather than chat optimization. Available in Google Antigravity (Google's VS Code competitor) from day one, it positions Gemini 3.5 Flash as the default engine for Google's developer ecosystem. Gemini 3.5 Pro is expected in June.

【Why It Matters】
With Gemini 3.5 Flash embedded across Google's developer stack — from Android Studio to Antigravity to API — Google is effectively making Gemini 3.5 the invisible infrastructure of modern development. Teams already using Google Cloud or Firebase should expect agent capabilities to surface automatically across their existing toolchain.

🔗 Google I/O 2026 Gemini

6. X-Humanoid Wise KaiWu Agent — Dynamic Spatial Memory for Humanoid Robots

Beijing's X-Humanoid unveiled the Wise KaiWu Agent (May 10), introducing the industry's first global scene perception and dynamic spatial memory system for humanoid robots. Four key breakthroughs: spatial memory, personalized user interaction at scale, one-build/multi-robot deployment, and multimodal force control.

【Technical Core】
The Wise KaiWu Agent introduces a persistent user memory system — robots can recognize individual users after a single interaction and remember behavioral preferences long-term. Combined with visual+tactile sensing for adaptive grasp force control, this bridges the gap between LLM reasoning and real-world physical manipulation. The "one-time development, multi-robot deployment" capability dramatically reduces deployment friction.

【Why It Matters】
Spatial memory in humanoid robots is the missing layer between impressive lab demos and real-world service deployment. When a robot can remember where objects are, who the user is, and how much force to apply — without being retrained — it becomes operationally viable for eldercare, logistics, and light manufacturing. X-Humanoid's Wise KaiWu Agent is a credible step toward persistent, personalized robotic service.

🔗 X-Humanoid Wise KaiWu Agent

7. Claude Mythos (Restricted Preview) — Rumored Autonomous Vulnerability Discovery

Anthropic's Claude Mythos is in restricted preview with ~50 partner organizations. Rumored capabilities include major leaps in reasoning, coding, agentic execution — and notably, automated discovery of previously unknown software vulnerabilities.

【Technical Core】
Autonomous vulnerability discovery — where an agent independently finds zero-day bugs in production software — would represent a qualitative leap beyond current coding agents. While details are restricted, the 50-partner preview structure suggests Anthropic is managing capability rollout carefully before broader release.

【Why It Matters】
If Claude Mythos can reliably discover novel vulnerabilities, it fundamentally changes the economics of both offensive and defensive security. Organizations running legacy codebases should watch this closely — it may arrive as a specialized product before becoming a general capability. The careful restricted release also signals Anthropic is treating this as a dual-use risk.

🔗 Kersai AI May 2026 Report