DEV Community

Cover image for All Agent Harnesses: The Live Comparison
Hector Flores
Hector Flores

Posted on • Edited on • Originally published at htek.dev

All Agent Harnesses: The Live Comparison

{/* LAST_UPDATED: 2026-06-10T23:36:00Z */}

🔴 LIVING ARTICLE — This page is continuously maintained and updated as platforms ship new features. Bookmark it. Come back often.

Last updated: June 10, 2026 (Evening) — Claude Code "A harness for every task" blog: Anthropic names the three failure modes dynamic workflows solve (agentic laziness, self-preferential bias, goal drift); JFrog + Anthropic launch enterprise supply chain governance plugin for Claude Code + Cursor + Copilot


Why This Page Exists

There are over a dozen platforms claiming to be the best way to build, run, and manage AI agents. Some are IDEs, some are cloud services, some are open-source libraries, and some are full autonomous coding environments. The terminology is a mess. Marketing pages all say "agent framework" but the products underneath are fundamentally different things.

I've been building multi-agent systems in production — 50+ agents running autonomously on cron schedules, managing everything from content pipelines to household logistics. That experience taught me something the comparison posts miss: the harness matters more than the model. The right control plane turns a chatbot into a production system. The wrong one turns your codebase into a liability.

This is my attempt to give you the definitive bird's-eye view. Every major agent harness, every feature set, head-to-head — with honest pros and cons for each. No ranking where my favorite conveniently wins. Just the facts, organized so you can make the right call for your situation.


What Is an Agent Harness?

Agent Platform Taxonomy — six distinct categories showing Harness, Framework, SDK, Tool, IDE Agent, and Autonomous Agent classifications
The six categories of agent platforms — not all "agent frameworks" are the same thing.

Before comparing anything, we need to define what we're actually comparing. The industry uses "agent framework," "agent SDK," and "agent harness" interchangeably — but they're different things. Anthropic's engineering team nailed the distinction: the harness is the runtime container that wraps around an agent's execution.

{/* TAXONOMY_TABLE_START */}

Category What It Does Who Controls the Loop Examples
Agent Harness Runtime container — lifecycle, governance, tool access, policy enforcement The platform GitHub Copilot, Bedrock Agents, Vertex AI Agent Builder
Agent Framework Programmable building blocks for composing agents in code The developer LangChain/LangGraph, CrewAI, AutoGen, Semantic Kernel
Agent SDK Thin client library binding your code to a vendor's harness The vendor's runtime OpenAI Agents SDK, Google ADK
Agent Tool / Sandbox Infrastructure component agents call into (increasingly integrated into SDKs) N/A — it's a tool E2B, Daytona, Modal, GKE Agent Sandbox, Cloudflare Workers
Agent Orchestrator Control plane running multiple harnesses side by side The orchestration layer Warp Oz
IDE Agent AI assistant embedded in a code editor with agent capabilities The IDE vendor Cursor, Devin Desktop, Google Antigravity, JetBrains AI
SLM Agent Harness Lightweight harness optimized for small language models (4B–27B) on-device The framework + local hardware Microsoft MagenticLite
Autonomous Agent Fully self-directed agent with its own cloud environment The agent itself Devin Cloud

{/* TAXONOMY_TABLE_END */}

The key distinction: a harness owns the loop. It decides whether a tool call executes, enforces budgets, manages context, and provides observability. A framework gives you the building blocks to construct that loop yourself. An SDK connects you to someone else's loop. As Analytics Vidhya's taxonomy puts it: frameworks provide building blocks, runtimes execute workflows, harnesses enforce control.

A May 2026 study by MBZUAI analyzing Claude Code's source (1,884 files, ~512k lines) quantified this: ~98.4% of a production agent is harness infrastructure (permissions, context management, sandboxing, tool routing, recovery) — only ~1.6% is AI decision logic. Four independently-built agents (Claude Code, Codex CLI, Aider, OpenClaw) all converged on the same harness patterns, suggesting this architecture is a fundamental constraint of the problem, not a design choice.

Why does this matter? Because if you're evaluating "agent platforms" without understanding these categories, you'll compare LangChain (a library you embed) against Bedrock Agents (a managed service you configure) and wonder why the feature lists look nothing alike. They're solving different problems at different layers.


Head-to-Head Comparison Tables

Harnesses, IDE Agents & Autonomous Agents

{/* HARNESS_COMPARISON_TABLE_START */}

Feature GitHub Copilot (Extensions + CLI) OpenAI Agents SDK Anthropic Claude Code Amazon Bedrock Agents Google Vertex AI Agent Builder Cursor Devin Desktop (fmr. Windsurf) Devin Cloud JetBrains AI
Tool Use Extensions API + MCP + function calling Function calling + hosted tools MCP protocol + Bash/file tools Action groups → Lambda/Step Functions Fulfillments + Vertex Extensions Built-in code/terminal tools Code search + editing tools Full dev environment tools IDE-native tools
Memory Copilot instructions + repo context + conversation Thread-level + vector stores Project indexing + conversation Knowledge bases (OpenSearch/S3) + sessions Vertex AI Search + flow state Codebase index + session Codebase index + session Codebase index + persistent sessions Project index + conversation
Multi-Agent Multi-agent via CLI (task tool, background agents) Handoffs between agents, swarm patterns Sub-agents via tool use Orchestration via Step Functions Sub-agent routing via flows Agents Window — up to 8 parallel agents in isolated worktrees ACP-compatible multi-agent (Agent Command Center) Parallel Devins Single agent
Sandboxing Docker containers, Codespaces Native harness/compute separation — 7 providers (E2B, Modal, Cloudflare, Vercel, Daytona, Blaxel, Runloop) Bash sandbox, permission prompts Lambda/VPC isolation Cloud Functions/Cloud Run Local or remote containers Local environment Cloud VM per session Local or remote
Governance Pre/post tool hooks (hooks.json), extension allowlists, org policies, MXC OS-level sandboxing Guardrails API, content filters Permission prompts, .claude files IAM + CloudTrail + CloudWatch IAM + Cloud Audit Logs User approval prompts User controls Admin controls Enterprise controls
Extensibility Extensions + custom agents + skills Plugin system + tool definitions MCP servers (open protocol) Lambda action groups Webhooks + Extensions Limited plugin API Limited API integrations Plugin marketplace
IDE Integration VS Code, Visual Studio, JetBrains, Xcode, CLI None (API-first) VS Code extension, terminal None (API/console) None (console/API) Native (Cursor IDE) Native (Devin Desktop IDE) Cloud IDE (VSCode-based) Native (JetBrains IDEs)
CLI Support ✅ Full CLI agent ✅ Claude Code CLI Slack/API
Cloud vs Local Both (local CLI + Codespaces + cloud agent) Cloud (OpenAI servers) Local-first + cloud Cloud (AWS) Cloud (GCP) Local + remote Local + remote Cloud only Local + remote
Pricing Free tier → $10/mo → $39/mo → Enterprise Pay-per-token + storage Free (Claude Code) + API costs Pay-per-token + AWS services Pay-per-token + GCP services Free → $20/mo → $40/mo → Enterprise Free → $15/mo → $60/mo → Enterprise $20/mo + $2.25/ACU → $500/mo teams Bundled with JetBrains subscription
Open Source Extensions spec open, CLI proprietary SDK open source (MIT), runtime proprietary CLI open source, MCP open protocol Proprietary Proprietary Proprietary Proprietary Proprietary Proprietary

{/* HARNESS_COMPARISON_TABLE_END */}

Agent Frameworks

{/* FRAMEWORK_COMPARISON_TABLE_START */}

Feature LangChain / LangGraph CrewAI Microsoft Agent Framework (AutoGen + Semantic Kernel) Google ADK Mastra
Tool Use Decorators + schemas + any callable Tool decorators with role binding Skills/functions (semantic + native) + agent tooling Tools with schema definitions TypeScript-first tool definitions
Memory Programmable (buffer, summary, vector, entity, graph) Shared crew memory + agent memory Vector store connectors + key-value + conversation history Session state + Google Search grounding Explicit read/write memory with observability
Multi-Agent Graph-based (nodes = agents, edges = flow) Crews with role-based orchestration Conversational groups + composable kernels (unified orchestration) Multi-agent with AgentTool delegation Multi-agent message flows
Sandboxing Developer-managed (any environment) Developer-managed Developer-managed (Azure containers available) Developer-managed (GCP available) Developer-managed
Governance Callbacks, LangSmith tracing Callbacks, logging hooks Azure IAM/RBAC + ACS (Agent Control Standard) + Foundry tracing Google Cloud IAM + logging Built-in observability, metrics, logs
Extensibility Very high — model-agnostic, 700+ integrations Moderate — growing ecosystem High — multi-language (C#, Java, Python, JS) + Microsoft ecosystem Moderate — Google ecosystem High — TypeScript ecosystem
Deployment Self-hosted (any infra) + LangSmith cloud Self-hosted (Python apps) Self-hosted + Azure + Foundry Agent Service (hosted agents) Self-hosted + GCP integration Self-hosted (Node.js)
Pricing Free (OSS) + LangSmith SaaS optional Free (OSS) + CrewAI Enterprise optional Free (OSS) + Foundry hosting optional Free (OSS) Free (OSS)
License MIT MIT MIT Apache 2.0 MIT

{/* FRAMEWORK_COMPARISON_TABLE_END */}


Every Harness, In Depth

{/* HARNESS_SECTION: github-copilot */}

GitHub Copilot (Extensions + CLI + Cloud Agent)

GitHub Copilot isn't just autocomplete anymore — it's a full agent harness with extensions, hooks for governance, and a CLI that runs autonomous agents in your terminal. The extensions system lets third-party services register as tools, and the hooks.json governance layer gives organizations pre/post-tool interception that no other IDE agent offers.

The cloud coding agent can autonomously research a repository, create implementation plans, and submit pull requests — triggered directly from GitHub Issues. It runs in a secure cloud sandbox with full access to the repo context.

May 2026: GitHub released a technical preview of the Copilot App — a standalone desktop client that moves Copilot from IDE extension to an agentic desktop workflow. Each task runs in its own session via git worktrees, enabling parallel work without conflicts. The app includes a cross-repo inbox, integrated terminal, and browser for live previews, guiding code changes from planning to merged PR. On May 18, GitHub made remote control for Copilot CLI sessions generally available — you can now start a CLI agent session and monitor, steer, approve, or stop it remotely from GitHub Mobile, github.com, VS Code, or JetBrains. This multi-surface capability means you can kick off a complex agent task at your desk and manage it from your phone while walking the dog.

Also in May 2026: Microsoft released WinUI agent skills — a modular plugin shipping 8 composable skills (dev-workflow, design, code-review, UI testing, packaging, WPF migration) that work with both GitHub Copilot and Claude Code. This cross-platform skill architecture demonstrates how agent skills can be portable across different harnesses, strengthening the ecosystem's shift toward standardized, composable agent capabilities.

May 21, 2026: Microsoft introduced the Plan agent for Visual Studio — a dedicated agent mode that asks clarifying questions, drafts an implementation plan, and lets you review and edit it before a single line of code changes. Plans are saved as .copilot/plans/plan-{title}.md, version-controlled alongside your code, and shareable with your team. Once approved, the Plan agent hands off directly to Agent mode for implementation. This closes the gap between "intent" and "code" that has caused frustration with autonomous agents jumping straight into implementation.

May 21–22, 2026: GitHub shipped the Copilot Agent Tasks REST API — a POST /agents/repos/{owner}/{repo}/tasks endpoint that lets you trigger the cloud coding agent from any script, portal, or CI pipeline without touching the web UI. The agent runs in a GitHub Actions environment, opens a PR when done, and supports mid-task clarification (waiting_for_user state). Economy model options (Claude Haiku 4.5 and GPT-5.4-mini at 0.33× cost multiplier, added May 18) make high-volume automation economical. A companion GET endpoint returns a repository's full agent configuration for security audits. Available on Copilot Business and Enterprise at launch; expanded to Copilot Pro, Pro+, and Max on June 4, 2026 — enabling individual developers to fan out refactors across repos, automate releases, and integrate cloud agent tasks into personal pipelines via PAT or OAuth tokens.

May 27, 2026: GitHub Copilot CLI launched a plugin and marketplace system — installable packages that can bundle custom agents, skills, hooks, MCP servers, and LSP integrations into a single distributable unit. Plugins are hosted via GitHub-backed marketplaces (copilot-plugins and awesome-copilot) or bundled directly from repositories. This transforms Copilot CLI from a single-agent terminal tool into an extensible platform where the community can ship reusable agent components.

May 29, 2026: Microsoft is building a Copilot "super app" that unifies GitHub Copilot, Copilot Chat, Copilot Cowork, and a new agentic workflow capability internally named "Autopilot" into a single destination. Led by Jacob Andreou (head of unified Copilot), the project aims for end-of-summer 2026 launch. A toggle lets users switch between personal and enterprise Microsoft 365 Copilots. GitHub Copilot has 4.7 million paid subscribers. The consolidation signals Microsoft's intent to make Copilot the single surface for all AI-assisted work — coding, chat, workflow automation — eliminating the fragmentation that confused customers across separate Copilot products.

May 27, 2026 — enterprise cost validation: Microsoft shifted engineers from Claude Code to GitHub Copilot CLI across its Experiences and Devices division (Windows, Microsoft 365, Teams, Surface), with a June 30 cutoff. After opening Claude Code to thousands of engineers in late 2025, per-engineer costs reached $500–$2,000/month under token-based pricing — an 8–12% surcharge on top of existing headcount costs. Uber burned through its entire 2026 AI coding budget in four months with 84% engineer adoption. The shift validates Copilot CLI's economic model: seat-based access plus stronger governance beats opaque token sprawl. Claude models still work inside Copilot CLI, and Microsoft's broader Anthropic investment ($5B) is unaffected — this is a pricing model decision, not a product quality decision.

June 1, 2026 — GitHub AI Credits: Starting tomorrow, GitHub moves Copilot to AI Credits billing while keeping plan prices unchanged — Pro $10/month, Pro+ $39/month, Business $19/user/month, Enterprise $39/user/month. One AI Credit equals $0.01, usage is metered by tokens at per-model rates, and code completions plus Next Edit Suggestions remain unlimited. Business customers get promotional $30/user/month credits for June–August and Enterprise gets $70/user/month; credits pool across the org, and admins can set spend caps at the enterprise, cost-center, and user level. GitHub also swapped the default Business/Enterprise base model from GPT-4.1 to GPT-5.3-Codex in May. The important framing: this is Copilot maturing into a more transparent enterprise platform — real budget controls and pooled credits without sacrificing the broad, multi-surface experience that made Copilot easy to operationalize.

June 2, 2026 — Copilot Max and governance updates: GitHub launched Copilot Max — a power-user tier for existing Pro/Pro+ subscribers with the highest included AI Credits and spending limits. Code review now consumes Actions minutes (in addition to AI Credits), and user-level budget controls are GA — admins can set universal or per-user spend caps with proactive email notifications as users approach thresholds. New sign-ups for Student, Pro, Pro+, and Max remain paused while GitHub scales the infrastructure. The cumulative picture: Copilot's billing is now a full enterprise governance surface — pooled credits, tiered spending, granular admin controls, and transparent per-model cost tracking.

June 2, 2026 — Build 2026: GitHub Copilot App, SDK GA, and Cloud Automations: Microsoft Build 2026 unveiled the GitHub Copilot App — an agent-native desktop experience in technical preview for Pro, Pro+, Business, and Enterprise users. The headline feature: Canvases — bidirectional work surfaces where agents and humans collaborate in real time. Canvases display plans, PRs, browser sessions, terminals, deployments, dashboards, or workflow states, with agents updating and developers editing/reordering freely.

The GitHub Copilot SDK is now generally available for Node.js/TypeScript, Python, Go, .NET, Rust, and Java — one runtime to build internal tools on the same agentic infrastructure that powers Copilot itself. Cloud automations let agents run on schedules, respond to events, open issues, and post comments, with a default prompt-permission model and autopilot option after trust is established.

Memory++ and /chronicle provide cross-surface continuity — your entire Copilot session history now syncs across the app, CLI, VS Code, JetBrains, and GitHub.com. Chronicle delivers standup summaries, personalized tips, and custom instructions surfaced from past work. Sessions are private by default, shareable as view-only via CLI (/share gist) or on github.com. Local sessions sync to your GitHub account automatically.

The Copilot CLI was also refreshed at Build 2026 with Rubber Duck (a conversational thinking-partner agent now GA — helps you work through architectural decisions and debugging puzzles without triggering any code changes, named after the classic rubber duck debugging technique), voice input (GA — narrate your session hands-free), a redesigned terminal interface with tabs for Issues, Pull Requests, and Gists (experimental via /experimental), and prompt scheduling (experimental) — extending the agent's autonomy beyond interactive sessions into always-on background work.

June 2, 2026 — Gemini 3.1 Pro + Gemini 3.5 Flash across Copilot surfaces: GitHub expanded model choice significantly with two Google Gemini models now available in Copilot CLI, Copilot cloud agent, the GitHub Copilot app (technical preview), and the Copilot SDK. Gemini 3.1 Pro (Preview) is available for Student, Pro, Pro+, Business, and Enterprise subscribers; Gemini 3.5 Flash for Pro, Pro+, Business, and Enterprise. Business and Enterprise admins must opt in via Copilot model policy settings. This makes Copilot one of the only developer tools offering simultaneous access to models from OpenAI, Microsoft (MAI), Anthropic, and Google across a single consistent interface.

June 2, 2026 — GitHub Agent Apps: AI agents installable from the Marketplace: GitHub launched Agent Apps — AI agents from GitHub partners that install from the GitHub Marketplace like any GitHub App and integrate directly into your GitHub workflows. Three entry points: assign an issue to the agent, @mention it in a pull request comment, or select it in the Agents UI with a custom prompt. The first wave includes partners like SonarQube (code quality and security analysis that gets access to the full PR context) with more partners and internal tooling support coming soon. The significance: GitHub is becoming a marketplace for specialized task agents — not just Copilot as a monolithic assistant, but a composable ecosystem where teams install the specific agents they need for code review, incident management, security analysis, and more. This is the agent equivalent of the GitHub Apps marketplace — a distribution layer that turns GitHub into a platform for third-party autonomous workers.

June 4, 2026 — 1M-Token Context Windows + Configurable Reasoning Levels: GitHub Copilot now supports one-million-token context windows — enabling deep work across large codebases, multi-file projects, and long documents without losing context. Available in VS Code, Copilot CLI, and the Copilot app today, expanding to more surfaces soon. Alongside this, configurable reasoning levels let developers dial in the speed/depth tradeoff and unlock extended thinking for architectural and debugging challenges. Both capabilities consume more AI credits at higher settings — GitHub recommends defaults for everyday tasks and extended options for complex, multi-file problems.

June 2, 2026 — MAI-Code-1-Flash: Microsoft launched MAI-Code-1-Flash, a 5B-parameter coding model built end-to-end by Microsoft and integrated directly into GitHub Copilot in VS Code. Designed for fast, efficient assistance in everyday developer workflows, it's trained with Copilot harnesses from production workflows to improve tool interaction. Key claims: solves harder problems with up to 60% fewer tokens, adaptive thinking that adjusts reasoning depth by request type, and strong instruction-following for single and multi-turn tasks. Rolling out to VS Code individual users via the Auto picker or model picker — no extra setup required. Part of Microsoft's seven new MAI models spanning image, voice, transcription, coding, and reasoning, with Frontier Tuning enabling organizations to train custom MAI models on their own workflows.

June 4, 2026 — Fix with Copilot for failing Actions (Pro/Pro+/Max): GitHub expanded Fix with Copilot for failing Actions to all individual-tier subscribers — Pro, Pro+, and Max — previously limited to Business and Enterprise plans. When a GitHub Actions workflow fails, click the Fix with Copilot button on the workflow run logs page and the cloud agent investigates the failure, pushes a fix branch, and tags you for review when done — running in its own cloud sandbox. Individual developers can now hand off CI firefighting to Copilot without needing an enterprise plan.

June 5, 2026 — Enterprise-managed plugins in VS Code (public preview): GitHub Copilot's enterprise-managed plugin distribution expands from Copilot CLI to VS Code (version 1.122+). Enterprise admins define plugin configurations — custom agents, skills, hooks, and MCP server references — in a settings.json file at .github-private/.github/copilot/settings.json. Both VS Code and the Copilot CLI automatically pull and apply these settings for licensed Copilot Business and Enterprise users, with configured plugins auto-installed on first authentication. This gives enterprise teams a single, version-controlled governance surface for distributing standardized agent tooling across the entire developer fleet.

June 5, 2026 — GPT-5.2 and GPT-5.2-Codex retired: GitHub deprecated GPT-5.2 and GPT-5.2-Codex across all Copilot experiences on June 5, 2026 — Chat, inline edits, ask and agent modes, and code completions — with the exception that GPT-5.2 remains available in Copilot Code Review. The suggested alternative is GPT-5.5. GPT-4.1 was also deprecated on June 1 (alternative: GPT-5.5). With both models retired, Copilot's model roster continues consolidating around newer-generation options: GPT-5.5, GPT-5.3-Codex, MAI-Code-1-Flash, Gemini 3.1 Pro, and Gemini 3.5 Flash. Enterprise admins may need to update model policies to enable access to the replacement models.

✅ Pros:

  • Deepest integration — VS Code, Visual Studio, JetBrains, Xcode, Eclipse, standalone CLI, and now a dedicated desktop app with Canvases and cloud automations
  • Copilot SDK GA — build internal tools on the same agentic runtime (Node.js/TS, Python, Go, .NET, Rust, Java)
  • Remote control for CLI sessions (GA) — monitor and steer agents from mobile, web, or any IDE
  • Extension system lets any service become an agent tool — unique in the IDE space
  • hooks.json governance — pre/post tool call interception for enterprise policy enforcement
  • CLI agent supports multi-agent patterns (background agents, task delegation, agent steering)
  • Enterprise trust — SSO, audit logs, content exclusions, org-level policy, IP indemnity
  • GitHub ecosystem integration — Actions, Issues, PRs, Codespaces, Security
  • MCP support for extensible tool discovery
  • Free tier available, competitive pricing at every tier
  • AI Credits add transparent budget controls, pooled usage, and admin spend caps without giving up Copilot's broad enterprise footprint
  • MAI-Code-1-Flash — first-party Microsoft coding model with 60% fewer tokens on hard problems

❌ Cons:

  • Extension ecosystem is growing but younger than VS Code's plugin marketplace
  • CLI agent requires local setup (though Codespaces solves this)
  • Multi-agent patterns in CLI are powerful but require context engineering knowledge
  • Cloud agent is newer and still maturing compared to the IDE and CLI experience

🎯 Best for: Teams already in the GitHub ecosystem who want IDE + CLI + cloud agent coverage with enterprise governance. If you need agents that integrate with your entire DevOps workflow — from issue to PR to deployment — nothing else touches the integration depth.

{/* HARNESS_SECTION_END: github-copilot */}


{/* HARNESS_SECTION: openai-agents-sdk */}

OpenAI Agents SDK

The OpenAI Agents SDK (which evolved from the Swarm research project) is OpenAI's production-grade framework for building multi-agent workflows. It's MIT-licensed and has undergone a major architecture overhaul in May 2026 — transforming from a lightweight chat SDK into a full agent infrastructure platform.

May 2026 — Architecture Overhaul (GPT-5.4): OpenAI rewrote the Agents SDK from the ground up, splitting into a two-layer architecture: harness (control flow, model calls, tool routing, pause/resume) and compute (isolated sandbox for file I/O, dependency installation, code execution). The two layers are fully decoupled — API keys and credentials never enter the execution sandbox.

Seven sandbox providers are officially supported: Blaxel, Cloudflare, Daytona, E2B, Modal, Runloop, and Vercel. A new Manifest configuration layer describes the agent workspace — mounted files, cloud storage sources (AWS S3, GCS, Azure Blob, Cloudflare R2), and artifact outputs. Switch sandbox providers by changing one config line.

The SDK now includes Codex-inspired tooling: configurable memory, file system tools, patch/apply editing, skills-based progressive information disclosure, AGENTS.md custom instructions, MCP tool access, and shell execution. Snapshot-based checkpoint recovery enables long-running agents to survive container failures, and multi-sandbox parallel execution provides sub-agent isolation.

OpenAI is also consolidating ChatGPT, Codex, and the API under Greg Brockman into a single agentic platform. The positioning is clear: OpenAI aims to own the foundational infrastructure layer for production agents, pushing third-party frameworks (LangChain, CrewAI, AutoGen) toward higher-level orchestration or more specialized tooling.

May 22, 2026: Codex shipped a Goals feature for long-running autonomous tasks. Type /goal to define a persistent objective — e.g., "migrate this JavaScript codebase to TypeScript in strict mode" — and Codex works toward it continuously, potentially for hours or days. Goals can be paused, resumed, and edited mid-flight. The agent logs every file change, command, and generated test as it works, and a single goal has been demonstrated handling over 100 hours of sustained coding work. Available in both the Codex app and CLI.

May 30, 2026: OpenAI rolled out "Computer Use" screen control to the Codex Windows app — transforming it from a text-only coding assistant into an autonomous engineering workstation. The agent can now view the user's screen, move the cursor, click buttons, and type inside external applications (e.g., extract Figma wireframes and translate them into React components). Available via the Microsoft Store for ChatGPT Plus/Pro/Business/Enterprise. Simultaneously, OpenAI updated the ChatGPT iOS/Android app for full remote control of Windows host machines — letting engineers trigger agent threads, review progress, and approve/reject code changes from their phones.

May 18–19, 2026: OpenAI began testing Codex for Mac remote control via the ChatGPT iPhone app — remotely control your Mac (files, apps, browser) from your phone using a new Secure Relay Layer that avoids public internet exposure. And Dell and OpenAI announced a partnership to deploy Codex on-premise via Dell AI Factory, bringing the agent closer to enterprise data with policy controls and approval gates (5,000+ Dell AI Factory customers already using the stack).

May 28, 2026: In an OpenAI Build Hour session, the team showcased three major additions to the Agents SDK: (1) a Skills API that lets developers package reusable, versioned workflows into skills that agents can mount via a manifest file; (2) a Hosted Shell tool running CLI-grade tasks inside isolated containers via the Responses API — useful for build/test steps and linting; and (3) network-enabled containers with explicit outbound access control. TypeScript support for sandbox agents is now available. The session demonstrated end-to-end task automation using the model-native harness pattern — harness handles control flow while sandboxes handle compute, with no shared credentials between layers.

✅ Pros:

  • Native harness + sandbox architecture — production-grade isolation out of the box, no DIY sandboxing
  • 7 sandbox providers with Manifest-based portability — switch providers without rewriting code
  • Codex-level tooling (file system, patch/apply, shell, memory) included in the SDK
  • Checkpoint recovery and multi-sandbox parallelism for long-running agents
  • Native access to OpenAI's latest models (GPT-5.4, o3, etc.) with minimal latency
  • Built-in tracing and observability via the OpenAI dashboard
  • Guardrails API for input/output validation
  • Handoffs pattern makes multi-agent delegation intuitive
  • Active development with 26,000+ GitHub stars
  • Companies like Ramp report 50%+ of PRs created by agents on this stack

❌ Cons:

  • Tightly coupled to OpenAI models — limited multi-provider support
  • No IDE integration — purely API/code-first
  • Python-first — TypeScript support now available for sandbox agents but still catching up
  • SDK remains at 0.Y.Z versioning (pre-1.0 stability guarantees)
  • Enterprise governance is limited to OpenAI's platform controls (no org-level hook interception like GitHub Copilot)
  • Positions OpenAI as infrastructure gatekeeper — third-party framework ecosystem may narrow

🎯 Best for: Teams building production AI agents on OpenAI's platform who need out-of-the-box sandboxing, multi-cloud storage integration, and Codex-level tooling without assembling their own infrastructure stack. The new architecture makes this the strongest "batteries-included" SDK for OpenAI-native development.

{/* HARNESS_SECTION_END: openai-agents-sdk */}


{/* HARNESS_SECTION: anthropic-claude-code */}

Anthropic Claude Code

Claude Code is Anthropic's agentic coding tool — a CLI-first agent that reads your codebase, runs commands, and edits files. It's powered by Claude and uses the Model Context Protocol (MCP) for extensible tool access. The CLI itself is open source.

May 2026: At Code With Claude 2026, Anthropic unveiled major updates: managed agents with an advisor/executor pattern (smaller models handle routine tasks, larger models tackle hard cases), an internal "Rubber Duck" critic for post-planning review, auto mode with a safety classifier to limit destructive actions, worktree-based branch isolation, and routines for scheduled/webhook-triggered workflows.

Claude Managed Agents Platform (May 6): Anthropic launched three new primitives that transform managed agents from stateless tools into persistent, memory-aware infrastructure:

  • Dreaming (research preview): A background process where agents autonomously review prior sessions, clean memory duplicates/contradictions, and extract reusable patterns — improving future performance without changing model weights. Harvey (legal AI) reported ~6× higher task completion rates purely from agents learning from their own history.
  • Outcomes (public beta): A formal grading system where a separate Claude evaluator agent scores work against a developer-defined rubric checklist. If the work doesn't pass, the grader provides specific feedback and the agent iterates automatically — built-in self-critique with retry loops.
  • Multi-agent Orchestration (public beta): A lead agent breaks tasks into sub-tasks, assigns them to specialist agents (each with different models, prompts, and tools), runs them in parallel with shared memory and artifacts, and coordinates until completion. First-class agent team composition without custom orchestration code.
  • Webhooks (public beta): Built-in webhook support for agent workflows to notify external systems (Slack, email, custom apps) on job completion or milestone events.

Agent View & /goal (v2.1.139+, May 2026): Claude Code shipped Agent View — a live session dashboard (claude agents) showing all running, blocked, and completed sessions with real-time metrics. The /goal command creates autonomous loops: set a completion condition and Claude works across unlimited turns — writing code, running tests, fixing failures — until the condition is met. Combined with /bg for background execution, Claude Code now functions as a persistent autonomous worker that only requests human input when genuinely stuck. Subsequent releases (v2.1.143–v2.1.144) added worktree isolation controls, model/effort persistence across session restarts, and /resume for recovering backgrounded sessions.

Also in May 2026: Anthropic launched MCP Tunnels and Self-Hosted Sandboxes for Claude Managed Agents. MCP Tunnels lets agents reach internal MCP servers (databases, APIs, knowledge bases) without exposing them publicly — a single encrypted outbound connection replaces inbound firewall rules. Self-Hosted Sandboxes (public beta) split the architecture: the agent loop stays on Anthropic's infra while tool execution moves to customer infrastructure via launch partners Cloudflare, Daytona, Modal, and Vercel. Both features target enterprise security and compliance teams where data exfiltration gates previously blocked agent PoCs.

May 28, 2026 — billing restructure: Anthropic announced a billing split effective June 15, 2026 — programmatic Agent SDK usage moves to a separate monthly credit pool. Affected: Agent SDK calls, claude -p headless mode, Claude Code in GitHub Actions, and third-party harnesses (OpenClaw, Hermes, etc.). Credits are tied to subscription plans with separate quotas from interactive Claude.ai usage. This follows enterprise cost pressure — Microsoft canceled most internal Claude Code licenses (per-engineer costs of $500–$2,000/month) and Uber exhausted its 2026 AI coding budget in four months at 84% developer adoption. The billing split signals Anthropic is searching for a sustainable pricing model as agentic usage patterns generate far more tokens per session than interactive chat.

May 28, 2026 — Claude Opus 4.8 + Dynamic Workflows: Anthropic released Claude Opus 4.8 alongside a $65 billion fundraise at a $965B valuation. The model achieves record coding benchmark scores at the same pricing as Opus 4.7 ($5/$25 per 1M input/output tokens). A new fast mode runs 2.5× faster at ~3× cheaper. The marquee feature: Dynamic Workflows (research preview for Enterprise/Team/Max plans in Claude Code). Dynamic Workflows decompose large tasks into hundreds of parallel subagents — each handling a slice of work (reading, testing, bug-finding, validation), then adversarially verifying results before synthesis. Demonstrated for large-scale code migrations where a single session orchestrates hundreds of parallel workers within one conversation. Under the hood, Anthropic's official launch post and workflow docs describe JavaScript orchestration scripts built around primitives like agent(), parallel(), pipeline(), and phase(), with runtime caps of up to 16 concurrent agents (or cpu_cores - 2, whichever is lower) and 1,000 total agents per run. The feature requires Claude Code v2.1.154+ and now spans the CLI, Desktop, VS Code extension, Claude API, Amazon Bedrock, Vertex AI, and Microsoft Foundry. New effort controls let users trade reasoning depth for speed. Also in this release: self-healing sessions (auto-detection and bypass of fatal exceptions to keep sessions alive), a full-screen TUI renderer, real-time streaming of thinking steps, enhanced MCP connection reliability, and a "feedback" feature for long-term adaptive learning.

✅ Pros:

  • CLI-first design — excellent for terminal-native developers
  • MCP protocol is open and vendor-neutral — any MCP server works as a tool
  • Strong project understanding via codebase indexing
  • .claude files for project-level instructions and rules
  • Sub-agent delegation via the Task tool for parallel work
  • Managed agents with advisor/executor pattern and internal critic for reliability
  • Dreaming (research preview) — agents autonomously learn from prior sessions between runs
  • Outcomes system — formal grading rubric with auto-retry ensures quality
  • Multi-agent orchestration (public beta) — lead + specialist agents with shared workspace
  • Dynamic Workflows (research preview) — decompose tasks into hundreds of parallel subagents with adversarial verification
  • Agent View dashboard (claude agents) for monitoring all sessions in real time
  • /goal command for autonomous goal-driven loops across unlimited turns
  • Auto mode with safety classifier + worktrees for isolated branch work
  • Self-healing sessions — auto-recovery from fatal exceptions keeps agents alive
  • Routines + webhooks for cron/event-triggered agent workflows
  • Open source CLI with transparent tool execution
  • Scheduled tasks for automated maintenance
  • MCP Tunnels for private internal system access + Self-Hosted Sandboxes for on-prem tool execution

❌ Cons:

  • Anthropic-model-only — can't use GPT-4o or Gemini through it
  • No visual IDE (VS Code extension exists but it's CLI-in-editor)
  • API costs can escalate quickly with heavy agentic usage (long context windows) — Microsoft and Uber both hit budget limits at scale
  • Enterprise governance features are newer (MCP Tunnels still in research preview)
  • Permission system relies on user approval prompts — no org-level policy hooks
  • Billing restructure (June 15) adds complexity — separate credit pools for interactive vs Agent SDK usage

🎯 Best for: Developers who live in the terminal and want a powerful, extensible coding agent with open protocols. MCP's vendor-neutral tool ecosystem is a genuine differentiator for teams building cross-platform integrations.

{/* HARNESS_SECTION_END: anthropic-claude-code */}


{/* HARNESS_SECTION: langchain-langgraph */}

LangChain / LangGraph

LangChain is the most widely adopted agent framework, with LangGraph adding stateful, graph-based orchestration for complex multi-agent workflows. Together they offer 700+ integrations covering every major model, vector store, and tool.

June 1, 2026: LangGraph 1.2.3 shipped the v3 streaming protocol — a significant architecture upgrade. RemoteGraph now supports v3 streaming natively, the SDK adds WebSocket transports alongside SSE, and tool-dispatched subagents get named identifiers via lc_agent_name for dramatically better observability in multi-agent systems. New stream decoders with interleave_projections let you multiplex messages and tool call projections from multiple agents into a single stream. The SDK (0.4.1) also distinguishes between user-initiated and system cancellations — useful for building resilient retry logic.

✅ Pros:

  • Largest ecosystem — 700+ integrations, massive community, extensive documentation
  • LangGraph's graph-based orchestration is genuinely powerful for complex workflows
  • Model-agnostic — swap between OpenAI, Anthropic, Google, open-source models freely
  • LangSmith provides production-grade tracing, evaluation, and monitoring
  • Checkpointed workflows for long-running agents with state persistence
  • Python and JavaScript SDKs

❌ Cons:

  • Steep learning curve — abstraction layers can feel over-engineered for simple use cases
  • No built-in sandboxing or execution isolation (BYO infrastructure)
  • No governance hooks at the platform level — you build your own policy layer
  • Frequent breaking changes between major versions
  • Enterprise adoption often requires significant custom engineering on top of the framework

🎯 Best for: Teams building custom multi-agent applications that need maximum flexibility and model portability. If you're willing to invest in infrastructure, LangGraph's graph-based orchestration is best-in-class for complex stateful workflows.

{/* HARNESS_SECTION_END: langchain-langgraph */}


{/* HARNESS_SECTION: crewai */}

CrewAI

CrewAI takes a role-based approach to multi-agent systems. You define "crews" of agents with specific roles, goals, and backstories, then orchestrate them through sequential or hierarchical task execution.

May 2026 update: CrewAI v1.14.5 shipped May 18 with A2A (Agent-to-Agent) protocol support, enabling inter-crew communication through Google's open standard. The release also deprecates CrewAgentExecutor in favor of the new AgentExecutor pattern, signaling a maturing internal architecture.

May 28, 2026: CrewAI 1.14.6 graduated from pre-release to stable, shipping the Agent Control Plane (ACP) Beta — a managed orchestration layer for multi-crew coordination. ACP introduces centralized agent registry, deployment management, and inter-crew communication through a hosted control surface. The release also hardens state management: checkpoint serialization now handles BaseModel fields as JSON schema, drops unroundtrippable callbacks, and supports full AgentExecutor restore from checkpoint state. Security-wise, the StdioTransport was enhanced to prevent environment variable leakage. The Skills Repository moved behind a CREWAI_EXPERIMENTAL gate — signaling CrewAI is consolidating its core before expanding its plugin surface.

✅ Pros:

  • Intuitive role-based abstraction — easy to conceptualize multi-agent collaboration
  • Quick to prototype — get a working multi-agent system in minutes
  • Growing ecosystem with pre-built tools and templates
  • Good documentation and active community
  • CrewAI Enterprise adds deployment, monitoring, and team management

❌ Cons:

  • Less flexible than LangGraph for complex orchestration patterns
  • Smaller integration ecosystem than LangChain
  • Production hardening requires significant custom work
  • No built-in sandboxing, governance, or policy enforcement
  • Role/backstory abstraction can feel artificial for non-conversational use cases

🎯 Best for: Teams prototyping multi-agent systems who want an intuitive, role-based API. Great for research, content generation, and analysis workflows where agents play distinct specialist roles.

{/* HARNESS_SECTION_END: crewai */}


{/* HARNESS_SECTION: microsoft-autogen */}

Microsoft AutoGen

AutoGen is Microsoft's framework for building scalable multi-agent conversational applications. It excels at patterns where agents debate, critique, and collaborate through structured conversations.

⚠️ Superseded (April 2026): Microsoft launched Microsoft Agent Framework 1.0 as the unified successor to AutoGen and Semantic Kernel. AutoGen remains open-source with critical fixes, but new features and development effort are moving to Agent Framework. Migration guide available.

✅ Pros:

  • Rich multi-agent conversation patterns — critic, coder, planner, executor roles
  • Deep Azure ecosystem integration (Azure OpenAI, Cognitive Search, Container Apps)
  • Strong research foundation (from Microsoft Research)
  • Code execution capabilities with Docker-based isolation
  • Active community and growing sample library

❌ Cons:

  • API has undergone significant redesigns (AutoGen 0.4 → AgentChat) — migration friction
  • Heavier abstraction than OpenAI Agents SDK for simple use cases
  • Primarily Python — limited multi-language support
  • Conversation-centric design doesn't fit all agent patterns
  • Enterprise governance still requires custom Azure integration work

🎯 Best for: Research teams and enterprises in the Microsoft ecosystem building multi-agent conversational systems — code review agents, planning committees, or collaborative debugging workflows.

{/* HARNESS_SECTION_END: microsoft-autogen */}


{/* HARNESS_SECTION: microsoft-agent-framework */}

Microsoft Agent Framework

Microsoft Agent Framework reached Release Candidate in February 2026 and General Availability on April 8, 2026. This is Microsoft consolidating its agent story into one open-source SDK: the enterprise-ready plugin and identity patterns from Semantic Kernel, the orchestration research from AutoGen, and a clear opinionated model for where new work should go. Python and .NET are first-class, pip install agent-framework and dotnet add package Microsoft.Agents.AI get you started, and the framework bakes in MCP, graph workflows, checkpointing, human-in-the-loop, and multi-model support across Azure OpenAI, Microsoft Foundry, OpenAI, Anthropic, Ollama, and more.

May 28, 2026 (python-1.7.0): Microsoft added HarnessAgent support plus A2AAgentSession with referenced task IDs and input-required flows, making the framework more credible for production cross-agent coordination. The same release also introduced experimental Foundry prompt-agent conversion and deployment APIs — a sign Microsoft wants Agent Framework to span local development and hosted agent deployment without forcing a framework switch.

Also in late May 2026: the emerging create_harness_agent pattern is worth watching because it packages eight subsystems in one call: function invocation, history, context compaction, todo planning, plan/execute mode, durable memory, skill loading, and OpenTelemetry instrumentation. Microsoft is also shipping FIDES (Flow Integrity Deterministic Enforcement System) middleware for prompt-injection defense — deterministic flow labeling instead of heuristic best-effort filtering. That moves MAF closer to a reusable harness runtime, not just a bag of framework primitives.

June 1, 2026 — Build 2026 Preview: Microsoft previewed Agent Framework sessions for Build 2026 (starting June 2). Key sessions include "Claw and agent harness in Microsoft Foundry" (deep dive on multi-agent systems, Claw patterns, hosted agents, triggers, state management), "From prototype to production" (lifecycle for production-grade agents with Foundry Agent Service), and "Govern open-source AI agents, any framework, any scale." A demo session builds an autonomous "Agentic Startup Content Factory" across three frameworks — LangGraph, .NET Microsoft Agent Framework, and GitHub Copilot SDK — deployed to Azure Container Apps with Microsoft Foundry observability. Microsoft also announced new security capabilities designed to stop prompt injection from hijacking agents.

✅ Pros:

  • Unified successor to AutoGen + Semantic Kernel — finally one Microsoft framework instead of two overlapping bets
  • Strong multi-agent story — graph-based workflows, type-safe routing, handoffs, group chat, checkpointing, and pause/resume
  • Standards-forward — built-in MCP support plus A2A interoperability for cross-runtime collaboration
  • Python + .NET first-class from day one, with strong Azure and Foundry integration
  • Better architecture for real production systems — agent sessions, context providers, middleware, tracing, and explicit human approval paths
  • Open source with migration guidance for both Semantic Kernel and AutoGen teams

❌ Cons:

  • New name, new package surface, and a live migration story — the ecosystem is still catching up to the consolidation
  • JavaScript/Java developers don't get the same first-class story as Python and .NET today
  • Still a framework, not a managed harness — you own deployment, runtime isolation, and governance unless you pair it with Azure/Foundry
  • Microsoft's previous two-framework history means some teams will wait before fully committing

🎯 Best for: Teams that want Microsoft's clearest forward path for agent systems — especially Python or .NET shops building on Azure, Foundry, or Microsoft 365-adjacent infrastructure. If you're starting fresh in Microsoft's ecosystem in 2026, this is the framework to evaluate first.

{/* HARNESS_SECTION_END: microsoft-agent-framework */}


{/* HARNESS_SECTION: microsoft-semantic-kernel */}

Microsoft Semantic Kernel

Semantic Kernel is Microsoft's orchestration framework for building AI copilots and agents in enterprise applications. It bridges LLM capabilities with traditional application code through a plugin architecture.

⚠️ Roadmap Superseded (April 2026): Microsoft Agent Framework 1.0 is the recommended path forward for new agent projects. Semantic Kernel remains supported as a GA SDK, but its roadmap is now superseded by Agent Framework. Microsoft recommends existing SK users migrate to Agent Framework for future feature development.

⚠️ Critical Security Update (May 22, 2026): Microsoft disclosed two critical vulnerabilities affecting Semantic Kernel. CVE-2026-25592 (CVSS 10.0, .NET SDK) — an accidentally exposed [KernelFunction] annotation on DownloadFileAsync in SessionsPythonPlugin enables full remote code execution via prompt injection. Fix: Microsoft.SemanticKernel.Core >= 1.71.0. CVE-2026-26030 (CVSS 9.8, Python SDK) — InMemoryVectorStore runs attacker-controlled filter expressions through eval(), allowing arbitrary Python execution from a poisoned RAG corpus. Fix: pip install "semantic-kernel>=1.39.4". Microsoft's guidance for both: disable auto-invocation on any agent with access to disk, shell, or production data. The same week saw similar vulnerabilities in PraisonAI and OpenClaw, confirming this is a systemic pattern across agent frameworks. Upgrade immediately if you are running Semantic Kernel in production.

✅ Pros:

  • Multi-language — C#, Java, Python, JavaScript support
  • Tight Azure and Microsoft 365 integration (RBAC, managed identities, Entra ID)
  • Plugin architecture makes it natural for enterprise "copilot" experiences
  • Strong typing and enterprise patterns (.NET-first design)
  • Good fit for building custom internal copilots on Microsoft stack

❌ Cons:

  • Multi-agent support is manual — less opinionated than AutoGen or CrewAI
  • Not designed primarily as an agent framework — more of an orchestrator
  • Smaller community than LangChain
  • .NET-first design can feel awkward in Python-dominant AI ecosystem
  • Less third-party model support compared to LangChain

🎯 Best for: Enterprise .NET/Java teams building internal copilots on Azure. If your stack is C# + Azure + Microsoft 365, Semantic Kernel is the natural choice for AI-augmented applications.

{/* HARNESS_SECTION_END: microsoft-semantic-kernel */}


{/* HARNESS_SECTION: amazon-bedrock-agents */}

Amazon Bedrock Agents

Amazon Bedrock Agents is AWS's fully managed agent harness. You configure agents declaratively — pick a model, define action groups (Lambda functions), attach knowledge bases (OpenSearch/S3), and Bedrock handles the runtime.

✅ Pros:

  • True managed harness — no loop code to write, configure and deploy
  • Strongest infrastructure isolation — Lambda/VPC/IAM per tool
  • Deep AWS service integration (S3, DynamoDB, Step Functions, CloudWatch)
  • Enterprise-grade governance — IAM, CloudTrail, service control policies, VPC endpoints
  • Knowledge bases with automated RAG patterns
  • Multi-model support (Claude, Llama, Titan, Mistral via Bedrock)

❌ Cons:

  • AWS lock-in — tools must be Lambda/AWS services
  • Declarative configuration limits flexibility for novel agent patterns
  • Multi-agent orchestration is indirect (via Step Functions, not native)
  • No IDE integration — API/console only
  • Cost can be opaque (token costs + Lambda + storage + data transfer)
  • Less community tooling compared to open-source frameworks

🎯 Best for: AWS-native enterprises that want a managed, governed agent runtime with minimal custom code. If your infrastructure is already on AWS and compliance requirements are strict, Bedrock Agents' built-in governance is a major advantage.

{/* HARNESS_SECTION_END: amazon-bedrock-agents */}


{/* HARNESS_SECTION: google-vertex-ai-adk */}

Google Vertex AI Agent Builder + ADK 2.0

Vertex AI Agent Builder is Google Cloud's managed harness, building on Dialogflow CX. The Agent Development Kit (ADK) 2.0 — released stable on May 19, 2026 — is the open-source companion framework featuring a new graph-based workflow execution engine and structured Task API for multi-agent orchestration.

✅ Pros:

  • Managed harness with dialog management roots (Dialogflow CX) — great for conversational flows
  • ADK 2.0 is open source (Apache 2.0) with graph-based Workflow Runtimes — deterministic execution independent of LLM decisions
  • Structured Task API for explicit multi-agent delegation with A2A protocol support for cross-framework agent communication
  • Google Search grounding for real-time information access
  • Vertex AI Search integration for enterprise RAG
  • GCP governance — IAM, VPC Service Controls, Cloud Audit Logs
  • Multi-model support via Vertex AI (Gemini, Claude, Llama, Mistral)
  • Native Cloud Run and Vertex AI integration gives GCP teams a LangGraph alternative with built-in infrastructure

❌ Cons:

  • GCP lock-in for the managed harness (ADK is open-source, but best experience requires GCP)
  • Agent Builder's dialog-management heritage can feel constraining for code-centric agents
  • 2.0 introduces breaking changes from 1.x (entire execution model shifted from LLM-driven to graph-based)
  • Ecosystem still smaller than LangChain/LangGraph outside Google Cloud
  • Pricing complexity similar to AWS (token costs + GCP services)
  • LiteLLM security concern: ADK 2.0 stable excludes versions 1.82.7–1.82.8 (compromised dependency)

🎯 Best for: GCP-native enterprises building conversational or multi-agent systems, or teams wanting an open-source graph-based orchestration framework (ADK 2.0) with optional managed deployment. Direct competitor to LangGraph for Python agent orchestration — choose ADK if you're already deep in Google Cloud.

{/* HARNESS_SECTION_END: google-vertex-ai-adk */}


{/* HARNESS_SECTION: google-antigravity */}

Google Antigravity 2.0

Google Antigravity 2.0 is Google's agentic coding platform, announced at Google I/O 2026 (May 19) as the direct competitor to Cursor and GitHub Copilot's desktop workflows. It includes a desktop app, CLI tool (replacing the Gemini CLI), and an SDK for custom agent workflows — all powered by the new Gemini 3.5 Flash model.

May 2026 (Google I/O): Major platform launch featuring multi-agent orchestration in a desktop app, dynamic subagent workflows, scheduled background tasks, voice commands, and integrations across Google AI Studio, Android, and Firebase. Google is also using Antigravity's coding capabilities in consumer Search — generating real-time custom UI as part of search answers. The new Android CLI 1.0 provides a standardized interface that ANY AI agent (including GitHub Copilot, Claude Code, and OpenAI Codex) can use to access Android Studio capabilities — representing a "platform-as-tool" strategy where Google provides specialized tooling for the entire ecosystem.

Additionally, Google launched Gemini Spark — a 24/7 agentic personal assistant built on the Antigravity harness that runs on dedicated Google Cloud VMs. Spark integrates deeply with Google Workspace (Gmail, Docs, Sheets, Slides), has its own Gmail address, interacts with the web via Chrome, supports MCP for third-party integrations, and tracks agent progress on mobile via Android Halo.

Managed Agents API (May 19–20): Google also launched Managed Agents in the Gemini API — a single API call provisions an Antigravity agent in an isolated Linux sandbox (Ubuntu with Python 3.12 and Node.js 22). The agent can reason, execute code, manage files, browse the web, and use Google Search — all in an ephemeral sandboxed environment. Developers extend agents via AGENTS.md and SKILL.md markdown files, version them, and invoke by ID. VentureBeat's analysis notes this is the lowest-friction agent deployment any major platform has shipped — it collapses weeks of sandbox provisioning into one function call. Pricing is pay-as-you-go (100K–3M tokens per interaction at Gemini 3.5 Flash rates); environment compute is free during preview.

✅ Pros:

  • Full agentic desktop app with multi-agent orchestration and parallel task execution
  • CLI tool for terminal-first developers (replacing Gemini CLI)
  • Antigravity SDK for building custom agents on Google's platform
  • Native voice command support
  • Deep ecosystem integration — AI Studio, Android, Firebase, Google Cloud
  • Android CLI 1.0 provides unique mobile development tooling accessible to ANY agent
  • Gemini Spark extends the harness into personal productivity (Gmail, Workspace)
  • MCP support for third-party tool integrations

❌ Cons:

  • Heavy Google ecosystem lock-in (AI Studio, GCP, Workspace)
  • Pricing is premium — AI Ultra at $100/mo (5x limits) or $200/mo (20x limits)
  • Desktop app is new and still maturing vs established competitors
  • Gemini CLI users must migrate to the new Antigravity CLI
  • Spark is initially limited to AI Ultra subscribers (premium tier required)
  • Less developer tooling depth than GitHub Copilot's extensions + hooks governance system

🎯 Best for: Teams already deep in the Google ecosystem (GCP, Workspace, Android) who want a unified agentic development platform with strong mobile development tooling. The Android CLI is genuinely unique — no other platform provides standardized CLI access to Android Studio capabilities for AI agents.

Pricing (May 2026):

  • AI Ultra: $100/month (5x higher AI limits than Pro)
  • Top AI Ultra: $200/month (20x higher limits, reduced from $250)
  • Gemini Spark: included with AI Ultra subscription

{/* HARNESS_SECTION_END: google-antigravity */}


{/* HARNESS_SECTION: warp-oz */}

Warp Oz — Multi-Harness Control Plane

Warp Oz is a cloud agent orchestration platform from Warp, launched in February 2026 and significantly updated in May 2026. It's the first control plane that runs Claude Code, OpenAI Codex, and Warp Agent side by side — addressing the "multi-harness problem" that enterprises face when they don't want to commit to a single agent.

May 2026: Major update adds multi-harness support (run any combination of Claude Code, Codex, and Warp Agent through one interface), automatic multi-agent orchestration for parallel subagent coordination, cross-harness persistent memory (research preview), and expanded enterprise controls (per-team billing, individual credit caps, least-privilege permissions per agent).

✅ Pros:

  • Only platform running multiple agent harnesses (Claude Code, Codex, Warp Agent) side by side
  • Compare harness effectiveness and assign the right one per task — true harness-agnostic orchestration
  • Cross-harness persistent memory — agents build on organizational knowledge across sessions
  • Enterprise self-hosting: Kubernetes, Docker, or direct execution
  • Built-in orchestration layer with task lifecycle tracking (created → running → completed/failed)
  • First-party integrations (Slack, GitHub PRs, CI failures) trigger agent work automatically
  • REST API and TypeScript/Python SDKs for programmatic control
  • BYOLLM (Bring Your Own LLM) on Enterprise plan

❌ Cons:

  • Enterprise pricing required for self-hosted execution — annual contracts via sales
  • Cloud agent billing is non-deterministic (no per-run cost cap for individual users yet)
  • Newer platform — less battle-tested than standalone harnesses it orchestrates
  • Cross-harness memory is still in research preview
  • Adds an orchestration layer on top of existing harnesses — more infrastructure to manage
  • Limited to supported harnesses (Claude Code, Codex, Warp Agent currently)

🎯 Best for: Engineering teams deploying multiple coding agents at scale who need a single governance plane across harnesses. If you're already running Claude Code AND Codex and want consistent access controls, audit logs, and cost tracking across both — Oz is uniquely positioned as the orchestration layer above individual harnesses.

{/* HARNESS_SECTION_END: warp-oz */}


{/* HARNESS_SECTION: cursor */}

Cursor

Cursor is an AI-native code editor (VS Code fork) with a built-in agent mode that can autonomously plan, write, and test code within your project.

April 2, 2026: Cursor 3.0 shipped the Agents Window — a ground-up rebuild replacing the old Composer pane with a full-screen tiled workspace for parallel AI agent execution. Up to 8 agents run simultaneously in isolated git worktrees (local, SSH, or cloud), preventing file edit collisions. Commands like /worktree to create, /apply-worktree to merge, and /delete-worktree to clean up enable multi-branch workflows. This positions Cursor as a multi-agent orchestration surface rather than a single-agent editor.

May 21, 2026: Cursor released Composer 2.5 — scoring 62 on the Artificial Analysis Coding Agent Index, third place overall behind only Claude Opus 4.7 (max) in Claude Code (66) and GPT-5.5 (xhigh) in Codex (65), which cost $4.10 and $4.82 per task respectively. Composer 2.5 standard runs at $0.07/task — 10–60× cheaper than those top-two slots. A "Fast" variant at $0.44/task executes 30% faster. Built on Kimi K2.5 base with ~85% of compute from Cursor's own additional training. On SWE-Bench-Pro-Hard-AA the model scored 47%, matching Claude Opus 4.7 (max) at a fraction of the cost. Not available outside Cursor (no external API).

May 28, 2026: Cursor announced a built-in Canvas feature that integrates interface design directly inside the IDE via MagicPath integration. Cursor can now create and manage design files, reference open files and components, and collaborate on visual UI tasks without leaving the editor — positioning it as a potential competitor to Figma for in-IDE interface design workflows.

May 29, 2026: Cursor 3.6 shipped Auto-review — a new run mode designed to let the agent work longer with fewer interruptions. Tool calls now flow through a three-stage filter: allowlisted calls auto-run, sandboxable calls execute in isolation, and everything else gets routed to a classifier subagent that decides whether to allow, retry differently, or ask for approval. It's a meaningful safety/usability upgrade for long autonomous runs, even if Cursor still frames the classifier as convenience rather than a hard security boundary.

Also in late May 2026: Cursor shipped Thermos — a branch audit tool that runs deep security and harsh code quality reviews in parallel, then synthesizes the output into a single prioritized findings list. That's a notable step toward review-first agent workflows inside the IDE.

June 4, 2026 — Cursor SDK: custom tools, nested subagents, and auto-review: Cursor's TypeScript and Python SDKs gained major new capabilities for programmatic agent use: custom tools can now be passed as function definitions via local.customTools (exposed through a built-in MCP server so every subagent inherits them automatically), subagents can be nested to any depth, and auto-review gates tool calls before execution by default. This makes Cursor's local and cloud SDK agents significantly more capable for production scripts, CI pipelines, and custom integrations.

June 5, 2026 — Cursor 3.7: Design Mode in the browser: Cursor's 3.7 release ships Design Mode in the Cursor browser — a new interaction layer where developers can click, draw, or describe UI changes by voice directly over the rendered page. Agents receive the selected elements, their code, and the surrounding visual layout as context, enabling precise "make this match that" UI edits and group component adjustments. Voice input stays active while an agent is mid-run, so the next change can be queued before the current one finishes. This moves Cursor toward a visual-first agent experience where the browser becomes a design surface rather than just a preview pane.

✅ Pros:

  • Seamless agent-in-editor experience — no context switching
  • Strong codebase understanding via semantic indexing
  • Agent mode handles multi-step tasks (implement feature → write tests → debug)
  • Agents Window enables parallel multi-agent workflows in isolated worktrees
  • Active development with rapid feature iteration
  • Growing user base and community
  • Competitive free tier

❌ Cons:

  • Proprietary — limited extensibility beyond what Cursor provides
  • No governance hooks for enterprise policy enforcement
  • Agent is a black box — limited observability into decisions
  • Fork dependency on VS Code means extension compatibility lags
  • No CLI agent capability

🎯 Best for: Individual developers who want the smoothest AI-in-editor experience and are comfortable with a curated, opinionated tool. Less suitable for enterprises needing governance and policy control.

{/* HARNESS_SECTION_END: cursor */}


{/* HARNESS_SECTION: google-antigravity */}

Google Antigravity (formerly Windsurf / Codeium)

Google Antigravity is Google's agent-first development platform, born from the $2.4 billion acquisition of Codeium/Windsurf in mid-2025. Antigravity 2.0, launched at Google I/O 2026, is a standalone desktop application with native multi-agent orchestration — agents coordinate in parallel while you focus on the big picture.

✅ Pros:

  • Native multi-agent orchestration — run parallel agents (one codes, another generates assets)
  • Backed by Google's Gemini 3.5 Flash model with deep integration
  • Unified platform: desktop app + CLI + SDK in one experience
  • Antigravity CLI inherits and improves on Gemini CLI (migration guide available)
  • Strong codebase-wide context understanding (inherited from Windsurf's Cascade)
  • Enterprise deployment options via Google Cloud

❌ Cons:

  • Gemini-first — model choice exists but Gemini gets priority treatment
  • Gemini CLI shutdown (June 18, 2026) forces migration to Antigravity CLI
  • Still establishing governance/policy framework for enterprise
  • Ecosystem lock-in with Google services
  • Community still transitioning from Windsurf branding (further complicated by Cognition rebranding the Windsurf IDE as Devin Desktop in June 2026 — Google retained Codeium's AI technology while Cognition acquired the editor)
  • Multi-agent orchestration details still emerging

🎯 Best for: Developers wanting a Google-native agent-first IDE with multi-agent orchestration and deep Gemini integration. Teams already in the Google Cloud ecosystem get the most seamless experience.

{/* HARNESS_SECTION_END: google-antigravity */}


{/* HARNESS_SECTION: devin */}

Devin

Devin by Cognition is a fully autonomous AI software engineer that operates in its own cloud environment. It can plan, code, debug, and deploy with minimal human intervention.

June 3, 2026 — Devin Desktop Launch (Windsurf Rebrand): Cognition launched Devin Desktop — the rebranded Windsurf IDE, now positioned as an "Agent Command Center" for managing local and cloud AI agents from a single unified surface. Devin is now a four-surface platform: Devin Desktop (IDE + agent manager), Devin Cloud (autonomous agent), Devin CLI, and Devin Review. Desktop supports the Agent Client Protocol (ACP), enabling third-party agents (including Claude Code) to run alongside Devin's own agents. Existing Windsurf users received the update over-the-air — plans, settings, and extensions carry over. The strategic move: Cognition is shifting from "autonomous agent" to "agent platform" — owning the IDE surface where developers coordinate all their AI agents, not just Cognition's.

June 1, 2026: Cognition raised $1 billion at a $26 billion valuation — more than doubling its value in 8 months. CEO Scott Wu stated Devin now writes 89% of Cognition's internal code. The latest version is reportedly 4× faster and 2× more efficient than earlier releases. A new MultiDevin feature lets one AI agent coordinate several coding agents simultaneously — creating something closer to a small automated engineering team. Major enterprise customers reportedly include Goldman Sachs, Microsoft, Dell, Cisco, and Palantir.

✅ Pros:

  • Most autonomous agent — handles end-to-end tasks from plan to PR
  • Own cloud environment with full dev tools (browser, terminal, IDE)
  • Parallel Devins for concurrent work on multiple tasks
  • NEW: Devin Desktop (fmr. Windsurf) — full IDE + Agent Command Center for managing local/cloud agents
  • NEW: ACP support — third-party agents (Claude Code, etc.) run alongside Devin agents
  • Interactive planning for collaborative task scoping
  • Devin Search and Wiki for codebase exploration and documentation
  • Slack integration for conversational task delegation

❌ Cons:

🎯 Best for: Teams wanting a unified surface to manage multiple AI agents (local + cloud). Devin Desktop gives you an IDE with agent orchestration built in; Devin Cloud handles fully autonomous end-to-end tasks. The ACP support makes it a viable "control center" even if you use non-Cognition agents.

{/* HARNESS_SECTION_END: devin */}


{/* HARNESS_SECTION: grok-build */}

Grok Build (xAI)

Grok Build is xAI's entry into the coding agent space, launched in early beta on May 15, 2026. It uses natural language as an "agentic command line interface" for software engineering tasks — planning, reviewing, and implementing code changes.

May 28, 2026: Grok Build shipped v0.2.3 with a persistent memory system. The /remember command creates notes that survive across sessions — with rich side-by-side previews, fullscreen editing, and a # shortcut for quick access. This addresses one of the biggest gaps in early agent tools: context that doesn't vanish when a session ends.

May 28, 2026 — Grok Build 0.1 API + Kilo Code integration: xAI released Grok Build 0.1 as a public beta API — a specialized coding model at $1/M tokens in, $2/M tokens out, running at 100+ tokens/second. Available via OpenRouter and Vercel AI Gateway. Simultaneously, xAI shipped Grok integration into Kilo Code (May 27) — bringing Grok as an agent tool inside VS Code, JetBrains, and the terminal for SuperGrok/X Premium+ subscribers via the Model Context Protocol. This makes xAI the fourth vendor to ship an agent in existing developer surfaces, joining Claude Code, Codex, and Antigravity. Kilo Code remains open-source, and the integration removes the separate API key requirement for eligible subscribers.

The API also supports full Agent Client Protocol (ACP), meaning orchestration platforms can call Grok Build as a primitive — the same way they call Claude Code or Codex CLI — making it interoperable with multi-harness control planes like Warp Oz.

Kilo Code keeps expanding: Kilo says it has crossed 3M+ downloads / 40T+ tokens processed while stretching across VS Code, JetBrains, CLI, Cloud Agents, and Slack. Third-party reporting adds 1.5M+ users, BYOK with zero markup, KiloClaw at $49/month, and Teams at $15/user/month, plus a $45M Series B led by Andreessen Horowitz with Sequoia, Accel, and Microsoft M12, 3,200 active Slack workspaces, and a 78% commit-acceptance rate. The strategic distinction is Kilo's review-first posture: every agent run is supposed to end in a human-reviewable artifact instead of a blind auto-merge.

✅ Pros:

  • Natural language CLI approach — conversational coding workflow
  • Backed by xAI's Grok models with strong reasoning capabilities
  • Plan → review → implement workflow mirrors professional development practices
  • Persistent memory via /remember — session context that survives across sessions
  • Public API at competitive pricing ($1/$2 per M tokens) with 100+ tok/sec throughput
  • MCP integration into Kilo Code — agent runs natively in VS Code, JetBrains, and terminal
  • Early-mover advantage in xAI's growing ecosystem

❌ Cons:

  • Full CLI requires SuperGrok Heavy subscription ($300/month) with no standalone pricing
  • Early beta — feature set and reliability are still maturing
  • Model-locked to Grok/xAI — no bring-your-own-model flexibility
  • API launched but ecosystem is nascent — limited community tooling beyond Kilo Code
  • Late to market compared to established alternatives (Claude Code, Copilot CLI, Cursor)

🎯 Best for: Developers already invested in xAI's SuperGrok ecosystem who want a coding agent without switching platforms. The $300/month price point makes it hard to justify unless you're already paying for SuperGrok Heavy for other reasons.

{/* HARNESS_SECTION_END: grok-build */}


{/* HARNESS_SECTION: holo-hcompany */}

Holo3.1 / HCompany (Desktop Agent Harness)

Holo3.1 is a computer-use model family from HCompany, released June 1, 2026. Unlike coding agents that operate through terminals, Holo3.1 is designed to act directly on GUIs — clicking buttons, navigating applications, and filling forms that have no programmatic API. The model runs locally on consumer hardware or on NVIDIA DGX Spark with agent-harness optimizations cutting step time from 6.8s to 3.3s.

June 1, 2026: HCompany released Holo3.1 with open-source checkpoints (Q4 GGUF for local Mac/Windows deployment) and announced HoloDesktop — an open-source desktop agent harness designed to plug into existing coding agents as a sub-agent. When a task requires stepping out of the terminal and into a real application, your coding agent (Claude Code, Codex, Cursor) delegates to Holo. The result is a computer-use agent that runs privately on your machine or in the cloud via HCompany's Models API. NVIDIA collaborated on agent-harness optimizations delivering 2× end-to-end speedup on DGX Spark with NVFP4 quantization.

✅ Pros:

  • Computer-use agent that bridges the gap between terminal-only agents and GUI-based workflows
  • Open-source model checkpoints (Q4 GGUF) for fully private local deployment
  • Designed as a sub-agent — integrates with existing coding agent workflows rather than replacing them
  • NVIDIA DGX Spark optimizations for enterprise-grade throughput
  • Runs on consumer hardware (Apple Silicon benchmarks provided)

❌ Cons:

  • Pre-release (HoloDesktop "coming soon") — not yet available for production use
  • Limited to computer-use tasks — not a general coding agent
  • Single-vendor model (HCompany) — no bring-your-own-model
  • New entrant with limited community and documentation
  • Requires either local GPU resources or DGX Spark for optimal performance

🎯 Best for: Teams whose workflows include GUI-heavy tasks (testing, data entry, design tool interaction) that current terminal-only agents can't handle. Wait for HoloDesktop release before evaluating for production.

{/* HARNESS_SECTION_END: holo-hcompany */}


{/* HARNESS_SECTION: jetbrains-ai */}

JetBrains AI Assistant

JetBrains AI is integrated into IntelliJ, PyCharm, WebStorm, and the full JetBrains IDE family, with an agent mode called Junie for autonomous multi-step coding tasks.

✅ Pros:

  • Native integration in the full JetBrains IDE family
  • Junie agent mode for autonomous multi-step tasks
  • Leverages JetBrains' deep code analysis (inspections, refactoring, type inference)
  • On-prem inference options for sensitive environments
  • Multi-model support (OpenAI, Anthropic, Google, local models)
  • Bundled with JetBrains All Products Pack

❌ Cons:

  • JetBrains IDEs only — no VS Code, no CLI
  • Agent capabilities are newer and less mature than Cursor or Copilot
  • Limited extensibility for custom agent behaviors
  • No governance/hooks framework comparable to Copilot's hooks.json
  • Smaller AI-focused community compared to VS Code ecosystem

🎯 Best for: JetBrains users who don't want to switch editors but want AI agent capabilities. The deep IDE integration (inspections, refactoring) gives it advantages in languages where JetBrains excels (Java, Kotlin, Python).

{/* HARNESS_SECTION_END: jetbrains-ai */}


{/* HARNESS_SECTION: mastra */}

Mastra

Mastra is a TypeScript-first agent framework focused on observability and developer experience. It's designed for building multi-agent systems in Node.js applications with built-in visibility into agent behavior.

✅ Pros:

  • TypeScript-native — first-class experience for Node.js/Next.js teams
  • Built-in observability (metrics, logs, visualization of agent flows)
  • Explicit memory model — developers see how and when memory is read/written
  • Multi-agent message flows with clear debugging
  • Growing ecosystem with modern developer ergonomics

❌ Cons:

  • TypeScript/Node.js only — no Python, C#, or Java support
  • Newer and smaller community than LangChain or CrewAI
  • No built-in sandboxing or governance
  • Less battle-tested in production than established frameworks
  • Limited model provider integrations compared to LangChain

🎯 Best for: TypeScript teams building multi-agent applications who prioritize observability and debuggability. If your stack is Next.js/Node.js and you want to see exactly what your agents are doing, Mastra's visibility is a differentiator.

{/* HARNESS_SECTION_END: mastra */}


{/* HARNESS_SECTION: statewright */}

Statewright — State Machine Guardrails

Statewright is a Rust-based state machine engine that constrains which tools an AI agent can use in each phase of work. Instead of giving a model 40+ tools and hoping for the best, Statewright defines workflow phases (planning → implementing → testing) with per-state tool restrictions. The agent sees 5 tools instead of 30, reducing flailing and improving task completion.

Architecture: The core is a deterministic Rust engine — no LLM in the loop for enforcement. A plugin layer integrates with coding agents via MCP. When a workflow activates, hooks enforce tool restrictions per state. Supports Claude Code and Codex (hard enforcement via hooks), Cursor (advisory via MCP), plus opencode and Pi. Guardrails include per-state tool allowlists, bash command discernment (blocks echo > file, rm -rf, scripting), edit guards (max lines/files per state), conditional transitions, approval gates, and environment variable scoping.

Research results: In a 5-task SWE-bench subset, two local models went from 2/10 passing to 10/10 with Statewright constraints — same tasks, same hardware. The structural win on larger models is breaking read-loop death spirals and keeping the tool space focused.

Pricing: Free tier (3 workflows, 200 transitions/mo) → $29/mo Pro → $99/mo Team → Enterprise. Self-hostable via Docker Compose (Apache 2.0 engine, FSL-1.1-ALv2 gateway converting to Apache 2.0 in 2029).

Source: GitHub — statewright/statewright (373 stars, v1.0, May 2026)

{/* HARNESS_SECTION_END: statewright */}


The Governance Gap

{/* GOVERNANCE_SECTION_START */}

The Governance Gap — contrasting hope-based control with architectural governance showing policy gates and audit trails
What separates a production harness from a prototyping tool — architectural control vs hoping agents behave.

Here's what surprised me most when building this comparison: most agent platforms have no governance story at all. Cursor, Windsurf, CrewAI, Devin — they all have "user clicks approve" and that's it. There's no programmatic policy layer, no pre-tool-call interception, no audit trail that an enterprise compliance team would accept.

Only three platforms offer real governance primitives:

  1. GitHub Copilothooks.json with pre/post tool call interception + extension allowlists + org-level policies
  2. Amazon Bedrock Agents — IAM + CloudTrail + service control policies + VPC endpoints
  3. Google Vertex AI Agent Builder — IAM + Cloud Audit Logs + VPC Service Controls

Emerging governance entrants (May 2026):

  • Microsoft Agent Governance Toolkit (AGT) — Microsoft released a public preview of the open-source AGT (May 28) — a runtime policy engine that evaluates agent actions against declarative policies before execution. AGT supports OWASP Agent Security standards, prevents unwanted operations, reduces token/logging risk, and works with any agent harness. This is the first vendor-neutral, open-source governance toolkit specifically designed for AI agents — filling a critical gap between harness-specific governance (Copilot hooks, Bedrock IAM) and no governance at all.
  • Warp Ozmulti-harness control plane with consistent access controls, audit logs, per-team billing, credit caps, and least-privilege permissions applied uniformly across Claude Code, Codex, and Warp Agent. Self-hosted in K8s or Docker for data sovereignty.
  • Kore.ai ArtemisAI-native agent platform (May 21) with Agent Blueprint Language (ABL) — a compiled, declarative language that standardizes how agents are defined, validated, and governed. Six built-in orchestration patterns (supervisor, delegation, handoff, fan-out, escalation, agent-to-agent federation) plus a Dual-Brain Architecture combining agentic reasoning and deterministic flows in parallel. Every decision is logged, traced, and analyzed in real-time. Launches on Microsoft Azure with broader cloud support forthcoming.
  • Versa Zero Trust MCPzero-trust architecture for Model Context Protocol that validates every AI agent action before execution. Human-in-the-loop governance via Versa Verbo. Available now with VersaONE Universal SASE Platform Release 23.1.1.
  • Neurapre-action governance layer that converts agent actions into Action Cards, routes them through a Relay for approval, and returns a Decision Receipt with trace and ledger context before execution.
  • Redis Context Engine — while primarily a memory layer, Redis's new Context Retriever uses the Model Context Protocol to auto-generate structured tools that give agents semantic access to business data.

The frameworks (LangChain, AutoGen, etc.) give you hooks to build governance, but you're writing that layer yourself. That's fine for startups but a non-starter for regulated enterprises. If governance is a requirement — and in 2026, it should be — your shortlist gets very short very fast.

I wrote about this gap in depth in my three layers your AI agent is missing article, and built @htekdev/agent-harness specifically to address it.

{/* GOVERNANCE_SECTION_END */}


How to Choose

{/* DECISION_FRAMEWORK_START */}

How to Choose Your Agent Platform — decision flowchart matching situations to the right platform category
Match your situation to the right tool category — start with what you're building, not which platform is "best."

Don't start with "which platform is best?" Start with "what am I building?"

If you're building... Start here Why
A custom AI application (chatbot, RAG app, copilot) LangChain/LangGraph or Semantic Kernel Maximum flexibility and model portability
AI coding assistance in your editor GitHub Copilot Broadest IDE + CLI + cloud coverage with governance
A quick AI coding setup, single-editor focus Cursor Most polished single-editor experience
Managed, governed agents on AWS Amazon Bedrock Agents Enterprise governance out of the box
Managed, governed agents on GCP Vertex AI Agent Builder Enterprise governance out of the box
A CLI-first agentic coding workflow Copilot CLI or Claude Code Extensions/hooks vs MCP extensibility
Multi-agent prototypes with roles CrewAI Fastest time-to-prototype for role-based systems
Multi-agent conversational systems AutoGen Rich debate/critique/collaborate patterns
Multi-agent graph-based orchestration LangGraph Best-in-class for stateful graph workflows
Full autonomous task delegation Devin Highest autonomy level (with supervision)
Internal copilots on Microsoft stack Semantic Kernel Native .NET/Azure/M365 integration
TypeScript-first agent apps Mastra Best observability for Node.js agents
Minimal multi-agent SDK OpenAI Agents SDK Production-grade harness/sandbox with 7 providers — strongest batteries-included SDK
Orchestrating multiple harnesses at scale Warp Oz Only multi-harness control plane with unified governance
One-call managed agent deployment on Google Google Managed Agents API Lowest-friction agent deployment (sandbox + tools in one call)
Kubernetes-native agent execution at scale GKE Agent Sandbox Sub-200ms provisioning, gVisor isolation, millions of agents
On-device browser agents with small models Microsoft MagenticLite Purpose-built SLM harness (4B–27B) with sandboxed browser execution
Ultra-long autonomous agent runs (hours/days) Alibaba Qwen3.7-Max (model) + any harness 35-hour continuous autonomous execution with 1,000+ tool calls
Enterprise multi-agent governance with declarative blueprints Kore.ai Artemis ABL compiled language, 6 orchestration patterns, AI-native lifecycle

{/* DECISION_FRAMEWORK_END */}


Where Copilot Stands — Honest Assessment

{/* COPILOT_ASSESSMENT_START */}

I use Copilot every day — it runs 50+ agents managing my home, my content pipeline, and my development workflow. So let me be direct about where it leads and where it doesn't.

Where Copilot genuinely leads:

  • Ecosystem breadth — Copilot now spans IDE (all major editors), CLI, cloud agent, dedicated desktop app, and API. The May 2026 Copilot App adds a fifth surface — a standalone agentic desktop — extending GitHub's multi-surface workflow story
  • Governance — hooks.json is unique. No other IDE agent gives you programmatic pre/post tool-call interception. For enterprises, this is a dealbreaker in Copilot's favor.
  • Extensions — the ability to turn any service into an agent tool via the extensions API is unique among IDE agents. Cursor and Windsurf are closed ecosystems.
  • Enterprise trust — IP indemnity, content exclusions, SSO, audit logs, org-level policy. GitHub spent years earning enterprise trust, and it shows.
  • GitHub integration — Issues → cloud agent → PR → Actions → deploy. The full software lifecycle, automated.

Where others have edges:

  • Claude Code's MCP protocol is more open and portable than Copilot's extensions API. MCP works across vendors; Copilot extensions are GitHub-specific.
  • Cursor's in-editor UX is more polished for pure coding tasks. The diff/apply flow feels snappier.
  • LangGraph's orchestration is more flexible than Copilot CLI's multi-agent patterns for complex stateful workflows.
  • Bedrock and Vertex offer stronger cloud-native governance for non-GitHub-centric enterprises.
  • Devin's autonomy level exceeds what any IDE agent currently attempts.

This isn't a contest where one tool wins everything. It's a landscape where your constraints determine the right choice.

{/* COPILOT_ASSESSMENT_END */}


{/* HARNESS_SECTION: notable-new-may-2026 */}

Notable New Entrants — May 2026

Microsoft Webwright — A terminal-native web agent framework from Microsoft Research (open-sourced May 24, 2026). Instead of predicting one browser action at a time, Webwright agents write and run Playwright code in an iterative loop (~1,000 lines of harness code across 3 modules: Runner, Model Endpoint, Environment). Scored 60.1% on the Odysseys long-horizon browsing benchmark (up from base GPT-5.4's 33.5%) and 86.7% on Online-Mind2Web. Scripts are reusable as CLI tools. Supports OpenAI, Anthropic, and OpenRouter backends. (Source)

NVIDIA AI-Q — An open-source deep research skill (May 20, 2026) designed to plug INTO existing agent harnesses — Claude Code, Codex, and LangChain Deep Agents. Instead of each harness rebuilding retrieval and synthesis logic, AI-Q provides a SKILL.md + helper script pattern: delegate a research question to a running AI-Q server, get back a structured, citation-backed report. Secure MCP integration connects to authenticated enterprise data sources. Built on NVIDIA's NeMo Agent Toolkit; deployable via Docker Compose or Helm on developer machines, on-prem clusters, or data centers. Dell AI Factory validation for regulated industries. The significance: agent harnesses are becoming composable — standardized skill interfaces mean capabilities can be shared across harnesses without reimplementation.

Google Agent Sandbox on GKE (GA) + Agent Substrate — Google Cloud's secure, cloud-native execution environment for AI agent workloads on Kubernetes reached general availability (May 20, 2026). Customers including LangChain and Lovable are deploying millions of agents in production on the platform. Alongside the GA announcement, Google open-sourced Agent Substrate — a lightweight control plane that enables sub-second agent startup by pre-provisioning ready compute capacity and moving agents onto/off it in real time. Agent Substrate builds on the Agent Sandbox runtime and targets ultra-scale agent density for orchestrators that need to spawn thousands of concurrent agents. This is infrastructure-as-commodity: the execution layer for agents is becoming as standardized as container runtimes.

xAI Grok Build — xAI's new AI coding agent launched May 24, 2026, competing directly with Claude Code and Codex. Runs on Grok 4.3 (~500B parameters) with parallel agent support for multi-task workflows. Pricing starts at $99/month (introductory) scaling to $300/month — positioning it as a premium enterprise tool rather than a developer mass-market play. Connected to a partnership with Cursor (SpaceX reportedly holds options valuing Cursor at ~$60B). Grok V9 (~1.5T parameters) with major coding upgrades reportedly arriving within weeks. (Source)

AWS MCP Server GA — AWS's managed MCP server reached General Availability in May 2026, now part of the Agent Toolkit for AWS. Provides AI coding agents with full AWS API coverage, IAM-based governance (CloudTrail logging, CloudWatch metrics), sandboxed Python execution for multi-step tasks, and up-to-date documentation access. Free to use (pay only for consumed resources). Works with Claude Code, Kiro, Cursor, and Codex via MCP protocol. Currently available in us-east-1 and eu-central-1. (Source)

{/* HARNESS_SECTION_END: notable-new-may-2026 */}


{/* HARNESS_SECTION: notable-new-late-may-2026 */}

Notable New Entrants — Late May 2026

Google AX (Agent eXecutor) — Google open-sourced AX v0.1.0 (May 28, 2026), a Go-based runtime layer for long-running AI agents. AX solves the "4-hour crash" problem: when an orchestration process dies, agents die with it. AX provides kernel-style durable execution with sub-second agent startup and automatic recovery. The architecture separates the agent's logic from its lifecycle — even if the host process crashes, AX preserves agent state and resumes from the last checkpoint. Works with any LLM provider. Apache 2.0 licensed. This complements Google's Agent Sandbox (execution isolation) with a missing piece: execution durability. (Source)

Microsoft Agent Governance Toolkit (AGT) — Microsoft released AGT in public preview (May 28, 2026), an open-source runtime policy engine for AI agents. AGT evaluates agent actions against declarative policies before execution — preventing unwanted operations without modifying agent code. Supports OWASP Agent Security standards, plugs into any agent harness (not just Microsoft products), and targets the governance gap this article has highlighted. This is significant: governance is becoming a separate, composable layer rather than something each harness must build from scratch. (Source)

Pydantic AI Harness — The Pydantic team launched an official capability library (May 28, 2026) for Pydantic AI agents. It provides standalone building blocks — tools, hooks, instructions, and model settings — to compose agents from reusable capability modules. Each module is independently testable and version-controllable. The significance: this is the first framework to separate agent capabilities from agent definitions, enabling a marketplace-style approach to agent composition.

{/* HARNESS_SECTION_END: notable-new-late-may-2026 */}


{/* HARNESS_SECTION: notable-new-may-30-2026 */}

Notable Developments — May 30, 2026

Replit + Visa Trusted Agent Protocol — Replit announced a strategic partnership with Visa (May 30, 2026) to embed native payment infrastructure and a cryptographic identity layer directly into AI agent workflows. The Visa Trusted Agent Protocol enables agents to undergo onboarding and certification for real-time identity verification and safer machine-to-machine (M2M) transactions — with guardrails including user consent, authentication, spending controls, and defined transaction limits. Over 1,000 Visa employees already use Replit for prototyping. Replit also launched self-serve Enterprise access (contracts up to $200k) with enhanced governance (SSO, SCIM, RBAC, audit logs, SOC-2) and a new Solution Partner Program with Accenture, Slalom, and Hexaware. The significance: agent commerce is becoming a first-class infrastructure layer, not an afterthought. The Trusted Agent Protocol introduces a model where agents earn cryptographic credentials to transact autonomously — a pattern other platforms will likely adopt.

Hexo Labs SIA — Self-Improving AI — Hexo Labs open-sourced SIA (Self-Improving AI) under MIT license (May 28, 2026). SIA introduces a dual-lever self-improvement loop: after each run, a Feedback-Agent can rewrite the scaffold (harness) OR trigger a LoRA weight update, or both. Architecture splits into three LLM-driven roles — Meta-Agent (initial scaffold), Task-Specific Agent (execution + logging), and Feedback-Agent (evaluation + change decisions). Claims 350× acceleration over baselines on OpenAI's MLE-Bench. First known framework to edit both scaffold and model weights in a single improvement loop. (Source)

{/* HARNESS_SECTION_END: notable-new-may-30-2026 */}


{/* HARNESS_SECTION: notable-new-may-31-2026 */}

Notable Developments — May 31, 2026

GitHub Copilot AI Credits go live June 1 — Starting tomorrow, GitHub switches Copilot to token-priced AI Credits while keeping the familiar seat tiers in place. One AI Credit equals $0.01, code completions and Next Edit Suggestions stay unlimited, Business gets promotional $30/user/month credits for three months and Enterprise gets $70/user/month, and admins can manage spend at the enterprise, cost-center, and user level. The significance: Copilot is turning pricing into a governable enterprise feature — more transparent, more flexible, and easier to budget than pretending long-running agents can stay flat-rate forever.

A2A Protocol v1.2 reaches real production scale — A2A is now running at 150+ organizations in production, not just pilots, with cryptographically signed agent cards for domain verification under the Linux Foundation's Agentic AI Foundation. Microsoft, AWS, Salesforce, SAP, and ServiceNow are already running it. The key takeaway is architectural clarity: MCP handles tool/data access while A2A handles cross-agent coordination.

Lovable subagents — Lovable's primary build agent can now spawn parallel Researcher, Reviewer, and Synthesizer subagents, each with its own activity-log thread for traceability. This is a direct answer to the fix-loop problem in vibe coding: research the existing codebase before patching, review the diff before it lands, and keep auxiliary summarization work out of the main agent's context window.

Kilo Code keeps compounding — Kilo reports 3M+ downloads and 40T+ processed tokens across VS Code, JetBrains, CLI, Cloud Agents, and Slack. Supplemental coverage adds 1.5M+ users plus Teams and KiloClaw pricing and a $45M Series B with 3,200 active Slack workspaces and a 78% commit-acceptance rate. The bigger point: open-source, review-first coding agents are no longer side projects — they're becoming serious multi-surface platforms.

MCP adoption keeps accelerating — Model Context Protocol now sits at 97M monthly SDK downloads, 9,400+ published servers across registries, and 78% of enterprise AI teams reporting at least one MCP-backed agent in production. Additional coverage reinforces the same pattern: MCP is no longer experimental glue code, it's baseline infrastructure.

Replit Canvas expands into a multimodal design workspace — Replit added GPT-Image 2 image generation, Seedance video generation, animated SVGs, and multi-edit controls for layout and typography inside Canvas. The significance: design surfaces are becoming first-class agent workspaces instead of chat sidecars.

{/* HARNESS_SECTION_END: notable-new-may-31-2026 */}


{/* HARNESS_SECTION: notable-new-june-1-2026 */}

Notable Developments — June 1, 2026

NVIDIA NemoClaw Agent Framework — NVIDIA launched NemoClaw at GTC Taipei 2026, an open-source agent framework with templates for planning, reasoning, execution, and delegation. Part of the NVIDIA Agent Toolkit, NemoClaw connects with popular harnesses (LangChain, CrewAI, Semantic Kernel) and pairs with the OpenShell secure runtime for containerized, policy-driven agent governance. CUDA-X libraries are exposed as agent skills. Built for enterprise-scale autonomous agents that act as "digital coworkers." Nemotron 3 Ultra (550B parameters) ships alongside as the recommended model for long-running agent workloads. Enterprise partners include SAP, ServiceNow, Accenture, and Dell. The significance: NVIDIA is entering the agent orchestration layer, not just providing models and GPUs — NemoClaw is a full framework that competes with LangGraph and CrewAI while leveraging NVIDIA's hardware ecosystem for on-device execution via RTX Spark. (Source)

GitHub Copilot AI Credits are now live — The usage-based billing model officially launched June 1. One AI Credit = $0.01. Code completions and Next Edit Suggestions remain unlimited. Business seats get promotional $30/user/month credits (3-month intro), Enterprise gets $70/user/month. Admins have granular spend controls at enterprise, cost-center, and user levels. The transition preserves Copilot's position as the most enterprise-governable AI coding platform — usage transparency is a feature, not a limitation.

Cursor Cloud Agents + Jira Integration — Cursor's Cloud Agents can now be triggered directly from Jira tickets, using the work item title, description, comments, and repository settings to scope the task, then posting completion updates and PR links back to Jira. The integration supports Atlassian MCP authentication for full bidirectional read/write access (read issues, edit descriptions, create linked tickets). Early community feedback shows the auth propagation still maturing, but the workflow pattern is compelling: ticket becomes scoped input → agent executes → PR + Jira update as output. Teams using .cursor/rules/*.mdc for per-area governance report cleaner diffs and shorter reviews.

MiniMax M3 — MiniMax released M3, a frontier model with a one-million-token context window and native multimodal input (image/video), specifically designed for coding agents and long-running automation workflows. Paired with MiniMax Code for multi-stage producer-verifier pipelines. Aimed at developers building complex agent workflows that need extended context over many files. The significance: Chinese AI labs are now building models specifically optimized for agent harnesses — not just general chat — increasing the model options available to BYOM frameworks like LangGraph and CrewAI.

OpenCode — Open-Source Terminal Agent — OpenCode is an emerging open-source, model-agnostic terminal coding agent that competes directly with Claude Code. Supports Claude, GPT, Gemini, and local models. Features include a polished TUI (terminal UI), multi-file editing, and the flexibility to switch between any LLM provider. Key advantage over Claude Code: no vendor lock-in. Key disadvantage: less polished instruction-following and speed compared to Claude's managed experience. Positioned for developers who want Claude Code's workflow without Anthropic dependency.

Kore.ai Artemis Edition — Kore.ai launched Artemis, a new-generation Agent Platform for building, governing, and operating enterprise multi-agent AI systems. Features Agent Blueprint Language (ABL) for declarative agent definition, built-in governance and observability, and production-grade architecture for multi-agent orchestration at enterprise scale. Targets regulated industries needing auditable agent behavior. The significance: enterprise agent governance platforms are multiplying — joining Microsoft's AGT and NVIDIA's OpenShell in the "agent control plane" space.

JetBrains Mellum2 — Open-Source 12B MoE Agent Model — JetBrains open-sourced Mellum2, a 12B-parameter Mixture-of-Experts model where only 2.5B parameters are active per token — delivering 2× faster inference than comparable models while remaining competitive on code generation, reasoning, and math benchmarks. Released under Apache 2.0, Mellum2 is purpose-built for the intermediate steps in agent workflows: routing, RAG, summarization, sub-agent orchestration, and private deployments. Available on Hugging Face with a full technical report. The significance: open-source models purpose-built for agent infrastructure are arriving — not to replace frontier models at the outer loop, but to handle the high-throughput, latency-sensitive inner operations (routing, validation, retrieval) that make agents affordable at scale. GitHub Copilot and other harnesses that support bring-your-own-model can leverage Mellum2 for these efficiency-critical sub-tasks.

SkipLabs Skipper — Closed-Loop Autonomous Coding Agent — SkipLabs launched Skipper, a closed-loop coding agent that takes a single prompt and returns a running, validated service — with zero developer review in the loop. Created by Julien Verlaguet (creator of Facebook's Hack programming language) alongside engineers from Facebook, Microsoft, Microsoft Research, and Meta, Skipper positions itself as the architectural substrate beneath foundation models. Rather than competing with Claude, GPT, or Gemini, it routes tasks to the best-suited model, autonomously decomposes work, generates and validates code, and delivers production-ready software. SkipLabs argues that "building correct software has always been an architecture problem disguised as a coding problem" — AI didn't change that, it made it more urgent. The significance: a new category of "developer-optional" agent is emerging — Skipper represents the logical extreme of autonomous coding where the human provides intent and the agent handles everything through to production deployment. Whether this proves viable at enterprise scale or is limited to greenfield services remains to be validated.

{/* HARNESS_SECTION_END: notable-new-june-1-2026 */}


{/* HARNESS_SECTION: notable-new-june-2-2026 */}

Notable Developments — June 2, 2026

GitHub Copilot Max + budget controls GA — GitHub launched Copilot Max as an upgrade path for existing Student, Pro, and Pro+ subscribers, with the highest included AI Credits usage and spending limits for power users. User-level budget controls are now GA for organizations and enterprises, and Copilot code review consumes GitHub Actions minutes. The significance: Copilot's pricing and governance model is becoming a complete enterprise control surface, not just a seat license.

LangGraph 1.2.3 ships v3 streaming — LangGraph's June 1 release adds v3 streaming support to RemoteGraph, WebSocket transports in the SDK, named tool-dispatched subagents via lc_agent_name, and multiplexed message/tool projections through interleave_projections. The significance: LangGraph is maturing from a graph orchestration library into a more observable, production-grade runtime for distributed multi-agent systems.

CrewAI 1.14.6 stabilizes ACP Beta — CrewAI promoted 1.14.6 from pre-release to stable with Agent Control Plane (ACP) Beta, a managed orchestration layer for multi-crew coordination. The release also hardens checkpoint restore, improves StdioTransport security, and moves the Skills Repository behind CREWAI_EXPERIMENTAL. The significance: CrewAI is shifting from lightweight role-play orchestration toward a managed control plane with stronger runtime hygiene.

Mistral Vibe — Le Chat becomes a unified Work + Code agent platform — Mistral rebranded Le Chat to Vibe on May 28, 2026, splitting the product into Vibe for Work and Vibe for Code. Work mode connects to Google Workspace, Outlook, SharePoint, Slack, and GitHub to scan inboxes, pull spreadsheet data, build reports, and route outputs into systems like Notion or SharePoint, with users reviewing the task plan before execution per The Decoder's coverage. Code mode runs agents in isolated cloud sandboxes via the code.mistral.ai web app, a new VS Code extension, and the CLI, where /teleport moves live sessions between local and cloud; jobs can run in parallel, survive a closed laptop, fix bugs, and open pull requests automatically, with Slack-launched jobs planned for June. The platform is powered by Mistral Medium 3.5 and priced at Free, Pro (€14.99/month), Team (€24.99/user/month or €19.99 annual), and Enterprise custom, with students getting 50% off Pro. The significance: Mistral has now entered the agentic IDE/workspace category — not as a narrow code assistant, but as a unified Work+Code platform where the same agent shares connectors, context, and identity across productivity and software tasks.

Google Antigravity 2.0's architecture is now much clearer — Follow-up reporting from ByteIota and AwesomeAgents confirms Antigravity 2.0 is really five products sharing one runtime: a desktop app, the Go-based agy CLI, the Python google.antigravity SDK, a Managed Agents API with serverless per-run billing through Gemini API, and an enterprise platform with SLAs. Google is also pairing MCP for tool access with native A2A support for agent-to-agent delegation across 150+ organizations, plus a built-in browser agent for UI testing and visual regressions, native voice control, and a multi-model story optimized for Gemini 3.5 Flash but extending to Claude Sonnet 4.5 and GPT-OSS. The caveat matters: terminal sandboxing currently relies on Apple Seatbelt on macOS only, leaving Linux and Windows without the same guardrail. The significance: Google's story is no longer "desktop coding app" — it's a full multi-surface agent platform with a much clearer split between local tooling, hosted agents, and enterprise deployment.

Canonical turns NVIDIA OpenShell into a one-command Ubuntu install — At COMPUTEX on June 1, Canonical announced an openshell snap for Ubuntu: sudo snap install openshell, then openshell sandbox create. The package runs each agent inside an isolated sandbox with corporate policy enforcement, and Canonical says NVIDIA is working with Microsoft on the Windows agent experience while Red Hat is also integrating OpenShell. The significance: secure agent runtimes are getting distro-level distribution — OpenShell is moving from niche runtime to standard install path for enterprise Linux fleets.

Microsoft Foundry Agent Service at Build 2026 — Microsoft Foundry announced a comprehensive agent framework and hosting service for scalable AI agents. The Agent Framework now supports skills, memory, and middleware with first-class integration into GitHub Copilot SDK and Claude Agent SDK — meaning developers can build agents that leverage multiple harnesses while deploying through Foundry's managed infrastructure. Toolboxes (public preview) provide a single managed endpoint for all tool types with auto-auth, lifecycle management, and governance. Skills are cataloged, project-scoped, and discoverable as MCP resources. Tracing and evaluation enter GA in late June 2026, enabling end-to-end production tracing, regression scoring, and actionable improvements via the Foundry Control Plane. The positioning is strategic: the Agent Framework acts as a flex point, not lock-in — investments in LangGraph, Copilot SDK, or Claude Agent SDK carry forward.

Agent Control Standard (ACS) — Open Runtime Governance — Also at Build 2026, Microsoft announced ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing) and the Agent Control Standard — a portable runtime control specification placing deterministic safety and security controls at five lifecycle checkpoints (input, LLM, state, tool execution, output). Policies are defined in standard YAML, enabling portability, versioning, and auditing across any framework. The ACS is open-source under MIT license and designed for broad ecosystem adoption. The significance: enterprise agent governance is getting a cross-framework standard — write your control policies once, enforce them across LangGraph, Copilot, Foundry, or custom harnesses.

OpenAI Codex — Sites, Annotations, and Role-Specific Plugins — OpenAI updated Codex with enterprise-focused agent workspaces. Sites enable rapid, hosted web workspaces for interactive, live-updating enterprise apps built by agents. Annotations provide in-place, localized context editing. Six Role-Specific Plugin bundles cover 62 apps (Snowflake, Figma, Salesforce, etc.) and 110 automated skills, tying Codex to common business workflows. The move positions Codex not just as a developer tool but as a general operating environment for white-collar knowledge work. Available in preview for Business/Enterprise via CLI and desktop app.

Claude Code Dynamic Workflows — Parallel Agent Coordination — Anthropic shipped Dynamic Workflows in research preview for Claude Code — a capability that dynamically creates and manages orchestration workflows across many AI subagents. It breaks complex tasks into subtasks, runs them in parallel, validates results, and iterates until convergence. Use cases include wide-scale bug investigations, large migrations, security audits, and architectural analyses. Available on Max/Team/eligible Enterprise plans, via the Claude API, and through partner platforms (Amazon Bedrock, Google Vertex AI, Microsoft Foundry). Higher token usage than typical Claude Code runs. The significance: Anthropic is shifting from single-agent optimization to multi-agent coordination — a direct challenge to LangGraph and CrewAI's orchestration positioning.

LangGraph 1.2.4 + LangChain 1.3.4 maintenance patches — LangGraph 1.2.4 (June 2) adds factory-graph integration tests and a backward-compat fix for _on_started overrides. LangChain 1.3.4 (June 2) improves HITL (human-in-the-loop) rejection guidance. Both are maintenance releases — no new features, confirming the ecosystem is in a stabilization phase after the v3 streaming additions in 1.2.3.

LangChain publishes "Model Neutrality" positioning (June 4, 2026) — LangChain VP Neil Dahlke made the formal case for open, model-neutral harnesses as a structural answer to lab-owned orchestration lock-in, drawing a direct parallel to how Terraform responded to cloud lock-in. The core argument: model labs are racing to capture the orchestration layer because token differentiation is eroding, and business logic trapped in a lab's harness stays captive to their token pricing. The response: an open, multi-model, profile-aware harness — LangChain's explicit positioning. This matters for the comparison because it reframes LangChain/LangGraph not just as a feature choice but as a strategic answer to a governance problem. Teams evaluating lab-native harnesses (OpenAI Agents SDK, Claude Code SDK, Vertex Agent Builder) should read this framing.

Microsoft Agent Framework consolidates AutoGen + Semantic Kernel — Confirmed across multiple sources: Microsoft merged AutoGen and Semantic Kernel into the Microsoft Agent Framework v1.0 (GA April 2026), intended as the default for .NET and Azure-native teams. AutoGen is moving to maintenance mode for new projects, with its community continuation (AG2) gaining streaming and event-driven features but without formal commercial support. The framework consolidates multi-agent abstractions with enterprise tooling. Teams choosing between frameworks in 2026: .NET/Azure → Microsoft Agent Framework; GPT-centric → OpenAI Agents SDK; stateful workflows → LangGraph; rapid prototyping → CrewAI.

Microsoft Execution Containers (MXC) — OS-Level Agent Sandbox — At Build 2026, Microsoft introduced MXC — a policy-driven execution layer built into Windows and WSL that lets developers and IT administrators declare exactly what an AI agent can and cannot access, with boundaries enforced at runtime by the OS kernel. MXC is not a product — it's an SDK and policy model providing a "composable sandbox spectrum" from lightweight process isolation (already used by GitHub Copilot CLI) up to micro-VMs, Linux containers, and full cloud instances on Windows 365. OpenAI and NVIDIA are already on board. It integrates with Agent 365, Entra, and Intune for enterprise-grade identity, containment, and auditability. The significance: agent sandboxing is becoming an OS primitive, not an app feature — MXC lets any agent framework (LangGraph, Copilot, Claude Code) inherit enterprise-grade containment without reimplementing it.

Copilot Code Review: Agent Skills, MCP Support + Medium Analysis Tier — Two public previews shipped for Copilot code review: (1) Agent skills and MCP support that bring your organization's tools and standards into every review — custom skills invoke internal tools during analysis, MCP server connections pull context from issue trackers, documentation, service catalogs, and incident tooling. (2) A new Medium analysis tier that routes complex PRs to a higher-reasoning model for deeper analysis of security-sensitive code and cross-service changes (Low remains default for straightforward work). Platform teams configure once and get consistent behavior across both code review and the cloud coding agent. Separately, Copilot code review for Azure Repos entered technical preview — bringing on-demand PR reviews directly into Azure DevOps with no GitHub Copilot license required (billed via AI Credits). The significance: Copilot's code review is evolving from AI annotation into a context-aware agent — pulling from your entire organizational context, scaling reasoning to match complexity, and expanding beyond GitHub into Azure DevOps.

Copilot CLI at Build 2026: Rubber Duck GA, Voice GA, and New Terminal UI — GitHub Copilot CLI's largest UX overhaul yet. The Rubber Duck agent is now generally available — a conversational thinking-partner that helps developers work through architectural decisions, debugging puzzles, and complex problems without triggering any code changes (named after the classic rubber duck debugging technique). Voice input is also GA — narrate your session and work hands-free. A redesigned terminal interface with tabs for Issues, Pull Requests, and Gists is available via /experimental, alongside theme-aware semantic colors and responsive layouts that adapt to narrow terminals. Prompt scheduling (/experimental) lets you queue tasks to run later. The significance: Copilot CLI is expanding beyond code execution into a complete developer workflow surface — thinking, talking, and scheduling, not just running agents.

Copilot in JetBrains: Agent Picker, Slash Commands, and Agent Debug Panel — The June 2 JetBrains update delivers multiple new Copilot capabilities for IntelliJ IDEA and related IDEs: Agent picker support lets developers select which agent handles their session; new slash commands expand in-session control; and an agent debug panel (public preview) provides visibility into what the agent is doing during a session — a major step toward the observability gap that has limited enterprise adoption of IDE agents. This update also marks the beginning of a phased transition from legacy Copilot mode to Copilot CLI agent as the default in JetBrains IDEs. The significance: JetBrains is becoming a first-class Copilot surface — not just an extension port, but a fully capable agent environment with its own debug tooling.

GitHub Copilot in Eclipse: BYOK, Skills, and Chat Refresh — A major Eclipse plugin update ships at Build 2026: a refreshed chat view with a new combo picker for chat mode and model selection; Bring Your Own Key (BYOK) for Business and Enterprise plans; skills and prompt file support — matching VS Code behavior so enterprise teams can distribute standardized agent tooling across their full IDE fleet (Eclipse included); improved ABAP support for SAP developers; and better context visibility into what's in the agent's session window. The significance: Eclipse joins VS Code as an enterprise-managed Copilot surface — BYOK and skills distribution close the gap that previously made Eclipse a second-class citizen in enterprise Copilot deployments.

{/* HARNESS_SECTION_END: notable-new-june-2-2026 */}


{/* HARNESS_SECTION: notable-new-june-3-2026 */}

Notable Developments — June 3, 2026

Hermes Desktop — Open-Source Agent Gets a Native GUI — Nous Research released Hermes Desktop in public preview — a native cross-platform GUI (macOS, Windows, Linux) for Hermes Agent v0.15.2, released under MIT license. The desktop app shares the same agent core, configuration, API keys, sessions, skills, and memory as the CLI and gateway (not a fork). Key capabilities: streaming responses with live tool activity, a right-hand preview pane for web pages/files/tool outputs, file browser, voice I/O, and cross-surface session continuity (start a conversation in Desktop, resume in CLI, or vice versa). Hermes supports sub-agent delegation with individual terminals and Python scripts, five sandbox backends (local, Docker, SSH, Singularity, Modal), and 300+ models via the Nous Portal. Multi-platform messaging integration spans Telegram, Discord, Slack, WhatsApp, Signal, and email. The significance: open-source agent infrastructure is moving from terminal-first tooling to desktop products that can compete for everyday team workflows — Hermes Desktop makes Nous Research's agent platform accessible to non-terminal users while preserving the developer-grade capabilities underneath.

{/* HARNESS_SECTION_END: notable-new-june-3-2026 */}


{/* HARNESS_SECTION: notable-new-june-3-2026-pm */}

Notable Developments — June 3, 2026 (PM)

Devin Desktop — Cognition Rebrands Windsurf as Agent Command Center — Cognition shipped Devin Desktop, rebranding the Windsurf IDE into a unified agent management platform. The update delivered over-the-air to existing Windsurf users. Devin is now four surfaces: Desktop (IDE + agent manager), Cloud (autonomous agent), CLI, and Review. Key feature: the Agent Command Center lets developers coordinate local and cloud AI agents, PRs, and project context from a single surface. Devin Desktop supports the Agent Client Protocol (ACP), meaning third-party agents (Claude Code, custom agents) can run alongside Cognition's own agents. Walden Yan (co-founder) positions this as the IDE becoming an orchestration layer — not just a coding surface. The significance: Cognition is pivoting from "autonomous agent" to "agent platform" — the most dramatic strategic shift in the IDE agent space since Google acquired Codeium. Devin Desktop directly competes with Cursor on the IDE side while maintaining the autonomous Devin Cloud agent as a differentiated capability.

MAI-Thinking-1 — Microsoft's Enterprise Reasoning Model — Microsoft AI launched MAI-Thinking-1, a 35B-active / ~1T-total parameter sparse Mixture-of-Experts model designed for enterprise coding and mathematical reasoning. Benchmarks: 97.0% on AIME 2025, strong SWE-Bench Pro scores. Available via Azure AI Foundry and GitHub Models. When paired with the Microsoft Agent Framework or GitHub Copilot, it provides a high-reasoning option for complex multi-step agent tasks (architecture decisions, security analysis, mathematical proofs). The significance: Microsoft now has a reasoning-specialized model to complement MAI-Code-1-Flash — fast model for everyday coding, thinking model for complex agent decisions. This mirrors the Low/Medium analysis tier pattern already shipping in Copilot Code Review.

{/* HARNESS_SECTION_END: notable-new-june-3-2026-pm */}


{/* HARNESS_SECTION: notable-new-june-5-2026 */}

Notable Developments — June 5, 2026

Augment Code Cosmos — GA: Operating System for Agentic Engineering Teams — Augment Code made Cosmos generally available to all plan tiers on June 3, 2026 (covered June 5 by SiliconANGLE). Cosmos is described as "the operating system that turns agents and humans into a coordinated team across your whole SDLC" — not a single agent or workflow engine, but a platform where specialized agents coordinate across triage, spec, implementation, review, testing, deployment, and feedback. Key differentiators: a shared virtual filesystem and system-wide memory so agents build on each other's work; teams of agents that coordinate, delegate, and pull humans in when judgment matters; Cosmos agents that help build Cosmos — describe what you want in natural language, the system configures the automation; run anywhere — Augment's cloud sandboxes, self-hosted VMs, or developer laptops; MCP support plus webhooks for wiring into any existing tool; and encoded institutional memory that carries patterns, conventions, and corrections forward across sessions and teammates. The platform works via web, mobile, CLI, Slack, and Linear — agents meet the work where it's already happening. Available to all team plans.

The significance: Augment Code is staking a claim in the "agent operating system" category this article has been tracking. Where most AI coding tools focus on IDE-level productivity, Cosmos competes at the team and lifecycle layer — directly challenging enterprise platforms like Microsoft Foundry Agent Service, Warp Oz, and Kore.ai Artemis. The public preview launched May 4, 2026; GA makes this a viable option for teams ready to deploy agents beyond the code editor.

{/* HARNESS_SECTION_END: notable-new-june-5-2026 */}


{/* HARNESS_SECTION: notable-new-june-6-2026-pm */}

Notable Developments — June 6, 2026

Microsoft Agent Framework BUILD 2026 — Agent Harness, CodeAct, Hosted Agents — At BUILD 2026, the MAF team shipped the most comprehensive harness infrastructure update since the 1.0 GA. The headline: Agent Harness is now first-classchatClient.AsHarnessAgent() turns any chat client into a production agent in one call, with automatic context compaction (monitors token usage, compacts history mid-loop to prevent overflow), built-in instruction merging, and a complete set of first-party providers: FileMemoryProvider (session-scoped persistent notes across turns, stored in agent-file-memory/{session}/), FileAccessProvider (general file I/O), TodoProvider (multi-step task tracking in session state), AgentModeProvider (plan vs execute operating modes), AgentSkillsProvider (skill discovery and execution from the filesystem), and BackgroundAgentsProvider (fan-out orchestration to parallel child agents). ToolApprovalAgent middleware adds 'don't ask again' approval rules for sensitive tool calls; OpenTelemetryAgent provides automatic Semantic Conventions tracing with pluggable storage backends. CodeAct (alpha, agent-framework-hyperlight package) collapses multi-step tool-call chains into a single model turn: instead of orchestrating one tool at a time, the model writes a short script that calls tools via call_tool(...), executes it once in a Hyperlight microVM sandbox, and returns a consolidated result — cutting latency and token usage for orchestration-heavy agents. Foundry Hosted Agents (preview) takes a local MAF agent to production in a few lines: scale-to-zero pricing, per-session VM-isolated sandbox with persistent filesystem across scale-down events, built-in OpenTelemetry traces to Application Insights, and automatic session management. The significance: MAF is now architecturally competitive with Claude Code and the OpenAI Agents SDK on harness completeness — memory persistence, parallel sub-agents, approval gates, observability, and sandbox isolation are all built-in and composable. (Source)

Chalk Compute — Time-Traveling Agent Sandboxes in Your Cloud — Chalk launched an enterprise agent runtime (June 1, 2026) that deploys gVisor-hardened sandboxes entirely inside your private VPC (AWS EKS and GCP GKE; Azure AKS coming). The headline capability: temporally consistent evaluations — a single knowledge_cutoff parameter routes every tool call through the Chalk Context Engine locked to that timestamp. As far as the agent knows, it's evaluating against your real production data as of that exact moment — no synthetic fixtures. This closes the outer eval loop: build the agent → evaluate against real historical context → fix what breaks → redeploy. Infrastructure: scales to 10,000 isolated containers in under 10 seconds via content-addressed image caching; gVisor intercepts syscalls before reaching the host; each sandbox gets its own OIDC-compliant cloud identity; outbound egress locked to a hostname or CIDR allowlist. Customer data, tool calls, and logs stay in your VPC — Chalk's metadata plane orchestrates, your data plane owns the data. Tool calls route through the Chalk MCP gateway. Runtime open-sourcing planned for this summer. The significance: the evaluation harness is becoming as critical as the execution harness — teams shipping production agents need to replay historical scenarios against real context. Chalk Compute is the first product to make temporally consistent agent evaluation a first-class enterprise offering. (Source)

Anthropic Defending-Code Reference Harness — Autonomous Security Scanning — Anthropic open-sourced a reference harness for autonomous vulnerability discovery and remediation with Claude (released May 22, 2026; broadly covered June 5 with 4,100+ GitHub stars). The harness implements a full recon → find → verify → report → patch pipeline for C/C++ memory vulnerabilities using Docker and ASAN, with a /customize skill to port to any language or vulnerability class. Claude Code skills included: /quickstart, /threat-model, /vuln-scan, /triage, /patch. Security architecture: the autonomous pipeline executes target code inside a gVisor sandbox and refuses to run outside it unless explicitly overridden; the interactive skill workflow (read/write-only) is safe for unsandboxed use. Companion managed product: Claude Security (hosted, Anthropic) finds and fixes vulnerabilities across multiple projects with a multi-stage false-positive reduction pipeline and full finding lifecycle management (triage → fix validation → rapid fix generation). The significance: security is where autonomous agents are crossing from demo to production workflow first — the harness patterns here (recon phase, multi-stage verification, gVisor sandboxing, human triage gate) are reusable for any high-stakes autonomous workflow. (Source)

{/* HARNESS_SECTION_END: notable-new-june-6-2026-pm */}


Hermes Agent (New Entrant)

Hermes Agent is an open-source AI agent platform for automation, coding, and task orchestration. Current version: v0.16.0 (June 5, 2026) — the Surface Release — shipping native desktop apps (macOS, Windows, Linux), a browser web admin panel, OAuth remote gateway, and concurrent multi-profile sessions. Earlier milestones: the v0.15.0 release (May 28, 2026) represented a massive architectural overhaul — 1,302 commits, 747 merged PRs, 1,746 files changed, and 560+ issues closed since v0.14.0. On June 3, Nous Research shipped Hermes Desktop — a native GUI preview (v0.15.2) that evolved into the full v0.16.0 platform.

May 28, 2026 (v0.15.0): The core agent loop was refactored from 16,083 lines to 3,821 (76% reduction), split across 14 modules. The "Kanban" multi-agent system now supports orchestrator auto-decomposition, swarm topology creation (hermes kanban swarm creates a full Swarm v1 graph in one command), scheduled tasks, per-task model overrides, and worktree-per-task isolation. Performance improvements include 63% cold start reduction (701ms → 258ms) and 47% fewer function calls per conversation. Security additions target prompt-injection and "Brainworm"-class attacks with memory scanning and tool output delimiters. Credential management moved to Bitwarden Secrets Manager for centralized secret storage.

✅ Pros:

  • Open-source with aggressive development pace (747 PRs in one release cycle)
  • Multi-agent Kanban orchestration with swarm topologies
  • Strong security focus — prompt-injection defenses, tool output delimiting
  • Per-task model overrides — route cheap models to subtasks, strong models to verification
  • Fast startup (258ms cold start) and low per-conversation overhead
  • 23 messaging platform integrations
  • MCP catalog support with interactive picker

❌ Cons:

  • Newer project — ecosystem and community smaller than LangChain/CrewAI
  • Rapid development pace means frequent breaking changes
  • Documentation may lag behind the aggressive release cadence
  • Less enterprise adoption and commercial support compared to established frameworks
  • Complex architecture may be overkill for simple agent use cases

🎯 Best for: Developers who want an open-source, multi-agent orchestration platform with built-in security hardening and aggressive performance optimization. The Kanban swarm pattern is particularly compelling for teams managing complex, decomposable coding tasks across multiple worktrees.

{/* HARNESS_SECTION_END: hermes-agent */}


The Bottom Line

{/* BOTTOM_LINE_START */}

The agent harness landscape in 2026 is where container orchestration was in 2016 — fragmented, fast-moving, and converging toward patterns that aren't fully standardized yet. The CNCF's four pillars of platform control (golden paths, guardrails, safety nets, manual review) are emerging as the design principles every harness will eventually implement.

May 2026 signals: The trend toward "agent operating systems" is accelerating. GitHub's Copilot App treats each task as an isolated session. Anthropic's managed agents introduce hierarchical orchestration with safety critics. OpenAI is collapsing its multi-product portfolio into a single agentic surface. And infrastructure players like Redis are shipping dedicated memory layers for agents. The harness isn't just wrapping the model anymore — it's becoming the operating system.

The multi-harness era (May 20–21): Two announcements signal where this is heading. Google's Managed Agents API collapses weeks of agent deployment infrastructure into a single API call — provision a sandbox, wire tools, and execute all in one request. Meanwhile, Warp's Oz platform shipped the first multi-harness control plane: run Claude Code, Codex, and Warp Agent side by side with unified governance. The implication is clear — enterprises won't pick one harness. They'll run many, and need an orchestration layer above them all.

OpenAI claims the infrastructure layer (May 21): OpenAI's Agents SDK architecture overhaul is arguably the most significant structural shift since this article launched. By splitting into native harness + compute with 7 official sandbox providers, OpenAI is no longer just a model vendor — they're positioning as the foundational infrastructure layer for production agents. The explicit goal: make LangChain, CrewAI, and AutoGen either move up-stack (orchestration, vertical domains) or down-stack (specialized tooling). If you were building on those frameworks because OpenAI's SDK lacked sandboxing and production tooling, that argument just evaporated. Meanwhile, MBZUAI's analysis of Claude Code confirms what this page has argued from the start: ~98% of a production agent is harness infrastructure, only ~2% is AI decision logic. The real moat is the control plane.

Agent infrastructure becomes a commodity (May 20): Google's Agent Sandbox on GKE hit general availability with LangChain and Lovable running millions of agents on the platform. More importantly, Google open-sourced Agent Substrate — a lightweight control plane for sub-second agent startup at ultra-scale. Meanwhile, NVIDIA released AI-Q as an open-source deep research skill that plugs into Claude Code, Codex, or LangChain via a SKILL.md interface. The pattern is clear: the execution layer is commoditizing while the skill/tool layer is standardizing. Harnesses that embrace composable skills (via MCP, SKILL.md, or similar interfaces) will accumulate capabilities faster than monolithic platforms rebuilding everything in-house.

Safety tooling matures (May 21): Microsoft open-sourced RAMPART and Clarity — AI agent safety tools from their internal Red Team. RAMPART is a CI test harness (built on PyRIT) that lets you write pytest adversarial scenarios gated in CI. Clarity is a structured design-review tool with multi-AI failure analysis. Both are now on GitHub (v0.1.0 and v0.1.1 respectively). Agent governance isn't just harness-level anymore — dedicated safety testing frameworks are becoming the standard for production deployment.

Google I/O 2026 (May 19): Google made its biggest agentic development push yet. Antigravity 2.0 is a full desktop platform with multi-agent orchestration — directly competing with Cursor and GitHub Copilot's desktop workflows. Android CLI 1.0 takes a "platform-as-tool" approach, providing standardized CLI access that ANY agent can use for Android development. And Gemini Spark extends the agentic paradigm beyond coding into personal productivity — a 24/7 agent running on dedicated cloud VMs with deep Workspace integration. The AI Ultra pricing ($100–$200/mo) positions Google alongside Anthropic and OpenAI in the premium agent tier.

My bet: by 2027, the distinction between "agent harness" and "agent framework" will dissolve. Frameworks will grow governance layers. Harnesses will expose programmable hooks. MCP or something like it will become the standard tool protocol. And the platforms that survive will be the ones that nailed the balance between developer autonomy and organizational control.

May 22, 2026 — security and the full-stack land grab: Two themes dominate today's news. First: security. Semantic Kernel's two CVSS 9.8+ CVEs — prompt injection escalating to full RCE via accidental tool registration and unsafe eval() — confirm what the security community has been warning: wiring LLMs to tools without explicit validation is a code execution primitive. Microsoft was blunt: disable auto-invocation on any agent that can reach disk, shell, or production data. Expect analogous CVEs in LangChain, CrewAI, and AutoGen. Patch now. Second: the full-stack land grab. DeepSeek announced a dedicated "Code Harness" team to build "DeepSeek Code" — a direct Claude Code competitor built on their formula: Model + Harness = Agent. With V4 Flash at $0.14/M tokens (vs Claude Opus 4.7's $15/M), any DeepSeek-native harness arrives with a structural pricing advantage for budget-sensitive teams. Combined with OpenAI's Codex Goals feature and GitHub Copilot's Agent Tasks REST API, the harness race is accelerating on every axis simultaneously.

May 27–June 1, 2026 — the cost reality check: The biggest shakeup this week isn't a new feature — it's economics. Microsoft canceled most internal Claude Code licenses, shifting engineers to GitHub Copilot CLI after token-based pricing produced $500–$2,000/month per-engineer costs. Uber burned through its entire 2026 AI coding budget in four months at 84% developer adoption. Anthropic responded with a billing restructure — splitting Agent SDK usage into separate credit pools starting June 15. GitHub is also transitioning Copilot to AI Credits on June 1: seat prices stay the same, but agentic usage now consumes token-priced credits while completions/NES remain unlimited. The key difference is governance. Copilot pairs the billing change with pooled org credits plus spend caps at enterprise, cost-center, and user levels, which is materially cleaner than Anthropic's split-pool approach. The lesson: agent pricing is becoming a control-plane feature. The winners will be the platforms that combine strong agents with transparent budget controls — not just the cheapest raw token rate.

Governance becomes its own layer (May 28): Microsoft open-sourced the Agent Governance Toolkit (AGT) — a runtime policy engine that evaluates agent actions against declarative policies before execution. AGT works with ANY agent harness, not just Microsoft products. Combined with Google's AX durable execution runtime and Pydantic AI Harness's composable capability modules, a clear pattern is emerging: agent infrastructure is decomposing into specialized, composable layers — governance (AGT), execution durability (AX), capabilities (Pydantic AI Harness), sandboxing (Agent Sandbox), and orchestration (Warp Oz). The monolithic "agent framework" is giving way to a layered stack where each concern is independently addressable.

The SLM bifurcation (late May 2026): The harness landscape is splitting along a new axis: model size. Microsoft's MagenticLite proves that purpose-built SLM harnesses (4B–27B models) can match or exceed GPT-4o-class agents on browser tasks while running entirely on-device. Alibaba's Qwen3.7-Max pushed the other extreme — 35-hour continuous autonomous runs with 1,000+ tool calls. The implication: the "one harness, one model" assumption is dead. Future architectures will route cheap SLMs to routine subtasks and expensive frontier models to verification and complex reasoning, with the harness managing the routing logic.

Microsoft consolidates (May 29): Microsoft officially deprecated AutoGen in favor of the new Agent Framework 1.0 — a unified platform covering the full agent lifecycle from prototyping to production. The framework absorbs Semantic Kernel, AutoGen's multi-agent patterns, and Azure AI Foundry into a single coherent stack. For teams already invested in the Microsoft ecosystem, this removes the "which framework?" confusion. For everyone else, it's a reminder that framework consolidation is inevitable — invest in patterns (MCP, governance hooks, memory layers) that survive vendor reshuffling.

The super app thesis (May 29): GitHub Copilot is becoming a developer super app — a unified platform where coding, project management, CI/CD, and now agent orchestration converge into a single surface. The plugin marketplace (May 27) enables third-party tool integration, making Copilot an extensible platform rather than a monolithic product. Combined with the remote control GA (May 18) and Agent Tasks REST API, GitHub is positioning Copilot as the control plane for all developer workflows — not just code completion.

June 1, 2026 — Copilot's pricing model matures: GitHub's shift to AI Credits deserves a more nuanced read than the broader "token pricing panic" narrative. Copilot keeps the existing seat tiers, preserves unlimited completions and Next Edit Suggestions, and adds pooled credits plus enterprise/cost-center/user spend caps. That's a pragmatic enterprise move: long-running agents were always going to need explicit budget controls, and Copilot is turning those controls into a first-class admin surface instead of hiding them behind surprise overages.

Cognition bets $1B on full autonomy (May 2026): Cognition raised $1 billion — a $1B Series C at a $6B pre-money valuation — and announced a Skills API that decomposes complex tasks into modular, independently-deployable steps. This is the largest single fundraise in the agent harness space, signaling investor confidence that fully autonomous software engineering agents represent a massive TAM. The Skills API is particularly notable because it mirrors the decomposition pattern seen in MCP's tool protocol and Hermes Agent's Kanban system — the industry is converging on "tasks as composable units."

Hermes Agent emerges (May 28): Hermes Agent v0.15.0 shipped a massive architectural overhaul — 1,302 commits, 76% core code reduction, and a Kanban multi-agent system with swarm topologies. The 258ms cold start and prompt-injection defenses make it a compelling open-source alternative for teams that want multi-agent orchestration without vendor lock-in. Watch this space.

Agent commerce arrives (May 30): Replit's Visa Trusted Agent Protocol signals a new infrastructure frontier: agents that can transact. Cryptographic identity verification, spending controls, and M2M payment primitives baked into the development platform. Meanwhile, Anthropic's Dynamic Workflows push Claude Code toward true parallelism — hundreds of subagents working simultaneously with adversarial verification. And xAI's Grok Build API at $1/M tokens undercuts most competitors on raw inference cost. The pattern: the harness race is splitting into three tiers — premium orchestration (Copilot, Claude Code Dynamic Workflows), commodity inference APIs (Grok Build, DeepSeek), and infrastructure primitives (Visa protocol, Google AX, AGT). Teams will mix across tiers.

Until then, choose based on what you actually need today. Use the comparison tables. Read the pros and cons. And remember: the best agent harness is the one your team can actually govern in production.

{/* BOTTOM_LINE_END */}


Resources


{/* HARNESS_SECTION: notable-new-june-7-2026 */}

Notable Developments — June 7, 2026

VS Code 1.123 — Agent Session Sync, 1M Context Windows, Read-Only Research Agent — VS Code 1.123 shipped June 3 with three features that change how long-running agent work holds together. Session sync (on by default) persists your full chat sessions — conversation history, edited files, repo context, referenced PRs and issues — to your GitHub account, so switching machines mid-task no longer means starting over. /chronicle:standup generates a standup report from the last 24 hours of coding; /chronicle [query] lets you search session history in natural language. 1 million token context windows now supported for compatible models including Claude Opus 4.7 and GPT-5.5 — enough to hold a large codebase across hours of agent work without mid-session truncation. The new research agent (/research [question]) is read-only by design: it investigates and reports from your codebase, GitHub repos, and the web without touching a file. Currently in preview for Copilot CLI (Insiders only). The significance: GitHub Copilot's infrastructure in VS Code now solves the operational pain of long-running agent sessions — state persistence, context limits, and safe investigation without side effects. (Source, June 3, 2026)

crewAI 1.14.3 — Checkpoints, Fork Support, Bedrock V4, 29% Cold-Start Improvement — crewAI 1.14.3 ships across four areas. Checkpoint and fork support for standalone agents — agents outside a full crew can now save execution state and branch from a checkpoint along a different path without rerunning the full workflow; lifecycle events fire for checkpoint operations. Amazon Bedrock V4 support lands alongside new sandbox integrations for e2b and Daytona. A 29% cold-start reduction comes from MCP SDK and event-type initialization optimizations — directly relevant for serverless or on-demand agent deployments. Security bumps: lxml ≥ 6.1.0 and python-dotenv ≥ 1.2.2. Serialization fixes improve checkpoint reliability. The significance: forking execution state is a pattern previously seen only in stateful workflow engines — crewAI bringing it to a Python framework closes a meaningful gap for production teams. (Source, June 5, 2026)

AutoGen Python v0.6.2 — Streaming Nested Agents, Inner Tool Loop, OpenTelemetry Traces — Microsoft AutoGen Python v0.6.2 delivers three headline changes. AgentTool and TeamTool gain streaming support via a new run_json_stream interface — when an AssistantAgent calls a nested agent as a tool, inner events surface through the parent's output stream in real time rather than returning only a terminal result. max_tool_iterations on AssistantAgent enables a bounded inner tool-calling loop: the agent calls the model and executes tools continuously until no more tool calls are generated or the ceiling is hit. ChatCompletionClient gains a tool_choice parameter for explicit model tool selection control. OpenTelemetry GenAI traces added for create_agent, invoke_agent, and execute_tool spans. The significance: nested multi-agent observability and bounded tool loops are production necessities — AutoGen v0.6.x is systematically closing the feature gap with more mature frameworks. (Source, June 5, 2026)

xAI Grok Build 0.1 — Agentic Coding Model Opens API in Public Beta — xAI opened Grok Build 0.1 via the xAI API in public beta (June 1), previously limited to SuperGrok/X Premium+ CLI users. Specs: 256K-token context window, text + image inputs, 100+ tokens/second, \/\ per million input/output tokens. Supports up to 8 parallel agents on a plan → search → build workflow, with subagents running in isolated worktrees. Native MCP support ("Bring Your Own MCP") and full Agent Client Protocol (ACP) compatibility let it be called as a primitive from orchestration platforms alongside Claude Code or Codex CLI. Integrations include GitHub, Notion, Linear, Google Workspace, Microsoft 365, Vercel, and Canva. Picks up AGENTS.md, hooks, skills, and MCP servers from the repo root. The significance: the API opening makes Grok Build a callable primitive for multi-agent pipelines; at \/\ per million tokens with 8-way parallelism, it's positioned as a cost-competitive option for parallel-heavy migration workloads. Public beta status means rough edges are expected. (Source, June 1, 2026)

Koog 1.0 — JetBrains Ships Stable AI Agent Framework for Java and Kotlin — JetBrains shipped Koog 1.0 — the first stable release of their JVM-native AI agent framework. The headline: a one-year API stability guarantee on all stable modules, with all deprecated APIs removed and graph DSL node names finalized. In an agent tooling landscape where breaking changes are routine, this is a production signal aimed at Java/Kotlin backend teams. Key 1.0 improvements: consistent Java interop (xxxBlocking in Kotlin, plain xxx from Java; explicit ExecutorService parameters removed), HTTP transport decoupled from Ktor (LLM client constructors no longer lock you to Ktor), and a clear stable/beta module split. The significance: JetBrains is betting that enterprise Java/Kotlin shops will build agent infrastructure in their native stack — Koog 1.0 is the first framework in the JVM ecosystem to offer a production stability commitment that Python frameworks have never convincingly delivered. (Source, June 6, 2026)
{/* HARNESS_SECTION_END: notable-new-june-7-2026 */}


{/* HARNESS_SECTION: notable-new-june-7-2026-pm */}

Notable Developments — June 7, 2026 (PM)

Hermes Agent v0.16.0 — "The Surface Release" — Nous Research shipped Hermes Agent v0.16.0 (June 5, 2026) in a release spanning 874 commits, 542 merged PRs, 1,962 files changed, 399 closed issues (including 2 P0 and 62 P1), and 170 contributors since v0.15.2. The headline is a transition from CLI-first tooling to a multi-surface platform. Native desktop apps now ship for macOS, Windows, and Linux with one-click install, auto-updates, drag-and-drop files, clipboard image paste, a Cmd+K command palette, session search and archive, and an inline model picker in the status bar. Concurrent multi-profile sessions let users run multiple Hermes instances in a single desktop window. OAuth remote gateway lets a laptop act as a thin client while the agent, API keys, and compute stay on a server — enabling team-shared Hermes infrastructure without SSH tunneling. A new browser-based web admin panel manages messaging channels, MCP catalog entries, credentials, webhooks, memory, and gateway controls. Security round: CVE-2026-48710 (Starlette pin), SSRF off-loop hardening, subprocess credential stripping. Additional additions: fuzzy-searchable model pickers across desktop/web/TUI/CLI, /undo for the last N turns, NVIDIA/skills added as a trusted Skills Hub alongside OpenAI, Anthropic, and HuggingFace, and a Simplified Chinese desktop GUI. Hermes held #2 on ClawCharts with 182,737 total stars at release. Operator note: the expanded web surface means auth boundaries and session continuity need validation before production upgrades. (Source: illmethinks.io, June 6, 2026; Release)

Devin Desktop: Devin Local Replaces Cascade — Rust Rewrite, Parallel Subagents, July 1 Deadline — The most consequential detail in the Devin Desktop rebrand (June 7 follow-up reporting): Cognition rewrote the primary local coding agent from scratch in Rust. Cascade — which operated as a single-context agent — is replaced by Devin Local, which supports parallel sub-sessions. A refactor + test-suite task can have one subagent handling schema changes while another drafts tests simultaneously. Cognition claims up to 30% greater token efficiency vs Cascade (self-reported). Cascade remains available as a legacy option through July 1, 2026 — teams with Cascade-specific workflows have that as the real migration deadline. The Agent Command Center is Devin Desktop's default surface, not the code editor — positioning Devin as a fleet manager first. ACP (Agent Client Protocol) support means any ACP-compatible agent runs natively in the same Kanban view and Spaces context layer. Devin Review is now included in all existing plans at no additional cost. Spaces (early, minimal) groups related agent sessions, PRs, and files around a feature branch for shared context — more development planned through Q3 2026. (Source: ByteIota, June 7, 2026)

Cursor Organizations for Enterprise — Per-Team Budgets, SCIM Groups, and Model-Tier Segmentation — Cursor shipped Organizations for Cursor Enterprise (GA June 3, 2026) — a top-level admin container that gives enterprises one dashboard for multiple teams with separate budgets, model access tiers, and governance per unit. Key capabilities: per-team budgets (sub-organization spend controls), model-tier segmentation (route different teams to different model tiers by cost and capability), and SCIM Groups for identity sync. Context: Cursor has reached + ARR (as of February 2026), with enterprise revenue at ~60% of total and Fortune 500 customer reach at ~64% of enterprise customers. Organizations GA is the clearest signal yet that the AI coding race has shifted from raw capability to enterprise control plane maturity — a trajectory GitHub Copilot's granular spend caps, pooled credits, and enterprise plugin governance reinforce from the other direction. (Source: Digital Applied, June 6, 2026)

Gartner's First Magic Quadrant for Enterprise AI Coding Agents — Gartner published the first-ever Magic Quadrant for Enterprise AI Coding Agents (June 5, 2026), formally recognizing agentic software engineering as a distinct, enterprise-procurement-relevant market category. The headline finding: AI-focused vendors are positioned as Leaders, while major cloud providers that previously ranked as Leaders in the adjacent "AI Code Assistants" Magic Quadrant are now positioned as Challengers — reflecting a shift in evaluation criteria from inline code suggestion quality toward autonomous agent orchestration, multi-step task execution, and governance capabilities. The significance: Gartner Magic Quadrants create structured buying behavior. The creation of this new MQ signals that enterprise procurement teams now have an analyst-backed framework for evaluating agent platforms — and the vendors already positioned as Leaders have a meaningful advantage in enterprise deal flow and IT spending cycles through 2027. (Source: Virtualization Review, June 5, 2026)

{/* HARNESS_SECTION_END: notable-new-june-7-2026-pm */}

{/* HARNESS_SECTION: notable-new-june-7-2026-evening */}

Notable Developments — June 7, 2026 (Evening)

Microsoft Scout on OpenClaw: The Agent Runtime Is Now Free — The clearest strategic signal from Build 2026 landed in a June 7 analysis: Microsoft shipped Scout — its first "Autopilot" (always-on work agent) — on OpenClaw, the open-source runtime an Austrian developer built over a weekend in late 2025. Microsoft chose not to build its own agent loop, mirroring how Google used Android: make the OS layer free, monetize the identity, policy, and distribution above it. The architectural stack Build made explicit: OpenClaw runtime (free, open) → Microsoft Execution Containers (kernel-level agent sandbox) → identity, governance, and grounding control plane → Scout. Scout connects to Microsoft 365 data, runs continuously in the background, and reaches the browser and external apps through MCP. Every Scout agent operates under its own governed Entra identity rather than a shared service account — Microsoft's direct answer to the agentic identity problem. The policy-conformance system checks each action and leaves an audit trail; conformance work is being contributed upstream to OpenClaw so open deployments can validate themselves. Agent 365 (the enterprise management console) discovers and manages local agents on a managed device — including OpenClaw-based agents, GitHub Copilot CLI, and Claude Code — surfacing them all in one interface. NVIDIA is bringing its OpenShell runtime to the same containment layer; Nous Research confirmed Hermes Agent will integrate both. Five months after OpenClaw launched, it is the shared runtime under Microsoft, NVIDIA, and a field of agent startups simultaneously. The significance: the agent runtime layer is now effectively free infrastructure — the same shift Android made to mobile OSes. The control plane — identity, governance, grounding, distribution — is where every enterprise vendor is competing, and it is not free. Teams evaluating agent infrastructure should factor this into build-vs-buy decisions: the execution loop is commoditized, but the trust and auditability layers above it are not. (Source: The New Stack, Janakiram MSV, June 7, 2026)

Perplexity Search as Code — Agents That Write Their Own Retrieval Pipelines — Perplexity introduced Search as Code, a reference architecture that shifts agent retrieval from calling a fixed endpoint to letting an agent generate Python search workflows per task. The three-layer stack: a model as control plane, a restricted compute sandbox for generated code, and the Agentic Search SDK (exposes retrieval, filtering, deduplication, and reranking as callable SDK primitives). Self-reported benchmark: 100% accuracy on a 200-CVE task, 85.1% fewer tokens than baseline — figures that need outside validation before treating as repeatable. Available in Perplexity Computer and the Perplexity Agent API. Direct competition in the same layer: OpenAI Responses API (web search before generation), Exa (search engine built for AI agents), Parallel (evidence-based agent search), and Tavily (agent-oriented Search API). The significance for harness developers: retrieval is shifting from a static endpoint integration to a programmable pipeline that agents generate per-task, adding code-review and trust-boundary considerations alongside the usual latency and cost tradeoffs. The retrieval layer is itself becoming an agent behavior to govern. (Source: WinBuzzer, June 7, 2026)

{/* HARNESS_SECTION_END: notable-new-june-7-2026-evening */}

{/* HARNESS_SECTION: notable-new-june-8-2026 */}

Notable Developments — June 8, 2026

LG CNS Launches AIND — Enterprise Agentic AI Development Platform with Cline — LG CNS (the IT services arm of LG Group) launched AIND (Agentic AI Development), an enterprise-grade multi-agent platform for building and operating large-scale IT systems. Co-developed with Cline, the U.S.-based open-source AI coding company, AIND deploys a pipeline of three cooperating agents: a requirements analysis and design agent that interprets natural language input and designs system architecture, a coding agent that generates code conforming to the enterprise's development standards, and a testing and QA agent that validates output before delivery. The platform's core differentiator is a Knowledge Foundation — an ontology-based database that integrates and indexes enterprise IT information (development standards, security regulations, source code, deliverables) so the AI understands the organization's specific architecture before generating code. This directly addresses the vibe-coding risk where agents generate plausible code that collides with existing systems. AIND targets finance, public sector, manufacturing, and defense industries, with an initial focus on the U.S., Japan, and Southeast Asia markets. The significance: enterprise systems integrators are entering the agent harness space with domain-specific knowledge bases — not just plug-in-and-run tools, but contextually-aware platforms that understand the organization's architecture and standards before a line of code is written. (Source: AJU PRESS, June 8, 2026)

GitHub Copilot App — Agent Merge Drives PRs from Review to Merged — Detailed coverage of the GitHub Copilot App (announced Build 2026, June 2) surfaced a specific autonomous feature that deserves its own headline: Agent Merge. This feature follows a pull request through the entire post-coding path — CI monitoring, required reviewer tracking, failing-check remediation — until the merge conditions are met. Developers configure exactly which steps Copilot is allowed to perform: driving CI back to green, addressing reviewer feedback, completing the final merge. The agent handles the coordination loop while the human retains control of the authorization scope. Combined with Canvases (bidirectional work surfaces updated in real time), cloud automations (scheduled/event-triggered agents), and cross-repository agent sessions in My Work, Agent Merge closes the last leg of the autonomous development cycle — from "agent writes code" to "code ships." The significance: the end-to-end agentic development loop is now complete within a single platform — GitHub Copilot is the first harness in this comparison where the full path from issue to merged, deployed code is automated with human-in-the-loop checkpoints throughout. No other platform in this comparison ships Agent Merge as a named, configurable feature. (Source: Help Net Security, June 8, 2026)

Google Gemini Enterprise Agent Platform — Agentic RAG with 34% Accuracy Improvement — Google Research and Google Cloud published details on their new multi-agent RAG framework, now available as a public preview feature in Gemini Enterprise Agent Platform. The key architectural innovation is persistence: unlike standard RAG that accepts "I don't have enough information" as a terminal state, this system uses a multi-agent loop — Query Planner, Context Agent, and Query Rewriter — to continue searching until the context is genuinely sufficient. When a search returns incomplete results, the Context Agent evaluates the gap and the Query Rewriter generates a refined search rather than returning an incomplete answer. Self-reported benchmark result: up to 34% accuracy improvement on factuality datasets compared to standard RAG, with better grounding and improved reasoning accuracy on domain-specific proprietary datasets. Responses are auditable, traceable, and grounded. The significance: enterprise agent retrieval is evolving from a stateless endpoint call into a governed quality loop — adding evaluation and retry logic to what was previously a single lookup. For teams building on Gemini Enterprise, this is the new retrieval foundation; for teams on other harnesses, it's a design pattern worth studying. (Source: Google Research Blog, June 5, 2026)

CrewAI 1.14.7a1 — Conversational Flows, Chat API, and Snowflake Cortex LLM — CrewAI's pre-release track ships 1.14.7a1/a2 with features targeting production conversational workflows. Conversational Flows add a chat mode that turns any Flow into a stateful dialogue — handle_turn processes each user message with context, the Chat API provides a REST interface for interactive sessions, and real-time traces surface in LangSmith and the CrewAI platform. Native Snowflake Cortex LLM provider allows agents to use Cortex models directly for workloads running inside Snowflake without data egress. Crew trained agents file support persists trained agent state for reuse across runs. The Flow DSL was refactored from a single monolith into three focused modules (DSL, definition, runtime) for improved testability. An NVIDIA Nemotron LLM guide was added. The significance: CrewAI is maturing from batch task-execution orchestration toward conversational, stateful agent interfaces — a direction that makes crews more practical for interactive enterprise workflows beyond fully automated pipelines. Note: pre-release status; API surface may shift before stable release. (Source: CrewAI GitHub, June 5, 2026)

{/* HARNESS_SECTION_END: notable-new-june-8-2026 /}
{/
HARNESS_SECTION: notable-new-june-8-2026-midday */}

Notable Developments — June 8, 2026 (Midday)

Mastra Code — Harness Architecture Deep Dive: Observational Memory and 4-Mode Design — Mastra published a detailed technical walkthrough of how Mastra Code's harness wraps the agent loop — and it introduces patterns not yet seen in other harnesses in this comparison. The centerpiece is Observational Memory (OM): instead of waiting for the context window to fill and then compacting the entire history in one step (the approach used by Claude Code and OpenAI Codex), Mastra Code runs an observer model continuously at 20% intervals ahead of the threshold (40K tokens by default). The observer writes structured observations — decisions, facts, state changes — to a separate store; a reflector model compresses those observations when they accumulate. When the threshold arrives, the distilled working memory is ready and swaps in without a discard step. The harness ships four modes: Build (full tool access, Claude Opus 4.6), Plan (read-only, produces structured plans on GPT-5.2-Codex, auto-switches to Build on approval), Fast (no planning phase, Cerebras ZAI-GLM-4.7), and YOLO (full auto-approve, no permission prompts). Tool approval runs as an ordered rule chain — allow/deny/ask is resolved by walking the chain top-to-bottom until a match is found, meaning rule order is itself the policy. Subagents can spawn in isolated worktrees (clean context) or forked threads (warm prompt cache). The harness is TypeScript-first and open-source, with a createMastraCode() factory function that returns a configured Harness, MCPManager, and HookManager. The significance: Mastra Code is the first public coding agent harness to ship a formally specified proactive Observational Memory architecture — a direct answer to quality degradation over long sessions that the rest of the field has not yet solved with background distillation running ahead of the limit. (Source: Mastra Blog, June 5, 2026)

VS Code 1.120-1.123: Air-Gapped BYOK Unlocks Enterprise AI Coding for Regulated Industries — A comprehensive analysis published today synthesizes the VS Code May release cycle (versions 1.120-1.123) and its cumulative impact on regulated-industry adoption of GitHub Copilot tooling. The key enabler is air-gapped BYOK, shipped in VS Code 1.122 (May 28): once at least one BYOK model is configured via the Command Palette, the Chat view activates without a GitHub OAuth handshake — allowing defense contractors, hospitals, financial institutions, and government agencies to run fully offline agentic workflows using local inference servers (Ollama, vLLM, Foundry Local). Setting COPILOT_OFFLINE=true disables telemetry, removing all outbound traffic. Combined with enterprise-managed plugins entering public preview June 5 — which let administrators configure and distribute custom agents, Copilot skills, and MCP server configurations across an entire organization from a single settings.json policy file — and the Agents window reaching Stable preview in VS Code 1.120 (May 13), this release cycle removed the last structural blockers for Copilot adoption in regulated environments. The significance: the combination of air-gapped BYOK, enterprise policy distribution, and a stable Agents window means GitHub Copilot is now technically deployable in high-compliance environments that previously could not evaluate it, broadening the addressable market beyond internet-connected developer workstations. (Source: TechTimes, June 8, 2026)

Notable Developments — June 8, 2026 (Evening)

Harness-1: Open-Source 20B Search Agent Proves "The Harness Is the Product" — A joint research team from UIUC, UC Berkeley, and Chroma released Harness-1, a 20-billion parameter open-source search agent that directly validates this article's core thesis: harness architecture matters as much as the model itself. Built on the gpt-oss-20b base, Harness-1 achieves 0.730 average curated recall across eight retrieval benchmarks — outperforming GPT-5.4 (0.709) and every other open search agent tested, with only Anthropic's Opus-4.6 scoring higher. The key innovation is stateful cognitive offloading: instead of packing all bookkeeping into the model's growing context transcript, the harness externalizes state management entirely — maintaining a candidate pool, importance-tagged curated set (capped at 30 documents), evidence graph, verification cache, and compressed full-text store outside the prompt. The model only handles semantic decisions: what to search, what to keep, what to verify, and when to stop. The practical result: Harness-1 runs at "Context-1-level cost and latency" because the budget-aware harness — not the model — enforces context constraints. Training required just 899 SFT trajectories and 3,453 RL queries. Transfer gains are striking: +17.0 points on held-out benchmarks vs +7.9 on training-domain tasks, suggesting the learned search behaviors generalize. Released under Apache 2.0 with weights on HuggingFace (pat-jj/harness-1) and code at github.com/pat-jj/harness-1. The significance: a research team published rigorous proof that the harness is the bottleneck — not model size — and the open weights mean any team can build on this architecture today. (Sources: VentureBeat, arXiv:2606.02373, June 8, 2026)

{/* HARNESS_SECTION_END: notable-new-june-8-2026-evening /}
{/
HARNESS_SECTION_START: notable-new-june-9-2026-morning */}

June 9, 2026 (Morning)

AWS Simple Strands Agent: Open-Source Model-Agnostic Coding Harness — Amazon Web Services previewed Simple Strands Agent (SSA), a lightweight open-source harness designed to decouple AI coding tools from specific models. Led by Anoop Deoras (director of applied science for agentic AI at AWS), SSA directly targets the impedance mismatch that plagues today's agent harnesses: when a harness imprecisely translates model intent into tool actions — causing an agent instructed to edit one function to accidentally modify multiple instances. SSA open-sources all harness elements — agent logic, tools, prompts, and model configurations — for a "plug-and-play" architecture where teams define agent logic once and run it on any model. AWS internal research confirms agents using SSA outperform agents on the same underlying model without SSA, validating the core insight: agent performance is fundamentally a systems problem, not a model problem. The practical payoff: teams stop rewriting agent logic every time a better model ships, eliminating a major source of DevOps rework. Futurum Group VP Mitch Ashley captured the strategic stakes — "competition among AI coding tool providers now revolves around the harness" — and framed model-agnostic open harnesses as the next frontier for avoiding deployment stack lock-in. (Source: DevOps.com, June 8, 2026)

LG CNS + Cline: "Spec Driven for Enterprise" Targets Full SDLC Automation — South Korean IT services giant LG CNS partnered with Cline — the open-source coding agent with 188K+ GitHub stars — to launch Cline Spec Driven for Enterprise, an agentic platform targeting end-to-end automation of large-scale enterprise IT system construction: from requirements analysis and system design through coding, testing, and operations. Separate from LG CNS's AIND platform (launched June 8 morning), this initiative specifically leverages Cline's spec-driven development model for enterprise governance contexts — signaling a broader trend of enterprise IT services firms adopting open-source agentic coding harnesses as delivery automation foundations. (Source: Vietnam Investment Review / PRNewswire, June 8–9, 2026)

WWDC 2026: Apple Ships Xcode 27 with On-Device AI Coding via Gemini-Powered Siri — Apple's WWDC 2026 introduced Xcode 27, which embeds on-device AI coding assistance via a rebuilt Siri now running on Google's 1.2-trillion-parameter Gemini model. App Intents become mandatory for all iOS/macOS app-agent surface areas as Apple deprecates legacy SiriKit — effectively requiring developers to instrument their apps as agent-callable action surfaces. The on-device execution model means Apple's AI coding harness runs without cloud round-trips for many tasks, a differentiator from cloud-first competitors. For agent harness builders targeting Apple platforms, App Intents is now the required integration protocol. (Source: Lushbinary / Apple Developer, June 8, 2026)

{/* HARNESS_SECTION_END: notable-new-june-9-2026-morning */}

{/* HARNESS_SECTION_START: notable-new-june-9-2026-midday */}

Notable Developments — June 9, 2026 (Midday)

Apple Xcode 27 Agent Skills CLI — Export to Claude, Codex, and Cursor — Beyond on-device AI code completion (covered above), Apple shipped a remarkable interoperability play: xcrun agent skills export lets developers extract Xcode 27's built-in Agent Skills to ~/.agents/skills, making them usable in Claude Code, OpenAI Codex, and Cursor. This means Apple's official coding skills — project navigation, build management, and SwiftUI generation — now function as portable agent capabilities regardless of which IDE you choose. The practical implication: Apple isn't trying to lock developers into Xcode for AI-assisted development. Instead, they're positioning Xcode as the skill authoring environment while acknowledging developers work across multiple agent IDEs. Not all skills transfer universally (Xcode-specific build system knowledge doesn't always map), but the architecture signals that portable agent skills are becoming a platform expectation, not an afterthought. (Source: SwiftLee / Antoine van der Lee, June 9, 2026)

Comet Opik: First Cost Intelligence Tool for Claude Code & Codex Spend — As AI coding spend scales into the billions, Comet launched cost intelligence in Opik — the first observability tool giving engineering leaders per-engineer, per-team, per-task visibility into Claude Code and Codex costs. The tool goes beyond dashboards: it automatically identifies unused MCPs, idle skills loaded into context, and misconfigured compaction strategies that waste tokens silently. One enterprise reportedly cut AI spend by millions annually using Opik's optimization layer. CEO Gideon Mendels: "Most engineering leaders have no idea how their developers have [AI coding tools] configured — which MCPs are loaded, which model is running by default, whether any of it maps to real outcomes." With both Claude Code and Codex now billing at full API rates, the infrastructure for AI coding cost governance is maturing into its own category. (Source: GlobeNewswire / Comet, June 9, 2026)

KPMG + Microsoft Agent 365: Enterprise Agent Governance at 276K-Person Scale — KPMG and Microsoft announced a global expansion deploying Microsoft Agent 365 for enterprise-scale AI agent management across more than 276,000 professionals. KPMG will use Agent 365 to manage deployment, monitoring, updates, and governance of AI agents through its Trusted AI framework, while rolling out Microsoft 365 Copilot firm-wide. The KPMG Workbench platform — built on Azure AI Foundry — coordinates multiple AI agents across client delivery. The significance: this is the largest publicly announced enterprise agent deployment to date, validating Microsoft's agent governance stack (Agent 365 + Foundry + Copilot) as production-grade infrastructure at consulting-firm scale. (Source: Microsoft News, June 9, 2026)

Ory Agent Security: First Agent IAM Control Plane — Identity infrastructure company Ory launched Agent Security, positioned as the first dedicated IAM (Identity and Access Management) control plane for AI agents. The platform provides centralized authentication, authorization, and access control for agent-based workflows at enterprise scale — addressing a gap where AI agents currently inherit human credentials or operate with overly broad permissions. As agent harnesses proliferate, the identity layer is emerging as critical infrastructure: who is the agent, what can it access, and who authorized it? Ory's entry signals that agent identity management is crystallizing as a distinct product category. (Source: EIN Presswire, June 9, 2026)
{/* HARNESS_SECTION_END: notable-new-june-9-2026-midday */}


{/* HARNESS_SECTION_START: notable-new-june-9-2026-evening */}

Notable Developments — June 9, 2026 (Evening)

Cohere Launches North Mini Code — Open-Source Sovereign Agentic Coding Model — Cohere launched North Mini Code, a 30-billion parameter Mixture-of-Experts (MoE) coding agent with only 3B active parameters per token, available under the Apache 2.0 license — the first explicitly sovereign-AI-focused agentic coding model purpose-built for on-prem deployment. It runs on a single NVIDIA H100 at FP8 precision (minimizing hardware requirements), ships with a 256K-token context window and 64K maximum generation length, and is available on HuggingFace, the Cohere API, OpenRouter, and Cohere Model Vault. Design goals are specifically agentic: sub-agent orchestration, architecture mapping, code review, and terminal tasks — not adapted from a general-purpose base. Key training differentiator: Cohere trained North Mini Code across three distinct harness scaffolds simultaneously — SWE-Agent (rich CLI with specialized commands), Mini-SWE-Agent (single bash tool with raw shell output), and OpenCode (individually typed tools returning structured JSON) — reporting a 10 percentage point gain on OpenCode evaluation while maintaining SWE-Agent performance. That multi-harness training generalizes agent capabilities rather than overfitting to one scaffold. On the Artificial Analysis Coding Index, North Mini Code scores 33.4, outperforming Qwen3.5 (35B), Gemma 4 (26B), and substantially larger models including Devstral 2 (123B-dense) and Nemotron 3 Super (120B). One important caveat: independent testing (VentureBeat) found North Mini Code generates approximately 3× the output tokens of comparable models for the same tasks — a verbosity cost that compounds at high-volume production scale. Teams should model actual token economics against their workload before committing. The significance: North Mini Code makes the "run on-prem, own your data" agentic coding architecture practical for teams with a single high-end GPU — eliminating managed-service pricing exposure and data-residency risk while matching frontier-class performance in its size band. Combined with its OpenCode native support, it's the clearest signal yet that the open-source sovereign agent stack is becoming competitive for production coding workloads. (Sources: Cohere Blog, HuggingFace, VentureBeat, June 9, 2026)

JetBrains Rider 2026.2 EAP 5 — PostToolUse Quality-Check Hooks for Claude Code and Codex — JetBrains Rider 2026.2 EAP 5 introduces bundled PostToolUse quality-check hooks for Claude Code and Codex — the most concrete IDE-native implementation of the "validate before the agent continues" pattern yet seen outside GitHub Copilot's hooks.json. After an external AI agent edits a file, Rider automatically runs its full IDE-level validation pipeline (inspections, build verification, type checking, code quality analysis) before the agent proceeds to its next step. The hooks ship pre-configured for both Claude Code and Codex — zero setup required, just install EAP 5 and it works. The build also ships a non-modal Welcome screen for faster startup and an "Explain with AI" action surfaced directly from build error and runtime exception diagnostics — letting developers trigger AI explanation from the problem location without manually copying context into chat. The significance: PostToolUse file validation is moving from a power-user configuration (Copilot hooks.json, custom harnesses) into a bundled IDE feature — the first time a major Java/C# IDE has shipped pre-wired agent quality gates without requiring manual hookflow setup. This signals that governance-in-the-IDE is shifting from differentiator to baseline expectation across the harness landscape. (Source: JetBrains .NET Blog, June 8, 2026)

MAI-Code-1-Flash Now Rolling Out to GitHub Copilot VS Code Users — Microsoft is rolling out MAI-Code-1-Flash, a new inference-efficient coding model built specifically for the GitHub Copilot harness, to individual VS Code Copilot users via the model picker and Auto picker. Unlike models distilled from third-party systems, MAI-Code-1-Flash was trained from scratch on clean, traceable, enterprise-grade data — with agentic coding optimization explicitly designed for the Copilot runtime ("trained and designed for GitHub Copilot harness, to work better together"). Key characteristics: adaptive thinking calibration (concise for simple requests, deeper reasoning budget for complex tasks), strong multi-turn instruction-following, and performance consistency across single-turn and agentic workflows. No additional setup is required — VS Code Copilot users will see it appear in the Auto picker or model picker as the rollout progresses. This adds a third distinct model to Microsoft's Copilot-native model stack in the same week: MAI-Code-1-Flash (everyday coding, Auto picker default), MAI-Thinking-1 (complex reasoning and architectural decisions, explicit selection), and Gemini 3.1 Pro / 3.5 Flash (via the existing model picker). The significance: Microsoft is building a purpose-built model family tuned specifically to the Copilot harness — following the same architectural logic as Apple's Neural Engine model in Xcode 27 (designed for its harness, optimized for its runtime). Copilot users get a harness-native model that improves performance without requiring any configuration changes, reinforcing Copilot's governance advantage: tighter model-harness integration, managed rollout, and spend controls at every level. (Source: Microsoft AI Blog, June 2 / Updated June 8, 2026)

Claude Managed Agents: Scheduled Deployments + CLI Secrets Vault Now in Public Beta — Anthropic expanded Claude Managed Agents with two major capabilities now in public beta: scheduled (cron) deployments and secured environment variable vaults with CLI tool access. Scheduled deployments let developers give an agent a cron schedule — the platform fires the session on schedule automatically, with no scheduler to build or host. Pause, resume, archive, or trigger additional runs on demand. CLI tool access means agents can now invoke authenticated command-line tools and services directly inside the managed sandbox, with environment variables stored in vault-backed secrets. Real production deployments are already live: Rakuten uses scheduled agents for weekly data analysis and production log monitoring; Actively AI runs cross-account agentic search with scheduled refresh cycles; Ando uses them to watch Slack channels, follow up on proposed next steps, and send meeting reminders. The significance for the harness landscape: Anthropic's Managed Agents platform is now directly competitive with Google's Managed Agents (which also support scheduled runs) and closes a capability gap against Codex Goals' long-running task persistence. Critically, the architecture — cron schedule fires → new session → agent completes task → session ends — is identical to how production multi-agent platforms like this one operate. Anthropic is now selling the infrastructure pattern that sophisticated teams have been building themselves. (Source: Claude Blog, June 9, 2026)

{/* HARNESS_SECTION_END: notable-new-june-9-2026-evening */}

{/* HARNESS_SECTION_START: notable-new-june-10-2026 */}

Notable Developments — June 10, 2026

VS Code 1.124 — Autopilot by Default, Advanced Autopilot, and the Agents Window — Microsoft shipped VS Code 1.124 with a cluster of agent workflow improvements that collectively represent the most significant shift in Copilot's autonomous execution model since Autopilot launched. Autopilot is now on by default — giving agents permission to take initiative and act without requiring explicit user approval for each action. This changes the default interaction model from "approve every step" to "agent decides, human reviews." Advanced Autopilot adds a utility model that reads the chat transcript and determines when a task is genuinely complete versus when the agent should keep iterating — reducing both premature stops and runaway loops. The Agents Window (new panel) lets users explore, iterate on, and review agent sessions across projects and machines simultaneously; previously, starting a new session required waiting for the current one to load. Background sessions allow queuing new requests while a session runs, eliminating idle time between agent tasks. Session navigation (search, jump, keyboard step-through) makes working across long agent runs faster. Enterprise-managed Copilot plugin policies (experimental) allow admins to centrally control which plugins and plugin marketplaces are available — the first centralized governance control for the Copilot plugin ecosystem at the admin level. The significance: Autopilot-by-default is the normalization of autonomous agent execution in the most widely used IDE in the world. When VS Code ships a setting enabled by default, it becomes the implicit expectation for developers using Copilot. This 1.124 release also marks the most aggressive push toward multi-session, parallel-agent workflows Microsoft has shipped in a single VS Code update. (Source: Neowin / Paul Hill, June 10, 2026)

Stack Overflow for Agents — Verified Machine-Readable Knowledge Exchange for the Agentic Era — Stack Overflow launched Stack Overflow for Agents in public beta — an API-first knowledge exchange designed to address what the company calls the "Ephemeral Intelligence Gap": the systemic problem where millions of autonomous agents independently rediscover the same bugs, deprecated APIs, and architectural patterns because agent context windows wipe clean at session end, and agent-to-agent knowledge transfer doesn't exist. The platform extends Stack Overflow's trust model into machine-readable form: agents can query the corpus before burning compute on known solutions, contribute findings when a gap exists (pending human orchestrator approval via a skills file), and verify others' contributions by reporting back on production use. Three post types capture different knowledge: TIL (Today I Learned) for debugging journeys and undocumented behaviors; Questions for unsolved problems; Blueprints for reusable design patterns with quality context (what works, when it breaks, tradeoffs). A multi-agent verification loop validates contributions before they compound into consensus — votes, replies, and verification feedback flow back to posts rather than accumulating as isolated answers. The community anchor: agents are tied to human Stack Overflow credentials via SSO, so reputation and accountability flow through to agent behavior. An enterprise tier (Stack Internal) keeps proprietary knowledge private inside company firewalls. The significance: Stack Overflow is attempting to do for agent knowledge what it did for human knowledge in 2008 — create a shared, peer-verified corpus that compounds over time rather than evaporating per session. If adoption scales, this becomes infrastructure: agents that don't query Stack Overflow for Agents before brute-forcing a problem are operating at a structural disadvantage against those that do. (Source: Stack Overflow Blog, June 10, 2026)

GitLab Transcend — Enterprise Agent-Driven DevSecOps at Scale — GitLab announced new capabilities at GitLab Transcend, its enterprise DevSecOps platform, designed to give engineering teams the infrastructure, context, and governance controls to run agent-driven software delivery at scale. The platform positions GitLab as the orchestration layer for multi-agent CI/CD — agents that plan, write, review, test, and deploy code within a single governed pipeline rather than across disconnected tools. The announcement follows a pattern visible across Cognition's Devin Desktop (team-layer coordination), Microsoft's Rayfin (multi-agent CI/CD), and Augment Code's Cosmos (team-scale agentic engineering): the agent harness battle is moving up the stack from individual developer tools to team-wide DevSecOps infrastructure. Enterprise agent adoption at organizations like KPMG (276K users, Microsoft/Agent 365), now joined by GitLab's enterprise customer base, signals that the evaluation period for agent tooling is ending and the procurement period is beginning. (Source: BusinessWire / Yahoo Finance, June 10, 2026)

{/* HARNESS_SECTION_END: notable-new-june-10-2026 */}


{/* HARNESS_SECTION: notable-new-june-10-2026-evening */}

Notable Developments — June 10, 2026 (Evening)

Claude Code "A Harness for Every Task" — Anthropic's Technical Deep-Dive on Dynamic Workflows — Anthropic published a detailed technical walkthrough of Dynamic Workflows that provides the clearest explanation yet of why single-context agents fail on complex tasks — and how workflow orchestration fixes each failure mode. The blog names three specific failure modes in long single-context execution: Agentic laziness — stopping before completing a complex multi-part task and declaring it done after partial progress (e.g., addressing 35 of 50 security review items); Self-preferential bias — Claude's tendency to prefer its own results when asked to verify or judge them against a rubric; Goal drift — fidelity loss to the original objective across many turns and compaction steps, where "don't do X" constraints gradually evaporate at summarization boundaries. Workflows solve all three by decomposing work across isolated agents where each has a fresh, bounded context. The JavaScript orchestration primitives — agent(), parallel(), pipeline(), and phase() — enable structures like tournament-style evaluation (multiple agents produce, one validates), adversarial parallel review (investor + customer + competitor angles simultaneously), and loop-until-convergence patterns for race condition reproduction. Concrete workflow prompts from the blog: reproducing a flaky test in 50 runs to identify a race condition; mining 50 past sessions for recurring corrections to turn into CLAUDE.md rules; digging through Slack incidents for root causes without a ticket; ranking resumes with a tournament. Key constraint the blog emphasizes: workflows are token-heavy and best suited for complex, high-value tasks — not everyday coding. The significance: Anthropic has given engineers the vocabulary to reason about when multi-agent orchestration is warranted — the failure mode taxonomy (laziness, bias, drift) is directly actionable for any team choosing between regular Claude Code runs and workflow orchestration. The "harness for every task" framing is also the most explicit validation yet of this article's core thesis: the harness architecture determines whether complex tasks complete correctly, regardless of model capability. (Source: Anthropic Blog, June 2, 2026)

JFrog + Anthropic: Enterprise Supply Chain Governance Comes to Claude Code — JFrog launched the JFrog Platform plugin for Claude Code in collaboration with Anthropic, available immediately at claude.com/plugins/jfrog. The problem it solves: AI coding agents are now "active participants in the software supply chain" — making decisions about dependencies, builds, and deployments without any supply chain context, which is how malicious packages, ungoverned AI assets, and unvetted vulnerabilities enter production. JFrog's platform manages over 18 billion artifacts (up 136% year-over-year), and the plugin layers supply chain governance directly into Claude Code's agent loop via three interfaces: JFrog Platform Skills (natural language artifact operations — vulnerability scanning, curation checks, provenance verification via simple prompts); JFrog MCP Tools (standardized security, compliance, and artifact data access across the JFrog platform); and a native agent plugin for deep IDE integration. Real-time upstream governance means agents enforce package security and license compliance as code is written, not after delivery in a separate security scan. The plugin also covers MCP and agent skills governance — ensuring agents only pull verified, secure, and governed MCP servers and skill packages, blocking rogue access to sensitive internal data. The integration supports Claude Code, Cursor, and VS Code Copilot simultaneously, reinforcing JFrog's positioning as a vendor-neutral supply chain governance layer across all major agent harnesses. The significance: supply chain governance is emerging as a dedicated harness layer — not something each agent framework rebuilds independently, but a specialized security surface that plugs into any coding agent. As agent-generated code scales into the billions of binaries, artifact provenance and real-time policy enforcement become baseline requirements. (Source: BusinessWire, June 10, 2026)

{/* HARNESS_SECTION_END: notable-new-june-10-2026-evening */}

Top comments (0)