DEV Community: Dr Hernani Costa

Agent Interoperability: When A2A Solves Real Problems vs. Protocol Debt

Dr Hernani Costa — Fri, 29 May 2026 06:58:00 +0000

When agent coordination becomes a business liability instead of a capability multiplier, technical leaders need a decision framework—not just architectural enthusiasm.

Agent-to-agent (A2A) interoperability sounds like the inevitable next step in AI infrastructure. But for most EU SMEs and technical teams in 2026, standardizing A2A too early creates operational debt faster than it solves coordination problems. The real question isn't whether A2A matters. It's whether your organization has solved the simpler governance and workflow problems that must come first.

When Agent-to-Agent Interoperability Helps and When It Just Adds Complexity

TL;DR: A practical guide to when A2A helps, when it adds complexity, and how technical leaders should decide whether to standardize interoperability now.

A2A becomes valuable when independent agents really need to collaborate across boundaries. It becomes expensive when teams use it to postpone simpler workflow and governance decisions.

A lot of technical leaders are hearing a more ambitious pitch: not just better agents, but interoperable agents. Agents that can discover each other, delegate tasks, collaborate securely, and work across platforms.

That sounds like the next logical step. Sometimes it is. But sometimes, it's just a more sophisticated way to add complexity too early.

Google and the A2A project describe Agent2Agent as an open protocol for communication and interoperability between independent agentic systems. The protocol is designed so agents can discover capabilities, negotiate interaction modalities, and collaborate on long-running tasks without exposing internal state, memory, or tools. While Google Cloud documents how to host A2A agents on Cloud Run and Gemini Enterprise allows admins to register them, the Gemini feature is still in Preview (Google Cloud Documentation).

This makes A2A important, but not automatically urgent.

The practical question in 2026 is not "Should we support agent interoperability?" The better question is: "Do we have a real coordination problem between independent agent systems that justifies another protocol layer, another security surface, and another operating model?" This matters even more because the Model Context Protocol (MCP) is also maturing quickly, with a clear roadmap focused on standardizing tool and context access. Many teams are still solving a context problem, not an interoperability problem—and those are not the same thing (OpenAI GitHub).

A2A and MCP solve different problems

This is the first thing technical leaders need to get clear.

MCP is about standardizing how applications provide tools and context to models. OpenAI's current Agents SDK supports hosted MCP tools, Streamable HTTP MCP servers, and stdio MCP servers, and it explicitly says SSE is deprecated for new integrations. In other words, MCP is becoming the standard context and tool-access layer (OpenAI GitHub).

A2A is different. Its goal is not to expose tools to one model. Its goal is to let separate agents communicate and collaborate as peers, even when they are built on different frameworks, by different vendors, or on separate servers. Google Cloud's A2A overview and the A2A project documentation both make that clear (Google Cloud Documentation).

That distinction matters because many teams hear "interoperability" and assume they need A2A now.

Often they do not.

If the problem is still "how does this agent access tools, data, or systems," MCP is usually closer to the right answer. If the problem is "how do these separate agents coordinate with each other across system boundaries," then A2A starts to make sense (OpenAI GitHub).

When A2A genuinely helps

1. When independent agents need to coordinate across real boundaries

A2A is useful when you already have multiple independent agents or agentic applications that need to collaborate without collapsing into one monolithic orchestrator. The A2A project describes this clearly: the protocol exists to let opaque agentic applications communicate and collaborate without exposing their internal state, memory, or tools. That is a real need when systems are owned by different teams, vendors, or runtime environments (GitHub).

This is especially relevant when:

Different business units own different agents
Different vendors or frameworks are already in production
One agent needs to delegate a job to another agent rather than call a simple tool
The systems should remain separate for governance or organizational reasons

That is a real interoperability problem, not just a nicer integration story (GitHub).

2. When long-running, multi-step collaboration is the real workload

A2A is stronger when the work is not a one-shot tool call. The protocol is specifically described around collaborative tasks, long-running jobs, and negotiated modalities. That means it is better suited to agent-to-agent coordination patterns than to simple "fetch this document" or "run this command" cases (GitHub).

If your environment has one agent that gathers requirements, another that checks policy, and another that executes a specialized downstream step, interoperability can become more valuable than adding one more tool to one agent. That is where A2A starts to move from interesting to useful (GitHub).

3. When organizational separation matters as much as technical separation

A2A helps when the architecture needs to preserve boundaries. Google Cloud's A2A documentation emphasizes that agents can work together as peers without exposing their internal logic. That is not just a technical feature. It is an operating model choice. It allows one team or vendor to maintain ownership of an agent while still letting another system collaborate with it (Google Cloud Documentation).

This can matter when:

Procurement boundaries separate systems
Internal platform teams need to preserve ownership
Partner ecosystems matter
Regulated or sensitive workflows require separation of responsibility

In those cases, interoperability can be cleaner than forcing all logic into one platform (Google Cloud Documentation).

4. When you already know a single control plane is not enough

If your team has already reached the point where one orchestration layer cannot realistically own all the work, A2A becomes more compelling. Google's A2A positioning is explicitly about moving from isolated agents to interconnected ecosystems. That is not a day-one architecture. It is what becomes relevant after agent systems start to specialize (Google Cloud).

In other words, A2A helps after specialization becomes real. Not before.

When A2A just adds complexity

1. When the real problem is still tool access, not agent collaboration

This is the biggest source of confusion.

If your team is still figuring out how one agent accesses repos, tickets, documentation, databases, or internal APIs, that is usually an MCP or workflow-design problem, not an A2A problem. OpenAI's MCP documentation is already rich enough to show how much can be solved through tool access, approval flow, filtering, and transport choice before agent-to-agent coordination becomes necessary (OpenAI GitHub).

A2A adds a coordination layer. If the simpler problem is not solved yet, adding that layer usually makes the architecture more impressive without making it more effective (OpenAI GitHub).

2. When teams have not standardized one governed workflow yet

If your team cannot clearly explain:

What the agent is allowed to do
What requires approval
How review happens
What context is exposed
Who owns the workflow

then it is not ready to standardize interoperability.

This is an inference, but it is strongly grounded in the current product landscape. MCP itself is prioritizing governance maturation and enterprise readiness. Gemini Enterprise A2A registration is still Preview. These are signals that the ecosystem is still working through the operational discipline required for broader production use (Model Context Protocol).

3. When preview-stage enterprise support is being mistaken for operational maturity

This one matters.

Gemini Enterprise lets admins register A2A agents, but the documentation clearly marks the feature as Preview and states that model armor does not protect conversations with registered A2A agents in the Gemini Enterprise web app. That does not make A2A unusable. It does mean technical leaders should not confuse ecosystem momentum with finished enterprise readiness (Google Cloud).

If your rollout depends on protections or governance assumptions that the preview surface does not yet guarantee, standardizing too early can create future rework (Google Cloud).

4. When the architecture is trying to solve politics with protocols

This is a subtle but common failure mode.

Sometimes teams reach for interoperability because different groups cannot agree on one platform, one workflow, or one owner. A2A can help with genuine boundary-preserving collaboration. It cannot fix unclear ownership, weak standards, or missing review design. If those problems are still unresolved, interoperability often becomes a protocol-shaped workaround for a management problem (GitHub).

The real decision is about coordination maturity

The best question to ask is not "Is A2A important?"

It is.

The better question is "What level of coordination maturity are we at?"

You are probably not ready to standardize A2A yet if:

You are still choosing the primary control plane
You have not standardized review and approval
Your context layer is still immature
MCP would solve most of the actual problem
Interoperability demand is hypothetical, not real

You may be ready to evaluate A2A seriously if:

Multiple independent agents already exist
They are owned by different teams, vendors, or systems
Long-running collaboration across boundaries is a real use case
One orchestrator is no longer an accurate model of the work
Governance and review are already stronger than the protocol layer itself

That is the line between architectural fit and premature complexity (GitHub).

A practical decision lens for technical leaders

Here is the framework I would use.

Step 1: classify the real problem

Is this about:

Tool access
Context sharing
Workflow review
Agent coordination
Cross-boundary delegation

If it is the first three, A2A is probably too early. If it is the last two, it may be worth evaluating (OpenAI GitHub).

Step 2: ask whether the agents are truly independent

If one team owns everything and one orchestrator could reasonably manage it, interoperability may be unnecessary. If the systems are truly separate and should remain separate, A2A becomes more plausible (GitHub).

Step 3: check governance before protocol

Do not standardize interoperability before you standardize:

Review
Approval
Context boundaries
Ownership
Escalation paths

Preview-stage platform support and evolving roadmap signals make this even more important in 2026 (Google Cloud).

Step 4: prefer the smallest working architecture

If MCP plus one orchestrator solves the real problem, do that first. Only add A2A when the architecture genuinely needs peer-to-peer agent collaboration across boundaries (OpenAI GitHub).

My take

Agent-to-agent interoperability is real.

It is also very easy to romanticize.

The strongest case for A2A is not "the future is multi-agent." That is too vague. The strongest case is much more practical: independent agents, owned in different places, need to collaborate on long-running work without collapsing into one brittle control plane. That is when interoperability earns its keep (GitHub).

For most teams in 2026, though, the more urgent work is still closer to home:

Define the workflow
Standardize review
Control context access
Design the primary lane
Decide whether MCP belongs in the stack

A2A becomes more useful after those questions are answered, not before (OpenAI GitHub).

Key takeaways

A2A helps when independent agent systems really need to collaborate across organizational, platform, or runtime boundaries, especially for long-running work where preserving separation matters. Google Cloud's A2A documentation and the A2A project both make that role clear (Google Cloud Documentation).

A2A adds complexity when teams are still solving simpler problems like tool access, workflow design, review logic, and context boundaries. In those cases, MCP or a clearer internal operating model is usually the better next move. Preview-stage enterprise support and explicit protection gaps in Gemini Enterprise make the timing question even more important (OpenAI GitHub).

Written by Dr. Hernani Costa | Powered by Core Ventures

Originally published at First AI Movers

Technology is easy. Mapping it to P&L is hard. At First AI Movers, we don't just write code; we build the 'Executive Nervous System' for EU SMEs.

Is your architecture creating technical debt or business equity?

👉 Get your AI Readiness Score (Free Company Assessment)

Our AI readiness assessment for EU SMEs evaluates whether your team is ready for agent interoperability or should strengthen governance and workflow design first. We combine AI strategy consulting with operational AI implementation guidance to help you avoid premature complexity.

A2A Protocol Standardization: The $2M Governance Gap CTOs Miss

Dr Hernani Costa — Thu, 28 May 2026 06:57:45 +0000

Most teams standardize A2A too early—and pay for it in coordination debt.

Agent-to-agent interoperability is entering production. Google Cloud now documents A2A deployment on Cloud Run, Gemini Enterprise lets admins register A2A agents, and the open protocol hit version 0.3 with gRPC support and signed security cards. But here's the signal technical leaders need to read correctly in 2026: meaningful momentum does not equal universal maturity. The real question is not "Is A2A important?" It is "What should we watch before we standardize it?"

A2A in 2026: What Technical Leaders Should Watch Before Standardizing It

TL;DR: A practical guide for CTOs on what to monitor before standardizing A2A in 2026, from preview risk and governance to enterprise readiness.

Agent-to-agent interoperability is getting more real. That does not mean your team should standardize it yet.

A2A is entering the part of the market where technical leaders can no longer dismiss it as a lab experiment.

Google Cloud now documents how to build and deploy A2A agents on Cloud Run, and Gemini Enterprise lets admins register A2A agents in the web app. At the same time, Google still marks that Gemini Enterprise capability as Preview, and the documentation explicitly says Model Armor does not protect conversations with registered A2A agents in the Gemini Enterprise web app. That is exactly the kind of mixed signal technical leaders need to read correctly in 2026: meaningful momentum, but not universal maturity.

Overview

The right question is not "Is A2A important?"

It is.

The better question is "What should we watch before we standardize it?" Google's own materials show real progress: A2A is positioned as an open protocol for communication between independent agentic systems, the project has an official open-source specification and SDKs, and Google announced version 0.3 with capabilities such as gRPC support and signed security cards. But those same official surfaces also show that enterprise product support is uneven, deployment still requires real infrastructure work, and at least some user-facing integrations remain Pre-GA. That means the practical decision in 2026 is not adoption versus rejection. It is whether your team has enough operational reason and governance discipline to move from watching to standardizing.

First, watch whether you have a real interoperability problem

This is the most important signal, and the easiest one to fake.

A2A makes sense when you already have independent agent systems that need to collaborate across real boundaries. The official A2A project describes the protocol as a way for agents built on different frameworks, by different vendors, and on separate servers to communicate and collaborate as agents, not just as tools. If your environment still looks like one orchestrator plus a few internal tools, you probably do not have an A2A problem yet. You have a workflow or context-access problem.

Second, watch protocol maturity rather than protocol enthusiasm

A lot of protocol narratives get ahead of production reality.

What matters more is whether the spec and implementation story are becoming stable enough to build against. Google's July 2025 update is important here because it announced A2A protocol version 0.3 as a more stable interface for enterprise adoption, with gRPC support, signed security cards, and broader SDK support. That is a real maturity signal. It does not mean the protocol is "finished." It does mean the project is moving beyond conceptual demos toward repeatable implementation.

The practical takeaway is simple: do not standardize on a protocol because the idea is elegant. Standardize when the specification, SDKs, and deployment paths are stable enough that your team is not becoming the maturity program for the protocol itself.

Third, watch the difference between protocol support and enterprise readiness

This is where technical leaders need to stay disciplined.

Google Cloud documents A2A agent deployment on Cloud Run, and Gemini Enterprise lets admins register A2A agents. But the Gemini Enterprise A2A feature is still explicitly labeled Preview, subject to Pre-GA terms, and the docs warn that Model Armor does not protect conversations with registered A2A agents. The same product family also requires admin roles, Discovery Engine API enablement, agent card JSON, and hosting/maintenance responsibility on the customer side. Those are all signs that interoperability is becoming real, but the enterprise convenience layer is not yet frictionless.

A mature buyer should read that as follows:

the direction is real
the deployment burden is real
the governance burden is still yours
the safety envelope is not fully abstracted away yet.

Fourth, watch whether your governance model is stronger than the protocol layer

This is the hidden gate.

If your team has not yet standardized:

what agents are allowed to do
how review works
what context they can access
who owns each workflow
when one system is allowed to delegate to another

then A2A is probably too early.

This is not because A2A is bad. It is because interoperability multiplies coordination surfaces. The A2A project is about agent discovery, modality negotiation, long-running tasks, and peer collaboration. That is powerful. It also means more places where ownership, approval, escalation, and trust can become ambiguous if your operating model is still weak.

Fifth, watch whether MCP is still the more urgent standardization problem

Many teams are not ready for A2A because they are still solving a simpler layer.

OpenAI's current Agents SDK makes MCP practical in several modes: hosted MCP tools, Streamable HTTP MCP servers, and stdio MCP servers. The SDK also treats approval flow and tool filtering as normal parts of the implementation. In other words, MCP is already the more concrete answer when the real problem is how one agent reaches tools, systems, or documents safely. If you have not yet standardized that context layer, A2A may be the wrong layer to focus on first.

The clean rule is this:

if the problem is tool and context access, watch MCP first
if the problem is independent agent collaboration across boundaries, then A2A deserves serious attention.

Sixth, watch deployment fit, not just protocol support

Google's A2A materials are useful because they show the deployment story clearly.

Cloud Run is already documented for A2A hosting. Google also describes Cloud Run, GKE, and Agent Engine as deployment paths in its broader A2A update. That matters because the real operational question is not whether A2A exists. It is whether your organization wants to host, monitor, secure, debug, and scale agent endpoints as part of its actual operating model.

That is a much harder question than "does the protocol have momentum?"

Seventh, watch whether vendor support is getting deeper or just louder

The protocol is clearly getting louder.

Google's official blog said in July 2025 that A2A had support from more than 150 organizations and highlighted expanding deployment, evaluation, marketplace, and partner paths. That is a meaningful ecosystem signal. But for a technical buyer, the better question is not partner count. It is support depth:

real SDK maturity
real deployment guides
real enterprise controls
real evaluation tooling
real security and governance features.

That is why "watching A2A" in 2026 should mean tracking capability depth, not just conference momentum.

What I would tell a CTO to monitor over the next quarter

If I were advising a technical leader right now, I would track five watchpoints.

Stable specification and SDK trajectory
Has the protocol stabilized enough that your team can build without constant adaptation? Version 0.3 and multi-language SDK signals are good signs, but you should still monitor change velocity and release notes.
Enterprise product hardening
Do A2A surfaces move from Preview toward stronger GA-like controls? Watch Gemini Enterprise documentation closely here.
Governance gap closure
Do the platform docs reduce current caveats, especially around protection layers such as Model Armor and around admin and hosting burden?
Real customer patterns
Google's official blog is already citing customer and partner examples such as Tyson, Gordon Food Service, Adobe, Box, ServiceNow, and Twilio. That is useful, but you should watch for patterns that resemble your own architecture, not just big-name logos.
Internal coordination maturity
Can your own team already govern one agent lane well? If not, do not standardize a protocol for coordinating many of them. This last point is an inference, but it is strongly supported by the gap between A2A's peer-collaboration ambitions and the still-preview state of some enterprise surfaces.

My take

A2A is worth watching seriously in 2026.

But most teams should still treat it as a watchlist architecture decision, not a default standard.

The strongest reason to standardize A2A is not that the protocol is fashionable. It is that your organization already has independent agent systems that genuinely need to collaborate across boundaries, and your governance model is strong enough to support that. Until those conditions are true, A2A usually adds another abstraction layer faster than it creates operational value.

Key takeaways

A2A is maturing. Google Cloud documents deployment and registration paths, the open-source protocol has a public specification and SDKs, and Google's own 2025 update signaled stronger enterprise-oriented progress with version 0.3, gRPC support, signed security cards, and a growing ecosystem.

That still does not mean most teams should standardize it now. The practical test is whether your problem is truly agent-to-agent coordination across boundaries, whether your governance is already stronger than the protocol layer, and whether preview-stage enterprise support is mature enough for your risk tolerance. If not, keep watching, strengthen the stack underneath, and let interoperability wait until it is actually deserved.

EU AI Act Compliance: 11 Questions CTOs Must Answer Before 2026

Dr Hernani Costa — Wed, 27 May 2026 06:57:51 +0000

Scaling agentic workflows without EU AI Act readiness is a $500k+ governance debt waiting to happen. Technical leaders who postpone compliance architecture until August 2026 will face forced refactoring, audit friction, and operational risk that could have been designed out in Q2 2025.

EU AI Act Questions Technical Leaders Should Answer Before Scaling Agentic Workflows

TL;DR: A practical guide for CTOs and technical leaders on the EU AI Act questions to answer before scaling agentic workflows in 2026.

The AI Act does not ask whether your team uses "agents." It asks what the system does, who controls it, what risks it creates, and whether your operating model is strong enough to govern it.

A lot of teams are about to make a timing mistake. They assume the EU AI Act is either already fully "live" for everything or still too far away to matter for engineering workflows. Neither is right.

The AI Act entered into force on August 1, 2024. Prohibited practices and AI literacy obligations have applied since February 2, 2025. GPAI obligations have applied since August 2, 2025. The Act becomes broadly applicable on August 2, 2026, with some high-risk rules for AI embedded in regulated products applying on August 2, 2027. The Commission's own FAQ also notes that a November 2025 Digital Omnibus proposal is under consideration to adjust the timing for some high-risk rules because standards are delayed.

So the practical question for technical leaders in April 2026 is not whether to care. It is what must be clarified before you scale.

The AI Act does not create a special legal bucket called "agentic workflows." It classifies AI systems by intended purpose and risk. That means a coding agent, a workflow agent, or a multi-agent setup may fall into very different compliance positions depending on what it actually does. If the workflow stays in low-risk internal engineering assistance, the compliance burden may be relatively light. If the same workflow is used in employment, access to essential services, insurance, credit, public services, or other Annex III areas, the burden changes materially.

The right leadership question is not "Are agents compliant?" It is "Which use cases are we scaling, what role are we playing, and what obligations follow from that?"

1. What is the intended purpose of this workflow?

This is the first question because the AI Act's classification logic starts with intended purpose. The Commission's FAQ says high-risk classification depends on the function performed by the AI system and the specific purpose and modalities for which it is used. The same model or workflow can be low-risk in one context and high-risk in another. An internal engineering assistant is a very different legal object from a system used to filter job applicants, assess creditworthiness, or support access to healthcare.

For technical leaders, that means architecture reviews should begin with a use-case inventory, not a model inventory.

2. Are we acting as provider, deployer, or both?

This sounds legal, but it is operational. The Commission's AI Act materials distinguish obligations for providers of high-risk systems, obligations for deployers of high-risk systems, and obligations for providers of GPAI models. Providers of high-risk systems must handle requirements such as risk management, documentation, traceability, transparency, human oversight, robustness, and conformity assessment. Deployers of high-risk systems must use systems according to instructions, assign human oversight, monitor operation, and act on risks or serious incidents.

That means a technical leader needs to know whether the organization is merely using a vendor system, materially modifying it, or effectively creating and putting its own system into service.

3. Does any workflow fall into a prohibited or clearly sensitive category?

This question matters before scale, not after. The Commission published prohibited-practices guidance in February 2025 and says the AI Act classifies certain uses as unacceptable, while others are high-risk or subject to transparency rules. The prohibition guidance specifically points to harmful manipulation, social scoring, and certain biometric practices among the unacceptable categories.

For most engineering teams, the practical implication is simple: do not assume "internal" means irrelevant. If any agentic workflow moves into sensitive decision support or high-risk domain use, the classification needs to be reviewed early.

4. If the workflow is high-risk, do we have the basics the Act expects?

The Commission's overview of high-risk requirements is unusually practical. High-risk AI systems need risk management, high-quality datasets where relevant, logging for traceability, technical documentation, sufficient transparency for deployers, human oversight, and appropriate levels of robustness, cybersecurity, and accuracy. Providers must also conduct conformity assessment and maintain lifecycle responsibility.

For technical leaders, this maps directly into system design:

Logging architecture
Review design
Documentation standards
Testing and evaluation
Security controls
Human override paths

This is why compliance is not just a legal workstream. It is architecture.

5. Do we have a real human oversight model, or just a human somewhere near the workflow?

Article 14 and the Commission FAQ both make clear that human oversight is not symbolic. Oversight must be designed so natural persons can effectively oversee the system during use, and deployers of high-risk systems must assign people with the necessary competence, training, authority, and support.

That means technical leaders should be able to answer:

Who reviews outputs?
Who can stop or override the workflow?
Who is accountable for exceptions?
Does the oversight point happen before action, before merge, or after deployment?

If the answer is "someone will probably look at it," the workflow is not ready.

6. Are we collecting the logs and documentation we would need later?

The Act's high-risk logic repeatedly points to traceability, logging, technical documentation, and instructions for use. The Commission's summary of high-risk requirements and the text of Articles 12 to 14 both reinforce that logs, deployer information, and human-oversight support are part of the system requirements, not optional extras.

Translated into engineering practice, that means you should know:

What the agent did
What inputs and outputs mattered
Which tools or systems it touched
What approvals occurred
How a reviewer could reconstruct the decision path

This is also why the best AI dev stack starts with review design, not model choice.

7. Are our staff and operators AI-literate enough for the workflows we are scaling?

This is the most underestimated obligation because it already applies. The Commission's AI literacy FAQ states that Article 4 requires providers and deployers of AI systems to ensure a sufficient level of AI literacy for staff and other people dealing with AI systems on their behalf, taking into account technical knowledge, experience, education, training, and the context of use. This has applied since February 2, 2025.

That means a technical leader should ask:

Who is actually operating or supervising these workflows?
Do they understand the system's limits?
Do reviewers know what to look for?
Do managers know what they are approving?

You cannot outsource that requirement to the vendor.

8. If we rely on GPAI models, what do we need from vendors now?

The AI Act's GPAI obligations have already applied since August 2, 2025. The Commission says providers of GPAI models must prepare technical documentation, implement a copyright policy, and publish a summary of training content, with extra obligations for GPAI models with systemic risk such as risk mitigation, incident reporting, and cybersecurity. The Commission also recognizes the GPAI Code of Practice as an adequate voluntary tool for providers that choose to sign it.

For technical buyers, that means vendor due diligence should now include:

What documentation the vendor provides
Whether the provider follows the GPAI code or equivalent
What copyright and training-data disclosures exist
How incidents and systemic-risk issues are handled

This is not abstract policy. It is procurement hygiene.

9. Do transparency obligations affect our workflow design?

Yes, and the timing matters. The Commission's AI Act FAQ says Article 50 transparency obligations apply to certain interactive and generative systems, including chatbots and deepfakes, and become applicable on August 2, 2026. Providers of AI systems that directly interact with people must inform them they are interacting with AI unless obvious. Providers of generative AI systems must mark outputs in machine-readable form. Deployers of deepfake systems and certain public-interest text-generation uses also have disclosure obligations, subject to exceptions.

For technical leaders, that means if agentic workflows produce public-facing content, customer-facing interactions, or manipulated media, disclosure and labeling need to be part of product and workflow design now, not added later.

10. If we are a public body or in a sensitive use case, do we owe a fundamental rights impact assessment?

Sometimes yes. The Commission's FAQ says deployers that are bodies governed by public law or private operators providing public services, as well as operators using certain high-risk systems for creditworthiness or life and health insurance pricing/risk assessment, must perform a fundamental rights impact assessment before first use. The FAQ also notes that this may need to be aligned with a data protection impact assessment.

This matters because many technical leaders still think impact assessment is purely a privacy-team activity. Under the AI Act, it can become part of deployment readiness.

11. Are we waiting for standards, or do we already know enough to act?

This is where many teams hesitate. The Commission's AI Act materials note that harmonized standards are still under development and that delays have prompted the November 2025 Digital Omnibus proposal to consider linking some high-risk application timing to support measures such as standards or guidelines. But the same official materials already give enough direction on classification, human oversight, documentation, logging, transparency, deployer obligations, GPAI duties, and AI literacy to justify internal preparation now.

So the right move in April 2026 is not to freeze. It is to tighten readiness.

A Practical Framework for Technical Leaders

Before scaling agentic workflows, I would want written answers to these:

What is the intended purpose of each workflow?
Is any use case plausibly high-risk or prohibited?
Are we provider, deployer, or both for this system?
What review and human oversight model exists today?
What logs and documentation can we produce if challenged?
Who is trained enough to operate and supervise this?
What do we require from GPAI vendors contractually and operationally?
Will any transparency obligations apply by August 2, 2026?
Do any deployments trigger a fundamental rights impact assessment?
Are we scaling faster than our governance model?

Those are not legal trivia. They are system-design questions with legal consequences.

My Take

Most technical teams do not need a legal memo first. They need a compliance-shaped architecture conversation. First AI Movers helps EU SMEs map AI governance into operational workflows—turning regulatory risk into architectural discipline.

The AI Act is forcing a discipline many teams should have had anyway: clearer use-case boundaries, stronger oversight, better logs, tighter documentation, better vendor due diligence, and a more explicit distinction between experimentation and scale. By April 2026, enough of the Act is already in force, and enough of the August 2, 2026 obligations are clear, that waiting passively is the wrong move.

Key Takeaways

The AI Act does not regulate "agents" as a special class. It regulates AI systems based on intended purpose, role, and risk. That means technical leaders need to classify workflows properly, identify whether they are providers or deployers, and understand which obligations are already in force now versus which ones become broadly applicable on August 2, 2026.

The practical work before scale is not abstract legal interpretation. It is architecture, review design, logging, training, transparency planning, vendor due diligence, and governance maturity. Teams that answer those questions early will move faster and more safely than teams that postpone them until rollout is already underway.

Written by Dr Hernani Costa | Powered by Core Ventures

Originally published at First AI Movers.

Technology is easy. Mapping it to P&L is hard. At First AI Movers, we don't just write code; we build the 'Executive Nervous System' for EU SMEs.

Is your architecture creating technical debt or business equity?

👉 Get your AI Readiness Score (Free Company Assessment)

Our AI Readiness Assessment helps CTOs and VPs of Engineering answer these 11 questions before governance becomes a blocker. We map EU AI Act compliance into your development operations, so scaling doesn't mean rearchitecting.

Environment Readiness Decides AI Delivery, Not Agent Quality

Dr Hernani Costa — Tue, 26 May 2026 06:57:46 +0000

The $2M mistake most CTOs make: blaming the agent when the environment is broken.

In 2026, the difference between an impressive demo and a working AI delivery system is rarely the agent. It is the environment the agent has to operate in.

A lot of teams are still diagnosing the wrong problem. The agent misses a step, writes weak code, fails a task, or gets stuck in a loop, and the immediate reaction is predictable: maybe the model is not strong enough, maybe the tool is overhyped, maybe we picked the wrong vendor.

Sometimes that is true. More often, it is not.

Factory's Agent Readiness framing is blunt about this: teams often blame the model, switch agents, and get the same weak results because "the agent is not broken. The environment is." Their framework measures repositories across technical pillars like style and validation, build systems, testing, documentation, dev environment, code quality, observability, and security and governance. That is a much more useful way to think about AI delivery in 2026.

The market is quietly admitting that environment quality now decides outcomes

One of the clearest signals in 2026 is that vendors are shipping more controls around behavior, not just more intelligence.

OpenAI is not just selling "smarter code." Codex is positioned as a command center for agents, with shared skills and parallel work. GitHub is not just selling generation. Copilot coding agent is built around reviewable pull requests and outcome measurement. Anthropic is not just selling a terminal agent. Claude Code now exposes a settings hierarchy with enterprise-managed policy, team-shared settings, user settings, and explicit allow, ask, and deny rules for tool use. That product direction tells you where the real battle is: not only model quality, but whether teams can create repeatable, governable environments for AI work.

Why great agents still fail in bad environments

A strong agent still performs poorly when the surrounding system is weak.

If build steps depend on tribal knowledge, the agent wastes cycles guessing. If tests are slow or missing, the feedback loop collapses. If docs are stale, the agent pulls the wrong assumptions into the task. If permissions are loose, the agent can do too much in the wrong place. If review is informal, weak output slips through or good output becomes expensive to validate.

Factory's readiness model is useful precisely because it treats these as environment failures, not agent failures. It organizes readiness around practical pillars that determine whether autonomous or semi-autonomous work is even feasible. The point is not that agents are useless. The point is that environments can make useful agents look broken.

Old engineering truths still decide agent performance

This is where the industry keeps overcomplicating the message.

AI delivery in 2026 still depends on old engineering fundamentals:

Measure before optimizing
Keep structures simple
Standardize what good looks like
Make the build reproducible
Keep review explicit
Make the runtime observable
Treat data and context structure as first-class

That is exactly why readiness frameworks feel so grounded. Factory's maturity model moves from functional to documented to standardized to optimized to autonomous. In other words, autonomy does not arrive because you bought an agent. It arrives because the environment became legible enough to support it.

What environment readiness actually means

For most teams, environment readiness has six concrete parts.

1. Fast feedback loops

Agents need tight feedback. Linters, type checkers, test suites, and pre-commit checks reduce wasted cycles and help the agent converge faster. Factory explicitly treats style and validation, build systems, and testing as foundational pillars because without them, agents keep failing on issues that should be caught in seconds.

2. Written instructions instead of hidden tribal knowledge

A readable environment beats a "smart" agent every time.

GitHub now supports repository-wide Copilot instructions and AGENTS.md for agent workflows. Claude Code uses CLAUDE.md and shared project settings. Factory also treats documentation as one of the core readiness pillars and publishes guidance for AGENTS.md structure. These are all variations of the same lesson: the environment gets stronger when expectations are encoded, not remembered.

3. Explicit review design

A team is not environment-ready if AI review is still vague.

GitHub says Copilot-created pull requests should be reviewed thoroughly before merge. Copilot code review itself is configurable and can automatically review pull requests. OpenAI's Codex app is built around reviewing diffs and supervising long-running work. Strong environments design the review path in advance. Weak environments hope someone catches issues later.

4. Permissions and boundaries

Claude Code's settings make this especially clear. Teams can define allow, ask, and deny rules, block access to secrets and environment files, and enforce enterprise-managed policy that users cannot override. That is environment readiness in practice: the agent is powerful, but the environment sets the boundaries.

5. Observability and measurement

This is where most teams still underinvest.

Factory treats observability as a core readiness pillar, and GitHub now includes guidance on measuring pull-request outcomes for coding-agent use. That matters because teams that do not measure rework, review burden, and exception rates often mistake output volume for progress.

6. Security and governance

Readiness is not complete until the environment can prevent the wrong work from becoming normal work.

Factory includes security and governance as a core pillar. GitHub exposes org and enterprise controls for Copilot. Claude Code supports managed policy. The pattern is clear: agent performance is now inseparable from governance quality.

The easiest mistake to make

The easiest mistake is to keep treating agent performance like an isolated tooling problem.

That produces the wrong behavior:

Switch the tool
Try another model
Buy another seat
Add another lane
Keep the environment the same

Then the team is surprised when the same class of problems returns.

That is one reason "tool sprawl" has become so expensive. If the environment remains weak, every new tool just introduces another surface for the same underlying failure. This is why your stack decision and your readiness decision are now tightly connected. A weak environment turns optionality into noise. A strong environment turns even modest agent capability into leverage.

What CTOs should fix first

If I were advising a technical leader right now, I would focus on this order:

Build and test clarity: Make sure the agent can actually build, validate, and check its own work.
Instruction quality: Write down how the repo works, what standards matter, and what should never happen.
Review model: Define what gets reviewed, by whom, and where the approval checkpoint lives.
Permission boundaries: Constrain what the agent can read, run, and change.
Observability: Measure whether the workflow is getting better or just getting busier.

That sequence is more valuable than chasing one more model upgrade because it improves the environment every future agent will inherit. Factory's maturity framing supports this directly: most teams should aim at a "standardized" environment before dreaming about full autonomy.

My take

The agent is not the broken part often enough that technical leaders should assume environment failure first.

That does not mean the model never matters. It means the faster commercial win usually comes from strengthening the environment: better validation, better docs, better review, better permissions, better observability, better shared instructions.

That is also why the consulting opportunity is changing. Teams do not just need recommendations on which tool to buy. They need help making their environments agent-ready through AI readiness assessment and workflow automation design. The teams that understand this early will get more value from the same generation of tools than teams that keep buying more capability into weak systems.

Key takeaways

The most important shift in AI delivery is not just stronger agents. It is that environment quality now decides whether those agents can produce repeatable business value. Factory's readiness model makes that explicit, and the current product direction across OpenAI, GitHub, and Anthropic supports it through shared skills, repository instructions, review workflows, managed settings, and permission boundaries.

That means the next question for technical leaders is not only "Which agent should we use?" It is "What kind of environment are we giving that agent to work in?" Teams that answer that well will outperform teams still trapped in vendor-switching mode.

Metacognition: The Self-Correction Layer AI Rollouts Miss

Dr Hernani Costa — Mon, 25 May 2026 06:57:43 +0000

When AI adoption stalls, the problem is rarely the model—it's your organization's inability to inspect its own thinking.

The teams adapting fastest to AI are not just using better tools. They are inspecting, correcting, and updating their own decisions faster than everyone else.

A lot of AI rollouts fail for a surprisingly human reason: the organization cannot see its own thinking clearly enough to improve it.

Cognitive science uses the term metacognition for monitoring and evaluating one's own thinking, including confidence, uncertainty, and decision adjustment. Neuroscience research links metacognitive processing to prefrontal systems, including anterior prefrontal regions. That does not make metacognition mystical or rare genius. It makes it practical: it is the capacity to inspect your own judgment instead of blindly defending it.

That matters more in AI rollouts than many leaders realize.

Because the teams that scale AI well are not just better at prompting. They are better at noticing weak assumptions, catching bad rollout habits, questioning the wrong metrics, and updating how they work before the damage compounds.

Most AI adoption problems are not caused by a total lack of capability. They come from weak organizational self-correction. NIST's AI Risk Management Framework is built around governance, mapping, measurement, and management because trustworthy AI use depends on evaluation and iterative risk handling, not just access to models. Factory's "Agent Readiness" work makes the same point in engineering terms: teams often blame the model, but the real issue is the environment around it.

This is where metacognition becomes commercially useful. Not as pop psychology, but as an operating capability.

Metacognition, Translated for Technical Leaders

In research terms, metacognition is "cognition about cognition." It shows up when a person monitors uncertainty, evaluates confidence, and revises a decision instead of simply executing the first response.

For a technical organization, the parallel is straightforward:

Noticing that the rollout metric is wrong
Realizing the agent is failing because the environment is weak
Seeing that review is too informal for the level of autonomy being introduced
Admitting that the team is scaling tool access faster than workflow discipline
Revising the operating model instead of defending the original plan

That is organizational metacognition.

I am using that as an operational analogy, not as a literal neuroscience claim. But it is a useful one, because it explains why some teams learn faster than others from the same AI tools.

Why This Matters More Now

The current product surface is already pushing teams toward more autonomy, more delegation, and more complexity.

OpenAI positions Codex as a command center for multiple agents, shared skills, worktrees, and automations. GitHub Copilot works in the background and then asks for human review. Claude Code supports managed policy, shared settings, and explicit permission rules. Factory's readiness framework says clearly that autonomous development depends on the state of the codebase and surrounding environment, not just the agent.

That means the organizations that win are not the ones with the most raw AI access. They are the ones that can inspect and update their own rollout logic faster.

The Missing Layer in Most AI Rollouts

Most teams do at least one of these:

1. They confuse activity with progress

They count generated pull requests, tool usage, or visible agent output and assume the rollout is working.

But stronger evaluation frameworks emphasize measurement, review burden, and risk management, not just output. NIST's AI RMF exists precisely because capability without disciplined evaluation is not enough.

A metacognitive team asks:

What got better?
What got noisier?
What created rework?
What looked fast but reduced trust?

2. They blame the model before checking the environment

Factory's wording is valuable here: "The agent is not broken. The environment is." Their examples are painfully familiar: missing pre-commit hooks, undocumented environment variables, tribal-knowledge build steps, and weak feedback loops.

A metacognitive team asks:

Is the agent weak, or is the system around it unreadable?
Are we switching vendors to avoid fixing engineering hygiene?
Are we buying capability into an environment that cannot support it?

3. They scale before they standardize

Factory's five-level readiness model is useful because it implies a sequence. "Functional" is not the same as "Autonomous." Their own framing says most teams should aim for "Level 3: Standardized" first.

A metacognitive team asks:

What should become a standard before we scale further?
Which behaviors are still personal hacks?
Which parts of the workflow are stable enough to repeat?

4. They defend the rollout instead of updating it

This is the most expensive failure mode.

Once a team announces an AI initiative, it becomes emotionally harder to say:

The review model is wrong
The lane split is wrong
The metrics are wrong
The change management is weak
The environment is not ready

But that is exactly where strong metacognition shows up. The better team is not the one that avoids mistakes. It is the one that updates faster when mistakes become visible.

What Metacognition Looks Like in Practice

This is not abstract. In a strong AI rollout, metacognition shows up in very operational places:

Review Design

A team notices that "human in the loop" is too vague and redesigns the review path before scaling more autonomy.

Postmortems

A team treats rollout failures as design signals, not as embarrassment to be hidden.

Measurement

A team tracks rework, review burden, and environment readiness instead of just generation volume.

Governance

A team realizes permissions, approvals, and context boundaries need to mature before more agent capability is added.

Documentation

A team turns tacit knowledge into explicit instructions because private cleverness does not scale.

Those are not soft traits. They are organizational self-correction mechanisms.

Why This Is a Leadership Problem First

The reason this matters commercially is that metacognition does not emerge from tools alone. It has to be designed into the organization.

NIST's AI RMF is voluntary and practical, meant to support design, development, deployment, and use of AI through structured risk management. That is essentially a leadership decision: will the organization create routines that encourage inspection, correction, and updating, or will it default to momentum and wishful thinking?

This is also why AI rollouts often need outside help. Not because the team is unintelligent, but because self-correction is hardest when you are already inside the system you need to question.

A Practical Decision Lens

If I were advising a technical leadership team on AI governance and risk advisory, I would ask these five questions:

1. What assumption are we making about this rollout that we have not yet tested?

If the answer is unclear, the team is probably moving faster than its learning system.

2. What evidence would convince us our current rollout approach is wrong?

If there is no answer, the team is defending a plan, not managing one.

3. Where does weak self-correction show up today?

Usually in review, measurement, documentation, or permissions.

4. What are we blaming on the agent that is really an environment problem?

This is often the highest-leverage question. Factory's framework exists because the answer is "a lot."

5. What should become a standard before we add more capability?

If the answer is "nothing," the organization is probably scaling noise.

My Take

Metacognition is the missing layer in most AI rollouts because most teams still treat AI adoption as a tooling problem.

It is not.

At the point where agentic systems, review flows, permissions, and environment quality all start interacting, the real differentiator becomes the organization's ability to inspect and update its own thinking.

That is why the best AI teams often look less like hype-driven adopters and more like disciplined learning systems.

They catch themselves faster. They revise faster. They standardize better. They defend less and improve more.

Key Takeaways

Metacognition as an Operating Capability: The ability to monitor and evaluate your organization's own thinking is a practical skill, not a psychological theory. It's the core of effective AI adoption and business process optimization.
Self-Correction Over Speed: The best teams aren't just faster; they have better self-correction loops. They question metrics, check their environment before blaming the model, and standardize workflows before scaling.
Leadership's Role: Building this capability requires deliberate design. It shows up in review processes, postmortems, and governance—all areas driven by leadership.

AI Hiring Broken: Why Companies Need Operators Not Enthusiasts

Dr Hernani Costa — Sun, 24 May 2026 06:57:45 +0000

When CTOs hire for AI enthusiasm instead of operational judgment, pilots stall, costs rise, and trust collapses. Here's what to actually look for.

AI hiring feels broken for a reason.

Most companies are trying to hire "AI talent" as if it were a single job category. It is not.

What they usually need is much more specific: someone who can turn messy business intent into a defined task, reliable workflow, measurable output, controlled risk posture, and sustainable operating cost.

If you are a CTO, VP Engineering, technical founder, or COO with delivery responsibility, the problem is not only that AI skills are hard to find. The problem is that many organizations are hiring against the wrong definition of value.

Recent surveys confirm that AI skills have become the hardest skills for employers to find globally. The World Economic Forum reports that AI and big data are among the fastest-growing skills, while skills gaps remain one of the biggest barriers to business transformation. LinkedIn's recruiting data adds another important layer: companies increasingly care about quality of hire and skills-based evaluation, but many are still not confident in how to measure either.

That combination creates a predictable failure pattern. Companies write broad AI job descriptions, run shallow interviews, overvalue enthusiasm, undervalue operational judgment, and then wonder why pilots stall, outputs drift, costs rise, and trust collapses.

The issue is not that there are no good people in the market.

The issue is that many companies are not hiring for the work that actually needs to get done.

The Real AI Job Is Operational

A lot of leaders still imagine AI work as model knowledge, tool familiarity, or prompt cleverness.

That is incomplete.

In practice, the hard part of AI delivery is operational. It starts with defining what the system is supposed to do, where it can fail, what context it needs, how outputs will be evaluated, which actions require human review, how data will be protected, and what the ongoing token or tooling cost will be.

That is operator work.

The strongest AI operators are not just excited about models. They can make ambiguity smaller. They can convert goals into decision trees, workflows, test cases, exception paths, and measurable business outcomes.

This is exactly why AI hiring feels so confusing. Many job descriptions still search for a general "AI expert," while the actual delivery environment needs a hybrid of product thinker, systems designer, evaluator, workflow architect, and risk-aware implementer.

Why Vague AI Hiring Creates Expensive Mistakes

Weak role design creates downstream waste.

You see it when a company hires someone to "bring AI into the business" without clarifying whether the real need is internal copilots, workflow automation, coding agents, retrieval systems, evaluation infrastructure, or governance.

You see it when the interview loop rewards tool talk but never tests decomposition, edge-case handling, or security judgment. This leads to the kind of stalled delivery common in many failed AI coding rollouts.

You see it when the person hired can generate demos, but cannot build a repeatable system that other teams can trust.

This is one reason the market feels broken from both sides. Employers say they cannot find the right people. Candidates say they cannot land the role. Often, both are reacting to the same problem: the specification is too vague to match supply with real demand.

The Seven Capabilities Companies Should Actually Hire For

If you want better AI hiring outcomes, stop starting with "years of AI experience" and start with operator capabilities.

1. Specification Precision

Can this person translate a vague business request into a precise task definition? That means defining inputs, outputs, success criteria, failure thresholds, escalation rules, and ownership boundaries. Without this, teams burn time on impressive-looking prototypes that do not survive contact with production reality.

2. Task Decomposition

Can this person break a complex workflow into smaller, testable steps? Strong operators do not ask one giant model call to do everything. They separate retrieval, reasoning, classification, generation, validation, and action. They know where determinism matters and where model flexibility is useful.

3. Evaluation Design

Can this person define what "good" looks like before rollout? Quality of hire is rising in importance, but confidence in measuring it remains low. The same pattern shows up in AI delivery. Companies want results, but many have weak evaluation habits. Good operators build scorecards, human review loops, test sets, and approval criteria early.

4. Failure Pattern Recognition

Can this person spot recurring breakdowns before they become organizational mistrust? Real AI systems fail in patterns: missing context, brittle prompts, weak grounding, permission errors, poor fallback logic, bad exception handling, hidden latency, and silent cost creep. Operators learn to see these patterns early.

5. Trust and Security Design

Can this person make sensible decisions about data exposure, permissions, logging, review, and model boundaries? AI use at work is already widespread, and many workers bring their own AI tools to work, especially in small and mid-sized companies. That makes operator judgment around data handling and approved workflows even more important.

6. Context Architecture

Can this person decide what the model should know, when it should know it, and how that context should be structured? This is where many teams lose reliability. Prompt quality matters, but context architecture matters more. Operators understand document quality, retrieval structure, metadata, system instructions, state handling, and tool access. They know that good context architecture usually beats generic model swapping.

7. Token Economics and Workflow Economics

Can this person balance quality, speed, and cost? The best operator is not the person who always chooses the smartest model. It is the person who can design a workflow where the expensive model is used only when it creates enough business value to justify the spend.

That is how AI becomes a delivery system instead of a novelty expense.

Why Most AI Interviews Miss These Skills

Most interview loops are still built for conventional hiring signals.

They check pedigree.
They check vocabulary.
They check whether someone has touched the latest tools.

That is not enough.

A better AI interview loop should test:

How the candidate clarifies an ambiguous task
How they decompose the workflow
How they define success and failure
How they handle data sensitivity
How they think about fallback paths
How they control cost and complexity

In other words, the interview should simulate the actual work.

If you only ask what tools someone has used, you are likely to hire for enthusiasm, not operational leverage.

What CTOs and COOs Should Do Instead

Here is the practical shift.

Do not ask, "How do we hire an AI person?"

Ask, "What operating capability do we need to build first?"

In many companies, the right first move is one of these:

Option 1. Hire an internal AI operator

This is the right move when AI work is already frequent, the workflows are business-critical, and you need day-to-day ownership close to product, engineering, or operations.

Option 2. Upskill an existing operator

This works when you already have strong product or engineering people with systems judgment, domain context, and credibility across the team. Many employers are responding by hiring for potential and building AI literacy across the workforce.

Option 3. Bring in an external partner to define the operating model

This is often the best move when the organization is still unclear on use cases, governance, what to standardize in the tool stack, role design, and rollout sequencing. External support helps compress the learning cycle and avoid expensive false starts.

A Simple Decision Lens for Technical Leaders

Before opening a new AI role, ask these seven questions:

What business workflow are we trying to improve?
Where does human review still need to stay in the loop?
What failures would make the system unacceptable?
What context does the system need to perform reliably?
How will we evaluate outputs before broad rollout?
What are the security, privacy, and permission boundaries?
What cost structure is acceptable at scale?

If you cannot answer those questions, the hiring problem is not yet a recruiting problem.

It is an AI readiness problem.

And readiness problems should be solved before headcount is used to paper over them.

The Strategic Takeaway

The companies that win with AI are not the ones that hire the most excited people first.

They are the ones that define the work correctly.

The market does have real scarcity. AI skills are in short supply, and demand is rising fast. But many hiring failures come from a more fixable issue: companies are still searching for AI enthusiasm when what they really need is operational judgment.

That is good news for technical leaders.

Because once you stop treating AI as a vague talent category and start treating it as an operating system design problem, your hiring decisions get sharper, your interviews get better, your rollouts get safer, and your investment gets easier to justify.

Practical Framework: Hire or Build Around This Operator Scorecard

Use this simple scorecard before you open a role or approve a consulting engagement.

Score each area from 1 to 5:

Problem definition
Workflow decomposition
Evaluation discipline
Failure analysis
Security and trust judgment
Context design
Cost awareness

If your team scores low across multiple areas, do not rush into another generic AI hire.

Start with a readiness assessment. Identify which capabilities should be built internally, which should be standardized, and which should be supported externally.

That is how you stop hiring into confusion.

That is how you start building delivery capacity.

Key Takeaways

AI hiring feels broken because many companies are hiring for a vague category instead of a defined operating need.
The highest-value AI capability is often not model enthusiasm. It is operational judgment.
Strong AI operators define tasks clearly, decompose workflows, design evaluations, recognize failure patterns, manage trust boundaries, structure context, and control cost.
Better interview loops test real delivery work, not just tool familiarity.
If your use cases, governance, and evaluation model are still unclear, your problem is readiness before it is recruiting.

Written by Dr. Hernani Costa | Powered by Core Ventures

Originally published at First AI Movers.

Technology is easy. Mapping it to P&L is hard. At First AI Movers, we don't just write code; we build the 'Executive Nervous System' for EU SMEs.

Is your AI hiring creating technical debt or business equity?

👉 Get your AI Readiness Score (Free Company Assessment)

Our AI readiness assessment for EU SMEs helps you define operator capabilities before headcount. We combine AI strategy consulting with workflow automation design to ensure your hiring decisions map directly to measurable business outcomes.

AI Agent Skills: From Prompts to Workflow Infrastructure

Dr Hernani Costa — Sat, 23 May 2026 06:57:51 +0000

The shift from fragile prompting to durable workflow infrastructure is happening now—and it's reshaping how technical leaders should architect AI operations.

Why Skills Are Becoming the Operating Layer for AI Agents

TL;DR: Skills are becoming reusable workflow infrastructure for AI agents. See what changed since October and how technical leaders should design them.

Since October, skills have moved from personal prompt helpers to reusable, versioned workflow infrastructure for teams, agents, and real business operations.

The market has spent a lot of time talking about agents.

That makes sense. Agents are visible. They demo well. They feel like the headline.

But the more durable shift is happening one layer lower.

Skills are quietly becoming the reusable operating layer that makes agents more accurate, more predictable, and more useful in real work.

Overview

When Anthropic introduced Agent Skills on October 16, 2025, the idea looked simple: package instructions, scripts, and resources into a folder so Claude could load them when relevant. By December 18, Anthropic had already added organization-wide management, a skills directory, and support for an open Agent Skills standard. Its current docs now position Skills across Claude.ai, Claude Code, and the API, with built-in document skills for PowerPoint, Excel, Word, and PDF plus custom skills for organizational knowledge. OpenAI now documents SKILL.md-based Skills in its API and uses repo-local skills with Codex for repeatable engineering workflows. Microsoft's Agent Skills docs describe the same pattern as portable, open-spec packages for domain expertise and reusable workflows.

That is the real update.

Skills are no longer just a clever way to save prompts. They are increasingly the way organizations package workflow knowledge for both humans and agents.

Skills are not just a Claude feature anymore

This is the first thing technical leaders need to update in their mental model.

Anthropicâ€™s own release notes say skills now come with organization-wide management and an open standard so they can work across AI platforms. OpenAI's current API cookbook uses the same SKILL.md manifest concept and describes skills as reusable bundles of instructions, scripts, and assets. Microsoft's Agent Skills docs also point to the open specification and describe skills as portable packages of instructions, scripts, and resources.

That does not mean every vendor surface works identically.

It does mean the pattern is escaping the lab.

For technical buyers, that matters more than any single release. Once multiple vendors converge on the same packaging idea, you stop thinking of it as a feature and start treating it as infrastructure.

Why this matters for business systems

Prompts are useful, but they do not compound very well.

They get copied into docs, chats, notebooks, and internal wikis. They drift. They fork. They become hard to test. They become hard to govern. They disappear into chat history.

Skills solve a different problem.

OpenAI's current guidance is the clearest way to say it: skills sit between prompts and tools. Prompts define always-on behavior. Tools do something in the world. Skills package repeatable procedures that should only load when needed. Anthropic describes the same progressive-disclosure model: Claude sees skill metadata first, reads the full SKILL.md when relevant, and only loads deeper references or scripts as needed.

That has real business implications:

less prompt sprawl
more consistent workflow execution
clearer ownership of methodology
better reuse across teams
cleaner handoffs between people and agents
a more testable path to agent reliability

This is why I do not think of skills as a niche developer artifact.

I think of them as workflow capital.

The shift is from personal configuration to organizational memory

In the early framing, a skill looked like something an individual user might create for personal productivity.

That is still true.

But Anthropic now lets Team and Enterprise owners provision skills organization-wide, and its help docs say shared skills can appear automatically for all users. Anthropic also makes built-in document skills available across paid and free plans, which expands the concept beyond coding into everyday knowledge work like spreadsheets, documents, presentations, and PDFs. Microsoft's documentation pushes in the same direction by describing agent skills for expense policies, legal workflows, and data analysis pipelines.

That is the bigger story.

Skills are becoming a way to take high-value, repeatable know-how out of individual heads and put it into a reusable layer the organization can route, test, and improve.

For most companies, that is a much more important story than whether an agent can perform a flashy one-off task.

Agent-first design changes how you should write skills

Once agents become the main caller, your design priorities change.

This is where many teams are still behind.

Anthropicâ€™s best-practices guide says the description field is critical for skill selection and that Claude may choose among 100 or more available skills based on that description. OpenAI makes a similar point: names and descriptions drive discovery and routing, and good skills include clear guidance about when to use them, when not to use them, expected outputs, and edge cases.

That leads to three practical conclusions.

1. The description is a routing signal

Do not treat the description as a label.

Treat it as the moment where the model decides whether this skill belongs in the workflow at all.

Vague descriptions like "helps with research" or "does analysis" are weak routing signals. Specific descriptions tied to artifacts, triggers, and outcomes are far more useful.

2. The output should behave like a contract

This is my inference from the current vendor guidance, not a vendor quote.

If an agent is going to hand the result of one skill into the next step, the output has to be legible, predictable, and structured enough to support downstream work. OpenAI explicitly recommends documenting expected outputs and designing skills like tiny CLIs. Anthropic stresses clear workflows, feedback loops, and executable code where determinism matters.

That is contract thinking.

The skill should tell the caller what it will produce, what format to expect, and where the boundaries are.

3. Composability matters more than cleverness

Anthropicâ€™s launch post describes skills as composable. That matters because the goal is not to create one giant magic file that solves everything. The goal is to create specialist units that can be combined without bloating context or confusing routing.

The best skills are usually narrow, reusable, and easy to hand off from.

How to build skills that actually work

This is where most teams need discipline.

Anthropicâ€™s guidance is straightforward: good skills are concise, well structured, and tested with real usage. Its docs recommend specific descriptions, progressive disclosure, clear workflows, and at least three evaluations with testing across the models you plan to use. OpenAI adds practical advice on routing guidance, negative examples, zip-based packaging, version pinning, and explicit verification steps.

A practical checklist looks like this:

Start with one repeatable workflow

Choose something that happens often enough to matter and predictably enough to standardize.

Write for discovery first

Be precise about what the skill does, when to use it, and what outputs it should produce.

Keep the core file lean

Anthropic warns that context is a shared resource. Put only the highest-value instructions in the core file and move examples or references into supporting files when needed.

Use scripts for deterministic parts

Anthropic explicitly says skills can include executable code when traditional programming is more reliable than token generation. That is an important boundary. Do not force natural-language instructions to do the job of a script when accuracy and repeatability matter.

Build evals before you trust the skill

If the skill matters enough to hand to an agent, it matters enough to test. Anthropic recommends real usage testing and multiple evaluations. OpenAI recommends version pinning for reproducibility.

A three-tier model for teams

This is the framework I would use with technical leaders.

Tier 1: Standard skills

These encode organization-wide rules and common assets.

Think brand voice, formatting rules, approved templates, common review procedures, and document-generation standards.

Tier 2: Methodology skills

These encode the craft knowledge that makes your strongest practitioners effective.

Think competitive analysis frameworks, deal memo review, product requirement decomposition, incident triage, or research synthesis.

This is often the highest-leverage tier because it turns tribal knowledge into reusable capability.

Tier 3: Personal workflow skills

These help an individual move faster in their day-to-day work.

They matter, but they should not stay trapped on one laptop forever. If a personal workflow proves durable and valuable, promote it upward.

That is how organizations start building a real skills library instead of a scattered prompt graveyard.

What technical leaders should do next

If you are serious about agent reliability, do not start by building fifty skills.

Start by picking one workflow where:

the task repeats
the output matters
the current process is inconsistent
a human can still review quality early on

Then do five things:

define the workflow clearly
package it into a skill with a sharp description and explicit outputs
test it against real scenarios
pin the version for production use
assign ownership so someone improves it over time

That is the path from prompting to operating.

The strategic takeaway

The companies that win with agents will not just have better models.

They will have better reusable workflow memory.

That is what skills are becoming.

Not a prompt trick. Not just a Claude feature. Not just a developer convenience.

A portable, testable, shareable layer that sits between global instructions and tool execution, and helps organizations turn fragile prompting into repeatable work. That is the direction now visible across Anthropic, OpenAI, and Microsoft documentation.

If your team is building agents without a plan for reusable skills, versioning, evaluation, and ownership, you are probably underinvesting in the layer that will decide whether your workflows stay reliable once the demos end.

Practical framework

Use this decision lens before you invest in a new agent workflow:

Is the task repeatable enough to deserve a skill?
Can we describe when it should and should not trigger?
What exact output should it produce?
Which parts should stay deterministic through scripts?
How will we evaluate quality before broader rollout?
Who owns versioning and maintenance?
Should this live at the personal, team, or organization tier?

Key takeaways

Skills are moving from personal configuration to organizational infrastructure.
The pattern is no longer vendor-isolated. Anthropic, OpenAI, and Microsoft now all document forms of portable, reusable skill packages or skill-compatible agent workflows.
Prompts are still useful, but they are not enough for durable, governed, repeatable operations.
Agent-first skill design requires strong routing descriptions, explicit outputs, composable boundaries, and real evaluation.
Technical leaders should treat skills as workflow infrastructure, not just a convenience feature.

Claude Skills: Workflow Layer, Not Feature

Dr Hernani Costa — Fri, 22 May 2026 06:57:48 +0000

Most AI teams still try to solve workflow reliability with bigger prompts. Then the prompt gets longer, the edge cases pile up, outputs start drifting, and the team realizes it is trying to run operations from chat history. Claude Skills matter because they point to a better pattern—and they signal how AI workflow design is maturing into something operationally sustainable.

Claude Skills Are More Than a Feature: They Are a New Workflow Layer

TL;DR: Claude Skills turn repeatable workflows into reusable process assets. See what they are, where they fit, and what technical leaders should standardize

Anthropic's Skills move Claude closer to repeatable execution by separating reusable process knowledge from broad instructions, project context, and external tool access.

Most AI teams still try to solve workflow reliability with bigger prompts.

That works for a while.

Then the prompt gets longer, the edge cases pile up, outputs start drifting, and the team realizes it is trying to run operations from chat history.

Claude Skills matter because they point to a better pattern.

AnthropicDescribes Skills as portable, composable, efficient, and capable of including executable code when programming is more reliable than token generation. Team and Enterprise users can share skills directly with colleagues or publish them organization-wide.

That is a bigger shift than it may look like at first glance.

Skills are not just a nicer way to save prompts. They are becoming a reusable process layer for AI work.

What Claude Skills actually are

Anthropic's current definition is useful because it cuts through a lot of confusion.

Skills are task-specific procedures that activate dynamically when relevant. Projects, by contrast, provide static background knowledge that is always loaded inside that project. Custom instructions apply broadly across conversations. MCP gives Claude access to external services and data sources. Skills teach Claude how to complete a specific workflow, and they can work together with MCP when a workflow needs external tools or data.

That distinction matters operationally.

A lot of companies are mixing these layers together:

global preferences
project context
external system access
repeatable workflow logic

When those all get collapsed into one giant instruction block, reliability suffers.

Skills are valuable because they separate procedure from context and from access.

Why this matters for technical leaders

Technical leaders should not read this as a UI update.

They should read it as a signal about how AI workflow design is maturing.

Anthropic's own launch post said Claude uses skills by scanning available options, matching what is relevant, and then loading only the minimal information and files needed. Anthropic also says skills can stack together automatically. That is important because it creates a cleaner model for building repeatable operations than endlessly expanding system prompts or project instructions.

In practice, this changes how teams should think about AI delivery.

The question is no longer just, "Which model should we use?"
It becomes, "Which parts of our workflow should be codified as reusable process assets?"

That is a more useful management question.

The real value is process reuse, not personalization

A lot of people first see skills as a personal productivity feature.

That is too small.

Anthropic says the best skills solve a specific, repeatable task, include clear instructions, define when they should be used, and stay focused on one workflow instead of trying to do everything. The company also allows organization-level sharing and provisioning on Team and Enterprise plans.

That makes skills relevant well beyond individual use.

Here is where the business value starts to show up:

1. Skills turn tribal knowledge into reusable process

When the strongest operator on your team knows how to structure a client report, build a board memo, run a product validation screen, or produce a weekly operating review, that method often stays trapped in their head.

A good skill moves that method into a reusable package.

2. Skills reduce prompt sprawl

Instead of copying versions of the same workflow prompt across docs, chats, and internal notes, teams can package the workflow once and improve it over time.

3. Skills improve consistency across humans and AI

Anthropic's docs note that shared skills are view-only for recipients and updates propagate automatically. That means the workflow logic can be improved centrally while remaining reusable across the organization.

That is operationally stronger than relying on everyone to remember the latest version of a prompt.

Where Skills sit in the stack

The easiest way to understand Claude Skills is to place them in the operating stack.

Custom instructions

Use these for broad preferences that should apply across conversations.

Projects

Use these for always-loaded context tied to a body of work.

MCP and connectors

Use these when Claude needs access to tools, systems, or data. Anthropic says connectors let Claude retrieve data and take actions inside connected services, and that MCP is the open standard behind those connections. Anthropic also warns that custom connectors and third-party MCP servers should be treated carefully from a trust and security perspective.

Skills

Use these for reusable procedures: how to perform a workflow, what output shape to produce, what conventions to follow, and what edge cases matter.

That is why I see Skills as the missing layer between instructions and execution.

The practical use cases that matter most

The best early use cases are not "everything Claude can do."

They are workflows with four traits:

repeated often
quality matters
conventions are known
the team wants more consistency

That includes:

board or leadership summaries
operating review templates
report structures
research synthesis
product validation checklists
issue triage formats
sales or customer handoff templates
internal analysis conventions
compliance-aware document generation

Anthropic's help center explicitly says skills work well when they enhance specialized knowledge and workflows specific to an organization or personal work style.

That is why this matters to operations, product, finance, and leadership teams, not just developers.

The limitations matter too

This is where a lot of AI content gets too excited.

Skills do not magically solve every output problem.

Anthropic's documentation makes clear that skills can include executable code when programming is more reliable than token generation. That is an implicit admission of an important truth: some tasks should stay more deterministic.

That means technical leaders should be careful about where they expect Skills alone to deliver high fidelity.

For example:

document structure and summaries are a better fit than highly polished visual design
procedural guidance is a better fit than pixel-perfect creative production
standardized workflow logic is a better fit than niche, high-precision execution that needs dedicated software

The right mental model is not "Skills replace tools."

It is "Skills improve how the model performs within a workflow, often alongside tools."

What to standardize first

If you are leading an engineering or operations team, deciding what to standardize first in an AI dev stack is a critical decision. Do not start by creating dozens of skills.

Start with one of these:

Standard outputs

Reports, summaries, recurring deliverables, and templated artifacts.

Method-heavy workflows

Processes where the real value is not just the answer, but the way the work is framed, structured, and reviewed.

Knowledge transfer bottlenecks

Work that currently depends too heavily on a few senior people.

Tool-using workflows with clear conventions

This is where Skills and MCP can work together well. Anthropic says connectors provide access, while Skills provide procedural knowledge about how to use those tools in context.

That is often the highest-leverage place to begin.

A practical decision lens for buyers

Before you invest time in creating a custom Claude Skill, ask these questions:

Is this task repeatable enough to deserve packaging?
Do we already know what "good" looks like?
Is the workflow stable enough to standardize?
Does this require external system access, and if so, should that be handled through MCP or a connector?
Does the output need deterministic enforcement in any step?
Who owns the skill once it exists?
How will we test whether it actually improves quality, speed, or consistency?

If you cannot answer those questions, you are not yet doing skill design. You are still in workflow discovery.

The strategic takeaway

Claude Skills are easy to underestimate because the packaging looks simple.

A ZIP file.
A markdown manifest.
A few instructions.
Optional supporting files.

But that simplicity is exactly why they matter.

Anthropic is making reusable process knowledge a first-class object inside Claude. The company now supports custom skill uploads, org sharing, and a formal distinction between Skills, Projects, custom instructions, and MCP.

That is not just a feature release.

It is a sign that the next phase of AI adoption will depend less on one-off prompting and more on how well organizations package, govern, test, and distribute repeatable workflow logic.

Practical framework

Use this three-part framework before rolling out Skills:

1. Capture

Identify one repeatable workflow where quality matters and conventions are already understood.

2. Package

Separate the workflow instructions from general context and external access. Put procedure in the skill, background in the project, and system access in MCP or connectors.

3. Govern

Assign ownership, version it clearly, test it against real outputs, and decide whether it belongs at the personal, team, or organization level.

Key takeaways

Claude Skills are task-specific, dynamically loaded procedures, not just saved prompts.
Anthropic now positions Skills as distinct from projects, custom instructions, and MCP.
The real business value is workflow reuse, consistency, and knowledge transfer.
Skills work best for repeatable, method-heavy processes with known output conventions.
Technical leaders should treat Skills as operational assets that need ownership, boundaries, and governance.

Claude Skills Rollout: When to Standardize Across Teams

Dr Hernani Costa — Thu, 21 May 2026 06:57:41 +0000

Governance gaps in Claude Skills are creating real risk for cross-department AI workflow standardization—and most technical leaders are moving too fast.

Claude Skills are already useful for small teams and single departments. Cross-department rollout still looks too immature for most organizations.

Claude Skills are one of those features that look smaller than they are. On the surface, they seem like a cleaner way to save instructions. In reality, they are a new workflow layer. Anthropic defines Skills as folders of instructions, scripts, and resources that Claude loads dynamically for specialized tasks, and says they improve consistency, speed, and performance through progressive disclosure (Claude Help Center).

That matters. But the decision for a technical leader is not whether Skills are interesting. It is whether they are ready to standardize across the team.

The Short Answer

For small teams, yes.

For departments, often yes.

For cross-department use, usually not yet.

That is not because the concept is weak. It is because the current governance and rollout model still looks too coarse for broad, cross-functional operating systems. Anthropic currently supports personal skills, sharing with specific colleagues, organization-directory publishing, and owner-provisioned skills for the whole organization. It also explicitly says group sharing and edit permissions are planned for a future release, which is a strong signal that the control model is still evolving (Claude Help Center).

Why Small Teams Should Move First

Small teams are the cleanest fit for Claude Skills right now.

Anthropic says Skills are available across Free, Pro, Max, Team, and Enterprise plans, and Team plans have the feature enabled by default at the organization level. It also says users can upload custom skills as ZIP files, toggle them on and off, and use Anthropic's built-in document skills automatically when relevant (Claude Help Center).

That creates a strong operating pattern for lean teams because:

Ownership is obvious
Workflows are easier to define
Fewer people need training
Iteration is faster
Prompt sprawl drops quickly

If a five-person product team has a repeatable method for PRD review, release notes, research synthesis, or weekly operating summaries, Claude Skills are already useful infrastructure.

Why Departments Can Usually Make Skills Work

A department is the next logical layer.

Anthropic says the best skills solve a specific, repeatable task, have clear instructions, define when they should be used, and stay focused on one workflow rather than trying to do everything. It also supports organization-wide provisioning on Team and Enterprise plans, with owners able to upload a skill once and make it available to everyone in the organization (Claude Help Center).

That means departments can standardize things like:

Finance memo structure
Product review formats
Customer success handoffs
Brand-constrained document generation
Recurring internal analyses

This works best when one function clearly owns the method and the output standard is already stable.

Why Cross-Department Rollout Still Looks Too Early

This is where most teams should slow down.

Anthropic's current organization-management docs say there are two independent sharing toggles: one for peer-to-peer sharing with specific colleagues, and one for publishing to the organization directory. They also say there is no approval workflow for org-wide sharing if that directory option is enabled. Most importantly, they say group sharing and edit permissions are planned for a future release (Claude Help Center).

That matters because cross-department use usually needs more than simple sharing. It needs:

Scoped rollout by function or group
Clear edit rights
Approval flows
Controlled versioning across teams
Stronger operating ownership

Without that, you risk either over-centralizing Skills too early or letting them spread without enough review.

There is another practical governance caveat. Anthropic says that in the Excel and PowerPoint add-ins, inputs and outputs are deleted from Anthropic's backend within 30 days, but those add-ins do not inherit custom data retention settings and their activity is not currently included in Enterprise audit logs, the Compliance API, or data exports. For teams thinking about cross-functional standardization, especially in regulated or review-heavy environments, that is a real limitation (Claude Help Center).

What Skills Are Best Used For Today

Claude Skills are strongest where the process is known and repeated.

Anthropic describes them as specialized workflows and knowledge packages, and lists use cases such as applying brand guidelines, following company templates, structuring meeting notes, creating tasks in company tools using team conventions, and running company-specific data analysis workflows (Claude Help Center).

That makes them a good fit for:

Recurring summaries
Templated reports
Document formatting standards
Single-team analysis methods
Structured internal reviews
Workflow-specific knowledge capture

That does not automatically make them a good fit for broad company-wide process design.

What I Would Recommend

Use this rollout sequence.

1. Start with one small team

Pick one repeated workflow where quality matters and the owner is obvious.

2. Expand to one department

Only move upward once the skill has proved useful, stable, and easy to maintain.

3. Be selective across departments

Only standardize across functions when the workflow has one clear owner and limited governance complexity.

That gives you the upside of Skills without pretending the platform controls are more mature than they are. This kind of phased rollout is a core part of any practical AI architecture review before you scale.

The Takeaway

Claude Skills are already valuable.

Anthropic has made them a first-class workflow object inside Claude, with dynamic loading, ZIP-based custom skill uploads, organization-wide provisioning, and support across Claude surfaces, including Excel and PowerPoint (Claude Help Center).

But the best buyer-facing answer is still practical:

Standardize Claude Skills now if you are a small team or a single department with clear workflow ownership. Do not treat them as a mature cross-department operating layer yet.

That is the decision most technical leaders can act on today, and it aligns with the broader question of what CTOs should standardize first in an AI dev stack.

From Workflow Sprawl to Operating Clarity

Standardizing new AI capabilities like Claude Skills requires more than just enabling a feature. It's an operating model decision that shapes how your team scales AI automation and workflow design. If you're moving from scattered experiments to a clear, governed AI workflow, our AI Readiness Assessment is the right starting point. We'll help you map your current state and identify the highest-value, lowest-risk workflows to standardize first.

For teams already implementing AI workflows and needing to design a scalable, secure operating model, our AI Governance & Risk Advisory and Workflow Automation Design services provide the architectural and governance expertise to move forward with confidence.

FAQ

What is a Claude Skill?

Anthropoc defines Skills as folders of instructions, scripts, and resources that Claude loads dynamically for specialized tasks (Claude Help Center).

Are Claude Skills available on Team plans?

Yes. Anthropic says Skills are available on Free, Pro, Max, Team, and Enterprise plans, and Team plans have the feature enabled by default at the organization level (Claude Help Center).

Can we upload our own skills?

Yes. Anthropic says custom skills can be packaged as ZIP files and uploaded through Claude's Skills interface (Claude Help Center).

Are Skills the same as Projects?

No. Projects provide always-loaded background knowledge. Skills are task-specific workflow packages that Claude loads when relevant (Claude Help Center).

Are Skills the same as MCP?

No. MCP provides access to external tools and data. Skills provide the workflow instructions for how to do the task (Claude Help Center).

Are Skills good for small teams?

Yes. That is the clearest fit today because the workflow owner is usually obvious and rollout is easier to govern. Anthropic's current sharing and provisioning model supports this well enough (Claude Help Center).

Are Skills ready for department-level rollout?

Usually yes, when one function owns the method and the workflow is stable enough to standardize. Anthropic's docs support both shared and owner-provisioned rollout patterns for this (Claude Help Center).

Why not standardize Skills across departments yet?

Because Anthropic's current docs say group sharing and edit permissions are still planned for a future release, and there is no approval workflow for org-wide sharing. That makes cross-functional governance weaker than many organizations will want (Claude Help Center).

Do Skills work in Excel and PowerPoint?

Yes. Anthropic says enabled Skills are available in the Excel add-in and across Excel and PowerPoint workflows (Claude Help Center).

Is there any governance caveat for Excel and PowerPoint?

Yes. Anthropic says those add-ins do not inherit custom data retention settings and their activity is not currently included in Enterprise audit logs, the Compliance API, or data exports (Claude Help Center).

AI Readiness for Engineering Teams: 15 Questions Before You Scale

Dr Hernani Costa — Wed, 20 May 2026 06:57:51 +0000

Scaling coding agents without governance creates technical debt faster than velocity. Before your team deploys autonomous workflows, MCP integrations, or background automation, you need answers to 15 practical readiness questions—not maturity theater, but the control, review, and governance decisions that separate leverage from chaos.

AI Readiness for Engineering Teams: 15 Questions Before You Scale

TL;DR: Before you scale coding agents, MCP, or AI workflows, answer these 15 readiness questions on control, review, governance, and context access.

Before you expand coding agents, MCP access, or background automation, make sure your team can answer the questions that determine whether scale creates leverage or chaos.

A lot of engineering teams think they are ready for AI because the tools work. That is not the same thing as being ready to scale them.

By April 2026, the strongest products already assume much more autonomous behavior than the "copilot" label suggests. OpenAI positions Codex as a command center for multiple agents, long-running tasks, built-in worktrees, and scheduled automations. GitHub Copilot coding agent can work independently in the background, open pull requests, and run in a sandboxed development environment powered by GitHub Actions. Anthropic positions Claude Code as a terminal-native agent that can connect to external tools and data through MCP. The MCP project itself is now in a more formal maturity phase, with an official registry in preview and a 2026 roadmap centered on transport scalability, agent communication, governance, and enterprise readiness.

That means readiness is no longer about whether one developer got a good result from one tool. It is about whether your team has the operating model to supervise, govern, review, and standardize AI-enabled work. NIST's AI Risk Management Framework and its Generative AI Profile reinforce the same principle from a governance angle: trustworthy AI use requires structured design, evaluation, and risk management across the lifecycle, not just model access.

This article gives you 15 questions to answer before you scale AI across engineering. They are not abstract maturity prompts. They are the practical questions that sit underneath control, context access, workflow design, review logic, security, observability, and rollout. If your team cannot answer most of them clearly, scaling usually increases inconsistency faster than productivity.

1. What exactly are we scaling?

A surprising number of teams cannot answer this cleanly. Are you scaling editor assistance, terminal-native execution, background coding agents, GitHub-native issue-to-PR workflows, shared MCP-connected tools, or a broader multi-agent operating model? Those are different things, with different trust and review implications. OpenAI, GitHub, Anthropic, and MCP are clearly optimizing for different layers of the stack now.

2. Which workflows stay advisory, and which become executable?

This is one of the first readiness gates. GitHub's documentation makes clear that Copilot coding agent works independently in the background but still requests human review. OpenAI frames Codex around directing and supervising agents rather than handing over uncontrolled autonomy. If your team has not split "suggest," "execute," "submit for review," and "never allow," then it is not ready to scale.

3. Where should the primary control plane live?

Your control plane might be the terminal, the IDE, GitHub, a desktop command center, or a hybrid model. Claude Code is terminal-native. GitHub Copilot coding agent is GitHub-native. Codex is positioned as a supervisory command center across app, CLI, IDE, and cloud. If your team has not decided where agent work should start, run, and be supervised, adoption will fragment fast.

4. What systems can agents reach, and through what path?

This is now a core architecture question. Anthropic documents Claude Code MCP access to issue trackers, monitoring, databases, design tools, and workflow systems. OpenAI's MCP guidance separates hosted MCP tools, Streamable HTTP MCP servers, and stdio MCP servers, which means tool access is no longer just "on" or "off." It is a design choice.

5. Do we actually need MCP yet?

MCP is increasingly important, but not every team needs it everywhere. The official registry is in preview, and the roadmap shows the protocol is moving toward broader production and enterprise use. But if your workflows are still local, narrow, and weakly governed, MCP can add infrastructure overhead before it adds real value. The readiness question is not "Can we add MCP?" It is "Do our workflows now require a shared context layer?"

6. Which transport and trust boundary make sense for our context layer?

The MCP roadmap highlights transport evolution and scalability as a priority area, and vendor documentation now distinguishes local and remote patterns much more clearly. Anthropic documents local, project, and user scopes for Claude Code MCP servers. Those are not minor implementation details. They are trust-boundary choices. If your team cannot explain what should stay local, what can be shared at project scope, and what justifies remote service access, it is not ready to scale context exposure.

7. How isolated should execution be?

GitHub says Copilot coding agent runs in a sandbox development environment powered by GitHub Actions. OpenAI previously described Codex tasks as running in cloud sandbox environments, and the current Codex app emphasizes isolated worktrees so multiple agents can work on the same repo without conflicts. Readiness means deciding whether your workflows belong on developer machines, in remote sandboxes, in isolated worktrees, or in customer-controlled infrastructure.

8. What is our human review model?

A team is not ready to scale if review still depends on "someone will probably look at it." GitHub explicitly says Copilot coding agent requests review and documents security protections, limitations, and risk mitigations. OpenAI's Codex app is designed around reviewing changes, commenting on diffs, and supervising long-running work. Readiness means knowing what can be auto-executed, what must be reviewed, who approves, and how override works.

9. What counts as success beyond speed?

NIST's AI RMF and Generative AI Profile both push organizations toward trustworthiness, evaluation, and risk-aware lifecycle management. For engineering teams, that means measuring more than output volume. You need to know rework rates, review burden, exception rates, quality drift, and whether the workflow actually became more repeatable. If you only measure speed, you will overestimate readiness.

10. Can we see what the agents actually did?

Observability is a readiness test. GitHub's coding-agent docs now include session logs, security validation details, and guidance on measuring pull request outcomes. OpenAI frames Codex around supervising parallel work and automations, which only works if activity is legible. If your team cannot reconstruct what happened, why it happened, and where it failed, scale will create hidden risk.

11. Where are our permissions, tokens, and secrets exposed?

GitHub's coding-agent docs call out restricted internet access, scoped repository permissions, branch protections, and mitigations against prompt injection. Anthropic's MCP documentation covers OAuth flows and scope-aware access patterns. Those are signs that identity, secret handling, and permission boundaries are already part of the mainstream product design. If your team has not mapped its exposure model, it is not ready.

12. What becomes a team standard, and what stays experimental?

Readiness is partly about deciding what deserves to compound. Codex supports shared skills across surfaces. Claude Code supports shared project guidance and project-scoped MCP configuration. GitHub offers organization-level governance over coding-agent availability. Those product choices all reward shared patterns over private hacks. A team that cannot distinguish "useful experiment" from "candidate standard" will scale noise.

13. Are we ready to support multi-agent work, or are we still managing single-agent habits?

OpenAI's Codex app is explicit that the core challenge has shifted from what agents can do to how people direct, supervise, and collaborate with them at scale. That is a very different readiness question from "Can one assistant help one engineer?" If your team is still organized around isolated assistant usage, multi-agent scaling may be premature even if the tools are impressive.

14. Do we know which workflows should scale first?

Not every successful workflow should become a standard. Readiness means having a rollout logic. Good early candidates are usually narrow, frequent, and easy to review. GitHub's documented agent tasks include bugs, incremental features, test coverage, documentation, and technical debt. Those are good examples because they are bounded enough to evaluate. If your team wants to start with its messiest, most cross-functional workflow, it is probably not ready.

15. If this works, what operating model are we actually moving toward?

This is the final readiness question, and the most strategic one. Are you moving toward a terminal-first engineering model, a GitHub-native delegation model, a multi-agent supervisory model, a customer-hosted execution model, or a layered system that combines several of these? If you cannot name the target operating model, you are not scaling intentionally. You are just accumulating tools.

A practical readiness lens

If I were reviewing an engineering team's readiness right now, I would group those 15 questions into five domains.

Control
What is being delegated, where work runs, and how people stay in charge.

Context
What systems agents can reach, through which scopes, transports, and approval rules.

Review
What gets checked, blocked, approved, or escalated before work becomes trusted output.

Governance
How permissions, secrets, policies, and risk management are handled.

Standardization
What becomes a repeatable team pattern instead of a private experiment.

If your team is weak in more than one of those domains, the right next step is usually not "buy more AI."

It is "tighten the operating model first."

My take

Most engineering teams are less ready to scale than they think.

Not because the tools are weak.

Because the tools got stronger faster than the surrounding management system.

That is what the current vendor and protocol landscape is telling us. Codex assumes multi-agent supervision. GitHub assumes background delegation with structured review. Claude Code assumes terminal-native execution with optional external tool access. MCP assumes that context exposure itself deserves standardized design. NIST assumes that trustworthy AI use requires lifecycle thinking, not just deployment enthusiasm.

That is why readiness is now the real bottleneck.

Key takeaways

AI readiness for engineering teams in 2026 is not a vague maturity score. It is the ability to answer practical questions about control, context access, review, governance, observability, and standardization before more autonomy enters the system. The current product direction across OpenAI, GitHub, Anthropic, and MCP shows that these questions are no longer optional.

The teams that scale well will not be the ones that adopt the most tools first. They will be the ones that can answer these 15 questions clearly enough to make autonomy governable. NIST's AI RMF and Generative AI Profile reinforce the same lesson: trust, oversight, and lifecycle management have to be designed in, not bolted on later.

Written by Dr Hernani Costa | Powered by Core Ventures

Originally published at First AI Movers.

Technology is easy. Mapping it to P&L is hard. At First AI Movers, we don't just write code; we build the 'Executive Nervous System' for EU SMEs.

Is your architecture creating technical debt or business equity?

👉 Get your AI Readiness Score (Free Company Assessment)

Our AI readiness assessment for EU SMEs covers control, context access, review governance, and standardization—the five domains that separate scaling leverage from chaos. We also offer AI strategy consulting, workflow automation design, and operational AI implementation guidance.

AI Coding Rollouts Fail on Governance, Not Models

Dr Hernani Costa — Tue, 19 May 2026 06:57:44 +0000

The $2M mistake most engineering leaders make: scaling AI coding autonomy before designing control systems.

The biggest risk in 2026 is not weak AI coding models. It is weak rollout design, unclear review logic, unmanaged context access, and teams scaling autonomy before they can govern it.

Many technical leaders still assume AI coding rollouts fail because the models are not good enough. That is becoming the wrong diagnosis.

By 2026, the leading products are already built for much more than autocomplete. OpenAI positions Codex as a command center for multiple agents and always-on automations. GitHub's Copilot coding agent can work independently in the background on repository tasks. Claude Code can automate GitHub workflows and connect to external tools. These are not lightweight assistant patterns; they are early operating models for delegated software work.

That means the failure point has moved. For many teams, the model is no longer the first thing that breaks. The rollout is.

Most AI coding rollouts fail because the team scales capability faster than it designs control. The products now assume background work, delegated execution, shared context, and structured review. NIST's Generative AI Profile makes the same point from a governance perspective: trustworthy AI use depends on lifecycle design, evaluation, and risk management, not just model access.

The Market Assumes More Autonomy Than Most Teams Are Ready For

OpenAI says the core challenge has shifted from what agents can do to how people direct, supervise, and collaborate with them at scale. GitHub says Copilot coding agent can work independently in the background "just like a human developer." Anthropic documents Claude Code GitHub Actions that can analyze code, implement features, and create pull requests from an @claude mention.

That is why the bottleneck is shifting from intelligence to management. If your team still treats these tools like smarter autocomplete, the rollout logic will lag behind the actual capability surface.

Failure Mode 1: The Team Never Defines What is Advisory Versus Executable

This is one of the most common rollout mistakes. Teams enable agentic tools before deciding what should stay suggestive, what can execute, and what can submit work for review. GitHub's own documentation makes clear that Copilot coding agent still has limitations and works inside a constrained workflow. OpenAI frames Codex around supervision and review, not unrestricted autonomy.

When those boundaries stay implicit, the rollout becomes socially negotiated instead of architected. That usually looks fast for a few weeks and then messy for months.

Failure Mode 2: Context Access Grows Faster Than Trust Boundaries

The next failure shows up when teams expand what agents can see and touch before they define the context model. Anthropic's Claude Code MCP docs describe local, project, and user scopes, which is effectively a trust-boundary system. OpenAI's MCP guidance distinguishes different server types and supports approval controls and tool filtering.

This means MCP is not just a convenience layer anymore. It is part of the rollout architecture. If your team adds shared tool access before it decides what should stay local, what should be project-scoped, and what needs approval, the rollout becomes a governance problem before it becomes a productivity win.

Failure Mode 3: Review Stays Informal While Delegation Becomes Real

A lot of teams say they have "human in the loop," but what they really have is "someone usually checks the output." That is not a rollout model.

GitHub explicitly documents built-in security protections, risks, and limitations for its coding agent, and its workflow is built around the agent opening work for human review. OpenAI describes Codex as a place to review diffs, comment on changes, and supervise multiple agents. These are product-level acknowledgments that review is not optional once agents are acting in the background.

If review logic is still informal, scale will expose it quickly. The model did not fail in that case. The operating model did.

Failure Mode 4: Teams Confuse Isolation with Safety

Isolation matters, but isolation alone is not enough. GitHub says Copilot coding agent uses a sandbox development environment. Cursor says background agents run in isolated VMs. But Cursor also warns that background agents have internet access and auto-run terminal commands, introducing data exfiltration risk via prompt injection.

This is a useful reminder for technical leaders. A rollout does not become safe just because the work happens away from a developer laptop. You still need permission design, network boundaries, review thresholds, and a clear understanding of what the agent is allowed to do.

Failure Mode 5: The Team Scales Usage Before Standardizing One Good Pattern

Many rollouts fail because they try to scale behavior before they standardize one repeatable workflow. OpenAI's Codex app supports shared skills. Anthropic's GitHub Actions setup uses project standards. GitHub structures coding-agent work around issue-to-PR and reviewable repository workflows. Those product choices all reward repeatable patterns over improvisation.

If every engineer uses a different tool, context, instructions, and review thresholds, the team is not rolling out a system. It is funding individual experiments.

Failure Mode 6: Success is Measured in Output Volume Instead of Operating Quality

This is where rollout enthusiasm usually hides the damage. Teams count generated code, faster issue turnaround, or more pull requests. But NIST's AI RMF and its Generative AI Profile emphasize that trustworthy adoption requires evaluation, monitoring, and risk-aware lifecycle management.

In engineering terms, that means tracking rework, review burden, failure categories, exception rates, and whether the workflow became more reliable, not just faster. If the only KPI is "the agent produced more," the rollout can look successful while quietly increasing cleanup, risk, and operational fragility.

Failure Mode 7: The Team Buys a Tool When It Really Needs an Operating Model

This is the strategic failure underneath the others. The product category now spans multi-agent supervision, terminal-native execution, and background automation. The buying decision is no longer just "which coding tool is smartest?" It is "how should our engineers, agents, repos, tools, and approvals work together?"

When a team buys a tool without answering that question, the rollout usually fails before the model does.

What a Stronger Rollout Looks Like

A better rollout starts smaller and gets stricter sooner. It usually has five characteristics:

A narrow first workflow: Start with one or two workflows that are frequent, bounded, and easy to review.
Explicit execution boundaries: Define what stays advisory, what can execute, and what always requires approval.
Controlled context access: Only expose the systems and tools the workflow actually needs.
Standardized review logic: Make review a designed step, not a cultural hope.
Better metrics: Track rework, review load, exceptions, and repeatability, not just output volume.

Before You Scale: A Rollout Checklist

Before you expand AI coding across the team, answer these questions:

What exactly are we scaling?
Which workflows are advisory versus executable?
Where does context access need to stop?
What review step is mandatory?
Which metrics show operating quality, not just output?
What becomes a shared team standard?

If those answers are still fuzzy, the right next step is not a bigger rollout. It is a tighter one.

From Rollout Risk to Operating Clarity

Getting this right requires a shift from tool adoption to operating model design. The challenge is not finding better AI—it's building the governance infrastructure to scale it safely.

Written by Dr Hernani Costa | Powered by Core Ventures

Originally published at First AI Movers.

Technology is easy. Mapping it to P&L is hard. At First AI Movers, we don't just write code; we build the 'Executive Nervous System' for EU SMEs.

Is your architecture creating technical debt or business equity?

👉 Get your AI Readiness Score (Free Company Assessment)

Our AI Readiness Assessment identifies governance gaps before they become operational crises. We help technical leaders design rollout strategies that scale autonomy without sacrificing control.

AI Dev Tool Evaluation: Operating Model Test, Not Feature Bake-Off

Dr Hernani Costa — Mon, 18 May 2026 06:57:46 +0000

The hidden cost of slow AI tool evaluation isn't time—it's technical debt hardening into your stack before you've tested the workflow that actually matters.

A practical evaluation model for technical leaders who need to compare coding agents, context layers, and workflow tools without turning the process into a six-week procurement ritual.

Most AI dev-tool evaluations fail for the opposite reason most software rollouts fail. They are too careful in the wrong places.

Teams spend weeks comparing features, debating model preferences, and watching demos. Then they make a decision without testing the things that actually determine success: where work runs, how review happens, what context gets exposed, and whether the workflow fits the team's real operating model. By April 2026, the major products already make that obvious. OpenAI's Codex app is built around supervising multiple agents, parallel work, worktrees, and automations. GitHub Copilot coding agent works in the background and requests human review. Claude Code is terminal-native and can connect to tools through MCP or automate GitHub workflows. Cursor background agents run in isolated Ubuntu-based machines, with internet access and auto-running terminal commands.

A good evaluation process should be fast enough to preserve momentum and structured enough to prevent expensive mistakes. That means testing the workflow, not just the model. It also means borrowing a lesson from AI governance rather than from traditional software procurement: NIST's AI Risk Management Framework and its Generative AI Profile both emphasize lifecycle thinking, evaluation, and risk management rather than simple capability access. In practice, for engineering teams, that means the right question is not "Which tool looks smartest?" It is "Which tool or combination of tools produces a governed, reviewable, repeatable workflow for the work we actually do?"

Why Most Evaluations Slow Teams Down

They slow down because they try to answer too many questions at once.

A CTO says the team needs an "AI coding tool evaluation," but the category now contains several different things: terminal-native agents, GitHub-native background agents, desktop multi-agent supervisors, remote background agents, and context-layer tooling through MCP. Those are different operating choices. OpenAI's Codex app is designed as a command center for multiple agents. GitHub Copilot coding agent is built around issue and pull-request workflows with review. Claude Code is built around terminal and repo-close execution. OpenAI's Agents SDK positions MCP as a standard way to provide tools and context, with hosted MCP, Streamable HTTP MCP, and stdio options.

So the evaluation gets bloated before it even starts.

The team is really evaluating control planes, review models, context boundaries, and execution environments, but it still thinks it is comparing "AI dev tools."

What to Evaluate Instead

The fastest useful evaluation is built around five questions.

1. Where does the work actually happen?

If your best engineers live in the terminal, a terminal-native agent may fit better than an IDE-centered experience. If your workflow is already GitHub-centric, background PR-oriented delegation may matter more than live editing assistance. If your team wants asynchronous remote execution, Cursor's background agents or a multi-agent supervisor like Codex may fit better. These are operating-shape decisions, not cosmetic ones.

2. How does review actually work?

GitHub's own docs tell users to review Copilot-created pull requests thoroughly before merging. Copilot coding agent is treated as an outside collaborator, cannot mark its own PRs ready, and cannot approve or merge them. OpenAI's Codex app is built around reviewing diffs and supervising long-running work. That means the review model is not a side concern. It is one of the main evaluation dimensions.

3. What context does the tool need?

Claude Code can connect to external tools, databases, issue trackers, design systems, and APIs through MCP. OpenAI's MCP support now spans hosted MCP, Streamable HTTP MCP, and stdio. If the workflow depends on external context, you are not just evaluating a coding assistant. You are evaluating context architecture.

4. How isolated is execution?

Cursor's background agents run in isolated Ubuntu-based machines, clone repos from GitHub, can install packages, have internet access, and auto-run terminal commands. GitHub says Copilot coding agent runs in a sandbox development environment with restricted permissions and branch limits. Isolation changes the trust model, but it does not remove the need for review and governance.

5. Can the workflow become a team standard?

Codex uses shared skills across app, CLI, IDE, and cloud. Claude Code GitHub Actions follows project standards and CLAUDE.md guidance. GitHub offers organization-level controls for coding-agent availability. The right evaluation should test whether the workflow can become a repeatable team pattern rather than remain a private power-user trick.

A Faster, Sharper Evaluation Model

Here is the process I would use.

Week 1: Choose two real workflows, not one synthetic benchmark

Do not start with a broad bake-off.

Pick two workflows your team actually cares about. One should be narrow and frequent, such as bug fixes, test generation, or documentation updates. The other should be slightly broader, such as issue-to-PR flow or repo analysis with implementation suggestions. GitHub's own examples for coding-agent work include fixing bugs and implementing incremental features, which is a good pattern for this kind of test.

Now define the success criteria before testing:

Review burden
Rework required
Time to first acceptable result
Clarity of agent behavior
Ease of handoff to the human developer

That keeps the evaluation grounded in operating outcomes rather than enthusiasm.

Week 1: Constrain the context on purpose

Do not give every tool maximum access from day one.

If the workflow needs only repo context, keep it there. If it needs one external tool, add one external tool. Anthropic's MCP docs and OpenAI's MCP guidance both make clear that context access can be scoped and structured. That is an advantage. Use it. A tighter context boundary makes it much easier to see whether the tool is genuinely useful or just powerful because you exposed half the company to it.

Week 1: Force review into the evaluation

If a tool's output is good but the review process is awkward, the workflow will not scale.

That is why you should evaluate review as a first-class criterion. GitHub explicitly requires human review for Copilot coding-agent output. OpenAI's Codex app is also designed around diff review and supervision. So your evaluation should include:

How readable the changes are
How easy it is to request follow-up changes
How much back-and-forth is required
Whether the human reviewer stays in control without becoming a bottleneck

Week 2: Compare operating fit, not just output quality

By the second week, the team should stop asking which tool produced the flashiest result.

Instead, compare:

Which tool matched the team's natural working surface
Which tool created the cleanest review loop
Which tool required the least fragile context setup
Which tool fit the security and infrastructure posture
Which tool could realistically become a shared standard

This is where the real decision appears. Cursor may win for remote asynchronous execution. Claude Code may win for terminal-native repo work. GitHub Copilot may win for GitHub-native issue-to-PR flow. Codex may win when multi-agent supervision and automation matter more than single-session editing. Those are all valid wins, but they are wins in different operating models.

The Scorecard to Actually Use

Do not score 25 features. Score seven things, each on a 1 to 5 scale:

Workflow fit: Does it match how your team already works?
Review quality: Does it make human review cleaner or heavier?
Context discipline: Can you keep access narrow and understandable?
Isolation and trust: Is the execution model acceptable for your environment?
Standardization potential: Can this become a shared pattern?
Speed to acceptable output: Not speed to first output. Speed to output a human could actually approve.
Governance friction: How much policy, security, or access cleanup will this create later?

If you score those seven honestly, you will usually know enough to decide.

What Not to Do

Do not run an abstract benchmark contest across ten tools.

Do not ask every engineer for an unstructured vibe-based opinion.

Do not test the tools with perfect prompts, full admin access, and no review constraints, then assume the results will hold in production.

Do not treat MCP as free infrastructure if the workflow does not need a shared context layer yet. OpenAI's SDK already treats approval flow and tool filtering as meaningful concerns, and Anthropic's MCP docs make scope and auth part of the operating model. That is a clue that context access should be evaluated with as much discipline as code generation.

The Real Evaluation Is an Operating Model Test

The fastest way to evaluate AI dev tools is not to make the process smaller. It is to make it sharper.

Most teams waste time because they evaluate too broadly and too abstractly. They compare tool brands before they compare workflow shape. They compare models before they compare review quality. They compare features before they compare operating fit.

That is why the right evaluation in 2026 is really a miniature operating-model test.

You are asking whether this tool can become part of a governed, repeatable team workflow. If the answer is no, it does not matter how impressive the demo looked. The current product surfaces across Codex, Copilot coding agent, Claude Code, Cursor, and MCP all point to the same lesson: the stack is becoming more autonomous, more connected, and more workflow-shaped. Your evaluation process should reflect that.

Key Takeaways

You can evaluate AI dev tools quickly without slowing the team down, but only if you stop treating the exercise like generic software procurement. In 2026, the meaningful differences across products are about control planes, review models, context exposure, isolation, and standardization potential, not just model quality or interface polish.

The best process is simple: choose two real workflows, constrain context intentionally, force review into the test, and score operating fit instead of feature abundance. Teams that do that will move faster and make better choices. Teams that do not will waste time comparing the wrong things. NIST's AI risk guidance supports the same underlying principle: lifecycle evaluation and risk-aware design matter more than capability access alone.

Written by Dr Hernani Costa | Powered by Core Ventures

Originally published at First AI Movers.

Technology is easy. Mapping it to P&L is hard. At First AI Movers, we don't just write code; we build the 'Executive Nervous System' for EU SMEs.

Is your AI tool evaluation creating technical debt or business clarity?

👉 Get your AI Readiness Score (Free Company Assessment)

DEV Community: Dr Hernani Costa

Agent Interoperability: When A2A Solves Real Problems vs. Protocol Debt

When Agent-to-Agent Interoperability Helps and When It Just Adds Complexity

A2A and MCP solve different problems

When A2A genuinely helps

1. When independent agents need to coordinate across real boundaries

2. When long-running, multi-step collaboration is the real workload

3. When organizational separation matters as much as technical separation

4. When you already know a single control plane is not enough

When A2A just adds complexity

1. When the real problem is still tool access, not agent collaboration

2. When teams have not standardized one governed workflow yet

3. When preview-stage enterprise support is being mistaken for operational maturity

4. When the architecture is trying to solve politics with protocols

The real decision is about coordination maturity

You are probably not ready to standardize A2A yet if:

You may be ready to evaluate A2A seriously if:

A practical decision lens for technical leaders

Step 1: classify the real problem

Step 2: ask whether the agents are truly independent

Step 3: check governance before protocol

Step 4: prefer the smallest working architecture

My take

Key takeaways

A2A Protocol Standardization: The $2M Governance Gap CTOs Miss

A2A in 2026: What Technical Leaders Should Watch Before Standardizing It

Agent-to-agent interoperability is getting more real. That does not mean your team should standardize it yet.

Overview

First, watch whether you have a real interoperability problem

Second, watch protocol maturity rather than protocol enthusiasm

Third, watch the difference between protocol support and enterprise readiness

Fourth, watch whether your governance model is stronger than the protocol layer

Fifth, watch whether MCP is still the more urgent standardization problem

Sixth, watch deployment fit, not just protocol support

Seventh, watch whether vendor support is getting deeper or just louder

What I would tell a CTO to monitor over the next quarter

My take

Key takeaways

Further Reading

EU AI Act Compliance: 11 Questions CTOs Must Answer Before 2026

EU AI Act Questions Technical Leaders Should Answer Before Scaling Agentic Workflows

1. What is the intended purpose of this workflow?

2. Are we acting as provider, deployer, or both?

3. Does any workflow fall into a prohibited or clearly sensitive category?

4. If the workflow is high-risk, do we have the basics the Act expects?

5. Do we have a real human oversight model, or just a human somewhere near the workflow?

6. Are we collecting the logs and documentation we would need later?

7. Are our staff and operators AI-literate enough for the workflows we are scaling?

8. If we rely on GPAI models, what do we need from vendors now?

9. Do transparency obligations affect our workflow design?

10. If we are a public body or in a sensitive use case, do we owe a fundamental rights impact assessment?

11. Are we waiting for standards, or do we already know enough to act?

A Practical Framework for Technical Leaders

My Take

Key Takeaways

Environment Readiness Decides AI Delivery, Not Agent Quality

The market is quietly admitting that environment quality now decides outcomes

Why great agents still fail in bad environments

Old engineering truths still decide agent performance

What environment readiness actually means

1. Fast feedback loops

2. Written instructions instead of hidden tribal knowledge

3. Explicit review design

4. Permissions and boundaries

5. Observability and measurement

6. Security and governance

The easiest mistake to make

What CTOs should fix first

My take

Key takeaways

Further Reading

Metacognition: The Self-Correction Layer AI Rollouts Miss

Metacognition, Translated for Technical Leaders

Why This Matters More Now

The Missing Layer in Most AI Rollouts

1. They confuse activity with progress

2. They blame the model before checking the environment

3. They scale before they standardize

4. They defend the rollout instead of updating it

What Metacognition Looks Like in Practice