A2A Tells Agents How to Talk. It Doesn't Tell Them What Happens When Things Break.

#ai #python #agents #a2a

Google's Agent-to-Agent protocol is a clean spec. Agent A sends a task to Agent B over HTTP. Agent B responds. Done.

Until Agent B crashes mid-task. Or goes down for 20 minutes during a deploy. Or needs a human to approve something before it can continue.

What A2A Gives You

A2A defines how agents find each other (Agent Cards), how they exchange messages (Tasks), and how they stream responses (SSE). It's a communication protocol. A good one.

What it doesn't define:

What happens when the receiving agent crashes mid-task
How to retry delivery if the first attempt fails
How to enforce a timeout if the agent takes too long
How to add a human approval step between agents
How to track whether a task actually completed

These aren't edge cases. They're the entire second half of building a production multi-agent system.

The Code You End Up Writing

# Sender: deliver task to Agent B with retry
async def send_task_with_retry(agent_url, task, max_retries=3):
    for attempt in range(max_retries):
        try:
            resp = requests.post(f"{agent_url}/tasks", json=task)
            if resp.status_code == 200:
                return resp.json()
            if resp.status_code >= 500:
                raise RetryableError()
        except (ConnectionError, Timeout, RetryableError):
            delay = min(2 ** attempt + random.uniform(0, 1), 60)
            await asyncio.sleep(delay)

    # All retries failed - now what?
    db.insert("failed_tasks", task=task, attempts=max_retries)
    alert(f"Task delivery failed after {max_retries} retries")
    return None

# Plus: timeout watcher, health check loop, DLQ consumer,
# delivery confirmation handler, state reconciliation cron...

You're not writing agent logic anymore. You're writing infrastructure.

What A2A + Lifecycle Looks Like

from axme import AxmeClient, AxmeClientConfig

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

intent_id = client.send_intent({
    "intent_type": "intent.a2a.task_execution.v1",
    "to_agent": "agent://myorg/production/data-enrichment",
    "payload": {
        "task": "enrich_customer_data",
        "records": 1500,
        "source": "crm_export_q1",
    },
})
result = client.wait_for(intent_id)

Agent B crashes at record 800? Platform retries delivery (up to 3 attempts). Agent B takes too long? 120-second deadline enforced. Need a human to approve the enrichment first? Add an approval gate to the scenario. The A2A communication still works - it just has a lifecycle around it now.

What Changes, What Stays

	Raw A2A	A2A + AXME lifecycle
Agent discovery	Agent Cards	Agent Cards
Message format	A2A Tasks	A2A Tasks (wrapped in intent)
Delivery	HTTP POST (you retry)	At-least-once (platform retries)
Crash recovery	Task lost	Automatic redelivery
Timeout	You build it	Configurable deadline
Human approval	Not in spec	Built-in gate
Observability	You instrument	Lifecycle events (SSE)
Audit trail	You build it	Every state transition logged

A2A stays as the communication layer. AXME adds the operational layer on top.

When You Need This

A2A alone works fine for:

Request/response between two always-on agents
Demo scenarios where crashes don't happen
Internal tooling where "restart and try again" is acceptable

You need a lifecycle layer when:

Agent B might be down during deploys
Tasks take minutes or hours, not milliseconds
A human needs to approve before Agent B acts
You need to prove a task completed (compliance, audit)
"It probably worked" isn't good enough

Try It

Working example - Agent A sends enrichment task to Agent B with crash recovery, timeout, and lifecycle tracking:

github.com/AxmeAI/a2a-with-durable-lifecycle

Built with AXME - durable lifecycle for agent operations. Alpha - feedback welcome.

DEV Community