DEV Community

George Belsky
George Belsky

Posted on

A2A Tells Agents How to Talk. It Doesn't Tell Them What Happens When Things Break.

Google's Agent-to-Agent protocol is a clean spec. Agent A sends a task to Agent B over HTTP. Agent B responds. Done.

Until Agent B crashes mid-task. Or goes down for 20 minutes during a deploy. Or needs a human to approve something before it can continue.

What A2A Gives You

A2A defines how agents find each other (Agent Cards), how they exchange messages (Tasks), and how they stream responses (SSE). It's a communication protocol. A good one.

What it doesn't define:

  • What happens when the receiving agent crashes mid-task
  • How to retry delivery if the first attempt fails
  • How to enforce a timeout if the agent takes too long
  • How to add a human approval step between agents
  • How to track whether a task actually completed

These aren't edge cases. They're the entire second half of building a production multi-agent system.

The Code You End Up Writing

# Sender: deliver task to Agent B with retry
async def send_task_with_retry(agent_url, task, max_retries=3):
    for attempt in range(max_retries):
        try:
            resp = requests.post(f"{agent_url}/tasks", json=task)
            if resp.status_code == 200:
                return resp.json()
            if resp.status_code >= 500:
                raise RetryableError()
        except (ConnectionError, Timeout, RetryableError):
            delay = min(2 ** attempt + random.uniform(0, 1), 60)
            await asyncio.sleep(delay)

    # All retries failed - now what?
    db.insert("failed_tasks", task=task, attempts=max_retries)
    alert(f"Task delivery failed after {max_retries} retries")
    return None

# Plus: timeout watcher, health check loop, DLQ consumer,
# delivery confirmation handler, state reconciliation cron...
Enter fullscreen mode Exit fullscreen mode

You're not writing agent logic anymore. You're writing infrastructure.

What A2A + Lifecycle Looks Like

from axme import AxmeClient, AxmeClientConfig

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

intent_id = client.send_intent({
    "intent_type": "intent.a2a.task_execution.v1",
    "to_agent": "agent://myorg/production/data-enrichment",
    "payload": {
        "task": "enrich_customer_data",
        "records": 1500,
        "source": "crm_export_q1",
    },
})
result = client.wait_for(intent_id)
Enter fullscreen mode Exit fullscreen mode

Agent B crashes at record 800? Platform retries delivery (up to 3 attempts). Agent B takes too long? 120-second deadline enforced. Need a human to approve the enrichment first? Add an approval gate to the scenario. The A2A communication still works - it just has a lifecycle around it now.

What Changes, What Stays

Raw A2A A2A + AXME lifecycle
Agent discovery Agent Cards Agent Cards
Message format A2A Tasks A2A Tasks (wrapped in intent)
Delivery HTTP POST (you retry) At-least-once (platform retries)
Crash recovery Task lost Automatic redelivery
Timeout You build it Configurable deadline
Human approval Not in spec Built-in gate
Observability You instrument Lifecycle events (SSE)
Audit trail You build it Every state transition logged

A2A stays as the communication layer. AXME adds the operational layer on top.

When You Need This

A2A alone works fine for:

  • Request/response between two always-on agents
  • Demo scenarios where crashes don't happen
  • Internal tooling where "restart and try again" is acceptable

You need a lifecycle layer when:

  • Agent B might be down during deploys
  • Tasks take minutes or hours, not milliseconds
  • A human needs to approve before Agent B acts
  • You need to prove a task completed (compliance, audit)
  • "It probably worked" isn't good enough

Try It

Working example - Agent A sends enrichment task to Agent B with crash recovery, timeout, and lifecycle tracking:

github.com/AxmeAI/a2a-with-durable-lifecycle

Built with AXME - durable lifecycle for agent operations. Alpha - feedback welcome.

Top comments (0)