Trace agent actions across workflows and kill everything in one call

#ai #python #security #opensource

Two problems kept coming up while I was building multi-step agent workflows.

First, when an orchestrator delegates to sub-agents, the audit trail turns into a mess. You get a flat list of signed actions with no way to tell which ones belong to the same workflow. Was that data:read from the reporting pipeline or the cleanup job? No idea.

Second, when something goes sideways in production, you need to stop every agent in your organization immediately instead of figuring out which one is causing the problem.

Trace correlation

trace_id solves the first problem. Generate one ID for the workflow, then pass it to every sign() call. Every action in that workflow shares the same trace, so you can reconstruct exactly what happened and in what order.

import asqav

asqav.init(api_key="sk_...")
agent = asqav.Agent.create("orchestrator")

# Generate a trace ID for the whole workflow
trace_id = asqav.generate_trace_id()

# Every action in this workflow shares the same trace
agent.sign("task:start", {"step": "fetch"}, trace_id=trace_id)
agent.sign("task:delegate", {"to": "sub-agent"}, trace_id=trace_id, parent_id="sig_abc")
agent.sign("task:complete", {"result": "done"}, trace_id=trace_id)

The parent_id parameter links delegated actions back to the signature that spawned them. So you get a tree, not a list. The orchestrator signed task:delegate, and the sub-agent's actions reference that signature as their parent.

This matters for compliance. When an auditor asks "show me everything this workflow did," you filter by trace_id and get a complete, ordered chain of signed actions. No guessing, no log correlation hacks.

Emergency halt

The second problem is simpler but scarier. An agent starts behaving unpredictably, or you detect a security incident, and you need a kill switch.

# Something goes wrong? Kill everything.
asqav.emergency_halt(reason="security incident")

One call. Every agent registered under your organization gets revoked immediately. Their signatures stop verifying, their scope tokens become invalid, and any in-flight actions get rejected. The halt reason gets recorded in the audit trail so you have a clear record of when and why you pulled the plug.

You can also halt a specific agent or agent group instead of the whole org. But the point of emergency_halt() with no target is that you do not need to figure out which agent is the problem when things are on fire. Stop everything, investigate after.

Why this matters

Agent workflows are getting more complex. An orchestrator calls a planner, the planner calls three workers, each worker calls external APIs. Without trace correlation, you are debugging in the dark. Without an emergency halt, you are hoping nothing goes wrong.

Both features ship in the latest version: