How a Negotiating Agent Society Out-Plans a Single Scheduler

#agents #ai #python #showdev

Turn a five-way scheduling trade-off into a debate you can audit: five advocate agents argue, a charge-nurse referee rules, and the plan beats a single agent on the same scorer.

The scariest scheduling failures are the confident ones.

You hand one LLM a clinic's week — 56 patients, 43 slots, three nurses — and ask it to book the follow-ups. No errors. No complaints. It returns a clean, plausible schedule in a single pass. But is it a good plan, or did it quietly sacrifice something that mattered?

Score it and you find out. High-acuity patients: all seen. Continuity of care: collapsed — patients bounced to whichever nurse had a gap. Overdue follow-ups: slipping. The agent anchored on the one objective that's easy to name and let the messy ones rot — and it never told you it made that trade.

A quiet objective is not a satisfied objective. Usually it's the opposite.

This is a case study in fixing that with a multi-agent society — and it won't stay abstract. Every number comes from one real system: RehabPanel, a rehab-scheduling agent built for the Qwen Cloud hackathon (Track 3), running on real Qwen models and deployed live on Alibaba Cloud.

The problem: one agent collapses the trade-off

Five objectives fight for the same slots — clinical acuity, overdue windows, continuity with the primary nurse, hard capacity, patient preference. They genuinely conflict: seat the sickest patient and you may break someone's continuity; honor a preference and you may bump an overdue visit.

A single agent resolves that tension inside one prompt, invisibly. You get an answer, not an argument. And you can't audit an answer.

The fix: make the conflict explicit and negotiated

RehabPanel replaces the one planner with five advocate agents, each obsessed with exactly one objective, plus a charge-nurse referee who brokers between them. It's a LangGraph state machine — draft → critique → negotiate → arbitrate, looping until nobody's still objecting.

Each round:

All five advocates object to the current plan, in their own words — in parallel.
The loudest objection's advocate proposes a concrete swap.
The rest split into FOR / AGAINST coalitions by whose objective the swap helps or hurts.
The referee rules on a fixed priority ranking (capacity ≻ acuity ≻ overdue ≻ continuity ≻ preference) — and explains the ruling in plain language, appending it to a conflict ledger.

Nobody scripts that explanation. Mid-run, the referee writes things like: "approved because it improves preference without violating any higher-ranked objective, and no competing acuity, overdue, or continuity claim contests this slot." That ledger is the whole point — a single agent never shows this work.

One round: five objections, the coalition split, and the referee's prose ruling appended to the conflict ledger.

The whole thing is a LangGraph state machine with one explicit exit:

flowchart TB
    DR["node_draft<br/>acuity-first skeleton — no LLM (or a warm seed)"]
    CR["node_critique<br/>all 5 advocates object — in parallel"]
    CD{"hot objection<br/>&amp; round &lt; cap?"}
    NG["node_negotiate<br/>top objector proposes · FOR/AGAINST coalitions"]
    AR["node_arbitrate<br/>referee brokers on priority rank (deterministic)<br/>+ writes the ruling in prose · logs the ledger"]
    EN(["END"])
    DR --> CR --> CD
    CD -- yes --> NG --> AR --> CR
    CD -- "no · stalled" --> EN

Draft once, then critique → negotiate → arbitrate until no hot objection remains.

Rule 1: Parallelize the debate. Serialize the decision.

The expensive, creative part is the critique — five agents reading the whole caseload and reasoning about their objective. So run the five critiques concurrently: one round-trip, not five sequential ones. That single change is the difference between a demo and a coffee break.

# orchestrator.py — the five advocates critique the SAME draft concurrently
with ThreadPoolExecutor(max_workers=len(advocates)) as ex:
    objections = [o for group in ex.map(_one, advocates.items()) for o in group]

The part that has to be trustworthy — who wins — is the opposite. The referee's decision is deterministic Python; only its rationale is the LLM. Autonomy lives where it earns its keep (what to flag, how hard, why) and stays out of where it doesn't (the ruling itself). You get a negotiation that's alive but reproducible.

# orchestrator.py — the verdict is a fixed priority ranking, never an LLM
_RANK = {"capacity": 100, "priority": 40, "window": 30, "continuity": 20, "preference": 10}

def _decide(forc, against):
    if any(a["agent"] == "capacity" for a in against):
        return False                    # capacity veto — never break feasibility
    fr = max((_RANK.get(f["agent"], 0) for f in forc), default=0)
    ar = max((_RANK.get(a["agent"], 0) for a in against), default=0)
    return fr > 0 and fr >= ar           # approve iff FOR outranks AGAINST

Design heuristic: put the LLM on the reasoning, keep a deterministic rule on the verdict. It's testable, it's cheap, and it can't drift.

Rule 2: Keep the judge outside the system

Both planners — the single agent and the society — are scored by the same pure-Python function. No LLM anywhere near it. It's locked in CI.

That's not decoration. If the thing declaring the winner is an LLM, "the society is better" is a vibe. If it's forty deterministic lines that neither planner controls, it's a claim. On the same scorer, same week, the society reaches a plan the single agent can't.

# scorer.py — one pure-Python function, no LLM anywhere, CI-locked
def score(assignments, patients, clinicians, slots, weights=None, meta=None):
    ...   # multi-objective: acuity coverage, overdue days, continuity, capacity, preference

Rule 3: Show the number you can defend

Here's the honest part. The first time I ran the full protocol on real Qwen, the live society scored lower than my deterministic reference implementation of the same protocol.

For a day that looked like a bug. It wasn't. The rule-based reference runs its critique to exhaustion — it finds every fixable conflict. The live LLM critique is sharper about what it flags but flags fewer things per round, so it converges earlier. Same scorer, honest gap.

So I had a choice: quote the big reproducible ceiling, or show the real run for what it is. I showed the real run. On live Qwen the society climbs 160 → 181 (+21) over twelve rounds — continuity breaks 30 → 25, preference misses 26 → 17, high-acuity coverage held throughout.

Same week, same deterministic scorer — the society (+21) recovers the continuity and preference a single agent abandons.

Show the number you can defend on a machine a stranger controls, not the number you wish you had. A smaller live win beats a bigger offline one.

Rule 4: If it costs money per click, gate it

I wanted judges to run it live. Which means an open public URL that fires real Qwen — a way to donate your token voucher to the first crawler that finds it.

So the demo has two doors. ▶ Replay plays a recorded real negotiation for anyone, free, no key. ◉ Run live and ⟳ Re-plan stream a fresh real negotiation over SSE, token-gated — only the judge link fires the models. Same protocol; the cost is contained.

# api.py — the live SSE endpoint bills the voucher, so gate it (constant-time compare)
gate = os.environ.get("REHABPANEL_DEMO_TOKEN")
if gate and not hmac.compare_digest(token, gate):
    return JSONResponse({"error": "live negotiation is token-gated"}, status_code=401)

One more trick that keeps live runs watchable instead of expensive: an explicit context cache on the invariant caseload. The big patient-and-slot table is identical on every advocate call, every round, so it's cached once (~99% prefix hit) instead of re-billed forty times.

# advocates.py — mark the invariant caseload block so Qwen caches the prefix
sysblocks = [
    {"type": "text", "text": _caseload_ref(context),
     "cache_control": {"type": "ephemeral"}},   # DashScope context-cache marker
    {"type": "text", "text": self.system},        # + this advocate's role (changes)
]

What you take away

Make the conflict a first-class object. A society that argues produces an auditable ledger; a single agent produces an opaque answer.
Parallelize critique, serialize the verdict — LLM on the reasoning, deterministic rule on the decision.
Keep the scorer pure and external, or your "win" isn't measurable.
Deploy it, and gate what costs money. A live, clickable demo is worth ten screenshots — as long as a crawler can't run up the bill.

A single agent collapses the trade-off. A society forced to negotiate under a referee reaches what one agent can't — and shows you exactly why each patient sits where they do.

Code (MIT): github.com/jwlai-cloud/rehabpanel. Live on Alibaba Cloud — ▶ Replay is free; ◉ Run live is token-gated. Built entirely on Qwen Cloud (dashscope-intl), LangGraph, FastAPI. Decision support, fully synthetic data — no real patient records, anywhere.