Anarchy, Assembly Lines, and Corporate Hierarchy: Benchmarking Multi-Agent Architectures for Medical Device Data

#awscommunitybuilders #aws #strandsagents #ai

My AI judge gave the anarchists a perfect score. I disagree.
I built three multi-agent systems to analyze data from my insulin pump — a Medtronic MiniMed 780G — and had an LLM evaluate their output. The cheapest, fastest architecture scored identically to the most expensive one. But when I read the actual reports, the cheap one guessed where the expensive one calculated. The evaluator didn't care. That tension — between automated scores and human judgment — turned out to be the most interesting finding of this experiment.
But let's start from the beginning.

A Fair Fight This Time

In my previous blog post, I compared a swarm architecture with a graph pipeline for analyzing CareLink CSV exports. The problem? I used different models for each, which made the comparison unfair.
This time, every agent runs on the same model: Haiku 4.5 via AWS Bedrock. Same prompts, same tools, same data. The only variable is the orchestration pattern.
A LinkedIn commenter also suggested trying prompt caching to reduce costs. Good idea — let's see which architecture benefits. It is fair to mention that caching could help you cache tool calls, prompts and system prompts with following config

bedrock_model = BedrockModel(
    model_id=MODEL_ID,
    region_name=REGION,
    max_tokens=64000,
    temperature=0.0,
    cache_config=CacheConfig(strategy="auto"),
    cache_tools="default",
    streaming=True,
)

For evaluation, I used the Strands evaluator with a rubric-based prompt and Sonnet 4.5 as the judge.

Every architecture uses the same four agents. Think of them as workers in a factory — the question is how the factory is organized.

CSV Reader — Parses the CareLink CSV export. Returns raw structured data, no interpretation.

Data Analyst — Crunches the numbers: glucose statistics, Time in Range, GMI, coefficient of variation, insulin totals, carb intake.

Pattern Reviewer — Reads the metrics and timestamps to spot clinically meaningful patterns: dawn phenomenon, post-meal spikes, overnight trends, hypo/hyper clustering.

Endocrinologist — Synthesizes everything into pump optimization suggestions, framed as discussion topics for a healthcare professional.
I defined the agents using a factory pattern so every architecture gets identical copies:

PROMPTS = {
    "csv_reader": (
        "You are a CSV Parser Agent specializing in Medtronic MiniMed 780G CareLink data exports.\n"
        "Parse the raw CSV and return the extracted data EXACTLY as the tool outputs it.\n"
        "Do NOT summarize, interpret, compute statistics, or analyze the data.\n"
        "Do NOT provide clinical recommendations.\n"
        "Return only the raw parsed output for other agents to process.\n"
    ),
    "data_analyst": (
        "You are a Data Analyst Agent specializing in CGM and insulin pump metrics.\n"
        "Compute statistical analysis on the diabetes pump data provided.\n"
        "Calculate glucose statistics (mean, median, SD, min, max), Time in Range (TIR),\n"
        "GMI (estimated A1C), coefficient of variation (CV%), total daily insulin,\n"
        "insulin-to-carb ratios, correction bolus frequency, and average daily carbs.\n"
        "Use the SG_VALUES data from the parsed CSV to run your statistical tools.\n"
        "If the SG_VALUES are not directly available in your input, use read_carelink_csv\n"
        "to extract them from the CSV file, then run calculate_statistics and time_in_range.\n"
        "Present findings as structured metrics with numbers — no clinical recommendations.\n"
    ),
    "pattern_reviewer": (
        "You are a Pattern Recognition Agent specializing in diabetes data interpretation.\n"
        "Identify clinically significant patterns and anomalies from the data provided.\n"
        "Look for dawn phenomenon, post-meal spikes, overnight trends, and glucose variability.\n"
        "If you have access to the CSV file path, use read_carelink_csv to extract timestamped\n"
        "data, then use hourly_glucose_profile to compute hourly patterns.\n"
        "Flag recurring hypoglycemia with timing, severity, and clustering.\n"
        "Flag prolonged hyperglycemia with duration and potential triggers.\n"
        "Note Auto Mode exits, insulin suspensions, and sensor issues from alert data.\n"
        "Compare metrics against ADA/EASD consensus targets.\n"
        "Note positive trends and areas of good control.\n"
        "Focus on pattern identification — do not suggest treatment changes.\n"
    ),
    "endocrinologist": (
        "You are an Endocrinologist Agent specializing in insulin pump therapy optimization.\n"
        "Provide clinical interpretation and actionable recommendations based on the\n"
        "patterns and metrics from previous analysis.\n"
        "Suggest potential pump setting adjustments: Active Insulin Time, carb ratios,\n"
        "Auto Mode target glucose (5.5 vs 6.7 mmol/L / 100 vs 120 mg/dL), and bolus timing.\n"
        "Use the Fiasp insulin pharmacokinetics profile: onset ~15min, peak ~1-2h, duration ~3-5h.\n"
        "Frame everything as 'discuss with your healthcare team' — not direct medical advice.\n"
        "Provide a clear, prioritized summary highlighting the most impactful improvements.\n"
        "Acknowledge what is working well alongside areas that need attention.\n"
    ),
}


def make_agents():
    """Create a fresh set of specialist agents.

    Tool assignment strategy:
      - csv_reader: read_carelink_csv only (parse the file)
      - data_analyst: read_carelink_csv + calculate_statistics + time_in_range
        (needs CSV access because Graph/Swarm may pass summaries instead of raw values)
      - pattern_reviewer: read_carelink_csv + hourly_glucose_profile
        (needs CSV access for the same reason — LLMs summarize upstream output)
      - endocrinologist: no tools (pure synthesis from previous agents' output)

    This ensures every agent can do its job regardless of orchestration pattern.
    In the Graph, upstream nodes may summarize data. In the Swarm, agents may skip steps.
    In the Coordinator, the LLM decides what to pass. Giving data-processing agents
    direct file access makes them resilient to all three patterns.
    """
    return {
        "csv_reader": Agent(
            system_prompt=PROMPTS["csv_reader"],
            name="csv_reader",
            model=bedrock_model,
            conversation_manager=make_conversation_manager(window_size=15, per_turn=True),
            tools=[read_carelink_csv],
        ),
        "data_analyst": Agent(
            system_prompt=PROMPTS["data_analyst"],
            name="data_analyst",
            model=bedrock_model,
            conversation_manager=make_conversation_manager(window_size=20, per_turn=True),
            tools=[read_carelink_csv, calculate_statistics, time_in_range],
        ),
        "pattern_reviewer": Agent(
            system_prompt=PROMPTS["pattern_reviewer"],
            name="pattern_reviewer",
            model=bedrock_model,
            conversation_manager=make_conversation_manager(window_size=20, per_turn=True),
            tools=[read_carelink_csv, hourly_glucose_profile],
        ),
        "endocrinologist": Agent(
            system_prompt=PROMPTS["endocrinologist"],
            name="endocrinologist",
            model=bedrock_model,
            conversation_manager=make_conversation_manager(window_size=20),
            tools=[],
        ),
    }

Shared constants across all runs :)

CSV_PATH = "input_carelink.csv"
MODEL_ID = "global.anthropic.claude-haiku-4-5-20251001-v1:0"
REGION = "us-east-1"

PROMPT = (
    f"Analyze the Medtronic MiniMed 780G CareLink CSV export at '{CSV_PATH}'. "
    "Parse the data, compute glucose and insulin metrics, identify patterns, "
    "and provide clinical interpretation with actionable recommendations."
)

Now lets try out the 3 political systems

The Commune: Swarm

The swarm is anarchy by design. No central authority, no predefined order. Agents self-organize, decide when to hand off work, and collectively arrive at an answer. Think of a commune where everyone has a specialty but nobody has a boss — the CSV reader finishes and announces "data's ready," and whoever feels qualified picks it up next.
The promise: emergent intelligence. The reality: emergent shortcuts. The swarm ran only two nodes — csv_reader → endocrinologist — skipping the data analyst and pattern reviewer entirely. Without anyone enforcing the pipeline, the endocrinologist never got computed statistics. It estimated Time in Range as "likely <70%" when the actual value was 92.3%. No hourly glucose profiles were generated, no hypoglycemia clustering by timestamp.
And yet — the clinical insights were sharp. The swarm correctly identified carb miscounting, flagged insulin timing issues, and produced a well-prioritized five-tier recommendation system. The anarchists were sloppy with numbers but wise in judgment.
I took inspiration from the Strands agents samples repo for structuring the swarm roles. Although I decided to avoid being exact in my instructions -> anarchy

strands agents GitHub repo

agents = make_agents()

swarm = Swarm(
    [agents["csv_reader"], agents["data_analyst"], agents["pattern_reviewer"], agents["endocrinologist"]],
    entry_point=agents["csv_reader"],
    max_handoffs=10,
    max_iterations=15,
    execution_timeout=600.0,
    node_timeout=180.0,
    repetitive_handoff_detection_window=6,
    repetitive_handoff_min_unique_agents=3,
)

start = time.time()
swarm_result = swarm(PROMPT)
swarm_elapsed = time.time() - start

swarm_output = str(swarm_result)
swarm_metrics = extract_metrics(swarm_result)
swarm_metrics["wall_time_s"] = round(swarm_elapsed, 1)

print(swarm_output)
print_metrics(swarm_elapsed, swarm_metrics, "SWARM")

benchmark["swarm"] = {"output": swarm_output, "metrics": swarm_metrics}

Assembly line: Graph Pipeline

The graph pipeline is the Soviet factory model — a rigid, sequential process where each station does exactly one job and passes the product forward. No loops, no improvisation. Parse → Analyse → Review → Recommend, in that order, every time.
What the assembly line lacks in flexibility it makes up in thoroughness. Because each agent receives the full output of the previous one, nothing gets lost. The analyst computed exact TIR (92.3%), precise CV% (23.9%), and correct GMI (6.5%). The pattern reviewer clustered hypoglycemia events by timestamp. Every number was grounded in the actual data.
The downside? Every agent processes everything sequentially, which means 4x the token throughput. And the rigid structure means you can't easily skip steps or parallelize work. The factory runs at the speed of its slowest station.

agents = make_agents()

builder = GraphBuilder()
builder.add_node(agents["csv_reader"], "parse")
builder.add_node(agents["data_analyst"], "analyse")
builder.add_node(agents["pattern_reviewer"], "review")
builder.add_node(agents["endocrinologist"], "recommend")

builder.add_edge("parse", "analyse")
builder.add_edge("analyse", "review")
builder.add_edge("review", "recommend")
builder.set_entry_point("parse")

graph = builder.build()

start = time.time()
graph_result = graph(PROMPT)
graph_elapsed = time.time() - start


graph_output_parts = []
for node_id, node_result in graph_result.results.items():
    text = str(node_result.result) if hasattr(node_result, "result") else str(node_result)
    print(f"\n--- {node_id} ---")
    print(text)
    graph_output_parts.append(f"[{node_id}]\n{text}")

graph_output = "\n\n".join(graph_output_parts)
graph_metrics = extract_metrics(graph_result)
graph_metrics["wall_time_s"] = round(graph_elapsed, 1)

print_metrics(graph_elapsed, graph_metrics, "GRAPH")

benchmark["graph"] = {"output": graph_output, "metrics": graph_metrics}

Coordinator

The coordinator is the capitalist org chart. One manager, four direct reports. The manager decides who works on what, in what order, and synthesizes the final deliverable. The specialists are demoted to tools — they don't talk to each other, they report up.
Caching works differently across architectures. Both the swarm and coordinator benefit from prompt caching — the swarm's csv_reader cached 73K tokens across its internal conversation cycles, and the coordinator cached 57K tokens across its four sequential tool calls. The graph creates fresh agent contexts per node, so it gets zero cache hits. My extract_metrics function reported the swarm's cache as 0 at the top level, but the nested AgentInvocation metrics tell the real story.
The result: the coordinator combined the statistical precision of the graph with the clinical nuance of the swarm. It added measurable targets (e.g., "post-lunch glucose <9.0 mmol/L") and flagged a battery failure event as an "unacceptable safety risk." The graph also caught the battery failure in its pattern review, but the swarm — having skipped the pattern reviewer — missed it entirely.
An interesting self-correction happened inside the coordinator: the pattern reviewer estimated TIR as "~70-75%" (wrong), but the data analyst computed 92.3% (correct), and the final synthesized report used the right numbers. The corporate hierarchy's redundancy — multiple specialists touching the same data — caught errors that a single-pass architecture would miss.

agents = make_agents()

# Wrap each specialist as a tool for the coordinator
@tool
def parse_csv(file_path: str) -> str:
    """Parse a Medtronic CareLink CSV export and extract raw diabetes data."""
    return str(agents["csv_reader"](f"Parse the CareLink CSV at '{file_path}'"))

@tool
def analyse_data(parsed_data: str) -> str:
    """Compute glucose and insulin statistics on parsed CareLink data."""
    return str(agents["data_analyst"](parsed_data))

@tool
def review_patterns(metrics: str) -> str:
    """Identify clinically significant patterns and anomalies in diabetes metrics."""
    return str(agents["pattern_reviewer"](metrics))

@tool
def clinical_assessment(patterns: str) -> str:
    """Provide endocrinology clinical interpretation and pump setting recommendations."""
    return str(agents["endocrinologist"](patterns))

coordinator = Agent(
    system_prompt=(
        "You are a coordinator analyzing Medtronic MiniMed 780G insulin pump data.\n"
        "You have access to four specialist tools:\n"
        "  - parse_csv: extracts raw structured data from the CareLink CSV\n"
        "  - analyse_data: computes glucose statistics, TIR, GMI, CV%\n"
        "  - review_patterns: identifies glucose patterns and anomalies\n"
        "  - clinical_assessment: provides endocrinology interpretation\n"
        "Call them in order: parse_csv first, then analyse_data with the parsed output,\n"
        "then review_patterns with the analysis, then clinical_assessment with the patterns.\n"
        "After all specialists have contributed, synthesize their outputs into a final report.\n"
    ),
    name="coordinator",
    model=bedrock_model,
    conversation_manager=make_conversation_manager(window_size=25, per_turn=True),
    tools=[parse_csv, analyse_data, review_patterns, clinical_assessment],
)

start = time.time()
coordinator_result = coordinator(PROMPT)
coordinator_elapsed = time.time() - start

coordinator_output = str(coordinator_result)
coordinator_metrics = extract_metrics(coordinator_result)
coordinator_metrics["wall_time_s"] = round(coordinator_elapsed, 1)

print(coordinator_output)
print_metrics(coordinator_elapsed, coordinator_metrics, "COORDINATOR")

benchmark["coordinator"] = {"output": coordinator_output, "metrics": coordinator_metrics}

The Judge: LLM-as-Evaluator

To score each architecture's output, I used the Strands evaluator framework with Sonnet 4.5 as the judging model. The rubric scores five criteria, each weighted equally at 20%:

Data Completeness — Are glucose, insulin, carbs, and device status all covered?
Statistical Accuracy — Are TIR, GMI, CV%, and glucose stats correctly calculated and compared to clinical targets?
Pattern Identification — Are meaningful patterns (dawn phenomenon, post-meal spikes, hypo clustering) identified with specifics?
Clinical Recommendations — Are pump optimization suggestions specific, balanced, and actionable?
Safety & Framing — Are severe hypos flagged, and is the report framed as informational rather than medical advice?
In code this prompt is called rubric :

from strands_evals import Case, Experiment
from strands_evals.evaluators import OutputEvaluator

RUBRIC = """
You are an expert endocrinologist and diabetes data analyst evaluating
AI-generated reports on Medtronic MiniMed 780G insulin pump data.

Score each criterion from 0.0 to 1.0:

1. DATA COMPLETENESS (weight: 0.20)
   - Were Sensor Glucose values extracted and summarized?
   - Were insulin delivery records (basal, bolus, auto-correction) covered?
   - Were carbohydrate entries mentioned?
   - Were device settings or Auto Mode status addressed?
   Score 1.0 if all four are present. Deduct 0.25 per missing category.

2. STATISTICAL ACCURACY (weight: 0.20)
   - Were Time in Range (TIR) percentages reported?
   - Were glucose statistics (mean, median, SD, CV%) included?
   - Was GMI / estimated A1C calculated?
   - Were values compared against ADA/EASD consensus targets?
   Score 1.0 if all metrics present and correctly interpreted.

3. PATTERN IDENTIFICATION (weight: 0.20)
   - Were hypoglycemia patterns identified (timing, severity, frequency)?
   - Were post-meal spikes addressed?
   - Was dawn phenomenon or overnight trends discussed?
   - Were Auto Mode exits or insulin suspensions noted?
   Score 1.0 if clinically significant patterns are identified with specifics.

4. CLINICAL RECOMMENDATIONS (weight: 0.20)
   - Were pump setting adjustments suggested (AIT, carb ratios, target)?
   - Were recommendations specific and actionable (not generic)?
   - Were both problem areas and positive aspects acknowledged?
   - Was pre-bolusing timing discussed if relevant?
   Score 1.0 if recommendations are specific, balanced, and actionable.

5. SAFETY & FRAMING (weight: 0.20)
   - Were severe hypos (<54 mg/dL) flagged as safety concerns?
   - Was the report framed as informational, not medical advice?
   - Was the patient directed to discuss changes with their healthcare team?
   - Was Fiasp pharmacokinetics referenced appropriately if relevant?
   Score 1.0 if safety concerns are prominently flagged and framing is appropriate.

OVERALL SCORE: Weighted average of all five criteria.
Provide the overall score and a brief explanation for each criterion.
"""

evaluator = OutputEvaluator(
    rubric=RUBRIC,
    include_inputs=True,
    model="global.anthropic.claude-sonnet-4-5-20250929-v1:0"
)

# Build cases — one per pattern
pattern_names = list(benchmark.keys())
cases = []
for pattern_name in pattern_names:
    data = benchmark[pattern_name]
    cases.append(
        Case[str, str](
            name=pattern_name,
            input=PROMPT,
            expected_output=data["output"],
            metadata={"pattern": pattern_name},
        )
    )

def task_fn(case: Case) -> str:
    """Return the pre-computed output for evaluation."""
    return case.expected_output

experiment = Experiment[str, str](cases=cases, evaluators=[evaluator])
reports = experiment.run_evaluations(task_fn)

# reports is a list per evaluator — we have 1 evaluator
# report.scores is a list per case, report.reasons is a list per case
report = reports[0]

print(f"Overall score across all patterns: {report.overall_score}")
print()

for idx, pattern_name in enumerate(pattern_names):
    score = report.scores[idx] if idx < len(report.scores) else None
    reason = report.reasons[idx] if idx < len(report.reasons) else None
    benchmark[pattern_name]["eval_score"] = score
    benchmark[pattern_name]["eval_reasoning"] = reason
    print(f"{pattern_name}: {score}")

Results

	Score	Cost	Time	Cache	Verdict
Swarm anarchist	1.00	$0.057	56s	0	Anarchists on assembly line don't work
Graph	0.98	$0.315	337s	0	Expensive perfectionist
Coordinator	1.00	$0.071	395s	57K	Slow but right

Swarm cache hits were internal to the csv_reader's conversation cycles — extract_metrics reported 0 at the top level, but nested AgentInvocation metrics show 73K tokens read from cache.
The anarchists (swarm) and the corporate manager both scored 1.00. The factory scored lowest. And yet — the factory and coordinator both computed Time in Range (92.3%) from raw data. The swarm wrote "likely <70%" — off by 22 percentage points — and the judge didn't blink.
Swarm, passed the rubric. That's the headline.
The swarm skipped two agents entirely — csv_reader handed off straight to the endocrinologist, bypassing the data analyst and pattern reviewer. The endocrinologist improvised statistics from alert counts and data summaries instead of computing them from raw glucose values. Maybe the swarm architecture is more appropriate for creative non-deterministic work. Here we have a process, that required agents to talk to each other in specific way.

The graph was rigorous but couldn't surface trade-offs the upstream agent didn't frame.

*The coordinator * had an interesting self-correction: its pattern reviewer also estimated TIR wrong (~70-75%), but the data analyst computed it correctly, and the final synthesis used the right numbers. Redundancy caught what a single pass missed.

Both the swarm and coordinator got cache hits. The graph paid full price for every node — fresh agent context each time, zero caching.
Pick Your Politics

Swarm — cheap and fast, don't trust the numbers if you skip and agent or you don't have explicit handoff especially for deterministic use-cases. Graph — when precision justifies 5× the cost. Coordinator — the default.

The Real Takeaway
My rubric checked whether metrics appeared, not how they were derived. Maybe here we can put more work here. Add a provenance criterion — can each claim trace to a tool call? — and the swarm drops where it belongs.