Martin Nanchev for AWS Community Builders

Posted on Mar 12

Optimizing Multi-Agent Costs on Bedrock: From ~$18 to ~$7 per Diabetes Report Run (Graph vs Swarm Comparison)

#ai #aws #bedrock #awscommunitybuilders

The Problem With 5,000 Rows of Blood Sugar Data

I've been living with Type 1 diabetes for over 17 years. My mother had it too, along with some of its complications. The disease hasn't changed much — but the tech around it has.

I use a MiniMed 780G insulin pump with a Guardian 4 CGM sensor running in SmartGuard auto mode. Every 14 days it produces a report: a ~5,000-row CSV of pump events and CGM readings plus pdf data.

I wanted something better — a structured clinical summary that's actually useful for medical staff and for people, that makes sense of patterns. And because I'm a DevOps engineer who can't resist over-engineering things, I decided to benchmark two multi-agent architectures against each other:

Graph (sequential pipeline)
Swarm (autonomous handoffs)

This article covers Graph and Swarm.

Architecture 1: The Graph Pipeline

Four agents, one after another. Each does its job and passes results to the next:

Reader — Ingests the raw CareLink CSV. Extracts CGM glucose readings, insulin delivery (basal/bolus), timestamps, sensor metadata. Flags data quality issues like gaps and sensor warmup periods.

Analyser — Crunches the numbers with Python: Time in Range (TIR), GMI, CV%, time-block patterns. Cross-validates against the PDF report.

Reviewer — The sceptic. Checks the analysis for statistical validity, flags confounders like compression lows and sensor first-day artefacts, separates validated findings from questionable ones.

Endocrinologist — Takes everything and writes a clinical consultation report with pump setting recommendations and discussion points for the next endo visit.

The Code

Setup — imports, PDF tool, and model config

import logging
from pypdf import PdfReader
from strands.models import BedrockModel
from strands import Agent, tool
from strands.multiagent import GraphBuilder
from strands_tools import python_repl, calculator, file_read
import os
from strands.agent.conversation_manager import SlidingWindowConversationManager

os.environ["BYPASS_TOOL_CONSENT"] = "true"

@tool
def read_pdf(filename: str) -&gt; str:
    """Reads the content of a pdf file and returns it as a string list"""
    try:
        reader = PdfReader(filename)
        texts = []
        for page in reader.pages:
            texts.append(page.extract_text())
        return '\n'.join(texts)
    except Exception as e:
        return f"Error reading file {filename}: {str(e)}"

model = BedrockModel(
    model_id="global.anthropic.claude-opus-4-6",
    region_name="us-east-1"
)

Agent definitions — four specialists, each with a clear handoff

reader = Agent(
    name="reader",
    system_prompt="""You extract and structure diabetes management data from files 
    in the current directory. Files follow the pattern 'X DD-MM-YYYY.csv' and 
    'X DD-MM-YYYY.pdf'.

    Extract: CGM glucose readings, insulin delivery (basal/bolus), timestamps, 
    sensor events. Flag any data quality issues (gaps, sensor warmup, anomalies).

    HANDOFF: Summarize date range, data completeness %, readings extracted.""",
    tools=[file_read, read_pdf],
    model=model
)

analyser = Agent(
    name="analyser",
    system_prompt="""You perform quantitative analysis on structured diabetes data.

    Calculate: TIR (3.9-10.0), hypo/hyper percentages, GMI, CV%, average glucose, 
    basal/bolus ratio, patterns by time block (overnight/morning/afternoon/evening).

    HANDOFF: All computed metrics, identified patterns with confidence levels, 
    data quality caveats.""",
    tools=[calculator, python_repl, file_read, read_pdf],
    model=model
)

reviewer = Agent(
    name="reviewer",
    system_prompt="""You critically review diabetes data analysis.

    Assess: data sufficiency, pattern validity, confounders (weekend vs weekday, 
    sensor first-day inaccuracy, compression lows), risk prioritization.

    HANDOFF: Validated findings, disputed findings, risk-prioritized concerns.""",
    tools=[],
    model=model
)

endocrinologist = Agent(
    name="endocrinologist",
    system_prompt="""You are a virtual endocrinology consultant. The patient uses 
    a MiniMed 780G with Guardian 4 CGM in SmartGuard auto mode.

    Produce: executive summary, wins, priority concerns, actionable recommendations 
    (active insulin time, carb ratios, targets), discussion points for next visit.""",
    tools=[],
    model=model
)

Wiring the graph — four nodes, three edges, linear flow

builder = GraphBuilder()
builder.add_node(reader, "reader")
builder.add_node(analyser, "analyser")
builder.add_node(reviewer, "reviewer")
builder.add_node(endocrinologist, "endocrinologist")

builder.add_edge("reader", "analyser")
builder.add_edge("analyser", "reviewer")
builder.add_edge("reviewer", "endocrinologist")
builder.set_entry_point("reader")

graph = builder.build()
result = graph("Analyze the diabetes data files and produce a clinical report.")

What Came Out

Here's the executive summary the system produced, condensed:

Excellent Type 1 diabetes control — 92.5% Time in Range, GMI 6.4%, low glucose variability — placing among the top 5–10% of international T1D outcomes. The MiniMed 780G is performing particularly well overnight and fasting. Main safety concern: three severe hypoglycaemic episodes (<3.0 mmol/L) within 14 days. Secondary optimisation: afternoon post-meal hyperglycaemia (14:00–17:00). Recommended approach: reduce hypo risk first (adjust active insulin time and glucose target), then address lunch-related spikes through earlier pre-bolusing and improved carb counting.

Clinically useful? I think so.

What impressed me most was how deep the review went. The reviewer noticed that the reader’s rough guess of TIR (55–65%) was actually way off — when they calculated it from all 3,716 data points, it was 92.5%. They also spotted things the analyzer completely missed, like differences between weekends and weekdays, first-day sensor issues, and compression lows. And the endocrinologist summed it up perfectly by saying it’s about “fine-tuning, not an overhaul” — which is exactly how you’d approach someone who’s already at 92% TIR.

The $18 Problem

Here's where it gets uncomfortable. The unoptimised graph: 18 minutes, ~3.3 million tokens:

Node	Input Tokens	Output Tokens	Total Cost
reader	1,623,805	19,252	$8.60
analyser	1,667,440	40,461	$9.35
reviewer	5,810	7,780	$0.22
endocrinologist	~3,331	~4,242	~$0.25
Total	~3.3M	~67K	$18.42

$18.42 for a single report. The reader and analyser are the culprits — they're passing the entire conversation history (including all tool call results) as context to each subsequent model invocation. The 5,000-row CSV gets re-read and re-sent multiple times.

My agents were going through tokens like people go through rakia (alcohol drink) at a village wedding.

Cutting It With a Sliding Window

One-liner fix per agent. Strands has a SlidingWindowConversationManager that caps how much conversation history gets sent to the model:

Sliding window configuration — add to reader and analyser

reader = Agent(
    name="reader",
    system_prompt="...",  # same as before
    tools=[file_read, read_pdf],
    model=model,
    conversation_manager=SlidingWindowConversationManager(
        window_size=10,
        should_truncate_results=True,
        per_turn=True,
    ),
)

analyser = Agent(
    name="analyser",
    system_prompt="...",  # same as before
    tools=[calculator, python_repl, file_read, read_pdf],
    model=model,
    conversation_manager=SlidingWindowConversationManager(
        window_size=10,
        should_truncate_results=True,
        per_turn=True,
    ),
)

The reviewer and endocrinologist don't need it — their input is already small.

Token reduction by node:

Node	Previous (no manager)	With Sliding Window	Savings
reader	1,623,805 input	1,054,919 input	-35%
analyser	1,667,440 input	135,770 input	-92%
reviewer	5,810 input	1,260 input	-78%
endocrinologist	~3,331 input	~3,331 input	same

Cost after optimisation:

Node	Input Cost	Output Cost	Total
reader	$5.27	$0.31	$5.58
analyser	$0.68	$0.39	$1.07
reviewer	$0.01	$0.08	$0.09
endocrinologist	$0.02	$0.13	$0.15
Total			$6.89

From $18.42 to $6.89 — a 63% cost reduction with no meaningful loss in output quality. The analyser benefited most because it was the worst offender: every python_repl tool call was accumulating in context.

Architecture 2: The Swarm

Swarm is the wildest of the three. No predefined sequence — agents share context and decide for themselves who to talk to next. Less assembly line, more group chat where everyone's an expert. A life without a manager, basically.

Doesn't mean it's cheaper though. Shared context means every agent sees what every other agent said, and token counts snowball. I hit ~3.6M tokens and some failed runs trying this on Opus. So I got pragmatic: Sonnet 4.6 as the coordinator, three worker agents on Sonnet 4. Costs came down, the thing actually finished.The Sonnet 4 is actually the default model for the agent swarm tool

Swarm setup — Sonnet coordinator, Sonnet workers

from strands.multiagent import Swarm

agent = Agent(
    name="Diabetes analyser",
    model=BedrockModel(model_id="global.anthropic.claude-sonnet-4-6"),
    tools=[swarm, python_repl, file_read, calculator]
)

result = agent(
    "Create three agents to analyse 'X 10-03-2026.csv' in current folder"
)

Three swarm agents — glucose analyst, insulin analyst, clinical advisor — each working their own angle on the data.

What Came Out

The Swarm got the numbers right. TIR 92.5%, afternoon variability, the hypo events — all there. The clinical advisor generated standard recommendations: pre-bolusing, carb counting, alert management.

But the report read more like a template than a consultation. Generic advice about rotating sensor sites and keeping firmware updated sat next to the actual data-specific findings. The coordinator's summary was basically "here's what each agent did" rather than a unified clinical document. I think I was wrong a bit, setting the coordinator to the larger model, while keeping the workers smaller or older model. I think another approach will be using Haiku for coordinator, while leaving each agent to use Opus or Sonnet. This will lead to more depth, but I was careful with the tokens or rakia as well, so i decided to use the default Sonnet

The Showdown: Graph vs Swarm

Same 14-day CareLink export, both architectures. Here's what happened.

Tokens & Cost

Metric	Graph (Opus, optimised)	Swarm (Opus coord + Sonnet agents)
Total tokens	~900K	~175K
Cost	~$6.89	~$2.50
Time	~11 min	~2.5 min
Worker model	Opus (all 4 nodes)	Sonnet 4 (3 agents)
Coordinator	N/A (sequential)	Opus

Swarm is 3x cheaper and 4x faster. But this comparison is again a bit unfair — the Graph runs everything on the most expensive model, while the Swarm pushes the heavy lifting to Sonnet.

Where the Graph pulled ahead

Self-correction. The reviewer noticed the reader's TIR estimate (55–65%) was way off and flagged 92.5% as the real number. The Swarm had no mechanism for one agent to challenge another.

Confounders. The reviewer flagged that the Level 2 hypo events weren't checked against the sensor change date (Mar 5). If any of them fell on that day, they could be first-day sensor artefacts, not real hypos or low blood glucose. Swarm didn't consider this.

Compression lows. Overnight readings below range could include compression lows — you roll onto the sensor and it reads artificially low. Graph flagged it. Swarm didn't.

Weekend vs weekday. The reviewer noted this wasn't assessed at all. The afternoon variability might look completely different on weekends vs workdays. Swarm just generated recommendations without thinking about it.

Pattern prioritisation. The reviewer took the "12.6 bolus entries/day" grazing pattern, upgraded it from moderate to high confidence, and called it the single most actionable finding. Swarm saw the same number but didn't do anything with it.

Specific recommendations. The endocrinologist said "weaken afternoon CR by 10–15%", "verify AIT is 3.0–3.5 hours", "do NOT lower SmartGuard target until hypos are addressed." The Swarm said "rotate sensor sites" and "keep firmware updated." One of these I can take to my doctor. The other I already know.

Where the Swarm held up

The core metrics were correct — TIR, GMI, CV% all matched. Parallel analysis was efficient. And for a quick "how's my last two weeks looking?" check, it's perfectly fine.

The thing I need to be honest about

I can't cleanly separate architecture quality from model quality here. And that bugs me.

The Graph runs all four nodes on Opus. The Swarm runs the workers on Sonnet. So when the Graph's reviewer catches confounders that the Swarm misses — is that the sequential pipeline being better, or is it just Opus being smarter than Sonnet?

Probably both. My gut says ~60% model, ~40% architecture. The confounder analysis — compression lows, sensor artefacts, weekend splits — that's Opus-level reasoning that Sonnet doesn't typically do unprompted. But the architecture gave Opus a dedicated step to do that reasoning. The Swarm doesn't have a reviewer step even if you ran it all on Opus.

The fair test would be:

Graph with Sonnet everywhere vs Swarm with Sonnet (isolate architecture)
Graph with Opus everywhere vs Swarm with Opus (same thing, higher tier)

The Verdict

Dimension	Graph	Swarm
Output quality	★★★★★	★★★☆☆
Clinical depth	Deep, specific	Competent, generic
Cost (optimised)	$6.89	~$2.50
Speed	~11 min	~2.5 min
Error correction	Built-in (reviewer)	None
Best for	Clinic-ready reports	Weekly trend checks

Graph wins on quality. I'd hand the Graph report to my endocrinologist. The Swarm report I'd keep for myself.

Swarm wins on speed and cost. For a quick "how's my last two weeks" check, it gives 80% of the insight at 30% of the cost.

As we say in Bulgaria: — "A good word opens even an iron door." The Graph doesn't just give you more words — it gives you the right ones. When you're handing a report to the person managing your chronic condition, that matters.

Key Takeaways

1. Context management is everything. Sliding window: $18 → $7. 63% cut. Zero quality loss. If you're building multi-agent pipelines, do this first. Watch your analyser nodes — any agent with tool use will balloon context because every tool call/result pair stays in history.

2. Review steps are worth it. The reviewer cost $0.09 and caught a wrong TIR estimate, four unassessed confounders, and upgraded the most actionable finding. Best nine cents I've ever spent.

3. Model and architecture are tangled. I used Opus for Graph and Sonnet for Swarm workers. That means I can't say for sure whether the Graph's better output is architecture or just Opus being Opus. Probably both — 60/40 model/architecture is my guess. Need to test properly.

4. Pick the right tool. Graph for depth, Swarm for speed. "Which is better?" → "Better at what?"

5. The output is actually useful. Whether $7 per report is worth it — or at least I think so. My endo agreed with the recommendations. That's the test that matters.

6. Start with the smallest model. It could do magic and be really useful.

How much did the over - engineering costs? Any additional costs?

I always thought this will cost a couple of bucks max.
Well I ran these agents a couple of times - 20 for development and there is a catch about the pricing. There are additional costs for long context. Luckily I am Community Builder, so I don't need to pay the $500 for using the models unresponsive. Use the models responsibly and start with the smallest possible model

DEV Community

Optimizing Multi-Agent Costs on Bedrock: From ~$18 to ~$7 per Diabetes Report Run (Graph vs Swarm Comparison)

The Problem With 5,000 Rows of Blood Sugar Data

Architecture 1: The Graph Pipeline

The Code

What Came Out

The $18 Problem

Cutting It With a Sliding Window

Architecture 2: The Swarm

What Came Out

The Showdown: Graph vs Swarm

Tokens & Cost

Where the Graph pulled ahead

Where the Swarm held up

The thing I need to be honest about

The Verdict

Key Takeaways

How much did the over - engineering costs? Any additional costs?

Top comments (0)