The Problem With 5,000 Rows of Blood Sugar Data
I've been living with Type 1 diabetes for over 17 years. My mother had it too, along with some of its complications. The disease hasn't changed much — but the tech around it has.
I use a MiniMed 780G insulin pump with a Guardian 4 CGM sensor running in SmartGuard auto mode. Every 14 days it produces a report: a ~5,000-row CSV of pump events and CGM readings plus pdf data.
I wanted something better — a structured clinical summary that's actually useful for medical staff and for people, that makes sense of patterns. And because I'm a DevOps engineer who can't resist over-engineering things, I decided to benchmark two multi-agent architectures against each other:
- Graph (sequential pipeline)
- Swarm (autonomous handoffs)
This article covers Graph and Swarm.
Architecture 1: The Graph Pipeline
Four agents, one after another. Each does its job and passes results to the next:
Reader — Ingests the raw CareLink CSV. Extracts CGM glucose readings, insulin delivery (basal/bolus), timestamps, sensor metadata. Flags data quality issues like gaps and sensor warmup periods.
Analyser — Crunches the numbers with Python: Time in Range (TIR), GMI, CV%, time-block patterns. Cross-validates against the PDF report.
Reviewer — The sceptic. Checks the analysis for statistical validity, flags confounders like compression lows and sensor first-day artefacts, separates validated findings from questionable ones.
Endocrinologist — Takes everything and writes a clinical consultation report with pump setting recommendations and discussion points for the next endo visit.
The Code
Setup — imports, PDF tool, and model config
import logging
from pypdf import PdfReader
from strands.models import BedrockModel
from strands import Agent, tool
from strands.multiagent import GraphBuilder
from strands_tools import python_repl, calculator, file_read
import os
from strands.agent.conversation_manager import SlidingWindowConversationManager
os.environ["BYPASS_TOOL_CONSENT"] = "true"
@tool
def read_pdf(filename: str) -> str:
"""Reads the content of a pdf file and returns it as a string list"""
try:
reader = PdfReader(filename)
texts = []
for page in reader.pages:
texts.append(page.extract_text())
return '\n'.join(texts)
except Exception as e:
return f"Error reading file {filename}: {str(e)}"
model = BedrockModel(
model_id="global.anthropic.claude-opus-4-6",
region_name="us-east-1"
)
Agent definitions — four specialists, each with a clear handoff
reader = Agent(
name="reader",
system_prompt="""You extract and structure diabetes management data from files
in the current directory. Files follow the pattern 'X DD-MM-YYYY.csv' and
'X DD-MM-YYYY.pdf'.
Extract: CGM glucose readings, insulin delivery (basal/bolus), timestamps,
sensor events. Flag any data quality issues (gaps, sensor warmup, anomalies).
HANDOFF: Summarize date range, data completeness %, readings extracted.""",
tools=[file_read, read_pdf],
model=model
)
analyser = Agent(
name="analyser",
system_prompt="""You perform quantitative analysis on structured diabetes data.
Calculate: TIR (3.9-10.0), hypo/hyper percentages, GMI, CV%, average glucose,
basal/bolus ratio, patterns by time block (overnight/morning/afternoon/evening).
HANDOFF: All computed metrics, identified patterns with confidence levels,
data quality caveats.""",
tools=[calculator, python_repl, file_read, read_pdf],
model=model
)
reviewer = Agent(
name="reviewer",
system_prompt="""You critically review diabetes data analysis.
Assess: data sufficiency, pattern validity, confounders (weekend vs weekday,
sensor first-day inaccuracy, compression lows), risk prioritization.
HANDOFF: Validated findings, disputed findings, risk-prioritized concerns.""",
tools=[],
model=model
)
endocrinologist = Agent(
name="endocrinologist",
system_prompt="""You are a virtual endocrinology consultant. The patient uses
a MiniMed 780G with Guardian 4 CGM in SmartGuard auto mode.
Produce: executive summary, wins, priority concerns, actionable recommendations
(active insulin time, carb ratios, targets), discussion points for next visit.""",
tools=[],
model=model
)
Wiring the graph — four nodes, three edges, linear flow
builder = GraphBuilder()
builder.add_node(reader, "reader")
builder.add_node(analyser, "analyser")
builder.add_node(reviewer, "reviewer")
builder.add_node(endocrinologist, "endocrinologist")
builder.add_edge("reader", "analyser")
builder.add_edge("analyser", "reviewer")
builder.add_edge("reviewer", "endocrinologist")
builder.set_entry_point("reader")
graph = builder.build()
result = graph("Analyze the diabetes data files and produce a clinical report.")
What Came Out
Here's the executive summary the system produced, condensed:
Excellent Type 1 diabetes control — 92.5% Time in Range, GMI 6.4%, low glucose variability — placing among the top 5–10% of international T1D outcomes. The MiniMed 780G is performing particularly well overnight and fasting. Main safety concern: three severe hypoglycaemic episodes (<3.0 mmol/L) within 14 days. Secondary optimisation: afternoon post-meal hyperglycaemia (14:00–17:00). Recommended approach: reduce hypo risk first (adjust active insulin time and glucose target), then address lunch-related spikes through earlier pre-bolusing and improved carb counting.
Clinically useful? I think so.
What impressed me most was how deep the review went. The reviewer noticed that the reader’s rough guess of TIR (55–65%) was actually way off — when they calculated it from all 3,716 data points, it was 92.5%. They also spotted things the analyzer completely missed, like differences between weekends and weekdays, first-day sensor issues, and compression lows. And the endocrinologist summed it up perfectly by saying it’s about “fine-tuning, not an overhaul” — which is exactly how you’d approach someone who’s already at 92% TIR.
The $18 Problem
Here's where it gets uncomfortable. The unoptimised graph: 18 minutes, ~3.3 million tokens:
| Node | Input Tokens | Output Tokens | Total Cost |
|---|---|---|---|
| reader | 1,623,805 | 19,252 | $8.60 |
| analyser | 1,667,440 | 40,461 | $9.35 |
| reviewer | 5,810 | 7,780 | $0.22 |
| endocrinologist | ~3,331 | ~4,242 | ~$0.25 |
| Total | ~3.3M | ~67K | $18.42 |
$18.42 for a single report. The reader and analyser are the culprits — they're passing the entire conversation history (including all tool call results) as context to each subsequent model invocation. The 5,000-row CSV gets re-read and re-sent multiple times.
My agents were going through tokens like people go through rakia (alcohol drink) at a village wedding.
Cutting It With a Sliding Window
One-liner fix per agent. Strands has a SlidingWindowConversationManager that caps how much conversation history gets sent to the model:
Sliding window configuration — add to reader and analyser
reader = Agent(
name="reader",
system_prompt="...", # same as before
tools=[file_read, read_pdf],
model=model,
conversation_manager=SlidingWindowConversationManager(
window_size=10,
should_truncate_results=True,
per_turn=True,
),
)
analyser = Agent(
name="analyser",
system_prompt="...", # same as before
tools=[calculator, python_repl, file_read, read_pdf],
model=model,
conversation_manager=SlidingWindowConversationManager(
window_size=10,
should_truncate_results=True,
per_turn=True,
),
)
The reviewer and endocrinologist don't need it — their input is already small.
Token reduction by node:
| Node | Previous (no manager) | With Sliding Window | Savings |
|---|---|---|---|
| reader | 1,623,805 input | 1,054,919 input | -35% |
| analyser | 1,667,440 input | 135,770 input | -92% |
| reviewer | 5,810 input | 1,260 input | -78% |
| endocrinologist | ~3,331 input | ~3,331 input | same |
Cost after optimisation:
| Node | Input Cost | Output Cost | Total |
|---|---|---|---|
| reader | $5.27 | $0.31 | $5.58 |
| analyser | $0.68 | $0.39 | $1.07 |
| reviewer | $0.01 | $0.08 | $0.09 |
| endocrinologist | $0.02 | $0.13 | $0.15 |
| Total | $6.89 |
From $18.42 to $6.89 — a 63% cost reduction with no meaningful loss in output quality. The analyser benefited most because it was the worst offender: every python_repl tool call was accumulating in context.
Architecture 2: The Swarm
Swarm is the wildest of the three. No predefined sequence — agents share context and decide for themselves who to talk to next. Less assembly line, more group chat where everyone's an expert. A life without a manager, basically.
Doesn't mean it's cheaper though. Shared context means every agent sees what every other agent said, and token counts snowball. I hit ~3.6M tokens and some failed runs trying this on Opus. So I got pragmatic: Opus as the coordinator, three worker agents on Sonnet 4. Costs came down, the thing actually finished.The Sonnet 4 is actually the default model for the agent swarm tool
Swarm setup — Opus coordinator, Sonnet workers
from strands.multiagent import Swarm
agent = Agent(
name="Diabetes analyser",
model=BedrockModel(model_id="global.anthropic.claude-sonnet-4-6"),
tools=[swarm, python_repl, file_read, calculator]
)
result = agent(
"Create three agents to analyse 'X 10-03-2026.csv' in current folder"
)
Three swarm agents — glucose analyst, insulin analyst, clinical advisor — each working their own angle on the data.
What Came Out
The Swarm got the numbers right. TIR 92.5%, afternoon variability, the hypo events — all there. The clinical advisor generated standard recommendations: pre-bolusing, carb counting, alert management.
But the report read more like a template than a consultation. Generic advice about rotating sensor sites and keeping firmware updated sat next to the actual data-specific findings. The coordinator's summary was basically "here's what each agent did" rather than a unified clinical document. I think I was wrong a bit, setting the coordinator to the largest model, while keeping the workers smaller. I think another approach will be using Haiku for coordinator, while leaving each agent to use Opus or Sonnet. This will lead to more depth, but I was careful with the tokens or rakia as well, so i decided to use the default Sonnet
The Showdown: Graph vs Swarm
Same 14-day CareLink export, both architectures. Here's what happened.
Tokens & Cost
| Metric | Graph (Opus, optimised) | Swarm (Opus coord + Sonnet agents) |
|---|---|---|
| Total tokens | ~900K | ~175K |
| Cost | ~$6.89 | ~$2.50 |
| Time | ~11 min | ~2.5 min |
| Worker model | Opus (all 4 nodes) | Sonnet 4 (3 agents) |
| Coordinator | N/A (sequential) | Opus |
Swarm is 3x cheaper and 4x faster. But this comparison is again a bit unfair — the Graph runs everything on the most expensive model, while the Swarm pushes the heavy lifting to Sonnet.
Where the Graph pulled ahead
Self-correction. The reviewer noticed the reader's TIR estimate (55–65%) was way off and flagged 92.5% as the real number. The Swarm had no mechanism for one agent to challenge another.
Confounders. The reviewer flagged that the Level 2 hypo events weren't checked against the sensor change date (Mar 5). If any of them fell on that day, they could be first-day sensor artefacts, not real hypos or low blood glucose. Swarm didn't consider this.
Compression lows. Overnight readings below range could include compression lows — you roll onto the sensor and it reads artificially low. Graph flagged it. Swarm didn't.
Weekend vs weekday. The reviewer noted this wasn't assessed at all. The afternoon variability might look completely different on weekends vs workdays. Swarm just generated recommendations without thinking about it.
Pattern prioritisation. The reviewer took the "12.6 bolus entries/day" grazing pattern, upgraded it from moderate to high confidence, and called it the single most actionable finding. Swarm saw the same number but didn't do anything with it.
Specific recommendations. The endocrinologist said "weaken afternoon CR by 10–15%", "verify AIT is 3.0–3.5 hours", "do NOT lower SmartGuard target until hypos are addressed." The Swarm said "rotate sensor sites" and "keep firmware updated." One of these I can take to my doctor. The other I already know.
Where the Swarm held up
The core metrics were correct — TIR, GMI, CV% all matched. Parallel analysis was efficient. And for a quick "how's my last two weeks looking?" check, it's perfectly fine.
The thing I need to be honest about
I can't cleanly separate architecture quality from model quality here. And that bugs me.
The Graph runs all four nodes on Opus. The Swarm runs the workers on Sonnet. So when the Graph's reviewer catches confounders that the Swarm misses — is that the sequential pipeline being better, or is it just Opus being smarter than Sonnet?
Probably both. My gut says ~60% model, ~40% architecture. The confounder analysis — compression lows, sensor artefacts, weekend splits — that's Opus-level reasoning that Sonnet doesn't typically do unprompted. But the architecture gave Opus a dedicated step to do that reasoning. The Swarm doesn't have a reviewer step even if you ran it all on Opus.
The fair test would be:
- Graph with Sonnet everywhere vs Swarm with Sonnet (isolate architecture)
- Graph with Opus everywhere vs Swarm with Opus (same thing, higher tier)
The Verdict
| Dimension | Graph | Swarm |
|---|---|---|
| Output quality | ★★★★★ | ★★★☆☆ |
| Clinical depth | Deep, specific | Competent, generic |
| Cost (optimised) | $6.89 | ~$2.50 |
| Speed | ~11 min | ~2.5 min |
| Error correction | Built-in (reviewer) | None |
| Best for | Clinic-ready reports | Weekly trend checks |
Graph wins on quality. I'd hand the Graph report to my endocrinologist. The Swarm report I'd keep for myself.
Swarm wins on speed and cost. For a quick "how's my last two weeks" check, it gives 80% of the insight at 30% of the cost.
As we say in Bulgaria: — "A good word opens even an iron door." The Graph doesn't just give you more words — it gives you the right ones. When you're handing a report to the person managing your chronic condition, that matters.
Key Takeaways
1. Context management is everything. Sliding window: $18 → $7. 63% cut. Zero quality loss. If you're building multi-agent pipelines, do this first. Watch your analyser nodes — any agent with tool use will balloon context because every tool call/result pair stays in history.
2. Review steps are worth it. The reviewer cost $0.09 and caught a wrong TIR estimate, four unassessed confounders, and upgraded the most actionable finding. Best nine cents I've ever spent.
3. Model and architecture are tangled. I used Opus for Graph and Sonnet for Swarm workers. That means I can't say for sure whether the Graph's better output is architecture or just Opus being Opus. Probably both — 60/40 model/architecture is my guess. Need to test properly.
4. Pick the right tool. Graph for depth, Swarm for speed. "Which is better?" → "Better at what?"
5. The output is actually useful. Whether $7 per report is worth it — or at least I think so. My endo agreed with the recommendations. That's the test that matters.
6. Start with the smallest model. It could do magic and be really useful.
How much did the over - engineering costs? Any additional costs?
I always thought this will cost a couple of bucks max.
Well I ran these agents a couple of times - 20 for development and there is a catch about the pricing. There are additional costs for long context. Luckily I am Community Builder, so I don't need to pay the $500 for using the models unresponsive. Use the models responsibly and start with the smallest possible model



Top comments (0)