Lars Winstand

Posted on Jun 11 • Originally published at standardcompute.com

If your agent touches health data, do the boring part first

#ai #automation #healthtech #devops

I’ll say it plainly: the first health-adjacent agent workflow I’d trust is not an AI doctor.

It’s a narrow pipeline that takes 6 months of Apple Watch sleep data, cleans timestamps, maps records into a fixed sleep-diary schema, flags broken rows, and stops for human review before anything reaches a clinician.

That sounds unsexy.

Good.

That’s exactly why it’s the first version I’d trust.

I landed on this after reading a post on r/openclaw where someone said they had their AI assistant turn months of Apple Watch sleep data into the diary their sleep clinic requested, and the data gotchas were brutal.

That sentence contains the whole product.

Not “AI healthcare.”
Not “autonomous wellness.”
Not a GPT-5 wrapper with a soothing UI pretending it understands sleep medicine.

Just a very practical engineering problem:

parse ugly export data
normalize time boundaries
fit it into a clinician-friendly format
fail loudly on bad rows
require a human to approve it

That is a real use case.

And if you build automations in n8n, Make, Zapier, OpenClaw, or Python, it should feel familiar: the hard part is not the final prompt. The hard part is the ugly middle.

The hard part is ETL, not reasoning

Most health-agent demos skip the only part that matters.

They show the polished summary. They show Claude or GPT-5 saying something calm and articulate. They show a dashboard.

I don’t think that’s the hard part.

The hard part is ETL:

extraction
transformation
loading

For sleep data, that means dealing with stuff like:

timestamps crossing midnight
timezone normalization
naps vs overnight sleep
missing start or end times
overlapping intervals
gaps from the device not recording
clinic-specific diary formats

If you get any of that wrong, the model summary at the end is not helpful. It is actively misleading.

That’s why I think the boring pipeline is the real product.

The workflow I’d actually ship

If I had to build this today, I would keep the architecture aggressively narrow.

Apple Health export
  -> parse sleep records
  -> validate schema
  -> normalize timestamps and diary dates
  -> map into fixed sleep-diary format
  -> optionally generate plain-language notes
  -> require human approval
  -> export for clinic submission

That’s it.

No diagnosis.
No treatment suggestions.
No “you may have a circadian disorder” nonsense.

Just a structured transformation pipeline with a review gate.

A practical implementation shape

Here’s how I’d break it up in code or in an automation builder.

Step 1: Parse the export deterministically

Don’t ask an LLM to parse Apple Health exports if you can avoid it.

Use deterministic code first.

from datetime import datetime
import csv

REQUIRED_FIELDS = ["start", "end", "source", "type"]

def parse_sleep_rows(rows):
    parsed = []
    errors = []

    for i, row in enumerate(rows):
        missing = [f for f in REQUIRED_FIELDS if not row.get(f)]
        if missing:
            errors.append({"row": i, "error": f"missing fields: {missing}"})
            continue

        try:
            start = datetime.fromisoformat(row["start"])
            end = datetime.fromisoformat(row["end"])
        except Exception as e:
            errors.append({"row": i, "error": f"bad timestamp: {e}"})
            continue

        if end <= start:
            errors.append({"row": i, "error": "end must be after start"})
            continue

        parsed.append({
            "start": start,
            "end": end,
            "source": row["source"],
            "type": row["type"]
        })

    return parsed, errors

This is boring code.

That’s the point.

Step 2: Normalize diary boundaries explicitly

Sleep data is annoying because humans think in nights, not rows.

A sleep segment from 11:42 PM to 6:18 AM belongs to one sleep episode, but it spans two calendar dates.

You need a rule.

For example:

def diary_date(record):
    # Example rule: assign sleep to the date it started,
    # unless start time is before 6 AM, then assign to previous day.
    start = record["start"]
    if start.hour < 6:
        return (start.date()).isoformat()
    return start.date().isoformat()

You may choose a different rule.

What matters is that the rule is explicit, testable, and visible to the reviewer.

Step 3: Validate before any model sees the data

This is where most “agent” demos get sloppy.

If rows overlap, if timestamps are missing, if timezone conversion changed the diary date, that should be surfaced before GPT-5, Claude, or any other model writes a nice paragraph about it.

def validate_intervals(records):
    records = sorted(records, key=lambda r: r["start"])
    errors = []

    for i in range(1, len(records)):
        prev = records[i - 1]
        curr = records[i]
        if curr["start"] < prev["end"]:
            errors.append({
                "error": "overlap detected",
                "previous": prev,
                "current": curr
            })

    return errors

If validation fails, stop.

Not “best effort.”
Not “the model can probably infer it.”

Stop.

Where an LLM actually helps

I’m not arguing against LLMs.

I’m arguing for using them in the one place they’re actually useful here: turning already-clean structured data into readable notes.

Example prompt shape:

{
  "diary_date": "2026-02-14",
  "sleep_start": "2026-02-14T23:42:00-08:00",
  "sleep_end": "2026-02-15T06:18:00-08:00",
  "total_sleep_minutes": 396,
  "awakenings": 2,
  "missing_fields": []
}

Then ask the model for something narrow:

Write a short plain-language sleep diary note using only the provided fields.
Do not infer diagnosis.
Do not add medical advice.
If fields are missing, say that explicitly.

That is a good LLM task.

Freeform parsing of raw health exports is not.

If you’re building this in n8n, Make, Zapier, or OpenClaw

The pattern is the same no matter what stack you use.

n8n shape

Webhook / File Trigger
-> Code node: parse export
-> IF node: validation errors?
-> Code node: normalize diary schema
-> OpenAI-compatible chat node: plain-language note generation
-> Human approval step
-> Export to CSV / email / EHR-compatible handoff

Make shape

Watch files
-> Parse JSON/XML/CSV
-> Router for validation failures
-> Transform records
-> LLM module for summary text
-> Approval scenario
-> Final export

Custom Python shape

python ingest.py --input apple_health_export.xml
python validate.py --input parsed_sleep.json
python normalize.py --input validated_sleep.json
python summarize.py --input diary.json
python export.py --input reviewed_diary.json

The stack is not the interesting part.

The boundary design is.

Why multi-agent health workflows make me nervous

I like agents. I build with them.

I still think people reach for multi-agent setups way too early.

If your workflow touches clinician-facing paperwork, every extra agent is another place for state to drift, retries to multiply, and outputs to become harder to audit.

For this class of workflow, I want:

one parser
one validator
one formatter
one optional LLM summarizer
one human reviewer

That’s enough.

If you need three agents debating whether a 2:07 AM sleep segment belongs to Tuesday or Wednesday, your architecture is already too clever.

The cost trap shows up fast

This is also where pricing matters more than people admit.

A real workflow like this does not run once.

It gets:

tested on partial exports
rerun after schema fixes
retried after validation failures
regenerated after human feedback
replayed when formatting changes

That means lots of repeated calls.

And if you’re paying per token every time the workflow loops, the architecture starts punishing you for being careful.

That’s one reason I think flat-rate inference is underrated for production automations.

If you’re using an OpenAI-compatible endpoint from Standard Compute, you can keep the same client setup while routing requests behind the scenes and avoid designing around token anxiety.

That changes behavior.

Teams are more willing to:

add validation passes
split deterministic steps from model steps
retry safely
keep human review in the loop

That’s a better engineering incentive than “please make fewer calls because finance is watching.”

The architecture rule that matters

Here’s the rule I’d use beyond health data too:

deterministic preprocessing first, model summarization second

That applies to:

sleep diaries
invoices
support tickets
compliance forms
CRM cleanup
any workflow where bad structure upstream creates fake confidence downstream

The safer pattern is:

parse
validate
normalize
structure
summarize
review

A lot of teams still do the reverse. They dump messy input into a model and hope the model invents structure on the way out.

That works right up until the workflow matters.

What I’d require before calling this safe enough

A few non-negotiables.

1. Separate source-derived fields from model-generated text

If a timestamp came from Apple Health, label it as source-derived.

If a sentence came from GPT-5 or Claude, label it as model-generated.

Those are not the same thing.

2. Broken rows fail loudly

Missing start time? Reject it.

Overlapping intervals? Flag them.

Timezone normalization changed the diary date? Show it.

Silently smoothing over bad data is exactly how trust gets destroyed.

3. Human review is a real gate

Not a decorative checkbox.

A human should be able to inspect the generated diary against the underlying records before it gets exported.

4. The workflow must admit uncertainty

Wearable data is messy.
Clinics want different formats.
Some records will be incomplete.

A good workflow should say “unknown” when something is unknown.

That is a feature, not a failure.

The weird part: narrower feels smarter

The more I think about this category, the more I think the best “health agent” barely feels like an agent.

It feels like a disciplined conveyor belt with one carefully fenced-off language model near the end.

No fake bedside manner.
No diagnosis theater.
No pretending a summary is the same thing as medical judgment.

Just a boring pipeline that survives the ugly data problems, produces a structured artifact, and hands it to a human.

That may sound small.

I think it’s exactly the right size.

And honestly, that lesson travels well outside health data too:

The more sensitive the workflow, the less your automation should improvise.

DEV Community