DEV Community

Cover image for How I Built an AI Agent That Writes FHIR Templates in Hours Instead of Days
Budi Widhiyanto
Budi Widhiyanto

Posted on

How I Built an AI Agent That Writes FHIR Templates in Hours Instead of Days

3-4 days to write one JSON template. Sometimes a full week.

It was a Tuesday night and I had the FHIR specification open in twelve browser tabs. On my left monitor, a CSV data dictionary with 150 fields from a Kobo maternal health form, postnatal care, the kind that tracks five KF checkups for the mother and four KN visits for the newborn. On my right, a half-finished JSON template already 200 lines deep. I wasn't even past the Encounter resource yet.

Each visit block needed its own set of vital signs: blood pressure (systolic and diastolic, split from a combined "120/80" string), pulse rate, respiratory rate, temperature. Each needed clinical observations like uterine fundal height, lochia status, breastfeeding assessment. Each needed the right LOINC code, the right data type handler, the right reference chain connecting it back to the Patient and Encounter it belonged to. Five mother visits times eight to twelve observations each, plus four newborn visits with their own observation sets, plus the baby Patient resource with mother-baby linkage via NIK identifiers.

I did the math. This template would land somewhere around 320 resources and several thousand lines of JSON.

I'd done this before. Many times. Four days minimum, probably five. And somewhere around day two I'd lose my mental model of the reference graph, come back the next morning staring at field mappings I no longer remembered the reasoning behind, spend an hour just re-orienting before I could write another line.

That night I closed the FHIR spec tabs and started building the agent instead.

The Old World

Some context. I'm a software engineer working on health data interoperability for district health systems in Indonesia, specifically Purbalingga and Lombok Barat. We pull data from over a dozen source systems: ePuskesmas (community health center records), Kobo Toolbox (mobile data collection forms), SiGizi (nutrition surveillance), SITB (tuberculosis registry), SIHEPI (hepatitis tracking), SIMRS (hospital information systems), and more. Each system has its own data model, its own field names, its own quirks. Our job is to convert all of it into FHIR (Fast Healthcare Interoperability Resources), the international standard for healthcare data exchange.

FHIR, if you haven't worked with it, models healthcare data as discrete "resources." A patient is a Patient resource. A doctor visit is an Encounter. A blood pressure reading is an Observation with a specific LOINC code (85354-9), split into systolic and diastolic components, each coded separately. A diagnosis is a Condition. A vaccination is an Immunization with a CVX vaccine code and dose protocol. Every resource links to others through references: an Observation points to the Patient it belongs to, the Encounter it was recorded during, the Organization that performed it.

The conversion pipeline itself isn't the hard part. We have a Python engine that reads a JSON template, maps incoming data fields to FHIR resource structures, runs transformation functions, and POSTs the result as a FHIR Bundle to Indonesia's national FHIR server (SatuSehat). The hard part is writing the template.

A template in our system is a JSON file that describes every FHIR resource the converter should produce from a given data source. Here's what even a simple resource definition looks like. This is a real Patient resource from a pregnancy tracking form:

{
  "resourceType": "Patient",
  "fullUrl": "urn:uuid:patient",
  "keyField": "[nik_ibu]",
  "mergeWithExistingData": true,
  "requiredFields": {
    "id": "[generated/patient_res_id]",
    "identifier.0.system": "https://fhir.kemkes.go.id/id/nik",
    "identifier.0.value": "[generated/nik_adjusted]",
    "name.0.text": "[nama_ibu]",
    "birthDate": "[generated/birth_date]"
  },
  "extraLogics": [
    {
      "type": "getPatientIdWithFuzzyLogic",
      "input": ["", "[generated/nik_adjusted]",
               "[nama_ibu]", "[generated/birth_date]",
               "female", ""],
      "output": ["[patient_existing_id]"]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

The [square_bracket] notation maps to incoming data fields. The [generated/...] notation refers to values produced by transformation functions, called extraLogics, that run before the FHIR resource is assembled. In this example, getPatientIdWithFuzzyLogic queries the FHIR server to find if this patient already exists (matching on national ID, name, birth date, and gender).

That Patient resource is one of two resources in the simplest template in our system: 279 lines of JSON total. Our hospital templates run to 300+ resources and nearly 500KB. The postnatal care form that broke me that Tuesday night had 320 resources across about 5,000 lines.

Manual template creation took 3-4 days of wall-clock time. Sometimes a full week. And that's with the hardest intellectual work already done for me.

Before I write a single line of JSON, our health informatics team has already spent their own days (sometimes weeks) building the CSV data dictionary. They're the ones who decide that berat_badan maps to LOINC 29463-7 "Body weight," that tekanan_darah should become a blood pressure panel with code 85354-9, that this particular field is a SNOMED-coded Condition and not just a free-text Observation. That work requires clinical domain knowledge I don't have. They understand what these data points mean in a maternal health context, which FHIR resource types are clinically appropriate, which code systems are correct. I turn their decisions into working JSON. Without that CSV, I'd be guessing at clinical semantics, and guessing wrong.

So even with a carefully prepared data dictionary in hand, the template creation itself still took 3-4 days.

The first day was mapping. Read the CSV line by line. For each field (berat_badan, tekanan_darah, tinggi_fundus, denyut_jantung_janin) decide which FHIR resource type it becomes. Observation for vitals, Condition for diagnoses, Procedure for clinical actions. Pick the clinical code: LOINC 29463-7 for body weight, SNOMED 364589006 for fundal height measurement. Choose the data type handler. Numeric fields need handleNumericValueOrDataAbsentReason, which returns four outputs: the value (or null), plus the data-absent-reason system, code, and display for when the field is empty. String fields, dates, booleans, coded values all have their own handlers.

Day two was wiring. Connect everything. Every clinical resource needs a subject reference pointing to the Patient, an encounter reference pointing to the Encounter, NIK (national ID) traceability in subject.identifier, and a meta.source field for audit provenance. Every numeric Observation needs addValueIfNotEmpty guards on the unit triple (valueQuantity.unit, valueQuantity.code, valueQuantity.system) because if the value is absent, emitting an empty valueQuantity object alongside a dataAbsentReason violates FHIR's XOR constraint. I'd forgotten this guard at least twice in earlier templates and spent hours debugging why the FHIR server rejected my bundles.

Day three was debugging. Run the converter against real sample data. Watch it crash. Trace the error to a missing generateUUID extraLogic, or an output array that doesn't match the function's return count, or a reference to a resource that was conditionally skipped because its keyField was empty in the sample. Fix, re-run, fix, re-run. Each cycle takes a few minutes but there are dozens of them.

If it stretched to a week, day four was review and edge cases. Check that every field in the data dictionary actually made it into the template. Verify code systems (is it http://loinc.org or http://loinc.org/? The trailing slash matters). Handle the weird cases: blood pressure stored as a single "120/80" string that needs to be parsed into two separate Observations with component coding. Baby-mother linkage where the newborn's Patient resource needs a nik-ibu identifier pointing back to the mother.

The worst part wasn't any single step. It was the context loss. Building a 200-resource template across multiple days means losing your mental model every night. You come back the next morning, open the half-finished JSON, and spend the first hour asking yourself: "Why did I use handleCodeableConceptOrDataAbsentReason here instead of handleValueOrDataAbsentReason? Was that intentional or a mistake?" The template is too large to hold in your head. You're constantly re-deriving conclusions you already reached yesterday.

Why an Agent, Not a Script

I didn't set out to "automate FHIR mapping with AI." I set out to make this work fit in a single focused session.

A Python script was my first thought. But template creation isn't a mechanical transformation; it requires judgment. Is letak_janin (fetal position) a coded Observation or free-text? Depends on whether we have a SNOMED mapper for the values. Should komplikasi_persalinan be a Condition resource or an Observation? Depends on whether the field represents a diagnosis or just a documentation note. These decisions vary across data sources and there's no algorithm for them.

My second thought was a one-shot LLM prompt. "Here's a CSV, here's a sample, generate a FHIR template." I tried it. The output looked plausible but failed in subtle ways: wrong function signatures for extraLogics, missing addValueIfNotEmpty guards, LOINC codes that didn't exist, references to resources that were never generated. A FHIR template is a program that runs through our converter engine, and it has to be mechanically correct, not just structurally plausible.

An agent was the right shape. FHIR mapping requires judgment (which resource type? which code?) interleaved with lookup (does this LOINC code exist? does this extraLogic function exist in our codebase?) followed by validation against real data. That loop, think-look up-try-validate-fix, is exactly the kind of work agents handle well.

My honest motivation wasn't full automation. I wanted to compress the tedious parts, choosing codes and wiring references, so my judgment could focus on the edge cases and domain-specific decisions that require understanding what the data actually means in a clinical context.

Anatomy of the Agent

The agent is built as a set of Claude Code custom slash commands: markdown files that define prompts, tool access, and autonomy rules. They chain together into a pipeline. I went through two major iterations, an original 7-step pipeline and a streamlined 3-step version I use today.

The agent receives three files as input:

First, a reference CSV, which is a data dictionary mapping field names to FHIR resource types, clinical codes, and paths. Our health informatics team builds these, and they're the foundation everything else depends on. Each row represents a clinical judgment call: this field is a vital sign Observation, that one is a Condition, this code is LOINC, that one is SNOMED. A typical row: berat_badan | Body weight in kg | Observation | valueQuantity.value | LOINC | 29463-7 | Body weight | numeric. The agent can wire JSON and pick handlers, but it can't decide whether a field is clinically meaningful or which code system is appropriate. That's what the CSV encodes.

Second, a sample JSON record, a real data row from BigQuery showing actual field values and types. This tells the agent what the data looks like in practice, not just in theory. It reveals things the CSV doesn't: that tekanan_darah is actually a combined "120/80" string, that usia_kehamilan is an integer, that some fields are null in real data.

Third, a reference template, an existing structure/*.json template for a similar data source. The agent uses this to extract project-specific patterns: how we do patient fuzzy matching, which regional helper functions exist (like getPurbalinggaVillageId), what the EpisodeOfCare strategy looks like for this program type.

Step A: Plan and Resolve

The first step is 80% deterministic Python, 20% LLM for genuinely ambiguous cases. It parses the CSV and sample data, cross-references them, resolves code systems from friendly labels to canonical URLs ("LOINC" becomes "http://loinc.org"), detects the appropriate data-absent-reason handler per field based on its type, and identifies the unit of measure from the CSV description.

The thing that makes this step work is a defaults file, twelve pre-resolved design decisions that the agent must follow without asking me:

## Coded Fields Without an Existing Mapper

When a field has coded values but no mapper function exists:
Default: use handleValueOrDataAbsentReason, store as valueString.
Do NOT stop to ask. Do NOT invent a mapper.

## Baby / Neonate Patient Linking

When the template has a second Patient for the baby:
Default: always add mother-baby linkage without asking.

## Blood Pressure from Combined String

When tekanan_darah is a combined "120/80" string:
Default: two separate Observations (systolic + diastolic),
each with parseSystolicBP / parseDiastolicBP.
Enter fullscreen mode Exit fullscreen mode

I wrote this file after running the first version of the agent and noticing it kept stopping to ask me the same twelve questions across different templates. "Should I use valueString or valueCodeableConcept for this field?" valueString, every time, unless we have a mapper. "Should I add baby-mother linking?" Yes, always. "What should period.end be when there's no discharge field?" Same as period.start. These aren't judgment calls. They're conventions. Encoding them as defaults lets the agent run with far less hand-holding.

The output is an enriched_mapping.json with every field fully typed, every code resolved, every handler assigned.

Step B: Generate

This is where the template gets written. One continuous pass, following a strict resource ordering: Patient first, then EpisodeOfCare, Encounter, Conditions, vital sign Observations, clinical assessments, numeric measurements, datetime Observations, obstetric history, boolean/coded Observations, Procedures, baby Patient with mother linking, baby Observations, QuestionnaireResponse for risk factors, and any remaining resources. Seventeen groups in total, generated without pausing.

The agent assembles each resource from validated skeleton patterns. Here's a simplified skeleton for a numeric Observation, the most common resource type in a typical template:

{
  "resourceType": "Observation",
  "keyField": "[generated/observation_datetime]",
  "extraLogics": [
    {"type": "generateUUID",
     "output": ["[obs_id]"]},
    {"type": "returnPatientReference",
     "input": ["[generated/patient_res_id]", "[nama]", "[nik]"],
     "output": ["[patient_ref]", "[patient_display]", "[patient_nik]"]},
    {"type": "handleNumericValueOrDataAbsentReason",
     "input": ["[berat_badan]"],
     "output": ["[weight_value]", "[weight_absent_system]",
               "[weight_absent_code]", "[weight_absent_display]"]},
    {"type": "addValueIfNotEmpty",
     "input": ["[generated/weight_value]", "kg"],
     "output": ["[weight_unit_code]"]}
  ],
  "requiredFields": {
    "code.coding.0.system": "http://loinc.org",
    "code.coding.0.code": "29463-7",
    "code.coding.0.display": "Body weight",
    "subject.reference": "[generated/patient_ref]",
    "subject.identifier.value": "[generated/patient_nik]"
  },
  "optionalFields": {
    "valueQuantity.value": "[generated/weight_value]",
    "valueQuantity.unit": "[generated/weight_unit_code]",
    "dataAbsentReason.coding.0.code": "[generated/weight_absent_code]"
  }
}
Enter fullscreen mode Exit fullscreen mode

Notice the addValueIfNotEmpty guard on the unit field. Without it, when the numeric value is absent, the handler correctly sets dataAbsentReason, but the unit field still emits an empty valueQuantity object, violating FHIR's constraint that a resource cannot have both value[x] and dataAbsentReason. This bug bit me three times in manual templates before I documented it in failure-modes.md and baked the guard into every pattern.

The strongest autonomy rule, repeated in every step:

Never ask "ready to continue?" or "shall I proceed?". Just proceed.

Before finishing, the agent runs a coverage check, verifying that every field in the sample data is mapped somewhere in the template. Target: zero uncovered fields.

Step C: Validate and Fix

The final step runs the actual converter against real sample data, then passes the output through a 7-layer validator that checks: FHIR R4 schema compliance, reference closure (every referenced resource exists in the bundle), mandatory fields per resource type, NIK traceability and meta.source format, code system terminology (LOINC/SNOMED/ICD-10/CVX existence), value consistency against the sample data, and helper/handler coverage. The first six layers flag errors; the last one flags warnings.

If errors exist, the agent reads only the error report (not the full bundle, not the full template), applies surgical fixes to the specific failing resources, and re-runs. Maximum two retry cycles. If errors persist after two passes, it stops and surfaces the remaining issues for me to fix manually.

A completed run produces four things: structure/<name>.json (the production-ready template), updated settings.py and settings_local.py (registering the new data source), bundle.json (a test FHIR Bundle from the sample data), and validation_report.json (showing zero errors across all layers).

Where the Template Lives

The agent's output, a JSON template, plugs into a larger system.

BigQuery holds raw health data across 50+ tables, ingested from source systems across Purbalingga and Lombok Barat districts. The orchestrator queries for unprocessed records and farms them out in parallel batches (15 workers, batches of 25 records). For each record, the converter loads the matching template from structure/, runs the extraLogics transformation chain, maps fields to FHIR paths, resolves references between resources, and assembles a FHIR Bundle. The bundle goes to Indonesia's SatuSehat FHIR server via POST. Success and failure get logged back to BigQuery.

The core of the converter, where the template becomes a FHIR resource:

def merge_resources_and_build_references(fullUrl, json, resource,
                                         references_data):
    existing_json = {}
    if resource.get("mergeWithExistingData", False):
        if "id" in json:
            existing_json = get_existing_data_with_id(
                resource["resourceType"], json['id'])

    resource_json = merge_nested_dicts(existing_json, json)

    references = copy.deepcopy(resource["references"])
    if "dynamicReferences" in resource:
        for ref in resource["dynamicReferences"]:
            if ref["key"] not in references_data:
                continue
            # ... build conditional reference

    resource_json = merge_dicts_with_list_values(
        resource_json, references)
    return {
        "fullUrl": fullUrl,
        "request": {"method": "POST",
                    "url": resource["resourceType"]},
        "resource": resource_json
    }
Enter fullscreen mode Exit fullscreen mode

The agent unblocks this pipeline. Everything upstream (data extraction) and downstream (conversion, FHIR submission, reporting) is mechanical. Before the agent, adding a new data source meant I couldn't take on other work for three to four days. Now it means an afternoon.

The Payoff

3-4 days of wall-clock time down to 3-6 hours.

About 30 minutes goes to setup: extracting a sample record from BigQuery, prepping the CSV data dictionary (our health informatics team builds these, but I sometimes need to clean up column names or add missing code system labels), picking a reference template close to the new data source.

1-3 hours for the agent pipeline and review. Run the three steps, then read through the generated template. Check code selections ("did it pick the right LOINC code for fundal height?"), verify the extraLogics chains make sense ("is it using the right date parser for this data source?"), spot-check the reference graph ("do all Observations point to the right Encounter?").

Then 2-3 hours of testing and manual fixes. Run the converter against real sample data. Fix the things the agent got wrong, usually a transformation function with the wrong number of arguments, or a reference to a field name that doesn't match the sample. Re-run until the validation report comes back clean.

I still read every template. But the review is different from building from scratch. When I built manually, I was making decisions and implementing them simultaneously: the cognitive load of "what should this be?" stacked on top of "how do I express that in our template format?" Now I'm only doing the first part. The agent handles the expression.

One moment that sticks with me: the agent generated a template for a nutrition surveillance form (sigizi_balita_dipantau_pmt) with about 60 Observations covering growth monitoring, nutrition status, and supplementary feeding data. In review, I noticed it had correctly applied addValueIfNotEmpty guards on the unit fields for every single numeric Observation. All sixty of them. That's a pattern I'd missed in at least two earlier manual templates, which led to hours of debugging FHIR constraint violations when the server rejected bundles where a patient had no weight recorded but the template still emitted an empty valueQuantity alongside the dataAbsentReason.

The agent got this right because I'd documented the bug in failure-modes.md after debugging it the hard way, and baked the fix into the skeleton patterns. My past pain became the agent's default behavior.

How It Evolved

The first version of the agent had seven steps: analyze inputs, generate a base template at ~40% coverage, evaluate coverage, implement missing fields, test structural integrity, validate quality, check data consistency. Each step was its own Claude Code command with its own system prompt.

It worked. Templates came out correct. I used it for weeks, and it was a real improvement over manual work. But after running it on maybe a dozen templates, I started noticing where the friction was.

Token cost added up. Seven LLM invocations, each re-reading the template, reference files, and sample data. The pipeline consumed roughly 107,000 tokens per run, about 4,773 lines of markdown across the seven command files. The generate-then-measure-then-patch loop (steps 2, 3, 4) felt especially redundant after a while: why generate an incomplete template at 40% coverage just to measure what's missing and patch it? I had enough experience by then to know the agent could do it in one pass.

Confirmation prompts wore me down. Despite instructions not to, the agent kept pausing between resource groups. "I've generated the vital signs Observations. Ready to continue with clinical assessments?" The first few times I didn't mind. After the twentieth template, pressing enter seventeen times per run felt like exactly the kind of tedious interruption I'd built the agent to eliminate.

Small quality issues compounded. The old data consistency validator (step 7) checked whether sample values appeared in the generated bundle using Python's in operator on the serialized JSON string: value in json.dumps(bundle). The value "1" matched every UUID containing the digit 1. Not a dealbreaker, but it meant an extra fix-and-retry cycle on most runs, and after a few weeks those extra cycles added up.

None of these were really wrong. The 7-step pipeline was a huge improvement over manual work. But once you've lived with a tool long enough, you start seeing the next version. I refactored it over a weekend, less out of frustration than the natural instinct to tighten something you use every day.

The 3-step version changed three things. One-pass generation replaced the generate-measure-patch loop: Step B produces the complete template in a single continuous pass. Pre-resolved defaults reduced confirmation prompts: the defaults.md file encodes every recurring design decision, and Claude still asks about some things (tool calls, file writes, ambiguous decisions outside the defaults) but the constant back-and-forth between resource groups is gone. Typed path-aware validation replaced substring matching: the new validator does type-aware comparison (float() cast for numerics, ISO normalization for dates, case-insensitive for strings) instead of value in json.dumps(bundle).

Token usage dropped by roughly 70%. Fix-and-retry cycles went from 2-3 typical to 0-1. It's not fully hands-off; I still sit with the agent, approve tool calls, and occasionally steer it when it goes down a wrong path. But the interruptions are meaningful now. Not mechanical.

When the Agent Gets It Wrong

The agent still makes mistakes. The most common: picking a transformation function that's incompatible with the actual data shape. Using handleNumericValueOrDataAbsentReason on a field that the CSV says is numeric but the sample data reveals is actually a string like "Normal" or "Tidak ada". The handler tries to cast it to a float, throws an exception, the converter crashes.

Another recurring issue is wrong argument counts in extraLogics. Our transformation functions have strict input/output contracts. getPatientIdWithFuzzyLogic takes exactly 6 inputs and returns 1 output. If the agent wires 5 inputs, the function throws an unexpected argument count error at runtime. Step C's test run always catches this, but it requires a manual fix to figure out which argument was omitted.

Sometimes the agent chooses a structure that looks right but doesn't fit how the converter actually processes the data. A field that should be a requiredField ends up in optionalFields, or vice versa, and the missing-value behavior changes in ways that only show up with certain input records. These are the subtlest bugs. They pass validation against the sample data but fail on edge cases in production.

These bugs are fast to fix because the error messages are specific and the template structure is already complete. Patching a single resource definition, not rebuilding from scratch. A 10-minute fix versus a 4-hour reconstruction.

The lesson from building two versions: the first agent you build optimizes for correctness. The second optimizes for cost and autonomy. Both matter, but you don't know which autonomy rules you need until you've run the first version enough times to feel the friction. Every line in defaults.md represents a specific moment where the agent's behavior cost me time, either by asking an unnecessary question or failing in a way I'd already seen before.

What Changed

I now support 60 template configurations across ten Indonesian health programs: hospital systems, community health centers, mobile data collection, nutrition surveillance, tuberculosis tracking, hepatitis monitoring. The templates range from a 2-resource pregnancy tracker (279 lines) to a 332-resource hospital ANC integration (490KB of JSON). Every one of them was built using the agent pipeline.

Before the agent, I had to defer new integrations. Each new data source meant committing to a multi-day block of focused work, and there was always something more urgent. A new nutrition monitoring form would sit in the backlog for weeks while I finished a hospital template. Now, when the health informatics team hands me a new data dictionary, I can have a working template the same day. That changed what integrations I was willing to take on, and how quickly new health programs could start reporting data through the national FHIR infrastructure.

My job shifted from writing thousands of lines of JSON by hand to designing the system that writes it and then reviewing the output with the kind of attention I couldn't sustain across a four-day manual effort.

The most valuable thing I built was defaults.md, twelve pre-resolved design decisions that let the agent run without constantly asking me questions. Writing that file forced me to articulate judgment calls I'd been making intuitively for months. Which EpisodeOfCare strategy for which program type. How to handle coded fields without an existing mapper. When to split blood pressure into components versus a single observation. These decisions were in my head, applied inconsistently, never written down.

But I want to be clear about something. The agent automates my part of the work: turning clinical mappings into working FHIR templates. It does not automate the clinical thinking that comes before it. Every template starts with a CSV that our health informatics team built, row by row, mapping raw field names to the right FHIR resource types and clinical codes. They're the ones who know that denyut_jantung_janin is LOINC 55283-6 and not some other fetal assessment code. They're the ones who decide when a field warrants its own Condition resource versus being an Observation. Without that upstream work, I'd have nothing to automate. The agent compressed my 3-4 days into an afternoon, but it didn't touch the clinical expertise that makes the output meaningful.

It turns out the hard part of building an AI agent isn't the AI. It's figuring out what you actually know, and writing it down clearly enough that something else can apply it. And sometimes, it's recognizing that the knowledge you depend on most isn't yours at all.

Top comments (0)