Budi Widhiyanto

Posted on Apr 17 • Edited on Apr 26

Fixing 168K Failed FHIR Conversions with Parallel AI Agents and Git Worktrees

#healthinformatics #ai #dataengineering #fhir

One afternoon we ran a simple BigQuery query just to check how our FHIR converter was doing. Nothing special, we had been busy with new features for weeks and someone on the team just wanted to see the numbers.

The result was worse than we thought.

100.0% | tb_tracker_table_a          |  7,753 failures
 99.4% | nutrition_app_table_a       | 23,602 failures
 77.2% | hepatitis_app_table_a       | 22,437 failures
 43.2% | health_service_table_c      |  8,096 failures
 28.1% | health_service_table_a      | 168,713 failures

168,713 failed conversions on one table. Thirteen tables above 25% failure rate. We had a backlog that had been quietly growing while we were looking the other way.

If you have ever maintained a data pipeline, you know how this feels. The normal way to fix it is one table at a time. Pull some failed records, find the error, patch the template, run the tests, move on. Two or three hours per table when you know the codebase. Thirteen tables means weeks of focused work.

We had been experimenting with AI agents for this kind of maintenance work, and they were actually pretty good at it. The problem was that we could only run one agent at a time. Two agents on the same branch would step on each other almost immediately.

So we were running them one after another, which mostly defeated the point.

What finally unblocked us was a Git feature that had been sitting there for years and we had not really used: git worktree. This is the story of how we went from "we will fix this when we have time" to clearing the whole backlog in a few days.

a bit of context

Our team maintains a FHIR R4 converter for a national health platform. It pulls health data from local government systems and converts it into FHIR resources that flow into the central platform serving more than 2,000 healthcare facilities across two districts, from small community health posts to hospitals. Immunization records, maternal care, TB treatment, nutrition monitoring, community health screening. The data sources cover around 30 different table types between the two districts.

If you have not worked with FHIR before, the short version is this. FHIR R4 is the international standard for exchanging healthcare data. Every record has a strict structure, required fields, and value sets that the validator checks. A patient's gender cannot just be any string. It has to be one of the allowed codes. A date cannot be empty if the resource needs it. A coded value has to come from the right terminology system. If anything is wrong, the FHIR server rejects the resource and the record never enters the platform.

So when we say "conversion failure", we are not just saying the script crashed. We are saying clinical data from a local health facility never made it into the national system. Lab results, immunization shots, maternal visits. Gone, until we fix it.

Each data source has its own structure and its own quirks. One district had been getting less attention for a while. We did not ignore it on purpose. We always had to fix the most urgent problems first, and the rest kept piling up. You probably know how that goes.

So when we saw those failure numbers, the first reaction was honestly just tired. We knew this work was waiting. We just had not had the bandwidth to face it.

why one agent at a time was not enough

Before we get to worktrees, here is what the agent workflow actually looks like, because that is where the bottleneck became clear.

For every failing table, the steps are roughly:

Pull 100 failed UUIDs from BigQuery
Check data quality, fill rates, date format issues
Test UUIDs one at a time to find the actual error
Fix the template or the shared utility function
Run all 100, make sure at least 90% pass
Delete the failed rows from the report table so the system picks them up again

A note on the 90% threshold in step 5. We accept that the last few percent often come from genuinely bad source data that we cannot fix on our side. Records with corrupted fields, encoding issues from old systems, or data that should not have been entered in the first place. Chasing 100% on every batch means spending hours on records that are not really fixable. 90% is the threshold where we stop and move on.

This is the kind of work where an agent shines. It is not creative work. It is patient detective work that follows a clear pattern.

A concrete example. One table was failing because a date field was coming in as "2024-01-15 00:00:00" instead of "2024-01-15". Our convertStringToDate function expected %Y-%m-%d and returned empty when given the datetime format. Empty date meant a required FHIR field was missing. The validator rejected the resource. Record gone.

The agent found this in about 20 minutes. The third UUID hit the error, the agent walked back to the utility function, added a fallback for the extra time part, then ran the full test batch again to confirm. The actual code change was 4 lines, with a clear explanation of what was wrong.

Another table had a similar issue but in a different shape. A boolean field was arriving as the string "true". FHIR boolean fields expect an actual boolean primitive, so the validator rejected it with expected boolean: found "true". Same workflow. The agent found a function returning ["true"] instead of [True], fixed it, and checked that no other templates depended on the old behavior before merging.

So the per-table work was fine. The bottleneck was scale.

If we started one agent on nutrition_app_table_a and another on tb_tracker_table_a, they would conflict almost immediately. Both agents change tests/test_specific_uuids.py to set up their tests. Both might also touch shared functions in extra_logics/general.py if they find a common bug. On the same branch, in the same directory, they overwrite each other within minutes.

We tried the obvious workaround first. Separate clones of the repo, one per agent. It worked, but it was slow to set up, ate disk space, and made it annoying to share git history. Each clone had its own .git folder, its own remotes to configure, its own everything. We also forgot once to pull the latest main into one of the clones, and the agent fixed a bug that was already fixed on main. Wasted half an hour on that one.

Then we read the worktree docs properly and realized this was the tool we had been needing all along.

what worktrees actually do

If you are like us and have been using Git for years without touching worktrees, here is the short version.

A git worktree lets you check out multiple branches in separate folders at the same time. Each folder has its own files and its own changes. They all share the same .git folder.

git worktree add ../converter-fix-1 -b fix/nutrition-app-a
git worktree add ../converter-fix-2 -b fix/tb-tracker-a
git worktree add ../converter-fix-3 -b fix/hepatitis-app-a
git worktree add ../converter-fix-4 -b fix/health-service-a

Four folders, four branches, one repository. The extra storage is small because we are not copying the whole .git folder, only the working files.

We knew roughly what worktrees do. What we did not realize until recently is that this is exactly what we needed for parallel agents. Each agent gets its own folder. They never see each other's files. Agent 1 can be halfway through testing nutrition_app_table_a while Agent 2 is just starting on tb_tracker_table_a. No conflicts, no waiting for each other.

This is the kind of feature that is great to know about even if you do not need it today. The day you do need it, you will remember it is there.

how we organized the actual run

With 13 failing tables, we sorted them by how many failed conversions they had. We also checked which tables share the same template file, because fixing one template sometimes fixes several tables at once. Worth doing this step. We saved a lot of duplicate work this way.

The first batch had four worktrees running at the same time:

worktree	table	failed conversions
1	health_service_table_a	168,713
2	health_service_table_b	45,707
3	nutrition_app_table_b	28,977
4	hepatitis_app_table_a	22,437

Each agent got a prompt like this:

please create a new worktree for fixing health_service_table_a,
then run the fix-converter-failures workflow.

TABLE_NAME=health_service_table_a
CODE=health_service_code_a
BQ project=your-project

After the process is READY TO DEPLOY, merge back to the main branch,
commit and push, then delete the worktree.

The prompt looks short because fix-converter-failures is a workflow file we maintain in the repo. It defines the 6 steps above and the conventions the agent should follow, so the prompt itself only needs to say which table to run it on. Building this workflow file took us a few iterations, but once it was stable, kicking off a new fix became a one-line task.

The agents ran independently. While worktree 1 was checking fill rates on the patient service table, worktree 2 was already on its third UUID. When one agent found a bug in a shared utility function, it fixed the bug in its own branch, and the other agents kept going.

The first batch did not go perfectly. We hit a small problem on the second day. Two agents found the same bug in convertStringToDate at almost the same time, and both fixed it slightly differently. One added a fallback, the other rewrote the function to use dateutil. Not a real conflict, since each fix was on its own branch, but during merge we had to pick one and revert the other. We added a rule for the next batch: when an agent touches shared code, it has to flag that in the PR description so we know to check for parallel fixes.

After the first batch finished and merged, we started the next four. The whole experience felt more like reviewing work than doing it. Honestly, this took a moment to get used to. After a few rounds, the rhythm became natural. Open four terminals, kick off four agents, let them run, come back to review.

the other half: template coverage

Fixing failures was one problem. The other quality issue was something we did not even have a name for at first. We were calling it "silent data loss".

We have a coverage check that compares each template against its data dictionary and counts what percentage of the source fields are actually mapped to FHIR. If it is below 90%, the template needs more work.

With 15 templates to check, we used the same approach. Four worktrees, grouped by data source type:

worktree 1: maternal death reporting templates
worktree 2: nutrition monitoring (three templates)
worktree 3: community health screening (four templates)
worktree 4: immunization and TB treatment

Agents ran coverage analysis on each template. If it passed 90%, move on. If not, add the missing fields, test, validate against the FHIR server, merge, delete the worktree.

This step found a different kind of problem than the failure fixes, and honestly the more important one. A template can have a 0% failure rate and still be missing half the fields it should be mapping. The records were converting successfully and reaching the FHIR server clean. But blood glucose, cholesterol, abdominal circumference, lab results, risk factor flags, procedure details, all of it was being silently dropped because the template never mapped those fields.

The failure rate query does not show this. The FHIR server does not show this. From every monitoring view we had, the system looked healthy. But the clinical data we were supposed to be capturing was not actually arriving.

This part was uncomfortable to look at. We had been measuring success by "no errors" without checking if we were actually capturing the data correctly. Some templates were already above 90% and needed no changes. Others were well below.

One community health screening template was correctly mapping vital signs but was missing several fields that the screening program actually collects. Blood glucose, cholesterol, abdominal circumference. All present in the source data, all listed in the data dictionary, just never connected to FHIR Observation resources. The agent added them, ran the tests, validated against the FHIR server. That kind of work would normally sit in a backlog for weeks.

A small thing we did not expect. Some data dictionaries were split into multiple CSV files. One screening app reference came in four separate files with around 1,400 rows total. The agent had to combine them before running the analysis. Worth knowing before you start.

We also had one false positive. The agent reported 78% coverage on a template that was actually fine. The data dictionary listed fields that were deprecated and no longer used in production. We had to manually verify the missing fields before adding code for them. Coverage numbers are useful, but the data dictionary itself can be stale, and the agent cannot know that without help.

a few things worth knowing if you try this

Sharing what tripped us up so you do not have to learn the same way.

Worktrees share your git history and remotes. A commit you make in ../converter-fix-2 will show up in git log from your main folder. git push works normally from any worktree. This is convenient once you are used to it, but it can feel weird at first.

Config files that are not committed need to exist in each worktree folder separately. Credential files, .env, anything in .gitignore, each worktree needs its own copy. We learned this the hard way when the first agent could not connect to BigQuery and we spent 15 minutes thinking the credentials were wrong. We keep a short setup note for this now.

Name your worktrees after what they are doing, not just numbers. ../fix-nutrition-app is much easier to work with than ../worktree-2 when you have four terminals open at the same time. Trust us on this.

Cleanup is one command:

git worktree remove ../converter-fix-1

The folder is removed and the worktree reference is cleaned up. The branch and its commits stay in the repository, in case you need to revisit them later.

For us the real lesson from this work was not about worktrees, and not even really about agents. It was about what we were measuring.

For months we had a dashboard that said the converter was healthy. Failure rates were low on most tables, the FHIR server was accepting resources, alerts were quiet. The dashboard was not lying. It was just answering a different question from the one that mattered for healthcare data. "Is the conversion working" is not the same as "is the clinical data arriving correctly". We were tracking the first one and assuming it answered the second.

Worktrees and agents did not solve that. They just made it cheap enough to fix once we noticed.

If you are working with agents on tasks that are similar and do not depend on each other, the parallel worktree pattern is worth trying. The setup takes a few minutes. The harder part is changing how you think about the work. From one task at a time to groups of tasks running side by side. That feels strange at first and gets natural quickly.

For us, the result was clearing months of backlog in a few days, plus the coverage findings we did not know we were missing. The agents are not magic. We still review every change carefully, especially anything that touches shared code or terminology mappings. But letting them work in parallel changed what was actually possible for our team, on a system that more than 2,000 healthcare facilities depend on.