DEV Community: Budi Widhiyanto

Fixing 168K Failed FHIR Conversions with Parallel AI Agents and Git Worktrees

Budi Widhiyanto — Fri, 17 Apr 2026 16:30:19 +0000

One afternoon we ran a simple BigQuery query just to check how our FHIR converter was doing. Nothing special, we had been busy with new features for weeks and someone on the team just wanted to see the numbers.

The result was worse than we thought.

100.0% | tb_tracker_table_a          |  7,753 failures
 99.4% | nutrition_app_table_a       | 23,602 failures
 77.2% | hepatitis_app_table_a       | 22,437 failures
 43.2% | health_service_table_c      |  8,096 failures
 28.1% | health_service_table_a      | 168,713 failures

168,713 failed conversions on one table. Thirteen tables above 25% failure rate. We had a backlog that had been quietly growing while we were looking the other way.

If you have ever maintained a data pipeline, you know how this feels. The normal way to fix it is one table at a time. Pull some failed records, find the error, patch the template, run the tests, move on. Two or three hours per table when you know the codebase. Thirteen tables means weeks of focused work.

We had been experimenting with AI agents for this kind of maintenance work, and they were actually pretty good at it. The problem was that we could only run one agent at a time. Two agents on the same branch would step on each other almost immediately.

So we were running them one after another, which mostly defeated the point.

What finally unblocked us was a Git feature that had been sitting there for years and we had not really used: git worktree. This is the story of how we went from "we will fix this when we have time" to clearing the whole backlog in a few days.

a bit of context

Our team maintains a FHIR R4 converter for a national health platform. It pulls health data from local government systems and converts it into FHIR resources that flow into the central platform serving more than 2,000 healthcare facilities across two districts, from small community health posts to hospitals. Immunization records, maternal care, TB treatment, nutrition monitoring, community health screening. The data sources cover around 30 different table types between the two districts.

If you have not worked with FHIR before, the short version is this. FHIR R4 is the international standard for exchanging healthcare data. Every record has a strict structure, required fields, and value sets that the validator checks. A patient's gender cannot just be any string. It has to be one of the allowed codes. A date cannot be empty if the resource needs it. A coded value has to come from the right terminology system. If anything is wrong, the FHIR server rejects the resource and the record never enters the platform.

So when we say "conversion failure", we are not just saying the script crashed. We are saying clinical data from a local health facility never made it into the national system. Lab results, immunization shots, maternal visits. Gone, until we fix it.

Each data source has its own structure and its own quirks. One district had been getting less attention for a while. We did not ignore it on purpose. We always had to fix the most urgent problems first, and the rest kept piling up. You probably know how that goes.

So when we saw those failure numbers, the first reaction was honestly just tired. We knew this work was waiting. We just had not had the bandwidth to face it.

why one agent at a time was not enough

Before we get to worktrees, here is what the agent workflow actually looks like, because that is where the bottleneck became clear.

For every failing table, the steps are roughly:

Pull 100 failed UUIDs from BigQuery
Check data quality, fill rates, date format issues
Test UUIDs one at a time to find the actual error
Fix the template or the shared utility function
Run all 100, make sure at least 90% pass
Delete the failed rows from the report table so the system picks them up again

A note on the 90% threshold in step 5. We accept that the last few percent often come from genuinely bad source data that we cannot fix on our side. Records with corrupted fields, encoding issues from old systems, or data that should not have been entered in the first place. Chasing 100% on every batch means spending hours on records that are not really fixable. 90% is the threshold where we stop and move on.

This is the kind of work where an agent shines. It is not creative work. It is patient detective work that follows a clear pattern.

A concrete example. One table was failing because a date field was coming in as "2024-01-15 00:00:00" instead of "2024-01-15". Our convertStringToDate function expected %Y-%m-%d and returned empty when given the datetime format. Empty date meant a required FHIR field was missing. The validator rejected the resource. Record gone.

The agent found this in about 20 minutes. The third UUID hit the error, the agent walked back to the utility function, added a fallback for the extra time part, then ran the full test batch again to confirm. The actual code change was 4 lines, with a clear explanation of what was wrong.

Another table had a similar issue but in a different shape. A boolean field was arriving as the string "true". FHIR boolean fields expect an actual boolean primitive, so the validator rejected it with expected boolean: found "true". Same workflow. The agent found a function returning ["true"] instead of [True], fixed it, and checked that no other templates depended on the old behavior before merging.

So the per-table work was fine. The bottleneck was scale.

If we started one agent on nutrition_app_table_a and another on tb_tracker_table_a, they would conflict almost immediately. Both agents change tests/test_specific_uuids.py to set up their tests. Both might also touch shared functions in extra_logics/general.py if they find a common bug. On the same branch, in the same directory, they overwrite each other within minutes.

We tried the obvious workaround first. Separate clones of the repo, one per agent. It worked, but it was slow to set up, ate disk space, and made it annoying to share git history. Each clone had its own .git folder, its own remotes to configure, its own everything. We also forgot once to pull the latest main into one of the clones, and the agent fixed a bug that was already fixed on main. Wasted half an hour on that one.

Then we read the worktree docs properly and realized this was the tool we had been needing all along.

what worktrees actually do

If you are like us and have been using Git for years without touching worktrees, here is the short version.

A git worktree lets you check out multiple branches in separate folders at the same time. Each folder has its own files and its own changes. They all share the same .git folder.

git worktree add ../converter-fix-1 -b fix/nutrition-app-a
git worktree add ../converter-fix-2 -b fix/tb-tracker-a
git worktree add ../converter-fix-3 -b fix/hepatitis-app-a
git worktree add ../converter-fix-4 -b fix/health-service-a

Four folders, four branches, one repository. The extra storage is small because we are not copying the whole .git folder, only the working files.

We knew roughly what worktrees do. What we did not realize until recently is that this is exactly what we needed for parallel agents. Each agent gets its own folder. They never see each other's files. Agent 1 can be halfway through testing nutrition_app_table_a while Agent 2 is just starting on tb_tracker_table_a. No conflicts, no waiting for each other.

This is the kind of feature that is great to know about even if you do not need it today. The day you do need it, you will remember it is there.

how we organized the actual run

With 13 failing tables, we sorted them by how many failed conversions they had. We also checked which tables share the same template file, because fixing one template sometimes fixes several tables at once. Worth doing this step. We saved a lot of duplicate work this way.

The first batch had four worktrees running at the same time:

worktree	table	failed conversions
1	health_service_table_a	168,713
2	health_service_table_b	45,707
3	nutrition_app_table_b	28,977
4	hepatitis_app_table_a	22,437

Each agent got a prompt like this:

please create a new worktree for fixing health_service_table_a,
then run the fix-converter-failures workflow.

TABLE_NAME=health_service_table_a
CODE=health_service_code_a
BQ project=your-project

After the process is READY TO DEPLOY, merge back to the main branch,
commit and push, then delete the worktree.

The prompt looks short because fix-converter-failures is a workflow file we maintain in the repo. It defines the 6 steps above and the conventions the agent should follow, so the prompt itself only needs to say which table to run it on. Building this workflow file took us a few iterations, but once it was stable, kicking off a new fix became a one-line task.

The agents ran independently. While worktree 1 was checking fill rates on the patient service table, worktree 2 was already on its third UUID. When one agent found a bug in a shared utility function, it fixed the bug in its own branch, and the other agents kept going.

The first batch did not go perfectly. We hit a small problem on the second day. Two agents found the same bug in convertStringToDate at almost the same time, and both fixed it slightly differently. One added a fallback, the other rewrote the function to use dateutil. Not a real conflict, since each fix was on its own branch, but during merge we had to pick one and revert the other. We added a rule for the next batch: when an agent touches shared code, it has to flag that in the PR description so we know to check for parallel fixes.

After the first batch finished and merged, we started the next four. The whole experience felt more like reviewing work than doing it. Honestly, this took a moment to get used to. After a few rounds, the rhythm became natural. Open four terminals, kick off four agents, let them run, come back to review.

the other half: template coverage

Fixing failures was one problem. The other quality issue was something we did not even have a name for at first. We were calling it "silent data loss".

We have a coverage check that compares each template against its data dictionary and counts what percentage of the source fields are actually mapped to FHIR. If it is below 90%, the template needs more work.

With 15 templates to check, we used the same approach. Four worktrees, grouped by data source type:

worktree 1: maternal death reporting templates
worktree 2: nutrition monitoring (three templates)
worktree 3: community health screening (four templates)
worktree 4: immunization and TB treatment

Agents ran coverage analysis on each template. If it passed 90%, move on. If not, add the missing fields, test, validate against the FHIR server, merge, delete the worktree.

This step found a different kind of problem than the failure fixes, and honestly the more important one. A template can have a 0% failure rate and still be missing half the fields it should be mapping. The records were converting successfully and reaching the FHIR server clean. But blood glucose, cholesterol, abdominal circumference, lab results, risk factor flags, procedure details, all of it was being silently dropped because the template never mapped those fields.

The failure rate query does not show this. The FHIR server does not show this. From every monitoring view we had, the system looked healthy. But the clinical data we were supposed to be capturing was not actually arriving.

This part was uncomfortable to look at. We had been measuring success by "no errors" without checking if we were actually capturing the data correctly. Some templates were already above 90% and needed no changes. Others were well below.

One community health screening template was correctly mapping vital signs but was missing several fields that the screening program actually collects. Blood glucose, cholesterol, abdominal circumference. All present in the source data, all listed in the data dictionary, just never connected to FHIR Observation resources. The agent added them, ran the tests, validated against the FHIR server. That kind of work would normally sit in a backlog for weeks.

A small thing we did not expect. Some data dictionaries were split into multiple CSV files. One screening app reference came in four separate files with around 1,400 rows total. The agent had to combine them before running the analysis. Worth knowing before you start.

We also had one false positive. The agent reported 78% coverage on a template that was actually fine. The data dictionary listed fields that were deprecated and no longer used in production. We had to manually verify the missing fields before adding code for them. Coverage numbers are useful, but the data dictionary itself can be stale, and the agent cannot know that without help.

a few things worth knowing if you try this

Sharing what tripped us up so you do not have to learn the same way.

Worktrees share your git history and remotes. A commit you make in ../converter-fix-2 will show up in git log from your main folder. git push works normally from any worktree. This is convenient once you are used to it, but it can feel weird at first.

Config files that are not committed need to exist in each worktree folder separately. Credential files, .env, anything in .gitignore, each worktree needs its own copy. We learned this the hard way when the first agent could not connect to BigQuery and we spent 15 minutes thinking the credentials were wrong. We keep a short setup note for this now.

Name your worktrees after what they are doing, not just numbers. ../fix-nutrition-app is much easier to work with than ../worktree-2 when you have four terminals open at the same time. Trust us on this.

Cleanup is one command:

git worktree remove ../converter-fix-1

The folder is removed and the worktree reference is cleaned up. The branch and its commits stay in the repository, in case you need to revisit them later.

For us the real lesson from this work was not about worktrees, and not even really about agents. It was about what we were measuring.

For months we had a dashboard that said the converter was healthy. Failure rates were low on most tables, the FHIR server was accepting resources, alerts were quiet. The dashboard was not lying. It was just answering a different question from the one that mattered for healthcare data. "Is the conversion working" is not the same as "is the clinical data arriving correctly". We were tracking the first one and assuming it answered the second.

Worktrees and agents did not solve that. They just made it cheap enough to fix once we noticed.

If you are working with agents on tasks that are similar and do not depend on each other, the parallel worktree pattern is worth trying. The setup takes a few minutes. The harder part is changing how you think about the work. From one task at a time to groups of tasks running side by side. That feels strange at first and gets natural quickly.

For us, the result was clearing months of backlog in a few days, plus the coverage findings we did not know we were missing. The agents are not magic. We still review every change carefully, especially anything that touches shared code or terminology mappings. But letting them work in parallel changed what was actually possible for our team, on a system that more than 2,000 healthcare facilities depend on.

How I Built an AI Agent That Writes FHIR Templates in Hours Instead of Days

Budi Widhiyanto — Sat, 11 Apr 2026 15:49:03 +0000

3-4 days to write one JSON template. Sometimes a full week.

It was a Tuesday night and I had the FHIR specification open in twelve browser tabs. On my left monitor, a CSV data dictionary with 150 fields from a Kobo maternal health form, postnatal care, the kind that tracks five KF checkups for the mother and four KN visits for the newborn. On my right, a half-finished JSON template already 200 lines deep. I wasn't even past the Encounter resource yet.

Each visit block needed its own set of vital signs: blood pressure (systolic and diastolic, split from a combined "120/80" string), pulse rate, respiratory rate, temperature. Each needed clinical observations like uterine fundal height, lochia status, breastfeeding assessment. Each needed the right LOINC code, the right data type handler, the right reference chain connecting it back to the Patient and Encounter it belonged to. Five mother visits times eight to twelve observations each, plus four newborn visits with their own observation sets, plus the baby Patient resource with mother-baby linkage via NIK identifiers.

I did the math. This template would land somewhere around 320 resources and several thousand lines of JSON.

I'd done this before. Many times. Four days minimum, probably five. And somewhere around day two I'd lose my mental model of the reference graph, come back the next morning staring at field mappings I no longer remembered the reasoning behind, spend an hour just re-orienting before I could write another line.

That night I closed the FHIR spec tabs and started building the agent instead.

The Old World

Some context. I'm a software engineer working on health data interoperability for district health systems in Indonesia, specifically Purbalingga and Lombok Barat. We pull data from over a dozen source systems: ePuskesmas (community health center records), Kobo Toolbox (mobile data collection forms), SiGizi (nutrition surveillance), SITB (tuberculosis registry), SIHEPI (hepatitis tracking), SIMRS (hospital information systems), and more. Each system has its own data model, its own field names, its own quirks. Our job is to convert all of it into FHIR (Fast Healthcare Interoperability Resources), the international standard for healthcare data exchange.

FHIR, if you haven't worked with it, models healthcare data as discrete "resources." A patient is a Patient resource. A doctor visit is an Encounter. A blood pressure reading is an Observation with a specific LOINC code (85354-9), split into systolic and diastolic components, each coded separately. A diagnosis is a Condition. A vaccination is an Immunization with a CVX vaccine code and dose protocol. Every resource links to others through references: an Observation points to the Patient it belongs to, the Encounter it was recorded during, the Organization that performed it.

The conversion pipeline itself isn't the hard part. We have a Python engine that reads a JSON template, maps incoming data fields to FHIR resource structures, runs transformation functions, and POSTs the result as a FHIR Bundle to Indonesia's national FHIR server (SatuSehat). The hard part is writing the template.

A template in our system is a JSON file that describes every FHIR resource the converter should produce from a given data source. Here's what even a simple resource definition looks like. This is a real Patient resource from a pregnancy tracking form:

{
  "resourceType": "Patient",
  "fullUrl": "urn:uuid:patient",
  "keyField": "[nik_ibu]",
  "mergeWithExistingData": true,
  "requiredFields": {
    "id": "[generated/patient_res_id]",
    "identifier.0.system": "https://fhir.kemkes.go.id/id/nik",
    "identifier.0.value": "[generated/nik_adjusted]",
    "name.0.text": "[nama_ibu]",
    "birthDate": "[generated/birth_date]"
  },
  "extraLogics": [
    {
      "type": "getPatientIdWithFuzzyLogic",
      "input": ["", "[generated/nik_adjusted]",
               "[nama_ibu]", "[generated/birth_date]",
               "female", ""],
      "output": ["[patient_existing_id]"]
    }
  ]
}

The [square_bracket] notation maps to incoming data fields. The [generated/...] notation refers to values produced by transformation functions, called extraLogics, that run before the FHIR resource is assembled. In this example, getPatientIdWithFuzzyLogic queries the FHIR server to find if this patient already exists (matching on national ID, name, birth date, and gender).

That Patient resource is one of two resources in the simplest template in our system: 279 lines of JSON total. Our hospital templates run to 300+ resources and nearly 500KB. The postnatal care form that broke me that Tuesday night had 320 resources across about 5,000 lines.

Manual template creation took 3-4 days of wall-clock time. Sometimes a full week. And that's with the hardest intellectual work already done for me.

Before I write a single line of JSON, our health informatics team has already spent their own days (sometimes weeks) building the CSV data dictionary. They're the ones who decide that berat_badan maps to LOINC 29463-7 "Body weight," that tekanan_darah should become a blood pressure panel with code 85354-9, that this particular field is a SNOMED-coded Condition and not just a free-text Observation. That work requires clinical domain knowledge I don't have. They understand what these data points mean in a maternal health context, which FHIR resource types are clinically appropriate, which code systems are correct. I turn their decisions into working JSON. Without that CSV, I'd be guessing at clinical semantics, and guessing wrong.

So even with a carefully prepared data dictionary in hand, the template creation itself still took 3-4 days.

The first day was mapping. Read the CSV line by line. For each field (berat_badan, tekanan_darah, tinggi_fundus, denyut_jantung_janin) decide which FHIR resource type it becomes. Observation for vitals, Condition for diagnoses, Procedure for clinical actions. Pick the clinical code: LOINC 29463-7 for body weight, SNOMED 364589006 for fundal height measurement. Choose the data type handler. Numeric fields need handleNumericValueOrDataAbsentReason, which returns four outputs: the value (or null), plus the data-absent-reason system, code, and display for when the field is empty. String fields, dates, booleans, coded values all have their own handlers.

Day two was wiring. Connect everything. Every clinical resource needs a subject reference pointing to the Patient, an encounter reference pointing to the Encounter, NIK (national ID) traceability in subject.identifier, and a meta.source field for audit provenance. Every numeric Observation needs addValueIfNotEmpty guards on the unit triple (valueQuantity.unit, valueQuantity.code, valueQuantity.system) because if the value is absent, emitting an empty valueQuantity object alongside a dataAbsentReason violates FHIR's XOR constraint. I'd forgotten this guard at least twice in earlier templates and spent hours debugging why the FHIR server rejected my bundles.

Day three was debugging. Run the converter against real sample data. Watch it crash. Trace the error to a missing generateUUID extraLogic, or an output array that doesn't match the function's return count, or a reference to a resource that was conditionally skipped because its keyField was empty in the sample. Fix, re-run, fix, re-run. Each cycle takes a few minutes but there are dozens of them.

If it stretched to a week, day four was review and edge cases. Check that every field in the data dictionary actually made it into the template. Verify code systems (is it http://loinc.org or http://loinc.org/? The trailing slash matters). Handle the weird cases: blood pressure stored as a single "120/80" string that needs to be parsed into two separate Observations with component coding. Baby-mother linkage where the newborn's Patient resource needs a nik-ibu identifier pointing back to the mother.

The worst part wasn't any single step. It was the context loss. Building a 200-resource template across multiple days means losing your mental model every night. You come back the next morning, open the half-finished JSON, and spend the first hour asking yourself: "Why did I use handleCodeableConceptOrDataAbsentReason here instead of handleValueOrDataAbsentReason? Was that intentional or a mistake?" The template is too large to hold in your head. You're constantly re-deriving conclusions you already reached yesterday.

Why an Agent, Not a Script

I didn't set out to "automate FHIR mapping with AI." I set out to make this work fit in a single focused session.

A Python script was my first thought. But template creation isn't a mechanical transformation; it requires judgment. Is letak_janin (fetal position) a coded Observation or free-text? Depends on whether we have a SNOMED mapper for the values. Should komplikasi_persalinan be a Condition resource or an Observation? Depends on whether the field represents a diagnosis or just a documentation note. These decisions vary across data sources and there's no algorithm for them.

My second thought was a one-shot LLM prompt. "Here's a CSV, here's a sample, generate a FHIR template." I tried it. The output looked plausible but failed in subtle ways: wrong function signatures for extraLogics, missing addValueIfNotEmpty guards, LOINC codes that didn't exist, references to resources that were never generated. A FHIR template is a program that runs through our converter engine, and it has to be mechanically correct, not just structurally plausible.

An agent was the right shape. FHIR mapping requires judgment (which resource type? which code?) interleaved with lookup (does this LOINC code exist? does this extraLogic function exist in our codebase?) followed by validation against real data. That loop, think-look up-try-validate-fix, is exactly the kind of work agents handle well.

My honest motivation wasn't full automation. I wanted to compress the tedious parts, choosing codes and wiring references, so my judgment could focus on the edge cases and domain-specific decisions that require understanding what the data actually means in a clinical context.

Anatomy of the Agent

The agent is built as a set of Claude Code custom slash commands: markdown files that define prompts, tool access, and autonomy rules. They chain together into a pipeline. I went through two major iterations, an original 7-step pipeline and a streamlined 3-step version I use today.

The agent receives three files as input:

First, a reference CSV, which is a data dictionary mapping field names to FHIR resource types, clinical codes, and paths. Our health informatics team builds these, and they're the foundation everything else depends on. Each row represents a clinical judgment call: this field is a vital sign Observation, that one is a Condition, this code is LOINC, that one is SNOMED. A typical row: berat_badan | Body weight in kg | Observation | valueQuantity.value | LOINC | 29463-7 | Body weight | numeric. The agent can wire JSON and pick handlers, but it can't decide whether a field is clinically meaningful or which code system is appropriate. That's what the CSV encodes.

Second, a sample JSON record, a real data row from BigQuery showing actual field values and types. This tells the agent what the data looks like in practice, not just in theory. It reveals things the CSV doesn't: that tekanan_darah is actually a combined "120/80" string, that usia_kehamilan is an integer, that some fields are null in real data.

Third, a reference template, an existing structure/*.json template for a similar data source. The agent uses this to extract project-specific patterns: how we do patient fuzzy matching, which regional helper functions exist (like getPurbalinggaVillageId), what the EpisodeOfCare strategy looks like for this program type.

Step A: Plan and Resolve

The first step is 80% deterministic Python, 20% LLM for genuinely ambiguous cases. It parses the CSV and sample data, cross-references them, resolves code systems from friendly labels to canonical URLs ("LOINC" becomes "http://loinc.org"), detects the appropriate data-absent-reason handler per field based on its type, and identifies the unit of measure from the CSV description.

The thing that makes this step work is a defaults file, twelve pre-resolved design decisions that the agent must follow without asking me:

## Coded Fields Without an Existing Mapper

When a field has coded values but no mapper function exists:
Default: use handleValueOrDataAbsentReason, store as valueString.
Do NOT stop to ask. Do NOT invent a mapper.

## Baby / Neonate Patient Linking

When the template has a second Patient for the baby:
Default: always add mother-baby linkage without asking.

## Blood Pressure from Combined String

When tekanan_darah is a combined "120/80" string:
Default: two separate Observations (systolic + diastolic),
each with parseSystolicBP / parseDiastolicBP.

I wrote this file after running the first version of the agent and noticing it kept stopping to ask me the same twelve questions across different templates. "Should I use valueString or valueCodeableConcept for this field?" valueString, every time, unless we have a mapper. "Should I add baby-mother linking?" Yes, always. "What should period.end be when there's no discharge field?" Same as period.start. These aren't judgment calls. They're conventions. Encoding them as defaults lets the agent run with far less hand-holding.

The output is an enriched_mapping.json with every field fully typed, every code resolved, every handler assigned.

Step B: Generate

This is where the template gets written. One continuous pass, following a strict resource ordering: Patient first, then EpisodeOfCare, Encounter, Conditions, vital sign Observations, clinical assessments, numeric measurements, datetime Observations, obstetric history, boolean/coded Observations, Procedures, baby Patient with mother linking, baby Observations, QuestionnaireResponse for risk factors, and any remaining resources. Seventeen groups in total, generated without pausing.

The agent assembles each resource from validated skeleton patterns. Here's a simplified skeleton for a numeric Observation, the most common resource type in a typical template:

{
  "resourceType": "Observation",
  "keyField": "[generated/observation_datetime]",
  "extraLogics": [
    {"type": "generateUUID",
     "output": ["[obs_id]"]},
    {"type": "returnPatientReference",
     "input": ["[generated/patient_res_id]", "[nama]", "[nik]"],
     "output": ["[patient_ref]", "[patient_display]", "[patient_nik]"]},
    {"type": "handleNumericValueOrDataAbsentReason",
     "input": ["[berat_badan]"],
     "output": ["[weight_value]", "[weight_absent_system]",
               "[weight_absent_code]", "[weight_absent_display]"]},
    {"type": "addValueIfNotEmpty",
     "input": ["[generated/weight_value]", "kg"],
     "output": ["[weight_unit_code]"]}
  ],
  "requiredFields": {
    "code.coding.0.system": "http://loinc.org",
    "code.coding.0.code": "29463-7",
    "code.coding.0.display": "Body weight",
    "subject.reference": "[generated/patient_ref]",
    "subject.identifier.value": "[generated/patient_nik]"
  },
  "optionalFields": {
    "valueQuantity.value": "[generated/weight_value]",
    "valueQuantity.unit": "[generated/weight_unit_code]",
    "dataAbsentReason.coding.0.code": "[generated/weight_absent_code]"
  }
}

Notice the addValueIfNotEmpty guard on the unit field. Without it, when the numeric value is absent, the handler correctly sets dataAbsentReason, but the unit field still emits an empty valueQuantity object, violating FHIR's constraint that a resource cannot have both value[x] and dataAbsentReason. This bug bit me three times in manual templates before I documented it in failure-modes.md and baked the guard into every pattern.

The strongest autonomy rule, repeated in every step:

Never ask "ready to continue?" or "shall I proceed?". Just proceed.

Before finishing, the agent runs a coverage check, verifying that every field in the sample data is mapped somewhere in the template. Target: zero uncovered fields.

Step C: Validate and Fix

The final step runs the actual converter against real sample data, then passes the output through a 7-layer validator that checks: FHIR R4 schema compliance, reference closure (every referenced resource exists in the bundle), mandatory fields per resource type, NIK traceability and meta.source format, code system terminology (LOINC/SNOMED/ICD-10/CVX existence), value consistency against the sample data, and helper/handler coverage. The first six layers flag errors; the last one flags warnings.

If errors exist, the agent reads only the error report (not the full bundle, not the full template), applies surgical fixes to the specific failing resources, and re-runs. Maximum two retry cycles. If errors persist after two passes, it stops and surfaces the remaining issues for me to fix manually.

A completed run produces four things: structure/<name>.json (the production-ready template), updated settings.py and settings_local.py (registering the new data source), bundle.json (a test FHIR Bundle from the sample data), and validation_report.json (showing zero errors across all layers).

Where the Template Lives

The agent's output, a JSON template, plugs into a larger system.

BigQuery holds raw health data across 50+ tables, ingested from source systems across Purbalingga and Lombok Barat districts. The orchestrator queries for unprocessed records and farms them out in parallel batches (15 workers, batches of 25 records). For each record, the converter loads the matching template from structure/, runs the extraLogics transformation chain, maps fields to FHIR paths, resolves references between resources, and assembles a FHIR Bundle. The bundle goes to Indonesia's SatuSehat FHIR server via POST. Success and failure get logged back to BigQuery.

The core of the converter, where the template becomes a FHIR resource:

def merge_resources_and_build_references(fullUrl, json, resource,
                                         references_data):
    existing_json = {}
    if resource.get("mergeWithExistingData", False):
        if "id" in json:
            existing_json = get_existing_data_with_id(
                resource["resourceType"], json['id'])

    resource_json = merge_nested_dicts(existing_json, json)

    references = copy.deepcopy(resource["references"])
    if "dynamicReferences" in resource:
        for ref in resource["dynamicReferences"]:
            if ref["key"] not in references_data:
                continue
            # ... build conditional reference

    resource_json = merge_dicts_with_list_values(
        resource_json, references)
    return {
        "fullUrl": fullUrl,
        "request": {"method": "POST",
                    "url": resource["resourceType"]},
        "resource": resource_json
    }

The agent unblocks this pipeline. Everything upstream (data extraction) and downstream (conversion, FHIR submission, reporting) is mechanical. Before the agent, adding a new data source meant I couldn't take on other work for three to four days. Now it means an afternoon.

The Payoff

3-4 days of wall-clock time down to 3-6 hours.

About 30 minutes goes to setup: extracting a sample record from BigQuery, prepping the CSV data dictionary (our health informatics team builds these, but I sometimes need to clean up column names or add missing code system labels), picking a reference template close to the new data source.

1-3 hours for the agent pipeline and review. Run the three steps, then read through the generated template. Check code selections ("did it pick the right LOINC code for fundal height?"), verify the extraLogics chains make sense ("is it using the right date parser for this data source?"), spot-check the reference graph ("do all Observations point to the right Encounter?").

Then 2-3 hours of testing and manual fixes. Run the converter against real sample data. Fix the things the agent got wrong, usually a transformation function with the wrong number of arguments, or a reference to a field name that doesn't match the sample. Re-run until the validation report comes back clean.

I still read every template. But the review is different from building from scratch. When I built manually, I was making decisions and implementing them simultaneously: the cognitive load of "what should this be?" stacked on top of "how do I express that in our template format?" Now I'm only doing the first part. The agent handles the expression.

One moment that sticks with me: the agent generated a template for a nutrition surveillance form (sigizi_balita_dipantau_pmt) with about 60 Observations covering growth monitoring, nutrition status, and supplementary feeding data. In review, I noticed it had correctly applied addValueIfNotEmpty guards on the unit fields for every single numeric Observation. All sixty of them. That's a pattern I'd missed in at least two earlier manual templates, which led to hours of debugging FHIR constraint violations when the server rejected bundles where a patient had no weight recorded but the template still emitted an empty valueQuantity alongside the dataAbsentReason.

The agent got this right because I'd documented the bug in failure-modes.md after debugging it the hard way, and baked the fix into the skeleton patterns. My past pain became the agent's default behavior.

How It Evolved

The first version of the agent had seven steps: analyze inputs, generate a base template at ~40% coverage, evaluate coverage, implement missing fields, test structural integrity, validate quality, check data consistency. Each step was its own Claude Code command with its own system prompt.

It worked. Templates came out correct. I used it for weeks, and it was a real improvement over manual work. But after running it on maybe a dozen templates, I started noticing where the friction was.

Token cost added up. Seven LLM invocations, each re-reading the template, reference files, and sample data. The pipeline consumed roughly 107,000 tokens per run, about 4,773 lines of markdown across the seven command files. The generate-then-measure-then-patch loop (steps 2, 3, 4) felt especially redundant after a while: why generate an incomplete template at 40% coverage just to measure what's missing and patch it? I had enough experience by then to know the agent could do it in one pass.

Confirmation prompts wore me down. Despite instructions not to, the agent kept pausing between resource groups. "I've generated the vital signs Observations. Ready to continue with clinical assessments?" The first few times I didn't mind. After the twentieth template, pressing enter seventeen times per run felt like exactly the kind of tedious interruption I'd built the agent to eliminate.

Small quality issues compounded. The old data consistency validator (step 7) checked whether sample values appeared in the generated bundle using Python's in operator on the serialized JSON string: value in json.dumps(bundle). The value "1" matched every UUID containing the digit 1. Not a dealbreaker, but it meant an extra fix-and-retry cycle on most runs, and after a few weeks those extra cycles added up.

None of these were really wrong. The 7-step pipeline was a huge improvement over manual work. But once you've lived with a tool long enough, you start seeing the next version. I refactored it over a weekend, less out of frustration than the natural instinct to tighten something you use every day.

The 3-step version changed three things. One-pass generation replaced the generate-measure-patch loop: Step B produces the complete template in a single continuous pass. Pre-resolved defaults reduced confirmation prompts: the defaults.md file encodes every recurring design decision, and Claude still asks about some things (tool calls, file writes, ambiguous decisions outside the defaults) but the constant back-and-forth between resource groups is gone. Typed path-aware validation replaced substring matching: the new validator does type-aware comparison (float() cast for numerics, ISO normalization for dates, case-insensitive for strings) instead of value in json.dumps(bundle).

Token usage dropped by roughly 70%. Fix-and-retry cycles went from 2-3 typical to 0-1. It's not fully hands-off; I still sit with the agent, approve tool calls, and occasionally steer it when it goes down a wrong path. But the interruptions are meaningful now. Not mechanical.

When the Agent Gets It Wrong

The agent still makes mistakes. The most common: picking a transformation function that's incompatible with the actual data shape. Using handleNumericValueOrDataAbsentReason on a field that the CSV says is numeric but the sample data reveals is actually a string like "Normal" or "Tidak ada". The handler tries to cast it to a float, throws an exception, the converter crashes.

Another recurring issue is wrong argument counts in extraLogics. Our transformation functions have strict input/output contracts. getPatientIdWithFuzzyLogic takes exactly 6 inputs and returns 1 output. If the agent wires 5 inputs, the function throws an unexpected argument count error at runtime. Step C's test run always catches this, but it requires a manual fix to figure out which argument was omitted.

Sometimes the agent chooses a structure that looks right but doesn't fit how the converter actually processes the data. A field that should be a requiredField ends up in optionalFields, or vice versa, and the missing-value behavior changes in ways that only show up with certain input records. These are the subtlest bugs. They pass validation against the sample data but fail on edge cases in production.

These bugs are fast to fix because the error messages are specific and the template structure is already complete. Patching a single resource definition, not rebuilding from scratch. A 10-minute fix versus a 4-hour reconstruction.

The lesson from building two versions: the first agent you build optimizes for correctness. The second optimizes for cost and autonomy. Both matter, but you don't know which autonomy rules you need until you've run the first version enough times to feel the friction. Every line in defaults.md represents a specific moment where the agent's behavior cost me time, either by asking an unnecessary question or failing in a way I'd already seen before.

What Changed

I now support 60 template configurations across ten Indonesian health programs: hospital systems, community health centers, mobile data collection, nutrition surveillance, tuberculosis tracking, hepatitis monitoring. The templates range from a 2-resource pregnancy tracker (279 lines) to a 332-resource hospital ANC integration (490KB of JSON). Every one of them was built using the agent pipeline.

Before the agent, I had to defer new integrations. Each new data source meant committing to a multi-day block of focused work, and there was always something more urgent. A new nutrition monitoring form would sit in the backlog for weeks while I finished a hospital template. Now, when the health informatics team hands me a new data dictionary, I can have a working template the same day. That changed what integrations I was willing to take on, and how quickly new health programs could start reporting data through the national FHIR infrastructure.

My job shifted from writing thousands of lines of JSON by hand to designing the system that writes it and then reviewing the output with the kind of attention I couldn't sustain across a four-day manual effort.

The most valuable thing I built was defaults.md, twelve pre-resolved design decisions that let the agent run without constantly asking me questions. Writing that file forced me to articulate judgment calls I'd been making intuitively for months. Which EpisodeOfCare strategy for which program type. How to handle coded fields without an existing mapper. When to split blood pressure into components versus a single observation. These decisions were in my head, applied inconsistently, never written down.

But I want to be clear about something. The agent automates my part of the work: turning clinical mappings into working FHIR templates. It does not automate the clinical thinking that comes before it. Every template starts with a CSV that our health informatics team built, row by row, mapping raw field names to the right FHIR resource types and clinical codes. They're the ones who know that denyut_jantung_janin is LOINC 55283-6 and not some other fetal assessment code. They're the ones who decide when a field warrants its own Condition resource versus being an Observation. Without that upstream work, I'd have nothing to automate. The agent compressed my 3-4 days into an afternoon, but it didn't touch the clinical expertise that makes the output meaningful.

It turns out the hard part of building an AI agent isn't the AI. It's figuring out what you actually know, and writing it down clearly enough that something else can apply it. And sometimes, it's recognizing that the knowledge you depend on most isn't yours at all.

National Vaccine Appointment & Administration System

Budi Widhiyanto — Sat, 28 Feb 2026 09:30:45 +0000

🌱 How It Started

Few Years ago, I had a system design interview. The interviewer gave me this scenario:

"Design a national vaccine appointment booking system. Millions of citizens need to register and book slots. Clinics must administer the doses. The government needs audit logs and fraud prevention."

My first thought was simple just let people book a slot, check the stock, and confirm. I drew a basic flow on the whiteboard and felt pretty good about it. Then the interviewer started asking harder questions.

"What if two people try to book the last slot at the same time?"

"What if the clinic runs out of doses after the booking is already confirmed?"

"How do you undo things if eligibility check fails in the middle?"

I didn't have good answers. I only designed for the happy path.

That interview stuck in my mind. Months later, I was doing research on inventory reservation patterns for an internet credit purchase system, and I realized the same ideas could have helped me in that interview. So I went back to the problem and redesigned it. This is what I came up with.

⚡ My Initial (Naïve) Solution

Here's what I proposed during the interview:

Simple, right? But the problems come fast:

Race conditions: Two people click "Book" at the same time for the last slot. Both get confirmed. Now one citizen has no seat.
Stock mismatch: Slot is confirmed, but the clinic ran out of vaccine doses between booking day and appointment day.
Late eligibility failure: System confirms appointment first, then finds out the citizen doesn't meet age or insurance requirement. Now you need to undo everything, but stock is already allocated.
No rollback: If something fails in the middle, there's no way to release the slot or dose back to the pool.

These are the same problems I found later when designing the internet credit purchase system the happy path is not enough when you deal with limited resources and many users at the same time.

🔍 Rethinking the Flow

The main idea, which I learned from inventory reservation strategies in e-commerce, is: don't confirm anything until everything is verified. Use a multi-stage process temporary hold first, then verify, then confirm. If anything fails, rollback.

It's like buying concert tickets. When you select a seat, it's held for you while you pay. If you don't finish in time, the seat goes back. Same concept here.

Here's the full flow of the improved design:

🧩 The Improved Design

1. Reserve First (Temporary Hold)

When a citizen selects a clinic, time slot, and vaccine type, the system does not confirm right away. Instead:

It creates a temporary reservation in Redis with a TTL (time-to-live), for example 5 minutes.
Appointment status is set to PENDING.
Slot capacity and vaccine dose count are decreased temporarily other users will see less availability.

Why Redis? Because we need something fast and temporary. A relational database could work too, but you would need a separate scheduled job to clean up expired reservations. Redis handles this automatically with TTL when 5 minutes pass, the key just disappears. For a system that handles millions of bookings during a national vaccine campaign, this performance difference is important.

How to handle race condition on Redis? We use Redis DECR command on the slot counter. This is atomic meaning if two requests come at the same time, Redis processes them one by one. If the counter reaches zero, the next request is rejected. For extra safety, you can use a Lua script to make the check-and-decrement happen in one step.

2. Eligibility Verification

While the slot is held, the system runs eligibility checks:

Age requirement (e.g., some vaccines only for 60+).
Insurance verification through external API.
Medical history (allergies, previous doses).
Geographic check (is this citizen in the right region?).

If any check fails, the reservation is released Redis key is deleted, slot goes back to the pool. The citizen gets a clear message explaining why they are not eligible, not just "something went wrong."

3. Confirm Appointment

If all checks pass:

Slot capacity and vaccine stock are decreased permanently in the main database.
Appointment status changes from PENDING to CONFIRMED.
Redis reservation is cleared (not needed anymore).
Confirmation is sent to the citizen (SMS, email, or push notification).

This is the point of no return. Before this step, everything can be undone.

4. Administration (Vaccination Day)

When the citizen arrives at the clinic:

Clinic staff scans the citizen's QR code. The QR code contains the appointment ID and a verification hash. The hash is generated on the server using appointment ID + citizen ID + a secret key, so it cannot be faked.
System verifies the QR code against the appointment record.
Staff records the vaccine batch number and time of administration.
Appointment status changes to ADMINISTERED.
An event is sent to other systems analytics, government reporting, audit logs.

5. Failure & Rollback Scenarios

This is the part I completely missed in my interview. Here's how each failure is handled:

No-show: A scheduled job checks for CONFIRMED appointments that passed their time window. Status becomes NO_SHOW, stock is released back.
Citizen cancels: They can cancel through the portal. Stock is released right away.
Clinic cancels a slot (e.g., not enough staff): All affected appointments are flagged. Citizens get notified and can rebook with priority.
External API is down (e.g., insurance service): The system uses a circuit breaker pattern. After several failures in a row, the system stops calling that API temporarily. Meanwhile, the booking is either queued for retry (with increasing wait time between retries) or allowed provisionally with a flag for manual review later. The important thing is: one broken dependency should not block the whole flow.
Redis goes down: The system falls back to database-level reservations with a cleanup job. It's slower, but the booking still works.

🏗️ System Components

Here's the high-level architecture:

Frontend: Booking portal for citizens + Dashboard for clinic staff.
API Gateway: Authentication, rate limiting (very important during mass booking), and routing.
Core Services:
- Auth Service Login, national ID verification.
- Patient Service Medical records, vaccination history.
- Clinic Service Slot management, staff schedules, capacity.
- Inventory Service Vaccine stock per clinic, batch tracking.
- Appointment Service The main service. Manages reservations, confirmations, and status changes.
- Eligibility Service Rules engine + external API calls.
- Notification Service SMS, email, push. Retries if delivery fails.
- Audit Service Append-only logs for every status change. Required for government compliance.
Data Layer: PostgreSQL for permanent data, Redis for temporary reservations and caching.
Async Messaging: Kafka for events AppointmentReserved, AppointmentConfirmed, AppointmentAdministered, AppointmentCancelled. This keeps services separated and makes the system auditable by default.

🎯 What I Would Do Differently Now

Looking back at that interview, the biggest thing I missed was not about technology it was about mindset. I jumped to the happy path because it felt complete. But the interviewer was not testing if I can design a booking form. They were testing if I can think about what happens when things go wrong.

Here's what I learned from this experience:

Start with failure scenarios, not the happy path. Ask yourself "what can go wrong at each step?" before finalizing any design.
Temporary reservation is a pattern, not a hack. Whether it's concert tickets, flash sales, or vaccine slots if you have limited stock and many users, you need hold-then-confirm flow.
Don't be vague about rollbacks. "We'll handle errors" is not a design. Be specific what happens to the data, the stock, and the user when something fails.
External services will go down. Always have a plan for when the insurance API or notification service is not available. Circuit breakers and retry queues are not optional they are necessary.

If you're preparing for system design interviews, I recommend studying inventory reservation patterns. My earlier post on designing an internet credit purchase system covers these patterns with more detail and code examples. The core idea reserve first, verify, then commit appears in many systems once you start looking.

Thanks for reading. If you faced similar interview questions or have ideas to improve this design, I would like to hear about it in the comments.

Data Fetching Patterns Every Developer Should Know (And When to Actually Use Them)

Budi Widhiyanto — Sat, 28 Feb 2026 08:56:54 +0000

About a year ago, I was working on a payment app. Solid architecture, clean API design, decent frontend on paper, everything looked good. But a few months after launch, the ratings started tanking. Users were complaining about slow loads, failed transactions, and the whole thing falling apart on spotty connections.

I spent three months debugging those performance issues, and the fix wasn't some clever algorithm or a server upgrade. It was rethinking how we fetched data. That's it. Same features, same infrastructure, same design just smarter data fetching patterns. The app went from 3.2 stars to 4.7, and transaction volume jumped 30% within two months.

That experience a year ago changed how I think about data flow end-to-end. Most apps don't have a "feature" problem they have a "how we get data to the screen" problem. And the difference between a mediocre app and a great one often comes down to picking the right data fetching pattern for the right situation.

Here's everything I learned and wish I'd known sooner.

The Basics: Request-Response

This is where everyone starts, and for good reason. You ask the server for something, you wait, you get it back. It's the foundation of HTTP, and it handles the majority of use cases just fine.

const fetchUser = async (id: string) => {
  const response = await fetch(`/api/users/${id}`);
  return response.json();
};

Think of it like ordering at a counter you place your order, you wait, you get your food. Simple and predictable.

This works great for standard CRUD operations: loading a user profile, submitting a form, fetching account details on page load. Where it falls apart is when you start chaining multiple requests together. If your page needs data from five endpoints and each one takes 300ms, your user is staring at a spinner for 1.5 seconds. That adds up fast.

The key is recognizing when request-response stops being enough which brings us to everything else.

Polling: The "Are We There Yet?" Approach

Polling is exactly what it sounds like. You ask the server for updates on a regular interval. Every 5 seconds, every 30 seconds, whatever makes sense for your use case.

const pollForUpdates = () => {
  const intervalId = setInterval(async () => {
    try {
      const response = await fetch('/api/updates');
      const data = await response.json();
      updateUI(data);
    } catch (error) {
      console.error('Polling failed:', error);
    }
  }, 5000);

  // Don't forget cleanup
  return () => clearInterval(intervalId);
};

I've seen polling get a bad reputation, and honestly, sometimes it deserves it. Naive polling hammers your server with requests even when nothing has changed. On mobile, it eats battery life. And you'll always have that gap between intervals where updates get missed.

But here's the thing polling is dead simple to implement, works everywhere, and for many use cases (dashboards refreshing every 30 seconds, checking job status on a build pipeline, order tracking) it's perfectly fine. Not everything needs to be real-time. Sometimes "close enough" is the right engineering decision.

The smarter version is long polling, where the server holds the connection open until it actually has something to send back. It's a nice middle ground before committing to WebSockets.

WebSockets: When You Need Actual Real-Time

WebSockets maintain a persistent, two-way connection between the client and server. Unlike polling, neither side has to ask data flows both directions whenever either side has something to say.

const socket = new WebSocket('wss://example.com/socket');

socket.onopen = () => {
  console.log('Connected');
};

socket.onmessage = (event) => {
  const data = JSON.parse(event.data);
  updateUI(data);
};

socket.onclose = () => {
  // You'll want reconnection logic here connections drop
  console.log('Disconnected');
};

This is what powers chat apps, multiplayer games, collaborative editors like Google Docs, and trading platforms where milliseconds matter. If your users need to see changes the moment they happen, and especially if they need to send data back frequently, WebSockets are the right call.

The tradeoff is complexity. You need to handle reconnections (connections will drop). You need to think about scaling every connected user holds an open connection on your server. You need to deal with authentication differently than with regular HTTP. It's not hard, but it's more surface area than a simple fetch call.

My rule of thumb: if you're polling more than once every 5 seconds, it's probably time to consider WebSockets.

Server-Sent Events: Real-Time's Simpler Cousin

SSE is the pattern I wish more developers knew about. It's a one-way channel the server pushes updates to the client over a long-lived HTTP connection. No polling, no WebSocket complexity.

const eventSource = new EventSource('/api/stream');

eventSource.onmessage = (event) => {
  const data = JSON.parse(event.data);
  updateUI(data);
};

eventSource.onerror = () => {
  eventSource.close();
};

See how much simpler that is compared to WebSockets? And you get automatic reconnection for free the browser handles it.

SSE is perfect for notifications, live sports scores, progress bars for long-running tasks (think file processing or deployment pipelines), newsfeeds, and anything where the server is doing the talking and the client is just listening.

The limitation is right there in the name: server-sent. If your client needs to send data back frequently, SSE isn't enough. But for a surprising number of "real-time" features, one-way is all you need.

Caching: Making Your App Feel Instant

Caching is less of a fetching pattern and more of a fetching strategy that you layer on top of other patterns. The idea is simple: store data you've already fetched so you don't have to fetch it again.

import { useQuery } from 'react-query';

const { data, isLoading, error } = useQuery(
  'user',
  () => fetch('/api/user').then((res) => res.json()),
  {
    staleTime: 5 * 60 * 1000,
    cacheTime: 30 * 60 * 1000,
  }
);

Libraries like React Query and SWR have made caching dramatically easier. They handle stale-while-revalidate (show cached data immediately, then refresh in the background), cache invalidation, and deduplication of simultaneous requests.

The impact is hard to overstate. When a user navigates to a page they've already visited and the content appears immediately while a background refresh happens silently that's the kind of thing that makes an app feel native-quality.

The classic challenge is cache invalidation (there's a reason Phil Karlton called it one of the two hard things in computer science). You have to decide: how long is cached data acceptable? What events should invalidate the cache? What happens when two tabs have different cached versions? These are solvable problems, but they require deliberate thinking.

Lazy Loading: Don't Fetch What You Don't Need Yet

The fastest network request is the one you never make. Lazy loading defers fetching until the user actually needs the data typically triggered by scrolling, clicking a tab, or navigating to a new section.

const loadMore = async () => {
  if (isNearBottom()) {
    const response = await fetch(`/api/items?page=${nextPage}`);
    const newItems = await response.json();
    setItems(prev => [...prev, ...newItems]);
    setNextPage(prev => prev + 1);
  }
};

window.addEventListener('scroll', loadMore);

You see this everywhere: infinite scroll on social feeds, images loading as you scroll past them, tabs that only fetch their content when clicked. It makes initial page loads fast because you're only loading what's visible.

The gotchas are UX-related. Infinite scroll can make it impossible for users to reach the footer. Loading new content can cause layout shifts that make users lose their place. And for accessibility, you need to make sure screen readers can navigate lazy-loaded content properly.

For very large lists (thousands of items), pair lazy loading with virtualization only render the DOM elements that are visible in the viewport. Libraries like react-window or tanstack-virtual make this manageable.

Background Sync: Building for the Real World

This one is close to my heart because it solved the biggest pain point in that payment app I worked on last year. Background sync lets users take actions (send a message, submit a form, record a transaction) even when they're offline. The operations get queued and processed automatically when connectivity returns.

// Service Worker
self.addEventListener('sync', (event) => {
  if (event.tag === 'sync-transactions') {
    event.waitUntil(processQueuedTransactions());
  }
});

// Application code
async function recordTransaction(transaction) {
  await saveToLocalQueue(transaction);

  // Show the transaction in the UI immediately
  updateUIOptimistically(transaction);

  if ('serviceWorker' in navigator) {
    const registration = await navigator.serviceWorker.ready;
    await registration.sync.register('sync-transactions');
  }
}

This pattern is essential for mobile apps used in areas with unreliable connections field service apps, delivery tracking, healthcare in rural areas, anything where you can't assume a stable connection.

The complexity lives in conflict resolution. What happens if two offline users edit the same record? What if the server rejects a queued operation? You need clear strategies for these cases, and they're not always straightforward. But the user experience improvement is massive. Going from "you can't do anything without internet" to "everything just works, and syncs when it can" is a night-and-day difference.

Batch Fetching: One Trip Instead of Ten

If your page makes 8 separate API calls to render, something is probably wrong. Batch fetching combines multiple requests into a single network call.

// Instead of this:
const user = await fetch('/api/users/1');
const posts = await fetch('/api/users/1/posts');
const notifications = await fetch('/api/users/1/notifications');

// Do this:
const dashboard = await fetch('/api/dashboard?userId=1');
// Returns user, posts, and notifications in one response

The savings come from reducing HTTP overhead connection setup, headers, TLS handshakes. On mobile networks with high latency, the difference between one request and ten is very noticeable.

The downside is coupling. When you batch things together, you can't cache or invalidate them independently. If the notification data changes every 30 seconds but user profile data changes once a month, batching them means either over-fetching profile data or under-fetching notifications. You have to think about which data actually belongs together.

GraphQL: Ask for Exactly What You Need

GraphQL flips the traditional REST model. Instead of the server deciding what data each endpoint returns, the client specifies exactly what it needs.

const query = `
  query GetUser($id: ID!) {
    user(id: $id) {
      name
      email
      posts(last: 5) {
        title
        preview
      }
    }
  }
`;

const response = await fetch('/graphql', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ query, variables: { id: '123' } }),
});

With REST, a mobile app and a desktop app hitting the same /api/user endpoint get the same response even if the mobile app only needs the name and avatar while the desktop app needs the full profile. GraphQL eliminates that mismatch. Each client asks for exactly what it needs.

This matters most when you have multiple clients with different data requirements, deeply nested data relationships, or you're tired of creating one-off REST endpoints for every new screen design.

The investment is real, though. You need a GraphQL server, a schema, resolvers, and your team needs to learn a new paradigm. Caching is trickier than REST because everything goes through a single endpoint. And poorly written queries can cause serious performance issues on the backend (the N+1 query problem is very real with GraphQL).

For smaller apps with a single client, REST with good API design is usually simpler and sufficient.

Federated Fetching: Unifying Microservices

In microservice architectures, the data a single page needs might live across five different services. Federated fetching usually through a BFF (Backend-For-Frontend) layer or API gateway aggregates that data so the client makes one clean request.

// BFF endpoint
app.get('/api/dashboard/:userId', async (req, res) => {
  const [user, accounts, activity] = await Promise.all([
    fetch(`http://user-service/users/${req.params.userId}`),
    fetch(`http://account-service/accounts?userId=${req.params.userId}`),
    fetch(`http://activity-service/recent?userId=${req.params.userId}`)
  ]);

  res.json({
    user: await user.json(),
    accounts: await accounts.json(),
    recentActivity: (await activity.json()).slice(0, 5)
  });
});

The BFF pattern is a lifesaver in complex systems. Instead of the frontend knowing about every microservice and making separate calls to each, it talks to one unified API that handles the orchestration. The frontend stays clean, and you can tailor responses to what each client actually needs.

The downside is obvious you're adding another service to build, deploy, and maintain. And if your BFF goes down, everything goes down. It's a pattern that makes sense at a certain scale, but overkill for smaller applications.

Combining Patterns: Where It Gets Interesting

No real application uses just one pattern. The interesting decisions happen when you combine them. Here's what that looks like in practice:

A messaging app might use WebSockets for incoming messages, background sync for sending messages in poor connectivity, caching for conversation history, and lazy loading for scrolling through older messages.

An e-commerce app might use request-response for search, caching for product pages, SSE for inventory availability, and batch fetching for the cart summary.

A trading platform might use WebSockets for live prices, polling as a fallback, GraphQL for portfolio data, and caching for historical charts.

The point is to match each data need to the pattern that best serves it. Not every piece of data on a screen has the same freshness requirements, the same access patterns, or the same tolerance for latency.

Quick Reference

Pattern	Use When	Complexity	Offline Support
Request-Response	Standard CRUD, simple pages	Low	No
Polling	Periodic updates, status checks	Low	No
WebSockets	Two-way real-time (chat, collaboration)	High	No
Server-Sent Events	One-way real-time (notifications, feeds)	Medium	No
Caching	Repeated data access, speed matters	Medium	Partial
Lazy Loading	Large lists, heavy initial loads	Low	No
Background Sync	Offline-first, unreliable connections	High	Yes
Batch Fetching	Multiple related data needs	Low	No
GraphQL	Complex/varied data requirements	High	No
Federated Fetching	Microservices, unified APIs	High	No

Final Thoughts

A year ago, when that payment app went from 3.2 stars to 4.7, we didn't add a single new feature. We just changed how existing features got their data. Caching made it feel instant. Background sync made it work offline. WebSockets made payments confirm in real-time. Batch fetching cut load times by 80%.

Looking back, that project taught me something I keep coming back to: users don't care about your architecture. They care that things are fast, reliable, and don't waste their time. Data fetching patterns on both the backend and the frontend are how you deliver on that promise.

Pick the right pattern for each situation. Combine them thoughtfully. And when your app ratings start climbing, you'll know why.

How a "Simple" QR Code Generator Ate All My RAM: A Tale of 50,000 QR Codes

Budi Widhiyanto — Tue, 24 Feb 2026 02:43:04 +0000

Sometimes the simplest tasks can become the biggest headaches. Here's how I learned that data size matters more than code complexity.

The Innocent Beginning

It started with a straightforward request: generate 50,000 unique QR codes for a project. "How hard could it be?" I thought. Python has excellent libraries for this. A quick script, a PDF output, done by lunch.

I was wrong. Very wrong.

What I didn't anticipate was that my "simple" script would consume every byte of RAM on my machine, freeze my computer, and teach me an important lesson about thinking at scale.

Let me walk you through what happened, how I fixed it, and what you can learn from my mistakes.

The Original Approach: Looks Good on Paper

Here's the approach I initially took. Generate all the QR codes first, cache them in memory, then write them to a PDF. It sounds logical, right? Pre-compute everything, then assemble the final output.

def generate_pdf(output_path: str, total: int = 50000):
    ids = generate_unique_ids(total)

    # Pre-generate ALL QR codes in parallel for "speed"
    print(f"Pre-generating {total} QR codes in parallel...")
    num_workers = cpu_count()

    # Split IDs into batches for parallel processing
    batch_size = max(1, total // (num_workers * 4))
    batches = [ids[i:i + batch_size] for i in range(0, len(ids), batch_size)]

    # Generate QR codes in parallel using multiprocessing
    qr_cache = {}
    with Pool(num_workers) as pool:
        results = list(tqdm(
            pool.imap(generate_qr_batch, batches),
            total=len(batches),
            desc="Generating QR codes"
        ))

    # Store ALL images in memory
    for batch_result in results:
        for uid, img_bytes in batch_result:
            buf = io.BytesIO(img_bytes)
            qr_cache[uid] = ImageReader(buf)

    # NOW create the PDF using cached images
    # ... PDF generation code ...

I was proud of this code. Multiprocessing! Parallel execution! Batch processing! All the buzzwords that make you feel like a "real" programmer.

Then I ran it.

The Disaster Unfolds

The script started running. Progress bars moved. CPU usage spiked to 100% across all cores. "Excellent," I thought, "parallel processing doing its thing."

Then I noticed my system getting sluggish. Browser tabs stopped responding. My IDE froze. I opened the system monitor and watched in horror as my RAM usage climbed:

2 GB...
4 GB...
8 GB...
12 GB...

My laptop has 16 GB of RAM. The script was devouring it all. Before I could react, the OOM (Out of Memory) killer struck. Process terminated. No PDF. Just a frozen computer and a lesson learned the hard way.

Understanding the Problem

After my system recovered, I sat down to analyze what went wrong. Let me break down the math:

Each QR code image:

Resolution: 400 × 400 pixels
Format: PNG in memory
Approximate size: 15-30 KB per image (compressed)
But in memory as a PIL Image object: ~500 KB - 1 MB

Scale it up:

50,000 QR codes × ~500 KB = ~25 GB of RAM

Even with the compressed PNG byte representation, we're looking at:

50,000 × 20 KB = ~1 GB just for the image bytes
Plus the ImageReader objects
Plus the BytesIO buffers
Plus Python's memory overhead
Plus multiprocessing duplicating data across workers

The actual memory consumption was somewhere between 2-4 GB, which was still way more than what should be acceptable for such a "simple" task.

The fundamental flaw in my approach was this: I was optimizing for speed when I should have been optimizing for resource consumption.

The Fix: Think Like a Stream, Not a Lake

The solution was embarrassingly simple once I understood the problem. Instead of loading all 50,000 QR codes into memory at once (a "lake" of data), I needed to process them as a stream—one page at a time.

Here's the key insight: A PDF with 50,000 QR codes has about 1,667 pages (30 QR codes per page). I only need to hold 30 QR codes in memory at any given time—the ones for the current page.

Here's the refactored approach:

def generate_pdf(output_path: str, total: int = 50000):
    ids = generate_unique_ids(total)
    total_pages = (total + PER_PAGE - 1) // PER_PAGE

    # Create PDF canvas
    c = canvas.Canvas(output_path, pagesize=A4)

    # Process ONE PAGE at a time
    for page_start in tqdm(range(0, total, PER_PAGE), desc="Generating PDF pages"):
        page_ids = ids[page_start : page_start + PER_PAGE]

        # Generate QR codes ONLY for this page
        page_qr_cache = {}
        for uid in page_ids:
            img = make_qr_image(uid)
            page_qr_cache[uid] = img_to_reader(img)

        # Draw this page
        for idx, uid in enumerate(page_ids):
            # ... draw QR code to PDF ...
            c.drawImage(page_qr_cache[uid], qr_x, qr_y, ...)

        c.showPage()

        # CRITICAL: Clear the cache after each page!
        page_qr_cache.clear()

    c.save()

The key changes:

Generate per-page: Only create QR codes for the 30 items on the current page
Clear after use: Explicitly clear the page cache after each page is written
No multiprocessing overhead: Removed the parallel processing that was duplicating data

The Trade-off: Speed vs. Safety

Let's be honest about the trade-offs:

Metric	Original (Parallel)	Optimized (Per-Page)
Memory Usage	2-4 GB	50-100 MB
Speed	Faster (theoretically)	Slower
Stability	Crashes on large datasets	Stable
Scalability	Limited by RAM	Limited by disk space

Yes, the optimized version is slower. Without parallel processing, we're generating QR codes sequentially. For 50,000 codes, the execution time went from "crash before completion" to "about 30-45 minutes of stable execution."

But here's the thing: a slow script that completes is infinitely faster than a fast script that crashes.

I ran the optimized version overnight. When I woke up, both PDF files (100,000 QR codes total) were sitting there, ready to use. My computer was fine. No crashes. No freezing. Just steady, predictable progress.

Lessons Learned

1. Data Size Changes Everything

A script that works perfectly for 100 items might explode at 10,000 items. Always ask yourself: "What happens when this scales 10x? 100x? 1000x?"

In my case, the script probably worked fine during testing with small batches. It was only at production scale that the memory issue became catastrophic.

2. Memory is Not Infinite

This sounds obvious, but it's easy to forget when you're writing code. Every object you create lives somewhere in memory. When you're dealing with images, those objects can be surprisingly large.

# This innocent-looking line...
qr_cache[uid] = ImageReader(buf)

# ...executed 50,000 times becomes a memory bomb

3. Parallel ≠ Better

Parallel processing is great for CPU-bound tasks where you have enough memory to support multiple workers. But when each worker is creating large objects, parallelism can actually make things worse by multiplying memory usage.

Sometimes, a simple sequential loop is the right answer.

4. Clear Your References

Python's garbage collector is good, but it's not magic. If you're holding references to large objects in a dictionary or list, that memory won't be freed until you explicitly remove those references.

# This single line saved gigabytes of RAM
page_qr_cache.clear()

5. Progress Bars Are Your Friend

When you're running long-executing tasks, always add progress bars. The tqdm library makes this trivially easy:

for page_start in tqdm(range(0, total, PER_PAGE), desc="Generating PDF pages"):
    # ... your code ...

Not only does this give you feedback on how long the task will take, but it also helps you identify when something is wrong. If the progress bar stalls, you know there's a problem.

The Bigger Picture: Thinking About Resources

This experience changed how I approach coding problems. Now, before I write any code that deals with data at scale, I ask myself three questions:

What's the memory footprint per item?
How many items will I process?
Can I process items one at a time instead of all at once?

This is especially important in scenarios like:

Image processing: Images are memory-hungry
Data pipelines: Processing large CSV/JSON files
API responses: Paginating through thousands of records
File operations: Reading/writing large files

The pattern is always the same: stream when you can, batch when you must, and never load everything into memory unless you absolutely have to.

Practical Tips for Your Own Projects

If you're working on a similar task—generating large numbers of images, processing big datasets, or handling any kind of bulk operation—here are some practical tips:

Use Generators Instead of Lists

# Bad: Creates a list of 50,000 items in memory
ids = [generate_id() for _ in range(50000)]

# Better: Generates one at a time
def id_generator(count):
    for _ in range(count):
        yield generate_id()

Process in Chunks

# Instead of processing all at once
for item in huge_list:
    process(item)

# Process in manageable chunks
chunk_size = 100
for i in range(0, len(huge_list), chunk_size):
    chunk = huge_list[i:i + chunk_size]
    for item in chunk:
        process(item)
    # Clean up after each chunk
    gc.collect()  # Force garbage collection if needed

Monitor Your Memory Usage

Add memory monitoring to long-running scripts:

import psutil
import os

def get_memory_usage():
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024 / 1024  # MB

# In your loop
for i, item in enumerate(items):
    process(item)
    if i % 1000 == 0:
        print(f"Processed {i} items, Memory: {get_memory_usage():.1f} MB")

Set Memory Limits

For critical scripts, you can set memory limits to prevent runaway consumption:

import resource

# Limit memory to 1GB
resource.setrlimit(resource.RLIMIT_AS, (1024 * 1024 * 1024, -1))

Conclusion

My "simple" QR code generator turned into a valuable lesson about resource management. The original code was clever—parallel processing, batch operations, caching. But clever code that doesn't work is worse than simple code that does.

The final version generates 100,000 QR codes across two PDF files. It takes about an hour to run. It uses less than 100 MB of RAM. And most importantly, it completes successfully every single time.

Sometimes the best optimization isn't making your code faster—it's making it actually work.

The next time you're writing code that processes data at scale, remember: think about memory first, speed second. A slow script that completes is infinitely more valuable than a fast script that crashes.

TL;DR: I tried to generate 50,000 QR codes by loading them all into memory at once. My computer ran out of RAM and crashed. The fix was simple: generate QR codes one page at a time (30 at a time instead of 50,000). It's slower, but it works. Always consider memory usage when working with data at scale.

Building a FHIR Patient Deduplication System: A Journey from Chaos to Performance

Budi Widhiyanto — Sat, 15 Nov 2025 01:32:38 +0000

I'm working on a national project to collect health data from legacy systems in two pilot districts in Indonesia. The goal is to create interoperability between different healthcare systems so we can make better healthcare decisions based on complete patient data. It's an important project, and it's been a challenging one.

One of the biggest challenges has been patient deduplication. We're collecting data from multiple legacy systems, and each one has its own way of storing patient information. When we convert all this data to FHIR R4 format, we end up with duplicate patient records—the same person appearing multiple times in our system because they exist in multiple source systems.

This is the story of how I built a patient deduplication system that processes thousands of records in minutes instead of hours, and the lessons I learned from approaches that didn't work. If you're working with FHIR data from multiple sources or dealing with patient deduplication in healthcare systems, I hope my experience helps you.

The Beginning: Starting with a Partner's Approach

We're working with a technology partner on this national interoperability project. They had already built systems for patient matching in their own applications, and they shared their approach with us. Their method seemed reasonable: when creating a new patient, first search for existing patients using gender and birthdate filters, then apply fuzzy matching on the patient's name. If you find a good match, use that patient ID. Otherwise, create a new patient. If gender or birthdate are missing, fall back to using NIK (Indonesian National Identity Number) for exact matching.

I implemented their approach in our FHIR converter. Here's what it looked like:

def getPatientIdWithFuzzyLogic(internal_id, nik, name, birthdate, gender, parent_name):
    # Strategy 1: Use gender and birthdate if available
    if gender and birthdate:
        # Get patients matching gender and birthdate
        params = {'gender': gender, 'birthdate': birthdate}
        candidates = get_patients_with_params(params)

        if len(candidates) > 0:
            # Apply fuzzy matching on names
            for patient in candidates:
                score = fuzz.token_sort_ratio(name, patient.name)
                if score >= FUZZY_THRESHOLD:
                    return [patient["id"]]

    # Strategy 2: Fall back to NIK if gender/birthdate didn't work
    if nik:
        nik_patients = get_patients_with_params({'identifier': nik})
        if len(nik_patients) > 0:
            return [nik_patients[0]["id"]]

    # No match found, create new patient
    return None

The logic seemed solid. Search by demographics first, verify with name matching, fall back to NIK if needed. I deployed it and started converting patient data from our legacy systems.

That's when the problems started appearing.

The Failed Solution: Why the Partner's Method Didn't Work

The partner's method worked well in their own internal systems, but it didn't work for our interoperability project. I discovered two fundamental problems that made their approach unsuitable for our needs.

Problem 1: Missing Data in Legacy Systems

The partner's method relied heavily on having gender and birthdate for every patient. But we had data quality issues. Many patient records were missing gender or birthdate fields. When that happened, the search by demographics would fail, and we'd fall back to NIK matching. But if NIK was also missing or inconsistent, we'd create a duplicate patient.

I started seeing duplicate patients in our FHIR server. The same person would appear multiple times because the legacy data from different sources had different levels of completeness. One source might have gender and birthdate, another might only have NIK, and a third might have partial information. The fuzzy matching couldn't handle this inconsistency reliably.

Problem 2: Pagination Limits

The bigger problem was the FHIR API's pagination limit. Our FHIR server returns a maximum of 100 records per search request. When I searched for patients by gender and birthdate, I'd get the first 100 results. If there were more than 100 patients matching those criteria (which is common for popular birthdates), I'd need to paginate through all the results to find the right patient.

But the partner's code didn't handle pagination. It only looked at the first page of results. If the patient I was looking for was on page 2 or page 3, the search would miss them, and the converter would create a duplicate.

I could have fixed the pagination issue by implementing proper page-through logic, but that would make every patient search much slower—potentially making multiple API calls just to check if a patient exists. For batch conversion of thousands of patients, this would be too slow.

The Real Problem

The partner's method was built for their internal systems, where they controlled the data quality and had different constraints. Our situation was different. We were collecting data from multiple independent legacy systems, each with its own data quality issues, and we needed to process it efficiently at scale.

I needed a different approach—one that worked with the data we actually had, not the data we wished we had.

Finding a Better Way

I went back to analyze what reliable data we did have. The answer was NIK—the Indonesian National Identity Number. Almost every patient in our system had a NIK, and it was consistent across different legacy systems. It's a 16-digit number, always formatted the same way, and it uniquely identifies a person.

Why was I treating NIK as a fallback? It should be the primary method. NIK is more reliable than gender or birthdate for identifying patients in Indonesia. Gender and birthdate can be missing or inconsistent, but NIK is designed to be unique.

Building the Reference System: NIK-First Strategy

I redesigned the patient matching system to use NIK as the primary identifier, with demographic matching as a fallback only when necessary. Here's the new approach:

def getPatientIdByNIK(nik):
    """Get patient ID by NIK with caching"""
    if nik in nik_cache:
        return nik_cache[nik]

    params = {'identifier': nik}
    patients = get_patients_with_params(params)

    if len(patients) > 0:
        patient_id = patients[0]["id"]
        nik_cache[nik] = patient_id  # Cache for future lookups
        return patient_id

    return None

def getPatientIdWithFuzzyLogic(nik, name, birthdate, gender, parent_name):
    # Strategy 1: Try NIK exact match first (most reliable)
    if nik:
        nik_patients = get_patients_with_params({'identifier': nik, 'active': True})

        if len(nik_patients) > 0:
            # Single match - verify with name fuzzy matching
            if len(nik_patients) == 1:
                patient = nik_patients[0]
                patient_name = get_full_name(patient)
                new_patient_name = name

                score = fuzz.token_sort_ratio(new_patient_name, patient_name)
                if score >= FUZZY_THRESHOLD:
                    return [patient["id"]]

            # Multiple NIK matches - use fuzzy matching to find best
            elif len(nik_patients) > 1:
                best_match = None
                best_score = 0

                for patient in nik_patients:
                    patient_name = get_full_name(patient)
                    score = fuzz.token_sort_ratio(name, patient_name)
                    if score > best_score:
                        best_score = score
                        best_match = patient

                if best_score >= FUZZY_THRESHOLD:
                    return [best_match["id"]]

    # Strategy 2: Fall back to demographic matching if NIK fails
    if gender and birthdate:
        params = {'gender': gender, 'birthdate': birthdate}
        candidates = get_patients_with_params(params)

        if len(candidates) > 0:
            for patient in candidates:
                score = fuzz.token_sort_ratio(name, get_full_name(patient))
                if score >= FUZZY_THRESHOLD:
                    return [patient["id"]]

    # No match found
    return None

This new system inverts the partner's approach. Instead of searching by demographics first and falling back to NIK, I search by NIK first and fall back to demographics. This solves both problems:

Missing data: NIK is more consistently available than gender/birthdate in our legacy systems
Pagination: Searching by NIK returns far fewer results (usually just one), so pagination isn't an issue

I also added a caching mechanism with getPatientIdByNIK. When converting thousands of patient records, many of them might be the same person (repeat visits, multiple encounters, etc.). By caching the NIK-to-patient-ID mapping, I avoid making redundant API calls for patients I've already looked up.

The fuzzy matching on names is still there as a safety check. Even when I find a patient by NIK, I verify that the name matches using fuzzy string comparison. This catches cases where NIK might have been entered incorrectly or where there might be data quality issues.

Testing the New System

When I tested the new NIK-first system with real data from our legacy systems, it worked much better. The converter found existing patients reliably, even when demographic data was missing. The pagination problem disappeared because NIK searches rarely return more than 100 results. And the caching made bulk conversion much faster.

I watched the logs during a test run converting 1,000 patient records: "Found patient by NIK... Found patient by NIK... Created new patient (no NIK match)... Found patient by NIK (cached)..." The system was working.

But I still had a problem. Before implementing this fix, the old system had already created duplicate patients in our FHIR server. Some NIKs had 2, 3, or even 10 duplicate patient records. While the new reference system prevented future duplicates, I needed to clean up the existing ones.

I needed a deduplication process.

The Deduplication Challenge: Two Versions

Building a system to deduplicate existing patients was a different challenge entirely. With the reference system, I was preventing new duplicates—a relatively simple task of checking before creating. With deduplication, I needed to find all existing duplicates, choose which one should be the "master," and then update potentially thousands of medical records to point to that master instead of the duplicates.

This was going to touch a lot of data. I needed to be careful.

Version 1: Sequential Processing - The Safe, Slow Way

For my first implementation, I chose the safest possible approach: sequential processing. I would handle one NIK at a time, processing each step completely before moving to the next. No parallelization, no batch operations, just simple, linear execution.

The algorithm was straightforward:

Find all patient IDs with the same NIK
Select which one should be the master (I chose the most recently updated)
Find all resources (observations, encounters, etc.) referencing the duplicate patient IDs
Update each resource to reference the master patient ID instead
Mark the duplicate patients as inactive

I wrote it as a simple loop:

for nik in nik_list:
    # Find duplicate patients
    patient_ids = find_patients_by_nik(nik)

    if len(patient_ids) <= 1:
        continue  # No duplicates, skip

    # Select master patient (most recently updated)
    master_id = select_master_patient(patient_ids)
    duplicate_ids = [pid for pid in patient_ids if pid != master_id]

    # Find all resources referencing duplicates
    for resource_type in RESOURCE_TYPES:
        for patient_id in duplicate_ids:
            resources = fetch_resources(resource_type, patient_id)

            # Update each resource
            for resource in resources:
                update_patient_reference(resource, master_id)
                put_resource(resource)

    # Mark duplicates inactive
    for dup_id in duplicate_ids:
        mark_patient_inactive(dup_id, master_id)

I tested it with a single NIK first. It worked. I checked the data afterward—all the medical records now pointed to the master patient, the duplicates were marked inactive with a replaced-by link to the master. Perfect.

Then I tried it with ten NIKs. It worked, but it took 15 minutes. Okay, that's not great, but acceptable for a cleanup operation, right?

Then I ran it on our actual list: 68 NIKs with known duplicates. I started the script, watched the logs for a few minutes, then went to get coffee. When I came back 30 minutes later, it had processed 3 NIKs. I did the math. 68 NIKs at 10 minutes each... over 11 hours.

I let it run overnight. The next morning, it had finished successfully. All the duplicates were cleaned up. But 11 hours was not acceptable. We had hundreds more NIKs to process in other datasets. At this rate, a full deduplication would take days, maybe weeks. And during that time, the script would be constantly hammering our FHIR server with API calls.

The problem was obvious: I was making way too many individual API calls. For each patient ID, I was searching for observations one at a time, then encounters one at a time, then medications, procedures, diagnostic reports—the list went on. And FHIR has a lot of resource types that can reference patients. Even though I filtered it down to the most common ones, I was still checking 27 different resource types. For each duplicate patient. Sequentially.

If a single NIK had 3 duplicate patients and each patient had 20 observations, that's 60 individual GET requests just for observations, plus 60 individual PUT requests to update them. Multiply that by all the other resource types, and you're talking about hundreds of API calls per NIK. No wonder it was slow.

I watched the script run for a while, looking at the logs. The server was responding quickly—each API call only took a few hundred milliseconds. But I was only making one call at a time. The network latency, the sequential execution, it all added up. I was wasting so much time just waiting.

That's when I remembered something. When I built the original FHIR converter, I had faced a similar problem. Converting thousands of patient records one at a time was slow. I had solved it by using batch operations and parallel processing. I could apply the same techniques here.

Version 2: Batch Processing & Parallelization - The Fast Way

The key insight was this: most of the steps in deduplication don't depend on each other. When I'm fetching observations for a patient, I don't need to wait for the encounters to be fetched first. When I'm updating resources, I don't need to update them one at a time—I can batch them together.

I redesigned the system with two major optimizations: parallel resource fetching and batch updates.

For parallel fetching, I used Python's ThreadPoolExecutor:

from concurrent.futures import ThreadPoolExecutor, as_completed

def find_all_references(self, patient_ids: List[str], master_id: str = None):
    """Find ALL resources that reference duplicate patient IDs (parallel)"""
    # Only search for duplicates, not master
    search_patient_ids = [pid for pid in patient_ids if pid != master_id]

    all_references = {}

    # Use ThreadPoolExecutor for parallel fetching
    with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
        # Submit all resource type fetches in parallel
        future_to_type = {
            executor.submit(self._fetch_resources_for_type, resource_type, search_patient_ids): resource_type
            for resource_type in PATIENT_REFERENCING_RESOURCES
        }

        # Collect results as they complete
        for future in as_completed(future_to_type):
            resource_type = future_to_type[future]
            try:
                resource_type, resources = future.result()
                if resources:
                    all_references[resource_type] = resources
            except Exception as e:
                logger.error(f"Error fetching {resource_type}: {e}")

    return all_references

Instead of fetching observations, then encounters, then medications sequentially, I now launch all those searches in parallel. Five worker threads (configurable via MAX_WORKERS) simultaneously fetch different resource types. This alone cut the fetching time by about 80%.

But the real performance gain came from batch updates. FHIR supports bundle operations—instead of sending one resource update at a time, you can send a bundle of up to hundreds of updates in a single API call. I implemented this using FHIR batch bundles:

def _create_batch_bundle(self, resources_to_update: List[tuple]) -> dict:
    """Create a FHIR batch bundle for updating multiple resources"""
    bundle = {
        "resourceType": "Bundle",
        "type": "batch",
        "entry": []
    }

    for resource_type, resource in resources_to_update:
        bundle["entry"].append({
            "request": {
                "method": "PUT",
                "url": f"{resource_type}/{resource['id']}"
            },
            "resource": resource
        })

    return bundle

def update_all_references(self, all_references, duplicate_ids, master_id):
    """Update all references using batch bundles"""
    resources_to_update = []

    # Collect all resources that need updating
    for resource_type, resources in all_references.items():
        for resource in resources:
            if self.update_patient_references(resource, duplicate_ids, master_id):
                resources_to_update.append((resource_type, resource))

    # Split into batches of 100
    num_batches = math.ceil(len(resources_to_update) / self.batch_size)

    for i in range(num_batches):
        batch_resources = resources_to_update[i*self.batch_size:(i+1)*self.batch_size]

        # Create and execute batch bundle
        bundle = self._create_batch_bundle(batch_resources)
        result_bundle = self._execute_batch_bundle(bundle)

        # Check results
        for entry in result_bundle.get('entry', []):
            response = entry.get('response', {})
            if response.get('status', '').startswith('2'):
                self.stats['resources_updated'] += 1

Now instead of 60 separate PUT requests to update 60 observations, I send one request with a bundle containing all 60 updates. The FHIR server processes them efficiently on its end, and I get back a bundle with the results.

I also added a smart optimization: only fetch resources for duplicate patients, not the master. Resources already pointing to the master don't need to be fetched or updated. This simple check cut the amount of data I needed to process roughly in half:

# Only search for duplicates, not master (master resources don't need updating)
search_patient_ids = [pid for pid in patient_ids if pid != master_id]

I ran the new version on the same 68 NIKs that had taken 11 hours before. This time, I watched the progress in real-time. The parallel fetching worked beautifully—I could see all 27 resource types being queried simultaneously. The batch updates were lightning fast—bundles of 100 resources updated in seconds.

Twenty-three minutes later, it was done. The same operation that took 11 hours now took 23 minutes. That's roughly 30 times faster.

I ran it again on a larger dataset just to be sure. Same results. The system was consistently fast. The optimization worked.

Technical Deep Dives: Solving Real Problems

While the main architecture was solid, I ran into several challenges that required specific solutions. Let me share three that taught me the most.

Challenge 1: Pagination and URL Handling

One issue that cost me two hours of debugging was pagination. When fetching resources, FHIR servers often return results in pages. You get the first 100 results, plus a "next" link to get the next page. Simple enough, right?

Except the "next" link returned by our FHIR server had a trailing slash before the query parameters: /Patient/?_count=100&_page_token=abc. When I tried to fetch that URL, I got 404 errors. The server expected /Patient?_count=100 (no slash before the question mark).

I spent way too long staring at logs before I noticed that subtle difference. Once I saw it, the fix was simple:

def _get_next_link(self, bundle: dict) -> str:
    """Get next page URL from bundle"""
    for link in bundle.get('link', []):
        if link.get('relation') == 'next':
            next_url = link.get('url')
            if next_url and '/fhir/' in next_url:
                path_and_query = next_url.split('/fhir/', 1)[1]
                # Remove trailing slash before query string
                if '/?' in path_and_query:
                    path_and_query = path_and_query.replace('/?', '?')
                return f"{self.base_url}/{path_and_query}"
    return None

This taught me to always validate assumptions about external APIs. Just because something looks like a standard URL doesn't mean it will work exactly as you expect.

Challenge 2: Choosing the Right Master Patient

Initially, I just picked the most recently updated patient as the master. Simple logic: the newest record is probably the most complete. But then I realized this created a problem. If I ran the deduplication twice, I might choose a different master the second time (if one of the duplicates had been updated in between). This would cause unnecessary churn—moving all those resource references back and forth.

The solution was to implement stability: once a patient has been designated as master, it should stay the master. I did this using FHIR meta tags. When a patient becomes the master, I tag it with a "golden resource" tag that points to itself:

def _has_self_referencing_golden_tag(self, patient: dict) -> bool:
    """Check if patient has golden resource tag pointing to itself"""
    patient_id = patient.get('id')
    if not patient_id:
        return False

    tags = patient.get('meta', {}).get('tag', [])
    for tag in tags:
        if tag.get('system') == 'http://terminology.kemkes.go.id/sp-replaced-by':
            if tag.get('code') == patient_id:
                return True
    return False

def select_master_patient(self, patient_ids: List[str]) -> str:
    """Select which patient should be the master

    Priority:
    1. Active patient with golden resource tag (existing master)
    2. Most recently updated active patient
    """
    patients = [fetch_patient(pid) for pid in patient_ids]
    active_patients = [p for p in patients if p.get('active', True)]

    # Check for existing master first
    existing_masters = [p for p in active_patients
                       if self._has_self_referencing_golden_tag(p)]

    if existing_masters:
        return max(existing_masters,
                  key=lambda p: p.get('meta', {}).get('lastUpdated', ''))['id']
    else:
        return max(active_patients,
                  key=lambda p: p.get('meta', {}).get('lastUpdated', ''))['id']

Now when I run deduplication, it first checks if one of the patients is already marked as master. If so, use that one. If not, pick the most recently updated and mark it as master. This ensures consistency across multiple runs.

Challenge 3: Handling Batch Failures Gracefully

When I first implemented batch updates, I used FHIR transaction bundles (type: "transaction"). These are atomic—either all updates succeed, or they all fail. This seemed safe, but it had a major problem: if even one resource in the batch had an issue, the entire batch would fail, and none of the updates would be applied.

During testing, I had a batch of 100 observations to update. One of them had a validation issue (a missing required field from old data). The entire batch failed, and I had to figure out which one was problematic. This was frustrating and slow.

The solution was to switch to batch bundles (type: "batch") instead of transaction bundles. With batch bundles, each operation in the bundle succeeds or fails independently:

bundle = {
    "resourceType": "Bundle",
    "type": "batch",  # Independent operations, not atomic
    "entry": [...]
}

Now if one resource in a batch of 100 fails, the other 99 still get updated successfully. I log the failure, track it in my stats, but don't let it block the entire operation:

for idx, entry in enumerate(result_bundle.get('entry', [])):
    response = entry.get('response', {})
    status = response.get('status', '')

    if status.startswith('2'):  # Success (2xx status code)
        self.stats['resources_updated'] += 1
        self.stats['batch_successes'] += 1
    else:
        # Log failure but continue
        resource_type, resource = batch_resources[idx]
        error_msg = f"Failed to update {resource_type}/{resource['id']}: {status}"
        logger.warning(error_msg)
        self.stats['errors'].append(error_msg)
        self.stats['batch_failures'] += 1

This makes the system much more robust. Even with messy real-world data, the deduplication completes successfully for the vast majority of resources, and I have a clear log of anything that failed.

Making It Production-Ready: The API Layer

The command-line script worked great for batch deduplication, but for ongoing operations, I needed something more accessible. I built a FastAPI wrapper that exposes the deduplication functionality as a REST API.

The API has two main endpoints:

@app.post("/deduplicate")
async def deduplicate_single_nik(request: SingleNIKRequest):
    """Deduplicate patients for a single NIK"""
    start_time = time.time()

    deduplicator = FHIRPatientDeduplicator(
        base_url=FHIR_BASE_URL,
        nik_system=NIK_SYSTEM,
        api_key=API_KEY,
        batch_size=BATCH_SIZE,
        max_workers=MAX_WORKERS
    )

    patient_ids = deduplicator.find_patients_by_nik(request.nik)

    if len(patient_ids) > 1:
        deduplicator.deduplicate_by_nik(
            nik=request.nik,
            patient_ids=patient_ids,
            delete_duplicates=request.delete_duplicates
        )

    duration = time.time() - start_time

    return DeduplicationResponse(
        success=True,
        nik=request.nik,
        resources_found=deduplicator.stats['resources_found'],
        resources_updated=deduplicator.stats['resources_updated'],
        duration_seconds=round(duration, 2),
        timestamp=datetime.utcnow().isoformat()
    )

I added timing information so we can track how long each deduplication takes. This is useful for monitoring and capacity planning. I also added a batch endpoint that processes multiple NIKs in sequence, with per-NIK timing and summary statistics.

The API is deployed on Google Cloud Run, which handles scaling automatically. If we need to process a large batch of NIKs, we can send them to the batch endpoint and it processes them sequentially (to maintain data integrity) while still being fast thanks to the parallelization and batch updates happening under the hood.

The API also makes it easy for other teams to integrate deduplication into their workflows. They can call the endpoint whenever they import new data, and any duplicates get cleaned up automatically.

Reflection & Lessons Learned

Looking back on this project, I'm proud of what I built, but I'm also very aware of what I did wrong and what I'd do differently next time.

What Went Well

The NIK-based reference system is simple and reliable. By choosing the right unique identifier from the start, I avoided all the complexity of demographic matching. The system hasn't created a single duplicate patient since I deployed it.

The optimization from sequential to batch/parallel processing was a huge win. Going from 11 hours to 23 minutes isn't just about speed—it's about practicality. At 11 hours, running deduplication was something you'd do rarely, maybe once a month, as a special operation. At 23 minutes, it's something you can run weekly, or even daily if needed. That changes how useful the tool is.

The architectural decisions around resilience—using batch bundles instead of transactions, tracking errors but continuing, logging everything—have proven their value. The system handles real-world messy data gracefully. It doesn't fail catastrophically because one record has a problem.

What I'd Do Differently

I should have thought about deduplication from day one. If I had implemented the NIK check in the original converter, I wouldn't have created thousands of duplicates that needed cleaning up. This is a classic example of a small amount of foresight preventing a large amount of pain later.

I wasted a month trying to adapt the partner's solution. I should have analyzed our specific problem more carefully first. Their demographic matching system was sophisticated and well-built, but it was solving a different problem than ours. Understanding the problem deeply before jumping to solutions would have saved a lot of time.

I should have built the parallel/batch version first, or at least earlier. I learned more from building it than I would have from just thinking about it, but if I had started with "how do I make this fast?" instead of "how do I make this work?", I would have gotten to the good solution faster.

Technical Learnings

Batch operations are powerful. Reducing API calls from hundreds to dozens makes a massive difference. Whenever you're doing lots of similar operations, look for a way to batch them.

Parallelization works best when operations are independent. Fetching different resource types in parallel is perfect because they don't depend on each other. But I couldn't parallelize the actual deduplication of different NIKs because they might reference the same resources. Understanding these dependencies is crucial.

The FHIR standard is well-designed but implementations vary. Features like batch bundles, search parameters, and pagination work slightly differently on different servers. Always test against your actual FHIR server, not just against the spec.

Real-world data is messy. Invalid formats, missing fields, duplicate identifiers—they're all going to happen. Build your system to handle errors gracefully rather than assuming perfect data.

Future Improvements

If I were to continue improving this system, here's what I'd add:

More sophisticated master selection. Currently I use "most recently updated" as a tiebreaker. But there are other factors that could matter—which patient has the most complete data, which one has the most recent medical records, which one was verified most recently. A scoring system could help.

Automated detection of new duplicates. Right now someone has to identify that duplicates exist and call the API. I could build a background job that periodically scans for NIKs with multiple active patients and flags them for review or automatic deduplication.

Intelligent merging of patient demographic data. When deduplicating patients, I currently just pick one master patient and mark the others inactive. But sometimes the duplicate records have complementary information—one might have a phone number, another might have an address. I could merge the best available data from all duplicates into the master patient record before marking duplicates inactive. This would ensure no valuable information is lost during deduplication.

Conclusion

Building this patient deduplication system taught me that good software engineering isn't just about making things work—it's about making them work well, reliably, and efficiently. It's about thinking ahead, but also about being willing to rework things when your first approach doesn't scale.

I made mistakes. I spent time on solutions that didn't fit my problem. I built a slow version first when I could have built a fast one. But each of those mistakes taught me something valuable. Now I know to identify the right unique identifier before building a system around it. I know to batch operations whenever possible. I know to design for resilience, not just for the happy path.

Most importantly, I learned that performance optimization isn't just about making things faster—it's about making them useful. A tool that takes 11 hours to run gets used rarely. A tool that takes 23 minutes gets used regularly. Speed enables usefulness.

If you're building something similar—whether it's deduplication, data migration, or any kind of batch processing—I hope my journey helps you avoid some of the wrong turns I took. Think about deduplication early. Choose the right unique identifier. Build for resilience. Batch and parallelize when you can. And don't be afraid to throw away your first version if it doesn't scale.

The code is running in production now, quietly cleaning up duplicate patient records every week. It works. It's fast. And most importantly, it helps make sure that when a healthcare provider looks up a patient's medical history, they see the complete picture. That's what matters.

Relearning Microservices with a Weekend Mini eCommerce Build

Budi Widhiyanto — Wed, 24 Sep 2025 03:24:21 +0000

One rainy weekend I decided to refresh my microservices skills by building a small eCommerce platform from scratch. I wanted a playground that was close enough to real work to show the classic problems—clear boundaries, steady APIs, reliable deployments—without growing into a long project. This article is my field journal from that sprint: what I built, why I made certain choices, and how the code in this repo supports every decision.

Architecture at a Glance

Saturday morning started with a blank page and four simple boxes. I knew the weekend would stay calm only if every box owned one clear job and followed the same rules. The result is a Node.js monorepo with four deployable workspaces that live together but stay independent:

User Service handles registration, login, and profile lookups so the rest of the stack never has to guess who is calling.
Product Service manages the catalog and keeps price data clean.
Order Service turns carts into history by connecting users and products.
API Gateway sits on the edge and hides the backend layout from clients.

Each service gets its own Postgres database and REST API. To avoid copying the same setup again and again, every service depends on @mini/shared for logging, HTTP helpers, error classes, and configuration tools. From there the workflow stays simple on purpose: npm run compose:up brings the stack online with this Compose file driving the topology:

# docker-compose.yml
services:
  user-service:
    command: npm run dev --workspace services/user
    ports:
      - "3001:3001"
    depends_on: [user-db]

  product-service:
    command: npm run dev --workspace services/product
    ports:
      - "3002:3002"
    depends_on: [product-db]

  order-service:
    command: npm run dev --workspace services/order
    ports:
      - "3003:3003"
    depends_on: [order-db, user-service, product-service]

  api-gateway:
    command: npm run dev --workspace gateway
    ports:
      - "8080:8080"
    depends_on: [user-service, product-service, order-service]

volumes:
  user-db-data:
  product-db-data:
  order-db-data:

The manifests in k8s/ reproduce the same shape inside a Kubernetes cluster when I want to push things a little harder.

Shared Platform Capabilities

By midday I noticed the same pattern, service after service. Each one wanted identical Express plumbing, the same error classes, and the same .env routine. Rather than repeat myself, I moved those cross-cutting pieces into @mini/shared so the rest of the weekend could focus on business rules instead of setup.

The shared HTTP helper keeps every edge consistent by centralising the Express setup, wiring in JSON parsing, health checks, and error handling so every service exposes the same behaviour:

// shared/src/http.js
function createApp({ serviceName, logger, routes }) {
  if (!serviceName) throw new Error('serviceName is required');
  const app = express();
  app.disable('x-powered-by');
  app.use(express.json());

  app.get('/healthz', (_req, res) => {
    res.json({ service: serviceName, status: 'ok', uptime: process.uptime() });
  });

  if (typeof routes === 'function') {
    routes(app);
  }

  app.use((_req, _res, next) => next(new NotFoundError()));

  app.use((err, req, res, _next) => {
    const error = err instanceof AppError ? err : new AppError('Internal Server Error');
    logger?.error?.('request failed', { code: error.code, status: error.status, id: req.id });
    res.status(error.status).json({ error: { code: error.code, message: error.message } });
  });

  return app;
}

Error classes stay in one place, so every service can throw meaningful responses and map domain problems to HTTP status codes without duplicating boilerplate:

// shared/src/errors.js
class ValidationError extends AppError {
  constructor(message = 'Validation failed', details) {
    super(message, { status: 400, code: 'validation_error', details });
  }
}

class UnauthorizedError extends AppError {
  constructor(message = 'Unauthorized') {
    super(message, { status: 401, code: 'unauthorized' });
  }
}

Configuration loading is just as centralised, which means each service validates its environment variables before it starts and applies optional parsers or defaults in one predictable location:

// shared/src/env.js
function getConfig(schema) {
  return Object.entries(schema).reduce((acc, [key, options]) => {
    let value = process.env[key];
    const required = !!options?.required;
    const fallback = options?.default;
    const parser = options?.parser;

    if ((value === undefined || value === '') && fallback !== undefined) {
      value = typeof fallback === 'function' ? fallback() : fallback;
    }

    if ((value === undefined || value === '') && required) {
      throw new Error(`Missing required environment variable ${key}`);
    }

    acc[key] = typeof parser === 'function' && value !== undefined ? parser(value) : value;
    return acc;
  }, {});
}

Lastly, the shared logger stamps every log line with the service name, which makes cross-service debugging feel like reading a conversation instead of a jumble of anonymous messages:

// shared/src/logger.js
function createLogger(serviceName) {
  const prefix = serviceName ? `[${serviceName}]` : '[app]';
  const base = { info: console.log, error: console.error, warn: console.warn };

  return {
    info: (msg, meta) => base.info(prefix, msg, meta || ''),
    warn: (msg, meta) => base.warn(prefix, msg, meta || ''),
    error: (msg, meta) => base.error(prefix, msg, meta || ''),
  };
}

After that refactor each service file felt lighter. The interesting code stayed in front, and new features no longer meant reworking the foundations.

Service Deep Dive

User Service: Reestablishing Identity Basics

The first feature I added was identity. Past projects taught me that most bugs look like security bugs when the caller is unknown, so registerUser hashes the password, saves it, and issues a JWT in one short flow:

// services/user/src/service.js
async function registerUser({ username, password }) {
  if (!username || !password) {
    throw new ValidationError('username and password are required');
  }

  const existing = await findByUsername(username);
  if (existing) {
    throw new ValidationError('username already taken');
  }

  const passwordHash = await hashPassword(password);
  const user = await createUser({ username, passwordHash });
  const token = issueToken({ sub: user.id, username: user.username, role: user.role });
  return { user, token };
}

Startup logic seeds an admin account from environment variables because I have locked myself out of dashboards before; the database initializer keeps that safety net in place by creating the table and populating the admin row the moment the service boots:

// services/user/src/db.js
async function initDb(customPool = getPool()) {
  await customPool.query(`
    CREATE TABLE IF NOT EXISTS users (
      id TEXT PRIMARY KEY,
      username TEXT UNIQUE NOT NULL,
      password_hash TEXT NOT NULL,
      role TEXT NOT NULL DEFAULT 'user'
    );
  `);

  const { rows } = await customPool.query('SELECT id FROM users WHERE username = $1 LIMIT 1', [
    config.ADMIN_USERNAME,
  ]);

  if (rows.length === 0 && config.ADMIN_PASSWORD) {
    const passwordHash = await hashPassword(config.ADMIN_PASSWORD);
    await customPool.query(
      'INSERT INTO users (id, username, password_hash, role) VALUES ($1, $2, $3, $4)',
      [crypto.randomUUID(), config.ADMIN_USERNAME, passwordHash, 'admin'],
    );
  }
}

Authentication sits in a small middleware that checks Bearer tokens and attaches the decoded data to the request. The cryptography helpers stay in their own module so the rest of the code can trust req.user without drama, and so future changes to signing logic happen in one place:

// services/user/src/auth-middleware.js
function authRequired(req, _res, next) {
  const header = req.headers.authorization || '';
  const [, token] = header.split(' ');

  if (!token) {
    return next(new UnauthorizedError('Missing bearer token'));
  }

  try {
    const payload = verifyToken(token);
    req.user = { id: payload.sub, username: payload.username, role: payload.role };
    return next();
  } catch (error) {
    return next(new UnauthorizedError('Invalid token'));
  }
}

// services/user/src/security.js
function issueToken(payload) {
  return jwt.sign(payload, config.JWT_SECRET, { expiresIn: '1h' });
}

function verifyToken(token) {
  return jwt.verify(token, config.JWT_SECRET);
}

Product Service: Guarding the Catalog

With identity stable, I moved to the catalog. Public routes need to be friendly but safe, so they validate pagination settings before running a query to avoid accidental full-table scans or wasteful database calls:

// services/product/src/service.js
async function fetchProducts(query = {}) {
  if (query.limit !== undefined && isNaN(Number(query.limit))) {
    throw new ValidationError('limit must be numeric');
  }
  if (query.offset !== undefined && isNaN(Number(query.offset))) {
    throw new ValidationError('offset must be numeric');
  }
  return listProducts({ limit: query.limit, offset: query.offset });
}

Admin routes are stricter: the price parser stops invalid or negative numbers before they reach the database, and the admin middleware keeps write actions behind a trusted role so change control stays tight:

// services/product/src/service.js
function parsePrice(price) {
  if (price === undefined) return undefined;
  const value = Number(price);
  if (Number.isNaN(value) || value < 0) {
    throw new ValidationError('price must be a non-negative number');
  }
  return Math.round(value * 100) / 100;
}

async function createProductRecord({ name, description, price }) {
  if (!name || !description) {
    throw new ValidationError('name and description are required');
  }
  const parsedPrice = parsePrice(price);
  if (parsedPrice === undefined) {
    throw new ValidationError('price is required');
  }
  return createProduct({
    id: randomUUID(),
    name,
    description,
    price: parsedPrice,
  });
}

// services/product/src/admin-middleware.js
function adminOnly(req, _res, next) {
  if (!req.user) {
    return next(new UnauthorizedError('Auth required'));
  }

  if (req.user.role !== 'admin') {
    return next(new UnauthorizedError('Admin access required'));
  }

  next();
}

Each product receives a UUID when it is created and is stored in Postgres. That small step keeps tracking clear and makes later integrations easier if this prototype grows into something larger because every product ID stays unique across environments and migrations.

Order Service: Cross-Service Collaboration

Orders were the most satisfying part because they make the services work together and force the boundaries to prove themselves. The handler checks that both userId and productId exist, validates pagination options, and then calls the product service to confirm the item is still available:

// services/order/src/service.js
async function recordOrder({ userId, productId }) {
  if (!userId || !productId) {
    throw new ValidationError('userId and productId are required');
  }

  const product = await fetchProduct(productId);
  if (!product) {
    throw new ValidationError('product not found');
  }

  return createOrder({ id: randomUUID(), userId, productId });
}

That remote call lives in a small client that normalizes URLs, treats 404s as “not found,” and wraps other errors in a validation message so downstream consumers receive clean, human-readable results:

// services/order/src/clients/product-client.js
async function fetchProduct(productId) {
  const base = config.PRODUCT_SERVICE_URL.endsWith('/')
    ? config.PRODUCT_SERVICE_URL.slice(0, -1)
    : config.PRODUCT_SERVICE_URL;
  const res = await fetch(`${base}/products/${productId}`);

  if (res.status === 404) {
    return null;
  }

  if (!res.ok) {
    throw new ValidationError('product lookup failed');
  }

  const body = await res.json();
  return body.product;
}

The repository stays lean by saving only foreign keys. If the catalog changes later, the order history still reads well, and the service can rebuild richer views by fetching user and product details when needed, which keeps the storage footprint small and the coupling loose:

// services/order/src/repository.js
async function createOrder({ id, userId, productId }, pool = getPool()) {
  await pool.query(
    'INSERT INTO orders (id, user_id, product_id) VALUES ($1, $2, $3)',
    [id, userId, productId],
  );
  return mapOrder({ id, user_id: userId, product_id: productId, created_at: new Date() });
}

API Gateway and Service-to-Service Communication

From the start I wanted one door for clients. The gateway connects everything, and the proxyTo helper does the heavy lifting by taking an incoming request, rebuilding the destination URL, and streaming the response back without leaking hop-by-hop headers:

// gateway/src/index.js
function proxyTo(baseUrl) {
  const normalizedBase = baseUrl.endsWith('/') ? baseUrl.slice(0, -1) : baseUrl;

  return async (req, res, next) => {
    try {
      const targetUrl = new URL(req.originalUrl, `${normalizedBase}/`).toString();

      const headers = { ...req.headers };
      delete headers.host;

      const init = {
        method: req.method,
        headers,
      };

      if (req.method !== 'GET' && req.method !== 'HEAD') {
        init.body = req.body ? JSON.stringify(req.body) : undefined;
        init.headers['content-type'] = 'application/json';
      }

      const response = await fetch(targetUrl, init);
      const text = await response.text();
      res.status(response.status);
      try {
        const parsed = JSON.parse(text || '{}');
        res.json(parsed);
      } catch (_err) {
        res.send(text);
      }
    } catch (error) {
      next(error);
    }
  };
}

The routes mount each downstream service under a clean prefix, which keeps the public API steady even if I move services around inside the cluster and makes documentation easier for anyone consuming the gateway:

// gateway/src/index.js
router.use('/users', proxyTo(serviceConfig.userServiceUrl));
router.use('/products', proxyTo(serviceConfig.productServiceUrl));
router.use('/orders', proxyTo(serviceConfig.orderServiceUrl));

Inside the system, the order service calls the product service through the same HTTP endpoints. The approach is intentionally simple because it matches what many teams already run. Right now those calls trust the network and do not add extra authentication, so improving that handshake is near the top of my hardening list. When I explore rate limiting or service discovery, the gateway will be the natural place to add them.

Configuration, Security, and Secrets Management

One personal rule for the project was simple: avoid “works on my machine” bugs. Every service reads configuration through env.getConfig, which applies defaults, checks required values, and handles small type conversions before the app even starts:

// services/product/src/config.js
env.loadEnv({ files: [path.join(__dirname, '..', '.env')] });

const config = env.getConfig({
  PORT: { default: 3002, parser: Number },
  DATABASE_URL: { default: 'postgres://product_service:password@localhost:5434/product_db' },
  JWT_SECRET: { default: 'devsecret', required: true },
});

When the stack runs in Kubernetes, the JWT secret comes from a cluster secret instead of shipping inside the image, which means new secrets can be rotated without rebuilding containers:

# k8s/secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: jwt-secret
  namespace: mini-ecommerce
stringData:
  value: devsecret

The user service issues tokens with that secret, the other services verify them locally, and role checks—like the admin filter in the product service—use the decoded payload to make decisions.

Local Development Workflow

Weekend hacking works only if the feedback loop stays short, so Docker Compose became the main control room:

Install dependencies once with npm install so every workspace shares the same node_modules tree.
Run npm run compose:up to launch the three services, the gateway, and their Postgres companions (using the compose file shown above) and let Docker wire the local network for you.
Send every request through http://localhost:8080 so the gateway path stays well traveled and the API surface mirrors production traffic.

Right now the services run with plain node processes, so I still restart them by hand when code changes. Hot reloaders are on the to-do list, but even without them the shared package keeps logs and errors consistent. Docker volumes remember the seeded catalog and test users between runs, so I can experiment, restart, and keep moving without rebuilding the database every time.

Deploying to Kubernetes

By Sunday afternoon curiosity won. I wanted to watch the system run inside a cluster, so the manifests in k8s/ mirror the Compose layout almost line for line.

The user service deployment is representative of the pattern:

# k8s/user-service.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-service
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: user-service
          image: mini-ecommerce-user:latest
          env:
            - name: JWT_SECRET
              valueFrom:
                secretKeyRef:
                  name: jwt-secret
                  key: value
          readinessProbe:
            httpGet:
              path: /healthz
              port: 3001

The gateway pairs a deployment with an ingress so there is one public entry point:

# k8s/gateway.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
spec:
  template:
    spec:
      containers:
        - name: api-gateway
          image: mini-ecommerce-gateway:latest
          env:
            - name: USER_SERVICE_URL
              value: http://user-service
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-gateway
spec:
  rules:
    - http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api-gateway
                port:
                  number: 80

Dedicated Postgres deployments keep data siloed per service, honoring the “database per service” mantra without any shared state leaks.

With images tagged—think mini-ecommerce-user:latest—a kubectl apply -f k8s/ sets up the same architecture I run locally. Rolling updates and restarts behave the way I expect, which makes this repo a comfortable sandbox for practicing cluster operations. Secrets ship with kubectl apply -f k8s/secret.yaml, and the workload manifests read them as environment variables; config maps follow the same pattern for plain settings.

Observability, Testing, and Next Experiments

I kept observability light but friendly. The logger shown earlier prefixes every line with a service name, so one tail -f gives a clear picture of who is talking. Tests live next to the code inside each service’s __tests__ folder; they mix unit checks with small integration cases so I can change a function and still trust the boundaries, and they double as documentation because they show how the modules are meant to collaborate.

There is still plenty to explore. A message broker for order events, circuit breakers inside the product client, and rate limiting at the gateway are already on the list. The current setup leaves room for those ideas without tearing up the base.

What I Relearned

Clear domain boundaries keep ownership simple and give every rule a home.
A small shared toolkit (@mini/shared) stops the team—future me included—from rebuilding the same helpers.
The API gateway protects client URLs while backend services evolve in private.
Matching the local Compose setup inside Kubernetes lowers the stress when promoting changes.

The weekend build reminded me that microservices are less about counting repositories and more about choosing clear boundaries. Steady ownership, honest contracts, and repeatable operations beat shiny patterns every time. Now that this mini eCommerce system lives in the toolbox, I can reopen the code and the lessons whenever I need a quick refresher.

Scaling Healthcare Data Processing: Multi-Environment FHIR Patient Updates with Smart Batch Processing

Budi Widhiyanto — Tue, 23 Sep 2025 05:36:58 +0000

The request sounded simple: “Can we keep patient phone numbers up to date?”

At first we thought it was a quick operations chore. Then we traced the real data flow and saw the mess underneath. Phone numbers rolled in from WhatsApp, hospital front desks, and survey tools, each with its own format. Patients jumped between facilities, so their trails were often broken. The operations team lived in Google Sheets, and every region guarded its own FHIR server with different credentials, limits, and quirks.

Our first fix was a tiny script that looped through one row at a time. On a test file it worked fine, but once we aimed it at 10,000 rows the run dragged on for hours, chewed through hundreds of megabytes of memory, and could crash if a single record looked wrong.

This article is the story of how that fragile script became a production-ready workflow. The same 10,000-row load now finishes in about 10–12 minutes per region, using only 256Mi memory and 0.5 vCPU. More important, it stays steady, it survives bad data, and operations teams are happy to run it every day.

From One Region to a Platform

The moment a second region joined the queue, the to-do list grew fast. We suddenly needed to:

Serve multiple regions, such as Purbalingga and Lombok Barat, at the same time without stepping on each other.
Keep every environment on its own FHIR endpoint, spreadsheet, and credential set,no mixing, ever.
Give operators live feedback with status updates, readable logs, and a safe way to restart when things went wrong.

At that point the one-off script had nowhere to grow. We needed a real architecture that could respect those boundaries.

The Challenge: Scale, Isolation, and Real-World Limits

Running two regions side by side exposed the real limits:

Tens of thousands of records every day.
Updates that had to finish in under 15 minutes.
Tight resource limits (very small memory and CPU).
Zero tolerance for mixing environments.

We also needed more resilience: a single broken record could not freeze the run, and the FHIR servers deserved a gentle pace so they never tipped into overload.

Why the Simple Loop Fails

# Inefficient sequential approach
for record in all_records:
    patient = find_patient(record['identifier'])
    update_patient_phone(patient, record['phone_number'])

This loop looked fine in early tests. In production it fell apart,every row opened new network calls, memory crept upward, progress stayed invisible, and one exception could stop the whole job. That simple design hurt us later on.

The Solution: Smart Batching Across Multiple Environments

The turning point came when we stopped thinking about “a script that updates phones” and started thinking about “a pipeline that needs to stay healthy.” Stability, visibility, and consistency became the main goals.

Once we named those needs, the design almost wrote itself: batch the work, reuse connections, pace the requests, and send every result back into the spreadsheets everyone already trusted. On top of that, make sure each batch leaves a clear log trail so operators can watch the system move.

Architecture Overview

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│  Google Sheets   │    │   Flask Web      │    │    FHIR Server  │
│   (per region)   │───▶│   Application    │───▶│    (per region) │
│                  │    │                  │    │                 │
│ • Regional rows  │    │ • Batch engine   │    │ • Patient query │
│ • Status column  │    │ • Memory hygiene │    │ • Phone update  │
│ • Daily feeds    │    │ • Safe retries   │    │ • Rate limiting │
└─────────────────┘    └──────────────────┘    └─────────────────┘
         │                       │
         └───────────────┬───────┘
                         │
                ┌────────────────────┐
                │  Config & Secrets  │
                │  (per environment) │
                │  JSON + env vars   │
                └────────────────────┘

In practice, each region runs end to end on its own. Operators manage rows in a regional Google Sheet, the Flask app reads that sheet and processes batches, and every update goes to the matching FHIR server. The config and secrets layer supplies the right credentials and URLs per run, so requests stay isolated and nothing leaks across environments.

The Batch Engine (Built for Production)

def process_records():
    start_time = time.time()
    results = {
        'status': 'success',
        'total_records': len(records),
        'successful_updates': 0,
        'failed_updates': 0,
        'patients_not_found': 0,
        'processing_time_minutes': 0,
        'batches_processed': 0,
        'errors': []
    }

    try:
        session = setup_session()
        batch_size = 100
        logger.info(f"Starting batch processing with batch size: {batch_size}")

        for i in range(0, len(records), batch_size):
            batch = records[i:i+batch_size]
            batch_num = i // batch_size + 1
            total_batches = (len(records) - 1) // batch_size + 1
            logger.info(f"=== Processing batch {batch_num}/{total_batches} ({len(batch)} records) ===")

            for idx, record in enumerate(batch):
                try:
                    row_index   = record['row_index']
                    identifier  = record['identifier']
                    phone       = record['phone_number']
                    record_num  = i + idx + 1

                    if record_num % 500 == 0:
                        logger.info(f"=== PROGRESS: {record_num}/{len(records)} processed ===")

                    patients = find_patient(session, identifier, fhir_url, headers)
                    if not patients:
                        results['patients_not_found'] += 1
                        msg = f"Patient {identifier} not found"
                        results['errors'].append(msg)
                        update_worksheet_status(worksheet, row_index, "failed", msg)
                        continue

                    all_success = True
                    failed_msgs = []
                    success_count = 0

                    for patient in patients:
                        ok, message = update_patient_phone(
                            session, patient['id'], phone, fhir_url, headers
                        )
                        if ok:
                            success_count += 1
                        else:
                            all_success = False
                            failed_msgs.append(f"Patient {patient['id']}: {message}")

                    if all_success:
                        results['successful_updates'] += success_count
                        update_worksheet_status(worksheet, row_index, "success")
                    else:
                        results['failed_updates'] += (len(patients) - success_count)
                        msg = f"Failed {len(failed_msgs)}/{len(patients)}"
                        update_worksheet_status(worksheet, row_index, "failed", msg)

                    time.sleep(0.1)  # be kind to downstreams

                except Exception as e:
                    results['failed_updates'] += 1
                    msg = f"Error processing {record.get('identifier', 'unknown')}: {e}"
                    results['errors'].append(msg)
                    logger.exception(f"❌ {msg}")
                    continue

            # Keep memory flat
            del batch
            time.sleep(0.1)
            results['batches_processed'] += 1

        results['processing_time_minutes'] = round((time.time() - start_time) / 60, 2)
        logger.info(f"Processing completed in {results['processing_time_minutes']} minutes")

    except Exception as e:
        results['status'] = 'error'
        results['message'] = f'Processing failed: {e}'
        logger.exception("Fatal error")

    return results

The batch engine became the system’s heart. Inside the loop you can see how it slices the sheet into blocks of 100 rows, logs the batch number, and keeps track of every success or failure. That small pause after each batch and record keeps memory flat and slows the request rate so the FHIR servers never get hammered. Even when a row misbehaves, the error handler writes it down and the loop keeps going, which means operators still see steady progress instead of a half-finished run.

Practices That Made the System Work

1) Batch Size That Fits Reality

When we tried tiny batches the system spent more time setting up than doing real work. When we went too big, the process grabbed extra memory and slowed everything down. After a few trial runs, 100 records felt balanced,quick to process, light on resources, and easy to monitor in the logs and in the sheet.

2) Connection Pooling and Safe Retries

def setup_session():
    session = requests.Session()
    retry = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504]
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    return session

Pooling the HTTP session kept us from opening a fresh connection for every row, which trimmed latency and CPU spikes. The retry helper then waited a little longer after each failure, so short network hiccups cleared on their own instead of breaking the run. With those two pieces in place, the pipeline finished sooner and recovered smoothly from the usual internet noise.

3) Explicit Memory Hygiene

# keep memory flat
del batch
time.sleep(0.1)

By explicitly deleting each batch and pausing briefly, memory remained flat. No creeping leaks, no surprises during long runs.

4) Pacing to Protect Servers

time.sleep(0.1)  # ~10 ops/sec

A short delay between requests prevented FHIR servers from being overwhelmed. Paradoxically, slowing down slightly made the whole system finish faster, because retries and throttling were reduced.

Features That Build Trust

Real-Time Status in Sheets

def update_worksheet_status(worksheet, row_index, status, error_message=None):
    """
    Update the status_update_phone_number column (Column G)
    """
    try:
        status_column = 7  # Column G (1-indexed)
        value = "success" if status == "success" else (
            f"failed: {error_message}" if error_message else "failed"
        )
        worksheet.update_cell(row_index, status_column, value)
        logger.info(f"Row {row_index} (G): {value}")
    except Exception as e:
        logger.error(f"Failed to update status for row {row_index}: {e}")

Operators never asked for a new dashboard; they just wanted their sheet to tell them what happened. This helper does exactly that, dropping “success” or “failed” (with the reason) straight into Column G so they can watch the run move row by row.

Structured Batch Logs and Alerts

Every batch writes a compact log entry with the batch number, record count, and any failures. Those logs land in Cloud Logging and a small alerting rule pings the on-call channel when something looks off. If a row fails, the operator spots it in the sheet and can jump straight to the matching log line because the correlation ID is right there in the message.

Handle Multiple Patients per Identifier

def find_patient(session, identifier, fhir_url, headers):
    url = f"{fhir_url}/Patient"
    params = {'identifier': identifier}
    r = session.get(url, params=params, headers=headers, timeout=30)
    if r.status_code == 200:
        data = r.json()
        if data.get('total', 0) > 0:
            return [e['resource'] for e in data['entry']]
    return []

Some identifiers pointed to more than one Patient record. Rather than pretend the duplicates did not exist, the system updates each match so that all copies stay aligned,even when the source data is messy.

Phone Number Cleaning

def clean_phone_number(phone):
    if not phone:
        return None

    phone = str(phone).strip()

    # Sheets can turn big numbers into scientific notation
    if 'e+' in phone.lower() or 'e-' in phone.lower():
        try:
            phone = f"{float(phone):.0f}"
        except ValueError:
            logger.warning(f"Could not parse scientific notation: '{phone}'")

    # If multiple numbers, take the first
    if ',' in phone:
        phone = phone.split(',')[0].strip()

    # Keep only digits and plus
    phone = re.sub(r'[^\d+]', '', phone)

    # Indonesian heuristic: restore missing leading zero
    if not phone.startswith('+') and phone.startswith('8') and len(phone) in [11, 12]:
        phone = '0' + phone
        logger.info(f"Added missing leading zero: '{phone}'")

    # Basic sanity check
    if len(phone.replace('+', '')) < 8:
        return None

    return phone

Phone numbers are notoriously messy. Sheets loves to turn big numbers into scientific notation, people paste in two numbers separated by commas, and in Indonesia a missing leading zero can point to the wrong person. The cleaner walks through each of those cases so the final value is something we can safely send to FHIR.

Multi-Environment Deployment: One Region = One Tenant

JSON Config as Source of Truth

{
  "name": "environment-name",
  "project_id": "your-gcp-project-id",
  "service_name": "your-cloud-run-service-name",
  "region": "asia-southeast2",
  "platform": "managed",
  "memory": "256Mi",
  "cpu": "0.5",
  "timeout": "900",
  "max_instances": "1",
  "min_instances": "0",
  "concurrency": "1",
  "credential_file": "service-account-credentials.json",
  "env_vars": {
    "FHIR_SERVER_URL": "https://your-fhir-server.com/fhir",
    "FHIR_API_TOKEN": "your-api-token",
    "SPREADSHEET_ID": "your-google-sheets-id",
    "WORKSHEET_ID": "your-worksheet-name"
  }
}

Each environment,Purbalingga, Lombok Barat, and friends,gets its own JSON file. The application code stays the same, while the config file names the project, credentials, and spreadsheet for that region. That simple split keeps the runs isolated, makes audits easy, and lets us roll back a region without touching the others.

Cloud Run Profile Per Environment

Each region deploys to its own Cloud Run service with a lean profile: 256Mi memory, 0.5 vCPU, and a single instance. It keeps costs low, keeps performance predictable, and matches the steady pace we designed for.

Lessons Learned

When we stepped back after the first few successful runs, a handful of habits stood out.

Regional boundaries matter. Keeping configs and credentials per region meant every incident stayed where it started. If Lombok Barat hit a problem, Purbalingga kept running without even noticing.

100-record batches are the sweet spot. That size is big enough to move quickly but small enough to avoid memory spikes. It also lines up nicely with the logging we added, so operators can read progress in plain language.

The spreadsheet is still the source of truth. By writing results directly into Column G, we gave operators instant trust. They did not need to learn a new tool; their everyday sheet became the dashboard.

Polite clients make for calm servers. Gentle pacing and retries with backoff handled the usual internet noise. Instead of chasing flaky errors, we saw quiet logs and smooth throughput.

Clean data upfront saves pain later. Fixing phone numbers at the edge kept downstream systems clean. Once we did that, support tickets about wrong contacts dropped sharply.

Conclusion

The journey took us from a fragile script to a workflow the team can trust. We didn’t introduce exotic technology; we simply leaned on good habits,clear boundaries, careful batching, shared connections, steady pacing, and honest visibility.

Today the 10,000-row jobs finish in about 12 minutes per environment. Memory stays flat. Operators watch the spreadsheet fill with results while alerts stay quiet.

For us, that’s what healthcare data scaling looks like: not only faster runs, but calmer shifts, clearer feedback, and an architecture that can keep growing with the organization.

Enhancing FHIR EpisodeOfCare Resources: Improving Interoperability and Performance

Budi Widhiyanto — Wed, 03 Sep 2025 11:19:03 +0000

Healthcare runs on data, but too often, that data doesn't feel like it was designed for the people who use it. Doctors and administrators regularly face cryptic codes when what they really need is a name, an identifier, or a simple confirmation of which hospital is in charge.

This was exactly the challenge we faced with FHIR EpisodeOfCare resources. By default, they tracked patient care episodes using reference codes, but those references weren't very useful in practice. They were difficult to query, slowed down workflows, and created friction in sharing data across organizations.

We set out to fix that by enhancing EpisodeOfCare resources with richer references adding patient identifiers, display names, and organization names directly into the resource. The reason was simple: more detailed references make queries easier, reduce ambiguity, and dramatically improve interoperability, especially in federated healthcare environments where multiple hospitals, clinics, and systems exchange data.

But there was another challenge: scale. It wasn't just about a handful of resources. We had to handle more than 10,000 EpisodeOfCare records. Enhancing that many records required a reliable, scalable, and cost effective approach.

The solution came through batch processing, powered by Google Cloud Healthcare API, a Python-based enhancement engine, and serverless deployment on Cloud Run. The results were remarkable: thousands of resources enhanced in minutes, at low cost, with 100% success. Most importantly, the data became more useful for both people and systems.

1. Introduction: The Interoperability Challenge

In today's healthcare environment, data lives everywhere. Patients might receive care at multiple facilities across different regions, each with its own information system. These systems often don't agree on how to represent data.

FHIR was designed as a bridge, providing a standard way to share healthcare information. But standards leave room for interpretation, and sometimes that room creates gaps.

The EpisodeOfCare resource is a good example. It records the period during which a patient is under care, but the way it references patients and organizations often leaves out the details that make data useful in practice.

Here's what a typical EpisodeOfCare resource looked like before enhancement:

{
  "patient": {
    "reference": "Patient/f0c29c2a-1f20-4b9f-b49c-8d4ad487a6e8"
  },
  "managingOrganization": {
    "reference": "Organization/100007732"
  }
}

On the surface, it seems fine. But in practice:

The patient's name is missing.
Their identifier, such as a national ID (NIK), is missing.
The organization's name is missing.

This makes queries harder. If you want to filter EpisodeOfCare resources by patient name, you can't do it directly. If you want to find all patients managed by a specific hospital, you have to look up organization codes separately. In federated systems, where data needs to be shared seamlessly across multiple institutions, this lack of clarity makes interoperability fragile.

The result: more work for healthcare staff, more room for errors, and less effective use of valuable data.

2. Our Approach: Adding the Missing Pieces

We set out to make EpisodeOfCare resources more useful, more searchable, and more interoperable.

The key was enhancing the references so that they carried not just codes, but context.

Patient identifiers (NIKs): By embedding identifiers directly into EpisodeOfCare, we made it easy to query by ID and match patients across systems.
Patient display names: Adding names meant healthcare workers could instantly see who the record referred to, and systems could query resources by name.
Organization display names: Embedding the managing organization's actual name improved both usability and data federation across institutions.

Here's what the same resource looked like after enhancement:

{
  "patient": {
    "reference": "Patient/f0c29c2a-1f20-4b9f-b49c-8d4ad487a6e8",
    "display": "Dr. YULIANI SUSANTI",
    "identifier": {
      "system": "https://fhir.kemkes.go.id/id/nik",
      "value": "3271234567890123"
    }
  },
  "managingOrganization": {
    "reference": "Organization/100007732",
    "display": "RSUD PURBALINGGA"
  }
}

Now, queries become much simpler. Need to find all patients managed by RSUD PURBALINGGA? You can query directly on the display field. Need to check for duplicate NIKs? The identifier is already there.

By enriching references, we didn't just improve usability, we made the data easier to search, more consistent across systems, and more reliable in federated environments.

3. The Big Challenge: Scaling Up

Enhancing one EpisodeOfCare resource was easy. Enhancing over 10,000 resources was a serious technical challenge.

Processing each record individually would have been too slow and too fragile. We risked hitting API rate limits, overloading memory, and wasting time. In a federated environment, where performance and reliability are critical, this was unacceptable.

The solution was batch processing. Instead of treating each record as a separate job, we processed them in groups. This allowed us to control throughput, recover gracefully from errors, and optimize performance.

With smart pagination and configurable parameters, the system could handle datasets of any size. Whether it was 1,000 records or 10,000, the engine processed them efficiently.

In practice, this meant:

Thousands of resources processed in under half an hour.
Hundreds of API calls per minute without hitting limits.
Full runs costing less than a dollar, even at large scale.

The leap from hundreds to tens of thousands of records proved that the system wasn't just functional; it was scalable.

4. Inside the Engine

At the heart of the system was a Python-based enhancement engine. Its job was to fetch EpisodeOfCare resources, enrich their references, and save them back in enhanced form.

The logic for enhancing patients, for example, looked like this:

def enhance_patient_reference(self, patient_ref, patient_resource):
    """Add NIK identifier and display name to patient reference"""
    enhanced_ref = patient_ref.copy()

    nik = self.extract_nik_from_patient(patient_resource)
    if nik:
        enhanced_ref['identifier'] = nik

    display_name = self.extract_patient_display_name(patient_resource)
    if display_name:
        enhanced_ref['display'] = display_name

    return enhanced_ref

This small piece of code hides a lot of sophistication. It knows how to handle multiple NIK formats, build names from different components, and gracefully skip enhancements if data is missing.

But beyond the logic, what mattered most was resilience. Logs tracked every step. Failed calls retried automatically. Errors didn't crash the process. They were logged, and the batch continued. This was essential for handling more than 10,000 records without losing progress.

And of course, everything was designed with security and compliance in mind. Data was encrypted in transit and at rest, authentication was handled securely, and every enhancement left an audit trail.

5. Deployment in the Real World

We chose Google Cloud Run to deploy the service. This gave us a serverless platform that scaled automatically and required almost no infrastructure management.

Each region had its own deployment, with its own configuration. For example, Purbalingga and Lombok Barat each had dedicated endpoints, tuned for their workloads.

We added features that made operations smooth:

Health-check endpoints to confirm the system was running.
Monitoring dashboards to track progress.
Dry-run mode to test enhancements safely.
Automated deployment scripts for fast updates.

The result was a production-ready system that didn't just work in theory, but in the complex, messy reality of healthcare IT.

6. The Results

The numbers tell a powerful story:

Over 10,000 EpisodeOfCare resources enhanced
100% success rate
Zero data loss
293 API calls per minute sustained

But behind the numbers were the real wins:

Doctors could query patients by name or ID without extra clicks.
Administrators no longer had to translate organization codes manually.
Federated systems could exchange EpisodeOfCare data more reliably, because references carried meaningful context.

Healthcare workers reported that tasks which once took minutes now took seconds. And across organizations, interoperability became less of a headache.

7. Looking Ahead

As we reflect on the project, one theme stands out above all others: the importance of the federated healthcare ecosystem.

Healthcare is rarely contained within a single system. Patients move between clinics, hospitals, insurance providers, and government programs. Each of these organizations may run its own IT infrastructure, but the real challenge, and the real opportunity, comes when they need to share data with one another.

This is where enhanced EpisodeOfCare resources make a difference. By embedding identifiers, display names, and organization names directly into the references, we remove friction. Instead of systems struggling to reconcile codes or run complicated joins, they can exchange data that already carries meaningful context.

In a federated setup, this means:

Queries become easier across institutions, because identifiers and names are consistent and visible.
Data exchange becomes more reliable, because the meaning of each reference is clearer.
Patient journeys can be tracked more smoothly, even when care happens across multiple organizations.

The future we see isn't about adding more technology for its own sake. It's about making the data we already have work better in a federated world, where collaboration between systems is the norm, not the exception.

8. Conclusion: Data That Works Better

At the start, our challenge was simple: EpisodeOfCare resources weren't very helpful to humans. They carried references, but not enough detail.

By enhancing those references, we solved three problems at once: we made queries easier, improved day-to-day usability, and strengthened interoperability across federated systems.

Scaling to more than 10,000 records proved that the solution wasn't just a prototype it was production-ready. And deploying it on the cloud made it secure, cost-efficient, and easy to operate.

In the end, the lesson is clear: good healthcare data is about more than accuracy. It's about accessibility, usability, and interoperability. When data works better, healthcare works better.

This is one step toward a future where healthcare systems truly work together, seamlessly and intelligently. And if there's one takeaway from our journey, it's that improving interoperability doesn't always require massive overhauls. sometimes, it just takes the right enhancements, applied at the right scale.

Caching Strategies Across Application Layers: Building Faster, More Scalable Products

Budi Widhiyanto — Sun, 23 Mar 2025 06:34:21 +0000

Sarah’s phone buzzed at 2:43 AM. Half-asleep, she answered. On the other end, the on-call engineer sounded stressed:

"The database load just spiked. The app is crawling, and users are reporting timeouts everywhere."

As the product lead for a fast-growing SaaS platform with over 200,000 daily active users, Sarah knew that every minute of slowdown meant frustrated customers—and possibly some of them leaving for good.

The cause? A small feature update had unintentionally bypassed key caching layers, sending every request straight to the database. What should have been a routine release turned into an emergency, taking hours to fix and forcing a partial rollback.

These kinds of issues happen more often than we’d like to admit. In the rush to ship features, caching can feel like an afterthought—something only engineers worry about. But in reality, caching affects all of us, from product managers thinking about user experience to developers balancing performance and reliability.

Caching is essential for applications, from mobile apps to web platforms and APIs. When done right, it prevents delays and enhances user experience.

When implemented correctly, caching reduces unnecessary work, enhances efficiency, and improves user experience. But when we overlook caching, it can lead to stale data, inconsistent behavior, or even system outages like the one Sarah’s team faced.

In this article, we’ll walk through different types of caching, from browser caches that speed up loading times to database caches that reduce repeated queries. Along the way, we’ll share real-world examples, practical strategies, and common mistakes we’ve encountered.

By the end, we hope to have a clearer understanding of how caching fits into product development—not just as a technical detail, but as a tool we can all use to build faster, more reliable applications.

Now that we understand why caching is crucial for performance and scalability, let's start by exploring the first layer of caching: the browser cache. This is often the fastest and easiest way to improve load times for end users.

Browser Caching: The First Line of Defense

Picture this: we're sipping our morning coffee, opening our favorite news app, and it loads instantly. Not because we have the world's fastest internet connection, but because our browser remembers what it downloaded yesterday. That's browser caching at work—the quiet, behind-the-scenes optimization that makes the web feel fast.

When we visit a website for the first time, our browser doesn’t just display the page and forget about it. It stores key resources—the JavaScript that makes buttons work, the CSS that styles the layout, and the images that make the page visually engaging. The next time we visit, instead of downloading everything again, the browser retrieves these cached files in a fraction of the time.

HTTP headers dictate what gets cached and for how long, serving as caching instructions from the server:

Cache-Control: Defines how long to keep a resource before checking for a new version.
ETag: Works like a version number—if the file hasn’t changed, the browser skips downloading it again.
Expires: Sets a specific expiration date for content.
Last-Modified: Lets the browser ask, “Has this changed since the last time I checked?” and reloads only if necessary.

To implement browser caching effectively, developers often use tools like Webpack (for asset bundling and versioning), Workbox.js (for managing service worker caching), and browser DevTools (Chrome, Firefox, Safari) to inspect cache behavior. Performance audit tools like Google Lighthouse help measure caching effectiveness alongside other optimizations.

Implementation Strategies

Here are some strategies we can use to make browser caching more effective:

Version-stamping assets. Instead of naming a file main.js, we can use main.d41ef2c.js. The unique fingerprint tells the browser when to fetch a new version, preventing users from seeing outdated files.
Service workers for offline caching. Service workers store and serve critical assets even when there's no internet connection, ensuring smooth offline experiences.
Balancing memory and disk cache. Browsers store frequently used assets in fast memory cache, while less critical ones go to disk cache. Structuring assets properly helps optimize performance.
Preloading and prefetching. Good websites anticipate what users need next. Using <link rel="preload">, we can tell the browser to fetch key assets early. With <link rel="prefetch">, we can load resources before users even request them—for example, preloading images for the next page they’re likely to visit.

Why This Matters

When browser caching is well-implemented, it does more than just speed up loading times—it improves the overall user experience and optimizes infrastructure efficiency.

Faster load times. Returning visitors don’t have to re-download assets they already have, making pages feel instantly responsive. A content-heavy news site, for example, saw noticeable improvements when implementing proper browser caching, as frequent readers no longer had to reload the same images and scripts every time they visited.
Lower bounce rates. Users expect websites to load quickly, and slow performance often leads to frustration and abandonment. An e-commerce company that optimized its caching strategy found that customers stayed on their product pages longer, leading to better engagement and higher conversion rates.
Improved session continuity. For applications that rely on frequent interactions, caching can make navigation smoother. A media streaming platform optimized caching for its homepage and video thumbnails, ensuring that users could browse without unnecessary delays when switching between content.
Reduced data usage. For mobile users, caching reduces the need to re-download resources, which is particularly valuable for those on limited data plans or in regions with slower network connections. A mobile app improved its usability in areas with spotty internet access by caching key interface elements, allowing users to continue browsing seamlessly even with temporary network disruptions.

Common Pitfalls

While caching improves performance, it comes with challenges that need careful handling:

Over-aggressive caching. Setting cache durations too long for dynamic content can cause users to see outdated information. An e-commerce site once cached its inventory page for six months, leading customers to try purchasing out-of-stock items. To prevent this, it's important to set appropriate expiration times and ensure cache invalidation mechanisms are in place for frequently changing data.
Not caching enough. Some assets, like logos or icons, rarely change but are often fetched repeatedly due to improper caching rules. Without caching, users waste bandwidth downloading the same files on every visit. Identifying truly static assets and assigning them long cache durations helps optimize performance without risking outdated content.
Cache invalidation issues. Even after a successful deployment, users may still see outdated versions of a website due to cache not being properly refreshed. This often happens when file names remain unchanged after an update. Using versioned filenames like main.abc123.js ensures browsers fetch the latest files while still benefiting from caching.
Security and privacy risks. Without proper controls, caching sensitive data can lead to privacy breaches. A banking app once cached account summaries incorrectly, momentarily exposing one user’s balance to another. To prevent such risks, sensitive content should be marked as non-cacheable using headers like Cache-Control: no-store, ensuring it is never stored or served from cache.

While browser caching speeds up individual page loads, it has limitations. Users across the globe may experience delays when fetching content from a single server. A visitor in New York might enjoy instant access, while someone in Tokyo faces sluggish load times. This is where Content Delivery Networks (CDNs) step in, delivering content closer to users for a seamless experience.

CDN Caching: Bringing Content Closer to Users

How It Works

Imagine we’re running a global coffee chain. Instead of brewing all our coffee in Jakarta and shipping it worldwide, which would result in cold, stale coffee, we build local shops in every city. That’s essentially what a Content Delivery Network (CDN) does for digital content.

CDNs maintain a vast network of servers—often called edge nodes—strategically positioned around the world. When Marco in Milan or Priya in Pune wants to see our website’s hero image, they don’t have to wait for data to travel from a server in Virginia. Instead, they receive a copy from a nearby edge server, making everything feel much faster no matter where they are.

Modern CDNs go beyond static file caching. Many now support caching API responses, running small programs at the edge using tools like Cloudflare Workers or Lambda@Edge, and even protecting against malicious traffic. It’s like having a combination of a local warehouse, a smart assistant, and a security guard in every city where our users live.

One of the biggest advantages of CDNs is geographical awareness. If our app suddenly gains popularity in South Korea, the CDN automatically ensures those users get the same fast experience as someone near our headquarters.

Different teams select CDNs based on their needs. For example, Cloudflare is widely used for its security and DDoS protection, Fastly is known for real-time cache purging and low latency, AWS CloudFront integrates seamlessly with other AWS services, and Akamai offers a massive global network, making it a preferred choice for large enterprises.

Implementation Strategies

Cache rules configuration. Setting different caching policies for different types of content ensures a balance between freshness and performance. For example, a news organization might set up cache rules as follows:
- Breaking news pages cached for 5 minutes.
- Weekly feature articles for 24 hours.
- Archived content for 30 days.
Cache keys. Cache keys determine what makes content unique. Should two versions of the same product page—one with ?ref=email and one without—be cached separately or treated as the same? One e-commerce company unknowingly created thousands of duplicate caches because they weren’t handling session IDs properly in URLs. A small change to their cache key strategy significantly reduced their CDN costs.
Dynamic content acceleration. Even frequently changing content can benefit from short-lived caching. A financial services app caches personalized portfolio summaries for just 30 seconds, eliminating unnecessary database hits while keeping data fresh enough for users.
Edge functions and workers. Some CDNs allow small programs to run at the edge to dynamically modify responses. A gaming company used edge functions to insert region-specific tournament details into a single cached page, avoiding the need to generate separate pages for each region.

Why This Matters

CDN caching improves performance in ways that browser caching alone cannot. Here’s how:

Faster page load times. By reducing the distance between users and content, CDNs significantly decrease delays, especially for users far from the origin server.
Global consistency. Users across different regions experience similar performance, whether they’re in Brazil, Japan, or the U.S.
Reduced load on origin servers. CDNs absorb the bulk of traffic, reducing direct requests to backend infrastructure and preventing overload during high-traffic events. A retail company that experienced traffic spikes on Black Friday relied on a CDN to handle the surge without increasing server capacity.
Optimized bandwidth costs. CDNs apply compression and delivery optimizations, reducing data transfer costs. A video streaming startup switched to a CDN with better video compression, cutting delivery expenses while improving streaming quality.

Common Pitfalls

Overly complex cache configurations. Some teams implement overly complex caching rules, making them difficult to modify. One engineering manager put it this way: “Our CDN config has become our legacy code.” Keeping rules simple and well-documented makes ongoing maintenance easier.
Cache coherency issues. Keeping content synchronized across different regions isn’t always straightforward. A company launching a new product found that European users saw the update two hours before U.S. users due to inconsistent cache invalidation. This led to confusion, support tickets, and customer complaints on social media.
Mismanaged CDN costs. CDN pricing models vary—some charge primarily for bandwidth, while others focus on request volume. A streaming service attempted to reduce bandwidth costs but overlooked the fact that their CDN charged mostly for requests, causing their costs to rise instead of fall. Understanding pricing structures is crucial to avoiding unexpected expenses.
Security gaps at the CDN level. Security measures applied at the origin server don’t automatically carry over to the CDN. A financial services company carefully configured security headers on its main servers but forgot to apply them at the CDN level, leaving key vulnerabilities exposed. Ensuring that security policies are consistently enforced across all layers helps prevent such oversights.

CDNs are fantastic for delivering static assets like images, CSS, and JavaScript files quickly, but what about dynamic data? API calls, such as product listings, user dashboards, or flight availability, often require fresh data. If every request hits the backend, it can slow down the entire system. API Gateway caching steps in as a solution, reducing redundant requests and improving API response times.

API Gateway Caching: The Request Filter

How It Works

Imagine we’re running a popular restaurant where the maître d' intercepts repeat orders before they reach the kitchen. "The table in the corner wants another plate of the special pasta? I already know exactly how the chef prepares that—no need to bother the kitchen again." That’s essentially what API Gateway caching does for our applications.

Acting as a middle layer between client applications and backend services, API Gateway caching reduces redundant processing by storing commonly requested API responses. While CDNs are optimized for static assets like images and scripts, API Gateways are designed to cache structured data, such as JSON or XML responses, helping to offload repeated database queries and reduce API latency.

A travel booking platform initially had uncached API calls taking 600-800ms to return flight results. With API Gateway caching enabled, identical searches took just 40ms, significantly improving responsiveness.

Many teams use AWS API Gateway for cloud-native applications, while Kong is a popular choice for self-hosted and Kubernetes environments. For enterprise-scale API management, solutions like Apigee (Google Cloud) are widely used. If we’re already using NGINX, the MicroCache module offers a lightweight alternative. The best choice depends on factors like infrastructure, compliance needs, and scale.

Key Features of API Gateway Caching

Full response caching. Unlike some caching layers that store fragments, API Gateways typically cache entire API responses. A financial services app implemented was making thousands of identical market data queries per minute—API Gateway caching reduced their backend load by 94%.
Security and authentication handling. API Gateways can authenticate requests before checking the cache, ensuring unauthorized users don’t access cached responses they shouldn’t see. This is especially critical for applications handling sensitive data.
Cache key customization. We can define which parts of a request—headers, query parameters, path segments, or even body elements—should determine cache uniqueness. A media streaming service I advised improved cache efficiency by including device type in cache keys but excluding session identifiers, dramatically reducing redundant caching.
Granular TTL control. Different API endpoints have different freshness needs. A banking app implemented cached account history for 30 minutes, transaction statuses for 60 seconds, and current balances were never cached.
Rate limiting and quota management. Even when serving cached responses, API Gateways can enforce rate limits, helping prevent traffic spikes from overwhelming backend services.

Implementation Strategies

Cache per endpoint

Each API has different requirements—some are read-heavy, others update frequently. A product catalog API implementation cached:
- Category listings for 30 minutes
- Product details for 5 minutes
- Inventory status for 30 seconds
Cache segmentation

Sometimes, the same API endpoint needs different caching rules depending on user type. A B2B platform implementation cached pricing API responses for:
- Anonymous users → Cached for 1 hour
- Authenticated partners → Cached for 5 minutes to reflect negotiated pricing updates
Selective caching

Not all HTTP methods should be cached.
- GET requests are typically safe to cache.
- POST, PUT, and DELETE modify data and should bypass the cache. One team mistakenly mistakenly cached POST requests, leading to orders not appearing in customer history for 15 minutes.
Using vary headers

Many applications deliver different responses based on content type, language, or device. Configuring cache keys properly can prevent unnecessary duplicate caching. A global e-commerce site implementation doubled cache efficiency by properly implementing Vary headers for different languages.
Cache bypass options

Some users need real-time data. We can implement a query parameter like ?fresh=true to allow users to bypass the cache when necessary. One investment platform implementation added a “Refresh” button to ensure users saw real-time financial reports while still benefiting from caching during normal browsing.

Why This Matters

API Gateway caching improves application performance in multiple ways:

Faster API responses. Caching API calls can reduce response times by 70-95%, making applications feel much snappier.
Lower backend load. By serving cached responses, API Gateways can reduce redundant API calls by over 80%, easing database strain and improving scalability. A social media analytics platform implemented reduced database queries from 3 million to 400,000 per hour just by enabling API caching.
Consistent performance. One e-commerce platform implementation didn’t just improve average API response times—they also eliminated unpredictable latency spikes during peak traffic hours. As their CTO put it, "Users notice inconsistency more than they notice raw speed."
Improved API availability. When a payment service implementation had database slowdowns, their API Gateway continued serving cached responses, preventing an outage. Their team estimated that caching bought them 30 minutes of breathing room before backend fixes took effect.

Common Pitfalls

Over-aggressive caching

Caching sensitive, user-specific data can lead to serious security issues. One financial app briefly showed User A’s balance to User B due to improper cache key settings. Always include user identifiers in cache keys when necessary.
Inconsistent user experience

When caching is misconfigured, some users see fresh data while others get stale responses. A document editing platform implemented cached document statuses too aggressively, causing team members to see outdated content for several minutes, even after refreshing.
Cache poisoning

If an error response or incorrect data gets cached, it can spread to multiple users. A healthcare app implementation cached incomplete patient records due to a database migration issue—turning a 30-second problem into a 15-minute one.
Hidden bugs

When caching is working well, we may not notice backend failures immediately. One team mistakenly discovered that their API was throwing errors 20% of the time, but the cache had masked the problem for weeks. Regular cache bypass testing helps detect hidden failures.
Cache stampede

When a frequently accessed item expires, multiple clients may request fresh data at the same time. This sudden spike can overload the backend, causing unexpected performance issues. A sports statistics platform implemented saw their database crash during a major game because their player stats cache expired just as a star player scored.

Solution: Use staggered expirations and background refreshes to avoid traffic spikes when cache entries expire.

API Gateway caching optimizes API calls, but what if we need to cache frequently accessed data within the application itself? Imagine a dashboard that displays the same metrics for hundreds of users—fetching this data from the database every time would be inefficient. Instead, application caching allows us to store frequently used data in memory, significantly improving performance and reducing backend strain.

Application Layer Caching: The Middle Tier

How It Works

Remember the last time we asked a friend the same question twice in five minutes? They probably gave us a look that said, "I just told you that." With application layer caching, we avoid recalculating or fetching data we’ve already seen, making our systems much more efficient.

Sitting between the application and the database, application caching acts as a short-term memory store. It holds frequently accessed data in fast, in-memory storage like Redis or Memcached—think of them as digital scratch pads that can be read thousands of times faster than even the most optimized database.

Many teams rely on Redis for its versatility and support for different data structures. Managed services like AWS ElastiCache and Azure Cache for Redis simplify operations, while language-specific libraries like Caffeine (Java) and node-cache (Node.js) provide efficient caching within specific tech stacks.

Unlike browser and CDN caching, which primarily handle static assets, application caching deals with dynamic data. It stores information that changes frequently but doesn’t need to be recalculated or retrieved every time, such as:

User profile details
Product inventories
Complex API responses
Results of computationally expensive tasks

A team optimizing a recommendation engine once spent three days refining an algorithm to generate product suggestions. They later discovered that caching those recommendations for a few minutes provided an even greater performance boost within just three hours. "We were trying to build a faster car," they noted, "when we really just needed to stop making unnecessary trips."

Common Caching Patterns

Data caching. Storing database records or API responses in memory to reduce repetitive database queries. By keeping frequently accessed data readily available, this approach reduces database load and improves response times while maintaining relatively fresh data.
Computation caching. Storing the results of expensive calculations so they don’t have to be recomputed repeatedly. For example, a financial services platform calculating risk assessment scores for users might cache the results for a short period. Instead of recalculating the same data for every request, the system retrieves it instantly from cache, significantly improving response times and reducing the load on computing resources.
Session caching. Storing user session data in memory for quick access. This ensures that applications can efficiently maintain user authentication, preferences, and shopping cart data across multiple requests or page reloads without frequent database lookups.
Rate limiting. Using a cache to track and limit API requests from individual users. This helps prevent accidental or intentional overload of the system by enforcing request thresholds while reducing unnecessary processing.

Implementation Strategies

Cache-aside (lazy loading). Before making a request to the database, the application first checks the cache. If the data isn’t there, it fetches the information, stores it in the cache, and serves it to the user. This pattern is widely used because it keeps cache management simple. A login API implementation reduced response time from 600ms to 40ms by caching user authentication data this way.
Read-through caching. The cache itself is responsible for retrieving missing data. If the requested data isn’t in the cache, it automatically fetches it from the source. This simplifies application logic but requires a more sophisticated caching layer.
Write-through caching. Every time data is updated, it’s written to both the cache and the backend database simultaneously. This ensures the cache is always in sync with the latest data but adds some latency to write operations. A ticket-booking platform used this approach to ensure seat availability information was always accurate.
Write-behind caching. Data is first written to the cache and then asynchronously updated in the backend. This improves write performance but carries some risk if the system fails before syncing with the database.
Time-based expiration. Different types of data have different expiration needs. An e-commerce site implemented a layered approach:
- Product descriptions cached for 1 week
- Inventory levels cached for 5 minutes
- Flash sale prices cached for 30 seconds

Why This Matters

Application caching improves both user experience and infrastructure efficiency:

Faster user interactions. Caching frequently accessed data can reduce API response times by 50-95%, making applications feel more responsive. A mobile app reduced its average API response time from 300ms to just 35ms after implementing application caching.
Lower infrastructure costs. Optimized caching reduces CPU and memory usage, allowing teams to handle more traffic with fewer resources. A B2B platform reduced its server count by 60% while handling more requests.
Reduced database load. Caching minimizes expensive database queries, keeping systems stable under heavy traffic. An analytics dashboard lowered database CPU utilization from 85% to 30%, eliminating timeouts during peak hours.
Better scalability without extra cost. Caching allows systems to handle traffic spikes without requiring a massive increase in infrastructure. As one CTO put it, "Before caching, each new marketing campaign meant an emergency infrastructure meeting. Now we just watch the metrics and smile, knowing the system will handle it."

Common Pitfalls

Cache invalidation challenges. Knowing when to refresh or discard cached data is surprisingly complex. Some engineering teams have created cache invalidation diagrams that look more like abstract art than structured designs.
Stale data issues. If cache invalidation isn’t handled correctly, users may see outdated information. A marketplace app once displayed "In Stock" labels for products that had already sold out, frustrating customers and increasing support tickets.
Cache penetration. If non-existent data is frequently requested, it can bypass the cache and overload the database. A system experiencing slowdowns due to bots requesting random product IDs mitigated the issue by implementing a "negative result cache" to remember which IDs didn’t exist.
Cache avalanche. If many cached items expire simultaneously, the sudden surge of database queries can cause system failure. A social platform crashed during a product launch when all promotional content caches expired simultaneously, triggering thousands of database queries.
Local vs. distributed caching challenges. Local in-memory caches work well for small applications, but as traffic grows, a distributed caching system becomes essential. A startup struggling with inconsistent user experiences found that switching to Redis as a centralized cache immediately resolved the issue.

Application caching keeps frequently accessed data readily available, but what happens when the database itself becomes the bottleneck? Every database query consumes CPU and memory, leading to slow response times under heavy load. Database caching steps in as the final layer of defense, storing precomputed results and frequently queried data in memory to keep things running smoothly.

Database Caching: The Foundation Layer

How It Works

Imagine walking into a library where the librarian already has our favorite books set aside because "we always ask for these." That’s similar to how database caching works—it anticipates frequently accessed data and keeps it readily available, making our applications run more efficiently.

Database caching is the deepest, most fundamental layer of caching in our application stack. Unlike other caching layers that store pre-processed responses, database caching optimizes how data is retrieved, reducing redundant work and improving performance.

Modern database systems do much more than just store data—they actively optimize how information is accessed and processed. They incorporate multiple caching mechanisms, much like a master chess player thinking several moves ahead:

Buffer cache. Stores frequently accessed disk pages directly in memory, reducing the need to fetch data from disk. This is like keeping our most-read pages in a high-speed binder instead of going back to the filing cabinet every time.
Query cache. Remembers answers to frequently asked questions. If multiple users request "How many users signed up last Tuesday?" the database can return a cached result instead of recalculating it. While MySQL had a built-in query cache (now removed due to inefficiencies), databases like PostgreSQL offer alternatives through prepared statements.
Execution plan cache. Before executing a query, the database determines the most efficient way to fetch the data. This planning process can be expensive, but caching the execution plan avoids recalculating it every time. It’s like a delivery driver optimizing a route once and reusing it for similar trips.
Materialized views. store pre-computed query results as tables, eliminating redundant calculations and speeding up queries. Instead of recalculating "total sales by region, by month" each time someone loads a dashboard, a materialized view keeps this report pre-generated.

These caching mechanisms vary by database. PostgreSQL supports prepared statements and buffer management tools like pgFincore, while MySQL/MariaDB optimize performance through the InnoDB buffer pool. SQL Server provides buffer pool extensions and execution plan caching.

For high-read scenarios, tools like PgBouncer and Amazon RDS Proxy help manage database connections efficiently, while materialized views (supported in PostgreSQL, Oracle, and SQL Server) provide powerful query caching capabilities.

Implementation Strategies

Index optimization. Indexes act like pre-sorted lookup tables, allowing databases to quickly locate specific data. A well-designed index can turn a slow, full-table scan into a lightning-fast lookup. One team reduced a product search query from 30 seconds to 25 milliseconds simply by adding the right index.
Query optimization. Small changes in how queries are written can significantly impact performance. Queries that leverage cached execution plans often run much faster than those that force full table scans. Two engineering teams wrote nearly identical queries—one was 50 times faster because it aligned with the database’s caching mechanisms.
Connection pooling. Establishing database connections is expensive, involving authentication, state setup, and memory allocation. Connection pooling maintains pre-established connections that applications can reuse, preserving cached execution plans and reducing response times. An e-commerce platform reduced checkout page load time by 300ms simply by implementing connection pooling.
Read replicas. For read-heavy workloads, having multiple read-only copies of the database can reduce the load on the primary database. News websites, for example, use read replicas to handle peak morning traffic without slowing down content updates.

Why This Matters

Database caching improves deep system-level performance, directly impacting user experience and scalability:

Faster queries. Caching reduces query execution time dramatically. In some cases, complex analytics queries that originally took 30 seconds can be reduced to milliseconds, improving application responsiveness.
Lower database load. Caching minimizes redundant computations and disk access, reducing CPU and I/O usage. A healthcare platform lowered database CPU utilization from 95% to 30% just by optimizing buffer cache settings.
More concurrent users. Optimized database caching enables applications to support more simultaneous users. One team improved their system capacity from 5,000 to 20,000 concurrent users without upgrading infrastructure.
Reduced infrastructure costs. Efficient caching delays or eliminates the need for expensive database upgrades. A CTO once noted, "We were about to spend $10,000 a month on a database upgrade. After optimizing our caching strategy, we achieved better performance on our existing hardware and postponed that expense for 18 months."

Common Pitfalls

Cache size allocation. Allocating the right amount of memory for database caching is critical. Too little memory reduces caching benefits, while too much can starve other processes or cause the system to swap to disk, leading to performance degradation.
Query plan instability. Cached execution plans are optimized based on current data distribution. As data changes, a previously efficient query plan can suddenly become inefficient, leading to performance issues. A retail company saw checkout times slow dramatically when their database chose a suboptimal execution plan due to shifting customer behavior.
Over-indexing. While indexes speed up reads, they slow down writes because every update requires index maintenance. One team once had 25 indexes on a single table—so many that inserts and updates were spending more time updating indexes than modifying the actual data.
Isolation level mismatches. Database transactions follow isolation rules that control how changes are visible to concurrent requests. A financial services app showed inconsistent account balances because its caching strategy conflicted with its chosen isolation level.

We’ve explored each caching layer independently, but the real magic happens when they work together. A robust caching strategy isn't just about optimizing one layer—it’s about ensuring browser, CDN, API, application, and database caching function cohesively to prevent bottlenecks. Let’s explore how to align these layers for maximum efficiency and scalability.

Integrated Caching Strategy: Putting It All Together

Layered Caching Architecture

An effective caching strategy leverages multiple layers, with each layer optimized for specific types of data and access patterns:

Browser Cache. Stores static assets like images, stylesheets, and scripts directly on the user’s device, reducing load times for repeat visits.
CDN. Distributes cached content across globally distributed edge servers, ensuring fast delivery regardless of location.
API Gateway Cache. Speeds up API response times by caching frequently requested data before it reaches backend services.
Application Cache. Reduces redundant computations by caching frequently accessed data and computed results in memory.
Database Cache. Optimizes query execution by storing precomputed results, reducing database load and improving scalability.

Cache Coherency Across Layers

One of the most challenging aspects of multi-layer caching is maintaining consistency across layers. Strategies include:

Cache Invalidation Chains. Ensures that when data is updated in one layer, all dependent caches are invalidated, preventing stale responses.
TTL Hierarchies. Higher caching layers (e.g., browser and CDN) expire cached content more quickly than lower layers, balancing freshness and efficiency.
Event-Based Invalidation. Uses pub/sub messaging (e.g., Redis Pub/Sub, Kafka) to notify caching layers when data changes, improving consistency.
Versioned Cache Keys. Embeds data versions into cache keys, ensuring clients retrieve the latest content without requiring manual cache clearing.

Monitoring and Optimization

A successful caching implementation requires ongoing monitoring and refinement:

Cache Hit Ratio. measures how often data is served from the cache instead of being fetched from the database. A higher ratio means fewer expensive API or database calls. Regular monitoring helps fine-tune caching strategies.
Cache Size and Eviction. Ensuring caches aren’t overfilled prevents excessive evictions, which can lead to performance drops.
Response Time Distribution. Comparing cached vs. non-cached response times highlights areas where caching can be improved.
Cost-Benefit Analysis. Balances the savings from caching with the risk of serving stale data, ensuring an optimal caching strategy.

Caching isn’t just about improving performance—it’s a core architectural decision that directly impacts user experience and scalability. A well-designed caching strategy ensures applications remain fast, efficient, and resilient under heavy load. Now, let’s summarize the key takeaways and best practices to implement caching successfully.

Conclusion: Caching as a Product Strategy

As explored throughout this article, caching isn’t merely a technical optimization—it’s a fundamental product strategy that impacts everything from user experience to operational costs. When implemented thoughtfully across all application layers, caching provides a competitive edge through superior performance, lower infrastructure costs, and improved scalability.

The most successful product teams treat caching as a core architectural decision rather than an afterthought. They recognize that different layers require different caching approaches and design their systems to maximize the strengths of each layer.

A well-designed caching strategy goes beyond speed—it enhances user experience, optimizes infrastructure costs, and ensures applications scale efficiently. By treating caching as an integral part of system architecture from the start, products can be built to handle growth and traffic surges while providing a seamless experience for users, no matter where they are.

Mastering The Art of Pull Requests: A Developer's Guide to Smooth Code Reviews

Budi Widhiyanto — Mon, 24 Feb 2025 20:03:53 +0000

In the fast-paced world of software development, version control tools like Git have become essential for keeping projects organized and collaborative. As developers, we often work in parallel, creating branches for new features, fixing bugs, and making updates. But when it’s time to bring those changes back into the main codebase, pull requests (PRs) are the bridge between isolated development and team collaboration.

We’ve all faced the frustration of a PR getting rejected or sent back for revisions. It’s easy to spend hours on a feature, only to realize our implementation doesn’t align with the team’s vision. Over time, however, we learn that effective PRs aren’t just about the code itself. They’re about how we communicate our changes.

In this article, we’ll walk through how to craft pull requests that make the review process smoother, faster, and more efficient for everyone involved.

Crafting the Perfect Pull Request

1. Start with an Informative Title

First impressions matter, even in the world of code. The title of our pull request is our chance to immediately convey the essence of the change. Think of it as the headline of an article; it should be short, clear, and informative. Avoid titles like:

"Updated code"
"Fixed bug"
"Changes from yesterday"

Instead, use more descriptive titles, such as:

feat: Add OAuth2 authentication for API endpoints
fix: Resolve race condition in user session handling
refactor: Optimize database query performance in user search

A well-crafted title helps our reviewers understand the change without opening the PR. It sets the tone for the whole review process.

2. Set the Context (Don’t Assume Everyone Knows the Problem)

Before diving into code, it’s important to provide context. Not everyone may be familiar with the specific problem we’re solving, and it’s essential to explain the why behind the changes. A solid PR description makes the review easier and faster. Here’s a template for a well-structured PR description:

## Problem
The current user authentication system doesn't support social login, causing friction during user onboarding. We're seeing a 40% drop-off rate at the registration step.
Related ticket: AUTH-123

## Solution
Implemented OAuth2 authentication flow with Google:
- Added OAuth2 middleware for handling Google authentication
- Created new user profile mapping logic
- Implemented session management for social login

## Technical Details
- Uses passport-google-oauth20 for authentication
- Added new database fields: googleId, socialProfile
- Modified user model to support multiple auth methods

## Testing
1. Click "Login with Google" button
2. Authorize test application
3. Verify successful redirect to dashboard
4. Check user profile contains Google data

## Configuration
New environment variables required:
- GOOGLE_CLIENT_ID
- GOOGLE_CLIENT_SECRET
- OAUTH_CALLBACK_URL

This structure ensures that our reviewers understand the issue, the approach we’ve taken, and how to validate the solution. A PR without context can slow things down significantly, so always make sure to include enough information for our team to understand our changes.

3. Keep the PR Focused

We’ve all been there: tempted to tackle multiple issues in one pull request. But this often leads to oversized PRs that can overwhelm reviewers. Instead, we should try to break our work into smaller, more focused PRs. Each PR should address a specific task. For example, if we’re building a user management system, we can break it down into smaller tasks like:

First PR: feat: Add basic user model and migration
Second PR: feat: Implement user authentication endpoints
Third PR: feat: Add user profile management UI
Fourth PR: feat: Integrate email verification system

Each PR should focus on a single feature or bug fix, which makes the review process easier for everyone. This leads to faster reviews and fewer reworks.

4. Commit Messages: Keep It Clean

A good commit message does more than explain the code. It helps everyone understand why the change was made and how it fits into the bigger picture.

Why Commit Messages Matter:

They provide context: A well-written commit message explains the why behind a change.
They improve collaboration: Future developers can trace the history of the project and easily understand the purpose of each commit.
They save time: Clear commit messages reduce the need for follow-up questions and prevent back-and-forth during the review process.

Examples of Poor Commit Messages:

"Fixed stuff"

Why it’s bad: This is vague and doesn’t specify what was fixed or why. Did we fix a bug, improve performance, or refactor code? It’s unclear.

Better version: fix(auth): resolve user login bug caused by expired tokens
"Updated files"

Why it’s bad: This message doesn’t tell the reviewer what was changed or why the update was necessary.

Better version: chore: update dependencies to fix security vulnerabilities
"Work in progress"

Why it’s bad: This doesn’t describe any meaningful change and suggests the code is incomplete. It also makes the review process harder, as the reviewer doesn’t know if they’re looking at a finished feature or just a draft.

Better version: feat(api): add user authentication endpoints

5. Review-Readiness Checklist: Why It’s Crucial

Before submitting a PR, we should use a review-readiness checklist to ensure our code is in top shape. Here’s why having a checklist is important:

Saves reviewers’ time: It minimizes the chances of reviewers asking for basic fixes, allowing them to focus on the logic of the code.
Improves consistency: A checklist ensures that all pull requests meet the same standard, making the review process smoother for everyone.
Reduces back-and-forth: By double-checking our code and tests, we avoid multiple rounds of revisions.

Example checklist:

## Pre-Submission Checklist:
- [ ] Code follows project style guide
- [ ] All tests pass (`npm run test`)
- [ ] Lint checks pass (`npm run lint`)
- [ ] Documentation updated
- [ ] No sensitive data in commits
- [ ] Branch is up to date with `main`

Overcoming Common Pull Request Challenges

Challenge 1: Pull Requests with Too Many Changes

We’ve faced the temptation to submit a massive PR that includes multiple features. However, this often leads to confusion and long review times. Here’s how we solved it: Instead of trying to work on everything at once, we broke the task into smaller, more manageable PRs. First for the database schema changes, then for API modifications, followed by frontend adjustments. This made it easier for reviewers to focus on one thing at a time.

Lesson Learned: Keep the PRs focused on a single aspect of the project.

Challenge 2: Lack of Context in PR Descriptions

Early in my career, I submitted PRs with minimal descriptions, assuming everyone knew what was happening. This led to confusion, questions, and delays in the review process. To fix this, I started adding detailed descriptions for each PR, explaining what the change was, why it was necessary, and how to test it. It made the review process much faster and more efficient.

Lesson Learned: Always provide context in our PR descriptions. Clear explanations can save us time and reduce the need for back-and-forth.

Challenge 3: Inconsistent Commit Messages

Our team once struggled with inconsistent commit messages, which made it difficult to track changes. Some messages were too vague, while others were overly detailed. To resolve this, we created a standardized commit message format (e.g., feat:, fix:, chore:) and made sure everyone followed it. This made the project history much more readable and improved collaboration.

Lesson Learned: Use a consistent commit message format. It helps everyone understand changes quickly.

Conclusion: Pull Requests as a Collaborative Opportunity

Crafting effective pull requests is a skill that improves with practice. By following these guidelines, we can make the review process more efficient while maintaining a cleaner, more maintainable codebase. Each PR is a learning opportunity. Every review comment, suggestion, or question helps us grow as developers and improve our coding practices.

The goal isn’t just to get our code merged; it’s to build a high-quality codebase that’s easy for everyone to understand and maintain. By putting extra care into our PRs, we’re investing in the success of our project and our growth as developers.

https://www.conventionalcommits.org/en/v1.0.0/

Feature Toggles: A Simple Way to Manage Access to Premium Features

Budi Widhiyanto — Tue, 11 Feb 2025 22:48:05 +0000

Imagine building an application with both free and premium features, and wanting to roll out a new feature for premium users without disrupting the experience for everyone else. The challenge is ensuring only the right users get access while keeping things smooth. For example, a streaming service like Netflix offers exclusive content or features to premium users, such as higher streaming quality or early access to new shows. They need a way to release these features gradually to the right audience and monitor performance before full rollout.

This is where feature toggles come in. They allow us to control which features are visible to different user groups, like premium subscribers. By enabling or disabling specific features based on user access levels, we can release new features gradually to the right audience, just like Netflix does. This approach helps monitor the feature’s performance, gather feedback, and ensure everything runs smoothly before making it available to all premium users, without affecting non-premium ones.

Understanding Feature Toggles: More Than Just On/Off Switches

So, what is a feature toggle? A feature toggle, also known as a feature flag, is a technique used in software development to enable or disable specific features without deploying new code. It allows us to control the visibility and availability of certain features in an application based on specific conditions, such as user roles or subscription levels.

It’s much more than a simple if/else statement. Feature toggles provide us with a powerful strategy for managing application behavior. Think of it as a control panel for features, where switches can be flipped on and off based on the needs of different users. This gives us the flexibility to gradually release features, test them, or provide exclusive content to certain user groups—without disrupting the experience for everyone else.

When Is This Useful?

Here are some scenarios where feature toggles really shine:

Gradually rolling out a new UI to ensure it doesn’t break anything
Running A/B tests to figure out if a blue button performs better than a red one
Controlling access to premium features (we’ll dive deeper into this one!)
Testing features in different environments without changing the code
Giving specific users access to beta-test features while others remain unaware

System Architecture: How It All Fits Together

Let’s take a closer look at how a feature toggle system works. Breaking it down into components helps make it clearer.

The System Components

The feature toggle system consists of several key parts:

Client Browser: This is where the user interacts with the application (e.g., web or mobile browser).
Frontend App: The client browser communicates with the frontend application, where the user interface is managed.
Backend API: The frontend app sends requests to the backend API, which handles the business logic and makes decisions about feature availability.
Toggle Service: The toggle service evaluates whether a specific feature should be enabled or disabled based on user access levels and feature configurations.
User Service: This service checks user-related information, like their subscription level or role, to determine their access rights.
Toggle DB: The database where feature configurations are stored, determining whether a feature is on or off for specific user groups.
User DB: The database that stores user information, such as their subscription level or status.

Understanding the Flow

Consider the flowchart that illustrates how the system handles feature access. Here’s what happens step-by-step:

The frontend app requests information about feature access from the backend API.
The backend API then checks the user’s subscription level (whether they are a premium or free user).
If the user is a premium subscriber, the toggle service checks if the requested feature is enabled for their user group by looking at the Toggle DB.
- If the feature is enabled, the feature is shown to the user.
- If the feature is disabled, the user is shown an upgrade prompt.
If the user is a free subscriber, the backend skips the feature toggle check and directly shows the upgrade prompt.

Premium Feature Access Control: A Real-World Example

Imagine building a SaaS platform with both free and premium features. Just like streaming services such as Netflix, where premium subscribers get access to exclusive content like higher streaming quality or early releases, we need to ensure that only paying users can access these features. The challenge is offering these benefits without affecting the experience of free-tier users.

Netflix, for example, uses feature toggles to gradually release premium features or exclusive content to a select group of premium users. This helps them test the feature, gather feedback, and monitor performance before rolling it out to all paying users. This strategy ensures a smooth user experience while maintaining the value of their subscription plans.

To implement this in our own projects, let’s take a look at how we can set up the necessary data models and backend logic to manage access control based on subscription levels, using feature toggles to control feature availability.

Setting Up the Data Models

@Entity
public class ToggleConfig {
    @Id
    private String featureName;
    private String subscriptionLevel;
    private boolean enabled;
    private Date lastModified;
    private String modifiedBy;

    // Getters and setters
}

ToggleConfig stores the configuration for each feature. The featureName identifies the feature, subscriptionLevel defines which users can access it, and enabled indicates whether it is active for that level. The lastModified and modifiedBy fields help track changes made to the feature.

@Entity
public class User {
    @Id
    private String id;
    private String subscriptionLevel;
    private Date subscriptionStart;
    private Date subscriptionEnd;

    // Getters and setters
}

User stores information about a user, such as their subscriptionLevel, which is essential for checking feature access.

@Entity
public class FeatureAccess {
    @Id
    private String id;
    private String userId;
    private String featureName;
    private boolean hasAccess;
    private Date lastChecked;

    // Getters and setters
}

FeatureAccess records whether a user has access to a particular feature, including the user ID, feature name, access status, and the date it was checked.

The Backend Magic

Here’s where it gets interesting. Let’s look at how we check if a user can access a feature:

@Service
public class FeatureToggleService {
    private final ToggleRepository toggleRepository;
    private final UserRepository userRepository;
    private final FeatureAccessRepository accessRepository;

    public boolean isFeatureEnabledForUser(String userId, String featureName) {
        // Get user subscription level
        User user = userRepository.findById(userId)
            .orElseThrow(() -> new UserNotFoundException(userId));

        // Check feature toggle configuration
        Optional<ToggleConfig> toggleConfig = toggleRepository
            .findByFeatureAndSubscriptionLevel(featureName, user.getSubscriptionLevel());

        // Record access check
        FeatureAccess access = new FeatureAccess();
        access.setUserId(userId);
        access.setFeatureName(featureName);
        access.setHasAccess(toggleConfig.map(ToggleConfig::isEnabled).orElse(false));
        access.setLastChecked(new Date());
        accessRepository.save(access);

        return access.isHasAccess();
    }
}

This service first retrieves the user’s subscription level by querying the UserRepository. Then it checks if the feature is enabled for that subscription by querying the ToggleRepository. After checking, the result is saved in the FeatureAccess repository, and the access status is returned.

@RestController
public class FeatureToggleController {
    private final FeatureToggleService featureToggleService;

    @GetMapping("/api/features/{featureName}/access")
    public ResponseEntity<FeatureAccessResponse> checkFeatureAccess(
        @PathVariable String featureName,
        @RequestParam String userId
    ) {
        boolean hasAccess = featureToggleService.isFeatureEnabledForUser(userId, featureName);

        FeatureAccessResponse response = new FeatureAccessResponse(
            featureName,
            hasAccess,
            hasAccess ? "Feature available" : "Please upgrade to access this feature"
        );

        return ResponseEntity.ok(response);
    }
}

This controller exposes an API endpoint that checks whether a user has access to a specific feature. It calls the FeatureToggleService to perform the necessary checks and returns the result.

Making It Look Good: The Frontend

On the frontend, the idea is to either show the premium feature or an upgrade prompt. Here’s how this is handled in React:

import React, { useEffect, useState } from 'react';
import { useFeatureToggle } from './hooks/useFeatureToggle';

const PremiumFeature = ({ userId, featureName }) => {
  const { isEnabled, isLoading, error } = useFeatureToggle(userId, featureName);

  if (isLoading) return <LoadingSpinner />;
  if (error) return <ErrorMessage error={error} />;

  return (
    <div className="feature-container">
      {isEnabled ? (
        <div className="premium-feature">
          <h2>Premium Feature</h2>
          <PremiumContent />
        </div>
      ) : (
        <div className="upgrade-prompt">
          <h2>Upgrade Required</h2>
          <p>This feature is available for premium subscribers only.</p>
          <UpgradeButton />
        </div>
      )}
    </div>
  );
}

This component checks the feature status using the useFeatureToggle hook. If the feature is enabled, it displays the premium content; if not, it shows an upgrade prompt.

// Custom Hook for Feature Toggle
const useFeatureToggle = (userId, featureName) => {
  const [state, setState] = useState({
    isEnabled: false,
    isLoading: true,
    error: null
  });

  useEffect(() => {
    const checkFeatureAccess = async () => {
      try {
        const response = await fetch(
          `/api/features/${featureName}/access?userId=${userId}`
        );
        const data = await response.json();

        setState({
          isEnabled: data.hasAccess,
          isLoading: false,
          error: null
        });
      } catch (error) {
        setState({
          isEnabled: false,
          isLoading: false,
          error: error.message
        });
      }
    };

    checkFeatureAccess();
  }, [userId, featureName]);

  return state;
};

The custom hook makes an API call to check the feature's availability for the given user and feature name. It manages the state for loading, error, and the feature's enabled status.

Making It Work in the Real World: Best Practices

Now that we’ve seen how everything works, here are some additional technique to make the feature toggle system robust:

1. Performance Matters

Caching toggle states is crucial since checking the database for every request can be slow.
Keep the toggle logic efficient and simple to avoid unnecessary delays.
Use background jobs for logging to keep the user experience smooth.

2. Keep It Secure

Always validate user permissions to prevent unauthorized access.
Encrypt sensitive toggle configurations for security.
Use audit logs to track changes made to feature configurations.

3. Stay Organized

Regularly monitor which toggles are in use.
Clean up unused toggles to keep things tidy.
Document everything to maintain clarity.

4. Think About Scale

Design the system to scale efficiently for a large user base and numerous toggles.
Use caching effectively for scalability.
Consider using a distributed configuration system for global reach.

Wrapping Up

Feature toggles really feel like a superpower in the development world. They help us:

Gradually release features and test them safely.
Experiment with new ideas without risking the entire application.
Provide tailored experiences for different user groups.
Do all of this without deploying new code each time.

Starting with a single feature toggle is a great way to get familiar with this concept. As the comfort level grows, expanding to more complex scenarios—like controlling premium feature access—becomes easier.

The best part is that once feature toggles are in place, releasing new features becomes far less stressful. Instead of hoping for the best, we have full control over who sees what and when.

Thanks for reading! I’d love to hear your thoughts and experiences with implementing feature toggles. Feel free to share your comments or any insights you have from using them in your own projects.