Introduction
The request sounded simple: “Can we keep patient phone numbers up to date?”
At first we thought it was a quick operations chore. Then we traced the real data flow and saw the mess underneath. Phone numbers rolled in from WhatsApp, hospital front desks, and survey tools, each with its own format. Patients jumped between facilities, so their trails were often broken. The operations team lived in Google Sheets, and every region guarded its own FHIR server with different credentials, limits, and quirks.
Our first fix was a tiny script that looped through one row at a time. On a test file it worked fine, but once we aimed it at 10,000 rows the run dragged on for hours, chewed through hundreds of megabytes of memory, and could crash if a single record looked wrong.
This article is the story of how that fragile script became a production-ready workflow. The same 10,000-row load now finishes in about 10–12 minutes per region, using only 256Mi memory and 0.5 vCPU. More important, it stays steady, it survives bad data, and operations teams are happy to run it every day.
From One Region to a Platform
The moment a second region joined the queue, the to-do list grew fast. We suddenly needed to:
- Serve multiple regions, such as Purbalingga and Lombok Barat, at the same time without stepping on each other.
- Keep every environment on its own FHIR endpoint, spreadsheet, and credential set,no mixing, ever.
- Give operators live feedback with status updates, readable logs, and a safe way to restart when things went wrong.
At that point the one-off script had nowhere to grow. We needed a real architecture that could respect those boundaries.
The Challenge: Scale, Isolation, and Real-World Limits
Running two regions side by side exposed the real limits:
- Tens of thousands of records every day.
- Updates that had to finish in under 15 minutes.
- Tight resource limits (very small memory and CPU).
- Zero tolerance for mixing environments.
We also needed more resilience: a single broken record could not freeze the run, and the FHIR servers deserved a gentle pace so they never tipped into overload.
Why the Simple Loop Fails
# Inefficient sequential approach
for record in all_records:
patient = find_patient(record['identifier'])
update_patient_phone(patient, record['phone_number'])
This loop looked fine in early tests. In production it fell apart,every row opened new network calls, memory crept upward, progress stayed invisible, and one exception could stop the whole job. That simple design hurt us later on.
The Solution: Smart Batching Across Multiple Environments
The turning point came when we stopped thinking about “a script that updates phones” and started thinking about “a pipeline that needs to stay healthy.” Stability, visibility, and consistency became the main goals.
Once we named those needs, the design almost wrote itself: batch the work, reuse connections, pace the requests, and send every result back into the spreadsheets everyone already trusted. On top of that, make sure each batch leaves a clear log trail so operators can watch the system move.
Architecture Overview
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Google Sheets │ │ Flask Web │ │ FHIR Server │
│ (per region) │───▶│ Application │───▶│ (per region) │
│ │ │ │ │ │
│ • Regional rows │ │ • Batch engine │ │ • Patient query │
│ • Status column │ │ • Memory hygiene │ │ • Phone update │
│ • Daily feeds │ │ • Safe retries │ │ • Rate limiting │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │
└───────────────┬───────┘
│
┌────────────────────┐
│ Config & Secrets │
│ (per environment) │
│ JSON + env vars │
└────────────────────┘
In practice, each region runs end to end on its own. Operators manage rows in a regional Google Sheet, the Flask app reads that sheet and processes batches, and every update goes to the matching FHIR server. The config and secrets layer supplies the right credentials and URLs per run, so requests stay isolated and nothing leaks across environments.
The Batch Engine (Built for Production)
def process_records():
start_time = time.time()
results = {
'status': 'success',
'total_records': len(records),
'successful_updates': 0,
'failed_updates': 0,
'patients_not_found': 0,
'processing_time_minutes': 0,
'batches_processed': 0,
'errors': []
}
try:
session = setup_session()
batch_size = 100
logger.info(f"Starting batch processing with batch size: {batch_size}")
for i in range(0, len(records), batch_size):
batch = records[i:i+batch_size]
batch_num = i // batch_size + 1
total_batches = (len(records) - 1) // batch_size + 1
logger.info(f"=== Processing batch {batch_num}/{total_batches} ({len(batch)} records) ===")
for idx, record in enumerate(batch):
try:
row_index = record['row_index']
identifier = record['identifier']
phone = record['phone_number']
record_num = i + idx + 1
if record_num % 500 == 0:
logger.info(f"=== PROGRESS: {record_num}/{len(records)} processed ===")
patients = find_patient(session, identifier, fhir_url, headers)
if not patients:
results['patients_not_found'] += 1
msg = f"Patient {identifier} not found"
results['errors'].append(msg)
update_worksheet_status(worksheet, row_index, "failed", msg)
continue
all_success = True
failed_msgs = []
success_count = 0
for patient in patients:
ok, message = update_patient_phone(
session, patient['id'], phone, fhir_url, headers
)
if ok:
success_count += 1
else:
all_success = False
failed_msgs.append(f"Patient {patient['id']}: {message}")
if all_success:
results['successful_updates'] += success_count
update_worksheet_status(worksheet, row_index, "success")
else:
results['failed_updates'] += (len(patients) - success_count)
msg = f"Failed {len(failed_msgs)}/{len(patients)}"
update_worksheet_status(worksheet, row_index, "failed", msg)
time.sleep(0.1) # be kind to downstreams
except Exception as e:
results['failed_updates'] += 1
msg = f"Error processing {record.get('identifier', 'unknown')}: {e}"
results['errors'].append(msg)
logger.exception(f"❌ {msg}")
continue
# Keep memory flat
del batch
time.sleep(0.1)
results['batches_processed'] += 1
results['processing_time_minutes'] = round((time.time() - start_time) / 60, 2)
logger.info(f"Processing completed in {results['processing_time_minutes']} minutes")
except Exception as e:
results['status'] = 'error'
results['message'] = f'Processing failed: {e}'
logger.exception("Fatal error")
return results
The batch engine became the system’s heart. Inside the loop you can see how it slices the sheet into blocks of 100 rows, logs the batch number, and keeps track of every success or failure. That small pause after each batch and record keeps memory flat and slows the request rate so the FHIR servers never get hammered. Even when a row misbehaves, the error handler writes it down and the loop keeps going, which means operators still see steady progress instead of a half-finished run.
Practices That Made the System Work
1) Batch Size That Fits Reality
When we tried tiny batches the system spent more time setting up than doing real work. When we went too big, the process grabbed extra memory and slowed everything down. After a few trial runs, 100 records felt balanced,quick to process, light on resources, and easy to monitor in the logs and in the sheet.
2) Connection Pooling and Safe Retries
def setup_session():
session = requests.Session()
retry = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504]
)
adapter = HTTPAdapter(max_retries=retry)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
Pooling the HTTP session kept us from opening a fresh connection for every row, which trimmed latency and CPU spikes. The retry helper then waited a little longer after each failure, so short network hiccups cleared on their own instead of breaking the run. With those two pieces in place, the pipeline finished sooner and recovered smoothly from the usual internet noise.
3) Explicit Memory Hygiene
# keep memory flat
del batch
time.sleep(0.1)
By explicitly deleting each batch and pausing briefly, memory remained flat. No creeping leaks, no surprises during long runs.
4) Pacing to Protect Servers
time.sleep(0.1) # ~10 ops/sec
A short delay between requests prevented FHIR servers from being overwhelmed. Paradoxically, slowing down slightly made the whole system finish faster, because retries and throttling were reduced.
Features That Build Trust
Real-Time Status in Sheets
def update_worksheet_status(worksheet, row_index, status, error_message=None):
"""
Update the status_update_phone_number column (Column G)
"""
try:
status_column = 7 # Column G (1-indexed)
value = "success" if status == "success" else (
f"failed: {error_message}" if error_message else "failed"
)
worksheet.update_cell(row_index, status_column, value)
logger.info(f"Row {row_index} (G): {value}")
except Exception as e:
logger.error(f"Failed to update status for row {row_index}: {e}")
Operators never asked for a new dashboard; they just wanted their sheet to tell them what happened. This helper does exactly that, dropping “success” or “failed” (with the reason) straight into Column G so they can watch the run move row by row.
Structured Batch Logs and Alerts
Every batch writes a compact log entry with the batch number, record count, and any failures. Those logs land in Cloud Logging and a small alerting rule pings the on-call channel when something looks off. If a row fails, the operator spots it in the sheet and can jump straight to the matching log line because the correlation ID is right there in the message.
Handle Multiple Patients per Identifier
def find_patient(session, identifier, fhir_url, headers):
url = f"{fhir_url}/Patient"
params = {'identifier': identifier}
r = session.get(url, params=params, headers=headers, timeout=30)
if r.status_code == 200:
data = r.json()
if data.get('total', 0) > 0:
return [e['resource'] for e in data['entry']]
return []
Some identifiers pointed to more than one Patient record. Rather than pretend the duplicates did not exist, the system updates each match so that all copies stay aligned,even when the source data is messy.
Phone Number Cleaning
def clean_phone_number(phone):
if not phone:
return None
phone = str(phone).strip()
# Sheets can turn big numbers into scientific notation
if 'e+' in phone.lower() or 'e-' in phone.lower():
try:
phone = f"{float(phone):.0f}"
except ValueError:
logger.warning(f"Could not parse scientific notation: '{phone}'")
# If multiple numbers, take the first
if ',' in phone:
phone = phone.split(',')[0].strip()
# Keep only digits and plus
phone = re.sub(r'[^\d+]', '', phone)
# Indonesian heuristic: restore missing leading zero
if not phone.startswith('+') and phone.startswith('8') and len(phone) in [11, 12]:
phone = '0' + phone
logger.info(f"Added missing leading zero: '{phone}'")
# Basic sanity check
if len(phone.replace('+', '')) < 8:
return None
return phone
Phone numbers are notoriously messy. Sheets loves to turn big numbers into scientific notation, people paste in two numbers separated by commas, and in Indonesia a missing leading zero can point to the wrong person. The cleaner walks through each of those cases so the final value is something we can safely send to FHIR.
Multi-Environment Deployment: One Region = One Tenant
JSON Config as Source of Truth
{
"name": "environment-name",
"project_id": "your-gcp-project-id",
"service_name": "your-cloud-run-service-name",
"region": "asia-southeast2",
"platform": "managed",
"memory": "256Mi",
"cpu": "0.5",
"timeout": "900",
"max_instances": "1",
"min_instances": "0",
"concurrency": "1",
"credential_file": "service-account-credentials.json",
"env_vars": {
"FHIR_SERVER_URL": "https://your-fhir-server.com/fhir",
"FHIR_API_TOKEN": "your-api-token",
"SPREADSHEET_ID": "your-google-sheets-id",
"WORKSHEET_ID": "your-worksheet-name"
}
}
Each environment,Purbalingga, Lombok Barat, and friends,gets its own JSON file. The application code stays the same, while the config file names the project, credentials, and spreadsheet for that region. That simple split keeps the runs isolated, makes audits easy, and lets us roll back a region without touching the others.
Cloud Run Profile Per Environment
Each region deploys to its own Cloud Run service with a lean profile: 256Mi memory, 0.5 vCPU, and a single instance. It keeps costs low, keeps performance predictable, and matches the steady pace we designed for.
Lessons Learned
When we stepped back after the first few successful runs, a handful of habits stood out.
Regional boundaries matter. Keeping configs and credentials per region meant every incident stayed where it started. If Lombok Barat hit a problem, Purbalingga kept running without even noticing.
100-record batches are the sweet spot. That size is big enough to move quickly but small enough to avoid memory spikes. It also lines up nicely with the logging we added, so operators can read progress in plain language.
The spreadsheet is still the source of truth. By writing results directly into Column G, we gave operators instant trust. They did not need to learn a new tool; their everyday sheet became the dashboard.
Polite clients make for calm servers. Gentle pacing and retries with backoff handled the usual internet noise. Instead of chasing flaky errors, we saw quiet logs and smooth throughput.
Clean data upfront saves pain later. Fixing phone numbers at the edge kept downstream systems clean. Once we did that, support tickets about wrong contacts dropped sharply.
Conclusion
The journey took us from a fragile script to a workflow the team can trust. We didn’t introduce exotic technology; we simply leaned on good habits,clear boundaries, careful batching, shared connections, steady pacing, and honest visibility.
Today the 10,000-row jobs finish in about 12 minutes per environment. Memory stays flat. Operators watch the spreadsheet fill with results while alerts stay quiet.
For us, that’s what healthcare data scaling looks like: not only faster runs, but calmer shifts, clearer feedback, and an architecture that can keep growing with the organization.
Top comments (0)