Amos Augo

Posted on Nov 26

Building a Scalable Community Health Worker Analytics Platform: My Journey with dbt and Snowflake

#snowflake #dbt #dataengineering

The Challenge: From Data Chaos to Clear Metrics

Data engineers working in the health sector face a familiar but critical challenge: Community Health Workers (CHWs) generate thousands of activity records daily, but turning this raw data into actionable performance metrics is a manual, error-prone process. Field coordinators need to answer simple but vital questions:

How many households did each CHW visit last month?
Which communities have coverage gaps?
Are our health workers meeting their activity targets?

I worked on one such project where the existing process involved Excel exports, manual date calculations, and fragile SQL queries that broke whenever source data changed. Something had to change.

The Solution: A dbt-Driven Analytics Pipeline

I designed and built a scalable analytics platform using dbt (data build tool) and Snowflake that could grow with our organization's needs. Here's how I approached it:

The "Aha!" Moment: Month Assignment Logic

The first major insight came from understanding our operational reality. CHWs work in remote areas, and activities often happen late in the month but get recorded days later. Our old system used simple calendar months, which unfairly penalized workers for reporting delays.

The breakthrough was implementing a business-specific month assignment rule:

CASE 
  WHEN DAY(activity_date) >= 26 THEN
    DATE_TRUNC('MONTH', DATEADD(MONTH, 1, activity_date))
  ELSE
    DATE_TRUNC('MONTH', activity_date)
END AS report_month

This simple change meant activities from the 26th onward counted toward next month's metrics. It sounds small, but it transformed how we measured performance and boosted CHW morale overnight.

Building the Foundation: Staging Layer

I started with a clean staging layer that handled the messy reality of field data:

-- models/staging/stg_chw_activity.sql
{{
  config(
    materialized='view',
    schema='ANALYTICS'
  )
}}

SELECT
  chv_id,
  activity_date,
  household_id,
  patient_id,
  activity_type
FROM {{ source('raw', 'fct_chv_activity') }}
WHERE activity_date IS NOT NULL
  AND deleted = FALSE
  AND chv_id IS NOT NULL

The beauty of this approach? Data quality issues were handled once, at the source. Downstream models could focus on business logic rather than data cleaning.

The Power of Incremental Processing

With thousands of new activities daily, full refreshes became impractical. I implemented incremental processing with a delete+insert strategy:

{{
  config(
    materialized='incremental',
    unique_key='(chv_id, report_month)',
    strategy='delete+insert'
  )
}}

This handled late-arriving data gracefully while maintaining performance. Processing time dropped from hours to minutes.

The Game Changer: Custom Data Quality Tests

Early on, I discovered that data quality issues could undermine even the best analytics. Missing dates, negative IDs, and stale data created silent errors. So I built a comprehensive testing framework:

1. Data Freshness Monitoring

-- tests/data_freshness.sql
SELECT COUNT(*) AS stale_records
FROM {{ source('raw', 'fct_chv_activity') }}
WHERE activity_date < DATEADD(day, -7, CURRENT_DATE())
HAVING COUNT(*) > 0

This test ensured we knew within days if field data collection stopped working.

2. Date Boundary Validation

-- tests/date_boundaries.sql
SELECT COUNT(*) AS invalid_dates
FROM {{ ref('stg_chw_activity') }}
WHERE activity_date < '2020-01-01' 
   OR activity_date > CURRENT_DATE()

This caught system date errors before they polluted our metrics.

3. Negative Value Detection

-- tests/negative_values.sql
SELECT COUNT(*) AS negative_values
FROM {{ ref('stg_chw_activity') }}
WHERE chv_id < 0 OR household_id < 0 OR patient_id < 0

This identified application bugs that created invalid identifiers.

The tests became our early warning system, catching issues before they reached decision-makers.

Technical Architecture Decisions

Why dbt?

Documentation as Code: Every model includes descriptions and tests
Modular SQL: Reusable components that team members could understand
Dependency Management: Automatic handling of model dependencies
Version Control: All changes tracked in Git

Why Snowflake?

Cross-Database Processing: Seamlessly query across RAW and analytics databases
Warehouse Scaling: Handle large processing jobs without infrastructure changes
Zero-Copy Cloning: Safe testing and development environments

The Implementation Journey

Week 1: Foundation

I started small with the staging model and basic monthly aggregation. The first version processed just core activities but proved the concept.

Week 2: Data Quality

After discovering some data issues in production, I implemented the custom test suite. This built confidence in our metrics.

Week 3: Incremental Processing

As data volumes grew, I refactored to incremental models. The performance improvement was immediate and dramatic.

Week 4: Documentation & Deployment

I used dbt's built-in documentation to create a data dictionary that non-technical stakeholders could understand.

Lessons Learned

1. Start with Business Logic

The most valuable part wasn't the technology—it was codifying our month assignment rule. Technology should serve business needs, not dictate them.

2. Build Quality In Early

Adding tests from day one prevented data quality debt from accumulating. It's easier to maintain quality than to fix it later.

3. Documentation is a Feature

The time I spent on dbt docs paid for itself when new team members could understand the data pipeline without hand-holding.

4. Simple is Scalable

The clean separation between staging and metrics layers meant we could easily add new activity types without refactoring everything.

The Impact

Within a month of deployment:

Reporting time dropped from 3 days to 15 minutes monthly
Data quality issues were caught and fixed before affecting metrics
Field coordinators could access current metrics anytime via the dashboard
CHW performance conversations became data-driven rather than anecdotal

One field manager told me: "For the first time, I can see which communities need support and recognize workers who are going above and beyond."

Looking Ahead

The platform continues to evolve. We're now adding:

Predictive analytics for CHW performance
Automated alerting for coverage gaps
Integration with supply chain data

But the foundation remains the same: clean, tested, documented data transformations that serve real business needs.

My Advice for Other Data Practitioners

Understand the business context before writing a single line of SQL
Invest in testing: it's cheaper than fixing bad decisions made with bad data
Document as you build: you will be grateful in future
Start simple and iterate based on user feedback

Building this platform taught me that the most sophisticated data engineering serves the simplest human needs: helping people understand their work and make better decisions. And that's a technical and professional achievement I'm proud to share.

The complete codebase is available on GitHub.*

DEV Community