Blessing Angus

Posted on Feb 28

Building an Incremental Zoho Desk to BigQuery Pipeline: Lessons from the Trenches

#dataengineering #zohodeskapi #bigquery #elt

What I thought would take a few days ended up taking weeks so here's everything I learned building a production Zoho Desk to BigQuery pipeline from scratch.

When my company decided to centralise customer support analytics, the task landed on my plate: pull data from Zoho Desk, land it in BigQuery, transform with dbt, done. The plan looked clean on paper.

What followed was a masterclass in why production data engineering is never as simple as the happy path suggests.

This is the story of building that pipeline; the architecture decisions, the walls I hit, and the lessons I'll carry into every pipeline I build from here.

The Starting Point: A Full Load That Took Forever

The first working version of the pipeline was blunt but functional. Every day, pull every ticket ever created from the Zoho Desk API and overwrite the BigQuery table. Simple. Predictable. And for tickets, completely untenable at scale because the table had over a million rows and was growing daily.

I needed incremental loads. But before I could get there, I had a different problem to solve: how do I get all the historical data into BigQuery in the first place without running an API job for days?

The Bootstrapping Problem: When the API Is Too Slow for the Initial Load

Loading years of historical data through a paginated API isn't a pipeline problem, it's a waiting problem. For a table with hundreds of thousands of rows, even a well-optimised API pull can take hours or days just for the seed load.

The solution was to sidestep the API entirely for the initial load. Zoho Desk has a built-in data backup feature that exports your entire account data as CSV files. I used this to export a full snapshot of tickets, threads, contacts, and calls, then loaded each CSV directly into BigQuery via the BQ Console UI.

The UI load process:

Format: CSV
Schema: defined manually (not auto-detect — more on why this matters later)
Skip leading rows: 1 (header row)
Allow quoted newlines: yes (critical for fields like ticket descriptions that contain line breaks)
Allow jagged rows: yes (API responses sometimes omit optional fields)

Once the historical snapshot was in BigQuery, I set the incremental pipeline's start_date to the backup date. The first scheduled run picks up any changes from that day forward, no gap, no overlap.

Lesson: For large initial loads, don't fight the API. Use native export features if they exist. The pipeline is for keeping data fresh; getting the history in is a one-time bootstrapping problem that deserves its own solution.

The Architecture: Code Generation Over Copy-Paste

Here's how the pipeline itself is structured. Rather than writing a separate Airflow DAG for every Zoho Desk endpoint, I built a code-generation system:

A custom Airflow operator (ZohoDeskToGCSOperator) handles all the API extraction logic; pagination, OAuth, concurrent detail fetching, incremental search
A Jinja template defines the DAG structure once
YAML config files one per endpoint and each defines the parameters: schedule, columns, schema, endpoint type
A generator script renders YAML + template to DAG Python file

Adding a new endpoint means writing a YAML file and running the generator, not copying a DAG from scratch. The template handles the branching logic for different endpoint types automatically.

Zoho Desk API
      │
      ▼
ZohoDeskToGCSOperator  ──►  GCS (staging CSV)
                                    │
                                    ▼
                GCSToBigQueryOperator  ──►  BigQuery (_staging table)
                                                              │
                                                              ▼
                                            BigQueryInsertJobOperator
                                              (MERGE into main table)

For large transactional tables (tickets, contacts, threads, calls), data lands in a _staging table first, then gets merged into the main table, updating existing rows and inserting new ones. For small reference tables (agents, teams, departments), a daily WRITE_TRUNCATE is sufficient.

Challenge 1: Not All APIs Are Created Equal

The ticket and contact endpoints support Zoho's modifiedTimeRange parameter. This allows you pass a start and end timestamp and get back only records modified in that window. Perfect for incremental loads.

The /calls endpoint does not. Pass modifiedTimeRange to it and you get a 422 back.

The workaround: sort by createdTime descending and stop paginating as soon as the oldest record on the current page predates your window. Since calls are append-only in practice, this is equivalent.

for rec in records:
    if rec["createdTime"] < data_interval_start:
        done = True
        break

Lesson: Don't assume API feature parity across endpoints from the same vendor. Test every endpoint independently before writing a single line of pipeline code.

Challenge 2: A Reserved Word Hiding in Your Column Names

The threads table has a column called to — as in the recipient of a thread. Perfectly reasonable name. Except TO is a reserved keyword in BigQuery SQL.

The MERGE statement was generating SQL like:

INSERT (`from`, `to`, `subject`, ...)
VALUES (S.`from`, S.`to`, S.`subject`, ...)

Where to was unquoted. BigQuery's parser sees the keyword TO in an unexpected position and throws:

Syntax error: Unexpected keyword TO at [40:130]

The fix was to backtick-quote the column name in the generated MERGE SQL

Lesson: When generating SQL programmatically, always quote all identifiers. You won't always know which column names will collide with reserved words, especially when the schema is driven by a third-party API you don't control.

Challenge 3: The MERGE That Couldn't Match Rows

After fixing the syntax error, the threads MERGE hit a different wall:

UPDATE/MERGE must match at most one source row for each target row

This one took some digging. The threads endpoint works like this: search for tickets modified in the time window, then fetch all threads for each of those tickets. The problem is that Zoho's modifiedTimeRange search is paginated and the same ticket can appear on multiple pages if the result set shifts between requests.

When that happens, threads get fetched for the same ticket twice. The staging table ends up with duplicate thread IDs. BigQuery's MERGE correctly refuses to update a target row when multiple source rows match it.

I fixed it at two layers:

In the operator (Python): deduplicate ticket IDs before fetching threads.

ticket_ids = list(dict.fromkeys(ticket_ids))

dict.fromkeys removes duplicates while preserving insertion order. I feel this is cleaner than converting to a set and back.

In the MERGE SQL (template): add a deduplication guard in the USING clause.

USING (
    SELECT * FROM `staging_table`
    QUALIFY ROW_NUMBER() OVER (PARTITION BY `id`) = 1
) S

The Python fix prevents the problem from occurring. The SQL fix is a safety net for edge cases I haven't thought of yet.

Lesson: For MERGE pipelines, always add a QUALIFY ROW_NUMBER() dedup guard in the staging select. Even if your source looks clean, it defends against edge cases you haven't anticipated.

Challenge 4: Auto-Detect Might Tell Lies

Here's where the bootstrapping decision from earlier came back to bite me. When I loaded the initial backup CSVs into BigQuery, I let auto-detect infer the schema rather than defining it explicitly. Fast and convenient, but caused some failures.

Auto-detect made choices that didn't match what the incremental pipeline expected:

onholdTime — stored by Zoho as a timestamp string, loaded by auto-detect as STRING
tagCount — a count field, auto-detected as INT64 (actually correct, but the YAML I'd written said STRING)
isEscalated — loaded as STRING because the CSV values were "true"/"false" strings, not proper booleans

None of these caused problems until the first MERGE ran and tried to assign a TIMESTAMP value to a STRING column. BigQuery's type enforcement at MERGE time is strict — and unforgiving.

The fix was to query INFORMATION_SCHEMA.COLUMNS on the main table and reconcile every column type against the YAML schema:

SELECT column_name, data_type
FROM `project.dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'zoho_desk_tickets'
ORDER BY ordinal_position;

Lesson: The initial load is the source of truth — not your assumptions about what the types should be. Always verify INFORMATION_SCHEMA before writing a schema config for any table that was bootstrapped manually.

The dbt Layer: Fixing What the Raw Layer Can't

With the raw pipeline stable, dbt handles the transformation into clean, analyst-ready tables.

This keeps the raw layer as a faithful copy of the source, while the transformation layer handles type normalisation.

What I'd Do Differently

Use native export for the initial load: don't fight the API for historical data. Export a full backup, load via the UI, then let the pipeline handle increments from that point.
Never use auto-detect for a table that an incremental pipeline will MERGE into: define the schema explicitly, and verify with INFORMATION_SCHEMA immediately after loading.
Add the QUALIFY ROW_NUMBER() dedup guard from day one: it costs nothing and saves you from mysterious MERGE failures later.
Test every API endpoint's query parameter support independently: don't inherit assumptions from one endpoint to another.
Backtick-quote all identifiers in generated SQL: reserved word collisions are unpredictable and the fix is trivial.
Keep the raw and transformation layers separate: having raw data land in BigQuery with minimal transformation, and a separate dbt layer for typing and renaming, makes debugging far easier. You can always re-run dbt without re-hitting the API.

Wrapping Up

None of the challenges described here were exotic. Reserved words, duplicate rows, type mismatches, API inconsistencies, etc., these are bread-and-butter data engineering problems. What made them feel hard was hitting them one at a time in production, under pressure, on a pipeline that analysts were waiting on.

The pipeline is now running reliably in production: tickets, contacts, threads, agents, teams, departments, and accounts incrementally loaded daily.

If you're building something similar — whether with Zoho, Salesforce, Hubspot, or any SaaS API — I hope these lessons save you a few hours of head-scratching.

The pipeline is built with Apache Airflow, Google Cloud Storage, BigQuery, and dbt. The custom operator pattern and code-generation approach described here are applicable to any REST API integration.

DEV Community