Panagiotis for Agile Actors Hellas

Posted on Mar 31 • Edited on May 3

Cross-Cloud Pipeline with ADF & STS: Architecture, Troubleshooting & Costs

#azure #googlecloud #dataengineering #analytics

Every data engineer eventually ends up staring at a problem that shouldn't exist. Data that needs to be somewhere it isn't. Two systems that should talk to each other but don't. A business requirement that assumes clouds are just different tabs in the same browser.

Our version of this problem was simple to describe and genuinely interesting to solve: operational data lived in PostgreSQL on Azure, while the analytics team (data scientists, BI developers, the people who actually make decisions from data) had built everything in BigQuery on GCP. Nobody was migrating either side, so my job was to make them talk.

What followed was one of those projects that starts as "a quick pipeline" and ends up teaching you more about cloud architecture, cross-service authentication, and silent failure modes than you expected. Every layer worked beautifully in isolation, but the problems lived exclusively in the spaces between services, in the handoffs, the assumptions, the error messages that pointed everywhere except at the actual cause.

This is that story. The architecture, yes, but more so the debugging sessions that shaped it. If you're building anything that crosses cloud boundaries, the troubleshooting sections alone might save you a few weeks.

How We Got Here

Companies rarely end up multi-cloud by design. It usually happens through acquisitions, through teams making independent vendor decisions, or through the gravitational pull of a tool that's genuinely best-in-class for its purpose.

In our case, the operational side of the business had grown up on Azure, with infrastructure, networking, and identity all running on Microsoft. PostgreSQL on Azure's managed Flexible Server made sense because it's a solid managed database with clean VNet integration and no public endpoint, which is a feature, not a limitation.

The analytics side had independently converged on Google Cloud. BigQuery is genuinely exceptional for analytical workloads, dbt had become the transformation layer, and Looker sat on top. The team had invested years building in this ecosystem, so migrating to Azure wasn't realistic, and nobody had the appetite for it either.

So we had two clouds, both legitimate, both entrenched, and we needed a bridge.

The First Surprise: ADF Can't Write to BigQuery

The natural starting point was Azure Data Factory, Microsoft's managed data integration service that has connectors for hundreds of sources and sinks, including a Google BigQuery connector right there in the UI.

What the marketing materials don't lead with: the BigQuery connector in ADF is source-only. You can read data from BigQuery into Azure, but you cannot write to it. Same story with Google Cloud Storage, which is also not a supported sink.

I remember the exact moment I discovered this. I had already designed half the pipeline in my head, envisioning a clean Copy Activity with source PostgreSQL and sink BigQuery, done by lunch. I opened the sink configuration dropdown, scrolled through every Azure-native option on offer, and scrolled again. BigQuery wasn't among them. I scrolled one more time, but no, I hadn't missed it.

This is one of those discoveries that reshapes an entire project in a single moment. It's not a bug or a misconfiguration, it's a fundamental constraint of how ADF's connector ecosystem works, and once you accept it, everything downstream changes. The tempting response is frustration, because you've just lost the simplest possible architecture.

The productive response is to ask:

what can ADF write to natively? Azure Blob Storage, obviously.
what can Google Cloud pull data from natively?

This is where things got interesting.

Finding the Right Shape

When you can't go direct, you look for managed services designed for the exact gap you're trying to cross.

Google Cloud Storage Transfer Service is exactly that, a managed GCP service whose entire job is moving data between storage systems, including Azure Blob Storage. It authenticates with Azure using a SAS token, reads files from a Blob container, and writes them into a GCS bucket, all without VMs, custom code, or an ETL framework.

Once you see it, the architecture snaps into place:

Azure Data Factory extracts from PostgreSQL through a Self-Hosted Integration Runtime and stages data as Parquet files in Blob Storage. Storage Transfer Service then moves those files from Azure to GCS, acting as the cross-cloud bridge. BigQuery's Jobs API loads the Parquet into raw tables, and dbt Cloud deduplicates and transforms the raw data into clean, analytics-ready tables.

Five hops, but no custom code in any of them. That's the design philosophy that made this project work: use each provider's own tools for what they're designed to do, and design the handoffs between them carefully.

Getting Out of the Private Network

Before we could even think about cross-cloud transfers, we had a more immediate challenge: PostgreSQL was deployed as an Azure Flexible Server with VNet integration, meaning it sat inside a private Azure VNet on a delegated subnet with no public endpoint. This is by design, but it creates a chain of constraints that narrows your options considerably. Firstly, Azure does not support private endpoint creation for VNet-integrated Flexible Servers, so there was no way to expose the database through Private Link. That rules out more than just direct access, because ADF's Managed Virtual Network integration runtime connects to data sources exclusively through managed private endpoints, which means it can only reach resources that support Private Link. No private endpoint on Postgres means no managed VNet runtime either. The only remaining option was a Self-Hosted Integration Runtime, a Windows VM deployed inside the same VNet and registered with ADF, acting as its private agent.

Think of it less as a separate component and more as ADF's arm reaching inside the locked room. Conceptually elegant, though setup is where the surprises live.

The Java Mystery

Our first pipeline run against a real table failed with a cryptic error. The Copy Activity connected to PostgreSQL successfully (we could see it reading rows in the logs), but the moment it tried to write the first Parquet file to Blob Storage, it crashed with something about a JRE not being found, which was not exactly self-documenting.

If you're not already expecting this, you'd spend your first hour looking at network rules, storage account permissions, or the SHIR registration itself, which is exactly what we did. We checked the linked service credentials, verified the Blob container existed, and tested with a CSV sink instead of Parquet. The CSV worked, which narrowed it down to something specific about the Parquet writer.

Here's what was actually happening: ADF's Copy Activity uses a Java-based Parquet writer under the hood. Our SHIR VM was a clean Windows Server image with no Java runtime. The SHIR installed fine, registered fine, and connected to PostgreSQL fine, but when it needed to write Parquet, it looked for a JRE, found nothing, and threw an error that only mentioned Java obliquely.

The fix took five minutes (install OpenJDK 17 and restart the runtime service), but finding it took most of a morning. The frustrating part is that the error message doesn't say "install Java." You have to mentally connect "JRE not found" to "Parquet writing requires Java, and this VM doesn't have it." In hindsight it's obvious, but in the moment, with ten other possible causes competing for attention, it's not.

The DNS Ghost

With Java installed, the next run hung for two minutes and timed out with a connection error to PostgreSQL. I knew the SHIR was inside the VNet and I could RDP in and ping other resources, so everything looked connected, yet the SHIR couldn't resolve the PostgreSQL hostname.

Azure Flexible Server uses a private DNS zone for hostname resolution, meaning the hostname resolves to a private IP only if that DNS zone is properly linked to the VNet where the SHIR lives. Our VNet was there, the DNS zone was there, but the link between them wasn't. The portal showed the zone as "active," just not active for our VNet.

The error from ADF was a plain connection timeout with nothing DNS-related in it. The debugging path that cracked it: I opened a command prompt on the SHIR VM and ran an nslookup against the PostgreSQL hostname, which returned the public Azure DNS answer instead of a private IP. That was the tell.

Linking the DNS zone took thirty seconds, but the lesson is broader: in Azure's private networking model, connectivity and name resolution are two entirely different things. You can have full network connectivity and still fail because DNS doesn't resolve correctly, and the errors don't help you distinguish between the two.

Making the Extraction Incremental

Full reloads were never an option because some tables had billions of rows and were growing constantly, making a complete load on every run expensive, slow, and fragile. So we went with watermark-based incremental extraction, tracking the maximum timestamp from the last successful run and extracting only newer rows.

Sounds simple, but there's a subtle data loss scenario hiding in the most natural approach.

The Watermark Race Condition

The intuitive pattern goes like this: read the last watermark, extract all rows newer than that, then record the current maximum as the next starting point. Clean and simple, and broken in one specific case that took us a while to find.

While the copy is running (say it takes eight minutes for a large table), new rows are being inserted into PostgreSQL with timestamps between the old watermark and the current moment. The copy finishes, captures the maximum timestamp from the data it extracted, and records that as the new watermark, but rows inserted during the copy, after the query started reading that portion of the table, weren't in the batch. On the next run, they're below the new watermark, which means they're gone. Silently.

The insidious part is the scale: you don't lose thousands of rows, just a handful per run, the ones that happened to be inserted in that narrow window. Row counts still look roughly right, dashboards still update, and everything appears healthy until someone runs a precise reconciliation and the numbers are off by a fraction of a percent. That's how we found it.

The fix: capture the current maximum before the copy starts and use it as an upper bound. Your extraction becomes a bounded window containing everything between the old watermark and the pre-captured ceiling, with anything above that ceiling waiting for the next run. Nothing falls through.

This pattern is in Microsoft's documentation, but it's not the first result when you search for "ADF incremental load." The first results show the simpler version, the one with the race condition. You have to dig deeper to find the bounded window variant, and by the time you're digging, you've usually already lost some data.

Why Parquet Matters More Than You Think

Parquet as the staging format goes beyond performance because it's what makes the whole pipeline schema-agnostic. Parquet embeds schema information inside the file itself, so when BigQuery receives a Parquet file, it reads the schema from the headers and creates the target table automatically. Adding a new table to the pipeline is a single configuration entry with no manual schema definitions and no migrations.

Schema drift works the same way: a new column appears in PostgreSQL, BigQuery adds it, old rows show null, and the pipeline doesn't need to know or care.

One wrinkle: PostgreSQL has a richer type system than BigQuery, with spatial types, custom domains, and array columns that don't translate directly. What ADF does is quietly cast any incompatible type to plain text before writing the file, with no error and no warning. We didn't know it was happening until a data scientist asked why a column that should have been an array was showing up as a string. The lesson: when bridging type systems, always verify what arrives, not just what was sent.

The Cross-Cloud Handoff

Storage Transfer Service is elegant in theory, but getting it to work in production revealed a series of gotchas that the documentation glosses over. I'm going to walk through each one in the order we hit them, because the order matters: each looks like the previous problem until you realize it's something entirely different.

The Firewall Problem

We'd configured the Blob Storage account with firewall rules allowing only our VNet and known IPs, which is standard practice. Then we created the STS job, which started, ran for ten seconds, and failed with an authentication error.

The actual problem: Google's transfer agents connect from IP ranges that are large, dynamic, and change frequently, so you cannot whitelist them statically. The storage account needs to be open to all networks, with security coming from the SAS token instead: short-lived, read-only, HTTPS-only, and automatically rotated. The token is the lock, not the firewall. This requires a mental model shift, but it's actually more robust than maintaining a firewall against a moving target.

The Permission Nobody Told You About

With encoding fixed, STS could authenticate with Azure, but job creation failed with a FAILED_PRECONDITION error on the GCP side. It turns out STS verifies that the destination bucket exists, which requires a permission called legacyBucketReader, an older role that doesn't overlap with the newer IAM roles the way you'd expect. We'd already granted objectAdmin on the bucket, but that didn't matter, and the error message said nothing about which permission was missing.

Project Number vs. Project ID

When referencing secrets from an STS job, the configuration expects the project's numeric identifier, not the human-readable name. Using the name produces yet another FAILED_PRECONDITION error with no mention of the format. By this point, we'd developed a reflex: when STS throws FAILED_PRECONDITION, the problem is almost never what the error implies.

Automating the Credential Rotation

SAS tokens expire, and a pipeline that works today but silently breaks in 90 days isn't production engineering, it's technical debt with a countdown timer.

We solved this with an Azure Function on a weekly timer that generates a new SAS token, URL-decodes it (the hard-won lesson), and pushes the decoded token to both Azure Key Vault and GCP Secret Manager. STS then reads the latest version automatically on the next transfer. The function runs on a Consumption plan, and the monthly bill rounds to zero.

One nuance worth mentioning: the Function itself needs credentials to write to GCP Secret Manager, which we handle with a GCP service account key stored in Azure Key Vault. Yes, there's a philosophical irony in storing a GCP credential in Azure to rotate an Azure credential into GCP. Welcome to multi-cloud.

Loading into BigQuery

Once files land in GCS, the BigQuery Jobs API loads them into raw tables in append mode, so reruns are safe by design. The Jobs API works well, but it has one behavior that caught us off guard.

When "DONE" Doesn't Mean "Succeeded"

BigQuery returns a status of DONE for both successful and failed jobs, with the difference being a separate error field that's only present on failure. This is documented, but it's the kind of API behavior you read once, think "that's odd," and then forget about until it bites you.

Our initial implementation polled for DONE and moved on, and for weeks this worked because no jobs were failing. The pipeline hummed along, watermarks advanced, dashboards updated, and everything seemed healthy.

Then one day a schema mismatch caused a load to fail: a column that had been integer upstream had changed to string, so the load job rejected the file. BigQuery returned DONE, our pipeline marked the run as successful, the watermark advanced, and the data simply wasn't in BigQuery.

Nobody noticed for four days until a BI developer flagged that a dashboard was showing stale numbers. We traced it to the failed load and then to our status-checking logic. The fix took ten minutes (check the error field alongside the status), but recovering four days of missed data took considerably longer because the watermark had already advanced past the missing rows. We had to manually reset watermarks, re-extract, and re-load, exactly the kind of manual intervention the pipeline was designed to avoid.

Always check both fields. BigQuery's error messages are specific and actionable when you actually look at them.

dbt: Making Sense of Append-Only Data

Appending rows every run means duplicates accumulate, which is intentional because it keeps the loading layer simple and safe, but it also means raw tables can't be used directly for analytics. You need a deduplication layer, and that's where dbt comes in.

dbt's incremental models handle exactly this. Configured with a unique key, each run generates a MERGE statement that updates changed rows and inserts new ones. The deduplication logic lives in a well-tested SQL model, version-controlled in Git, not in a fragile Python script or an ADF expression buried three menus deep.

The result is a clean two-layer architecture. Raw tables hold every row ever loaded with ingestion timestamps, which is useful for debugging, auditing, and reprocessing. If something goes wrong downstream, you can always go back to the raw layer and replay. dbt silver tables hold deduplicated, partitioned, clustered data, the kind that analysts actually query. The complexity of the multi-cloud pipeline is invisible to data consumers.

When something looks wrong in the analytics layer, you trace it through the dbt model to the raw load and see exactly what arrived and when. This audit trail doesn't seem important until the first time it saves you from a long debugging session.

After all loads complete, ADF retrieves the dbt Cloud API token from Key Vault and triggers the transformation job automatically, so the entire pipeline runs end to end without human involvement.

The Metadata-Driven Design

The decision that paid off most disproportionately was making the pipeline entirely metadata-driven from day one. I almost didn't, because the first prototype was hardcoded for three tables, and the temptation to just keep adding tables manually was real. But the upfront investment in a configuration layer saved us weeks of work over the following months.

Every table is a single row in a configuration table stored in Azure SQL Database, and that row tracks where data comes from, where it's going, how far the last run got, and what happened. ADF reads this table at the start of every run, with nothing hardcoded in the pipeline itself. Adding a new table means adding one row, with no pipeline changes, no GCP console work, and no manual STS job creation. On first run, ADF creates the STS transfer job automatically, BigQuery creates the target table from the Parquet schema, and data starts flowing.

The same table doubles as the operational dashboard. The error column tells you what went wrong, the watermark tells you where each table stands, the timestamp tells you when each was last loaded, and the row count tells you if something loaded suspiciously fewer rows than expected. A single query gives you the health of every table in the pipeline at a glance.

It also made the project easier to hand off, because everything about the pipeline's configuration lives in a table anyone can read, with no tribal knowledge buried in JSON that requires ADF Studio access to understand.

One More Thing: ADF's Nesting Limits

ADF has a limitation that isn't widely documented: you cannot nest certain activity types inside other activities beyond a certain depth. We discovered this when trying to put a polling loop inside a conditional block, and while the pipeline validated fine in ADF Studio, at runtime ADF threw a validation error about unsupported nesting.

The solution was to break the nested logic into a separate child pipeline connected via Execute Pipeline. The child contains the polling loop, isolated from any conditional wrapper, which means more pipelines to manage, but each one is simpler and the nesting constraint disappears.

The Cost Reality

Cross-cloud pipelines have a reputation for being expensive, but this one isn't, though you do need to account for a cost that's easy to overlook.

The largest ongoing cost is the SHIR VM, which runs continuously. The Azure SQL Database runs on Basic tier at around €4/month, the Azure Function runs on Consumption for single-digit euros, and Blob Storage staging costs near zero because files are deleted after each load.

The cost that catches most people off guard in multi-cloud architectures is the cross-cloud data transfer. When STS pulls files from Azure Blob Storage, that data leaves Azure's network as egress to the public internet, which Azure charges at roughly $0.087/GB for the first 10 TB. On the GCP side, ingress into Cloud Storage is free, so you're only paying the Azure side of the transfer. For our workload of a dozen tables with incremental loads, this amounts to a few euros per month because we're only moving deltas, not full table dumps. If you were moving terabytes daily, though, this line item would dominate the bill, and you'd want to look into Azure ExpressRoute or Google Cloud Interconnect to bring those rates down significantly.

On the GCP side beyond ingress, Storage Transfer Service is free for Azure-to-GCS transfers, and BigQuery load jobs are free as well since Google charges for storage and queries, not ingestion. The GCS staging bucket costs a few euros.

Total for a dozen tables with incremental loads: well under €150 per month. The comparison that matters isn't against doing nothing, it's against a self-managed ETL tool on a VM, a Python script on a scheduler, or an Airbyte instance you're responsible for operating. Those trade low licensing cost for high operational burden, while managed services invert that trade-off.

What the Documentation Doesn't Tell You

Looking back, a pattern emerges: the hardest problems were always at the boundaries between services. Within any single cloud service, the documentation is generally good, but at the handoffs, where Azure talks to GCP, where ADF talks to the SHIR, where BigQuery interprets what "done" means, the documentation assumes things will go smoothly.

A summary of what actually bit us, roughly in order of encounter:

Install Java on the SHIR VM before running any Parquet-based Copy Activity.
Verify the Private DNS Zone is linked to the correct VNet before assuming connectivity works.
Always use a bounded watermark window to prevent the incremental extraction race condition.
URL-decode SAS tokens before storing them in Secret Manager.
Open the storage account to all networks when using STS. The token is the security layer, not the firewall.
Grant legacyBucketReader to the STS service agent. Use numeric project IDs in secret references, not human-readable names.
Check BigQuery's error field, not just the status.
And split ADF logic across child pipelines to avoid nesting limits.

None of these are difficult once you know them, but all of them are invisible until you hit them. The list above represents roughly two and a half weeks of cumulative debugging time.

Two Clouds, One Pipeline

The pipeline has been running in production for months without manual intervention. Watermarks advance automatically, new tables go live in minutes, SAS tokens rotate on schedule, dbt keeps the silver layer clean, and the configuration table is the single source of truth.

The architecture isn't elegant in the way a single-cloud pipeline can be. There are five hops where a native solution might have two, there are IAM permissions to manage across two providers, and there are encoding quirks and API behaviors you have to learn once and then never forget.

But it works, it's observable, it costs less per month than a team dinner, and it was built entirely from managed services the team already understood, with no new tools to learn, no new infrastructure to operate, and no new vendor relationships to manage.

The hardest part wasn't the code, because there is almost no code. It was understanding what each managed service was designed to do, what it quietly assumed, and building the handoffs between them well enough that when something goes wrong, it fails loudly, not silently and slowly, weeks later, when the damage is already done.

If this story has a thesis, it's this: the documentation for any individual cloud service is generally good, but the gaps are always in the spaces between services. That's where the interesting engineering happens, and it's where most of the debugging time goes. Plan for it.

That understanding is the actual deliverable. The pipeline is just what you get when you have it.

DEV Community