DEV Community: Panagiotis

Cross-Cloud Pipeline with ADF & STS: Architecture, Troubleshooting & Costs

Panagiotis — Tue, 31 Mar 2026 12:15:23 +0000

Every data engineer eventually ends up staring at a problem that shouldn't exist. Data that needs to be somewhere it isn't. Two systems that should talk to each other but don't. A business requirement that assumes clouds are just different tabs in the same browser.

Our version of this problem was simple to describe and genuinely interesting to solve: operational data lived in PostgreSQL on Azure, while the analytics team (data scientists, BI developers, the people who actually make decisions from data) had built everything in BigQuery on GCP. Nobody was migrating either side, so my job was to make them talk.

What followed was one of those projects that starts as "a quick pipeline" and ends up teaching you more about cloud architecture, cross-service authentication, and silent failure modes than you expected. Every layer worked beautifully in isolation, but the problems lived exclusively in the spaces between services, in the handoffs, the assumptions, the error messages that pointed everywhere except at the actual cause.

This is that story. The architecture, yes, but more so the debugging sessions that shaped it. If you're building anything that crosses cloud boundaries, the troubleshooting sections alone might save you a few weeks.

How We Got Here

Companies rarely end up multi-cloud by design. It usually happens through acquisitions, through teams making independent vendor decisions, or through the gravitational pull of a tool that's genuinely best-in-class for its purpose.

In our case, the operational side of the business had grown up on Azure, with infrastructure, networking, and identity all running on Microsoft. PostgreSQL on Azure's managed Flexible Server made sense because it's a solid managed database with clean VNet integration and no public endpoint, which is a feature, not a limitation.

The analytics side had independently converged on Google Cloud. BigQuery is genuinely exceptional for analytical workloads, dbt had become the transformation layer, and Looker sat on top. The team had invested years building in this ecosystem, so migrating to Azure wasn't realistic, and nobody had the appetite for it either.

So we had two clouds, both legitimate, both entrenched, and we needed a bridge.

The First Surprise: ADF Can't Write to BigQuery

The natural starting point was Azure Data Factory, Microsoft's managed data integration service that has connectors for hundreds of sources and sinks, including a Google BigQuery connector right there in the UI.

What the marketing materials don't lead with: the BigQuery connector in ADF is source-only. You can read data from BigQuery into Azure, but you cannot write to it. Same story with Google Cloud Storage, which is also not a supported sink.

I remember the exact moment I discovered this. I had already designed half the pipeline in my head, envisioning a clean Copy Activity with source PostgreSQL and sink BigQuery, done by lunch. I opened the sink configuration dropdown, scrolled through every Azure-native option on offer, and scrolled again. BigQuery wasn't among them. I scrolled one more time, but no, I hadn't missed it.

This is one of those discoveries that reshapes an entire project in a single moment. It's not a bug or a misconfiguration, it's a fundamental constraint of how ADF's connector ecosystem works, and once you accept it, everything downstream changes. The tempting response is frustration, because you've just lost the simplest possible architecture.

The productive response is to ask:

what can ADF write to natively? Azure Blob Storage, obviously.
what can Google Cloud pull data from natively?

This is where things got interesting.

Finding the Right Shape

When you can't go direct, you look for managed services designed for the exact gap you're trying to cross.

Google Cloud Storage Transfer Service is exactly that, a managed GCP service whose entire job is moving data between storage systems, including Azure Blob Storage. It authenticates with Azure using a SAS token, reads files from a Blob container, and writes them into a GCS bucket, all without VMs, custom code, or an ETL framework.

Once you see it, the architecture snaps into place:

Azure Data Factory extracts from PostgreSQL through a Self-Hosted Integration Runtime and stages data as Parquet files in Blob Storage. Storage Transfer Service then moves those files from Azure to GCS, acting as the cross-cloud bridge. BigQuery's Jobs API loads the Parquet into raw tables, and dbt Cloud deduplicates and transforms the raw data into clean, analytics-ready tables.

Five hops, but no custom code in any of them. That's the design philosophy that made this project work: use each provider's own tools for what they're designed to do, and design the handoffs between them carefully.

Getting Out of the Private Network

Before we could even think about cross-cloud transfers, we had a more immediate challenge: PostgreSQL was deployed as an Azure Flexible Server with VNet integration, meaning it sat inside a private Azure VNet on a delegated subnet with no public endpoint. This is by design, but it creates a chain of constraints that narrows your options considerably. Firstly, Azure does not support private endpoint creation for VNet-integrated Flexible Servers, so there was no way to expose the database through Private Link. That rules out more than just direct access, because ADF's Managed Virtual Network integration runtime connects to data sources exclusively through managed private endpoints, which means it can only reach resources that support Private Link. No private endpoint on Postgres means no managed VNet runtime either. The only remaining option was a Self-Hosted Integration Runtime, a Windows VM deployed inside the same VNet and registered with ADF, acting as its private agent.

Think of it less as a separate component and more as ADF's arm reaching inside the locked room. Conceptually elegant, though setup is where the surprises live.

The Java Mystery

Our first pipeline run against a real table failed with a cryptic error. The Copy Activity connected to PostgreSQL successfully (we could see it reading rows in the logs), but the moment it tried to write the first Parquet file to Blob Storage, it crashed with something about a JRE not being found, which was not exactly self-documenting.

If you're not already expecting this, you'd spend your first hour looking at network rules, storage account permissions, or the SHIR registration itself, which is exactly what we did. We checked the linked service credentials, verified the Blob container existed, and tested with a CSV sink instead of Parquet. The CSV worked, which narrowed it down to something specific about the Parquet writer.

Here's what was actually happening: ADF's Copy Activity uses a Java-based Parquet writer under the hood. Our SHIR VM was a clean Windows Server image with no Java runtime. The SHIR installed fine, registered fine, and connected to PostgreSQL fine, but when it needed to write Parquet, it looked for a JRE, found nothing, and threw an error that only mentioned Java obliquely.

The fix took five minutes (install OpenJDK 17 and restart the runtime service), but finding it took most of a morning. The frustrating part is that the error message doesn't say "install Java." You have to mentally connect "JRE not found" to "Parquet writing requires Java, and this VM doesn't have it." In hindsight it's obvious, but in the moment, with ten other possible causes competing for attention, it's not.

The DNS Ghost

With Java installed, the next run hung for two minutes and timed out with a connection error to PostgreSQL. I knew the SHIR was inside the VNet and I could RDP in and ping other resources, so everything looked connected, yet the SHIR couldn't resolve the PostgreSQL hostname.

Azure Flexible Server uses a private DNS zone for hostname resolution, meaning the hostname resolves to a private IP only if that DNS zone is properly linked to the VNet where the SHIR lives. Our VNet was there, the DNS zone was there, but the link between them wasn't. The portal showed the zone as "active," just not active for our VNet.

The error from ADF was a plain connection timeout with nothing DNS-related in it. The debugging path that cracked it: I opened a command prompt on the SHIR VM and ran an nslookup against the PostgreSQL hostname, which returned the public Azure DNS answer instead of a private IP. That was the tell.

Linking the DNS zone took thirty seconds, but the lesson is broader: in Azure's private networking model, connectivity and name resolution are two entirely different things. You can have full network connectivity and still fail because DNS doesn't resolve correctly, and the errors don't help you distinguish between the two.

Making the Extraction Incremental

Full reloads were never an option because some tables had billions of rows and were growing constantly, making a complete load on every run expensive, slow, and fragile. So we went with watermark-based incremental extraction, tracking the maximum timestamp from the last successful run and extracting only newer rows.

Sounds simple, but there's a subtle data loss scenario hiding in the most natural approach.

The Watermark Race Condition

The intuitive pattern goes like this: read the last watermark, extract all rows newer than that, then record the current maximum as the next starting point. Clean and simple, and broken in one specific case that took us a while to find.

While the copy is running (say it takes eight minutes for a large table), new rows are being inserted into PostgreSQL with timestamps between the old watermark and the current moment. The copy finishes, captures the maximum timestamp from the data it extracted, and records that as the new watermark, but rows inserted during the copy, after the query started reading that portion of the table, weren't in the batch. On the next run, they're below the new watermark, which means they're gone. Silently.

The insidious part is the scale: you don't lose thousands of rows, just a handful per run, the ones that happened to be inserted in that narrow window. Row counts still look roughly right, dashboards still update, and everything appears healthy until someone runs a precise reconciliation and the numbers are off by a fraction of a percent. That's how we found it.

The fix: capture the current maximum before the copy starts and use it as an upper bound. Your extraction becomes a bounded window containing everything between the old watermark and the pre-captured ceiling, with anything above that ceiling waiting for the next run. Nothing falls through.

This pattern is in Microsoft's documentation, but it's not the first result when you search for "ADF incremental load." The first results show the simpler version, the one with the race condition. You have to dig deeper to find the bounded window variant, and by the time you're digging, you've usually already lost some data.

Why Parquet Matters More Than You Think

Parquet as the staging format goes beyond performance because it's what makes the whole pipeline schema-agnostic. Parquet embeds schema information inside the file itself, so when BigQuery receives a Parquet file, it reads the schema from the headers and creates the target table automatically. Adding a new table to the pipeline is a single configuration entry with no manual schema definitions and no migrations.

Schema drift works the same way: a new column appears in PostgreSQL, BigQuery adds it, old rows show null, and the pipeline doesn't need to know or care.

One wrinkle: PostgreSQL has a richer type system than BigQuery, with spatial types, custom domains, and array columns that don't translate directly. What ADF does is quietly cast any incompatible type to plain text before writing the file, with no error and no warning. We didn't know it was happening until a data scientist asked why a column that should have been an array was showing up as a string. The lesson: when bridging type systems, always verify what arrives, not just what was sent.

The Cross-Cloud Handoff

Storage Transfer Service is elegant in theory, but getting it to work in production revealed a series of gotchas that the documentation glosses over. I'm going to walk through each one in the order we hit them, because the order matters: each looks like the previous problem until you realize it's something entirely different.

The Firewall Problem

We'd configured the Blob Storage account with firewall rules allowing only our VNet and known IPs, which is standard practice. Then we created the STS job, which started, ran for ten seconds, and failed with an authentication error.

The actual problem: Google's transfer agents connect from IP ranges that are large, dynamic, and change frequently, so you cannot whitelist them statically. The storage account needs to be open to all networks, with security coming from the SAS token instead: short-lived, read-only, HTTPS-only, and automatically rotated. The token is the lock, not the firewall. This requires a mental model shift, but it's actually more robust than maintaining a firewall against a moving target.

The Permission Nobody Told You About

With encoding fixed, STS could authenticate with Azure, but job creation failed with a FAILED_PRECONDITION error on the GCP side. It turns out STS verifies that the destination bucket exists, which requires a permission called legacyBucketReader, an older role that doesn't overlap with the newer IAM roles the way you'd expect. We'd already granted objectAdmin on the bucket, but that didn't matter, and the error message said nothing about which permission was missing.

Project Number vs. Project ID

When referencing secrets from an STS job, the configuration expects the project's numeric identifier, not the human-readable name. Using the name produces yet another FAILED_PRECONDITION error with no mention of the format. By this point, we'd developed a reflex: when STS throws FAILED_PRECONDITION, the problem is almost never what the error implies.

Automating the Credential Rotation

SAS tokens expire, and a pipeline that works today but silently breaks in 90 days isn't production engineering, it's technical debt with a countdown timer.

We solved this with an Azure Function on a weekly timer that generates a new SAS token, URL-decodes it (the hard-won lesson), and pushes the decoded token to both Azure Key Vault and GCP Secret Manager. STS then reads the latest version automatically on the next transfer. The function runs on a Consumption plan, and the monthly bill rounds to zero.

One nuance worth mentioning: the Function itself needs credentials to write to GCP Secret Manager, which we handle with a GCP service account key stored in Azure Key Vault. Yes, there's a philosophical irony in storing a GCP credential in Azure to rotate an Azure credential into GCP. Welcome to multi-cloud.

Loading into BigQuery

Once files land in GCS, the BigQuery Jobs API loads them into raw tables in append mode, so reruns are safe by design. The Jobs API works well, but it has one behavior that caught us off guard.

When "DONE" Doesn't Mean "Succeeded"

BigQuery returns a status of DONE for both successful and failed jobs, with the difference being a separate error field that's only present on failure. This is documented, but it's the kind of API behavior you read once, think "that's odd," and then forget about until it bites you.

Our initial implementation polled for DONE and moved on, and for weeks this worked because no jobs were failing. The pipeline hummed along, watermarks advanced, dashboards updated, and everything seemed healthy.

Then one day a schema mismatch caused a load to fail: a column that had been integer upstream had changed to string, so the load job rejected the file. BigQuery returned DONE, our pipeline marked the run as successful, the watermark advanced, and the data simply wasn't in BigQuery.

Nobody noticed for four days until a BI developer flagged that a dashboard was showing stale numbers. We traced it to the failed load and then to our status-checking logic. The fix took ten minutes (check the error field alongside the status), but recovering four days of missed data took considerably longer because the watermark had already advanced past the missing rows. We had to manually reset watermarks, re-extract, and re-load, exactly the kind of manual intervention the pipeline was designed to avoid.

Always check both fields. BigQuery's error messages are specific and actionable when you actually look at them.

dbt: Making Sense of Append-Only Data

Appending rows every run means duplicates accumulate, which is intentional because it keeps the loading layer simple and safe, but it also means raw tables can't be used directly for analytics. You need a deduplication layer, and that's where dbt comes in.

dbt's incremental models handle exactly this. Configured with a unique key, each run generates a MERGE statement that updates changed rows and inserts new ones. The deduplication logic lives in a well-tested SQL model, version-controlled in Git, not in a fragile Python script or an ADF expression buried three menus deep.

The result is a clean two-layer architecture. Raw tables hold every row ever loaded with ingestion timestamps, which is useful for debugging, auditing, and reprocessing. If something goes wrong downstream, you can always go back to the raw layer and replay. dbt silver tables hold deduplicated, partitioned, clustered data, the kind that analysts actually query. The complexity of the multi-cloud pipeline is invisible to data consumers.

When something looks wrong in the analytics layer, you trace it through the dbt model to the raw load and see exactly what arrived and when. This audit trail doesn't seem important until the first time it saves you from a long debugging session.

After all loads complete, ADF retrieves the dbt Cloud API token from Key Vault and triggers the transformation job automatically, so the entire pipeline runs end to end without human involvement.

The Metadata-Driven Design

The decision that paid off most disproportionately was making the pipeline entirely metadata-driven from day one. I almost didn't, because the first prototype was hardcoded for three tables, and the temptation to just keep adding tables manually was real. But the upfront investment in a configuration layer saved us weeks of work over the following months.

Every table is a single row in a configuration table stored in Azure SQL Database, and that row tracks where data comes from, where it's going, how far the last run got, and what happened. ADF reads this table at the start of every run, with nothing hardcoded in the pipeline itself. Adding a new table means adding one row, with no pipeline changes, no GCP console work, and no manual STS job creation. On first run, ADF creates the STS transfer job automatically, BigQuery creates the target table from the Parquet schema, and data starts flowing.

The same table doubles as the operational dashboard. The error column tells you what went wrong, the watermark tells you where each table stands, the timestamp tells you when each was last loaded, and the row count tells you if something loaded suspiciously fewer rows than expected. A single query gives you the health of every table in the pipeline at a glance.

It also made the project easier to hand off, because everything about the pipeline's configuration lives in a table anyone can read, with no tribal knowledge buried in JSON that requires ADF Studio access to understand.

One More Thing: ADF's Nesting Limits

ADF has a limitation that isn't widely documented: you cannot nest certain activity types inside other activities beyond a certain depth. We discovered this when trying to put a polling loop inside a conditional block, and while the pipeline validated fine in ADF Studio, at runtime ADF threw a validation error about unsupported nesting.

The solution was to break the nested logic into a separate child pipeline connected via Execute Pipeline. The child contains the polling loop, isolated from any conditional wrapper, which means more pipelines to manage, but each one is simpler and the nesting constraint disappears.

The Cost Reality

Cross-cloud pipelines have a reputation for being expensive, but this one isn't, though you do need to account for a cost that's easy to overlook.

The largest ongoing cost is the SHIR VM, which runs continuously. The Azure SQL Database runs on Basic tier at around €4/month, the Azure Function runs on Consumption for single-digit euros, and Blob Storage staging costs near zero because files are deleted after each load.

The cost that catches most people off guard in multi-cloud architectures is the cross-cloud data transfer. When STS pulls files from Azure Blob Storage, that data leaves Azure's network as egress to the public internet, which Azure charges at roughly $0.087/GB for the first 10 TB. On the GCP side, ingress into Cloud Storage is free, so you're only paying the Azure side of the transfer. For our workload of a dozen tables with incremental loads, this amounts to a few euros per month because we're only moving deltas, not full table dumps. If you were moving terabytes daily, though, this line item would dominate the bill, and you'd want to look into Azure ExpressRoute or Google Cloud Interconnect to bring those rates down significantly.

On the GCP side beyond ingress, Storage Transfer Service is free for Azure-to-GCS transfers, and BigQuery load jobs are free as well since Google charges for storage and queries, not ingestion. The GCS staging bucket costs a few euros.

Total for a dozen tables with incremental loads: well under €150 per month. The comparison that matters isn't against doing nothing, it's against a self-managed ETL tool on a VM, a Python script on a scheduler, or an Airbyte instance you're responsible for operating. Those trade low licensing cost for high operational burden, while managed services invert that trade-off.

What the Documentation Doesn't Tell You

Looking back, a pattern emerges: the hardest problems were always at the boundaries between services. Within any single cloud service, the documentation is generally good, but at the handoffs, where Azure talks to GCP, where ADF talks to the SHIR, where BigQuery interprets what "done" means, the documentation assumes things will go smoothly.

A summary of what actually bit us, roughly in order of encounter:

Install Java on the SHIR VM before running any Parquet-based Copy Activity.
Verify the Private DNS Zone is linked to the correct VNet before assuming connectivity works.
Always use a bounded watermark window to prevent the incremental extraction race condition.
URL-decode SAS tokens before storing them in Secret Manager.
Open the storage account to all networks when using STS. The token is the security layer, not the firewall.
Grant legacyBucketReader to the STS service agent. Use numeric project IDs in secret references, not human-readable names.
Check BigQuery's error field, not just the status.
And split ADF logic across child pipelines to avoid nesting limits.

None of these are difficult once you know them, but all of them are invisible until you hit them. The list above represents roughly two and a half weeks of cumulative debugging time.

Two Clouds, One Pipeline

The pipeline has been running in production for months without manual intervention. Watermarks advance automatically, new tables go live in minutes, SAS tokens rotate on schedule, dbt keeps the silver layer clean, and the configuration table is the single source of truth.

The architecture isn't elegant in the way a single-cloud pipeline can be. There are five hops where a native solution might have two, there are IAM permissions to manage across two providers, and there are encoding quirks and API behaviors you have to learn once and then never forget.

But it works, it's observable, it costs less per month than a team dinner, and it was built entirely from managed services the team already understood, with no new tools to learn, no new infrastructure to operate, and no new vendor relationships to manage.

The hardest part wasn't the code, because there is almost no code. It was understanding what each managed service was designed to do, what it quietly assumed, and building the handoffs between them well enough that when something goes wrong, it fails loudly, not silently and slowly, weeks later, when the damage is already done.

If this story has a thesis, it's this: the documentation for any individual cloud service is generally good, but the gaps are always in the spaces between services. That's where the interesting engineering happens, and it's where most of the debugging time goes. Plan for it.

That understanding is the actual deliverable. The pipeline is just what you get when you have it.

From Pipelines to Product: My Journey from Data Engineer to Data Product Owner

Panagiotis — Tue, 14 Oct 2025 07:58:26 +0000

Most career transitions happen quietly: one project ends, another begins, and slowly a new title appears on your LinkedIn. Mine didn’t. Mine started with a single, uncomfortable question in a demo meeting:

“Okay… and what do you want me to do with that?”

That question revealed a blind spot in my work as a data engineer and set me on a journey I didn’t expect — from building technically flawless pipelines to owning the vision of a data platform as a product. This is the story of how I moved from the comfort of code to the ambiguity of human needs, and what I learned along the way.

The Haunting Question of 'Why'
We were showcasing our latest work to the client's logistics leadership—a dynamic heatmap tracking parcel congestion across logistic centers in near real-time. We had built it using a streaming pipeline that ingested tens of thousands of scan events per minute. The UI was sleek, the data was fresh, and the latency was under 15 minutes. It was, by every engineering measure, a win.

As we walked through the interface, I zoomed into the a distribution center. “You can see here,” I said proudly, “we’re detecting a 43% spike in inbound volume over baseline for this time of day.”

There was a pause. Then one of the senior ops managers leaned forward and asked, “Okay... and what do you want me to do with that?”

That one question knocked the wind out of me. He wasn’t being dismissive—he was being honest. In that moment, I realized the painful truth: we hadn’t built a decision-support tool—we had built a statistics mirror. It was technically elegant but operationally incomplete.

I had given him the signal, but not the meaning. I had shown him something interesting, but not something useful. The spike was real, the data was right—but I hadn’t connected it to the decisions he was responsible for: rerouting vans, calling in night shift early, delaying outbound dispatches. To him, the number was noise until it came packaged with a recommendation or an alert.

That question—“What do you want me to do with that?”—echoed in my mind for weeks. It marked a shift in my thinking: from delivering outputs to enabling outcomes. From answering what, to relentlessly chasing the so what.

In a different environment, the feedback might have been logged as a feature request for "v2.0." But our culture values impact over output. That manager's question wasn't a critique; it was an invitation to solve a deeper problem.

As a data engineer, I had built my career on the bedrock of how. I found contentment in the elegant logic of a well-designed pipeline. Yet, that forecast dashboard marked a turning point. It wasn't enough for the data to be fast and correct; I needed it to be meaningful. The "why" behind the request was no longer a background detail—it was becoming the only thing that mattered. That obsession with purpose marked the beginning of my transition to Data Platform Product Owner—a journey from the certainty of code to the ambiguity of human needs.

A Culture of Curiosity, Not Just Code
My transition is made possible by the exceptional dynamic I share with my employer, Agile Actors. I’ve heard tales from peers where career paths are rigid, but my experience was the opposite. I was the beneficiary of a dual culture that saw its people as evolving investments.

This wasn't just a poster on the wall. During a planning session, we were reviewing a list of upcoming data pipeline tasks, mostly prioritized by technical effort. As I looked through it, I found myself asking, “Which of these will actually help someone on the business side in the next couple of months?”

Rather than a bold challenge, it was simply a quiet question which shifted the discussion. We ended up rethinking the priorities, reached out to a few internal users for input, and adjusted our plan based on real impact rather than just complexity. My Agile Actors Chapter Lead heard about this, and instead of seeing it as scope creep, he saw it as me embodying our value of 'continuous improvement'. He went beyond acknowledgment, setting up a meeting to discuss my development path, seeing an opportunity for me to create more value for our client by moving closer to the business.

This support system was crucial and my chapter leader became my advocate. When Agile Actors sponsored my PSPO certifications, it wasn't an exception; it was an extension of a belief that investing in an employee’s curiosity pays the highest dividends. They weren't just training a data engineer; they were cultivating a future leader who could bridge the gap between their technical teams and their client's business goals.

From Building Pipelines to Charting a Product Vision

This unwavering support transformed a personal ambition into a clear career path. My mentors introduced me to a revolutionary concept for a centuries-old postal service: treating our entire data platform as an internal product.

Traditionally, we were seen as a service team—implementing requests, building pipelines, fixing bugs. But the “platform as a product” mindset changed everything. Our infrastructure, tools, and datasets weren’t just technical assets—they were products with internal customers: analysts, data scientists, developers, and decision-makers across the business. My new job was to be the Product Owner for this data platform.

One of my first major initiatives was the development of a reusable ingestion framework to power our Databricks lakehouse. Until then, bringing in a new data source meant writing custom Spark code, managing brittle workflows, and duplicating logic across teams.

We flipped that model. We built a framework that allowed data engineers to onboard new sources using only configuration files—defining schema mappings, update frequency, and quality rules in YAML, with minimal code. It abstracted away complexity and gave teams a standard, governed, and scalable way to land their data in the lake.

Beyond the framework, the product delivered an ecosystem: documentation, onboarding guides, reusable templates, and SLAs that teams could trust. What used to take weeks could now be done in a few hours. At its core, the difference was cultural, not only technical.
We gave teams autonomy, while ensuring consistency and quality across the platform.

Soon, I was creating roadmaps for feature rollouts, prioritizing enhancements based on internal feedback, and aligning delivery with cross-functional use cases. The shift from the technical how to the strategic why felt like stepping back from coding individual pipelines to shaping the way our entire organization worked with data.

What I Kept, What I Learned
Moving from engineering to product wasn't about erasing my past; it was about building upon it.

What I Kept:

Systems Thinking: The ability to see the entire data ecosystem—from a mail carrier's handheld scanner to the final delivery confirmation—was invaluable for understanding downstream consequences.
Problem Decomposition: Breaking down a massive problem like "improve delivery efficiency" into logical, manageable steps is the same skill used to design a complex data pipeline.
A Respect for Quality: Obsession with data integrity became a secret weapon in discussions about building robust, reliable data products that the business could trust.

What I Had to Learn:

Stakeholder Management: My world expanded to include logistics, sales, finance, and executive leadership. I had to learn their languages and negotiate compromises.
The Art of Saying 'No': The Head of Regional Distribution wanted a real-time dashboard tracking every single delivery truck on a map, refreshed every second. My engineering gut knew it was feasible. But my new Product Owner brain had to ask why. After interviewing the dispatchers, I discovered they didn't need a flashy map; they needed a reliable alert when a truck was projected to be more than 30 minutes late. We built the simpler, more valuable alerting system instead. Saying 'no' to the 'wow' feature in favor of the 'working' feature was terrifying, but it was the right call.
Embracing Ambiguity: I had to get comfortable making decisions with incomplete information, moving forward to learn and iterate rather than waiting for the "perfect" answer.

Finding Rhythm in the Chaos with Scrum
When Agile Actors offered to sponsor my Professional Scrum Product Owner (PSPO) certification, I was skeptical. I associated Scrum with rigid project management rituals. The training was a revelation. It was an empirical framework designed to deliberately navigate ambiguity.

Concerning a data product, "value" can be elusive. It's an insight that prevents a sorting machine from breaking down, an automated process that optimizes a delivery route to save fuel, or a model that improves address correction. The PSPO training taught me to make this concrete. I learned to define a clear Product Goal (our north star) and break it down into tangible Sprint Goals.

This transformed our work. Our Sprint Goal was no longer "build a pipeline," but something like: "Provide the 'Address Quality' team with a reliable daily source of truth for returned mail, so they can validate their new correction algorithm."

The Sprint Retrospective became the embodiment of our dual-company growth mindset. In one retro, we realized our planning was failing because the client's subject matter expert was only available on Thursdays. To solve this, our Agile Actors team proposed a new "Co-creation Wednesday" meeting. It wasn't in the Scrum guide, but it was our adaptation to make the framework succeed in our unique client environment.

Trading the Keyboard for a Compass
The most challenging part was internal. My confidence came from my hands-on ability to solve problems. I remember a critical project where the team was wrestling with a nasty performance bug in our dbt models processing scanner data from the hubs. The build was taking three hours instead of thirty minutes. My fingers itched to dive into the Jinja macros and start debugging. I felt a pang of anxiety, a fear of losing my technical credibility.

My chapter leader said, "You’re proving you can still handle the technical work. But the team doesn’t need another set of hands—they need someone to set direction and show them where to focus."

That was a breakthrough. I had to learn to lead through influence, not instruction. My value was no longer in the code; it was in the clarity of the vision. I had to empower the engineering team and then get out of their way.

A New Definition of 'Done'
Today, my work starts with a need for data and ends with someone being able to act on it confidently. My definition of "done" has evolved. It’s no longer writing a custom pipeline to bring in a single source; it’s a new dataset flowing into the lakehouse through our ingestion framework with nothing more than a configuration file. It’s an engineer onboarding a system in hours instead of weeks, or an analyst querying consistent, well-documented data without worrying about hidden transformations. It’s a data scientist running experiments on fresh, trusted data because the platform makes quality and availability a given.

I’ve shifted from building pipelines myself to enabling others to move faster, safer, and with more autonomy. “Done” is no longer code that works — it’s a platform that empowers. It’s a data scientist deploying a new address validation algorithm in minutes instead of weeks because our platform is robust. I've shifted from completing tasks to enabling outcomes.

Becoming a Data Product Owner didn’t erase my engineering roots—it gave them purpose. The journey was a personal transformation, made possible by the unique partnership between a consultancy that invests in its people and a client that trusts them to solve real problems. I learned that the most powerful growth happens when you have the courage and the support to build not just the right thing, but the right thing together.

Looking back, the hardest part wasn’t learning product frameworks or stakeholder management. It was letting go of the idea that my value was in the code I could write. My value became the clarity I could bring, the questions I could ask, and the outcomes I could enable for others.

That shift — from outputs to outcomes, from what to why — changed not only my career, but also the way I see impact in any technical role.

For anyone standing at a similar crossroads, my advice is simple: stay curious, ask the uncomfortable questions, and don’t be afraid to trade your keyboard for a compass. The right environment will see that curiosity not as scope creep, but as leadership in the making.

At Agile Actors, we thrive on challenges with a bold and adventurous spirit. We confront problems directly, using cutting-edge technologies in the most innovative and daring ways. If you’re excited to join a dynamic learning organization where knowledge flows freely and skills are refined to excellence, come join our exceptional team. Let’s conquer new frontiers together. Check out our openings and choose the Agile Actors Experience!