DEV Community: Λ\: Clément Bosc

Applying graph theory for inferring your BigQuery SQL transformations: an experimental DataOps tool

Λ\: Clément Bosc — Tue, 16 Apr 2024 20:26:40 +0000

If you work with Google Cloud for your Data Platform there are chances that you use BigQuery and run your Data pipelines transformations in a ELT manner: using BQ query engine to run transformations as a series of SELECT statements, one after another. Indeed over the last few years, ELT and tools like DBT or Dataform have been the de-facto standard for running and organizing your Data transformations at scale.

Theses tools, that we may group under the “SQL orchestration tools” banner are great for many reasons:

SQL is the main and only language to express a transformation, indeed SQL is great for structured data (and even semi-structured data)
They do a great job at centralizing the transformations: nice for audits, lineage tracking and trust
They simplify the DataOps experience and help onboard Data Analysts in Data Engineer tasks
They can almost automatically infer the transformation dependencies by creating a DAG.

BUT, for my Platform Engineering background, they have a major flow: they miss a state. Indeed if you take declarative IaC tools like Terraform, the current state of the Data Platform infrastructure is stored in a file (the state), including the tables/views, the permissions etc...

But how is this a problem ?

The problem is that tools like DBT or Dataform are only running DML statements. For example to create a table the generated statement will be CREATE OR REPLACE TABLE AS SELECT your_transformation. This means that the tool never knows if the object exists before or not, so you cannot attach IAM permission on it with Terraform (because the object is re-created every day in your daily batch) neither can you use the table as agreement interface with consumers because the table does not exists prior to the transformation.

The solution: an experimental tool that use the best of both worlds

I wanted to keep the benefits from SQL orchestration tools (like Dataform on GCP), but in conjunction with Terraform for the Ops benefits, by keeping in mind the following requirements:

Table dependencies between 2 transformations (running the transformation B after the A if B reference table A in the query) should be automatically inferred
Table schema (type, column_name) must be automatically inferred: user should not lose time on writing the table schema if it can be deduced from the output.
Table should be automatically created prior to the transformation (not by the transformation) with an IaC tool : Terraform
Be able to have a custom monitoring interface that gathers all the transformations information: status, cost, performance, custom error messages etc..

Architecture proposal

Here is the architecture proposal for my experimental transformation DataOps-oriented tool

Transformation are BigQuery queries
Orchestration is carried by an auto-generated Cloud Workflow with all the correct dependencies and parallel steps when possible (if two transformations can run at the same time)
Monitoring is a BigQuery timestamp-partitioned table with a Pub/Sub topic (and an Avro schema for the interface) and a push-to-BQ streaming subscription
Transformations are defined in a git repository in yaml files. Jinja template are supported for flexibility and factorisation)
A Cloud Run endpoint that host all the schema/dependencies inference logic and Workflow body generation according to the transformation dependencies (more on the Cloud Run below)

How to infer dependencies ?

Here is where the magic happens : the automatic dependency inference. Let’s remind it, DAG in data pipelines are nothing more than Graphs (Direct Acyclic Graphs), so let’s use a Graph library to build them from raw SQL declarations. You can find all the detailed process and Python implementation examples in this post: Build the dependency graph of your BigQuery pipelines at no cost: a Python implementation

The raw SQL declarations are sent by Terraform to a remote Cloud Run instance that computes the inference logic (DAG creation, Workflows source code generation, table schema generation), so Terraform that immediately creates the tables and workflows, prior to any transformations.

Exemple: the experiment in action

Let’s take a simple example: we are in a standard Data Platform Architecture with a 3 layer principal: Bronze (raw data), Silver (curated data) and Gold (aggregated/meaningful data). We need to run a data transformation pipeline, in SQL, that cleans the raw data (for deduplication and type-conversion for ex) and builds an analytics-ready aggregated table from the cleaned data.

The demo dataset is a very simple retail-oriented data model (orders, products and users), orders being the fact table.

Our tool, based on Terraform, needs to create the SILVER and GOLD tables, with the correct schemas, ahead of the transformations running, and the Cloud Workflow source definition.

The data transformation files:

The transformations are described in a yaml file, specifying the destination table and the SQL transformation query as a single select.

Building the silver layer, here it’s only a deduplication step for the sake of the demo

workflow_group: demo
destination_table:
  project_id: ${raw_project}
  dataset_id: ${app_name}_ds_3_demo_${multiregion_id}_${project_env}
  table_id: orders
  location: EU

query: >
  SELECT
    *
  FROM `${raw_project}.sldp_demo_retail_analytics_raw_data_eu_${project_env}.orders_v1`
  QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY insertion_time DESC) = 1

Building the gold layer, here an aggregated table of the total amount of sold products per month and consumer.

workflow_group: 3-demo
destination_table:
  project_id: ${raw_project}
  dataset_id: ${app_name}_ds_3_demo_${multiregion_id}_${project_env}
  table_id: total_cost_by_user
  location: EU

description: "Total cost by user and month. Granularity: [user_id, month]"

query: >
  SELECT
    u.email,
    DATE_TRUNC(DATE(o.created_at), MONTH) as month,
    SUM(o.quantity * p.price) as total_amount,
    COUNT(DISTINCT o.id) as total_orders,
    CURRENT_TIMESTAMP() as insertion_time
  FROM `${raw_project}.${app_name}_ds_3_demo_${multiregion_id}_${project_env}.orders` o
  JOIN `${raw_project}.${app_name}_ds_3_demo_${multiregion_id}_${project_env}.users` u
    ON u.id = o.user_id
 JOIN `${raw_project}.${app_name}_ds_3_demo_${multiregion_id}_${project_env}.products` p
    ON p.id = o.product_id
  GROUP BY email, month

And after running the terraform plan command we can see the following output:

Terraform will perform the following actions:

# google_bigquery_table.destination_tables["orders"] will be created
  + resource "google_bigquery_table" "destination_tables" {
      + creation_time       = (known after apply)
      + dataset_id          = "sldp_ds_3_demo_eu_dev"
      + schema              = jsonencode(
            [
              + {
                  + mode        = "NULLABLE"
                  + name        = "id"
                  + type        = "INTEGER"
                },
        ...
          ])
}

# google_bigquery_table.destination_tables["products"] will be created
  + resource "google_bigquery_table" "destination_tables" {
    ...
}

# google_bigquery_table.destination_tables["users"] will be created
  + resource "google_bigquery_table" "destination_tables" {
    ...
}

 # google_bigquery_table.destination_tables["total_cost_by_user"] will be created
  + resource "google_bigquery_table" "destination_tables" {
      + dataset_id          = "sldp_ds_3_demo_eu_dev"
      + description         = "Total cost by user and month. Granularity: [user_id, month]"
      + id                  = (known after apply)
      + schema              = jsonencode(
            [
              + {
                  + description = null
                  + mode        = "NULLABLE"
                  + name        = "email"
                  + policyTags  = {
                      + names = []
                    }
                  + type        = "STRING"
                },
              + {
                  + description = null
                  + mode        = "NULLABLE"
                  + name        = "month"
                  + policyTags  = {
                      + names = []
                    }
                  + type        = "DATE"
                },
        ...
     ])

}

# google_workflows_workflow.data_transfo["3-demo"] will be created
  + resource "google_workflows_workflow" "data_transfo" {
      + create_time      = (known after apply)
      + description      = (known after apply)
      + effective_labels = (known after apply)
      + id               = (known after apply)
      + name             = "wkf_datatransfo_3_demo_euw1_dev"
      + name_prefix      = (known after apply)
      + project          = "sldp-front-dev"
      + region           = "europe-west1"
      + revision_id      = (known after apply)
      + service_account  = "..."
      + source_contents  = jsonencode(
        <Coming from the Cloud Run backend mentioned above, called directly by terraform with the data http provider>
     )
}


Plan: 5 to add, 0 to change, 0 to destroy.

The auto-generated Cloud Workflow DAG:

In the auto-generated Cloud Workflow, we can find 4 steps, one for each table. In our example above:

3 can be done in parallel (the Silver tables) for deduplication and typing. Here we use the topological generation method in our graph.
1 step for the Gold transformation, that needs to wait for the termination of the previous steps, because the Silver tables are referenced by the Gold table.

In this Workflows, each step will do the following:

Compile the query : in all our transformations we can use Jinja templating language. Workflows input parameters can be used in the transformation template. For example, we can use the “incremental” parameter to have a different transformation logic is we want to deal with incremental updates
Run the BigQuery job (compiled query)
Log the status of the job: the workflow publishes an event in a Pub/Sub topic that will dump in realtime in a BigQuery monitoring table, in order to track the status of every step and every workflow.

More features…

The experiment is very feature rich now, here are some of the features we added:

Every transformation can have some SQL scripting pre-operations. The pre-operations are taken into account to process the dependency graph (if you create temporary tables for example) and are run into the same BQ session as the main transformation. BTW, checkout this great article by my friend Matthieu explaining the implementation in Python BigQuery transactions over multiple queries, with sessions.
You can use Python Jinja tempting in every transformation by using some common variables that are available at run time : in the workflow, every transformation step is first “compiled” before being sent to BigQuery.
You can define custom query templates that can be used across all the project: for example a Merge template is available for everyone to use to Implement merge strategy in the final table instead of replace/append.
All templates can implement an incremental alternative (using Jinja conditions). For example, the Default template appends data to the final table if workflow is run in incremental mode or overwrites the data in non-incremental mode.
All the input parameters of your workflows can be used in Jinja templates.
After every workflow step, a real-time structured log information is being published to the monitoring Pub/Sub topic to be immediately streaming into the monitoring BQ table.

Conclusion

It works like a charm !

This architecture is being used for a few months internally at Stack Labs to process our internal data pipelines : there are extremely few pipeline errors at runtime (even less than with Dataform that sometimes lost the connection to the git provider), it’s very cost effective (the DAG generation is completely free thanks to a few hacks), the custom templating system is very flexible for advanced data engineering use cases and we now have proper custom monitoring logs at every transformation step to build real time monitoring dashboards !

So yes, it’s a very geeky approach, and the developer experience is local-first and git-oriented, but if like me you have a Software Engineer background you will feel very comfortable doing Data Engineer/Analyst tasks using this approach. This will probably stay at the experimental phase, but it was fun designing a Serverless, DevOps-oriented Data Transformation and applying Graph theory in the solution. Feel free to ping me for the source code.

Build the dependency graph of your BigQuery pipelines at no cost: a Python implementation

Λ\: Clément Bosc — Thu, 11 Jan 2024 10:36:34 +0000

Nowadays, in a lot of Data Stacks, most of the Data Engineering task is writing SQL.
SQL is a powerful language that is well suited for building most batch data pipelines: it's universal, great for structured data and (most of the time) easy to write. On the other hand, one of the complexities can be orchestrating the series of SQL statements in the correct order of the dependencies, meaning that if a SQL query references the result of another one, the second should be run before the first.

For a personal project I pursued the quest of automatically generating the graph of BigQuery transformations dependencies using a small Python script. I wanted to use graph theory and SQL syntax parser. Let’s see how.

What are the dependencies in BigQuery SQL transformations ?

In a BigQuery (or any other DataWarehouse actually), transformations are typically chained to form a more or less complex DAG (Direct Acyclic Graph). In this DAG, all the transformations are sourced from one or more tables and the result is dumped in a single table, that can in turn be used in other transformations and so on. In this graph all the nodes appear to be tables (sources or destinations) and the edges are dependencies, meaning the transformation references the sources tables in a FROM or JOIN statement to produce the target.

Here is an example:

Here we can see that the D table is produced from tables A & B; table E produced from table D & C and table F from tables D & E. From this diagram we can easily see that the transformation D should be run first, followed by the E and finally the transformation F.

Automatically infer the graph with Python

In the project we used Python lib networkx and a DiGraph object (Direct Graph). To detect a table reference in a Query, we use sqlglot, a SQL parser (among other things) that works well with Bigquery.

Here is the first part of the Python script to create the graph, simplified for this blog post.

import networkx as nx

# all the transfromations are stored in a structure
# here let's assume Transformation object only contains the
# query and the destination table (project_id, dataset_id,
# table_id)

transformations: list[Transformation]

graph = nx.DiGraph()
for transfo in transformations:
    dependencies = find_query_dependencies(transfo.query)

    # Add nodes, the transfo infos are added as node metadata
    graph.add_node(transfo.destination_table, transfo=transfo)

    # Add edges
    for src in dependencies:
        # here note that is the src does not exist yet are a
        # node it's created
        graph.add_edge(src, transfo.destination_table)

Now let’s see how to find the dependencies of a SQL query by using the sqlglot SQL parser:

from sqlglot import parse_one, exp

def find_query_dependencies(query: str) -> set[str]:
    """Find all the tables in the query"""
    return {
        f"{t.catalog}.{t.db}.{t.name}"
        for t in parse_one(
            query, read="bigquery"
        ).find_all(exp.Table)
}

In the piece of code above, we use the parse_one function from sqlglot to parse the transformation query using BigQuery dialect into a tree that can be search on :

Automatically infer the schema of the output tables

Another cool feature we can add to our script is the ability to auto-detect with a high precision the expected output schema of all our tables (D, E and F in the example), even if they haven't yet been created. This can be very helpful if we want to create the tables using an Infrastructure as Code tool like Terraform before the transformations even run.
For this feature, I used the following BigQuery trick: we can run a query with a LIMIT 0 at the end of the SELECT statement. The awesomeness of this is that BigQuery won't charge anything (0 bytes billed) but will still create a temporary output table with the expected schema (including NULLABLE / REQUIRED coherence) !
To generate the query with LIMIT 0 we need to add it to all the SELECT statements of a query (including all the subqueries). Let’s use sqlglot again by defining a SQL transformer:

def limit_transformer(node):
    """This sqlglot transformer function add a limit 0 to
       every SELECT stmnt"""
    if isinstance(node, exp.Select):
        return node.limit(0)
    return node

query = """
WITH source_A as (
   SELECT "HelloWorld" as hello
), source_B as (
   SELECT CURRENT_DATE() as date
)
SELECT * 
FROM source_A, source_B
"""

sample_query = (
    parse_one(query, dialect=Dialects.BIGQUERY)
        .transform(limit_transformer)
        .sql(dialect=Dialects.BIGQUERY)
)
print(sample_query)

# =====================

# WITH source_A as (
#    SELECT "HelloWorld" as hello LIMIT 0
# ), source_B as (
#    SELECT CURRENT_DATE() as date LIMIT 0
# )
# SELECT * 
# FROM source_A, source_B
# LIMIT 0

Once we have our new query with all the LIMIT 0, we need to create a BQ job that runs this query, for free !

from google.cloud import bigquery

def fetch_destination_schema(query: str):
bq_client = bigquery.Client()

query_job = bq_client.query(query=query)
query_job.result()

# Fetch the temporary table schema created by BigQuery
tmp_dest_table = bq_client.get_table(query_job.destination)
destination_schema = tmp_dest_table.schema

return destination_schema, str(query_job.destination)

Now in order to generate the output schema from all the tables in the graph, even the last one which depends on tables that have not yet been created (here, tables D and E will only be temporary tables, not real tables), we need our schema generation method and apply it to each node in the DiGraph in the "correct" order.
We call this order the topological order in graph theory, i.e. in this example first table D, then E, then F. For each node, we need to replace the reference of the real table in the body of the transformation with the reference of the temporary table previously created . This way, even if none of tables D and E exist, we can still deduce the final schema for table F !

This process can be illustrated like this:

for node_id in nx.topological_sort(graph):
    # here the "transfo" key is where the transformation
    # metadata have been stored in the node
    query_to_run = graph.nodes[node_id]["transfo"].query
    ancestors = nx.ancestors(graph, node_id)

    # We exclude all the ancestors that don't have transfo
    # metadata, meaning all nodes that are not and
    # intermediary output table
    ancestors = filter(
        lambda x: "transfo" in graph.nodes[x], ancestors
    )

    for ancestor_table in ancestors:
        query_to_run = query_to_run.replace(
            ancestor_table,
            graph.nodes[ancestor_table]["transfo"].tmp_table
        )

    schema, tmp_table = fetch_destination_schema(query_to_run)
    graph.nodes[node_id]["transfo"].tmp_table = tmp_table

# This will be the exact schema of the last transformation
print(schema)

And voilà ! With this graph generation technique and a bit of BigQuery ticks, we were able to automatically infer a dependency graph of complex SQL transformations, as well as the exact final schema of all tables, without any tables being created and at zero cost !
In a next blog post we will see how I've applied and improved this technique to build an experimental and Ops-oriented data orchestration tool !

Part 5. Provision Azure resources with Terraform from GCP with token exchange

Λ\: Clément Bosc — Tue, 14 Feb 2023 13:28:18 +0000

A small bonus for a use case we had in my project : all our CI is on GCP, using Cloud Build and I wanted to create Azure Resource along with Google resources with Terraform by exchanging my Cloud Build service account identity for an equivalent Azure identity.

If you don’t know what I am referring to with identity federation and multi-cloud token exchange, and to understand the prerequisite, make sure to catch up with the previous article of the links above.

Create Azure resources with Terraform from GCP

Remember the Part 3 of this series, we needed to exchange a Google identity token for an Azure access token and we used a curl request to the AAD Authorization Server, let’s do the same with Terraform, using the http provider !

The http provider is a provider maintained by Hashicorp that cannot create resource, but only make HTTP request on the form of Terraform Data Sources.

1. Add Terraform providers

First let’s add the required providers : google for creating GCP resources and get the ID token, http to exchange the token and azurerm to create Azure resource.

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "4.52.0"
    }
    http = {
      source  = "hashicorp/http"
      version = "3.2.1"
    }
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "3.42.0"
    }
  }
}

# Simply configure the Google provider and use Application Default Credentials
provider "google" {
  project = var.google_project_id
}

2. Generate the Google ID token (JWT token)

data "google_service_account_id_token" "oidc" {
    # the GCP SA mapped to Azure App Registration
  target_service_account = var.target_service_account
  target_audience        = "api://AzureADTokenExchange"
}

3. Call the Azure Authorization Server to Exchange the access token

Here we first build the payload in a local variable using the same parameters described in Part 3 of this series. We finally query the Authorization Server with the data "http" "azure_id_token" Data Source.

locals {
  azure_id_token_request_body_obj = {
    client_id             = var.client_id
    scope                 = ".default"
    client_assertion_type = "urn:ietf:params:oauth:client-assertion-type:jwt-bearer"
    client_assertion      = data.google_service_account_id_token.oidc.id_token
    grant_type            = "client_credentials"
  }

  azure_id_token_request_body = join("&", formatlist("%s=%s", keys(local.azure_id_token_request_body_obj), values(local.azure_id_token_request_body_obj)))
}

data "http" "azure_id_token" {
  url    = "${var.aad_authority}${var.azure_tenant_id}/oauth2/v2.0/token"
  method = "GET"

  request_headers = {
    Content-Type = "application/x-www-form-urlencoded"
  }

  request_body = local.azure_id_token_request_body
}

4. Configure Azure provider and create resource !

You can now configure the azurerm provider by using the oidc_token with the resulting exchanged access_token. You are ready to go with Azure resource creation along with GCP resource creation ! Make sure to give the correct IAM role to the target App Registration depending of what resources you want to create.

Here we create a resource_group and an Azure Storage Account 💾

provider "azurerm" {
    features {}

    oidc_token = jsondecode(data.http.azure_id_token.body).access_token
}

resource "azurerm_resource_group" "example" {
  name     = "example-resources"
  location = "West Europe"
}

resource "azurerm_storage_account" "example" {
  name                     = "storageaccountname"
  resource_group_name      = azurerm_resource_group.example.name
  location                 = azurerm_resource_group.example.location
  account_tier             = "Standard"
  account_replication_type = "GRS"

  tags = {
    environment = "staging"
  }
}

It’s the end of this multi-cloud service-to-service identity federation series of articles.

After this series of articles you know how to setup identity federation between GCP and Azure in a secure way. We saw what are access tokens and ID tokens and how they are used by Cloud providers. We saw the steps to exchange a Google ID token for an Azure access token and how to impersonate a GCP service account from an Azure App registration using Workload Identity Federation. Finally with Part 4 and Part 5 we detailed concret implementation in Python and Terraform for your production applications.

If you keep exposing service account keys or secrets after this, you have no excuse !

Thanks for reading! I'm Clement, Data Engineer at Stack Labs.

If you want to discover the Stack Labs Data Platform or join an enthusiast data engineering team and work on awesome technical subjects, please contact us.

Part 4. Implement token exchange between Azure and GCP in Python

Λ\: Clément Bosc — Tue, 14 Feb 2023 13:24:56 +0000

In the previous three article of the multi-cloud identity federation series we discussed about access token, identity token, how to differentiate them and how to exchange service identity between Google Cloud and Azure without exposing your keys and secrets. If you don’t know want I am referring to, make sure to catch up with the links above.

Not let’s see the Python implementation for your production applications, first from Azure environment to impersonate Google Cloud service account, then from GCP to impersonate an Azure App Registration.

1. From Azure environment : impersonate GCP service account

Generate the Azure access token

Use the Azure Identity library to generate an access token

You can use this option if your application is running in a compute instance that have access to the Azure Instance Metadata Service (IMDS). It’s the recommended method because you do not have to store secrets in the instance or in environment variables, but it’s not always available depending on your use case.

Note: azure.identity module is part of azure-identity package.

from azure.identity import DefaultAzureCredential
from azure.identity import AzureAuthorityHosts

default_credential = DefaultAzureCredential(
    authority=AzureAuthorityHosts.AZURE_PUBLIC_CLOUD
)
azure_access_token = default_credential.get_token(
    scopes=f"{os.environ['APPLICATION_ID']}.default"
)

b. Use the MSAL library with your client_id and client_secret

If you do not have access to IMDS, you can always expose the CLIENT_SECRET and CLIENT_ID (App ID) in the environment variables of your application or most preferably store and retrieve them in a Key Vault. You can then use the MSAL ConfidentialClientApplication to get your App Registration access token.

from msal import ConfidentialClientApplication

CLIENT_SECRET = os.environ["CLIENT_SECRET"]
TENANT_ID = os.environ["TENANT_ID"]
APPLICATION_ID = os.environ["APPLICATION_ID"]

app = ConfidentialClientApplication(
    client_id=APPLICATION_ID,
    client_credential=CLIENT_SECRET,
    authority=f"{AzureAuthorityHosts.AZURE_PUBLIC_CLOUD}/{TENANT_ID}"
)
azure_access_token = app.acquire_token_for_client(
    scopes=f"{APPLICATION_ID}/.default"
)["access_token"]

Use the Google’s STS Client to get a federated token via the Workload Identity Federation

The second step of the token exchange process is to request a short-lived token to Google STS API. Make sure to understand the parameters detailed in the article Part 2.

from google.oauth2.sts import Client
from google.auth.transport.requests import Request

GCP_PROJECT_NUMBER = os.environ["PROJECT_NUMER"]
GCP_PROJECT_ID = os.environ["GCP_PROJECT_ID"]
POOL_ID = os.environ["POOL_ID"]
PROVIDER_ID = os.environ["PROVIDER_ID"]

sts_client = Client(token_exchange_endpoint="https://sts.googleapis.com/v1/token")
response = sts_client.exchange_token(
    request=Request(),
    audience=f"//iam.googleapis.com/projects/{GCP_PROJECT_NUMBER}/locations/global/workloadIdentityPools/{POOL_ID}/providers/{PROVIDER_ID}",
    grant_type="urn:ietf:params:oauth:grant-type:token-exchange",
    subject_token=azure_access_token,
    scopes=["https://www.googleapis.com/auth/cloud-platform"],
    subject_token_type="urn:ietf:params:oauth:token-type:jwt",
    requested_token_type="urn:ietf:params:oauth:token-type:access_token"
)
sts_access_token = response["access_token"]

Impersonate the target service account with STS token

When you have your STS token (federated token) you can finally impersonate the target service account (assuming you gave the correct role to your Workload Identity PrincipalSet)

Create the target credential object

from google.oauth2.credentials import Credentials
from google.auth import impersonated_credentials

TARGET_SERVICE_ACCOUNT = os.environ["TARGET_SERVICE_ACCOUNT"]

sts_credentials = Credentials(token=sts_access_token)

credentials = impersonated_credentials.Credentials(
  source_credentials=sts_credentials,
  target_principal=TARGET_SERVICE_ACCOUNT,
  target_scopes = ["https://www.googleapis.com/auth/cloud-platform"],
  lifetime=500
)
credentials.refresh(Request())

Call your Google API (here BigQuery) from Azure environment

Now that the token exchange process is over, you can request any API that the target service account have access to by using the corresponding Client library (here it’s BigQuery).

from google.cloud import bigquery

client = bigquery.Client(credentials=credentials, project=GCP_PROJECT_ID)

# Here my TARGET_SERVICE_ACCOUNT has bigquery.jobUser role.
query = "SELECT CURRENT_DATE() as date"
query_job = client.query(query)  # Make an API request.

print("The query data:")
for row in query_job:
    print(row["date"])
# It works !

2. From Google Cloud : impersonate Azure App

Let’s see the Python implementation from the other perspective : impersonate an Azure App from GCP environment. This process is detailed in the Part 3 of the series. Make sure to read it to understand the process.

Implement a python Credential class from TokenCredential

Most of Microsoft client libraries can take a Credential instance as argument. Even if most of the time it’s a DefaultAzureCredential or ConfidentialClientApplication, you can create your own by implementing the TokenCredential interface. The class must implement the get_token method, that is called by the client library when authenticating.

Here we first perform the Google ID token generation by querying the Google Metadata Server, then we use the ConfidentialClientApplication with the ID token as client_assertion to get the federated token.

from azure.core.credentials import TokenCredential, AccessToken
from msal import ConfidentialClientApplication
from google.auth.transport.requests import Request
import time

class GoogleAssertionCredential(TokenCredential):

    def __init__(self, azure_client_id, azure_tenant_id, azure_authority_host):
        # create a confidential client application
        self.app = ConfidentialClientApplication(
            azure_client_id,
            client_credential={
                'client_assertion': self._get_google_id_token()
            },
            authority=f"{azure_authority_host}{azure_tenant_id}"
        )

    def _get_google_id_token(self) -> str:
                """Request an ID token to the Metadata Server"""
        response = Request()(
            f"{GOOGLE_METADATA_API}/instance/service-accounts/default/identity",
                        f"?audience=api://AzureADTokenExchange",
            method="GET",
            headers={"Metadata-Flavor": "Google"},
          )
                return response.data.decode("utf-8")

    def get_token(
        self,
        *scopes: str,
        claims: Optional[str] = None,
        tenant_id: Optional[str] = None,
        **kwargs: Any
    ) -> AccessToken:
        # get the token using the application
        token = self.app.acquire_token_for_client(scopes)
        if 'error' in token:
            raise Exception(token['error_description'])
        expires_on = time.time() + token['expires_in']
        # return an access token with the token string and expiration time
        return AccessToken(token['access_token'], int(expires_on))

Note: the token generation with Metadata Server will only work on an app deployed on GCP. If you want to test locally, you can use a service account file.

credentials = IDTokenCredentials.from_service_account_file(
    GOOGLE_APPLICATION_CREDENTIALS,
    target_audience="api://AzureADTokenExchange",
)
credentials.refresh(Request())
return credentials.token

Instantiate the GoogleAssertionCredential and query final Azure API

Finally you can request any API the Azure App registration have access to, to get your work done. Just instantiate the GoogleAssertionCredential with your target Azure App CLIENT_ID & TENANT_ID, and pass it to the client library (here it’s BlobServiceClient, assuming that the App registration have Contributor role in the Azure Storage Account)

CLIENT_ID = os.environ["CLIENT_ID"]
TENANT_ID = os.environ["TENANT_ID"]

creds = GoogleAssertionCredential(
    azure_client_id=CLIENT_ID,
    azure_tenant_id=TENANT_ID,
    azure_authority_host=AzureAuthorityHosts.AZURE_PUBLIC_CLOUD
)

STORAGE_ACCOUNT = os.environ["STORAGE_ACCOUNT"]
CONTAINER = os.environ["CONTAINER"]
# Here the App registration is Contributor of the Azure storage account
blob_service_client = BlobServiceClient(f"https://{STORAGE_ACCOUNT}.blob.core.windows.net", credential=creds)
container_client = blob_service_client.get_container_client(container=CONTAINER)
for blob in container_client.list_blob_names():
    print(blob)
        # It works !

We just saw how to concretely impersonate service identities between Google Cloud and Azure in your production with Python. Keep it mind the good practices :

no secret storage if no need to, there is the Metadata Server in both clouds
use the correct audience or scope for just what you need to do, so if the token leaks the thief will only be able to use it for the target service before the token expires (less than 1 hour)

We will see in the next and final part of this multi-cloud series how to exchange token using Terraform to create Azure resource from Google Cloud Build.

Part 3. Token exchange from GCP to Azure

Λ\: Clément Bosc — Tue, 14 Feb 2023 13:22:32 +0000

In the previous article of this multi-cloud identity federation series, we saw how to securely exchange an Azure access token (on the form of a JWT token) with a GCP access token, to access private GCP resource from Azure. You are probably wondering how to do the reverse operation, you are in the right spot !

The big picture

To request Azure APIs from GCP environment, we will need the same two objects as before : a GCP service account and an Azure App Registration. The process is straightforward, because there is no Workload Identity Federation-like product on Azure, everything happens in the App registration configuration :

Generate a GCP ID token for the source service account, either via the Metadata Server (recommended way for production applications), or via the CLI or IAM REST API (need to have impersonate permissions on the SA)
Ask the Azure OAuth2 Authorization server to exchange the token for an Azure access token representing the target App registration.
Enjoy your APIs requests 🙂

1. Generate a GCP ID token

First you need to generate your ID token on behalf of the source service account. Why ID token and not access token ? Because access token on GCP are opaque and are not decodable ! We need a JWT token here : Azure need to be able to check the content of the token to map the App registration on the other side and check if the issuer is Google.

For this step you can either use the Metadata Server if your workload is running in GCP Compute context (recommended way for production applications), or use gcloud CLI (or any other method available). The result will be a valid identity token. Note that you need to match the audience with the audience you will configure in the next step. The recommended value (according to Azure) is api://AzureADTokenExchange.

Using the magic Metadata Server URL :

curl -H "Metadata-Flavor: Google" \
  'http://metadata/computeMetadata/v1/instance/service-accounts/default/identity?audience=api://AzureADTokenExchange'

b. Using the gcloud CLI (for testing purpose only)

gcloud auth print-identity-token \
    --impersonate-service-account=SOURCE_SERVICE_ACCOUNT_EMAIL \
    --audiences=api://AzureADTokenExchange

2. Create a federated credentials in your Azure App Registration

Secondly, configure your target Azure App Registration to allow impersonation from the source GCP service account. This operation happens in the Federated Credential section of your App registration in Azure Active Directory. You will need to specify the trusted issuer, the subject and the audience.

How to get these values ? By decoding the GCP identity token, of course ! Just as usual, you can go to https://jwt.io/ and inspect the content of your token’s payload :

{
  "aud": "api://AzureADTokenExchange",
  "azp": "106697322240068434726",
  "exp": 1676301170,
  "iat": 1676297570,
  "iss": "https://accounts.google.com",
  "sub": "106697322240068434726"
}

iss = https://accounts.google.com is the issuer of the token (Google)
sub = subject is the source service account ID. You can also find this info in the GCP console, on the service account page (Unique ID)
aud is the default audience value, defined in the first step.

From these informations you can create your federated-credentials settings :

az ad app federated-credential create --id APPLICATION_ID --parameters credential.json
("credential.json" contains the following content)
{
    "name": "GcpFederation",
    "issuer": "https://accounts.google.com",
    "subject": "106697322240068434726",
    "description": "Test GCP federation",
    "audiences": [
        "api://AzureADTokenExchange"
    ]
}

3. Exchange your GCP ID token for an Azure Access token

Last but not least, you need to exchange your GCP ID token for an Azure Access token to do whatever you want on Azure side : you need to make a request to the Azure Oauth2 Authorization Server by specifying the following parameters :

client_id to your App registration ID,
scope to the desired scope depending on the future usage of your token,
client_assertion_type fixed to urn:ietf:params:oauth:client-assertion-type:jwt-bearer for this operation
grant_type to client_credentials
And the most important: your GCP ID token under client_assertion

curl GET 'https://login.partner.microsoftonline.cn/TENANT_ID/oauth2/v2.0/token' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'client_id=APP_ID' \
--data-urlencode 'scope=https://storage.azure.com/.default' \ # or whatever other scope you might want
--data-urlencode 'client_assertion_type=urn:ietf:params:oauth:client-assertion-type:jwt-bearer' \
--data-urlencode 'client_assertion=GCP_ID_TOKEN' \
--data-urlencode 'grant_type=client_credentials'

# Reponse

{
    "token_type": "Bearer",
    "expires_in": 3599,
    "ext_expires_in": 3599,
    "access_token": "eyJ0eXAiO********" # JWT token
}

After decoding the GCP token, if the audience, issuer and subject match, your are good to go for a brand new Azure access token ! You can now access the APIs that match the scope you specified (here the Azure Storage API) :

curl GET 'https://STORAGE_ACCOUNT_NAME.blob.core.windows.net/CONTAINER' \
--header 'x-ms-version: 2020-04-08' \
--header 'Authorization: Bearer AZURE_TOKEN'

# Response 200 OK

No need to store client_id and client_secret in GCP and risk a security breach ! Just use Azure Active Directory federated credentials !

In the next 2 articles we will see concret implementation in Python and Terraform for your production applications.

Part 2. Token exchange from Azure to GCP

Λ\: Clément Bosc — Tue, 14 Feb 2023 13:09:19 +0000

In the previous article Part 1. Access token vs ID token, we saw why going multi-cloud is a security challenge and why we need a more sustainable solution than exporting and storing sensitive secrets. We also saw what is an access token, what is an ID token and the difference between them. Keep this information in mind, we will need it for the following!

Now let’s see in details the technical implementation for exchanging securely tokens from Azure to Google Cloud, to be able to query Google APIs from Azure Cloud without having to generate a Google service account JSON key.

The big picture

To request a service or API hosted on GCP, you need a GCP access token (or ID token if your service is Cloud Run). But all you have at this point is a token delivered by Azure, related to your Azure identity. That’s why you need to exchange it for a GCP token.

To exchange an Azure access token for a Google access token you need to configure a GCP service called Workload Identity Federation. This service allows you to configure external providers (Azure, AWS, GitLab, anything that uses OIDC and JWT tokens) and map entities from theses providers to Service Accounts in GCP. This will allow external entities to impersonate the GCP service account, that's to say inherit all the permissions the service account has on the platform.

The process goes in 3 steps:

Generate an Azure Active Directory (AAD) access token for an App registration (more on them bellow), either using the client_id and client_secret or via the Metadata Server.
Exchange the Azure access token with a short-lived access token from Google’s Security Token Service API (STS).
Exchange the STS access token with a Service Account’s access token and use this one to query Google APIs !

1. Azure App registration creation and token generation

In Azure world, the App registration is the identity of a service (or app). It’s kind of like a Service Account if you are coming from the Google Cloud world. You must first create an App registration in Azure Active Directory.

In Azure boundaries you can generate an access token on behalf of the application, either via the Authorization Server with the client_id and client_secret or via the Metadata Server.

a. Generate an access token with Authorization server and client_id and secret_id

In my case my project is in Azure China so the AAD Authority (host) is [https://login.partner.microsoftonline.cn](https://login.partner.microsoftonline.cn) but for you it’s probably https://login.microsoftonline.com/ (Azure Global)

curl --location --request POST 'https://login.partner.microsoftonline.cn/TENANT_ID/oauth2/v2.0/token' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'client_id=APPLICATION_ID' \
--data-urlencode 'grant_type=client_credentials' \
--data-urlencode 'client_secret=CLIENT_SECRET' \
--data-urlencode 'scope=.default' # or anything else that you would like for your token

# Response
{
    "token_type": "Bearer",
    "expires_in": 3599,
    "ext_expires_in": 3599,
    "access_token": "eyJ0eXAiOiJK*********" # AZURE_TOKEN
}

b. Generate an access token with the Azure Instance Metadata Service

In the Cloud world there is a reserved magic IP “169.254.169.254” which is used to fetch user or service information when your workload is running in the Cloud compute context: it’s the Metadata server. You can request this service from inside an Azure VM to generate a token for the managed identity attached to the VM, without having the secret ! 🪄

curl GET 'http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=AZURE_APP_ID/.default'
--header 'Metadata: true'

# Response 
{
  "access_token": "eyJ0eXA********", # AZURE_TOKEN
  "expires_in": 3599,
  "token_type": "Bearer"
}

More info in Azure Metadata Server to acquire an access token here.

2. Setup Workload Identity Federation : Pool and Azure Provider

First, let’s create a Workload Identity Pool on GCP, you only need a name and ID for this one. You can have many providers by pool, and a provider is limited to one tenant. So if you are in a multi-tenant Azure pattern, you might need a pool for each of them.

To create the Azure provider, select type OpenId Connect (OIDC) : you will need a name, an ID and an issuer.

Let’s remind what we learned from the previous article, the issuer is the trusted entity which sign the original access token and it can be easily retrieved by decoding the JWT token. As usual, let’s go to jwt.io with you AZURE_ACCESS_TOKEN and find out the issuer.

In my case it’s https://sts.chinacloudapi.cn/TENANT_ID because my project is on Azure China, but for you it will probably look more like https://sts.windows.net/TENANT_ID (Azure Global).

You are then asked to setup the attribute mapping : this is used later to allow a subset of entities to impersonate the target GCP service account. You must at least set the google.subject mapping and once again, let’s look for our subject in the decoded JWT payload, at sub attribute. This is a unique ID for your Azure application. You might want to add other JWT mapping at your convenience.

gcloud iam workload-identity-pools create POOL_ID \ 
    --location="global" \
    --display-name=POOL_DISPLAY_NAME

gcloud iam workload-identity-pools providers create-oidc PROVIDER_ID \
    --location="global" \
    --workload-identity-pool=POOL_ID \
    --display-name=PROVIDER_DISPLAY_NAME \
    --issuer-uri="https://sts.chinacloudapi.cn/TENANT_ID" \
    --allowed-audiences=AZURE_APP_ID/.default

3. Exchange your Azure token for a GCP token with STS

In this step, you request a short-lived access token to Google Security Token Service API in exchange for your Azure App registration access token. A simple curl call would do the job as shown below. Make sure to specify the previously created Workload Identity Provider as audience. grantType, requestedTokenType and subjectTokenType are fixed by convention.

The result is an STS token, representing the principalSet of you Workload Identity Pool.

curl POST 'https://sts.googleapis.com/v1/token' \
--header 'Content-Type: application/json' \
--data-raw '{
  "grantType": "urn:ietf:params:oauth:grant-type:token-exchange",
  "audience": "//iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/WORKLOAD_IDENTITY_POOL/providers/AZURE_PROVIDER",
  "scope": "https://www.googleapis.com/auth/cloud-platform",
  "requestedTokenType": "urn:ietf:params:oauth:token-type:access_token",
  "subjectToken": "AZURE_TOKEN",
  "subjectTokenType": "urn:ietf:params:oauth:token-type:jwt"
}'

# Response

{
    "access_token": "ya29.d.b0Aaekm1K9f******", # STS_ACCESS_TOKEN
    "issued_token_type": "urn:ietf:params:oauth:token-type:access_token",
    "token_type": "Bearer",
    "expires_in": 3587
}

4. A new GCP principal : the principalSet

Before jumping to the last exchange operation, let’s get back to fundamentals.

In GCP, a principal is a entity that can be allowed via IAM to perform certain actions. If you never used Workload Identity Federation you are probably convinced that there are only 3 kinds of principal : user, group & serviceAccount. But with Workload Identity Federation, Google introduced a fourth : the principalSet. The principalSet is the principal identity for a pool, but it can only be used with the role Workload Identity User to impersonate a real Service Account. Moreover, the particularity of this principal, it’s that the corresponding identity is dynamic, based on a pattern : you can apply filter base on the source JWT attributes that where previously mapped !

To limit the impersonation permission on a specific subject you can set the permission on principal://iam.googleapis.com/projects/**PROJECT_NUMBER**/locations/global/workloadIdentityPools/**POOL_ID**/subject/**SUBJECT_ATTRIBUTE_VALUE**
But you can use any custom attribute by using principalSet://iam.googleapis.com/projects/**PROJECT_NUMBER**/locations/global/workloadIdentityPools/**POOL_ID**/attribute.**ATTRIBUTE_NAME**/**ATTRIBUTE_VALUE**

Here are all the possible patterns:

5. Give workload identity principal access to target service account and exchange final token

Now that you have your STS access token you are nearly to the end !

The Workload Identity Principal must be authorized by GCP IAM to impersonate your final, target service account. To do so you need to add the Workload Identity User role, at service account level, to the principal represented by principalSet://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/POOL_ID/subject/AZURE_APP_SUBJECT.

gcloud iam service-accounts add-iam-policy-binding \
TARGET_SERVICE_ACCOUNT_EMAIL \
--member principalSet://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/POOL_ID/subject/AZURE_APP_SUBJECT \
--role roles/iam.workloadIdentityUser

Exchange STS for a final access token

Here you are, you can now finally impersonate the target service account to access real Google Cloud APIs. Just pass the STS access token as Bearer token of your HTTP request against IAM access token generation endpoint (or ID token depending of your use case) 🙂

curl --location --request POST 'https://iamcredentials.googleapis.com/v1/projects/-/serviceAccounts/TARGET_SERVICE_ACCOUNT_EMAIL:generateAccessToken' \
--header 'Authorization: Bearer STS_ACCESS_TOKEN' \
--header 'Content-Type: application/json' \
--data-raw '{
    "scope": [
        "https://www.googleapis.com/auth/cloud-platform"
    ]
}'

# Response

{
    "accessToken": "ya29.c.b0Aaekm1Izvf********", # GCP_ACCESS_TOKEN
    "expireTime": "2023-02-12T20:36:27Z"
}

The resulting access token can be used to do anything that the TARGET_SERVICE_ACCOUNT can, for example running BigQuery queries 📊

curl POST 'https://bigquery.googleapis.com/bigquery/v2/projects/PROJECT_ID/queries' \
--header 'Authorization: Bearer GCP_ACCESS_TOKEN' \
--header 'Content-Type: application/json' \
--data-raw '{
  "query": "SELECT CURRENT_TIMESTAMP()",
  "useLegacySql": false,
  "location": "EU"
}'

We saw how to securely exchange an Azure App registration access token by impersonating a GCP service account and access Google APIs securely from other Clouds :)

Let’s see how to do the reverse operation in a next article !

Part 1. Access token vs ID token

Λ\: Clément Bosc — Tue, 14 Feb 2023 13:08:19 +0000

Nowadays multi-cloud is in every mouth in big companies. The motivations for going multi-cloud are various, it can be for technical flexibility (chose the best tool for the best usage), fault tolerance and redundancy or for reducing the risk of single point of failure. But in any case the problem of security and the federation of identity between services will come pretty quickly.

How we usually share credentials between clouds and why it is bad

The issue is that service identities (Service Account or Service Principal) are not shared between Cloud. A common way to communicate from one Cloud to the other is to create a secret for the service identity and store it on the other Cloud. This way the service account of Cloud A can access Cloud A’s APIs from Cloud B by generating a token from the secret and vice versa. But the extradition of a secret (credential or key) is a security risk. If the secret falls in the wrong hand, the thief would be able to perform requests as the service account and extract sensible data or cause chaos. The solution to this issue might be Identity Federation, but first let’s discuss a bit of theory : how a service account can generate a token to prove its identity ?

Access token vs ID token

“For authentication and authorization, a token is a digital object that contains information about the identity of the principal making the request and what kind of access they are authorized for.”

In the Cloud world (and not only) there are mainly two types of token and associated protocol : Access tokens and ID tokens.

Access Token

Access token are opaque tokens that conform to the OAuth 2.0 framework. They contain authorization information (what the service can do), but not identity information. They are opaque because they are in a proprietary format, applications cannot inspect them.

Note: In some case, access tokens can be decoded (if they are JWT for ex) but do not take this for granted.

ID Token

ID tokens are JSON Web Tokens (JWT) that conform to the OpenID Connect (OIDC) specification. Unlike access tokens, ID tokens can be decoded and inspected by the application. That is why in a multi-cloud world, you will always exchange JWT tokens between Clouds.

The JWT token is composed of 3 parts :

The header, with information about the algorithm and the token type.
The payload, with info about the subject (service identity), the issuer (trust authority which signed the token) and more identity info like email, first_name, last_name…
The signature, generated by a concatenation of a base64 representation of the header and the payload, all encoded by a secret key using the algorithm specified in the header.

All 3 parts are joined with a dot “.” to form the final token.

Ex:

You can decode JWT tokens manually with a base64decode utility or with websites like https://jwt.io/. Decoding the first two parts will give us the following objects:

# header
{
  "alg": "RS256",
  "kid": "b49c5062d890f5ce449e890c88e8dd98c4fee0ab",
  "typ": "JWT"
}

# payload
{
  "aud": "32555940559.apps.googleusercontent.com",
  "azp": "103441981692756022942",
  "exp": 1675978466,
  "iat": 1675974866,
  "iss": "https://accounts.google.com",
  "sub": "103441981692756022942"
}

Here the issuer (iss) which signed the token is https://accounts.google.com and the service account which is represented by the token is the subject (sub) : 103441981692756022942.

Now that we have the basics, how can we exchange an identity token from one Cloud Identity, by impersonating the other Cloud Identity, and protect against secret leaks ? This processus is called identity federation.

Let’s see in details with 2 technical implementations between Azure and Google Cloud Platform and inversely, in the following articles…

[Feedback] GCP : Cross region Data transfer with BigQuery. Part 2 - Schema drift with the Google Analytics use case

Λ\: Clément Bosc — Fri, 28 Jan 2022 12:07:51 +0000

In a previous article, we detailed the process that we set up to transfer data incrementally and periodically between GCP regions in BigQuery. It’s a common problem when working in a global context where your data resides in locations all over the world. If you missed it, be sure to catch up here : [Feedback] GCP : Cross region Data transfer with BigQuery. Part 1 - Workflows and DTS at the rescue

Now that you have the architecture in mind, let's dive into a problem we had when working with Google Analytics (GA) : the schema drift situation.

The use case : Google Analytics

Let's add some context : the company I am working for owns websites. A lot of them. And like pretty much everybody they use Google Analytics to track the audience, user behavior, acquisition, conversion and a bunch of other metrics.
Additionally, they use this GA feature that lets them automatically export the full content of Google Analytics data in BigQuery. Each website -or view in GA jargon-, is exported in a dedicated dataset (represented by the viewId) and a table (a shard) is created each day with the data from the day before. Same as before, the data is split in different projects, located in remote GCP regions according to the country of management of the website.

The problem is the configuration to export data into BigQuery is done and maintained manually for each website ! Moreover from time to time new websites are created and some of them stop publishing data (because they were closed for instance). And Google Analytics, like every information system platform, is constantly evolving : adding dimensions, features and metrics (and so columns in the data model), but the system does not update the schema from all the previous tables already created and it leads to a large issue : the schema drift.

What we needed to do

The very big workflow that we made to transfer data cross region, takes as argument a SQL query to read data from the source. But here, compared to the situation described in the previous blog post, the shared tables are located across many datasets (see schema above). Our first reflex was to pass as input query something like this :

SELECT * FROM `source-project-1.236548976.ga_sessions_*`
UNION ALL
SELECT * FROM `source-project-1.987698768.ga_sessions_*`
UNION ALL ...
-- For each dataset in a given project

But this was the beginning of our misery : the schema from tables in dataset 236548976 and 987698768 is not exactly the same (probably one of them was initiated later, with some changed fields). Easy, you would say : just specify the explicit list of fields in the SELECT statement, replacing missing fields with something like NULL as <alias>. Well, it’s not so simple, because :

The schema from the Google Analytics data model is One Big Table with over 320 columns, distributed on a 4 level depth of nested columns, repeated fields, array of repeated fields, etc... the differences could be at any level, and different from a dataset to another.
We have hundreds of website, and so hundreds of datasets
A new dataset can be added at any time and the solution had to automatically load the new data without further re-configuration.

We needed a way to automate all of this mess.

The first idea: a magical stored procedure 🧙🏼‍♂️

The first, and I think the more logical, reflex that we had was to generate SQL queries. And what a nicer way of doing this than by using SQL ? Given a projectId and a dataset, the procedure would have to generate something like that :

SELECT
channelGrouping AS channelGrouping,
clientId AS clientId,
STRUCT(
    device.browser AS browser,
    -- this field does not exist in the current dataset
    CAST(NULL as STRING) AS browserSize,    
    ...
) as device,
...
ARRAY(
    SELECT
        STRUCT(ARRAY(SELECT STRUCT(...) as promotion FROM UNNEST(product)) as product)
    FROM UNNEST(hits)
) as hits, ...
FROM `source-project-1.236548976.ga_sessions_*`

With, let's remind it, a 4 level depth of nested structures and more than 320 columns. Fortunately, like most respectable databases, BigQuery has an internal hidden table called INFORMATION_SCHEMA with all the metadata that we needed.

Not without some effort, it worked : with a recursive and generic approach, we succeeded in dynamically generating the massive SQL query. We only needed to call the procedure for each dataset to have the SELECT statement and perform an UNION ALL, and the problem of schema drift would have been solved !

But this was not the good approach : we had fun building the magical stored procedure but it was too slow, consuming too many resources, and the generated query was so large (when joined together with UNION ALL), that the content didn’t fit in a variable in our orchestrator 🤦‍♂️.

The real magic resides in simplicity

Most of the time, the simpler the better. We actually realized quickly that BigQuery already had its own way of dealing with the schema drift problem : it’s the wildcard functionality and we were using it all along !

On a sharded table structure, a table is actually splitted in many smaller tables with a suffix to differentiate them (most commonly a date). Conveniently, BigQuery let you query for all the tables sharing the same base name with the wildcard annotation, just like we did to have all the data from a website :
SELECT * FROM `ga_sessions_*` process all the data from tables that match the pattern ga_sessions_*. And this works even if the schema has evolved since the first table !

BigQuery automatically applies the schema from the last created table to the query result and completes the missing fields with NULL values. Sadly, doing the same from a batch of dataset if not possible (like `project.*.ga_sessions_*` so we got around the issue by doing the following:

For each dataset in the source, create a large table that contains all the data from the ga_sessions_* sharded table. This table is named with the datasetId as a suffix, in a buffer project, dedicated to the replication purpose. In practice this table contains all the partitions created since the last transfer (so most of the time 1 partition, except in the init phase)
Create a "fake" empty table with the exact same schema as the destination table. This schema is our reference, where the other tables might differ slightly. (it’s the ga_sessions_DONOTUSE table in the schema bellow)
Use the wildcard syntax again to append all the data into the final table (partionned) : the suffix is not a date anymore but the source datasetId ! As the fake DONOTUSE table is always the last created, it’s schema is applied to all the other tables.

Now, next time Google updates the GA data model, we only have to impact the change in our final partitioned table and the whole process will adapt and won’t fail, even if the new columns are not yet present in every source table at the same time. On the downsides, with this process we might miss schema update from the source if we aren't aware of new columns, but for now the current architecture fits our need.

Conclusion

To be honest I felt a bit ashamed to not have thought of the second solution sooner, it is so simple and much more maintainable than the first one ! It works like a charm in production today and we are transferring daily, tens of gigabytes of data, coming from thousands of websites across all the regions of GCP to a unique, massive partitioned table that is available to Analysts.

[Feedback] GCP : Cross region Data transfer with BigQuery. Part 1 - Workflows and DTS at the rescue

Λ\: Clément Bosc — Mon, 24 Jan 2022 16:38:41 +0000

I have been working for this very large French cosmetic company for a few months now and here is some feedback about a common problem we had with Bigquery when working in a global context : how to query data located all around the world ?

Here is the use case we had and you might recognize yourself -or your company- in it : imagine that you have many subsidiary companies all over the world, and each of these entities are producing a lot of data. Nowadays data is everywhere : from financial documents, to website sessions, online advertisement, in-store sales… and the volume is growing exponentially ! In our case, every locality (basically at country level) is responsible for collecting and managing the data it produces, and storing it in BigQuery in the GCP region closest to it.

The problem is that BigQuery, let’s remind it, is actually two distinct products : the Query Engine (based on Google’s Dremel) and the Storage (based on Capacitor, Google’s columnar storage format); but you cannot use the query engine in a given location to process data stored in another location ! And this is a big issue when your company is distributed globally.

To get around this issue, we need to periodically transfer the data from remote locations to the main location of analysis (closest to the users, in our case it’s the EU region) and ideally this transfer must be done incrementally : we only want to transfer the new data produced since the last transfer, in order to save cost and improve performance.

How to efficiently transfer data between BigQuery regions ?

There are two main methods for this issue :

1. The "legacy" method : use Google Cloud Storage

For this kind of problem, the historical solution would be to use the following process :

export the data from the BigQuery Storage to Parquet file (or any other format but Parquet is great for most use cases) in a GCS bucket located in the same region as the original dataset,
transfer the created objects (that we won’t be able to know the number in advance, because that’s how BigQuery works) in another bucket located in the same region as the final location,
finally load the Parquet files into the final table

2. The “new” method : use the dedicated service : Data transfer Service

But there is something more straightforward to solve this : use the new feature from Data Transfer Service for copying full datasets ! It does the same thing as the first method without extracting data out of the BigQuery storage 😎. Under the hood, DTS is performing a table copy command for each table in the source dataset. If the table was already transferred in a previous run, and no row has been modified or added since then, the table is skipped.

The service is still in beta and we found some inconveniences though :

The incremental load is managed automatically based on internal hidden metadata in BQ tables, but it’s not yet supported for partitioned tables (no append mode).
You cannot choose which tables from the source dataset are transferred : it’s everything or nothing !
In a GCP project we can only transfer 2000 tables a day (cross-region), but some of our sources hosted sharded tables (one table per day with a history of years), so we reached the quota pretty soon.

And these limitations were a pain in the ass for us : we didn’t want to transfer every table from the source and our destination tables were partitioned 😭.

Our use case was as follows : the data from each country was resting in a dedicated dataset, in a country-dedicated project, and located in a different GCP region. Each dataset contained many sharded tables sharing the same schema. But due to DTS limitations we couldn’t just use the service as it is : as the append mode in partitioned tables doesn’t work yet, every transfer would have erased the data previously transferred from another country…

Cloud Workflow as a wrapper for Data Transfer Service

Finally we managed to design an architecture for our situation by transferring the data not directly from the source to the destination but by constituting dedicated temporary datasets :

Stage 1: For each table to transfer, we create a temporary table in a temporary dataset that contains only the last partitions (or shard) to transfer since the last transfer happened (by providing a custom business logic)
Stage 2: Once the temporary table is built, trigger the Data Transfer Service to copy everything in a temporary destination dataset (in the same region as the final tables)
Stage 3: Merge the transferred partitions with a custom BQ job in the final tables.

And voilà !

Our final architecture

The whole thing was orchestrated with a monstrous Workflow (more than 130 actual steps long) that was designed using DRY principles : everything is generic and built to adapt to input params from the user, and so as to be as much as possible detached from the current use case to be re-usable by other teams for other use cases.
If you don’t know Cloud Workflow you should definitely give it a try, it’s a new serverless -minimalist- orchestration service from Google Cloud that is explicitly designed to make API calls. Indeed, everything is API in the Cloud, and it's the only thing that Cloud Workflows needs to do (well, sometimes some additional features might be a nice-have though, Google if you read me, let’s talk)

Conclusion

It works like a charm ! And thanks to these DRY principles we have bee n able to make the workflow as generic as possible and use the same code to transfer different sources with different tables and structure. The performances are satisfying and of the order of minutes to less than an hour to transfer GB to TB of data across more than 60 countries.

What's next ?

In a next article to come, we will discuss the daily transfer of Google Analytics data with this method and the issues and solutions we found regarding schema drifting, a common problem in the Data Engineering world. Stay tuned..

Serverless Spark on GCP : How does it compare with Dataflow ?

Λ\: Clément Bosc — Tue, 16 Nov 2021 08:13:41 +0000

I am a huge fan of serverless products: it allows developers to be focused on bringing business value on the software they are working on and not the underlying infrastructure, at the end they are more autonomous to test and deploy and the scaling capabilities are often better than an equivalent self managed service.

When it comes to Data processing on GCP there are not so many options for serverless products, the choice is often limited to Dataflow. Dataflow is great but the learning curve is a bit more progressive and Beam (the OSS framework behind Dataflow) is not promoted by other providers which often prioritize Spark. And to run Spark workload on GCP the solutions were not so lightweight: you had to provision a Dataproc cluster or run your workload in Kubernetes: it’s a whole different level of complexity!
This was until recently, because Google surprisingly announced at Next’21 a new Serverless Spark service!

Spark on GCP: a new area for data processing

According to Google, this new service is the industry’s first autoscaling serverless Spark. You do not need any infrastructure provisioning or tuning, it is integrated with BigQuery, Vertex AI and Dataplex and it’s ready to use via a submission service (API), notebooks, Bigquery console for any usage you can imagine (except streaming analytics): ETL, data exploration, analysis, and ML.

On my side I have been able to test the workload submission service (the most interesting to me): it’s an API endpoint to submit custom Spark code (Python, Java, R or SQL). You can see this submission service as an answer to the spark-submit problematic.
On the autoscaling side, Google will magically decide for the number of executors to run the job optimally but you can still manually handle it.
The service is part of the Dataproc family and accessible on the console through the Dataproc page. After some tests the service seems to be working fine, but how is it compared to Dataflow? Let’s check that with a small experiment.

The experiment: Dataflow vs Serverless Spark

I wrote 2 simple programs: the first one in PySpark and the second one with Beam (python SDK). The goal is to read 100GB of ASCII data in a Cloud Storage bucket, parse the content of the files, filter according to a regex pattern, group by according to a key value (some column) and count the number of lines having the same key. The result is written in Parquet format on another bucket.

For the input data I used a subset of a 100TB dataset publicly available here: gs://dataproc-datasets-us-central1/teragen/100tb/ascii_sort_1GB_input.*

In this dataset, each file is about 1GB and the content is as below (not very relevant):

7ed>@}"UeR  0000000000000000000000024FDFC680  1111555588884444DDDD0000555511113333DDDDFFFF88881111
3AXi 40'NA  0000000000000000000000024FDFC681  888800000000CCCCEEEEDDDD11110000DDDD55553333CCCC6666
PL.Ez`vXmt  0000000000000000000000024FDFC682  111122225555CCCC000000002222FFFFFFFFFFFF88885555FFFF
5^?a=6o0]?  0000000000000000000000024FDFC683  7777FFFF55551111BBBBDDDD44447777DDDD5555BBBB9999CCCC

The regex filter applied is ^.*FFFF.*$ in the 3rd “column”, meaning the column content for a given record must have at least 4 F consecutively (totally useless but it's for the sake of the experiment). The grouping key is the first column. The observed reducing factor of the filter operation is about 50%. I agree the experiment is not something we would normally do in a real project but it is not important, it's just to stimulate the workload with an important compute task.

On the configuration side, for the Dataflow job, I enabled the Dataflow Prime feature but everything else was left by default (Prime feature is more optimized and it simpler to calculate the total cost of the job). For the Spark service, everything was left as default and I manually asked for 17 executors (why 17? why not 😅)

The result :

	Dataflow	Serverless Spark
Total execution time	36 min 34 sec	12 min 36 sec
Number of vCPU	64 (autoscaling)	68 (17 executors * 4 vCPUs)
Total cost of the job	28.88 DPU * $0.071 = $2.05

Both jobs accomplished the desired task and output 567 M row in multiple parquet files (I checked with Bigquery external tables):

Serverless Spark service processed the data in about a third of the time compared to Dataflow! Nice performance 👏.

Currently however there are some limitations to this Serverless service:

It’s only for batch processing, not streaming (Dataflow would probably be better for that anyway) and job duration is limited to 24 hours.
There are no monitoring dashboard whatsoever and the Spark UI is not accessible, compared to Dataflow which have a pretty good real time dashboarding functionality
It’s only Spark 3.2 for now, might not be a limitation for you but if you want to migrate existing workload to the service it might not work.

Remarks about the experiment:

The Beam/Dataflow pipeline was developed with the Python SDK and I would probably achieve better results with the Java SDK and by using Flex templates (the scaling operation is more efficient because the pipeline is containerized), so it’s not totally fair to Dataflow.
Dataflow targeted an ideal number of vCPU to 260 but I limited the max number of workers to save cost (and also because my CPU usage quota was at its maximum) but without this limit Dataflow would probably be much quicker to solve the problem.

To conclude I am pretty optimistic about this new Spark serverless service. Running Spark on GCP was not really a solution promoted natively by Google (except for lift and shift migration on Dataproc) whereas AWS and Azure based their main data processing products on Spark (Glue and Mapping data flows). On the downsides, the integration with the GCP ecosystem is way behind Dataflow for now (Monitoring & Operations), it does not support Spark Streaming and the autoscaling feature is still a bit obscure.

At the end you should keep in mind that Serverless Spark and Dataflow are two different products, and the choice between the two is not only in term of performance and pricing, but also the need of batch vs streaming ingestion (Dataflow is much better for that) and the background knowledge of your team for the two frameworks : Spark or Beam.

Anyway the service should get out of Private Preview by mid-december 2021 and be integrated with other GCP products (Bigquery, Vertex AI, Dataplex) later this year. It’s only the beginning but it’s promising.

References:

Both Spark and Dataflow pipelines are available in the following GitLab repo 🦊

What’s new with BigQuery ?

Λ\: Clément Bosc — Mon, 25 Oct 2021 07:15:22 +0000

To all BigQuery lovers around here (and others too !) : Google Cloud Next'21 is just over and there was an important part of announcements regarding Data !
Let’s see the latest news and functionalities of BigQuery, announced at Next or in the past weeks. (careful, some of them are still in Preview)

SQL Translator & BigQuery Migration Service

You want to migrate your existing DataWarehouse to BigQuery ? (congratulations, it’s probably a good idea 🎉) Checkout BigQuery Migration Service, a set of free tools to help you migrate. There is a particularly interesting one : the SQL translator. Accessible from the API or the Console, this tool will help you translate your current SQL queries into BigQuery Standard SQL language.

Only Teradata SQL is supported for now but let’s bet there are more to come !

Sessions and transactions

There is now a support for transactions in BigQuery ! Yes, you’ve read well, transactions !
The functionality is called Multi-statement transactions and allows you to perform data modifications in one or more tables with ACID guarantees. During the transaction, all reads return a consistent version of the tables referenced in the transaction and any modification is either committed or rolled back.
Multi-statement transactions over multiple queries are started within a Session. The new session abstraction lets you separate users and applications from each other.

Table Snapshots and table Clone

Heard of time travel in BigQuery ? It’s pretty useful but only allows you to go 7 days back. To store the state of a table for more than that, table snapshots are here to help : it allows you to preserves the contents of a table at a particular time and preserve this image for as long as you want. BigQuery will minimize the storage cost by only storing the bytes that are different between a snapshot and its base table.

Tips : think about periodic snapshots creation with the query scheduler

A new similar functionality to come is Clone. While table snapshots are immutable (you can restore them but not edit them directly), a clone is a mutable version of the base table. They allow you to clone a table and perform read/write/schema evolution operations. Pretty useful for testing production changes. Something nice : same as Snapshot, BQ will only bill you for the new data because it stores only the difference between the base table and the cloned one.

Table sampling

You have a machine learning model to train with BigQuery ML but you want to use only a subset of a table for the training set ? Try the table sampling functionality : Sampling returns a variety of records while avoiding the costs associated with scanning and processing an entire table (unlike the LIMIT clause !)

-- this will give you a random 10% of a table data 
SELECT * FROM dataset.my_table TABLESAMPLE SYSTEM (10 PERCENT)

Table functions

Have you ever wanted to have parameters in views ? Now you have table functions (TVF) for that ! Table-valued functions allow you to create an SQL function that returns a table. You can see it just like create a view with parameters and call the result of this function in standard queries :

CREATE OR REPLACE TABLE FUNCTION mydataset.names_by_year(y INT64)
AS
  SELECT year, name, SUM(number) AS total
  FROM `bigquery-public-data.usa_names.usa_1910_current`
  WHERE year = y
  GROUP BY year, name

-- use your table function in an other query, just like a view
SELECT * FROM mydataset.names_by_year(1950)
  ORDER BY total DESC
  LIMIT 5

You can use Authorized function to give specific user access to your TVF without them having access to the underlying table (just like Authorized view works for standard views)

Storage Write API

There is now a new Write API to unify batch and streaming old APIs : the Storage Write API ! This new API gives you more control over the loading process and is more performant than the previous. Moreover this new API is cheaper than the legacy Streaming insert functionality while providing a free tier usage !

BigQuery Omni

The multi-cloud analytics engine BigQuery Omni is going GA for AWS and Azure ! With Omni you can query large amounts of data in AWS S3 or Azure ADLS, without maintaining complex cross-cloud Extract-Transform-Load pipelines. The functionality will allow multi-cloud organisations to save cost on Egress and Join data between cloud providers and locations. The BigQuery console on GCP will become the central access point for analytics and you will be able to define governance and access control in a simple place !

BigQuery BI Engine

After being in Preview for a while with Data Studio, BigQuery BI Engine is also going GA ! BI Engine is a reserved in-memory database used to obtain sub-second query results with any BI tool (Looker, Tableau, PowerBI, etc), even over very large amounts of data. This functionality prevents the use of OLAP Cube and complex ETL pipelines : Google automatically handles the move and the freshness of the data between standard BigQuery Storage and BI Engine. And this also works for streaming !

Parameterized data types

Historically, BigQuery does not allow restriction over the size of certain data types, but this is about to change : there is now a Parameterized data types syntax on STRING, BYTES, NUMERIC and BIGNUMERIC. Want to raise an error if the text value in a certain column is larger than 10 characters ? Type your column as STRING(10)

SQL functions : PIVOT and QUALIFY

Among many new Geography functions (not detailed here), here are two interesting new functions and Standard SQL syntax evolution that retained my attention :

PIVOT and UNPIVOT to turn rows into columns and columns into rows.
QUALIFY, a new SQL clause used to filter on the result of an analytical function without the need for a subquery ! It’s kind of like the HAVING to filter on a standard aggregation function.

Ex :

SELECT item
FROM Produce
WHERE Produce.category = 'vegetable'
QUALIFY RANK() OVER (PARTITION BY category ORDER BY purchases DESC) <= 3

DEV Community: Λ\: Clément Bosc

Applying graph theory for inferring your BigQuery SQL transformations: an experimental DataOps tool

But how is this a problem ?

The solution: an experimental tool that use the best of both worlds

Architecture proposal

How to infer dependencies ?

Exemple: the experiment in action

More features…

Conclusion

Build the dependency graph of your BigQuery pipelines at no cost: a Python implementation

What are the dependencies in BigQuery SQL transformations ?

Automatically infer the graph with Python

Automatically infer the schema of the output tables

Part 5. Provision Azure resources with Terraform from GCP with token exchange

Create Azure resources with Terraform from GCP

1. Add Terraform providers

2. Generate the Google ID token (JWT token)

3. Call the Azure Authorization Server to Exchange the access token

4. C*onfigure Azure provider and create resource !*

Part 4. Implement token exchange between Azure and GCP in Python

1. From Azure environment : impersonate GCP service account

Generate the Azure access token

Use the Google’s STS Client to get a federated token via the Workload Identity Federation

Impersonate the target service account with STS token

2. From Google Cloud : impersonate Azure App

Implement a python Credential class from TokenCredential

Instantiate the GoogleAssertionCredential and query final Azure API

Part 3. Token exchange from GCP to Azure

The big picture

1. Generate a GCP ID token

2. Create a federated credentials in your Azure App Registration

3. Exchange your GCP ID token for an Azure Access token

Part 2. Token exchange from Azure to GCP

The big picture

1. Azure App registration creation and token generation

2. Setup Workload Identity Federation : Pool and Azure Provider

3. Exchange your Azure token for a GCP token with STS

4. A new GCP principal : the principalSet

5. Give workload identity principal access to target service account and exchange final token

Exchange STS for a final access token

Part 1. Access token vs ID token

How we usually share credentials between clouds and why it is bad

Access token vs ID token

[Feedback] GCP : Cross region Data transfer with BigQuery. Part 2 - Schema drift with the Google Analytics use case

The use case : Google Analytics

What we needed to do

The first idea: a magical stored procedure 🧙🏼‍♂️

The real magic resides in simplicity

Conclusion

[Feedback] GCP : Cross region Data transfer with BigQuery. Part 1 - Workflows and DTS at the rescue

How to efficiently transfer data between BigQuery regions ?

1. The "legacy" method : use Google Cloud Storage

2. The “new” method : use the dedicated service : Data transfer Service

Cloud Workflow as a wrapper for Data Transfer Service

Conclusion

What's next ?

Serverless Spark on GCP : How does it compare with Dataflow ?

Spark on GCP: a new area for data processing

The experiment: Dataflow vs Serverless Spark

References:

What’s new with BigQuery ?

SQL Translator & BigQuery Migration Service

Sessions and transactions

Table Snapshots and table Clone

Table sampling

Table functions

Storage Write API

BigQuery Omni

BigQuery BI Engine

Parameterized data types

SQL functions : PIVOT and QUALIFY

4. Configure Azure provider and create resource !