<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Λ\: Clément Bosc</title>
    <description>The latest articles on DEV Community by Λ\: Clément Bosc (@clementbosc).</description>
    <link>https://dev.to/clementbosc</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F536186%2F1e53dd8c-36b2-4855-b3a5-01c8833e9dfc.jpeg</url>
      <title>DEV Community: Λ\: Clément Bosc</title>
      <link>https://dev.to/clementbosc</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/clementbosc"/>
    <language>en</language>
    <item>
      <title>Applying graph theory for inferring your BigQuery SQL transformations: an experimental DataOps tool</title>
      <dc:creator>Λ\: Clément Bosc</dc:creator>
      <pubDate>Tue, 16 Apr 2024 20:26:40 +0000</pubDate>
      <link>https://dev.to/stack-labs/applying-graph-theory-for-inferring-your-bigquery-sql-transformations-an-experimental-dataops-tool-463n</link>
      <guid>https://dev.to/stack-labs/applying-graph-theory-for-inferring-your-bigquery-sql-transformations-an-experimental-dataops-tool-463n</guid>
      <description>&lt;p&gt;If you work with Google Cloud for your Data Platform there are chances that you use BigQuery and run your Data pipelines transformations in a ELT manner: using BQ query engine to run transformations as a series of SELECT statements, one after another. Indeed over the last few years, ELT and tools like DBT or Dataform have been the de-facto standard for running and organizing your Data transformations at scale.&lt;/p&gt;

&lt;p&gt;Theses tools, that we may group under the “SQL orchestration tools” banner are great for many reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SQL is the main and only language to express a transformation, indeed SQL is great for structured data (and even semi-structured data)&lt;/li&gt;
&lt;li&gt;They do a great job at centralizing the transformations: nice for audits, lineage tracking and trust&lt;/li&gt;
&lt;li&gt;They simplify the DataOps experience and help onboard Data Analysts in Data Engineer tasks&lt;/li&gt;
&lt;li&gt;They can almost automatically infer the transformation dependencies by creating a DAG. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;BUT, for my Platform Engineering background, they have a major flow: they miss a state. Indeed if you take declarative IaC tools like Terraform, the current state of the Data Platform infrastructure is stored in a file (the state), including the tables/views, the permissions etc...&lt;/p&gt;

&lt;h2&gt;
  
  
  But how is this a problem ?
&lt;/h2&gt;

&lt;p&gt;The problem is that tools like DBT or Dataform are only running DML statements. For example to create a table the generated statement will be &lt;code&gt;CREATE OR REPLACE TABLE AS SELECT your_transformation&lt;/code&gt;. This means that the tool never knows if the object exists before or not, so you cannot attach IAM permission on it with Terraform (because the object is re-created every day in your daily batch) neither can you use the table as agreement interface with consumers because the table does not exists prior to the transformation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The solution: an experimental tool that use the best of both worlds
&lt;/h2&gt;

&lt;p&gt;I wanted to keep the benefits from SQL orchestration tools (like Dataform on GCP), but in conjunction with Terraform for the Ops benefits, by keeping in mind the following requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Table dependencies between 2 transformations (running the transformation B after the A if B reference table A in the query) should be &lt;strong&gt;automatically inferred&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Table schema (type, column_name) must be &lt;strong&gt;automatically inferred&lt;/strong&gt;: user should not lose time on writing the table schema if it can be deduced from the output.&lt;/li&gt;
&lt;li&gt;Table should be automatically created prior to the transformation (not by the transformation) with an IaC tool : Terraform&lt;/li&gt;
&lt;li&gt;Be able to have a custom monitoring interface that gathers all the transformations information: status, cost, performance, custom error messages etc..&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture proposal
&lt;/h2&gt;

&lt;p&gt;Here is the architecture proposal for my experimental transformation DataOps-oriented tool&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transformation are BigQuery queries&lt;/li&gt;
&lt;li&gt;Orchestration is carried by an auto-generated Cloud Workflow with all the correct dependencies and parallel steps when possible (if two transformations can run at the same time)&lt;/li&gt;
&lt;li&gt;Monitoring is a BigQuery timestamp-partitioned table with a Pub/Sub topic (and an Avro schema for the interface) and a push-to-BQ streaming subscription&lt;/li&gt;
&lt;li&gt;Transformations are defined in a git repository in yaml files. Jinja template are supported for flexibility and factorisation)&lt;/li&gt;
&lt;li&gt;A Cloud Run endpoint that host all the schema/dependencies inference logic and Workflow body generation according to the transformation dependencies (more on the Cloud Run below)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F19r1krgme0i3jj44la4s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F19r1krgme0i3jj44la4s.png" alt="Automatically create BigQuery tables with Terraform" width="800" height="344"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How to infer dependencies ?
&lt;/h2&gt;

&lt;p&gt;Here is where the magic happens : the automatic dependency inference. Let’s remind it, DAG in data pipelines are nothing more than Graphs (Direct Acyclic Graphs), so let’s use a Graph library to build them from raw SQL declarations. You can find all the detailed process and Python implementation examples in this post: &lt;a href="https://dev.to/stack-labs/build-the-dependency-graph-of-your-bigquery-pipelines-at-no-cost-a-python-implementation-1ik1"&gt;Build the dependency graph of your BigQuery pipelines at no cost: a Python implementation&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;The raw SQL declarations are sent by Terraform to a remote Cloud Run instance that computes the inference logic (DAG creation, Workflows source code generation, table schema generation), so Terraform that immediately creates the tables and workflows, prior to any transformations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Exemple: the experiment in action
&lt;/h2&gt;

&lt;p&gt;Let’s take a simple example: we are in a standard Data Platform Architecture with a 3 layer principal: Bronze (raw data), Silver (curated data) and Gold (aggregated/meaningful data). We need to run a data transformation pipeline, in SQL, that cleans the raw data (for deduplication and type-conversion for ex) and builds an analytics-ready aggregated table from the cleaned data.&lt;/p&gt;

&lt;p&gt;The demo dataset is a very simple retail-oriented data model (orders, products and users), orders being the fact table.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzi2cpty6hmj68amy0ibd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzi2cpty6hmj68amy0ibd.png" alt="Infer BigQuery table dependencies" width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our tool, based on Terraform, needs to create the SILVER and GOLD tables, with the correct schemas, ahead of the transformations running, and the Cloud Workflow source definition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The data transformation files:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The transformations are described in a yaml file, specifying the destination table and the SQL transformation query as a single select.&lt;/p&gt;

&lt;p&gt;Building the silver layer, here it’s only a deduplication step for the sake of the demo&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;workflow_group&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;demo&lt;/span&gt;
&lt;span class="na"&gt;destination_table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;project_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${raw_project}&lt;/span&gt;
  &lt;span class="na"&gt;dataset_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${app_name}_ds_3_demo_${multiregion_id}_${project_env}&lt;/span&gt;
  &lt;span class="na"&gt;table_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;orders&lt;/span&gt;
  &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;EU&lt;/span&gt;

&lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="s"&gt;SELECT&lt;/span&gt;
    &lt;span class="s"&gt;*&lt;/span&gt;
  &lt;span class="s"&gt;FROM `${raw_project}.sldp_demo_retail_analytics_raw_data_eu_${project_env}.orders_v1`&lt;/span&gt;
  &lt;span class="s"&gt;QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY insertion_time DESC) = 1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Building the gold layer, here an aggregated table of the total amount of sold products per month and consumer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;workflow_group&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3-demo&lt;/span&gt;
&lt;span class="na"&gt;destination_table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;project_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${raw_project}&lt;/span&gt;
  &lt;span class="na"&gt;dataset_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${app_name}_ds_3_demo_${multiregion_id}_${project_env}&lt;/span&gt;
  &lt;span class="na"&gt;table_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;total_cost_by_user&lt;/span&gt;
  &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;EU&lt;/span&gt;

&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;by&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;month.&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Granularity:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;[user_id,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;month]"&lt;/span&gt;

&lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="s"&gt;SELECT&lt;/span&gt;
    &lt;span class="s"&gt;u.email,&lt;/span&gt;
    &lt;span class="s"&gt;DATE_TRUNC(DATE(o.created_at), MONTH) as month,&lt;/span&gt;
    &lt;span class="s"&gt;SUM(o.quantity * p.price) as total_amount,&lt;/span&gt;
    &lt;span class="s"&gt;COUNT(DISTINCT o.id) as total_orders,&lt;/span&gt;
    &lt;span class="s"&gt;CURRENT_TIMESTAMP() as insertion_time&lt;/span&gt;
  &lt;span class="s"&gt;FROM `${raw_project}.${app_name}_ds_3_demo_${multiregion_id}_${project_env}.orders` o&lt;/span&gt;
  &lt;span class="s"&gt;JOIN `${raw_project}.${app_name}_ds_3_demo_${multiregion_id}_${project_env}.users` u&lt;/span&gt;
    &lt;span class="s"&gt;ON u.id = o.user_id&lt;/span&gt;
 &lt;span class="s"&gt;JOIN `${raw_project}.${app_name}_ds_3_demo_${multiregion_id}_${project_env}.products` p&lt;/span&gt;
    &lt;span class="s"&gt;ON p.id = o.product_id&lt;/span&gt;
  &lt;span class="s"&gt;GROUP BY email, month&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And after running the terraform plan command we can see the following output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Terraform will perform the following actions:

&lt;span class="c"&gt;# google_bigquery_table.destination_tables["orders"] will be created&lt;/span&gt;
  + resource &lt;span class="s2"&gt;"google_bigquery_table"&lt;/span&gt; &lt;span class="s2"&gt;"destination_tables"&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
      + creation_time       &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;known after apply&lt;span class="o"&gt;)&lt;/span&gt;
      + dataset_id          &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sldp_ds_3_demo_eu_dev"&lt;/span&gt;
      + schema              &lt;span class="o"&gt;=&lt;/span&gt; jsonencode&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="o"&gt;[&lt;/span&gt;
              + &lt;span class="o"&gt;{&lt;/span&gt;
                  + mode        &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"NULLABLE"&lt;/span&gt;
                  + name        &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"id"&lt;/span&gt;
                  + &lt;span class="nb"&gt;type&lt;/span&gt;        &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"INTEGER"&lt;/span&gt;
                &lt;span class="o"&gt;}&lt;/span&gt;,
        ...
          &lt;span class="o"&gt;])&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;# google_bigquery_table.destination_tables["products"] will be created&lt;/span&gt;
  + resource &lt;span class="s2"&gt;"google_bigquery_table"&lt;/span&gt; &lt;span class="s2"&gt;"destination_tables"&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    ...
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;# google_bigquery_table.destination_tables["users"] will be created&lt;/span&gt;
  + resource &lt;span class="s2"&gt;"google_bigquery_table"&lt;/span&gt; &lt;span class="s2"&gt;"destination_tables"&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    ...
&lt;span class="o"&gt;}&lt;/span&gt;

 &lt;span class="c"&gt;# google_bigquery_table.destination_tables["total_cost_by_user"] will be created&lt;/span&gt;
  + resource &lt;span class="s2"&gt;"google_bigquery_table"&lt;/span&gt; &lt;span class="s2"&gt;"destination_tables"&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
      + dataset_id          &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sldp_ds_3_demo_eu_dev"&lt;/span&gt;
      + description         &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Total cost by user and month. Granularity: [user_id, month]"&lt;/span&gt;
      + &lt;span class="nb"&gt;id&lt;/span&gt;                  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;known after apply&lt;span class="o"&gt;)&lt;/span&gt;
      + schema              &lt;span class="o"&gt;=&lt;/span&gt; jsonencode&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="o"&gt;[&lt;/span&gt;
              + &lt;span class="o"&gt;{&lt;/span&gt;
                  + description &lt;span class="o"&gt;=&lt;/span&gt; null
                  + mode        &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"NULLABLE"&lt;/span&gt;
                  + name        &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"email"&lt;/span&gt;
                  + policyTags  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                      + names &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;[]&lt;/span&gt;
                    &lt;span class="o"&gt;}&lt;/span&gt;
                  + &lt;span class="nb"&gt;type&lt;/span&gt;        &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"STRING"&lt;/span&gt;
                &lt;span class="o"&gt;}&lt;/span&gt;,
              + &lt;span class="o"&gt;{&lt;/span&gt;
                  + description &lt;span class="o"&gt;=&lt;/span&gt; null
                  + mode        &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"NULLABLE"&lt;/span&gt;
                  + name        &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"month"&lt;/span&gt;
                  + policyTags  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                      + names &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;[]&lt;/span&gt;
                    &lt;span class="o"&gt;}&lt;/span&gt;
                  + &lt;span class="nb"&gt;type&lt;/span&gt;        &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DATE"&lt;/span&gt;
                &lt;span class="o"&gt;}&lt;/span&gt;,
        ...
     &lt;span class="o"&gt;])&lt;/span&gt;

&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;# google_workflows_workflow.data_transfo["3-demo"] will be created&lt;/span&gt;
  + resource &lt;span class="s2"&gt;"google_workflows_workflow"&lt;/span&gt; &lt;span class="s2"&gt;"data_transfo"&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
      + create_time      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;known after apply&lt;span class="o"&gt;)&lt;/span&gt;
      + description      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;known after apply&lt;span class="o"&gt;)&lt;/span&gt;
      + effective_labels &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;known after apply&lt;span class="o"&gt;)&lt;/span&gt;
      + &lt;span class="nb"&gt;id&lt;/span&gt;               &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;known after apply&lt;span class="o"&gt;)&lt;/span&gt;
      + name             &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"wkf_datatransfo_3_demo_euw1_dev"&lt;/span&gt;
      + name_prefix      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;known after apply&lt;span class="o"&gt;)&lt;/span&gt;
      + project          &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sldp-front-dev"&lt;/span&gt;
      + region           &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"europe-west1"&lt;/span&gt;
      + revision_id      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;known after apply&lt;span class="o"&gt;)&lt;/span&gt;
      + service_account  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"..."&lt;/span&gt;
      + source_contents  &lt;span class="o"&gt;=&lt;/span&gt; jsonencode&lt;span class="o"&gt;(&lt;/span&gt;
        &amp;lt;Coming from the Cloud Run backend mentioned above, called directly by terraform with the data http provider&amp;gt;
     &lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;


Plan: 5 to add, 0 to change, 0 to destroy.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The auto-generated Cloud Workflow DAG:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the auto-generated Cloud Workflow, we can find 4 steps, one for each table. In our example above: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3 can be done in parallel (the Silver tables) for deduplication and typing. Here we use the &lt;a href="https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.dag.topological_generations.html" rel="noopener noreferrer"&gt;topological generation&lt;/a&gt; method in our graph.&lt;/li&gt;
&lt;li&gt;1 step for the Gold transformation, that needs to wait for the termination of the previous steps, because the Silver tables are referenced by the Gold table.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F02i1mg2kk7lox9cs8764.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F02i1mg2kk7lox9cs8764.png" alt="Auto generated GCP Cloud Workflow DAG" width="725" height="699"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this Workflows, each step will do the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Compile the query : in all our transformations we can use Jinja templating language. Workflows input parameters can be used in the transformation template. For example, we can use the “incremental” parameter to have a different transformation logic is we want to deal with incremental updates&lt;/li&gt;
&lt;li&gt;Run the BigQuery job (compiled query)&lt;/li&gt;
&lt;li&gt;Log the status of the job: the workflow publishes an event in a Pub/Sub topic that will dump in realtime in a BigQuery monitoring table, in order to track the status of every step and every workflow.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  More features…
&lt;/h2&gt;

&lt;p&gt;The experiment is very feature rich now, here are some of the features we added:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every transformation can have some SQL scripting pre-operations. The pre-operations are taken into account to process the dependency graph (if you create temporary tables for example) and are run into the same BQ session as the main transformation. BTW, checkout this great article by my friend Matthieu explaining the implementation in Python &lt;a href="https://dev.to/stack-labs/bigquery-transactions-over-multiple-queries-with-sessions-2ll5"&gt;BigQuery transactions over multiple queries, with sessions&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;You can use Python Jinja tempting in every transformation by using some common variables that are available at run time : in the workflow, every transformation step is first “compiled” before being sent to BigQuery.&lt;/li&gt;
&lt;li&gt;You can define custom query templates that can be used across all the project: for example a Merge template is available for everyone to use to Implement merge strategy in the final table instead of replace/append.&lt;/li&gt;
&lt;li&gt;All templates can implement an incremental alternative (using Jinja conditions). For example, the Default template appends data to the final table if workflow is run in incremental mode or overwrites the data in non-incremental mode.&lt;/li&gt;
&lt;li&gt;All the input parameters of your workflows can be used in Jinja templates.&lt;/li&gt;
&lt;li&gt;After every workflow step, a real-time structured log information is being published to the monitoring Pub/Sub topic to be immediately streaming into the monitoring BQ table.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;It works like a charm ! &lt;/p&gt;

&lt;p&gt;This architecture is being used for a few months internally at &lt;a href="https://stack-labs.com/" rel="noopener noreferrer"&gt;Stack Labs&lt;/a&gt; to process our internal data pipelines : there are extremely few pipeline errors at runtime (even less than with Dataform that sometimes lost the connection to the git provider), it’s very cost effective (the DAG generation is completely free thanks to a few hacks), the custom templating system is very flexible for advanced data engineering use cases and we now have proper custom monitoring logs at every transformation step to build real time monitoring dashboards !&lt;/p&gt;

&lt;p&gt;So yes, it’s a very geeky approach, and the developer experience is local-first and git-oriented, but if like me you have a Software Engineer background you will feel very comfortable doing Data Engineer/Analyst tasks using this approach. This will probably stay at the experimental phase, but it was fun designing a Serverless, DevOps-oriented Data Transformation and applying Graph theory in the solution. Feel free to ping me for the source code.&lt;/p&gt;

</description>
      <category>dataops</category>
      <category>bigquery</category>
      <category>googlecloud</category>
      <category>data</category>
    </item>
    <item>
      <title>Build the dependency graph of your BigQuery pipelines at no cost: a Python implementation</title>
      <dc:creator>Λ\: Clément Bosc</dc:creator>
      <pubDate>Thu, 11 Jan 2024 10:36:34 +0000</pubDate>
      <link>https://dev.to/stack-labs/build-the-dependency-graph-of-your-bigquery-pipelines-at-no-cost-a-python-implementation-1ik1</link>
      <guid>https://dev.to/stack-labs/build-the-dependency-graph-of-your-bigquery-pipelines-at-no-cost-a-python-implementation-1ik1</guid>
      <description>&lt;p&gt;Nowadays, in a lot of Data Stacks, most of the Data Engineering task is writing SQL. &lt;br&gt;
SQL is a powerful language that is well suited for building most batch data pipelines: it's universal, great for structured data and (most of the time) easy to write. On the other hand, one of the complexities can be orchestrating the series of SQL statements in the correct order of the dependencies, meaning that if a SQL query references the result of another one, the second should be run before the first. &lt;/p&gt;

&lt;p&gt;For a personal project I pursued the quest of automatically generating the graph of BigQuery transformations dependencies using a small Python script. I wanted to use graph theory and SQL syntax parser. Let’s see how.&lt;/p&gt;
&lt;h2&gt;
  
  
  What are the dependencies in BigQuery SQL transformations ?
&lt;/h2&gt;

&lt;p&gt;In a BigQuery (or any other DataWarehouse actually), transformations are typically chained to form a more or less complex DAG (Direct Acyclic Graph). In this DAG, all the transformations are sourced from one or more tables and the result is dumped in a single table, that can in turn be used in other transformations and so on. In this graph all the nodes appear to be tables (sources or destinations) and the edges are dependencies, meaning the transformation references the sources tables in a FROM or JOIN statement to produce the target.&lt;/p&gt;

&lt;p&gt;Here is an example:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdv3zq9gfcf7qe5oev0fy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdv3zq9gfcf7qe5oev0fy.png" alt="Build Graph from BigQuery transformations" width="800" height="526"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here we can see that the D table is produced from tables A &amp;amp; B; table E produced from table D &amp;amp; C and table F from tables D &amp;amp; E. From this diagram we can easily see that the transformation D should be run first, followed by the E and finally the transformation F.&lt;/p&gt;
&lt;h2&gt;
  
  
  Automatically infer the graph with Python
&lt;/h2&gt;

&lt;p&gt;In the project we used Python lib &lt;a href="https://networkx.org/" rel="noopener noreferrer"&gt;networkx&lt;/a&gt; and a DiGraph object (Direct Graph). To detect a table reference in a Query, we use &lt;a href="https://github.com/tobymao/sqlglot" rel="noopener noreferrer"&gt;sqlglot&lt;/a&gt;, a SQL parser (among other things) that works well with Bigquery.&lt;/p&gt;

&lt;p&gt;Here is the first part of the Python script to create the graph, simplified for this blog post.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;networkx&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;nx&lt;/span&gt;

&lt;span class="c1"&gt;# all the transfromations are stored in a structure
# here let's assume Transformation object only contains the
# query and the destination table (project_id, dataset_id,
# table_id)
&lt;/span&gt;
&lt;span class="n"&gt;transformations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Transformation&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DiGraph&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;transfo&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;transformations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;dependencies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;find_query_dependencies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transfo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Add nodes, the transfo infos are added as node metadata
&lt;/span&gt;    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transfo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;destination_table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transfo&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;transfo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Add edges
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;dependencies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# here note that is the src does not exist yet are a
&lt;/span&gt;        &lt;span class="c1"&gt;# node it's created
&lt;/span&gt;        &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transfo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;destination_table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let’s see how to find the dependencies of a SQL query by using the sqlglot SQL parser:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sqlglot&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;parse_one&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exp&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find_query_dependencies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Find all the tables in the query&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;parse_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bigquery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the piece of code above, we use the parse_one function from sqlglot to parse the transformation query using BigQuery dialect into a tree that can be search on : &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdgmg9hw1rl6h096uchv0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdgmg9hw1rl6h096uchv0.png" alt="sqlglot parse_one" width="741" height="679"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Automatically infer the schema of the output tables
&lt;/h2&gt;

&lt;p&gt;Another cool feature we can add to our script is the ability to auto-detect with a high precision the expected output schema of all our tables (D, E and F in the example), even if they haven't yet been created. This can be very helpful if we want to create the tables using an Infrastructure as Code tool like Terraform before the transformations even run.&lt;br&gt;
For this feature, I used the following BigQuery trick: we can run a query with a &lt;code&gt;LIMIT 0&lt;/code&gt; at the end of the &lt;code&gt;SELECT&lt;/code&gt; statement. The awesomeness of this is that BigQuery won't charge anything (0 bytes billed) but will still create a temporary output table with the expected schema (including &lt;code&gt;NULLABLE&lt;/code&gt; / &lt;code&gt;REQUIRED&lt;/code&gt; coherence) ! &lt;br&gt;
To generate the query with &lt;code&gt;LIMIT 0&lt;/code&gt; we need to add it to all the &lt;code&gt;SELECT&lt;/code&gt; statements of a query (including all the subqueries). Let’s use sqlglot again by defining a SQL transformer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;limit_transformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;This sqlglot transformer function add a limit 0 to
       every SELECT stmnt&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Select&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;

&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
WITH source_A as (
   SELECT &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HelloWorld&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; as hello
), source_B as (
   SELECT CURRENT_DATE() as date
)
SELECT * 
FROM source_A, source_B
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;sample_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nf"&gt;parse_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dialect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Dialects&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BIGQUERY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;limit_transformer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dialect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Dialects&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BIGQUERY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sample_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# =====================
&lt;/span&gt;
&lt;span class="c1"&gt;# WITH source_A as (
#    SELECT "HelloWorld" as hello LIMIT 0
# ), source_B as (
#    SELECT CURRENT_DATE() as date LIMIT 0
# )
# SELECT * 
# FROM source_A, source_B
# LIMIT 0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once we have our new query with all the &lt;code&gt;LIMIT 0&lt;/code&gt;, we need to create a BQ job that runs this query, for free !&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.cloud&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;bigquery&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_destination_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="n"&gt;bq_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bigquery&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;query_job&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bq_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;query_job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Fetch the temporary table schema created by BigQuery
&lt;/span&gt;&lt;span class="n"&gt;tmp_dest_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bq_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;destination&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;destination_schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tmp_dest_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;destination_schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;destination&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now in order to generate the output schema from all the tables in the graph, even the last one which depends on tables that have not yet been created (here, tables D and E will only be temporary tables, not real tables), we need our schema generation method and apply it to each node in the DiGraph in the "correct" order. &lt;br&gt;
We call this order the topological order in graph theory, i.e. in this example first table D, then E, then F. For each node, we need to replace the reference of the real table in the body of the transformation with the reference of the temporary table previously created . This way, even if none of tables D and E exist, we can still deduce the final schema for table F !&lt;/p&gt;

&lt;p&gt;This process can be illustrated like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi09hxlq3sbjdib495yfe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi09hxlq3sbjdib495yfe.png" alt="automatically infer data transformation schemas bigquery" width="800" height="538"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;node_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;nx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;topological_sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# here the "transfo" key is where the transformation
&lt;/span&gt;    &lt;span class="c1"&gt;# metadata have been stored in the node
&lt;/span&gt;    &lt;span class="n"&gt;query_to_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;node_id&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transfo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;
    &lt;span class="n"&gt;ancestors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ancestors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# We exclude all the ancestors that don't have transfo
&lt;/span&gt;    &lt;span class="c1"&gt;# metadata, meaning all nodes that are not and
&lt;/span&gt;    &lt;span class="c1"&gt;# intermediary output table
&lt;/span&gt;    &lt;span class="n"&gt;ancestors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transfo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;ancestors&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ancestor_table&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ancestors&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;query_to_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;query_to_run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;ancestor_table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ancestor_table&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transfo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;tmp_table&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tmp_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_destination_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_to_run&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;node_id&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transfo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;tmp_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tmp_table&lt;/span&gt;

&lt;span class="c1"&gt;# This will be the exact schema of the last transformation
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And voilà ! With this graph generation technique and a bit of BigQuery ticks, we were  able to automatically infer a dependency graph of complex SQL transformations, as well as the exact final schema of all tables, without any tables being created and at zero cost ! &lt;br&gt;
In a next blog post we will see how I've applied and improved this technique to build an experimental and Ops-oriented data orchestration tool !&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Part 5. Provision Azure resources with Terraform from GCP with token exchange</title>
      <dc:creator>Λ\: Clément Bosc</dc:creator>
      <pubDate>Tue, 14 Feb 2023 13:28:18 +0000</pubDate>
      <link>https://dev.to/stack-labs/multi-cloud-identity-federation-explained-part-5-provision-azure-resources-with-terraform-from-gcp-with-token-exchange-3833</link>
      <guid>https://dev.to/stack-labs/multi-cloud-identity-federation-explained-part-5-provision-azure-resources-with-terraform-from-gcp-with-token-exchange-3833</guid>
      <description>&lt;p&gt;A small bonus for a use case we had in my project : all our CI is on GCP, using Cloud Build and I wanted to create Azure Resource along with Google resources with Terraform by exchanging my Cloud Build service account identity for an equivalent Azure identity.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Favx5g7v3swb11eozd8ph.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Favx5g7v3swb11eozd8ph.png" alt="Provision Azure resources with Terraform from GCP with token exchange" width="800" height="234"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you don’t know what I am referring to with identity federation and multi-cloud token exchange, and to understand the prerequisite, make sure to catch up with the previous article of the links above.&lt;/p&gt;

&lt;h2&gt;
  
  
  Create Azure resources with Terraform from GCP
&lt;/h2&gt;

&lt;p&gt;Remember the &lt;a href="https://dev.to/stack-labs/multi-cloud-identity-federation-explained-part-3-token-exchange-from-gcp-to-azure-4o8e"&gt;Part 3 of this series&lt;/a&gt;, we needed to exchange a Google identity token for an Azure access token and we used a &lt;code&gt;curl&lt;/code&gt; request to the AAD Authorization Server, let’s do the same with Terraform, using the &lt;a href="https://registry.terraform.io/providers/hashicorp/http/latest/docs" rel="noopener noreferrer"&gt;http provider&lt;/a&gt; ! &lt;/p&gt;

&lt;p&gt;The http provider is a provider maintained by Hashicorp that cannot create resource, but only make HTTP request on the form of Terraform Data Sources.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Add Terraform providers&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;First let’s add the required providers : &lt;code&gt;google&lt;/code&gt; for creating GCP resources and get the ID token, &lt;code&gt;http&lt;/code&gt; to exchange the token and &lt;code&gt;azurerm&lt;/code&gt; to create Azure resource.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;terraform&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;required_providers&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;source&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hashicorp/google&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
      &lt;span class="n"&gt;version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4.52.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;http&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;source&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hashicorp/http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
      &lt;span class="n"&gt;version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3.2.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;azurerm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;source&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hashicorp/azurerm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
      &lt;span class="n"&gt;version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3.42.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Simply configure the Google provider and use Application Default Credentials
&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;project&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;google_project_id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. &lt;strong&gt;Generate the Google ID token (JWT token)&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google_service_account_id_token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;oidc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;# the GCP SA mapped to Azure App Registration
&lt;/span&gt;  &lt;span class="n"&gt;target_service_account&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target_service_account&lt;/span&gt;
  &lt;span class="n"&gt;target_audience&lt;/span&gt;        &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api://AzureADTokenExchange&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. &lt;strong&gt;Call the Azure Authorization Server to Exchange the access token&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Here we first build the payload in a local variable using the same parameters described in &lt;a href="https://dev.to/stack-labs/multi-cloud-identity-federation-explained-part-3-token-exchange-from-gcp-to-azure-4o8e"&gt;Part 3 of this series&lt;/a&gt;. We finally query the Authorization Server with the &lt;code&gt;data "http" "azure_id_token"&lt;/code&gt; Data Source.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nb"&gt;locals&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;azure_id_token_request_body_obj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;client_id&lt;/span&gt;             &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client_id&lt;/span&gt;
    &lt;span class="n"&gt;scope&lt;/span&gt;                 &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;client_assertion_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;urn:ietf:params:oauth:client-assertion-type:jwt-bearer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;client_assertion&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;google_service_account_id_token&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;oidc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id_token&lt;/span&gt;
    &lt;span class="n"&gt;grant_type&lt;/span&gt;            &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;client_credentials&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="n"&gt;azure_id_token_request_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;formatlist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%s=%s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;azure_id_token_request_body_obj&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;azure_id_token_request_body_obj&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;azure_id_token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;url&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${var.aad_authority}${var.azure_tenant_id}/oauth2/v2.0/token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
  &lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GET&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

  &lt;span class="n"&gt;request_headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Content&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/x-www-form-urlencoded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="n"&gt;request_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;azure_id_token_request_body&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. C*&lt;em&gt;onfigure Azure provider and create resource !&lt;/em&gt;*
&lt;/h3&gt;

&lt;p&gt;You can now configure the &lt;code&gt;azurerm&lt;/code&gt; provider by using the &lt;code&gt;oidc_token&lt;/code&gt; with the resulting exchanged access_token. You are ready to go with Azure resource creation along with GCP resource creation ! Make sure to give the correct IAM role to the target App Registration depending of what resources you want to create.&lt;/p&gt;

&lt;p&gt;Here we create a resource_group and an Azure Storage Account 💾&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;provider &lt;span class="s2"&gt;"azurerm"&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    features &lt;span class="o"&gt;{}&lt;/span&gt;

    oidc_token &lt;span class="o"&gt;=&lt;/span&gt; jsondecode&lt;span class="o"&gt;(&lt;/span&gt;data.http.azure_id_token.body&lt;span class="o"&gt;)&lt;/span&gt;.access_token
&lt;span class="o"&gt;}&lt;/span&gt;

resource &lt;span class="s2"&gt;"azurerm_resource_group"&lt;/span&gt; &lt;span class="s2"&gt;"example"&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  name     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"example-resources"&lt;/span&gt;
  location &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"West Europe"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

resource &lt;span class="s2"&gt;"azurerm_storage_account"&lt;/span&gt; &lt;span class="s2"&gt;"example"&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  name                     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"storageaccountname"&lt;/span&gt;
  resource_group_name      &lt;span class="o"&gt;=&lt;/span&gt; azurerm_resource_group.example.name
  location                 &lt;span class="o"&gt;=&lt;/span&gt; azurerm_resource_group.example.location
  account_tier             &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Standard"&lt;/span&gt;
  account_replication_type &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"GRS"&lt;/span&gt;

  tags &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    environment &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"staging"&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It’s the end of this multi-cloud service-to-service identity federation series of articles. &lt;/p&gt;

&lt;p&gt;After this series of articles you know how to setup identity federation between GCP and Azure in a secure way. We saw what are access tokens and ID tokens and how they are used by Cloud providers. We saw the steps to exchange a Google ID token for an Azure access token and how to impersonate a GCP service account from an Azure App registration using Workload Identity Federation. Finally with &lt;a href="https://dev.to/stack-labs/multi-cloud-identity-federation-explained-part-4-implement-token-exchange-between-azure-and-gcp-in-python-1gop"&gt;Part 4&lt;/a&gt; and Part 5 we detailed concret implementation in Python and Terraform for your production applications.&lt;/p&gt;

&lt;p&gt;If you keep exposing service account keys or secrets after this, you have no excuse !&lt;/p&gt;

&lt;p&gt;Thanks for reading! I'm Clement, Data Engineer at &lt;a href="https://stack-labs.com/" rel="noopener noreferrer"&gt;Stack Labs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you want to discover the &lt;a href="https://cloud.stack-labs.com/cloud-data-platform" rel="noopener noreferrer"&gt;Stack Labs Data Platform&lt;/a&gt; or join an enthusiast &lt;a href="https://www.welcometothejungle.com/fr/companies/stack-labs" rel="noopener noreferrer"&gt;data engineering team&lt;/a&gt; and work on awesome technical subjects, please contact us.&lt;/p&gt;

</description>
      <category>gratitude</category>
    </item>
    <item>
      <title>Part 4. Implement token exchange between Azure and GCP in Python</title>
      <dc:creator>Λ\: Clément Bosc</dc:creator>
      <pubDate>Tue, 14 Feb 2023 13:24:56 +0000</pubDate>
      <link>https://dev.to/stack-labs/multi-cloud-identity-federation-explained-part-4-implement-token-exchange-between-azure-and-gcp-in-python-1gop</link>
      <guid>https://dev.to/stack-labs/multi-cloud-identity-federation-explained-part-4-implement-token-exchange-between-azure-and-gcp-in-python-1gop</guid>
      <description>&lt;p&gt;In the previous three article of the multi-cloud identity federation series we discussed about access token, identity token, how to differentiate them and how to exchange service identity between Google Cloud and Azure without exposing your keys and secrets. If you don’t know want I am referring to, make sure to catch up with the links above.&lt;/p&gt;

&lt;p&gt;Not let’s see the Python implementation for your production applications, first from Azure environment to impersonate Google Cloud service account, then from GCP to impersonate an Azure App Registration.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. From Azure environment : impersonate GCP service account
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Generate the Azure access token
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Use the Azure Identity library to generate an access token&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You can use this option if your application is running in a compute instance that have access to the &lt;a href="https://learn.microsoft.com/en-us/azure/virtual-machines/windows/instance-metadata-service?tabs=windows" rel="noopener noreferrer"&gt;Azure Instance Metadata Service (IMDS)&lt;/a&gt;. It’s the recommended method because you do not have to store secrets in the instance or in environment variables, but it’s not always available depending on your use case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: azure.identity module is part of &lt;a href="https://pypi.org/project/azure-identity/" rel="noopener noreferrer"&gt;azure-identity package&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.identity&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DefaultAzureCredential&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.identity&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AzureAuthorityHosts&lt;/span&gt;

&lt;span class="n"&gt;default_credential&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DefaultAzureCredential&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;authority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;AzureAuthorityHosts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AZURE_PUBLIC_CLOUD&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;azure_access_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;default_credential&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_token&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;scopes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;APPLICATION_ID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;b. Use the MSAL library with your client_id and client_secret&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you do not have access to IMDS, you can always expose the CLIENT_SECRET and CLIENT_ID (App ID) in the environment variables of your application or most preferably store and retrieve them in a Key Vault. You can then use the &lt;a href="https://learn.microsoft.com/en-us/azure/active-directory/develop/msal-overview" rel="noopener noreferrer"&gt;MSAL&lt;/a&gt; &lt;code&gt;ConfidentialClientApplication&lt;/code&gt; to get your App Registration access token.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;msal&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ConfidentialClientApplication&lt;/span&gt;

&lt;span class="n"&gt;CLIENT_SECRET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CLIENT_SECRET&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;TENANT_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TENANT_ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;APPLICATION_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;APPLICATION_ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ConfidentialClientApplication&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;APPLICATION_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;client_credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CLIENT_SECRET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;authority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;AzureAuthorityHosts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AZURE_PUBLIC_CLOUD&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;TENANT_ID&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;azure_access_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;acquire_token_for_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;scopes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;APPLICATION_ID&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/.default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;access_token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Use the Google’s STS Client to get a federated token via the Workload Identity Federation
&lt;/h3&gt;

&lt;p&gt;The second step of the token exchange process is to request a short-lived token to Google STS API. Make sure to understand the parameters detailed in the &lt;a href="https://dev.to/stack-labs/multi-cloud-identity-federation-explained-part-2-token-exchange-from-azure-to-gcp-5a1f"&gt;article Part 2&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.oauth2.sts&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.auth.transport.requests&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;

&lt;span class="n"&gt;GCP_PROJECT_NUMBER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PROJECT_NUMER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;GCP_PROJECT_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GCP_PROJECT_ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;POOL_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;POOL_ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;PROVIDER_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PROVIDER_ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;sts_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token_exchange_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://sts.googleapis.com/v1/token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sts_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exchange_token&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;audience&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;//iam.googleapis.com/projects/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;GCP_PROJECT_NUMBER&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/locations/global/workloadIdentityPools/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;POOL_ID&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/providers/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PROVIDER_ID&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;grant_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;urn:ietf:params:oauth:grant-type:token-exchange&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;subject_token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;azure_access_token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;scopes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.googleapis.com/auth/cloud-platform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;subject_token_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;urn:ietf:params:oauth:token-type:jwt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;requested_token_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;urn:ietf:params:oauth:token-type:access_token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sts_access_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;access_token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Impersonate the target service account with STS token
&lt;/h3&gt;

&lt;p&gt;When you have your STS token (federated token) you can finally impersonate the target service account (assuming you gave the correct role to your &lt;a href="https://dev.to/stack-labs/multi-cloud-identity-federation-explained-part-2-token-exchange-from-azure-to-gcp-5a1f#4-a-new-gcp-principal-the-principalset"&gt;Workload Identity PrincipalSet&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create the target credential object&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.oauth2.credentials&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Credentials&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.auth&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;impersonated_credentials&lt;/span&gt;

&lt;span class="n"&gt;TARGET_SERVICE_ACCOUNT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TARGET_SERVICE_ACCOUNT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;sts_credentials&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Credentials&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sts_access_token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;credentials&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;impersonated_credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Credentials&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;source_credentials&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sts_credentials&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;target_principal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TARGET_SERVICE_ACCOUNT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;target_scopes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.googleapis.com/auth/cloud-platform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="n"&gt;lifetime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;refresh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Call your Google API (here BigQuery) from Azure environment&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now that the token exchange process is over, you can request any API that the target service account have access to by using the corresponding Client library (here it’s BigQuery).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.cloud&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;bigquery&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bigquery&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;GCP_PROJECT_ID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Here my TARGET_SERVICE_ACCOUNT has bigquery.jobUser role.
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT CURRENT_DATE() as date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;query_job&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Make an API request.
&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The query data:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query_job&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# It works !
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2. From Google Cloud : impersonate Azure App
&lt;/h2&gt;

&lt;p&gt;Let’s see the Python implementation from the other perspective : impersonate an Azure App from GCP environment. This process is detailed in the &lt;a href="https://dev.to/stack-labs/multi-cloud-identity-federation-explained-part-3-token-exchange-from-gcp-to-azure-4o8e"&gt;Part 3 of the series&lt;/a&gt;. Make sure to read it to understand the process.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implement a python Credential class from TokenCredential
&lt;/h3&gt;

&lt;p&gt;Most of Microsoft client libraries can take a Credential instance as argument. Even if most of the time it’s a &lt;code&gt;DefaultAzureCredential&lt;/code&gt; or &lt;code&gt;ConfidentialClientApplication&lt;/code&gt;, you can create your own by implementing the &lt;code&gt;TokenCredential&lt;/code&gt; interface. The class must implement the &lt;code&gt;get_token&lt;/code&gt; method, that is called by the client library when authenticating. &lt;/p&gt;

&lt;p&gt;Here we first perform the Google ID token generation by querying the Google Metadata Server, then we use the &lt;code&gt;ConfidentialClientApplication&lt;/code&gt; with the ID token as &lt;code&gt;client_assertion&lt;/code&gt; to get the federated token.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.core.credentials&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TokenCredential&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AccessToken&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;msal&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ConfidentialClientApplication&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.auth.transport.requests&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;GoogleAssertionCredential&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TokenCredential&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;azure_client_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;azure_tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;azure_authority_host&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# create a confidential client application
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ConfidentialClientApplication&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;azure_client_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;client_credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;client_assertion&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_get_google_id_token&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="n"&gt;authority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;azure_authority_host&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;azure_tenant_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_get_google_id_token&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Request an ID token to the Metadata Server&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;()(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;GOOGLE_METADATA_API&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/instance/service-accounts/default/identity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?audience=api://AzureADTokenExchange&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GET&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Metadata-Flavor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Google&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
          &lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_token&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;scopes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;claims&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AccessToken&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# get the token using the application
&lt;/span&gt;        &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;acquire_token_for_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scopes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;error_description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;expires_on&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;expires_in&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="c1"&gt;# return an access token with the token string and expiration time
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;AccessToken&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;access_token&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expires_on&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: the token generation with Metadata Server will only work on an app deployed on GCP. If you want to test locally, you can use a service account file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;credentials&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;IDTokenCredentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_service_account_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;GOOGLE_APPLICATION_CREDENTIALS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;target_audience&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api://AzureADTokenExchange&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;refresh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Instantiate the GoogleAssertionCredential and query final Azure API
&lt;/h3&gt;

&lt;p&gt;Finally you can request any API the Azure App registration have access to, to get your work done. Just instantiate the GoogleAssertionCredential with your target Azure App CLIENT_ID &amp;amp; TENANT_ID, and pass it to the client library (here it’s BlobServiceClient, assuming that the App registration have Contributor role in the Azure Storage Account)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CLIENT_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CLIENT_ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;TENANT_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TENANT_ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;creds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GoogleAssertionCredential&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;azure_client_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CLIENT_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;azure_tenant_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TENANT_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;azure_authority_host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;AzureAuthorityHosts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AZURE_PUBLIC_CLOUD&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;STORAGE_ACCOUNT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STORAGE_ACCOUNT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;CONTAINER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CONTAINER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="c1"&gt;# Here the App registration is Contributor of the Azure storage account
&lt;/span&gt;&lt;span class="n"&gt;blob_service_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BlobServiceClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;STORAGE_ACCOUNT&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.blob.core.windows.net&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;creds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;container_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blob_service_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_container_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CONTAINER&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;blob&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;container_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_blob_names&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# It works !
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We just saw how to concretely impersonate service identities between Google Cloud and Azure in your production with Python. Keep it mind the good practices : &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no secret storage if no need to, there is the Metadata Server in both clouds&lt;/li&gt;
&lt;li&gt;use the correct audience or scope for just what you need to do, so if the token leaks the thief will only be able to use it for the target service before the token expires (less than 1 hour)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We will see in the next and final part of this multi-cloud series how to exchange token using Terraform to create Azure resource from Google Cloud Build.&lt;/p&gt;

</description>
      <category>googlecloud</category>
      <category>azure</category>
      <category>multicloud</category>
      <category>python</category>
    </item>
    <item>
      <title>Part 3. Token exchange from GCP to Azure</title>
      <dc:creator>Λ\: Clément Bosc</dc:creator>
      <pubDate>Tue, 14 Feb 2023 13:22:32 +0000</pubDate>
      <link>https://dev.to/stack-labs/multi-cloud-identity-federation-explained-part-3-token-exchange-from-gcp-to-azure-4o8e</link>
      <guid>https://dev.to/stack-labs/multi-cloud-identity-federation-explained-part-3-token-exchange-from-gcp-to-azure-4o8e</guid>
      <description>&lt;p&gt;In the previous article of this multi-cloud identity federation series, we saw how to securely exchange an Azure access token (on the form of a JWT token) with a GCP access token, to access private GCP resource from Azure. You are probably wondering how to do the reverse operation, you are in the right spot !&lt;/p&gt;

&lt;h2&gt;
  
  
  The big picture
&lt;/h2&gt;

&lt;p&gt;To request Azure APIs from GCP environment, we will need the same two objects as before : a GCP service account and an Azure App Registration. The process is straightforward, because there is no Workload Identity Federation-like product on Azure, everything happens in the App registration configuration : &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generate a GCP ID token for the source service account, either via the Metadata Server (recommended way for production applications), or via the CLI or IAM REST API (need to have impersonate permissions on the SA)&lt;/li&gt;
&lt;li&gt;Ask the Azure OAuth2 Authorization server to exchange the token for an Azure access token representing the target App registration.&lt;/li&gt;
&lt;li&gt;Enjoy your APIs requests 🙂&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyl1ljldu2zbhlvbqgsuq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyl1ljldu2zbhlvbqgsuq.png" alt="Token exchange between Google Cloud and Azure" width="800" height="133"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Generate a GCP ID token
&lt;/h2&gt;

&lt;p&gt;First you need to generate your ID token on behalf of the source service account. Why ID token and not access token ? Because access token on GCP are opaque and are not decodable ! We need a JWT token here : Azure need to be able to check the content of the token to map the App registration on the other side and check if the issuer is Google.&lt;/p&gt;

&lt;p&gt;For this step you can either use the Metadata Server if your workload is running in GCP Compute context (recommended way for production applications), or use &lt;code&gt;gcloud&lt;/code&gt; CLI (or any other method available). The result will be a valid identity token. Note that you need to match the audience with the audience you will configure in the next step. The recommended value (according to Azure) is &lt;code&gt;api://AzureADTokenExchange&lt;/code&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Using the magic Metadata Server URL :&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Metadata-Flavor: Google"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s1"&gt;'http://metadata/computeMetadata/v1/instance/service-accounts/default/identity?audience=api://AzureADTokenExchange'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;b. Using the gcloud CLI (for testing purpose only)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud auth print-identity-token &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--impersonate-service-account&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;SOURCE_SERVICE_ACCOUNT_EMAIL &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--audiences&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;api://AzureADTokenExchange
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2. Create a federated credentials in your Azure App Registration
&lt;/h2&gt;

&lt;p&gt;Secondly, configure your target Azure App Registration to allow impersonation from the source GCP service account. This operation happens in the &lt;em&gt;Federated Credential&lt;/em&gt; section of your App registration in Azure Active Directory. You will need to specify the trusted issuer, the subject and the audience.&lt;/p&gt;

&lt;p&gt;How to get these values ? By decoding the GCP identity token, of course ! Just as usual, you can go to &lt;a href="https://jwt.io/" rel="noopener noreferrer"&gt;https://jwt.io/&lt;/a&gt; and inspect the content of your token’s payload :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"aud"&lt;/span&gt;: &lt;span class="s2"&gt;"api://AzureADTokenExchange"&lt;/span&gt;,
  &lt;span class="s2"&gt;"azp"&lt;/span&gt;: &lt;span class="s2"&gt;"106697322240068434726"&lt;/span&gt;,
  &lt;span class="s2"&gt;"exp"&lt;/span&gt;: 1676301170,
  &lt;span class="s2"&gt;"iat"&lt;/span&gt;: 1676297570,
  &lt;span class="s2"&gt;"iss"&lt;/span&gt;: &lt;span class="s2"&gt;"https://accounts.google.com"&lt;/span&gt;,
  &lt;span class="s2"&gt;"sub"&lt;/span&gt;: &lt;span class="s2"&gt;"106697322240068434726"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;iss&lt;/code&gt; = &lt;a href="https://accounts.google.com" rel="noopener noreferrer"&gt;https://accounts.google.com&lt;/a&gt; is the issuer of the token (Google)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sub&lt;/code&gt; = subject is the source service account ID. You can also find this info in the GCP console, on the service account page (Unique ID)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;aud&lt;/code&gt; is the default audience value, defined in the first step.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From these informations you can create your federated-credentials settings :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;az ad app federated-credential create &lt;span class="nt"&gt;--id&lt;/span&gt; APPLICATION_ID &lt;span class="nt"&gt;--parameters&lt;/span&gt; credential.json
&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"credential.json"&lt;/span&gt; contains the following content&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"name"&lt;/span&gt;: &lt;span class="s2"&gt;"GcpFederation"&lt;/span&gt;,
    &lt;span class="s2"&gt;"issuer"&lt;/span&gt;: &lt;span class="s2"&gt;"https://accounts.google.com"&lt;/span&gt;,
    &lt;span class="s2"&gt;"subject"&lt;/span&gt;: &lt;span class="s2"&gt;"106697322240068434726"&lt;/span&gt;,
    &lt;span class="s2"&gt;"description"&lt;/span&gt;: &lt;span class="s2"&gt;"Test GCP federation"&lt;/span&gt;,
    &lt;span class="s2"&gt;"audiences"&lt;/span&gt;: &lt;span class="o"&gt;[&lt;/span&gt;
        &lt;span class="s2"&gt;"api://AzureADTokenExchange"&lt;/span&gt;
    &lt;span class="o"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  3. Exchange your GCP ID token for an Azure Access token
&lt;/h2&gt;

&lt;p&gt;Last but not least, you need to exchange your GCP ID token for an Azure Access token to do whatever you want on Azure side : you need to make a request to the Azure Oauth2 Authorization Server by specifying the following parameters :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;client_id&lt;/code&gt; to your App registration ID,&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;scope&lt;/code&gt;  to the desired scope depending on the future usage of your token,&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;client_assertion_type&lt;/code&gt; fixed to &lt;code&gt;urn:ietf:params:oauth:client-assertion-type:jwt-bearer&lt;/code&gt; for this operation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;grant_type&lt;/code&gt;  to &lt;code&gt;client_credentials&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;And the most important: your GCP ID token under &lt;code&gt;client_assertion&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl GET &lt;span class="s1"&gt;'https://login.partner.microsoftonline.cn/TENANT_ID/oauth2/v2.0/token'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/x-www-form-urlencoded'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--data-urlencode&lt;/span&gt; &lt;span class="s1"&gt;'client_id=APP_ID'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--data-urlencode&lt;/span&gt; &lt;span class="s1"&gt;'scope=https://storage.azure.com/.default'&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;&lt;span class="c"&gt;# or whatever other scope you might want&lt;/span&gt;
&lt;span class="nt"&gt;--data-urlencode&lt;/span&gt; &lt;span class="s1"&gt;'client_assertion_type=urn:ietf:params:oauth:client-assertion-type:jwt-bearer'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--data-urlencode&lt;/span&gt; &lt;span class="s1"&gt;'client_assertion=GCP_ID_TOKEN'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--data-urlencode&lt;/span&gt; &lt;span class="s1"&gt;'grant_type=client_credentials'&lt;/span&gt;

&lt;span class="c"&gt;# Reponse&lt;/span&gt;

&lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"token_type"&lt;/span&gt;: &lt;span class="s2"&gt;"Bearer"&lt;/span&gt;,
    &lt;span class="s2"&gt;"expires_in"&lt;/span&gt;: 3599,
    &lt;span class="s2"&gt;"ext_expires_in"&lt;/span&gt;: 3599,
    &lt;span class="s2"&gt;"access_token"&lt;/span&gt;: &lt;span class="s2"&gt;"eyJ0eXAiO********"&lt;/span&gt; &lt;span class="c"&gt;# JWT token&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After decoding the GCP token, if the audience, issuer and subject match, your are good to go for a brand new Azure access token ! You can now access the APIs that match the scope you specified (here the Azure Storage API) :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl GET &lt;span class="s1"&gt;'https://STORAGE_ACCOUNT_NAME.blob.core.windows.net/CONTAINER'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'x-ms-version: 2020-04-08'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Authorization: Bearer AZURE_TOKEN'&lt;/span&gt;

&lt;span class="c"&gt;# Response 200 OK&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No need to store client_id and client_secret in GCP and risk a security breach ! Just use Azure Active Directory federated credentials !&lt;/p&gt;

&lt;p&gt;In the next 2 articles we will see concret implementation in Python and Terraform for your production applications.&lt;/p&gt;

</description>
      <category>googlecloud</category>
      <category>azure</category>
      <category>multicloud</category>
      <category>security</category>
    </item>
    <item>
      <title>Part 2. Token exchange from Azure to GCP</title>
      <dc:creator>Λ\: Clément Bosc</dc:creator>
      <pubDate>Tue, 14 Feb 2023 13:09:19 +0000</pubDate>
      <link>https://dev.to/stack-labs/multi-cloud-identity-federation-explained-part-2-token-exchange-from-azure-to-gcp-5a1f</link>
      <guid>https://dev.to/stack-labs/multi-cloud-identity-federation-explained-part-2-token-exchange-from-azure-to-gcp-5a1f</guid>
      <description>&lt;p&gt;In the previous article &lt;a href="https://dev.to/stack-labs/multi-cloud-identity-federation-explained-part-1-access-token-vs-id-token-l4"&gt;Part 1. Access token vs ID token&lt;/a&gt;, we saw why going multi-cloud is a security challenge and why we need a more sustainable solution than exporting and storing sensitive secrets. We also saw what is an access token, what is an ID token and the difference between them. Keep this information in mind, we will need it for the following!&lt;/p&gt;

&lt;p&gt;Now let’s see in details the technical implementation for exchanging securely tokens from Azure to Google Cloud, to be able to query Google APIs from Azure Cloud without having to generate a Google service account JSON key.&lt;/p&gt;

&lt;h2&gt;
  
  
  The big picture
&lt;/h2&gt;

&lt;p&gt;To request a service or API hosted on GCP, you need a GCP access token (or ID token if your service is Cloud Run). But all you have at this point is a token delivered by Azure, related to your Azure identity. That’s why you need to exchange it for a GCP token. &lt;/p&gt;

&lt;p&gt;To exchange an Azure access token for a Google access token you need to configure a GCP service called &lt;strong&gt;Workload Identity Federation&lt;/strong&gt;. This service allows you to configure external providers (Azure, AWS, GitLab, anything that uses OIDC and JWT tokens) and map entities from theses providers to Service Accounts in GCP. This will allow external entities to impersonate the GCP service account, that's to say inherit all the permissions the service account has on the platform.&lt;/p&gt;

&lt;p&gt;The process goes in 3 steps: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generate an Azure Active Directory (AAD) access token for an App registration (more on them bellow), either using the &lt;code&gt;client_id&lt;/code&gt; and &lt;code&gt;client_secret&lt;/code&gt; or via the Metadata Server.&lt;/li&gt;
&lt;li&gt;Exchange the Azure access token with a short-lived access token from Google’s Security Token Service API (STS).&lt;/li&gt;
&lt;li&gt;Exchange the STS access token with a Service Account’s access token and use this one to query Google APIs !&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu8l2kz2ycucmjsi18ysg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu8l2kz2ycucmjsi18ysg.png" alt="Token exchange between Azure and GCP" width="800" height="104"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Azure App registration creation and token generation
&lt;/h2&gt;

&lt;p&gt;In Azure world, the App registration is the identity of a service (or app). It’s kind of like a Service Account if you are coming from the Google Cloud world. You must first create an App registration in Azure Active Directory.&lt;/p&gt;

&lt;p&gt;In Azure boundaries you can generate an access token on behalf of the application, either via the Authorization Server with the &lt;code&gt;client_id&lt;/code&gt; and &lt;code&gt;client_secret&lt;/code&gt; or via the Metadata Server.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a. Generate an access token with Authorization server and client_id and secret_id&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In my case my project is in Azure China so the AAD Authority (host) is &lt;code&gt;[https://login.partner.microsoftonline.cn](https://login.partner.microsoftonline.cn)&lt;/code&gt; but for you it’s probably &lt;a href="https://login.microsoftonline.com/" rel="noopener noreferrer"&gt;&lt;code&gt;https://login.microsoftonline.com/&lt;/code&gt;&lt;/a&gt; (Azure Global)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;--location&lt;/span&gt; &lt;span class="nt"&gt;--request&lt;/span&gt; POST &lt;span class="s1"&gt;'https://login.partner.microsoftonline.cn/TENANT_ID/oauth2/v2.0/token'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/x-www-form-urlencoded'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--data-urlencode&lt;/span&gt; &lt;span class="s1"&gt;'client_id=APPLICATION_ID'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--data-urlencode&lt;/span&gt; &lt;span class="s1"&gt;'grant_type=client_credentials'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--data-urlencode&lt;/span&gt; &lt;span class="s1"&gt;'client_secret=CLIENT_SECRET'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--data-urlencode&lt;/span&gt; &lt;span class="s1"&gt;'scope=.default'&lt;/span&gt; &lt;span class="c"&gt;# or anything else that you would like for your token&lt;/span&gt;

&lt;span class="c"&gt;# Response&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"token_type"&lt;/span&gt;: &lt;span class="s2"&gt;"Bearer"&lt;/span&gt;,
    &lt;span class="s2"&gt;"expires_in"&lt;/span&gt;: 3599,
    &lt;span class="s2"&gt;"ext_expires_in"&lt;/span&gt;: 3599,
    &lt;span class="s2"&gt;"access_token"&lt;/span&gt;: &lt;span class="s2"&gt;"eyJ0eXAiOiJK*********"&lt;/span&gt; &lt;span class="c"&gt;# AZURE_TOKEN&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;b. Generate an access token with the Azure Instance Metadata Service&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the Cloud world there is a reserved magic IP “&lt;a href="http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&amp;amp;resource=https://management.azure.com/" rel="noopener noreferrer"&gt;169.254.169.254&lt;/a&gt;” which is used to fetch user or service information when your workload is running in the Cloud compute context: it’s the Metadata server. You can request this service from inside an Azure VM to generate a token for the managed identity attached to the VM, without having the secret ! 🪄&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl GET &lt;span class="s1"&gt;'http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&amp;amp;resource=AZURE_APP_ID/.default'&lt;/span&gt;
&lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Metadata: true'&lt;/span&gt;

&lt;span class="c"&gt;# Response &lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"access_token"&lt;/span&gt;: &lt;span class="s2"&gt;"eyJ0eXA********"&lt;/span&gt;, &lt;span class="c"&gt;# AZURE_TOKEN&lt;/span&gt;
  &lt;span class="s2"&gt;"expires_in"&lt;/span&gt;: 3599,
  &lt;span class="s2"&gt;"token_type"&lt;/span&gt;: &lt;span class="s2"&gt;"Bearer"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;More info in Azure Metadata Server to acquire an access token &lt;a href="https://learn.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Setup Workload Identity Federation : Pool and Azure Provider
&lt;/h2&gt;

&lt;p&gt;First, let’s create a Workload Identity Pool on GCP, you only need a name and ID for this one. You can have many providers by pool, and a provider is limited to one tenant. So if you are in a multi-tenant Azure pattern, you might need a pool for each of them.&lt;/p&gt;

&lt;p&gt;To create the Azure provider, select type OpenId Connect (OIDC) : you will need a name, an ID and an issuer. &lt;/p&gt;

&lt;p&gt;Let’s remind what we learned from the &lt;a href="https://dev.to/stack-labs/multi-cloud-identity-federation-explained-part-1-access-token-vs-id-token-l4"&gt;previous article&lt;/a&gt;, the issuer is the trusted entity which sign the original access token and it can be easily retrieved by decoding the JWT token. As usual, let’s go to &lt;a href="http://jwt.io" rel="noopener noreferrer"&gt;jwt.io&lt;/a&gt; with you AZURE_ACCESS_TOKEN and find out the issuer.&lt;/p&gt;

&lt;p&gt;In my case it’s &lt;a href="https://sts.chinacloudapi.cn/TENANT_ID" rel="noopener noreferrer"&gt;https://sts.chinacloudapi.cn/TENANT_ID&lt;/a&gt; because my project is on Azure China, but for you it will probably look more like &lt;a href="https://sts.windows.net/TENANT_ID" rel="noopener noreferrer"&gt;https://sts.windows.net/TENANT_ID&lt;/a&gt; (Azure Global).&lt;/p&gt;

&lt;p&gt;You are then asked to setup the attribute mapping : this is used later to allow a subset of entities to impersonate the target GCP service account. You must at least set the &lt;code&gt;google.subject&lt;/code&gt; mapping and once again, let’s look for our subject in the decoded JWT payload, at &lt;code&gt;sub&lt;/code&gt;  attribute. This is a unique ID for your Azure application. You might want to add other JWT mapping at your convenience.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud iam workload-identity-pools create POOL_ID &lt;span class="se"&gt;\ &lt;/span&gt;
    &lt;span class="nt"&gt;--location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"global"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--display-name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;POOL_DISPLAY_NAME

gcloud iam workload-identity-pools providers create-oidc PROVIDER_ID &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"global"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--workload-identity-pool&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;POOL_ID &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--display-name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;PROVIDER_DISPLAY_NAME &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--issuer-uri&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://sts.chinacloudapi.cn/TENANT_ID"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--allowed-audiences&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;AZURE_APP_ID/.default
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  3. Exchange your Azure token for a GCP token with STS
&lt;/h2&gt;

&lt;p&gt;In this step, you request a short-lived access token to Google Security Token Service API in exchange for your Azure App registration access token. A simple curl call would do the job as shown below. Make sure to specify the previously created Workload Identity Provider as audience. &lt;code&gt;grantType&lt;/code&gt;, &lt;code&gt;requestedTokenType&lt;/code&gt; and &lt;code&gt;subjectTokenType&lt;/code&gt; are fixed by convention.&lt;/p&gt;

&lt;p&gt;The result is an STS token, representing the &lt;code&gt;principalSet&lt;/code&gt; of you Workload Identity Pool.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl POST &lt;span class="s1"&gt;'https://sts.googleapis.com/v1/token'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--data-raw&lt;/span&gt; &lt;span class="s1"&gt;'{
  "grantType": "urn:ietf:params:oauth:grant-type:token-exchange",
  "audience": "//iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/WORKLOAD_IDENTITY_POOL/providers/AZURE_PROVIDER",
  "scope": "https://www.googleapis.com/auth/cloud-platform",
  "requestedTokenType": "urn:ietf:params:oauth:token-type:access_token",
  "subjectToken": "AZURE_TOKEN",
  "subjectTokenType": "urn:ietf:params:oauth:token-type:jwt"
}'&lt;/span&gt;

&lt;span class="c"&gt;# Response&lt;/span&gt;

&lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"access_token"&lt;/span&gt;: &lt;span class="s2"&gt;"ya29.d.b0Aaekm1K9f******"&lt;/span&gt;, &lt;span class="c"&gt;# STS_ACCESS_TOKEN&lt;/span&gt;
    &lt;span class="s2"&gt;"issued_token_type"&lt;/span&gt;: &lt;span class="s2"&gt;"urn:ietf:params:oauth:token-type:access_token"&lt;/span&gt;,
    &lt;span class="s2"&gt;"token_type"&lt;/span&gt;: &lt;span class="s2"&gt;"Bearer"&lt;/span&gt;,
    &lt;span class="s2"&gt;"expires_in"&lt;/span&gt;: 3587
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. A new GCP principal : the principalSet
&lt;/h2&gt;

&lt;p&gt;Before jumping to the last exchange operation, let’s get back to fundamentals.&lt;/p&gt;

&lt;p&gt;In GCP, a principal is a entity that can be allowed via IAM to perform certain actions. If you never used Workload Identity Federation you are probably convinced that there are only 3 kinds of principal : &lt;em&gt;user&lt;/em&gt;, &lt;em&gt;group&lt;/em&gt; &amp;amp; &lt;em&gt;serviceAccount&lt;/em&gt;. But with Workload Identity Federation, Google introduced a fourth : the &lt;em&gt;principalSet&lt;/em&gt;. The &lt;em&gt;principalSet&lt;/em&gt; is the principal identity for a pool, but it can only be used with the role Workload Identity User to impersonate a real Service Account. Moreover, the particularity of this principal, it’s that the corresponding identity is dynamic, based on a pattern : you can apply filter base on the source JWT attributes that where previously mapped ! &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;To limit the impersonation permission on a specific subject you can set the permission on &lt;code&gt;principal://iam.googleapis.com/projects/**PROJECT_NUMBER**/locations/global/workloadIdentityPools/**POOL_ID**/subject/**SUBJECT_ATTRIBUTE_VALUE**&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;But you can use any custom attribute by using &lt;code&gt;principalSet://iam.googleapis.com/projects/**PROJECT_NUMBER**/locations/global/workloadIdentityPools/**POOL_ID**/attribute.**ATTRIBUTE_NAME**/**ATTRIBUTE_VALUE**&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here are all the possible patterns:&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Give workload identity principal access to target service account and exchange final token
&lt;/h2&gt;

&lt;p&gt;Now that you have your STS access token you are nearly to the end ! &lt;/p&gt;

&lt;p&gt;The Workload Identity Principal must be authorized by GCP IAM to impersonate your final, target service account. To do so you need to add the Workload Identity User role, at service account level, to the principal represented by &lt;code&gt;principalSet://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/POOL_ID/subject/AZURE_APP_SUBJECT&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud iam service-accounts add-iam-policy-binding &lt;span class="se"&gt;\&lt;/span&gt;
TARGET_SERVICE_ACCOUNT_EMAIL &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--member&lt;/span&gt; principalSet://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/POOL_ID/subject/AZURE_APP_SUBJECT &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--role&lt;/span&gt; roles/iam.workloadIdentityUser
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Exchange STS for a final access token
&lt;/h3&gt;

&lt;p&gt;Here you are, you can now finally impersonate the target service account to access real Google Cloud APIs. Just pass the STS access token as Bearer token of your HTTP request against IAM access token generation endpoint (or ID token depending of your use case) 🙂&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;--location&lt;/span&gt; &lt;span class="nt"&gt;--request&lt;/span&gt; POST &lt;span class="s1"&gt;'https://iamcredentials.googleapis.com/v1/projects/-/serviceAccounts/TARGET_SERVICE_ACCOUNT_EMAIL:generateAccessToken'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Authorization: Bearer STS_ACCESS_TOKEN'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--data-raw&lt;/span&gt; &lt;span class="s1"&gt;'{
    "scope": [
        "https://www.googleapis.com/auth/cloud-platform"
    ]
}'&lt;/span&gt;

&lt;span class="c"&gt;# Response&lt;/span&gt;

&lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"accessToken"&lt;/span&gt;: &lt;span class="s2"&gt;"ya29.c.b0Aaekm1Izvf********"&lt;/span&gt;, &lt;span class="c"&gt;# GCP_ACCESS_TOKEN&lt;/span&gt;
    &lt;span class="s2"&gt;"expireTime"&lt;/span&gt;: &lt;span class="s2"&gt;"2023-02-12T20:36:27Z"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The resulting access token can be used to do anything that the TARGET_SERVICE_ACCOUNT can, for example running BigQuery queries 📊&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl POST &lt;span class="s1"&gt;'https://bigquery.googleapis.com/bigquery/v2/projects/PROJECT_ID/queries'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Authorization: Bearer GCP_ACCESS_TOKEN'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--data-raw&lt;/span&gt; &lt;span class="s1"&gt;'{
  "query": "SELECT CURRENT_TIMESTAMP()",
  "useLegacySql": false,
  "location": "EU"
}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We saw how to securely exchange an Azure App registration access token by impersonating a GCP service account and access Google APIs securely from other Clouds :) &lt;/p&gt;

&lt;p&gt;Let’s see how to do the reverse operation in a next article !&lt;/p&gt;

</description>
      <category>softwaredevelopment</category>
      <category>productivity</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Part 1. Access token vs ID token</title>
      <dc:creator>Λ\: Clément Bosc</dc:creator>
      <pubDate>Tue, 14 Feb 2023 13:08:19 +0000</pubDate>
      <link>https://dev.to/stack-labs/multi-cloud-identity-federation-explained-part-1-access-token-vs-id-token-l4</link>
      <guid>https://dev.to/stack-labs/multi-cloud-identity-federation-explained-part-1-access-token-vs-id-token-l4</guid>
      <description>&lt;p&gt;Nowadays multi-cloud is in every mouth in big companies. The motivations for going multi-cloud are various, it can be for technical flexibility (chose the best tool for the best usage), fault tolerance and redundancy or for reducing the risk of single point of failure. But in any case the &lt;strong&gt;problem of security&lt;/strong&gt; and the federation of identity between services will come pretty quickly.&lt;/p&gt;

&lt;h3&gt;
  
  
  How we usually share credentials between clouds and why it is bad
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F083dco7dvlm3blq6otel.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F083dco7dvlm3blq6otel.png" alt="Security risk secret storage Cloud" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The issue is that service identities (Service Account or Service Principal) are not shared between Cloud. A common way to communicate from one Cloud to the other is to create a secret for the service identity and store it on the other Cloud. This way the service account of Cloud A can access Cloud A’s APIs from Cloud B by generating a token from the secret and vice versa. But the extradition of a secret (credential or key) is a security risk. If the secret falls in the wrong hand, the thief would be able to perform requests as the service account and extract sensible data or cause chaos. The solution to this issue might be Identity Federation, but first let’s discuss a bit of theory : how a service account can generate a token to prove its identity ?&lt;/p&gt;

&lt;h3&gt;
  
  
  Access token vs ID token
&lt;/h3&gt;

&lt;p&gt;“&lt;em&gt;For authentication and authorization, a token is a digital object that contains information about the identity of the principal making the request and what kind of access they are authorized for.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In the Cloud world (and not only) there are mainly two types of token and associated protocol : Access tokens and ID tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Access Token&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Access token are opaque tokens that conform to the &lt;a href="https://datatracker.ietf.org/doc/html/rfc6749#section-1.4" rel="noopener noreferrer"&gt;OAuth 2.0 framework&lt;/a&gt;. They contain &lt;strong&gt;authorization&lt;/strong&gt; information (what the service can do), but not identity information. They are opaque because they are in a proprietary format, applications cannot inspect them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: In some case, access tokens can be decoded (if they are JWT for ex) but do not take this for granted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ID Token&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ID tokens are JSON Web Tokens (JWT) that conform to the &lt;a href="https://openid.net/specs/openid-connect-core-1_0.html" rel="noopener noreferrer"&gt;OpenID Connect (OIDC) specification&lt;/a&gt;. Unlike access tokens, ID tokens can be decoded and inspected by the application. That is why in a multi-cloud world, &lt;strong&gt;you will always exchange JWT tokens between Clouds.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The JWT token is composed of 3 parts :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;header&lt;/strong&gt;, with information about the algorithm and the token type.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;payload&lt;/strong&gt;, with info about the &lt;code&gt;subject&lt;/code&gt; (service identity), the &lt;code&gt;issuer&lt;/code&gt; (trust authority which signed the token) and more identity info like &lt;code&gt;email&lt;/code&gt;, &lt;code&gt;first_name&lt;/code&gt;, &lt;code&gt;last_name&lt;/code&gt;…&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;signature&lt;/strong&gt;, generated by a concatenation of a base64 representation of the header and the payload, all encoded by a secret key using the algorithm specified in the header.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All 3 parts are joined with a dot “.” to form the final token.&lt;/p&gt;

&lt;p&gt;Ex: &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fal5xmq0une5325yw3vcy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fal5xmq0une5325yw3vcy.png" alt="JWT tokens" width="800" height="322"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can decode JWT tokens manually with a base64decode utility or with websites like &lt;a href="https://jwt.io/" rel="noopener noreferrer"&gt;https://jwt.io/&lt;/a&gt;. Decoding the first two parts will give us the following objects:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;header&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"alg"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"RS256"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"kid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"b49c5062d890f5ce449e890c88e8dd98c4fee0ab"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"typ"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"JWT"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;payload&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"aud"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"32555940559.apps.googleusercontent.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"azp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"103441981692756022942"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"exp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1675978466&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"iat"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1675974866&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"iss"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://accounts.google.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sub"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"103441981692756022942"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here the issuer (&lt;code&gt;iss&lt;/code&gt;) which signed the token is &lt;code&gt;https://accounts.google.com&lt;/code&gt; and the service account which is represented by the token is the subject (&lt;code&gt;sub&lt;/code&gt;) : &lt;code&gt;103441981692756022942&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now that we have the basics, how can we exchange an identity token from one Cloud Identity, by impersonating the other Cloud Identity, and protect against secret leaks ? This processus is called identity federation.&lt;/p&gt;

&lt;p&gt;Let’s see in details with 2 technical implementations between Azure and Google Cloud Platform and inversely, in the following articles…&lt;/p&gt;

</description>
      <category>googlecloud</category>
      <category>azure</category>
      <category>multicloud</category>
      <category>security</category>
    </item>
    <item>
      <title>[Feedback] GCP : Cross region Data transfer with BigQuery. Part 2 - Schema drift with the Google Analytics use case</title>
      <dc:creator>Λ\: Clément Bosc</dc:creator>
      <pubDate>Fri, 28 Jan 2022 12:07:51 +0000</pubDate>
      <link>https://dev.to/stack-labs/feedback-gcp-cross-region-data-transfer-with-bigquery-part-2-schema-drift-with-the-google-analytics-use-case-59k3</link>
      <guid>https://dev.to/stack-labs/feedback-gcp-cross-region-data-transfer-with-bigquery-part-2-schema-drift-with-the-google-analytics-use-case-59k3</guid>
      <description>&lt;p&gt;In a previous article, we detailed the process that we set up to transfer data incrementally and periodically between GCP regions in BigQuery. It’s a common problem when working in a global context where your data resides in locations all over the world. If you missed it, be sure to catch up here : &lt;a href="https://dev.to/stack-labs/feedback-gcp-cross-region-data-transfer-with-bigquery-part-1-workflow-and-dts-at-the-rescue-4o3i"&gt;[Feedback] GCP : Cross region Data transfer with BigQuery. Part 1 - Workflows and DTS at the rescue&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now that you have the architecture in mind, let's dive into a problem we had when working with Google Analytics (GA) : the schema drift situation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The use case : Google Analytics
&lt;/h2&gt;

&lt;p&gt;Let's add some context : the company I am working for owns websites. A lot of them. And like pretty much everybody they use Google Analytics to track the audience, user behavior, acquisition, conversion and a bunch of other metrics. &lt;br&gt;
Additionally, they use this GA feature that lets them automatically export the full content of Google Analytics data in BigQuery. Each website -or &lt;em&gt;view&lt;/em&gt; in GA jargon-, is exported in a dedicated dataset (represented by the viewId) and a table (a shard) is created each day with the data from the day before. Same as before, the data is split in different projects, located in remote GCP regions according to the country of management of the website.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fekyyu5ikepb0pv4j0aw4.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fekyyu5ikepb0pv4j0aw4.jpg" alt="Google Analytics to BigQuery architecture" width="800" height="326"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The problem is the configuration to export data into BigQuery is done and maintained manually for each website ! Moreover from time to time new websites are created and some of them stop publishing data (because they were closed for instance). And Google Analytics, like every information system platform, is constantly evolving : adding dimensions, features and metrics (and so columns in the data model), but the system does not update the schema from all the previous tables already created and it leads to a large issue : the schema drift.&lt;/p&gt;
&lt;h2&gt;
  
  
  What we needed to do
&lt;/h2&gt;

&lt;p&gt;The very big workflow that we made to transfer data cross region, takes as argument a SQL query to read data from the source. But here, compared to the situation described in the previous blog post, the shared tables are located across many datasets (see schema above). Our first reflex was to pass as input query something like this :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;`source-project-1.236548976.ga_sessions_*`&lt;/span&gt;
&lt;span class="k"&gt;UNION&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;`source-project-1.987698768.ga_sessions_*`&lt;/span&gt;
&lt;span class="k"&gt;UNION&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="c1"&gt;-- For each dataset in a given project&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But this was the beginning of our misery : the schema from tables in dataset 236548976 and 987698768 is not exactly the same (probably one of them was initiated later, with some changed fields). Easy, you would say : just specify the explicit list of fields in the &lt;code&gt;SELECT&lt;/code&gt; statement, replacing missing fields with something like &lt;code&gt;NULL as &amp;lt;alias&amp;gt;&lt;/code&gt;. Well, it’s not so simple, because : &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The schema from the Google Analytics data model is One Big Table with over 320 columns, distributed on a 4 level depth of nested columns, repeated fields, array of repeated fields, etc... the differences could be at any level, and different from a dataset to another.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We have hundreds of website, and so hundreds of datasets&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A new dataset can be added at any time and the solution had to automatically load the new data without further re-configuration.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We needed a way to automate all of this mess.&lt;/p&gt;

&lt;h3&gt;
  
  
  The first idea: a magical stored procedure 🧙🏼‍♂️
&lt;/h3&gt;

&lt;p&gt;The first, and I think the more logical, reflex that we had was to generate SQL queries. And what a nicer way of doing this than by using SQL ? Given a projectId and a dataset, the procedure would have to generate something like that :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="n"&gt;channelGrouping&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;channelGrouping&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;clientId&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;clientId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;STRUCT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;-- this field does not exist in the current dataset&lt;/span&gt;
    &lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;browserSize&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    
    &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="n"&gt;ARRAY&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt;
        &lt;span class="n"&gt;STRUCT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARRAY&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;STRUCT&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;promotion&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;UNNEST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;UNNEST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hits&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;`source-project-1.236548976.ga_sessions_*`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With, let's remind it, a &lt;strong&gt;4 level depth of nested structures and more than 320 columns&lt;/strong&gt;. Fortunately, like most respectable databases, BigQuery has an internal hidden table called &lt;code&gt;INFORMATION_SCHEMA&lt;/code&gt; with all the metadata that we needed. &lt;/p&gt;

&lt;p&gt;Not without some effort, it worked : with a recursive and generic approach, we succeeded in dynamically generating the massive SQL query. We only needed to call the procedure for each dataset to have the &lt;code&gt;SELECT&lt;/code&gt; statement and perform an &lt;code&gt;UNION ALL&lt;/code&gt;, and the problem of schema drift would have been solved !&lt;/p&gt;

&lt;p&gt;But this was not the good approach : we had fun building the magical stored procedure but it was too slow, consuming too many resources, and the generated query was so large (when joined together with &lt;code&gt;UNION ALL&lt;/code&gt;), that the content didn’t fit in a variable in our orchestrator 🤦‍♂️.&lt;/p&gt;

&lt;h3&gt;
  
  
  The real magic resides in simplicity
&lt;/h3&gt;

&lt;p&gt;Most of the time, the simpler the better. We actually realized quickly that BigQuery already had its own way of dealing with the schema drift problem : it’s the &lt;u&gt;wildcard functionality&lt;/u&gt; and we were using it all along ! &lt;/p&gt;

&lt;p&gt;On a sharded table structure, a table is actually splitted in many smaller tables with a suffix to differentiate them (most commonly a date). Conveniently, BigQuery let you query for all the tables sharing the same base name with the wildcard annotation, just like we did to have all the data from a website : &lt;br&gt;
&lt;code&gt;SELECT * FROM `ga_sessions_*`&lt;/code&gt; process all the data from tables that match the pattern &lt;code&gt;ga_sessions_*&lt;/code&gt;. And this works even if the schema has evolved since the first table ! &lt;/p&gt;

&lt;p&gt;BigQuery automatically applies the schema from the last created table to the query result and completes the missing fields with NULL values. Sadly, doing the same from a batch of dataset if not possible (like &lt;code&gt;`project.*.ga_sessions_*`&lt;/code&gt; so we got around the issue by doing the following: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;For each dataset in the source, create a large table that contains all the data from the &lt;code&gt;ga_sessions_*&lt;/code&gt; sharded table. This table is named with the datasetId as a suffix, in a buffer project, dedicated to the replication purpose. In practice this table contains all the partitions created since the last transfer (so most of the time 1 partition, except in the init phase)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Create a "fake" empty table with the exact same schema as the destination table. This schema is our reference, where the other tables might differ slightly. (it’s the &lt;code&gt;ga_sessions_DONOTUSE&lt;/code&gt; table in the schema bellow)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Use the wildcard syntax again to append all the data into the final table (partionned) : the suffix is not a date anymore but the source datasetId ! As the fake &lt;em&gt;DONOTUSE&lt;/em&gt; table is always the last created, it’s schema is applied to all the other tables.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbpp910m9qnpjylh1zvt8.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbpp910m9qnpjylh1zvt8.jpg" alt="Schema drift solution with table in different datasets" width="800" height="488"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, next time Google updates the GA data model, we only have to impact the change in our final partitioned table and the whole process will adapt and won’t fail, even if the new columns are not yet present in every source table at the same time. On the downsides, with this process we might miss schema update from the source if we aren't aware of new columns, but for now the current architecture fits our need.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;To be honest I felt a bit ashamed to not have thought of the second solution sooner, it is so simple and much more maintainable than the first one ! It works like a charm in production today and we are transferring daily, tens of gigabytes of data, coming from thousands of websites across all the regions of GCP to a unique, massive partitioned table that is available to Analysts.&lt;/p&gt;

</description>
      <category>bigquery</category>
      <category>data</category>
      <category>googlecloud</category>
      <category>analytics</category>
    </item>
    <item>
      <title>[Feedback] GCP : Cross region Data transfer with BigQuery. Part 1 - Workflows and DTS at the rescue</title>
      <dc:creator>Λ\: Clément Bosc</dc:creator>
      <pubDate>Mon, 24 Jan 2022 16:38:41 +0000</pubDate>
      <link>https://dev.to/stack-labs/feedback-gcp-cross-region-data-transfer-with-bigquery-part-1-workflow-and-dts-at-the-rescue-4o3i</link>
      <guid>https://dev.to/stack-labs/feedback-gcp-cross-region-data-transfer-with-bigquery-part-1-workflow-and-dts-at-the-rescue-4o3i</guid>
      <description>&lt;p&gt;I have been working for this very large French cosmetic company for a few months now and here is some feedback about a common problem we had with Bigquery when working in a global context : how to query data located all around the world ?&lt;/p&gt;

&lt;p&gt;Here is the use case we had and you might recognize yourself -or your company- in it : imagine that you have many subsidiary companies all over the world, and each of these entities are producing a lot of data. Nowadays data is everywhere : from financial documents, to website sessions, online advertisement, in-store sales… and the volume is growing exponentially ! In our case, every locality (basically at country level) is responsible for collecting and managing the data it produces, and storing it in BigQuery in the GCP region closest to it.&lt;/p&gt;

&lt;p&gt;The problem is that BigQuery, let’s remind it, is actually two distinct products : the Query Engine (based on Google’s &lt;a href="https://research.google/pubs/pub36632/" rel="noopener noreferrer"&gt;Dremel&lt;/a&gt;) and the Storage (based on &lt;a href="https://cloud.google.com/blog/products/bigquery/inside-capacitor-bigquerys-next-generation-columnar-storage-format" rel="noopener noreferrer"&gt;Capacitor&lt;/a&gt;, Google’s columnar storage format); but you cannot use the query engine in a given location to process data stored in another location ! And this is a big issue when your company is distributed globally.&lt;/p&gt;

&lt;p&gt;To get around this issue, we need to periodically transfer the data from remote locations to the main location of analysis (closest to the users, in our case it’s the EU region) and ideally this transfer must be done incrementally : we only want to transfer the new data produced since the last transfer, in order to save cost and improve performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to efficiently transfer data between BigQuery regions ?
&lt;/h2&gt;

&lt;p&gt;There are two main methods for this issue : &lt;/p&gt;

&lt;h3&gt;
  
  
  1. The "legacy" method : use Google Cloud Storage
&lt;/h3&gt;

&lt;p&gt;For this kind of problem, the historical solution would be to use the following process : &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;export the data from the BigQuery Storage to Parquet file (or any other format but Parquet is great for most use cases) in a GCS bucket located in the same region as the original dataset, &lt;/li&gt;
&lt;li&gt;transfer the created objects (that we won’t be able to know the number in advance, because that’s how BigQuery works) in another bucket located in the same region as the final location, &lt;/li&gt;
&lt;li&gt;finally load the Parquet files into the final table&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fie2zh04cbf2u5csc8acu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fie2zh04cbf2u5csc8acu.png" alt="Data transfer using Parquet export" width="800" height="106"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The “new” method : use the dedicated service : Data transfer Service
&lt;/h3&gt;

&lt;p&gt;But there is something more straightforward to solve this : use the new feature from &lt;a href="https://cloud.google.com/bigquery/docs/copying-datasets" rel="noopener noreferrer"&gt;Data Transfer Service for copying full datasets&lt;/a&gt; ! It does the same thing as the first method without extracting data out of the BigQuery storage 😎. Under the hood, DTS is performing a table copy command for each table in the source dataset. If the table was already transferred in a previous run, and no row has been modified or added since then, the table is skipped.&lt;/p&gt;

&lt;p&gt;The service is still in beta and we found some inconveniences though : &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The incremental load is managed automatically based on internal hidden metadata in BQ tables, but it’s not yet supported for partitioned tables (no append mode).&lt;/li&gt;
&lt;li&gt;You cannot choose which tables from the source dataset are transferred : it’s everything or nothing !&lt;/li&gt;
&lt;li&gt;In a GCP project we can only transfer 2000 tables a day (cross-region), but some of our sources hosted sharded tables (one table per day with a history of years), so we reached the quota pretty soon.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And these limitations were a &lt;em&gt;pain in the ass&lt;/em&gt; for us : we didn’t want to transfer every table from the source and our destination tables were partitioned 😭.&lt;/p&gt;

&lt;p&gt;Our use case was as follows : the data from each country was resting in a dedicated dataset, in a country-dedicated project, and located in a different GCP region. Each dataset contained many sharded tables sharing the same schema. But due to DTS limitations we couldn’t just use the service as it is :  as the append mode in partitioned tables doesn’t work yet, every transfer would have erased the data previously transferred from another country…&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F20opy6snabck8ngt4es4.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F20opy6snabck8ngt4es4.jpg" alt="Our global organisation" width="800" height="229"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;h2&gt;
  
  
  Cloud Workflow as a wrapper for Data Transfer Service
&lt;/h2&gt;

&lt;p&gt;Finally we managed to design an architecture for our situation by transferring the data not directly from the source to the destination but by constituting dedicated temporary datasets :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Stage 1: For each table to transfer, we create a temporary table in a temporary dataset that contains only the last partitions (or shard) to transfer since the last transfer happened (by providing a custom business logic)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Stage 2: Once the temporary table is built, trigger the Data Transfer Service to copy everything in a temporary destination dataset (in the same region as the final tables)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Stage 3: Merge the transferred partitions with a custom BQ job in the final tables.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;And voilà !&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6rbknw4b3kvghte5rzxj.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6rbknw4b3kvghte5rzxj.jpg" alt="Architecture : transfer data cross region using Data Transfer Service" width="800" height="367"&gt;&lt;/a&gt;&lt;em&gt;Our final architecture&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The whole thing was orchestrated with a monstrous &lt;a href="https://cloud.google.com/workflows" rel="noopener noreferrer"&gt;Workflow&lt;/a&gt; (more than 130 actual steps long) that was designed using &lt;a href="https://en.wikipedia.org/wiki/Don%27t_repeat_yourself" rel="noopener noreferrer"&gt;DRY&lt;/a&gt; principles : everything is generic and built to adapt to input params from the user, and so as to be as much as possible detached from the current use case to be re-usable by other teams for other use cases.&lt;br&gt;
If you don’t know Cloud Workflow you should definitely give it a try, it’s a new serverless -minimalist- orchestration service from Google Cloud that is explicitly designed to make API calls. Indeed, everything is API in the Cloud, and it's the only thing that Cloud Workflows needs to do (well, sometimes some additional features might be a nice-have though, Google if you read me, let’s talk)&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;It works like a charm ! And thanks to these DRY principles we have bee  n able to make the workflow as generic as possible and use the same code to transfer different sources with different tables and structure. The performances are satisfying and of the order of minutes to less than an hour to transfer GB to TB of data across more than 60 countries.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next ?
&lt;/h2&gt;

&lt;p&gt;In a next article to come, we will discuss the daily transfer of Google Analytics data with this method and the issues and solutions we found regarding schema drifting, a common problem in the Data Engineering world. Stay tuned..&lt;/p&gt;

</description>
      <category>bigquery</category>
      <category>data</category>
      <category>googlecloud</category>
      <category>analytics</category>
    </item>
    <item>
      <title>Serverless Spark on GCP : How does it compare with Dataflow ?</title>
      <dc:creator>Λ\: Clément Bosc</dc:creator>
      <pubDate>Tue, 16 Nov 2021 08:13:41 +0000</pubDate>
      <link>https://dev.to/stack-labs/serverless-spark-on-gcp-how-does-it-compare-with-dataflow--2o8n</link>
      <guid>https://dev.to/stack-labs/serverless-spark-on-gcp-how-does-it-compare-with-dataflow--2o8n</guid>
      <description>&lt;p&gt;I am a huge fan of serverless products: it allows developers to be focused on bringing business value on the software they are working on and not the underlying infrastructure, at the end they are more autonomous to test and deploy and the scaling capabilities are often better than an equivalent self managed service.&lt;/p&gt;

&lt;p&gt;When it comes to Data processing on GCP there are not so many options for serverless products, the choice is often limited to Dataflow. Dataflow is great but the learning curve is a bit more progressive and Beam (the OSS framework behind Dataflow) is not promoted by other providers which often prioritize Spark. And to run Spark workload on GCP the solutions were not so lightweight: you had to provision a Dataproc cluster or run your workload in Kubernetes: it’s a whole different level of complexity!&lt;br&gt;
This was until recently, because Google surprisingly announced at Next’21 a new Serverless Spark service!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw3kx0725rf0sm9dyjo97.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw3kx0725rf0sm9dyjo97.png" alt="Spark + Google = ❤️" width="611" height="116"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Spark on GCP: a new area for data processing
&lt;/h2&gt;

&lt;p&gt;According to Google, this new service is the industry’s first autoscaling serverless Spark. You do not need any infrastructure provisioning or tuning, it is integrated with BigQuery, Vertex AI and Dataplex and it’s ready to use via a submission service (API), notebooks, Bigquery console for any usage you can imagine (except streaming analytics): ETL, data exploration, analysis, and ML.&lt;/p&gt;

&lt;p&gt;On my side I have been able to test the workload submission service (the most interesting to me): it’s an API endpoint to submit custom Spark code (Python, Java, R or SQL). You can see this submission service as an answer to the &lt;code&gt;spark-submit&lt;/code&gt; problematic.&lt;br&gt;
On the autoscaling side, Google will magically decide for the number of executors to run the job optimally but you can still manually handle it.&lt;br&gt;
The service is part of the Dataproc family and accessible on the console through the Dataproc page. After some tests the service seems to be working fine, but how is it compared to Dataflow? Let’s check that with a small experiment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3h1nxzsw9lipishe26y9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3h1nxzsw9lipishe26y9.png" alt="The new Dataproc console with Serverless Spark" width="800" height="495"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The experiment: Dataflow vs Serverless Spark
&lt;/h2&gt;

&lt;p&gt;I wrote 2 simple programs: the first one in PySpark and the second one with Beam (python SDK). The goal is to read 100GB of ASCII data in a Cloud Storage bucket, parse the content of the files, filter according to a regex pattern, group by according to a key value (some column) and count the number of lines having the same key. The result is written in Parquet format on another bucket. &lt;/p&gt;

&lt;p&gt;For the input data I used a subset of a 100TB dataset publicly available here: &lt;code&gt;gs://dataproc-datasets-us-central1/teragen/100tb/ascii_sort_1GB_input.*&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;In this dataset, each file is about 1GB and the content is as below (not very relevant):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;7ed&amp;gt;@}"UeR  0000000000000000000000024FDFC680  1111555588884444DDDD0000555511113333DDDDFFFF88881111
3AXi 40'NA  0000000000000000000000024FDFC681  888800000000CCCCEEEEDDDD11110000DDDD55553333CCCC6666
PL.Ez`vXmt  0000000000000000000000024FDFC682  111122225555CCCC000000002222FFFFFFFFFFFF88885555FFFF
5^?a=6o0]?  0000000000000000000000024FDFC683  7777FFFF55551111BBBBDDDD44447777DDDD5555BBBB9999CCCC
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The regex filter applied is &lt;code&gt;^.*FFFF.*$&lt;/code&gt; in the 3rd “column”, meaning the column content for a given record must have at least 4 F consecutively (totally useless but it's for the sake of the experiment). The grouping key is the first column. The observed reducing factor of the filter operation is about 50%. I agree the experiment is not something we would normally do in a real project but it is not important, it's just to stimulate the workload with an important compute task.&lt;/p&gt;

&lt;p&gt;On the configuration side, for the Dataflow job, I enabled the &lt;a href="https://cloud.google.com/blog/products/data-analytics/simplify-and-automate-data-processing-with-dataflow-prime" rel="noopener noreferrer"&gt;Dataflow Prime&lt;/a&gt; feature but everything else was left by default (Prime feature is more optimized and it simpler to calculate the total cost of the job). For the Spark service, everything was left as default and I manually asked for 17 executors (why 17? why not 😅)&lt;/p&gt;

&lt;p&gt;The result :&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Dataflow&lt;/th&gt;
&lt;th&gt;Serverless Spark&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total execution time&lt;/td&gt;
&lt;td&gt;36 min 34 sec&lt;/td&gt;
&lt;td&gt;12 min 36 sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Number of vCPU&lt;/td&gt;
&lt;td&gt;64 (autoscaling)&lt;/td&gt;
&lt;td&gt;68 (17 executors * 4 vCPUs)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total cost of the job&lt;/td&gt;
&lt;td&gt;28.88 DPU * $0.071 = $2.05&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both jobs accomplished the desired task and output 567 M row in multiple parquet files (I checked with Bigquery external tables):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp9woggvsch2hmu3sa8te.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp9woggvsch2hmu3sa8te.png" alt="BigQuery results with external tables" width="689" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Serverless Spark service processed the data in about a third of the time compared to Dataflow! Nice performance 👏.&lt;/p&gt;

&lt;p&gt;Currently however there are some limitations to this Serverless service: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It’s only for batch processing, not streaming (Dataflow would probably be better for that anyway) and job duration is limited to 24 hours.&lt;/li&gt;
&lt;li&gt;There are no monitoring dashboard whatsoever and the Spark UI is not accessible, compared to Dataflow which have a pretty good real time dashboarding functionality&lt;/li&gt;
&lt;li&gt;It’s only Spark 3.2 for now, might not be a limitation for you but if you want to migrate existing workload to the service it might not work.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Remarks about the experiment: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Beam/Dataflow pipeline was developed with the Python SDK and I would probably achieve better results with the Java SDK and by using Flex templates (the scaling operation is more efficient because the pipeline is containerized), so it’s not totally fair to Dataflow.&lt;/li&gt;
&lt;li&gt;Dataflow targeted an ideal number of vCPU to 260 but I limited the max number of workers to save cost (and also because my CPU usage quota was at its maximum) but without this limit Dataflow would probably be much quicker to solve the problem.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To conclude I am pretty optimistic about this new Spark serverless service. Running Spark on GCP was not really a solution promoted natively by Google (except for &lt;em&gt;lift and shift&lt;/em&gt; migration on Dataproc) whereas AWS and Azure based their main data processing products on Spark (&lt;em&gt;Glue&lt;/em&gt; and &lt;em&gt;Mapping data flows&lt;/em&gt;).  On the downsides, the integration with the GCP ecosystem is way behind Dataflow for now (Monitoring &amp;amp; Operations), it does not support Spark Streaming and the autoscaling feature is still a bit obscure.&lt;/p&gt;

&lt;p&gt;At the end you should keep in mind that Serverless Spark and Dataflow are two different products, and the choice between the two is not only in term of performance and pricing, but also the need of batch vs streaming ingestion (Dataflow is much better for that) and the background knowledge of your team for the two frameworks : Spark or Beam. &lt;/p&gt;

&lt;p&gt;Anyway the service should get out of Private Preview by mid-december 2021 and be integrated with other GCP products (Bigquery, Vertex AI, Dataplex) later this year. It’s only the beginning but it’s promising.&lt;/p&gt;

&lt;h3&gt;
  
  
  References:
&lt;/h3&gt;

&lt;p&gt;Both Spark and Dataflow pipelines are available in the following &lt;a href="https://gitlab.com/clement.bosc/dataflow-vs-spark-serverless/-/blob/main/dataflow_experiment.py" rel="noopener noreferrer"&gt;GitLab repo 🦊&lt;/a&gt;&lt;/p&gt;

</description>
      <category>dataflow</category>
      <category>spark</category>
      <category>analytics</category>
      <category>googlecloud</category>
    </item>
    <item>
      <title>What’s new with BigQuery ?</title>
      <dc:creator>Λ\: Clément Bosc</dc:creator>
      <pubDate>Mon, 25 Oct 2021 07:15:22 +0000</pubDate>
      <link>https://dev.to/stack-labs/whats-new-with-bigquery--35eo</link>
      <guid>https://dev.to/stack-labs/whats-new-with-bigquery--35eo</guid>
      <description>&lt;p&gt;To all BigQuery lovers around here (and others too !) : Google Cloud Next'21 is just over and there was an important part of announcements regarding Data ! &lt;br&gt;
Let’s see the latest news and functionalities of BigQuery, announced at Next or in the past weeks. (careful, some of them are still in Preview)&lt;/p&gt;
&lt;h2&gt;
  
  
  SQL Translator &amp;amp; BigQuery Migration Service
&lt;/h2&gt;

&lt;p&gt;You want to migrate your existing DataWarehouse to BigQuery ? (congratulations, it’s probably a good idea 🎉) Checkout &lt;a href="https://cloud.google.com/bigquery/docs/migration-intro" rel="noopener noreferrer"&gt;BigQuery Migration Service&lt;/a&gt;, a set of free tools to help you migrate. There is a particularly interesting one : the &lt;a href="https://cloud.google.com/bigquery/docs/interactive-sql-translator" rel="noopener noreferrer"&gt;SQL translator&lt;/a&gt;. Accessible from the API or the Console, this tool will help you translate your current SQL queries into BigQuery Standard SQL language. &lt;/p&gt;

&lt;p&gt;Only Teradata SQL is supported for now but let’s bet there are more to come !&lt;/p&gt;
&lt;h2&gt;
  
  
  Sessions and transactions
&lt;/h2&gt;

&lt;p&gt;There is now a support for transactions in BigQuery ! Yes, you’ve read well, transactions ! &lt;br&gt;
The functionality is called &lt;a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/transactions" rel="noopener noreferrer"&gt;Multi-statement transactions&lt;/a&gt; and allows you to perform data modifications in one or more tables with ACID guarantees. During the transaction, all reads return a consistent version of the tables referenced in the transaction and any modification is either committed or rolled back.&lt;br&gt;
Multi-statement transactions over multiple queries are started within a &lt;a href="https://cloud.google.com/bigquery/docs/sessions-intro" rel="noopener noreferrer"&gt;Session&lt;/a&gt;. The new session abstraction lets you separate users and applications from each other. &lt;/p&gt;
&lt;h2&gt;
  
  
  Table Snapshots and table Clone
&lt;/h2&gt;

&lt;p&gt;Heard of time travel in BigQuery ? It’s pretty useful but only allows you to go 7 days back. To store the state of a table for more than that, &lt;a href="https://cloud.google.com/bigquery/docs/table-snapshots-intro" rel="noopener noreferrer"&gt;table snapshots&lt;/a&gt; are here to help : it allows you to  preserves the contents of a table at a particular time and preserve this image for as long as you want. BigQuery will minimize the storage cost by only storing the bytes that are different between a snapshot and its base table.&lt;/p&gt;

&lt;p&gt;Tips : think about periodic snapshots creation with the query scheduler &lt;/p&gt;

&lt;p&gt;A new similar functionality to come is Clone. While table snapshots are immutable (you can restore them but not edit them directly), a clone is a mutable version of the base table. They allow you to clone a table and perform read/write/schema evolution operations. Pretty useful for testing production changes. Something nice : same as Snapshot, BQ will only bill you for the new data because it stores only the difference between the base table and the cloned one.&lt;/p&gt;
&lt;h2&gt;
  
  
  Table sampling
&lt;/h2&gt;

&lt;p&gt;You have a machine learning model to train with BigQuery ML but you want to use only a subset of a table for the training set ? Try the &lt;a href="https://cloud.google.com/bigquery/docs/table-sampling" rel="noopener noreferrer"&gt;table sampling&lt;/a&gt; functionality : Sampling returns a variety of records while avoiding the costs associated with scanning and processing an entire table (unlike the &lt;code&gt;LIMIT&lt;/code&gt; clause !)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- this will give you a random 10% of a table data &lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_table&lt;/span&gt; &lt;span class="n"&gt;TABLESAMPLE&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="n"&gt;PERCENT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Table functions
&lt;/h2&gt;

&lt;p&gt;Have you ever wanted to have parameters in views ? Now you have &lt;a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/table-functions" rel="noopener noreferrer"&gt;table functions&lt;/a&gt; (TVF) for that ! Table-valued functions allow you to create an SQL function that returns a table. You can see it just like create a view with parameters and call the result of this function in standard queries :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="k"&gt;FUNCTION&lt;/span&gt; &lt;span class="n"&gt;mydataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;names_by_year&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="n"&gt;INT64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="nb"&gt;year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;`bigquery-public-data.usa_names.usa_1910_current`&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="nb"&gt;year&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;
  &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;

&lt;span class="c1"&gt;-- use your table function in an other query, just like a view&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;mydataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;names_by_year&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1950&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
  &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can use &lt;strong&gt;Authorized function&lt;/strong&gt; to give specific user access to your TVF without them having access to the underlying table (just like Authorized view works for standard views)&lt;/p&gt;

&lt;h2&gt;
  
  
  Storage Write API
&lt;/h2&gt;

&lt;p&gt;There is now a new Write API to unify batch and streaming old APIs : the &lt;a href="https://cloud.google.com/bigquery/docs/write-api" rel="noopener noreferrer"&gt;Storage Write API&lt;/a&gt; ! This new API gives you more control over the loading process and is more performant than the previous. Moreover this new API is cheaper than the legacy Streaming insert functionality while providing a free tier usage !&lt;/p&gt;

&lt;h2&gt;
  
  
  BigQuery Omni
&lt;/h2&gt;

&lt;p&gt;The multi-cloud analytics engine BigQuery Omni is going GA for AWS and Azure ! With Omni you can query large amounts of data in AWS S3 or Azure ADLS, without maintaining complex cross-cloud Extract-Transform-Load pipelines. The functionality will allow multi-cloud organisations to save cost on Egress and Join data between cloud providers and locations. The BigQuery console on GCP will become the central access point for analytics and you will be able to define governance and access control in a simple place !&lt;/p&gt;

&lt;h2&gt;
  
  
  BigQuery BI Engine
&lt;/h2&gt;

&lt;p&gt;After being in Preview for a while with Data Studio, BigQuery &lt;a href="https://cloud.google.com/bi-engine/docs/sql-interface-overview" rel="noopener noreferrer"&gt;BI Engine&lt;/a&gt; is also going GA ! BI Engine is a reserved in-memory database used to obtain sub-second query results with any BI tool (Looker, Tableau, PowerBI, etc), even over very large amounts of data. This functionality prevents the use of OLAP Cube and complex ETL pipelines : Google automatically handles the move and the freshness of the data between standard BigQuery Storage and BI Engine. And this also works for streaming !&lt;/p&gt;

&lt;h2&gt;
  
  
  Parameterized data types
&lt;/h2&gt;

&lt;p&gt;Historically, BigQuery does not allow restriction over the size of certain data types, but this is about to change : there is now a &lt;a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#parameterized_data_types" rel="noopener noreferrer"&gt;Parameterized data types&lt;/a&gt; syntax on &lt;code&gt;STRING&lt;/code&gt;, &lt;code&gt;BYTES&lt;/code&gt;, &lt;code&gt;NUMERIC&lt;/code&gt; and &lt;code&gt;BIGNUMERIC&lt;/code&gt;. Want to raise an error if the text value in a certain column is larger than 10 characters ? Type your column as &lt;code&gt;STRING(10)&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  SQL functions : PIVOT and QUALIFY
&lt;/h2&gt;

&lt;p&gt;Among many new Geography functions (not detailed here), here are two interesting new functions and Standard SQL syntax evolution that retained my attention :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#pivot_operator" rel="noopener noreferrer"&gt;&lt;code&gt;PIVOT&lt;/code&gt;&lt;/a&gt; and &lt;a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#unpivot_operator" rel="noopener noreferrer"&gt;&lt;code&gt;UNPIVOT&lt;/code&gt;&lt;/a&gt; to turn rows into columns and columns into rows.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#qualify_clause" rel="noopener noreferrer"&gt;&lt;code&gt;QUALIFY&lt;/code&gt;&lt;/a&gt;, a new SQL clause used to filter on the result of an analytical function without the need for a subquery ! It’s kind of like the &lt;code&gt;HAVING&lt;/code&gt; to filter on a standard aggregation function.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ex :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;Produce&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;Produce&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'vegetable'&lt;/span&gt;
&lt;span class="n"&gt;QUALIFY&lt;/span&gt; &lt;span class="n"&gt;RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;purchases&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>database</category>
      <category>analytics</category>
      <category>sql</category>
      <category>bigquery</category>
    </item>
  </channel>
</rss>
