<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jader Lima</title>
    <description>The latest articles on DEV Community by Jader Lima (@jader_lima_b72a63be5bbddc).</description>
    <link>https://dev.to/jader_lima_b72a63be5bbddc</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1878052%2F95282525-5af2-48aa-8c30-ebe572a23d06.jpg</url>
      <title>DEV Community: Jader Lima</title>
      <link>https://dev.to/jader_lima_b72a63be5bbddc</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jader_lima_b72a63be5bbddc"/>
    <language>en</language>
    <item>
      <title>Using Google Cloud Functions for Three-Tier Data Processing with Google Composer and Automated Deployments via GitHub Actions</title>
      <dc:creator>Jader Lima</dc:creator>
      <pubDate>Fri, 25 Oct 2024 02:03:03 +0000</pubDate>
      <link>https://dev.to/jader_lima_b72a63be5bbddc/using-google-cloud-functions-for-three-tier-data-processing-with-google-composer-and-automated-deployments-via-github-actions-5f02</link>
      <guid>https://dev.to/jader_lima_b72a63be5bbddc/using-google-cloud-functions-for-three-tier-data-processing-with-google-composer-and-automated-deployments-via-github-actions-5f02</guid>
      <description>&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;About&lt;/li&gt;
&lt;li&gt;Tools Used&lt;/li&gt;
&lt;li&gt;Solution Architecture Diagram&lt;/li&gt;
&lt;li&gt;
Deployment Process

&lt;ul&gt;
&lt;li&gt;Prerequisites&lt;/li&gt;
&lt;li&gt;Secrets for GitHub Actions&lt;/li&gt;
&lt;li&gt;To create a new secret:&lt;/li&gt;
&lt;li&gt;Setting Up the DevOps Service Account&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
GitHub Actions Pipeline: Steps

&lt;ul&gt;
&lt;li&gt;enable-services&lt;/li&gt;
&lt;li&gt;deploy-buckets&lt;/li&gt;
&lt;li&gt;deploy-cloud-function&lt;/li&gt;
&lt;li&gt;deploy-composer-service-account&lt;/li&gt;
&lt;li&gt;deploy-bigquery-dataset-bigquery-tables&lt;/li&gt;
&lt;li&gt;deploy-composer-environment&lt;/li&gt;
&lt;li&gt;deploy-composer-http-connection&lt;/li&gt;
&lt;li&gt;deploy-dags&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;GitHub Actions Workflow Explanation&lt;/li&gt;
&lt;li&gt;Resources Created After Deployment&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;li&gt;References&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  About
&lt;/h2&gt;

&lt;p&gt;This post explores the use of &lt;strong&gt;Google Cloud Functions&lt;/strong&gt; for processing data in a &lt;strong&gt;three-tier architecture&lt;/strong&gt;. The solution is orchestrated with &lt;strong&gt;Google Composer&lt;/strong&gt; and features &lt;strong&gt;automated deployments&lt;/strong&gt; using &lt;strong&gt;GitHub Actions&lt;/strong&gt;. We will walk through the tools used, deployment process, and pipeline steps, providing a clear guide for building an end-to-end cloud-based data pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tools Used
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Google Cloud Platform (GCP):&lt;/strong&gt; The primary cloud environment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Storage:&lt;/strong&gt; For storing input and processed data across different layers (Bronze, Silver, Gold).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Functions:&lt;/strong&gt; Serverless functions responsible for data processing in each tier.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google Composer:&lt;/strong&gt; An orchestration tool based on Apache Airflow, used to schedule and manage workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Actions:&lt;/strong&gt; Automation tool for deploying and managing the pipeline infrastructure.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Solution Architecture Diagram
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0qpvy32rtwd6vzg3m77c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0qpvy32rtwd6vzg3m77c.png" alt="project architecture" width="800" height="445"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment Process
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;Before setting up the project, ensure you have the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;GCP Account&lt;/strong&gt;: A Google Cloud account with billing enabled.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service Account for DevOps&lt;/strong&gt;: A service account with the required permissions to deploy resources in GCP.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secrets in GitHub&lt;/strong&gt;: Store the GCP service account credentials, project ID, and bucket name as secrets in GitHub for secure access.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Secrets for GitHub Actions
&lt;/h3&gt;

&lt;p&gt;To securely access your GCP project and resources, set the following secrets in GitHub Actions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;BUCKET_DATALAKE&lt;/code&gt;: Your Cloud Storage bucket for the data lake.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GCP_DEVOPS_SA_KEY&lt;/code&gt;: The service account key in JSON format.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PROJECT_ID&lt;/code&gt;: Your GCP project ID.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;REGION_PROJECT_ID&lt;/code&gt;: The region where your GCP project is deployed.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  To create a new secret:
&lt;/h3&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. In project repository, menu **Settings** 
2. **Security**, 
3. **Secrets and variables**,click in access **Action**
4. **New repository secret**, type a **name** and **value** for secret.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi45cicz0q89ije7j70yf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi45cicz0q89ije7j70yf.png" alt="github secret creation" width="800" height="609"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For more details , access :&lt;br&gt;
&lt;a href="https://docs.github.com/pt/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions" rel="noopener noreferrer"&gt;https://docs.github.com/pt/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Setting Up the DevOps Service Account
&lt;/h3&gt;

&lt;p&gt;Create a service account in GCP with permissions for Cloud Functions, Composer, BigQuery, and Cloud Storage. Grant the necessary roles such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cloud Functions Admin&lt;/li&gt;
&lt;li&gt;Composer User&lt;/li&gt;
&lt;li&gt;BigQuery Data Editor&lt;/li&gt;
&lt;li&gt;Storage Object Admin&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  GitHub Actions Pipeline: Steps
&lt;/h2&gt;

&lt;p&gt;The pipeline automates the entire deployment process, ensuring all components are set up correctly. Here's a breakdown of the key jobs from the GitHub Actions file, each responsible for a different aspect of the deployment.&lt;/p&gt;

&lt;h3&gt;
  
  
  enable-services
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;enable-services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-22.04&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v2&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Authenticate with GCP&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Authorize GCP&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;google-github-actions/auth@v2'&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;credentials_json&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;${{ secrets.GCP_DEVOPS_SA_KEY }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Configure  Cloud SDK&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Cloud SDK&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;google-github-actions/setup-gcloud@v2&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;363.0.0'&lt;/span&gt;
        &lt;span class="na"&gt;project_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PROJECT_ID }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Configure Docker to use the gcloud command-line tool as a credential helper&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure Docker&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud auth configure-docker&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up python &lt;/span&gt;&lt;span class="m"&gt;3.8&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/setup-python@v2&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;python-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3.8.16&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Create GCP Bucket &lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Enable gcp service api's&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud services enable ${{ env.GCP_SERVICE_API_0 }}&lt;/span&gt;
        &lt;span class="s"&gt;gcloud services enable ${{ env.GCP_SERVICE_API_1 }}&lt;/span&gt;
        &lt;span class="s"&gt;gcloud services enable ${{ env.GCP_SERVICE_API_2 }}&lt;/span&gt;
        &lt;span class="s"&gt;gcloud services enable ${{ env.GCP_SERVICE_API_3 }}&lt;/span&gt;
        &lt;span class="s"&gt;gcloud services enable ${{ env.GCP_SERVICE_API_4 }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  deploy-buckets
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;deploy-buckets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;enable-services&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-22.04&lt;/span&gt;
    &lt;span class="na"&gt;timeout-minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Authorize GCP&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;google-github-actions/auth@v2'&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;credentials_json&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;${{ secrets.GCP_DEVOPS_SA_KEY }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Authenticate with GCP&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Cloud SDK&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;google-github-actions/setup-gcloud@v2&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;363.0.0'&lt;/span&gt;
        &lt;span class="na"&gt;project_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PROJECT_ID }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Configure Docker to use the gcloud command-line tool as a credential helper&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure Docker&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud auth configure-docker&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Create GCP Bucket &lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create Google Cloud Storage - datalake&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if ! gsutil ls -p ${{ secrets.PROJECT_ID }} gs://${{ secrets.BUCKET_DATALAKE }} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;gcloud storage buckets create gs://${{ secrets.BUCKET_DATALAKE }} --default-storage-class=nearline --location=${{ env.REGION }}&lt;/span&gt;
          &lt;span class="s"&gt;else&lt;/span&gt;
            &lt;span class="s"&gt;echo "Cloud Storage : gs://${{ secrets.BUCKET_DATALAKE }}  already exists" ! &lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;


    &lt;span class="c1"&gt;# Step to Upload the file to GCP Bucket - transient files&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Upload transient files to Google Cloud Storage&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;TARGET=${{ env.INPUT_FOLDER }}&lt;/span&gt;
        &lt;span class="s"&gt;BUCKET_PATH=${{ secrets.BUCKET_DATALAKE }}/${{ env.INPUT_FOLDER }}    &lt;/span&gt;
        &lt;span class="s"&gt;gsutil cp -r $TARGET gs://${BUCKET_PATH}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  deploy-cloud-function
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;deploy-cloud-function&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;enable-services&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-buckets&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-22.04&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v2&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Authenticate with GCP&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Authorize GCP&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;google-github-actions/auth@v2'&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;credentials_json&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;${{ secrets.GCP_DEVOPS_SA_KEY }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Configure  Cloud SDK&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Cloud SDK&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;google-github-actions/setup-gcloud@v2&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;363.0.0'&lt;/span&gt;
        &lt;span class="na"&gt;project_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PROJECT_ID }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Configure Docker to use the gcloud command-line tool as a credential helper&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure Docker&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud auth configure-docker&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up python &lt;/span&gt;&lt;span class="m"&gt;3.10&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/setup-python@v2&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;python-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3.10.12&lt;/span&gt;
    &lt;span class="c1"&gt;#cloud_function_scripts/csv_to_parquet&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create cloud function - ${{ env.CLOUD_FUNCTION_1_NAME }}&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;cd ${{ env.FUNCTION_SCRIPTS }}/${{ env.CLOUD_FUNCTION_1_NAME }}&lt;/span&gt;
        &lt;span class="s"&gt;gcloud functions deploy ${{ env.CLOUD_FUNCTION_1_NAME }} \&lt;/span&gt;
        &lt;span class="s"&gt;--gen2 \&lt;/span&gt;
        &lt;span class="s"&gt;--cpu=${{ env.FUNCTION_CPU  }} \&lt;/span&gt;
        &lt;span class="s"&gt;--memory=${{ env.FUNCTION_MEMORY  }} \&lt;/span&gt;
        &lt;span class="s"&gt;--runtime ${{ env.PYTHON_FUNCTION_RUNTIME }} \&lt;/span&gt;
        &lt;span class="s"&gt;--trigger-http \&lt;/span&gt;
        &lt;span class="s"&gt;--region ${{ env.REGION }} \&lt;/span&gt;
        &lt;span class="s"&gt;--entry-point ${{ env.CLOUD_FUNCTION_1_NAME }}&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create cloud function - ${{ env.CLOUD_FUNCTION_2_NAME }}&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;cd ${{ env.FUNCTION_SCRIPTS }}/${{ env.CLOUD_FUNCTION_2_NAME }}&lt;/span&gt;
        &lt;span class="s"&gt;gcloud functions deploy ${{ env.CLOUD_FUNCTION_2_NAME }} \&lt;/span&gt;
        &lt;span class="s"&gt;--gen2 \&lt;/span&gt;
        &lt;span class="s"&gt;--cpu=${{ env.FUNCTION_CPU  }} \&lt;/span&gt;
        &lt;span class="s"&gt;--memory=${{ env.FUNCTION_MEMORY  }} \&lt;/span&gt;
        &lt;span class="s"&gt;--runtime ${{ env.PYTHON_FUNCTION_RUNTIME }} \&lt;/span&gt;
        &lt;span class="s"&gt;--trigger-http \&lt;/span&gt;
        &lt;span class="s"&gt;--region ${{ env.REGION }} \&lt;/span&gt;
        &lt;span class="s"&gt;--entry-point ${{ env.CLOUD_FUNCTION_2_NAME }}&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create cloud function - ${{ env.CLOUD_FUNCTION_3_NAME }}&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;cd ${{ env.FUNCTION_SCRIPTS }}/${{ env.CLOUD_FUNCTION_3_NAME }}&lt;/span&gt;
        &lt;span class="s"&gt;gcloud functions deploy ${{ env.CLOUD_FUNCTION_3_NAME }} \&lt;/span&gt;
        &lt;span class="s"&gt;--gen2 \&lt;/span&gt;
        &lt;span class="s"&gt;--cpu=${{ env.FUNCTION_CPU  }} \&lt;/span&gt;
        &lt;span class="s"&gt;--memory=${{ env.FUNCTION_MEMORY  }} \&lt;/span&gt;
        &lt;span class="s"&gt;--runtime ${{ env.PYTHON_FUNCTION_RUNTIME }} \&lt;/span&gt;
        &lt;span class="s"&gt;--trigger-http \&lt;/span&gt;
        &lt;span class="s"&gt;--region ${{ env.REGION }} \&lt;/span&gt;
        &lt;span class="s"&gt;--entry-point ${{ env.CLOUD_FUNCTION_3_NAME }}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  deploy-composer-service-account
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;deploy-composer-service-account&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;enable-services&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-buckets&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-cloud-function&lt;/span&gt; &lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-22.04&lt;/span&gt;
    &lt;span class="na"&gt;timeout-minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Authorize GCP&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;google-github-actions/auth@v2'&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;credentials_json&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;${{ secrets.GCP_DEVOPS_SA_KEY }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Authenticate with GCP&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Cloud SDK&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;google-github-actions/setup-gcloud@v2&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;363.0.0'&lt;/span&gt;
        &lt;span class="na"&gt;project_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PROJECT_ID }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Configure Docker to use the gcloud command-line tool as a credential helper&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure Docker&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud auth configure-docker&lt;/span&gt;


    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create service account&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;

        &lt;span class="s"&gt;if ! gcloud iam service-accounts list | grep -i ${{ env.SERVICE_ACCOUNT_NAME}} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;gcloud iam service-accounts create ${{ env.SERVICE_ACCOUNT_NAME }} \&lt;/span&gt;
            &lt;span class="s"&gt;--display-name=${{ env.SERVICE_ACCOUNT_DESCRIPTION }}&lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Add permissions to service account&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \&lt;/span&gt;
          &lt;span class="s"&gt;--member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \&lt;/span&gt;
          &lt;span class="s"&gt;--role="roles/composer.user"&lt;/span&gt;

        &lt;span class="s"&gt;gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \&lt;/span&gt;
          &lt;span class="s"&gt;--member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \&lt;/span&gt;
          &lt;span class="s"&gt;--role="roles/storage.objectAdmin"&lt;/span&gt;

        &lt;span class="s"&gt;gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \&lt;/span&gt;
          &lt;span class="s"&gt;--member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \&lt;/span&gt;
          &lt;span class="s"&gt;--role="roles/cloudfunctions.invoker"&lt;/span&gt;

        &lt;span class="s"&gt;# Permissão para criar e gerenciar ambientes Composer&lt;/span&gt;
        &lt;span class="s"&gt;gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \&lt;/span&gt;
          &lt;span class="s"&gt;--member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \&lt;/span&gt;
          &lt;span class="s"&gt;--role="roles/composer.admin"&lt;/span&gt;

        &lt;span class="s"&gt;gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \&lt;/span&gt;
          &lt;span class="s"&gt;--member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \&lt;/span&gt;
          &lt;span class="s"&gt;--role="roles/composer.worker"&lt;/span&gt;

        &lt;span class="s"&gt;# Permissão para criar e configurar instâncias e recursos na VPC&lt;/span&gt;
        &lt;span class="s"&gt;gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \&lt;/span&gt;
          &lt;span class="s"&gt;--member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \&lt;/span&gt;
          &lt;span class="s"&gt;--role="roles/compute.networkAdmin"&lt;/span&gt;

        &lt;span class="s"&gt;# Permissão para interagir com o Cloud Storage, necessário para buckets e logs&lt;/span&gt;
        &lt;span class="s"&gt;gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \&lt;/span&gt;
          &lt;span class="s"&gt;--member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \&lt;/span&gt;
          &lt;span class="s"&gt;--role="roles/storage.admin"&lt;/span&gt;

        &lt;span class="s"&gt;# Permissão para criar e gerenciar recursos no projeto, como buckets e instâncias&lt;/span&gt;
        &lt;span class="s"&gt;gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \&lt;/span&gt;
          &lt;span class="s"&gt;--member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \&lt;/span&gt;
          &lt;span class="s"&gt;--role="roles/editor"&lt;/span&gt;

        &lt;span class="s"&gt;# Permissão para acessar e usar recursos necessários para o IAM&lt;/span&gt;
        &lt;span class="s"&gt;gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \&lt;/span&gt;
          &lt;span class="s"&gt;--member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \&lt;/span&gt;
          &lt;span class="s"&gt;--role="roles/iam.serviceAccountUser"&lt;/span&gt;

        &lt;span class="s"&gt;gcloud functions add-iam-policy-binding ${{env.CLOUD_FUNCTION_1_NAME}} \&lt;/span&gt;
          &lt;span class="s"&gt;--region="${{env.REGION}}" \&lt;/span&gt;
          &lt;span class="s"&gt;--member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \&lt;/span&gt;
          &lt;span class="s"&gt;--role="roles/cloudfunctions.invoker"&lt;/span&gt;

        &lt;span class="s"&gt;gcloud functions add-invoker-policy-binding ${{env.CLOUD_FUNCTION_1_NAME}} \&lt;/span&gt;
          &lt;span class="s"&gt;--region="${{env.REGION}}" \&lt;/span&gt;
          &lt;span class="s"&gt;--member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" &lt;/span&gt;

        &lt;span class="s"&gt;gcloud functions add-iam-policy-binding ${{env.CLOUD_FUNCTION_2_NAME}} \&lt;/span&gt;
          &lt;span class="s"&gt;--region="${{env.REGION}}" \&lt;/span&gt;
          &lt;span class="s"&gt;--member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \&lt;/span&gt;
          &lt;span class="s"&gt;--role="roles/cloudfunctions.invoker"&lt;/span&gt;

        &lt;span class="s"&gt;gcloud functions add-invoker-policy-binding ${{env.CLOUD_FUNCTION_2_NAME}} \&lt;/span&gt;
          &lt;span class="s"&gt;--region="${{env.REGION}}" \&lt;/span&gt;
          &lt;span class="s"&gt;--member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com"     &lt;/span&gt;

        &lt;span class="s"&gt;gcloud functions add-iam-policy-binding ${{env.CLOUD_FUNCTION_3_NAME}} \&lt;/span&gt;
          &lt;span class="s"&gt;--region="${{env.REGION}}" \&lt;/span&gt;
          &lt;span class="s"&gt;--member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \&lt;/span&gt;
          &lt;span class="s"&gt;--role="roles/cloudfunctions.invoker"&lt;/span&gt;

        &lt;span class="s"&gt;gcloud functions add-invoker-policy-binding ${{env.CLOUD_FUNCTION_3_NAME}} \&lt;/span&gt;
          &lt;span class="s"&gt;--region="${{env.REGION}}" \&lt;/span&gt;
          &lt;span class="s"&gt;--member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com"   &lt;/span&gt;

        &lt;span class="s"&gt;SERVICE_NAME_1=$(gcloud functions describe ${{ env.CLOUD_FUNCTION_1_NAME }} --region=${{ env.REGION }} --format="value(serviceConfig.service)")&lt;/span&gt;
        &lt;span class="s"&gt;gcloud run services add-iam-policy-binding $SERVICE_NAME_1 \&lt;/span&gt;
        &lt;span class="s"&gt;--region="${{env.REGION}}" \&lt;/span&gt;
        &lt;span class="s"&gt;--member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \&lt;/span&gt;
        &lt;span class="s"&gt;--role="roles/run.invoker"&lt;/span&gt;

        &lt;span class="s"&gt;SERVICE_NAME_2=$(gcloud functions describe ${{ env.CLOUD_FUNCTION_2_NAME }} --region=${{ env.REGION }} --format="value(serviceConfig.service)")&lt;/span&gt;
        &lt;span class="s"&gt;gcloud run services add-iam-policy-binding $SERVICE_NAME_2 \&lt;/span&gt;
        &lt;span class="s"&gt;--region="${{env.REGION}}" \&lt;/span&gt;
        &lt;span class="s"&gt;--member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \&lt;/span&gt;
        &lt;span class="s"&gt;--role="roles/run.invoker"&lt;/span&gt;

        &lt;span class="s"&gt;SERVICE_NAME_3=$(gcloud functions describe ${{ env.CLOUD_FUNCTION_3_NAME }} --region=${{ env.REGION }} --format="value(serviceConfig.service)")&lt;/span&gt;
        &lt;span class="s"&gt;gcloud run services add-iam-policy-binding $SERVICE_NAME_3 \&lt;/span&gt;
        &lt;span class="s"&gt;--region="${{env.REGION}}" \&lt;/span&gt;
        &lt;span class="s"&gt;--member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \&lt;/span&gt;
        &lt;span class="s"&gt;--role="roles/run.invoker"&lt;/span&gt;


        &lt;span class="s"&gt;gcloud functions add-invoker-policy-binding ${{env.CLOUD_FUNCTION_1_NAME}} \&lt;/span&gt;
        &lt;span class="s"&gt;--region="${{env.REGION}}" \&lt;/span&gt;
        &lt;span class="s"&gt;--member="allUsers"&lt;/span&gt;

        &lt;span class="s"&gt;gcloud functions add-invoker-policy-binding ${{env.CLOUD_FUNCTION_2_NAME}} \&lt;/span&gt;
        &lt;span class="s"&gt;--region="${{env.REGION}}" \&lt;/span&gt;
        &lt;span class="s"&gt;--member="allUsers"&lt;/span&gt;

        &lt;span class="s"&gt;gcloud functions add-invoker-policy-binding ${{env.CLOUD_FUNCTION_3_NAME}} \&lt;/span&gt;
        &lt;span class="s"&gt;--region="${{env.REGION}}" \&lt;/span&gt;
        &lt;span class="s"&gt;--member="allUsers"&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  deploy-bigquery-dataset-bigquery-tables
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;deploy-bigquery-dataset-bigquery-tables&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;enable-services&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-buckets&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-cloud-function&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-composer-service-account&lt;/span&gt; &lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-22.04&lt;/span&gt;
    &lt;span class="na"&gt;timeout-minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Authorize GCP&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;google-github-actions/auth@v2'&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;credentials_json&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;${{ secrets.GCP_DEVOPS_SA_KEY }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Authenticate with GCP&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Cloud SDK&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;google-github-actions/setup-gcloud@v2&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;363.0.0'&lt;/span&gt;
        &lt;span class="na"&gt;project_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PROJECT_ID }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Configure Docker to use the gcloud command-line tool as a credential helper&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure Docker&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud auth configure-docker&lt;/span&gt;


    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create Big Query Dataset&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;  
        &lt;span class="s"&gt;if ! bq ls --project_id ${{ secrets.PROJECT_ID}} -a | grep -w ${{ env.BIGQUERY_DATASET}} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then &lt;/span&gt;

            &lt;span class="s"&gt;bq --location=${{ env.REGION }} mk \&lt;/span&gt;
          &lt;span class="s"&gt;--default_table_expiration 0 \&lt;/span&gt;
          &lt;span class="s"&gt;--dataset ${{ env.BIGQUERY_DATASET }}&lt;/span&gt;

          &lt;span class="s"&gt;else&lt;/span&gt;
            &lt;span class="s"&gt;echo "Big Query Dataset : ${{ env.BIGQUERY_DATASET }} already exists" ! &lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create Big Query table&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;TABLE_NAME_CUSTOMER=${{ env.BIGQUERY_DATASET}}.${{ env.BIGQUERY_TABLE_CUSTOMER}}&lt;/span&gt;
        &lt;span class="s"&gt;c=0&lt;/span&gt;
        &lt;span class="s"&gt;for table in $(bq ls --max_results 1000 "${{ secrets.PROJECT_ID}}:${{ env.BIGQUERY_DATASET}}" | tail -n +3 | awk '{print $1}'); do&lt;/span&gt;

            &lt;span class="s"&gt;# Determine the table type and file extension&lt;/span&gt;
            &lt;span class="s"&gt;if bq show --format=prettyjson $BIGQUERY_TABLE_CUSTOMER | jq -r '.type' | grep -q -E "TABLE"; then&lt;/span&gt;
              &lt;span class="s"&gt;echo "Dataset ${{ env.BIGQUERY_DATASET}} already has table named : $table " !&lt;/span&gt;
              &lt;span class="s"&gt;if [ "$table" == "${{ env.BIGQUERY_TABLE_CUSTOMER}}" ]; then&lt;/span&gt;
                &lt;span class="s"&gt;echo "Dataset ${{ env.BIGQUERY_DATASET}} already has table named : $table " !&lt;/span&gt;
                &lt;span class="s"&gt;((c=c+1))       &lt;/span&gt;
              &lt;span class="s"&gt;fi                  &lt;/span&gt;
            &lt;span class="s"&gt;else&lt;/span&gt;
                &lt;span class="s"&gt;echo "Ignoring $table"            &lt;/span&gt;
                &lt;span class="s"&gt;continue&lt;/span&gt;
            &lt;span class="s"&gt;fi&lt;/span&gt;
        &lt;span class="s"&gt;done&lt;/span&gt;
        &lt;span class="s"&gt;echo " contador $c "&lt;/span&gt;
        &lt;span class="s"&gt;if [ $c == 0 ]; then&lt;/span&gt;
          &lt;span class="s"&gt;echo "Creating table named : $table for Dataset ${{ env.BIGQUERY_DATASET}} " !&lt;/span&gt;

          &lt;span class="s"&gt;bq mk --table \&lt;/span&gt;
          &lt;span class="s"&gt;$TABLE_NAME_CUSTOMER \&lt;/span&gt;
          &lt;span class="s"&gt;./big_query_schemas/customer_schema.json&lt;/span&gt;


        &lt;span class="s"&gt;fi&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  deploy-composer-environment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;deploy-composer-environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;enable-services&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-buckets&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-cloud-function&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-composer-service-account&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-bigquery-dataset-bigquery-tables&lt;/span&gt; &lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-22.04&lt;/span&gt;
    &lt;span class="na"&gt;timeout-minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;40&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Authorize GCP&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;google-github-actions/auth@v2'&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;credentials_json&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;${{ secrets.GCP_DEVOPS_SA_KEY }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Authenticate with GCP&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Cloud SDK&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;google-github-actions/setup-gcloud@v2&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;363.0.0'&lt;/span&gt;
        &lt;span class="na"&gt;project_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PROJECT_ID }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Configure Docker to use the gcloud command-line tool as a credential helper&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure Docker&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud auth configure-docker&lt;/span&gt;

    &lt;span class="c1"&gt;# Create composer environments&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create composer environments&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if ! gcloud composer environments list --project=${{ secrets.PROJECT_ID }} --locations=${{ env.REGION }} | grep -i ${{ env.COMPOSER_ENV_NAME }} &amp;amp;&amp;gt; /dev/null; then&lt;/span&gt;
            &lt;span class="s"&gt;gcloud composer environments create ${{ env.COMPOSER_ENV_NAME }} \&lt;/span&gt;
                &lt;span class="s"&gt;--project ${{ secrets.PROJECT_ID }} \&lt;/span&gt;
                &lt;span class="s"&gt;--location ${{ env.REGION }} \&lt;/span&gt;
                &lt;span class="s"&gt;--environment-size ${{ env.COMPOSER_ENV_SIZE }} \&lt;/span&gt;
                &lt;span class="s"&gt;--image-version ${{ env.COMPOSER_IMAGE_VERSION }} \&lt;/span&gt;
                &lt;span class="s"&gt;--service-account ${{ env.SERVICE_ACCOUNT_NAME }}@${{ secrets.PROJECT_ID }}.iam.gserviceaccount.com&lt;/span&gt;
        &lt;span class="s"&gt;else&lt;/span&gt;
            &lt;span class="s"&gt;echo "Composer environment ${{ env.COMPOSER_ENV_NAME }} already exists!"&lt;/span&gt;
        &lt;span class="s"&gt;fi&lt;/span&gt;

    &lt;span class="c1"&gt;# Create composer environments&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create composer variable PROJECT_ID&lt;/span&gt; 
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \&lt;/span&gt;
        &lt;span class="s"&gt;--location ${{ env.REGION}} variables \&lt;/span&gt;
        &lt;span class="s"&gt;-- set PROJECT_ID ${{ secrets.PROJECT_ID }}&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create composer variable REGION&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;  
        &lt;span class="s"&gt;gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \&lt;/span&gt;
          &lt;span class="s"&gt;--location ${{ env.REGION }} variables \&lt;/span&gt;
          &lt;span class="s"&gt;-- set REGION ${{ env.REGION }}&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create composer variable CLOUD_FUNCTION_1_NAME&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }}\&lt;/span&gt;
          &lt;span class="s"&gt;--location ${{ env.REGION }} variables \&lt;/span&gt;
          &lt;span class="s"&gt;-- set CLOUD_FUNCTION_1_NAME ${{ env.CLOUD_FUNCTION_1_NAME }}&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create composer variable CLOUD_FUNCTION_2_NAME&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \&lt;/span&gt;
        &lt;span class="s"&gt;--location ${{ env.REGION }} variables \&lt;/span&gt;
        &lt;span class="s"&gt;-- set CLOUD_FUNCTION_2_NAME ${{ env.CLOUD_FUNCTION_2_NAME }}&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create composer variable CLOUD_FUNCTION_3_NAME&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \&lt;/span&gt;
        &lt;span class="s"&gt;--location ${{ env.REGION }} variables \&lt;/span&gt;
        &lt;span class="s"&gt;-- set CLOUD_FUNCTION_3_NAME ${{ env.CLOUD_FUNCTION_3_NAME }}&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create composer variable BUCKET_DATALAKE&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \&lt;/span&gt;
        &lt;span class="s"&gt;--location ${{ env.REGION}} variables \&lt;/span&gt;
        &lt;span class="s"&gt;-- set BUCKET_NAME ${{ secrets.BUCKET_DATALAKE }}&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create composer variable TRANSIENT_FILE_PATH&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \&lt;/span&gt;
        &lt;span class="s"&gt;--location ${{ env.REGION }} variables \&lt;/span&gt;
        &lt;span class="s"&gt;-- set TRANSIENT_FILE_PATH ${{ env.TRANSIENT_FILE_PATH }}&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create composer variable BRONZE_PATH&lt;/span&gt; 
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \&lt;/span&gt;
        &lt;span class="s"&gt;--location ${{ env.REGION }} variables \&lt;/span&gt;
        &lt;span class="s"&gt;-- set BRONZE_PATH ${{ env.BRONZE_PATH }}&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create composer variable SILVER_PATH&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \&lt;/span&gt;
        &lt;span class="s"&gt;--location ${{ env.REGION }} variables \&lt;/span&gt;
        &lt;span class="s"&gt;-- set SILVER_PATH ${{ env.SILVER_PATH }}&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create composer variable REGION_PROJECT_ID&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \&lt;/span&gt;
        &lt;span class="s"&gt;--location ${{ env.REGION }} variables \&lt;/span&gt;
        &lt;span class="s"&gt;-- set REGION_PROJECT_ID "${{ env.REGION }}-${{ secrets.PROJECT_ID }}"&lt;/span&gt;


    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create composer variable BIGQUERY_DATASET&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \&lt;/span&gt;
        &lt;span class="s"&gt;--location ${{ env.REGION }} variables \&lt;/span&gt;
        &lt;span class="s"&gt;-- set BIGQUERY_DATASET "${{ env.BIGQUERY_DATASET }}"&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create composer variable BIGQUERY_TABLE_CUSTOMER&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \&lt;/span&gt;
        &lt;span class="s"&gt;--location ${{ env.REGION }} variables \&lt;/span&gt;
        &lt;span class="s"&gt;-- set BIGQUERY_TABLE_CUSTOMER "${{ env.BIGQUERY_TABLE_CUSTOMER }}"&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  deploy-composer-http-connection
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;deploy-composer-http-connection&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;enable-services&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-buckets&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-cloud-function&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-composer-service-account&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-bigquery-dataset-bigquery-tables&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-composer-environment&lt;/span&gt; &lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-22.04&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Authorize GCP&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;google-github-actions/auth@v2'&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;credentials_json&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;${{ secrets.GCP_DEVOPS_SA_KEY }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Authenticate with GCP&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Cloud SDK&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;google-github-actions/setup-gcloud@v2&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;363.0.0'&lt;/span&gt;
        &lt;span class="na"&gt;project_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PROJECT_ID }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Configure Docker to use the gcloud command-line tool as a credential helper&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure Docker&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud auth configure-docker&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create composer http connection HTTP_CONNECTION&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;HOST="https://${{ env.REGION }}-${{ secrets.PROJECT_ID }}.cloudfunctions.net"&lt;/span&gt;
        &lt;span class="s"&gt;gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \&lt;/span&gt;
        &lt;span class="s"&gt;--location ${{ env.REGION }} connections \&lt;/span&gt;
        &lt;span class="s"&gt;-- add ${{ env.HTTP_CONNECTION }} \&lt;/span&gt;
        &lt;span class="s"&gt;--conn-type ${{ env.CONNECTION_TYPE  }} \&lt;/span&gt;
        &lt;span class="s"&gt;--conn-host $HOST&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  deploy-dags
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;deploy-dags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;enable-services&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-buckets&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-cloud-function&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-composer-service-account&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-bigquery-dataset-bigquery-tables&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-composer-environment&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-composer-http-connection&lt;/span&gt; &lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-22.04&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Authorize GCP&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;google-github-actions/auth@v2'&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;credentials_json&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;${{ secrets.GCP_DEVOPS_SA_KEY }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Authenticate with GCP&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Cloud SDK&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;google-github-actions/setup-gcloud@v2&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;363.0.0'&lt;/span&gt;
        &lt;span class="na"&gt;project_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PROJECT_ID }}&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Get Composer bucket name and Deploy DAG to Composer&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;COMPOSER_BUCKET=$(gcloud composer environments describe ${{ env.COMPOSER_ENV_NAME }} \&lt;/span&gt;
        &lt;span class="s"&gt;--location ${{ env.REGION }} \&lt;/span&gt;
        &lt;span class="s"&gt;--format="value(config.dagGcsPrefix)")&lt;/span&gt;
        &lt;span class="s"&gt;gsutil -m cp -r ./dags/* $COMPOSER_BUCKET/dags/&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Resources Created After Deployment
&lt;/h3&gt;

&lt;p&gt;After the deployment process is complete, the following resources will be available:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Storage Buckets&lt;/strong&gt;: Organized into Bronze, Silver, and Gold layers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Functions:&lt;/strong&gt;: Responsible for processing data across the three layers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service Account for Composer:&lt;/strong&gt;: With appropriate permissions for invoking Cloud Functions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BigQuery Dataset and Tables&lt;/strong&gt;: A DataFrame created for storing processed data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google Composer Environment&lt;/strong&gt;: Orchestrates the Cloud Functions with daily executions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Composer DAG&lt;/strong&gt;: The DAG manages the workflow that invokes Cloud Functions and processes data.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;This solution demonstrates how to leverage Google Cloud Functions, Composer, and BigQuery to create a robust three-tier data processing pipeline. The automation using GitHub Actions ensures a smooth, reproducible deployment process, making it easier to manage cloud-based data pipelines at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Google Cloud Platform Documentation&lt;/strong&gt;: &lt;a href="https://cloud.google.com/docs" rel="noopener noreferrer"&gt;https://cloud.google.com/docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Actions Documentation:&lt;/strong&gt;: &lt;a href="https://docs.github.com/en/actions" rel="noopener noreferrer"&gt;https://docs.github.com/en/actions&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google Composer Documentation:&lt;/strong&gt;: &lt;a href="https://cloud.google.com/composer/docs" rel="noopener noreferrer"&gt;https://cloud.google.com/composer/docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Functions Documentation&lt;/strong&gt;: &lt;a href="https://cloud.google.com/functions/docs" rel="noopener noreferrer"&gt;https://cloud.google.com/functions/docs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Repo&lt;/strong&gt;: &lt;a href="https://github.com/jader-lima/gcp-cloud-functions-to-bigquery" rel="noopener noreferrer"&gt;https://github.com/jader-lima/gcp-cloud-functions-to-bigquery&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>gcp</category>
      <category>python</category>
      <category>airflow</category>
      <category>serverless</category>
    </item>
    <item>
      <title>Using Cloud Functions and Cloud Schedule to process data with Google Dataflow</title>
      <dc:creator>Jader Lima</dc:creator>
      <pubDate>Sun, 22 Sep 2024 21:31:29 +0000</pubDate>
      <link>https://dev.to/jader_lima_b72a63be5bbddc/using-cloud-functions-and-cloud-schedule-to-process-data-with-google-dataflow-1p1k</link>
      <guid>https://dev.to/jader_lima_b72a63be5bbddc/using-cloud-functions-and-cloud-schedule-to-process-data-with-google-dataflow-1p1k</guid>
      <description>&lt;h2&gt;
  
  
  GCP DataFlow Function Schedule
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;This project showcases the integration of Google Cloud services, specifically Dataflow, Cloud Functions, and Cloud Scheduler, to create a highly scalable, cost-effective, and easy-to-maintain data processing solution. It demonstrates how you can automate data pipelines, perform seamless integration with other GCP services like BigQuery, and manage workflows efficiently through CI/CD pipelines with GitHub Actions. This setup provides flexibility, reduces manual intervention, and ensures that the data processing workflows run smoothly and consistently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Technologies Used&lt;/li&gt;
&lt;li&gt;Features&lt;/li&gt;
&lt;li&gt;Architecture Diagram&lt;/li&gt;
&lt;li&gt;
Getting Started

&lt;ul&gt;
&lt;li&gt;Prerequisites&lt;/li&gt;
&lt;li&gt;Setup Instructions&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

Deploying the Project

&lt;ul&gt;
&lt;li&gt;Workflow YAML Explanation&lt;/li&gt;
&lt;li&gt;Workflow Job Steps&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Resources Created After Deployment&lt;/li&gt;

&lt;li&gt;Conclusion&lt;/li&gt;

&lt;li&gt;Documentation Links&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Technologies Used
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Google Dataflow&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Google Dataflow is a fully managed service for stream and batch data processing, which is built on Apache Beam. It allows for the creation of highly efficient, low-latency, and cost-effective data pipelines. Dataflow can handle large-scale data processing tasks, making it ideal for use cases like real-time analytics and ETL jobs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cloud Storage&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Google Cloud Storage is a scalable, durable, and secure object storage service designed to handle large volumes of unstructured data. It is ideal for use in big data analysis, backups, and content distribution, offering high availability and low latency across the globe.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cloud Functions&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Google Cloud Functions is a serverless execution environment that allows you to run code in response to events. In this project, Cloud Functions are used to trigger Dataflow jobs and manage workflow automation efficiently with minimal operational overhead.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cloud Scheduler&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Google Cloud Scheduler is a fully managed cron job service that allows you to schedule tasks or trigger cloud services at specific intervals. It’s used in this project to automate the execution of the Cloud Functions, ensuring that Dataflow jobs run as needed without manual intervention.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CI/CD Process with GitHub Actions&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
GitHub Actions enables continuous integration and continuous delivery (CI/CD) workflows directly from your GitHub repository. In this project, it is used to automate the build, testing, and deployment of resources to Google Cloud, ensuring consistent and reliable deployments.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GitHub Secrets and Configuration&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
GitHub Secrets securely store sensitive information such as API keys, service account credentials, and configuration settings required for deployment. By keeping these details secure, the risk of leaks and unauthorized access is minimized.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Features
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Ingest and transform data from Google Cloud Storage using Google Dataflow.&lt;/li&gt;
&lt;li&gt;Encapsulate the Dataflow process into a reusable Dataflow template.&lt;/li&gt;
&lt;li&gt;Create a Cloud Function that executes the Dataflow template through a REST API.&lt;/li&gt;
&lt;li&gt;Automate the execution of the Cloud Function using Cloud Scheduler.&lt;/li&gt;
&lt;li&gt;Implement a CI/CD pipeline with GitHub Actions for automated deployments.&lt;/li&gt;
&lt;li&gt;Incorporate comprehensive error handling and logging for reliable data processing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture Diagram
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3883w42kkahmkz27yri5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3883w42kkahmkz27yri5.png" alt="architecture" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;Before getting started, ensure you have the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Google Cloud account with billing enabled.&lt;/li&gt;
&lt;li&gt;A GitHub account.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Setup Instructions
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Clone the Repository&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   git clone https://github.com/jader-lima/gcp-dataflow-function-schedule.git
   &lt;span class="nb"&gt;cd &lt;/span&gt;gcp-dataproc-bigquery-workflow-template
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Set Up Google Cloud Environment
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Create a Google Cloud Storage bucket&lt;/strong&gt; to store your data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set up a BigQuery dataset&lt;/strong&gt; where your data will be ingested.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create a Dataproc cluster&lt;/strong&gt; for processing.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Creat a new service account for deploy purposes
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Create the Service Account:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud iam service-accounts create devops-dataops-sa &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Service account for DevOps and DataOps tasks"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--display-name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"DevOps DataOps Service Account"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Grant Storage Access Permissions (Buckets): Storage Admin (roles/storage.admin): 
Grants permissions to create, list, and manipulate buckets and files.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud projects add-iam-policy-binding &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:devops-dataops-sa@&lt;/span&gt;&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;&lt;span class="s2"&gt;.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/storage.admin"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Grant Dataflow Permissions: 
Dataflow Admin (roles/dataflow.admin): To create, run, and manage Dataflow jobs. 
Dataflow Developer (roles/dataflow.developer): Allows the development and submission of Dataflow jobs.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud projects add-iam-policy-binding &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:devops-dataops-sa@&lt;/span&gt;&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;&lt;span class="s2"&gt;.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/dataflow.admin"&lt;/span&gt;

gcloud projects add-iam-policy-binding &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:devops-dataops-sa@&lt;/span&gt;&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;&lt;span class="s2"&gt;.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/dataflow.developer"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Permissions to Create and Manage Cloud Functions and Cloud Scheduler: 
Cloud Functions Admin (roles/cloudfunctions.admin): To create and manage Cloud Functions. 
Cloud Scheduler Admin (roles/cloudscheduler.admin): To create and manage Cloud Scheduler jobs.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud projects add-iam-policy-binding &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:devops-dataops-sa@&lt;/span&gt;&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;&lt;span class="s2"&gt;.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/cloudfunctions.admin"&lt;/span&gt;

gcloud projects add-iam-policy-binding &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:devops-dataops-sa@&lt;/span&gt;&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;&lt;span class="s2"&gt;.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/cloudscheduler.admin"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Grant Permissions to Manage Service Accounts: 
IAM Service Account Admin (roles/iam.serviceAccountAdmin): To create and manage other service accounts. 
IAM Service Account User (roles/iam.serviceAccountUser): To use service accounts in different services.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud projects add-iam-policy-binding &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:devops-dataops-sa@&lt;/span&gt;&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;&lt;span class="s2"&gt;.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/iam.serviceAccountAdmin"&lt;/span&gt;

gcloud projects add-iam-policy-binding &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:devops-dataops-sa@&lt;/span&gt;&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;&lt;span class="s2"&gt;.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/iam.serviceAccountUser"&lt;/span&gt;


gcloud projects add-iam-policy-binding &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:devops-dataops-sa@&lt;/span&gt;&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;&lt;span class="s2"&gt;.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/serviceusage.serviceUsageAdmin"&lt;/span&gt;

gcloud projects add-iam-policy-binding &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:devops-dataops-sa@&lt;/span&gt;&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;&lt;span class="s2"&gt;.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/resourcemanager.projectIamAdmin"&lt;/span&gt;

gcloud projects add-iam-policy-binding &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:devops-dataops-sa@&lt;/span&gt;&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;&lt;span class="s2"&gt;.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/resourcemanager.projectIamAdmin"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Permission to enable API services:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud projects add-iam-policy-binding &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:devops-dataops-sa@&lt;/span&gt;&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;&lt;span class="s2"&gt;.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/cloudscheduler.admin"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Additional Permissions (Optional): 
Compute Admin (roles/compute.admin): If your pipeline needs to create compute resources (e.g., virtual machine instances). Viewer (roles/viewer): 
To ensure the account can view other resources in the project.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud projects add-iam-policy-binding &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:devops-dataops-sa@&lt;/span&gt;&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;&lt;span class="s2"&gt;.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/compute.admin"&lt;/span&gt;

gcloud projects add-iam-policy-binding &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:devops-dataops-sa@&lt;/span&gt;&lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;&lt;span class="s2"&gt;.iam.gserviceaccount.com"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/viewer"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Configure Environment Variables and Secrets
&lt;/h2&gt;

&lt;p&gt;Ensure the following environment variables are set in your deployment configuration or within GitHub Secrets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;GCP_BUCKET_BIGDATA_FILES&lt;/code&gt;: Secret used to store the name of the cloud storage&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GCP_BUCKET_DATALAKE&lt;/code&gt;: Secret used to store the name of the cloud storage&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GCP_BUCKET_DATAPROC&lt;/code&gt;: Secret used to store the name of the cloud storage&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GCP_BUCKET_TEMP_BIGQUERY&lt;/code&gt;: Secret used to store the name of the cloud storage&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GCP_DEVOPS_SA_KEY&lt;/code&gt;: Secret used to store the value of the service account key. For this project, the default service key was used. &lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GCP_SERVICE_ACCOUNT&lt;/code&gt;: Secret used to store the value of the service account key. For this project, the default service key was used. &lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PROJECT_ID&lt;/code&gt;: Secret used to store the project id value&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Creat a new service account for deploy purposes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Creating github secret
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;To create a new secret:

&lt;ol&gt;
&lt;li&gt;In project repository, menu &lt;strong&gt;Settings&lt;/strong&gt; &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;, &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secrets and variables&lt;/strong&gt;,click in access &lt;strong&gt;Action&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;New repository secret&lt;/strong&gt;, type a &lt;strong&gt;name&lt;/strong&gt; and &lt;strong&gt;value&lt;/strong&gt; for secret.&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi45cicz0q89ije7j70yf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi45cicz0q89ije7j70yf.png" alt="github secret creation" width="800" height="609"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For more details , access :&lt;br&gt;
&lt;a href="https://docs.github.com/pt/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions" rel="noopener noreferrer"&gt;https://docs.github.com/pt/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying the project &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Whenever a push to the main branch occurs, GitHub Actions will trigger and run the YAML script. The script contains four jobs, described in detail below. In essence, GitHub Actions uses the service account credentials to authenticate with Google Cloud and execute the necessary steps as described in the YAML file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Workflow File YAML Explanation&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Environments Needed&lt;br&gt;
We have variations for basic usage for cluster characteristics, bucket paths, process names and steps &lt;br&gt;
make workflow. In case of new steps in the workflow or new scripts, new variables can be easily added as below :&lt;/p&gt;

&lt;h2&gt;
  
  
  Workflow Job Steps &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;enable-services&lt;/strong&gt;:&lt;br&gt;
This step enables the necessary APIs for Cloud Functions, Dataflow, and the build process.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;deploy-buckets&lt;/strong&gt;:&lt;br&gt;
This step creates Google Cloud Storage buckets and copies the required data files and scripts into them.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;build-dataflow-classic-template&lt;/strong&gt;:&lt;br&gt;
Builds and stores a Dataflow template in a Cloud Storage bucket for future execution.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;deploy-cloud-function&lt;/strong&gt;:&lt;br&gt;
Deploys a Cloud Function that triggers the execution of the Dataflow template using the google-api-python-client library.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;deploy-cloud-schedule&lt;/strong&gt;:&lt;br&gt;
Creates a Cloud Scheduler job to automate the execution of the Cloud Function, ensuring data is processed at defined intervals.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Resources Created After Deployment
&lt;/h2&gt;

&lt;p&gt;Upon deployment, the following resources are created:&lt;/p&gt;

&lt;h3&gt;
  
  
  Google Cloud Storage Bucket
&lt;/h3&gt;

&lt;p&gt;A Cloud Storage bucket to store data and templates.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5au57kwtdazwa8qovvkl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5au57kwtdazwa8qovvkl.png" alt="buckets" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Csv files of the olist dataset, stored in the transient layer of the datalake.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwcby7lbjl2aqly7fvfgl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwcby7lbjl2aqly7fvfgl.png" alt="bucket transient" width="800" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Csv file created after Dataflow processing, this file can be used in analysis tools, spreadsheets, databases, etc.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjdfb9ebtdgp69n6syqwd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjdfb9ebtdgp69n6syqwd.png" alt="bucket silver" width="800" height="277"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Dataflow Classic Template
&lt;/h3&gt;

&lt;p&gt;A reusable Dataflow template stored in Cloud Storage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjj0payqwvuca6pq1vyq1.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjj0payqwvuca6pq1vyq1.JPG" alt="dataproc-workflow3" width="800" height="375"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud Scheduler Job
&lt;/h3&gt;

&lt;p&gt;Automated scheduled jobs for Dataflow executions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx74p6oys3q5n66vx2ojf.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx74p6oys3q5n66vx2ojf.JPG" alt="Cloud Schedule" width="800" height="200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;This project demonstrates how to leverage Google Cloud services like Dataflow, Cloud Functions, and Cloud Scheduler to create a fully automated and scalable data processing pipeline. The integration with GitHub Actions ensures continuous deployment, while the use of Cloud Functions and Scheduler provides flexibility and automation, minimizing operational overhead. This setup is versatile and can be easily extended to incorporate additional GCP services such as BigQuery.&lt;/p&gt;

&lt;p&gt;Links and References&lt;br&gt;
&lt;a href="https://github.com/jader-lima/gcp-dataflow-function-schedule" rel="noopener noreferrer"&gt;GitHub Repo&lt;/a&gt;&lt;br&gt;
&lt;a href="https://cloud.google.com/functions/docs" rel="noopener noreferrer"&gt;Cloud Functions&lt;/a&gt;&lt;br&gt;
&lt;a href="https://cloud.google.com/dataflow/docs" rel="noopener noreferrer"&gt;DataFlow&lt;/a&gt;&lt;br&gt;
&lt;a href="https://docs.github.com/en/actions" rel="noopener noreferrer"&gt;Cloud Scheduler&lt;/a&gt;&lt;/p&gt;

</description>
      <category>gcp</category>
      <category>dataflow</category>
      <category>cloudfunctions</category>
      <category>cloudscheduler</category>
    </item>
    <item>
      <title>Loading data to Google Big Query using Dataproc workflow templates and cloud Schedule</title>
      <dc:creator>Jader Lima</dc:creator>
      <pubDate>Fri, 06 Sep 2024 23:29:22 +0000</pubDate>
      <link>https://dev.to/jader_lima_b72a63be5bbddc/loading-data-to-google-big-query-using-dataproc-workflow-templates-and-cloud-schedule-4l4</link>
      <guid>https://dev.to/jader_lima_b72a63be5bbddc/loading-data-to-google-big-query-using-dataproc-workflow-templates-and-cloud-schedule-4l4</guid>
      <description>&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;This project is designed to facilitate the ingestion of data from Google Cloud Storage into BigQuery using Apache PySpark on Google Dataproc. Furthermore, it utilizes Google Cloud Scheduler for automated execution and GitHub Actions for seamless deployment. &lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Technologies Used&lt;/li&gt;
&lt;li&gt;Features&lt;/li&gt;
&lt;li&gt;Architecture Diagram&lt;/li&gt;
&lt;li&gt;
Getting Started

&lt;ul&gt;
&lt;li&gt;Prerequisites&lt;/li&gt;
&lt;li&gt;Setup Instructions&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

Deploying the project

&lt;ul&gt;
&lt;li&gt;Workflow File YAML Explanation&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Resources Created After Deployment&lt;/li&gt;

&lt;li&gt;Conclusion&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Technologies Used
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Google Dataproc&lt;/strong&gt;&lt;br&gt;
Google Dataproc is a fully managed cloud service designed to simplify running Apache Spark and Hadoop clusters in the Google Cloud ecosystem. It provides a fast and scalable way to process large datasets while integrating seamlessly with other Google Cloud services, such as Cloud Storage, to minimize operational overhead and ensure efficient big data processing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cloud Storage&lt;/strong&gt;&lt;br&gt;
Google Cloud Storage is a scalable and secure object storage service for storing large amounts of unstructured data. It offers high availability and strong global consistency, making it suitable for a wide range of scenarios, such as data backups, big data analytics, and content distribution.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Workflow Templates&lt;/strong&gt;&lt;br&gt;
Workflow templates in Google Cloud simplify the definition and management of complex processes involving multiple cloud services. This feature helps in scheduling and executing intricate workflows, optimizing resource management, and automating multi-step tasks across different services.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cloud Scheduler&lt;/strong&gt;&lt;br&gt;
Google Cloud Scheduler is a fully managed service for running scheduled jobs with no infrastructure to manage. It can be used to automate workflows, run reports, or trigger specific cloud services at defined intervals.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CI/CD Process with GitHub Actions&lt;/strong&gt;&lt;br&gt;
Implementing a CI/CD pipeline with GitHub Actions automates the build, test, and deployment stages of your project. In this workflow, GitHub Actions triggers deployment to Google Cloud every time code changes are pushed to the repository, ensuring a consistent and accurate deployment process with minimal manual intervention.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GitHub Secrets and Configuration&lt;/strong&gt;&lt;br&gt;
GitHub Secrets is essential for maintaining the security of sensitive information like API keys, service account credentials, and other configuration data. By securely storing these details outside your source code, you mitigate the risk of unauthorized access and potential leaks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Google BigQuery&lt;/strong&gt;&lt;br&gt;
A fully managed, scalable data warehouse that enables lightning-fast SQL queries and supports large-scale analytics across terabytes of data.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Features
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Ingest data from Google Cloud Storage into BigQuery using Dataproc with PySpark.&lt;/li&gt;
&lt;li&gt;Utilize Cloud Scheduler to automate the execution of data ingestion workflows.&lt;/li&gt;
&lt;li&gt;Implement CI/CD for automated deployment through GitHub Actions.&lt;/li&gt;
&lt;li&gt;Comprehensive error handling and logging for reliable data processing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture Diagram
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8q2nw194pekf1nfkctbr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8q2nw194pekf1nfkctbr.png" alt="Architecture Diagram" width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;Before you begin, ensure you have the following prerequisites set up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Google Cloud account with billing enabled.&lt;/li&gt;
&lt;li&gt;GitHub account.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Setup Instructions
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Clone the Repository&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   git clone https://github.com/jader-lima/gcp-dataproc-bigquery-workflow-template.git
   &lt;span class="nb"&gt;cd &lt;/span&gt;gcp-dataproc-bigquery-workflow-template
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Set Up Google Cloud Environment
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Create a Google Cloud Storage bucket&lt;/strong&gt; to store your data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set up a BigQuery dataset&lt;/strong&gt; where your data will be ingested.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create a Dataproc cluster&lt;/strong&gt; for processing.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Configure Environment Variables
&lt;/h2&gt;

&lt;p&gt;Ensure the following environment variables are set in your deployment configuration or within GitHub Secrets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;GCP_BUCKET_BIGDATA_FILES&lt;/code&gt;: Secret used to store the name of the cloud storage&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GCP_BUCKET_DATALAKE&lt;/code&gt;: Secret used to store the name of the cloud storage&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GCP_BUCKET_DATAPROC&lt;/code&gt;: Secret used to store the name of the cloud storage&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GCP_BUCKET_TEMP_BIGQUERY&lt;/code&gt;: Secret used to store the name of the cloud storage&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GCP_SA_KEY&lt;/code&gt;: Secret used to store the value of the service account key. For this project, the default service key was used. &lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GCP_SERVICE_ACCOUNT&lt;/code&gt;: Secret used to store the value of the service account key. For this project, the default service key was used. &lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PROJECT_ID&lt;/code&gt;: Secret used to store the project id value&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Creating github secret
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;To create a new secret:

&lt;ol&gt;
&lt;li&gt;In project repository, menu &lt;strong&gt;Settings&lt;/strong&gt; &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;, &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secrets and variables&lt;/strong&gt;,click in access &lt;strong&gt;Action&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;New repository secret&lt;/strong&gt;, type a &lt;strong&gt;name&lt;/strong&gt; and &lt;strong&gt;value&lt;/strong&gt; for secret.&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi45cicz0q89ije7j70yf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi45cicz0q89ije7j70yf.png" alt="github secret creation" width="800" height="609"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For more details , access :&lt;br&gt;
&lt;a href="https://docs.github.com/pt/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions" rel="noopener noreferrer"&gt;https://docs.github.com/pt/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Deploying the project &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Whenever a push to the main branch occurs, GitHub Actions will trigger and run the YAML script. The script contains four jobs, described in detail below. In essence, GitHub Actions uses the service account credentials to authenticate with Google Cloud and execute the necessary steps as described in the YAML file.&lt;/p&gt;
&lt;h2&gt;
  
  
  Workflow File YAML Explanation&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Environments Needed&lt;br&gt;
We have variations for basic usage for cluster characteristics, bucket paths, process names and steps &lt;br&gt;
make workflow. In case of new steps in the workflow or new scripts, new variables can be easily added as below :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;MY_VAR_NAME &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my_var_value&lt;/span&gt; 
&lt;span class="s"&gt;${{ env.MY_VAR_NAME}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;REGION&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east1&lt;/span&gt;
    &lt;span class="na"&gt;ZONE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east1-b&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_CLUSTER_NAME&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dataproc-bigdata-multi-node-cluster&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_WORKER_TYPE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;n2-standard-2&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_MASTER_TYPE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;n2-standard-2&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_NUM_WORKERS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_IMAGE_VERSION&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2.1-debian11&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_WORKER_NUM_LOCAL_SSD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_MASTER_NUM_LOCAL_SSD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_MASTER_BOOT_DISK_SIZE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;32&lt;/span&gt;   
    &lt;span class="na"&gt;DATAPROC_WORKER_DISK_SIZE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;32&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_MASTER_BOOT_DISK_TYPE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pd-balanced&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_WORKER_BOOT_DISK_TYPE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pd-balanced&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_COMPONENTS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;JUPYTER&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_WORKFLOW_NAME&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;report_olist_order_items&lt;/span&gt;
    &lt;span class="na"&gt;BIGQUERY_DATASET&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;olist&lt;/span&gt;
    &lt;span class="na"&gt;BIGQUERY_TABLE_ORDER_ITEMS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;order_items_report&lt;/span&gt;
    &lt;span class="na"&gt;BRONZE_DATALAKE_FILES&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bronze&lt;/span&gt;
    &lt;span class="na"&gt;TRANSIENT_DATALAKE_FILES&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;transient&lt;/span&gt;
    &lt;span class="na"&gt;BUCKET_DATALAKE_FOLDER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;transient&lt;/span&gt;
    &lt;span class="na"&gt;BUCKET_BIGDATA_JAR_FOLDER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jars&lt;/span&gt;
    &lt;span class="na"&gt;BUCKET_BIGDATA_SCRIPT_FOLDER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;scripts&lt;/span&gt;
    &lt;span class="na"&gt;BUCKET_BIGDATA_PYSPARK_FOLDER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pyspark&lt;/span&gt;
    &lt;span class="na"&gt;BUCKET_BIGDATA_PYSPARK_INGESTION&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ingestion&lt;/span&gt;
    &lt;span class="na"&gt;BUCKET_BIGDATA_PYSPARK_ENRICHMENT&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;enrichment/&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_APP_NAME&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ingestion_countries_csv_to_delta&lt;/span&gt; 
    &lt;span class="na"&gt;JAR_LIB1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;delta-core_2.12-2.3.0.jar&lt;/span&gt;
    &lt;span class="na"&gt;JAR_LIB2&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;delta-storage-2.3.0.jar&lt;/span&gt; 
    &lt;span class="na"&gt;APP_NAME&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;countries_ingestion_csv_to_delta'&lt;/span&gt;
    &lt;span class="na"&gt;PYSPARK_INGESTION_SCRIPT&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ingestion_csv_to_delta.py&lt;/span&gt;
    &lt;span class="na"&gt;PYSPARK_ENRICHMENT_SCRIPT_ORDER_ITENS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;order_order_items_to_bigquery.py&lt;/span&gt;
    &lt;span class="na"&gt;TIME_PARTITION_FIELD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;datePartition&lt;/span&gt;
    &lt;span class="na"&gt;FILE1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;orders&lt;/span&gt;
    &lt;span class="na"&gt;FILE2&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;order_items&lt;/span&gt;
    &lt;span class="na"&gt;SUBJECT &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;olist&lt;/span&gt;
    &lt;span class="na"&gt;STEP1 &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;orders&lt;/span&gt;
    &lt;span class="na"&gt;STEP2 &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;order_items&lt;/span&gt;
    &lt;span class="na"&gt;STEP3 &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;order_items_report&lt;/span&gt;
    &lt;span class="na"&gt;TIME_ZONE &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;America/Sao_Paulo&lt;/span&gt;
    &lt;span class="na"&gt;SCHEDULE &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;00&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;12&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;
    &lt;span class="na"&gt;SCHEDULE_NAME &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;schedule_olist_etl&lt;/span&gt;
    &lt;span class="na"&gt;SERVICE_ACCOUNT_NAME &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;account-dataproc-bq-workflow&lt;/span&gt;
    &lt;span class="na"&gt;CUSTOM_ROLE &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DataProcBigQueryWorkflowCustomRole&lt;/span&gt;
    &lt;span class="na"&gt;STEP1_NAME &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;step_ingestion_orders&lt;/span&gt;
    &lt;span class="na"&gt;STEP2_NAME &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;step_ingestion_order_items&lt;/span&gt;
    &lt;span class="na"&gt;STEP3_NAME &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;step_ingestion_order_items_bigquery&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Workflow Job Steps &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;deploy-buckets&lt;/strong&gt;:
This step is responsible for creating the required Cloud Storage buckets. If the bucket name does not already exist, it will be created. Once the buckets are created, the necessary data files, JARs, and scripts are copied into the appropriate folders.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy-buckets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-22.04&lt;/span&gt;
    &lt;span class="na"&gt;timeout-minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Authorize GCP&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;google-github-actions/auth@v2'&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;credentials_json&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;${{ secrets.GCP_SA_KEY }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Authenticate with GCP&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Cloud SDK&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;google-github-actions/setup-gcloud@v2&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;363.0.0'&lt;/span&gt;
        &lt;span class="na"&gt;project_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PROJECT_ID }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Configure Docker to use the gcloud command-line tool as a credential helper&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure Docker&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud auth configure-docker&lt;/span&gt;


    &lt;span class="c1"&gt;# Step to Create GCP Bucket &lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create Google Cloud Storage - files&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if ! gsutil ls -p ${{ secrets.PROJECT_ID }} gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;gcloud storage buckets create gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }} --default-storage-class=nearline --location=${{ env.REGION }}&lt;/span&gt;
          &lt;span class="s"&gt;else&lt;/span&gt;
            &lt;span class="s"&gt;echo "Cloud Storage : gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}  already exists" ! &lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Create GCP Bucket &lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create Google Cloud Storage - dataproc&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if ! gsutil ls -p ${{ secrets.PROJECT_ID }} gs://${{ secrets.GCP_BUCKET_DATAPROC }} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;gcloud storage buckets create gs://${{ secrets.GCP_BUCKET_DATAPROC }} --default-storage-class=nearline --location=${{ env.REGION }}&lt;/span&gt;
          &lt;span class="s"&gt;else&lt;/span&gt;
            &lt;span class="s"&gt;echo "Cloud Storage : gs://${{ secrets.GCP_BUCKET_DATAPROC }}  already exists" ! &lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Create GCP Bucket &lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create Google Cloud Storage - datalake&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if ! gsutil ls -p ${{ secrets.PROJECT_ID }} gs://${{ secrets.GCP_BUCKET_DATALAKE }} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;gcloud storage buckets create gs://${{ secrets.GCP_BUCKET_DATALAKE }} --default-storage-class=nearline --location=${{ env.REGION }}&lt;/span&gt;
          &lt;span class="s"&gt;else&lt;/span&gt;
            &lt;span class="s"&gt;echo "Cloud Storage : gs://${{ secrets.GCP_BUCKET_DATALAKE }}  already exists" ! &lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Create GCP Bucket &lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create Google Cloud Storage - big query temp files&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if ! gsutil ls -p ${{ secrets.PROJECT_ID }} gs://${{ secrets.GCP_BUCKET_TEMP_BIGQUERY }} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;gcloud storage buckets create gs://${{ secrets.GCP_BUCKET_TEMP_BIGQUERY }} --default-storage-class=nearline --location=${{ env.REGION }}&lt;/span&gt;
          &lt;span class="s"&gt;else&lt;/span&gt;
            &lt;span class="s"&gt;echo "Cloud Storage : gs://${{ secrets.GCP_BUCKET_TEMP_BIGQUERY }}  already exists" ! &lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Upload the file to GCP Bucket - transient files&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Upload transient files to Google Cloud Storage&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;TARGET=${{ env.TRANSIENT_DATALAKE_FILES }}&lt;/span&gt;
        &lt;span class="s"&gt;BUCKET_PATH=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}    &lt;/span&gt;
        &lt;span class="s"&gt;gsutil cp -r $TARGET gs://${BUCKET_PATH}&lt;/span&gt;


    &lt;span class="c1"&gt;# Step to Upload the file to GCP Bucket - jar files&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Upload jar files to Google Cloud Storage&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;TARGET=${{ env.BUCKET_BIGDATA_JAR_FOLDER }}&lt;/span&gt;
        &lt;span class="s"&gt;BUCKET_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}&lt;/span&gt;
        &lt;span class="s"&gt;gsutil cp -r $TARGET gs://${BUCKET_PATH}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Upload the file to GCP Bucket - pyspark files&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Upload pyspark files to Google Cloud Storage&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;TARGET=${{ env.BUCKET_BIGDATA_SCRIPT_FOLDER}}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER}}&lt;/span&gt;
        &lt;span class="s"&gt;BUCKET_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES}}/${{ env.BUCKET_BIGDATA_SCRIPT_FOLDER}}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER}}&lt;/span&gt;
        &lt;span class="s"&gt;gsutil cp -r $TARGET gs://${BUCKET_PATH}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;deploy-bigquery-dataset-bigquery-tables&lt;/strong&gt;:
This step creates the BigQuery dataset and tables. It checks for the existence of a dataset and validates whether a table is present. The table schema is copied from a predefined JSON file located in the scripts/libs bucket.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;deploy-bigquery-dataset-bigquery-tables&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;deploy-buckets&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-22.04&lt;/span&gt;
    &lt;span class="na"&gt;timeout-minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Authorize GCP&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;google-github-actions/auth@v2'&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;credentials_json&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;${{ secrets.GCP_SA_KEY }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Authenticate with GCP&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Cloud SDK&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;google-github-actions/setup-gcloud@v2&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;363.0.0'&lt;/span&gt;
        &lt;span class="na"&gt;project_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PROJECT_ID }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Configure Docker to use the gcloud command-line tool as a credential helper&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure Docker&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud auth configure-docker&lt;/span&gt;


    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create Big Query Dataset&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;  
        &lt;span class="s"&gt;if ! bq ls --project_id ${{ secrets.PROJECT_ID}} -a | grep -w ${{ env.BIGQUERY_DATASET}} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then &lt;/span&gt;

            &lt;span class="s"&gt;bq --location=${{ env.REGION }} mk \&lt;/span&gt;
          &lt;span class="s"&gt;--default_table_expiration 0 \&lt;/span&gt;
          &lt;span class="s"&gt;--dataset ${{ env.BIGQUERY_DATASET }}&lt;/span&gt;

          &lt;span class="s"&gt;else&lt;/span&gt;
            &lt;span class="s"&gt;echo "Big Query Dataset : ${{ env.BIGQUERY_DATASET }} already exists" ! &lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create Big Query table&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;TABLE_NAME_ORDER_ITEMS=${{ env.BIGQUERY_DATASET}}.${{ env.BIGQUERY_TABLE_ORDER_ITEMS}}&lt;/span&gt;
        &lt;span class="s"&gt;c=0&lt;/span&gt;
        &lt;span class="s"&gt;for table in $(bq ls --max_results 1000 "${{ secrets.PROJECT_ID}}:${{ env.BIGQUERY_DATASET}}" | tail -n +3 | awk '{print $1}'); do&lt;/span&gt;

            &lt;span class="s"&gt;# Determine the table type and file extension&lt;/span&gt;
            &lt;span class="s"&gt;if bq show --format=prettyjson $TABLE_NAME_ORDER_ITEMS | jq -r '.type' | grep -q -E "TABLE"; then&lt;/span&gt;
              &lt;span class="s"&gt;echo "Dataset ${{ env.BIGQUERY_DATASET}} already has table named : $table " !&lt;/span&gt;
              &lt;span class="s"&gt;if [ "$table" == "${{ env.BIGQUERY_TABLE_ORDER_ITEMS}}" ]; then&lt;/span&gt;
                &lt;span class="s"&gt;echo "Dataset ${{ env.BIGQUERY_DATASET}} already has table named : $table " !&lt;/span&gt;
                &lt;span class="s"&gt;((c=c+1))       &lt;/span&gt;
              &lt;span class="s"&gt;fi                  &lt;/span&gt;
            &lt;span class="s"&gt;else&lt;/span&gt;
                &lt;span class="s"&gt;echo "Ignoring $table"            &lt;/span&gt;
                &lt;span class="s"&gt;continue&lt;/span&gt;
            &lt;span class="s"&gt;fi&lt;/span&gt;
        &lt;span class="s"&gt;done&lt;/span&gt;
        &lt;span class="s"&gt;echo " contador $c "&lt;/span&gt;
        &lt;span class="s"&gt;if [ $c == 0 ]; then&lt;/span&gt;
          &lt;span class="s"&gt;echo "Creating table named : $table for Dataset ${{ env.BIGQUERY_DATASET}} " !&lt;/span&gt;

          &lt;span class="s"&gt;bq mk --table \&lt;/span&gt;
          &lt;span class="s"&gt;--time_partitioning_field ${{ env.TIME_PARTITION_FIELD}} \&lt;/span&gt;
          &lt;span class="s"&gt;$TABLE_NAME_ORDER_ITEMS \&lt;/span&gt;
          &lt;span class="s"&gt;./scripts/bigquery_files/schemas/order_items_schema.json&lt;/span&gt;


        &lt;span class="s"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;deploy-dataproc-workflow-template&lt;/strong&gt;:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This step creates the workflow template. It sets up a Dataproc cluster and links it with the workflow. The three main steps of the workflow are then created, with validations ensuring that each component is only created once.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt; &lt;span class="na"&gt;deploy-dataproc-workflow-template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;deploy-buckets&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-22.04&lt;/span&gt;
    &lt;span class="na"&gt;timeout-minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Authorize GCP&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;google-github-actions/auth@v2'&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;credentials_json&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;${{ secrets.GCP_SA_KEY }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Authenticate with GCP&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Cloud SDK&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;google-github-actions/setup-gcloud@v2&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;363.0.0'&lt;/span&gt;
        &lt;span class="na"&gt;project_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PROJECT_ID }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Configure Docker to use the gcloud command-line tool as a credential helper&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure Docker&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud auth configure-docker&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create Dataproc Workflow&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;  
        &lt;span class="s"&gt;if ! gcloud dataproc workflow-templates list --region=${{ env.REGION}} | grep -i ${{ env.DATAPROC_WORKFLOW_NAME}} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;gcloud dataproc workflow-templates create ${{ env.DATAPROC_WORKFLOW_NAME }} --region ${{ env.REGION }}&lt;/span&gt;

          &lt;span class="s"&gt;else&lt;/span&gt;
            &lt;span class="s"&gt;echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME }} already exists" ! &lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create Dataproc Managed Cluster&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;       
        &lt;span class="s"&gt;gcloud dataproc workflow-templates set-managed-cluster ${{ env.DATAPROC_WORKFLOW_NAME }} &lt;/span&gt;
        &lt;span class="s"&gt;--region ${{ env.REGION }} &lt;/span&gt;
        &lt;span class="s"&gt;--zone ${{ env.ZONE }} &lt;/span&gt;
        &lt;span class="s"&gt;--image-version ${{ env.DATAPROC_IMAGE_VERSION }} &lt;/span&gt;
        &lt;span class="s"&gt;--master-machine-type=${{ env.DATAPROC_MASTER_TYPE }} &lt;/span&gt;
        &lt;span class="s"&gt;--master-boot-disk-type ${{ env.DATAPROC_MASTER_BOOT_DISK_TYPE }} &lt;/span&gt;
        &lt;span class="s"&gt;--master-boot-disk-size ${{ env.DATAPROC_MASTER_BOOT_DISK_SIZE }} &lt;/span&gt;
        &lt;span class="s"&gt;--worker-machine-type=${{ env.DATAPROC_WORKER_TYPE }} &lt;/span&gt;
        &lt;span class="s"&gt;--worker-boot-disk-type ${{ env.DATAPROC_WORKER_BOOT_DISK_TYPE }}&lt;/span&gt;
        &lt;span class="s"&gt;--worker-boot-disk-size ${{ env.DATAPROC_WORKER_DISK_SIZE }} &lt;/span&gt;
        &lt;span class="s"&gt;--num-workers=${{ env.DATAPROC_NUM_WORKERS }} &lt;/span&gt;
        &lt;span class="s"&gt;--cluster-name=${{ env.DATAPROC_CLUSTER_NAME }} &lt;/span&gt;
        &lt;span class="s"&gt;--optional-components ${{ env.DATAPROC_COMPONENTS }} &lt;/span&gt;
        &lt;span class="s"&gt;--service-account=${{ env.GCP_SERVICE_ACCOUNT }}&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Add Job Ingestion orders to Workflow&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if gcloud dataproc workflow-templates list --region=${{ env.REGION}} | grep -i ${{ env.DATAPROC_WORKFLOW_NAME}} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;

            &lt;span class="s"&gt;if gcloud dataproc workflow-templates describe ${{ env.DATAPROC_WORKFLOW_NAME}} --region=${{ env.REGION}} | grep -i ${{ env.STEP1_NAME  }} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
            &lt;span class="s"&gt;then \&lt;/span&gt;
              &lt;span class="s"&gt;echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME }} already has step : ${{ env.STEP1_NAME  }} " ! &lt;/span&gt;
            &lt;span class="s"&gt;else&lt;/span&gt;
              &lt;span class="s"&gt;PYSPARK_SCRIPT_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES}}/${{ env.BUCKET_BIGDATA_SCRIPT_FOLDER}}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER}}/${{ env.BUCKET_BIGDATA_PYSPARK_INGESTION}}/${{ env.PYSPARK_INGESTION_SCRIPT}}&lt;/span&gt;
              &lt;span class="s"&gt;JARS_PATH=gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB1 }}&lt;/span&gt;
              &lt;span class="s"&gt;JARS_PATH=${JARS_PATH},gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB2 }}&lt;/span&gt;
              &lt;span class="s"&gt;TRANSIENT=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}/${{ env.SUBJECT }}/${{ env.FILE1 }}&lt;/span&gt;
              &lt;span class="s"&gt;BRONZE=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BRONZE_DATALAKE_FILES }}/${{ env.SUBJECT }}/${{ env.FILE1 }}&lt;/span&gt;

              &lt;span class="s"&gt;gcloud dataproc workflow-templates add-job pyspark gs://${PYSPARK_SCRIPT_PATH} \&lt;/span&gt;
              &lt;span class="s"&gt;--workflow-template ${{ env.DATAPROC_WORKFLOW_NAME }}  \&lt;/span&gt;
              &lt;span class="s"&gt;--step-id ${{env.STEP1_NAME }} \&lt;/span&gt;
              &lt;span class="s"&gt;--region ${{ env.REGION }} \&lt;/span&gt;
              &lt;span class="s"&gt;--jars ${JARS_PATH} \&lt;/span&gt;
              &lt;span class="s"&gt;-- --app_name=${{ env.APP_NAME }}${{ env.STEP1 }} --bucket_transient=gs://${TRANSIENT} \&lt;/span&gt;
              &lt;span class="s"&gt;--bucket_bronze=gs://${BRONZE}&lt;/span&gt;
            &lt;span class="s"&gt;fi&lt;/span&gt;
        &lt;span class="s"&gt;else&lt;/span&gt;
          &lt;span class="s"&gt;echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME}} not exists" ! &lt;/span&gt;
        &lt;span class="s"&gt;fi        &lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Add Job Ingestion order items to Workflow&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if gcloud dataproc workflow-templates list --region=${{ env.REGION}} | grep -i ${{ env.DATAPROC_WORKFLOW_NAME}} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;if gcloud dataproc workflow-templates describe ${{ env.DATAPROC_WORKFLOW_NAME}} --region=${{ env.REGION}} | grep -i ${{ env.STEP2_NAME }} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
            &lt;span class="s"&gt;then \&lt;/span&gt;
              &lt;span class="s"&gt;echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME }} already has step : ${{ env.STEP2_NAME  }} " ! &lt;/span&gt;
            &lt;span class="s"&gt;else&lt;/span&gt;
              &lt;span class="s"&gt;PYSPARK_SCRIPT_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES}}/${{ env.BUCKET_BIGDATA_SCRIPT_FOLDER}}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER}}/${{ env.BUCKET_BIGDATA_PYSPARK_INGESTION}}/${{ env.PYSPARK_INGESTION_SCRIPT}}&lt;/span&gt;
              &lt;span class="s"&gt;JARS_PATH=gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB1 }}&lt;/span&gt;
              &lt;span class="s"&gt;JARS_PATH=${JARS_PATH},gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB2 }}&lt;/span&gt;
              &lt;span class="s"&gt;TRANSIENT=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}/${{ env.SUBJECT }}/${{ env.FILE2 }}&lt;/span&gt;
              &lt;span class="s"&gt;BRONZE=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BRONZE_DATALAKE_FILES }}/${{ env.SUBJECT }}/${{ env.FILE2 }}&lt;/span&gt;


              &lt;span class="s"&gt;gcloud dataproc workflow-templates add-job pyspark gs://${PYSPARK_SCRIPT_PATH} \&lt;/span&gt;
              &lt;span class="s"&gt;--workflow-template ${{ env.DATAPROC_WORKFLOW_NAME }}  \&lt;/span&gt;
              &lt;span class="s"&gt;--step-id ${{ env.STEP2_NAME }} \&lt;/span&gt;
              &lt;span class="s"&gt;--start-after ${{ env.STEP1_NAME }} \&lt;/span&gt;
              &lt;span class="s"&gt;--region ${{ env.REGION }} \&lt;/span&gt;
              &lt;span class="s"&gt;--jars ${JARS_PATH} \&lt;/span&gt;
              &lt;span class="s"&gt;-- --app_name=${{ env.APP_NAME }}${{ env.STEP2 }} --bucket_transient=gs://${TRANSIENT} \&lt;/span&gt;
              &lt;span class="s"&gt;--bucket_bronze=gs://${BRONZE}&lt;/span&gt;
            &lt;span class="s"&gt;fi&lt;/span&gt;
        &lt;span class="s"&gt;else&lt;/span&gt;
          &lt;span class="s"&gt;echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME}} not exists" ! &lt;/span&gt;
        &lt;span class="s"&gt;fi&lt;/span&gt;


    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Add Job order + order items ingestion into big query  to Workflow&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if gcloud dataproc workflow-templates list --region=${{ env.REGION}} | grep -i ${{ env.DATAPROC_WORKFLOW_NAME}} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;if gcloud dataproc workflow-templates describe ${{ env.DATAPROC_WORKFLOW_NAME}} --region=${{ env.REGION}} | grep -i ${{ env.STEP3_NAME }} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
            &lt;span class="s"&gt;then \&lt;/span&gt;
              &lt;span class="s"&gt;echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME }} already has step : ${{ env.STEP3_NAME }} " ! &lt;/span&gt;
            &lt;span class="s"&gt;else&lt;/span&gt;

              &lt;span class="s"&gt;PYSPARK_SCRIPT_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES}}/${{ env.BUCKET_BIGDATA_SCRIPT_FOLDER}}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER}}/${{ env.BUCKET_BIGDATA_PYSPARK_ENRICHMENT}}/${{ env.PYSPARK_ENRICHMENT_SCRIPT_ORDER_ITENS}}&lt;/span&gt;
              &lt;span class="s"&gt;JARS_PATH=gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB1 }}&lt;/span&gt;
              &lt;span class="s"&gt;JARS_PATH=${JARS_PATH},gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB2 }}&lt;/span&gt;
              &lt;span class="s"&gt;BRONZE_ORDERS=${{ secrets.GCP_BUCKET_DATALAKE}}/${{ env.BRONZE_DATALAKE_FILES}}/${{ env.SUBJECT}}/${{ env.FILE1}}&lt;/span&gt;
              &lt;span class="s"&gt;BRONZE_ORDER_ITEMS=${{ secrets.GCP_BUCKET_DATALAKE}}/${{ env.BRONZE_DATALAKE_FILES}}/${{ env.SUBJECT}}/${{ env.FILE2}}&lt;/span&gt;
              &lt;span class="s"&gt;TABLE_NAME_ORDER_ITEMS=${{ secrets.PROJECT_ID}}.${{ env.BIGQUERY_DATASET}}.${{ env.BIGQUERY_TABLE_ORDER_ITEMS}}&lt;/span&gt;
              &lt;span class="s"&gt;BIG_QUERY_TEMP=${{ secrets.GCP_BUCKET_TEMP_BIGQUERY }}&lt;/span&gt;



              &lt;span class="s"&gt;gcloud dataproc workflow-templates add-job pyspark gs://${PYSPARK_SCRIPT_PATH} \&lt;/span&gt;
              &lt;span class="s"&gt;--workflow-template ${{ env.DATAPROC_WORKFLOW_NAME }}  \&lt;/span&gt;
              &lt;span class="s"&gt;--step-id ${{ env.STEP3_NAME }} \&lt;/span&gt;
              &lt;span class="s"&gt;--start-after ${{ env.STEP1_NAME }},${{ env.STEP2_NAME }} \&lt;/span&gt;
              &lt;span class="s"&gt;--region ${{ env.REGION }} \&lt;/span&gt;
              &lt;span class="s"&gt;--jars ${JARS_PATH} \&lt;/span&gt;
              &lt;span class="s"&gt;-- --app_name=${{ env.APP_NAME }}${{ env.STEP3 }} --bronze_orders_zone=gs://${BRONZE_ORDERS} \&lt;/span&gt;
              &lt;span class="s"&gt;--bronze_orders_items_zone=gs://${BRONZE_ORDER_ITEMS} \&lt;/span&gt;
              &lt;span class="s"&gt;--bigquery_table=${TABLE_NAME_ORDER_ITEMS} \&lt;/span&gt;
              &lt;span class="s"&gt;--temp_bucket=${BIG_QUERY_TEMP} &lt;/span&gt;

            &lt;span class="s"&gt;fi&lt;/span&gt;
        &lt;span class="s"&gt;else&lt;/span&gt;
          &lt;span class="s"&gt;echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME}} not exists" ! &lt;/span&gt;
        &lt;span class="s"&gt;fi&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;deploy-cloud-schedule&lt;/strong&gt;:
In this final step, a service account, custom role, and Cloud Scheduler job are created. The Cloud Scheduler runs the workflows on a predefined schedule, and the service account used is granted the necessary permissions.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;deploy-cloud-schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;deploy-buckets&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-dataproc-workflow-template&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-22.04&lt;/span&gt;
    &lt;span class="na"&gt;timeout-minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Authorize GCP&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;google-github-actions/auth@v2'&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;credentials_json&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;${{ secrets.GCP_SA_KEY }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Authenticate with GCP&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Cloud SDK&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;google-github-actions/setup-gcloud@v2&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;363.0.0'&lt;/span&gt;
        &lt;span class="na"&gt;project_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PROJECT_ID }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Configure Docker to use the gcloud command-line tool as a credential helper&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure Docker&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud auth configure-docker&lt;/span&gt;


    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create service account&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;

        &lt;span class="s"&gt;if ! gcloud iam service-accounts list | grep -i ${{ env.SERVICE_ACCOUNT_NAME}} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;gcloud iam service-accounts create ${{ env.SERVICE_ACCOUNT_NAME }} \&lt;/span&gt;
            &lt;span class="s"&gt;--display-name="scheduler dataproc workflow service account"&lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create Custom role for service account&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if ! gcloud iam roles list --project ${{ secrets.PROJECT_ID }} | grep -i ${{ env.CUSTOM_ROLE }} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;gcloud iam roles create ${{ env.CUSTOM_ROLE }} --project ${{ secrets.PROJECT_ID }} \&lt;/span&gt;
            &lt;span class="s"&gt;--title "Dataproc Workflow template scheduler" --description "Dataproc Workflow template scheduler" \&lt;/span&gt;
            &lt;span class="s"&gt;--permissions "dataproc.workflowTemplates.instantiate,iam.serviceAccounts.actAs" --stage ALPHA&lt;/span&gt;
          &lt;span class="s"&gt;fi    &lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Add the custom role for service account&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \&lt;/span&gt;
        &lt;span class="s"&gt;--member=serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com \&lt;/span&gt;
        &lt;span class="s"&gt;--role=projects/${{secrets.PROJECT_ID}}/roles/${{env.CUSTOM_ROLE}}&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create cloud schedule for workflow execution&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if ! gcloud scheduler jobs list --location ${{env.REGION}} | grep -i ${{env.SCHEDULE_NAME}} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;gcloud scheduler jobs create http ${{env.SCHEDULE_NAME}} \&lt;/span&gt;
            &lt;span class="s"&gt;--schedule="30 12 * * *" \&lt;/span&gt;
            &lt;span class="s"&gt;--description="Dataproc workflow " \&lt;/span&gt;
            &lt;span class="s"&gt;--location=${{env.REGION}} \&lt;/span&gt;
            &lt;span class="s"&gt;--uri=https://dataproc.googleapis.com/v1/projects/${{secrets.PROJECT_ID}}/regions/${{env.REGION}}/workflowTemplates/${{env.DATAPROC_WORKFLOW_NAME}}:instantiate?alt=json \&lt;/span&gt;
            &lt;span class="s"&gt;--time-zone=${{env.TIME_ZONE}} \&lt;/span&gt;
            &lt;span class="s"&gt;--oauth-service-account-email=${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com&lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Resources Created After Deployment
&lt;/h2&gt;

&lt;p&gt;Upon deployment, the following resources are created:&lt;/p&gt;

&lt;h3&gt;
  
  
  Google Cloud Storage Bucket
&lt;/h3&gt;

&lt;p&gt;At the end of the deployment process, several Cloud Storage buckets are created: one bucket for storing data related to the data lake, another for the Dataproc cluster, one for the PySpark scripts and libraries used in the project and one for BigQuery Temporary files generated in load process . The Dataproc service itself creates a cluster to manage temporary data generated during processing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8egg4eg3jeonb1bwuu9q.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8egg4eg3jeonb1bwuu9q.JPG" alt="cloud storage" width="800" height="224"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Dataproc Cluster
&lt;/h3&gt;

&lt;p&gt;Now Dataproc service shows a new Workflow template, the picture above shows 2 templates. At the the Workflow tab, is possible to explore some options, as monitoring workflow executions and analyzing their details.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsmp4rsn1x7x5paazy75e.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsmp4rsn1x7x5paazy75e.JPG" alt="dataproc-workflow1" width="800" height="354"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Selecting the created workflow, is possible to see the cluster used for processing and workflow's steps, dependencies between the steps. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe0vypblqtao6k13v8lfh.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe0vypblqtao6k13v8lfh.JPG" alt="dataproc-workflow2" width="800" height="596"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With Dataproc Service, is possible monitoring the execution status of each job, with individual detail about each execution, its performance , a example is displayed below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjj0payqwvuca6pq1vyq1.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjj0payqwvuca6pq1vyq1.JPG" alt="dataproc-workflow3" width="800" height="375"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  BigQuery Dataset
&lt;/h3&gt;

&lt;p&gt;In the image below the bigQuery DataSet is displayed, in this case we have just one table, with several columns of different numeric types, but it is possible to have several other datasets in the same project, as well as countless other tables, procedures, views and functions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtuezk6zshzxxgxbxd6g.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtuezk6zshzxxgxbxd6g.JPG" alt="Big Query1" width="800" height="509"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The result of a query is demonstrated, in this case the query returns all columns in the table, using the column partitioned in the clause where is always a good practice to avoid excessive costs, as Biq Query charges for data processed in the query and stored data, As the data used in the experiment is small and I used a Trial account, there was no cost for the consultation&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhtr8eqw0u0j3qn5kfiz9.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhtr8eqw0u0j3qn5kfiz9.JPG" alt="Big Query2" width="800" height="417"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud Scheduler Job
&lt;/h3&gt;

&lt;p&gt;The cloud scheduler shows that there are 2 dataproc Workflow schedules, which automate the entire data ingestion, transformation and enrichment process in an orchestrated manner and with scheduled execution.&lt;br&gt;
It is possible to force the execution of any schedule manually, by selecting the desired Cloud Scheduler and clicking on the "FORCE RUN" button, so it is possible to modify the scheduling time in "EDIT"&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx74p6oys3q5n66vx2ojf.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx74p6oys3q5n66vx2ojf.JPG" alt="Cloud Schedule" width="800" height="200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion&lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;This project demonstrates how Google Cloud’s Dataproc, BigQuery, Cloud Storage, and Cloud Scheduler can be integrated to create a scalable, automated data ingestion pipeline. By leveraging GitHub Actions for CI/CD, the project ensures streamlined deployment, robust automation, and seamless workflows. This setup can be adapted to suit various use cases in big data processing, enabling organizations to process, store, and analyze large datasets efficiently.&lt;/p&gt;

&lt;p&gt;Links and References&lt;br&gt;
&lt;a href="https://github.com/jader-lima/gcp-dataproc-bigquery-workflow-template" rel="noopener noreferrer"&gt;GitHub Repo&lt;/a&gt;&lt;br&gt;
&lt;a href="https://cloud.google.com/bigquery/docs" rel="noopener noreferrer"&gt;Big Query&lt;/a&gt;&lt;br&gt;
&lt;a href="https://cloud.google.com/dataproc/docs" rel="noopener noreferrer"&gt;DataProc&lt;/a&gt;&lt;br&gt;
&lt;a href="https://cloud.google.com/scheduler/docs" rel="noopener noreferrer"&gt;Cloud Scheduler&lt;/a&gt;&lt;br&gt;
&lt;a href="https://cloud.google.com/dataproc/docs/concepts/workflows/workflow-schedule-solutions" rel="noopener noreferrer"&gt;Wrokflows&lt;/a&gt;&lt;/p&gt;

</description>
      <category>gcp</category>
      <category>dataproc</category>
      <category>bigquery</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Creating a data pipeline using Dataproc workflow templates and cloud Schedule</title>
      <dc:creator>Jader Lima</dc:creator>
      <pubDate>Wed, 21 Aug 2024 13:11:52 +0000</pubDate>
      <link>https://dev.to/jader_lima_b72a63be5bbddc/creating-a-data-pipeline-using-dataproc-workflow-templates-and-cloud-schedule-267d</link>
      <guid>https://dev.to/jader_lima_b72a63be5bbddc/creating-a-data-pipeline-using-dataproc-workflow-templates-and-cloud-schedule-267d</guid>
      <description>&lt;h2&gt;
  
  
  About This Post
&lt;/h2&gt;

&lt;p&gt;Data pipelines are processes of acquiring, transforming and enriching data, &lt;br&gt;
orchestrated and scheduled, which process information from different sources and with countless possible destinations and applications.&lt;br&gt;
There are several systems that help in creating data pipelines, in this post we will cover creating data pipelines on Google Cloud Platform, using the dataproc Workflow template and creating a schedule with cloud Schedule&lt;/p&gt;
&lt;h2&gt;
  
  
  Description of Services Used in GCP
&lt;/h2&gt;
&lt;h2&gt;
  
  
  Apache Spark
&lt;/h2&gt;

&lt;p&gt;Apache Spark is an open-source unified analytics engine for large-scale data processing. It is known for its speed, ease of use, and sophisticated analytics capabilities. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance, making it suitable for a wide variety of big data applications.&lt;/p&gt;
&lt;h2&gt;
  
  
  Google Dataproc
&lt;/h2&gt;

&lt;p&gt;Google Dataproc is a fully managed cloud service that simplifies running Apache Spark and Apache Hadoop clusters in the Google Cloud environment. It allows users to easily process large datasets and integrates seamlessly with other Google Cloud services such as Cloud Storage. Dataproc is designed to make big data processing fast and efficient while minimizing operational overhead.&lt;/p&gt;
&lt;h2&gt;
  
  
  Cloud Storage
&lt;/h2&gt;

&lt;p&gt;Google Cloud Storage is a scalable and secure object storage service for storing large amounts of unstructured data. It offers high availability and strong global consistency, making it suitable for a wide range of scenarios, such as data backups, big data analytics, and content distribution.&lt;/p&gt;
&lt;h2&gt;
  
  
  Workflow Templates
&lt;/h2&gt;

&lt;p&gt;Workflow templates in Google Cloud allow you to define and manage complex workflows that automate interactions between different cloud services. This feature simplifies the process of building, scheduling, and executing intricate workflows, ensuring better management of resources and tasks.&lt;/p&gt;
&lt;h2&gt;
  
  
  Cloud Scheduler
&lt;/h2&gt;

&lt;p&gt;Google Cloud Scheduler is a fully managed cron job service that allows you to run arbitrary functions at specified times without needing to manage the infrastructure. It is useful for automating tasks such as running reports, triggering workflows, and executing other scheduled jobs.&lt;/p&gt;
&lt;h2&gt;
  
  
  CI/CD Process with GitHub Actions
&lt;/h2&gt;

&lt;p&gt;Incorporating a CI/CD pipeline using GitHub Actions involves automating the build, test, and deployment processes of your applications. For this project, GitHub Actions simplifies the deployment of code and resources to Google Cloud. This automation leverages GitHub's infrastructure to trigger workflows based on events such as code pushes, ensuring that your applications are built and deployed consistently and accurately each time code changes are made.&lt;/p&gt;
&lt;h2&gt;
  
  
  GitHub Secrets and Configuration
&lt;/h2&gt;

&lt;p&gt;Utilizing secrets in GitHub Actions is vital for maintaining security during the deployment process. Secrets allow you to store sensitive information such as API keys, passwords, and service account credentials securely. By keeping this sensitive data out of your source code, you minimize the risk of leaks and unauthorized access.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;GCP_BUCKET_BIGDATA_FILES&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Secret used to store the name of the cloud storage&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;GCP_BUCKET_DATALAKE&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Secret used to store the name of the cloud storage&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;GCP_BUCKET_DATAPROC&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Secret used to store the name of the cloud storage&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;GCP_SERVICE_ACCOUNT&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;GCP_SA_KEY&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Secret used to store the value of the service account key. For this project, the default service key was used. &lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;PROJECT_ID&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Secret used to store the project id value&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  Creating a GCP service account key
&lt;/h3&gt;

&lt;p&gt;To create computing resources in any cloud, in an automated or programmatic way, it is necessary to have an access key.&lt;br&gt;
In the case of GCP, we use an access key linked to a service account, for the project the default account was used.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;In GCP Console, access :

&lt;ol&gt;
&lt;li&gt;IAM &amp;amp;Admin&lt;/li&gt;
&lt;li&gt;Service accounts&lt;/li&gt;
&lt;li&gt;Select default service account, default name is something like &lt;strong&gt;Compute Engine default service account&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;In selected service account, menu &lt;strong&gt;KEYS&lt;/strong&gt;, 

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ADD KEY&lt;/strong&gt;, &lt;strong&gt;Create new Key&lt;/strong&gt;, Key Type &lt;strong&gt;json&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Download the key file, use the content as key value in your secret in github&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fua188978d64xi4o6php2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fua188978d64xi4o6php2.png" alt="GCP service account key" width="800" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For more details, access:&lt;br&gt;
&lt;a href="https://cloud.google.com/iam/docs/keys-create-delete" rel="noopener noreferrer"&gt;https://cloud.google.com/iam/docs/keys-create-delete&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Creating github secret
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;To create a new secret:

&lt;ol&gt;
&lt;li&gt;In project repository, menu &lt;strong&gt;Settings&lt;/strong&gt; &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;, &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secrets and variables&lt;/strong&gt;,click in access &lt;strong&gt;Action&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;New repository secret&lt;/strong&gt;, type a &lt;strong&gt;name&lt;/strong&gt; and &lt;strong&gt;value&lt;/strong&gt; for secret.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi45cicz0q89ije7j70yf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi45cicz0q89ije7j70yf.png" alt="github secret creation" width="800" height="609"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For more details , access :&lt;br&gt;
&lt;a href="https://docs.github.com/pt/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions" rel="noopener noreferrer"&gt;https://docs.github.com/pt/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Architecture Diagram
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvfastxh43rr298iljs9j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvfastxh43rr298iljs9j.png" alt="Architecture Diagram" width="506" height="554"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Deploying the project
&lt;/h2&gt;

&lt;p&gt;Every time a push to the &lt;strong&gt;main&lt;/strong&gt; branch happens, github actions will be triggered,&lt;br&gt;
running the yml script.&lt;br&gt;
the yml script contains 3 jobs which are explained in more detail below, but basically&lt;br&gt;
github actions uses the credentials of the service account with rights to create &lt;br&gt;
computing resources, if you authenticate to GCP, perform the steps described in the yml file&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;branchs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Workflow File YAML Explanation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Environments Needed
&lt;/h3&gt;

&lt;p&gt;Here's a brief explanation of the environment variables needed in your workflows based on the YAML file you provided:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;BRONZE_DATALAKE_FILES&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bronze&lt;/span&gt;
    &lt;span class="na"&gt;TRANSIENT_DATALAKE_FILES&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;transient&lt;/span&gt;
    &lt;span class="na"&gt;BUCKET_DATALAKE_FOLDER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;transient&lt;/span&gt;
    &lt;span class="na"&gt;BUCKET_BIGDATA_JAR_FOLDER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jars&lt;/span&gt;
    &lt;span class="na"&gt;BUCKET_BIGDATA_PYSPARK_FOLDER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;scripts&lt;/span&gt;
    &lt;span class="na"&gt;PYSPARK_INGESTION_SCRIPT &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ingestion_csv_to_delta.py&lt;/span&gt; 
    &lt;span class="na"&gt;REGION&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east1&lt;/span&gt;
    &lt;span class="na"&gt;ZONE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east1-b&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_CLUSTER_NAME &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dataproc-bigdata-multi-node-cluster&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_WORKER_TYPE &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;n2-standard-2&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_MASTER_TYPE &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;n2-standard-2&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_NUM_WORKERS &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_IMAGE_VERSION &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2.1-debian11&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_WORKER_NUM_LOCAL_SSD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_MASTER_NUM_LOCAL_SSD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_MASTER_BOOT_DISK_SIZE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;32&lt;/span&gt;   
    &lt;span class="na"&gt;DATAPROC_WORKER_DISK_SIZE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;32&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_MASTER_BOOT_DISK_TYPE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pd-balanced&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_WORKER_BOOT_DISK_TYPE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pd-balanced&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_COMPONENTS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;JUPYTER&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_WORKFLOW_NAME&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;departments_etl&lt;/span&gt;
    &lt;span class="na"&gt;DATAPROC_WORKFLOW_INGESTION_STEP_NAME&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ingestion_countries_csv_to_delta&lt;/span&gt; 
    &lt;span class="na"&gt;JAR_LIB1 &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;delta-core_2.12-2.3.0.jar&lt;/span&gt;
    &lt;span class="na"&gt;JAR_LIB2 &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;delta-storage-2.3.0.jar&lt;/span&gt; 
    &lt;span class="na"&gt;APP_NAME &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;countries_ingestion_csv_to_delta'&lt;/span&gt;
    &lt;span class="na"&gt;SUBJECT &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;departments&lt;/span&gt;
    &lt;span class="na"&gt;STEP1 &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;countries&lt;/span&gt;
    &lt;span class="na"&gt;STEP2 &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;departments&lt;/span&gt;
    &lt;span class="na"&gt;STEP3 &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;employees&lt;/span&gt;
    &lt;span class="na"&gt;STEP4 &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jobs&lt;/span&gt;
    &lt;span class="na"&gt;TIME_ZONE &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;America/Sao_Paulo&lt;/span&gt;
    &lt;span class="na"&gt;SCHEDULE &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;20&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;12&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;
    &lt;span class="na"&gt;SCHEDULE_NAME &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;schedule_departments_etl&lt;/span&gt;
    &lt;span class="na"&gt;SERVICE_ACCOUNT_NAME &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dataproc-account-workflow&lt;/span&gt;
    &lt;span class="na"&gt;CUSTOM_ROLE &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;WorkflowCustomRole&lt;/span&gt;
    &lt;span class="na"&gt;STEP1_NAME &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;step_countries&lt;/span&gt;
    &lt;span class="na"&gt;STEP2_NAME &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;step_departments&lt;/span&gt;
    &lt;span class="na"&gt;STEP3_NAME &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;step_employees&lt;/span&gt;
    &lt;span class="na"&gt;STEP4_NAME &lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;step_jobs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Deploy Buckets Job&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This job is responsible for creating three Google Cloud Storage buckets: one for transient files, one for jar files, and one for PySpark scripts. It checks if each bucket already exists before attempting to create them. Additionally, it uploads specified files into these buckets to prepare for later processing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy-buckets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-22.04&lt;/span&gt;
    &lt;span class="na"&gt;timeout-minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Authorize GCP&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;google-github-actions/auth@v2'&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;credentials_json&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;${{ secrets.GCP_SA_KEY }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Create GCP Bucket &lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create Google Cloud Storage - files&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if ! gsutil ls -p ${{ secrets.PROJECT_ID }} gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;gcloud storage buckets create gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }} --default-storage-class=nearline --location=${{ env.REGION }}&lt;/span&gt;
          &lt;span class="s"&gt;else&lt;/span&gt;
            &lt;span class="s"&gt;echo "Cloud Storage : gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}  already exists" ! &lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Create GCP Bucket &lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create Google Cloud Storage - dataproc&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if ! gsutil ls -p ${{ secrets.PROJECT_ID }} gs://${{ secrets.GCP_BUCKET_DATAPROC }} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;gcloud storage buckets create gs://${{ secrets.GCP_BUCKET_DATAPROC }} --default-storage-class=nearline --location=${{ env.REGION }}&lt;/span&gt;
          &lt;span class="s"&gt;else&lt;/span&gt;
            &lt;span class="s"&gt;echo "Cloud Storage : gs://${{ secrets.GCP_BUCKET_DATAPROC }}  already exists" ! &lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Create GCP Bucket &lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create Google Cloud Storage - datalake&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if ! gsutil ls -p ${{ secrets.PROJECT_ID }} gs://${{ secrets.GCP_BUCKET_DATALAKE }} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;gcloud storage buckets create gs://${{ secrets.GCP_BUCKET_DATALAKE }} --default-storage-class=nearline --location=${{ env.REGION }}&lt;/span&gt;
          &lt;span class="s"&gt;else&lt;/span&gt;
            &lt;span class="s"&gt;echo "Cloud Storage : gs://${{ secrets.GCP_BUCKET_DATALAKE }}  already exists" ! &lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Upload the file to GCP Bucket - transient files&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Upload transient files to Google Cloud Storage&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;TARGET=${{ env.TRANSIENT_DATALAKE_FILES }}&lt;/span&gt;
        &lt;span class="s"&gt;BUCKET_PATH=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}    &lt;/span&gt;
        &lt;span class="s"&gt;gsutil cp -r $TARGET gs://${BUCKET_PATH}&lt;/span&gt;


    &lt;span class="c1"&gt;# Step to Upload the file to GCP Bucket - jar files&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Upload jar files to Google Cloud Storage&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;TARGET=${{ env.BUCKET_BIGDATA_JAR_FOLDER }}&lt;/span&gt;
        &lt;span class="s"&gt;BUCKET_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}&lt;/span&gt;
        &lt;span class="s"&gt;gsutil cp -r $TARGET gs://${BUCKET_PATH}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Upload the file to GCP Bucket - pyspark files&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Upload pyspark files to Google Cloud Storage&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;TARGET=${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}&lt;/span&gt;
        &lt;span class="s"&gt;BUCKET_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}&lt;/span&gt;
        &lt;span class="s"&gt;gsutil cp -r $TARGET gs://${BUCKET_PATH}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to create dataproc cluster&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Upload pyspark files to Google Cloud Storage&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;TARGET=${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}&lt;/span&gt;
        &lt;span class="s"&gt;BUCKET_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}&lt;/span&gt;
        &lt;span class="s"&gt;gsutil cp -r $TARGET gs://${BUCKET_PATH}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Explanation:&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This job begins by checking out the code and authorizing Google Cloud credentials. It then checks for the existence of three specified Cloud Storage buckets—one for transient files, one for JAR files, and one for PySpark scripts. If these buckets do not exist, it creates them with gcloud. Finally, it uploads the relevant files to the corresponding buckets using gsutil.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Deploy Dataproc Workflow Template Job&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This job deploys a Dataproc workflow template in Google Cloud. It begins by checking if the workflow template already exists; if not, it creates one. It also sets up a managed Dataproc cluster with specific configurations such as the machine types and number of workers. Subsequently, it adds various steps (jobs) to the workflow template to outline the processing tasks for data ingestion.&lt;/p&gt;

&lt;p&gt;Code Snippet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;  &lt;span class="na"&gt;deploy-dataproc-workflow-template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;deploy-buckets&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-22.04&lt;/span&gt;
    &lt;span class="na"&gt;timeout-minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Authorize GCP&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;google-github-actions/auth@v2'&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;credentials_json&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;${{ secrets.GCP_SA_KEY }}&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create Dataproc Workflow&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;  
        &lt;span class="s"&gt;if ! gcloud dataproc workflow-templates list --region=${{ env.REGION}} | grep -i ${{ env.DATAPROC_WORKFLOW_NAME}} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;gcloud dataproc workflow-templates create ${{ env.DATAPROC_WORKFLOW_NAME }} --region ${{ env.REGION }}&lt;/span&gt;
          &lt;span class="s"&gt;else&lt;/span&gt;
            &lt;span class="s"&gt;echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME }} already exists" ! &lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create Dataproc Managed Cluster&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;       
        &lt;span class="s"&gt;gcloud dataproc workflow-templates set-managed-cluster ${{ env.DATAPROC_WORKFLOW_NAME }} &lt;/span&gt;
        &lt;span class="s"&gt;--region ${{ env.REGION }} &lt;/span&gt;
        &lt;span class="s"&gt;--zone ${{ env.ZONE }} &lt;/span&gt;
        &lt;span class="s"&gt;--image-version ${{ env.DATAPROC_IMAGE_VERSION }} &lt;/span&gt;
        &lt;span class="s"&gt;--master-machine-type=${{ env.DATAPROC_MASTER_TYPE }} &lt;/span&gt;
        &lt;span class="s"&gt;--master-boot-disk-type ${{ env.DATAPROC_MASTER_BOOT_DISK_TYPE }} &lt;/span&gt;
        &lt;span class="s"&gt;--master-boot-disk-size ${{ env.DATAPROC_MASTER_BOOT_DISK_SIZE }} &lt;/span&gt;
        &lt;span class="s"&gt;--worker-machine-type=${{ env.DATAPROC_WORKER_TYPE }} &lt;/span&gt;
        &lt;span class="s"&gt;--worker-boot-disk-type ${{ env.DATAPROC_WORKER_BOOT_DISK_TYPE }}&lt;/span&gt;
        &lt;span class="s"&gt;--worker-boot-disk-size ${{ env.DATAPROC_WORKER_DISK_SIZE }} &lt;/span&gt;
        &lt;span class="s"&gt;--num-workers=${{ env.DATAPROC_NUM_WORKERS }} &lt;/span&gt;
        &lt;span class="s"&gt;--cluster-name=${{ env.DATAPROC_CLUSTER_NAME }} &lt;/span&gt;
        &lt;span class="s"&gt;--optional-components ${{ env.DATAPROC_COMPONENTS }} &lt;/span&gt;
        &lt;span class="s"&gt;--service-account=${{ env.GCP_SERVICE_ACCOUNT }}&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Add Job Ingestion countries to Workflow&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if gcloud dataproc workflow-templates list --region=${{ env.REGION}} | grep -i ${{ env.DATAPROC_WORKFLOW_NAME}} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;

            &lt;span class="s"&gt;if gcloud dataproc workflow-templates describe ${{ env.DATAPROC_WORKFLOW_NAME}} --region=${{ env.REGION}} | grep -i ${{ env.STEP1_NAME  }} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
            &lt;span class="s"&gt;then \&lt;/span&gt;
              &lt;span class="s"&gt;echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME }} already has step : ${{ env.STEP1_NAME  }} " ! &lt;/span&gt;
            &lt;span class="s"&gt;else&lt;/span&gt;
              &lt;span class="s"&gt;PYSPARK_SCRIPT_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}/${{ env.PYSPARK_INGESTION_SCRIPT }}&lt;/span&gt;
              &lt;span class="s"&gt;JARS_PATH=gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB1 }}&lt;/span&gt;
              &lt;span class="s"&gt;JARS_PATH=${JARS_PATH},gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB2 }}&lt;/span&gt;
              &lt;span class="s"&gt;TRANSIENT=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}/${{ env.SUBJECT }}/${{ env.STEP1 }}&lt;/span&gt;
              &lt;span class="s"&gt;BRONZE=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BRONZE_DATALAKE_FILES }}/${{ env.SUBJECT }}/${{ env.STEP1 }}&lt;/span&gt;

              &lt;span class="s"&gt;gcloud dataproc workflow-templates add-job pyspark gs://${PYSPARK_SCRIPT_PATH} \&lt;/span&gt;
              &lt;span class="s"&gt;--workflow-template ${{ env.DATAPROC_WORKFLOW_NAME }}  \&lt;/span&gt;
              &lt;span class="s"&gt;--step-id ${{env.STEP1_NAME }} \&lt;/span&gt;
              &lt;span class="s"&gt;--region ${{ env.REGION }} \&lt;/span&gt;
              &lt;span class="s"&gt;--jars ${JARS_PATH} \&lt;/span&gt;
              &lt;span class="s"&gt;-- --app_name=${{ env.APP_NAME }}${{ env.STEP1 }} --bucket_transient=gs://${TRANSIENT} \&lt;/span&gt;
              &lt;span class="s"&gt;--bucket_bronze=gs://${BRONZE}&lt;/span&gt;
            &lt;span class="s"&gt;fi&lt;/span&gt;
        &lt;span class="s"&gt;else&lt;/span&gt;
          &lt;span class="s"&gt;echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME}} not exists" ! &lt;/span&gt;
        &lt;span class="s"&gt;fi        &lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Add Job Ingestion departments to Workflow&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if gcloud dataproc workflow-templates list --region=${{ env.REGION}} | grep -i ${{ env.DATAPROC_WORKFLOW_NAME}} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;if gcloud dataproc workflow-templates describe ${{ env.DATAPROC_WORKFLOW_NAME}} --region=${{ env.REGION}} | grep -i ${{ env.STEP2_NAME }} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
            &lt;span class="s"&gt;then \&lt;/span&gt;
              &lt;span class="s"&gt;echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME }} already has step : ${{ env.STEP2_NAME  }} " ! &lt;/span&gt;
            &lt;span class="s"&gt;else&lt;/span&gt;
              &lt;span class="s"&gt;PYSPARK_SCRIPT_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}/${{ env.PYSPARK_INGESTION_SCRIPT }}&lt;/span&gt;
              &lt;span class="s"&gt;JARS_PATH=gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB1 }}&lt;/span&gt;
              &lt;span class="s"&gt;JARS_PATH=${JARS_PATH},gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB2 }}&lt;/span&gt;
              &lt;span class="s"&gt;TRANSIENT=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}/${{ env.SUBJECT }}/${{ env.STEP2 }}&lt;/span&gt;
              &lt;span class="s"&gt;BRONZE=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BRONZE_DATALAKE_FILES }}/${{ env.SUBJECT }}/${{ env.STEP2 }}&lt;/span&gt;

              &lt;span class="s"&gt;gcloud dataproc workflow-templates add-job pyspark gs://${PYSPARK_SCRIPT_PATH} \&lt;/span&gt;
              &lt;span class="s"&gt;--workflow-template ${{ env.DATAPROC_WORKFLOW_NAME }}  \&lt;/span&gt;
              &lt;span class="s"&gt;--step-id ${{ env.STEP2_NAME }} \&lt;/span&gt;
              &lt;span class="s"&gt;--start-after ${{ env.STEP1_NAME }} \&lt;/span&gt;
              &lt;span class="s"&gt;--region ${{ env.REGION }} \&lt;/span&gt;
              &lt;span class="s"&gt;--jars ${JARS_PATH} \&lt;/span&gt;
              &lt;span class="s"&gt;-- --app_name=${{ env.APP_NAME }}${{ env.STEP2 }} --bucket_transient=gs://${TRANSIENT} \&lt;/span&gt;
              &lt;span class="s"&gt;--bucket_bronze=gs://${BRONZE}&lt;/span&gt;
            &lt;span class="s"&gt;fi&lt;/span&gt;
        &lt;span class="s"&gt;else&lt;/span&gt;
          &lt;span class="s"&gt;echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME}} not exists" ! &lt;/span&gt;
        &lt;span class="s"&gt;fi&lt;/span&gt;


    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Add Job Ingestion employees to Workflow&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if gcloud dataproc workflow-templates list --region=${{ env.REGION}} | grep -i ${{ env.DATAPROC_WORKFLOW_NAME}} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;if gcloud dataproc workflow-templates describe ${{ env.DATAPROC_WORKFLOW_NAME}} --region=${{ env.REGION}} | grep -i ${{ env.STEP3_NAME }} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
            &lt;span class="s"&gt;then \&lt;/span&gt;
              &lt;span class="s"&gt;echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME }} already has step : ${{ env.STEP3_NAME }} " ! &lt;/span&gt;
            &lt;span class="s"&gt;else&lt;/span&gt;
              &lt;span class="s"&gt;PYSPARK_SCRIPT_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}/${{ env.PYSPARK_INGESTION_SCRIPT }}&lt;/span&gt;
              &lt;span class="s"&gt;JARS_PATH=gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB1 }}&lt;/span&gt;
              &lt;span class="s"&gt;JARS_PATH=${JARS_PATH},gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB2 }}&lt;/span&gt;
              &lt;span class="s"&gt;TRANSIENT=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}/${{ env.SUBJECT }}/${{ env.STEP2 }}&lt;/span&gt;
              &lt;span class="s"&gt;PYSPARK_SCRIPT_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}/${{ env.PYSPARK_INGESTION_SCRIPT }}&lt;/span&gt;
              &lt;span class="s"&gt;JARS_PATH=gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB1 }}&lt;/span&gt;
              &lt;span class="s"&gt;JARS_PATH=${JARS_PATH},gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB2 }}&lt;/span&gt;
              &lt;span class="s"&gt;TRANSIENT=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}/${{ env.SUBJECT }}/${{ env.STEP3 }}&lt;/span&gt;
              &lt;span class="s"&gt;BRONZE=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BRONZE_DATALAKE_FILES }}/${{ env.SUBJECT }}/${{ env.STEP3 }}&lt;/span&gt;

              &lt;span class="s"&gt;gcloud dataproc workflow-templates add-job pyspark gs://${PYSPARK_SCRIPT_PATH} \&lt;/span&gt;
              &lt;span class="s"&gt;--workflow-template ${{ env.DATAPROC_WORKFLOW_NAME }}  \&lt;/span&gt;
              &lt;span class="s"&gt;--step-id ${{ env.STEP3_NAME }} \&lt;/span&gt;
              &lt;span class="s"&gt;--start-after ${{ env.STEP1_NAME }},${{ env.STEP2_NAME }} \&lt;/span&gt;
              &lt;span class="s"&gt;--region ${{ env.REGION }} \&lt;/span&gt;
              &lt;span class="s"&gt;--jars ${JARS_PATH} \&lt;/span&gt;
              &lt;span class="s"&gt;-- --app_name=${{ env.APP_NAME }}${{ env.STEP3 }} --bucket_transient=gs://${TRANSIENT} \&lt;/span&gt;
              &lt;span class="s"&gt;--bucket_bronze=gs://${BRONZE}&lt;/span&gt;
            &lt;span class="s"&gt;fi&lt;/span&gt;
        &lt;span class="s"&gt;else&lt;/span&gt;
          &lt;span class="s"&gt;echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME}} not exists" ! &lt;/span&gt;
        &lt;span class="s"&gt;fi&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Add Job Ingestion Jobs to Workflow&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if gcloud dataproc workflow-templates list --region=${{ env.REGION}} | grep -i ${{ env.DATAPROC_WORKFLOW_NAME}} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;if gcloud dataproc workflow-templates describe ${{ env.DATAPROC_WORKFLOW_NAME}} --region=${{ env.REGION}} | grep -i ${{ env.STEP4_NAME }} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
            &lt;span class="s"&gt;then \&lt;/span&gt;
              &lt;span class="s"&gt;echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME }} already has step : ${{ env.STEP4_NAME }} " ! &lt;/span&gt;
            &lt;span class="s"&gt;else&lt;/span&gt;
              &lt;span class="s"&gt;PYSPARK_SCRIPT_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}/${{ env.PYSPARK_INGESTION_SCRIPT }}&lt;/span&gt;
              &lt;span class="s"&gt;JARS_PATH=gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB1 }}&lt;/span&gt;
              &lt;span class="s"&gt;JARS_PATH=${JARS_PATH},gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB2 }}&lt;/span&gt;
              &lt;span class="s"&gt;TRANSIENT=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}/${{ env.SUBJECT }}/${{ env.STEP2 }}&lt;/span&gt;
              &lt;span class="s"&gt;PYSPARK_SCRIPT_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}/${{ env.PYSPARK_INGESTION_SCRIPT }}&lt;/span&gt;
              &lt;span class="s"&gt;JARS_PATH=gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB1 }}&lt;/span&gt;
              &lt;span class="s"&gt;JARS_PATH=${JARS_PATH},gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB2 }}&lt;/span&gt;
              &lt;span class="s"&gt;TRANSIENT=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}/${{ env.SUBJECT }}/${{ env.STEP4 }}&lt;/span&gt;
              &lt;span class="s"&gt;BRONZE=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BRONZE_DATALAKE_FILES }}/${{ env.SUBJECT }}/${{ env.STEP4 }}&lt;/span&gt;

              &lt;span class="s"&gt;gcloud dataproc workflow-templates add-job pyspark gs://${PYSPARK_SCRIPT_PATH} \&lt;/span&gt;
              &lt;span class="s"&gt;--workflow-template ${{ env.DATAPROC_WORKFLOW_NAME }}  \&lt;/span&gt;
              &lt;span class="s"&gt;--step-id ${{ env.STEP4_NAME }} \&lt;/span&gt;
              &lt;span class="s"&gt;--start-after ${{ env.STEP1_NAME }},${{ env.STEP2_NAME }},${{ env.STEP3_NAME }} \&lt;/span&gt;
              &lt;span class="s"&gt;--region ${{ env.REGION }} \&lt;/span&gt;
              &lt;span class="s"&gt;--jars ${JARS_PATH} \&lt;/span&gt;
              &lt;span class="s"&gt;-- --app_name=${{ env.APP_NAME }}${{ env.STEP4 }} --bucket_transient=gs://${TRANSIENT} \&lt;/span&gt;
              &lt;span class="s"&gt;--bucket_bronze=gs://${BRONZE}&lt;/span&gt;
            &lt;span class="s"&gt;fi&lt;/span&gt;
        &lt;span class="s"&gt;else&lt;/span&gt;
          &lt;span class="s"&gt;echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME}} not exists" ! &lt;/span&gt;
        &lt;span class="s"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Explanation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This job follows a systematic approach to deploying a Dataproc workflow template. It first checks if the workflow template exists and creates it if it does not. Next, a managed Dataproc cluster is configured with specified properties (e.g., number of workers, machine type). The job also adds specified steps for data ingestion tasks to the workflow template, detailing how data should be processed. The remaining steps for Add Job are structured similarly, each focusing on different data ingestion tasks within the workflow.&lt;/p&gt;

&lt;p&gt;Deploy Cloud Schedule Job&lt;br&gt;
This job sets up a scheduling mechanism using Google Cloud Scheduler. It creates a service account specifically for the scheduled job, defines a custom role with specific permissions, and binds the custom role to the service account. Finally, it creates the cloud schedule to trigger the execution of the workflow at defined intervals.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;    &lt;span class="na"&gt;deploy-cloud-schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;deploy-buckets&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;deploy-dataproc-workflow-template&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-22.04&lt;/span&gt;
    &lt;span class="na"&gt;timeout-minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Authorize GCP&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;google-github-actions/auth@v2'&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;credentials_json&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;${{ secrets.GCP_SA_KEY }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Authenticate with GCP&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Cloud SDK&lt;/span&gt;
      &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;google-github-actions/setup-gcloud@v2&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;363.0.0'&lt;/span&gt;
        &lt;span class="na"&gt;project_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.PROJECT_ID }}&lt;/span&gt;

    &lt;span class="c1"&gt;# Step to Configure Docker to use the gcloud command-line tool as a credential helper&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure Docker&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud auth configure-docker&lt;/span&gt;


    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create service account&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;

        &lt;span class="s"&gt;if ! gcloud iam service-accounts list | grep -i ${{ env.SERVICE_ACCOUNT_NAME}} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;gcloud iam service-accounts create ${{ env.SERVICE_ACCOUNT_NAME }} \&lt;/span&gt;
            &lt;span class="s"&gt;--display-name="scheduler dataproc workflow service account"&lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create Custom role for service account&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if ! gcloud iam roles list --project ${{ secrets.PROJECT_ID }} | grep -i ${{ env.CUSTOM_ROLE }} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;gcloud iam roles create ${{ env.CUSTOM_ROLE }} --project ${{ secrets.PROJECT_ID }} \&lt;/span&gt;
            &lt;span class="s"&gt;--title "Dataproc Workflow template scheduler" --description "Dataproc Workflow template scheduler" \&lt;/span&gt;
            &lt;span class="s"&gt;--permissions "dataproc.workflowTemplates.instantiate,iam.serviceAccounts.actAs" --stage ALPHA&lt;/span&gt;
          &lt;span class="s"&gt;fi    &lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Add the custom role for service account&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \&lt;/span&gt;
        &lt;span class="s"&gt;--member=serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com \&lt;/span&gt;
        &lt;span class="s"&gt;--role=projects/${{secrets.PROJECT_ID}}/roles/${{env.CUSTOM_ROLE}}&lt;/span&gt;

    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create cloud schedule for workflow execution&lt;/span&gt;
      &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
        &lt;span class="s"&gt;if ! gcloud scheduler jobs list --location ${{env.REGION}} | grep -i ${{env.SCHEDULE_NAME}} &amp;amp;&amp;gt; /dev/null; \&lt;/span&gt;
          &lt;span class="s"&gt;then \&lt;/span&gt;
            &lt;span class="s"&gt;gcloud scheduler jobs create http ${{env.SCHEDULE_NAME}} \&lt;/span&gt;
            &lt;span class="s"&gt;--schedule="30 12 * * *" \&lt;/span&gt;
            &lt;span class="s"&gt;--description="Dataproc workflow " \&lt;/span&gt;
            &lt;span class="s"&gt;--location=${{env.REGION}} \&lt;/span&gt;
            &lt;span class="s"&gt;--uri=https://dataproc.googleapis.com/v1/projects/${{secrets.PROJECT_ID}}/regions/${{env.REGION}}/workflowTemplates/${{env.DATAPROC_WORKFLOW_NAME}}:instantiate?alt=json \&lt;/span&gt;
            &lt;span class="s"&gt;--time-zone=${{env.TIME_ZONE}} \&lt;/span&gt;
            &lt;span class="s"&gt;--oauth-service-account-email=${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com&lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Explanation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In this job, a service account is created specifically for handling the scheduled workflow execution. It also defines a custom role that grants the necessary permissions for the service account to instantiate the workflow template. This custom role is then associated with the service account to ensure it has the required permissions. Finally, the job creates a cloud schedule that triggers the workflow execution at predetermined times, ensuring automated execution of the data processing workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Resources created after deploy process&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Dataproc Workflow Template&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;After deploying the project, you can access the Dataproc service to view the Workflow template. In the Workflow tab, you can explore various options, including monitoring workflow executions and analyzing their details.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl8c09t53cv125vjx89gd.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl8c09t53cv125vjx89gd.jpeg" alt="Dataproc Workflow Template 1" width="800" height="309"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When you select the created workflow, you can see the cluster used for processing and the steps that comprise the workflow, including any dependencies between the steps. This visibility allows you to track the workflow's operational flow.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmsalytbi3oc92ofwj9z0.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmsalytbi3oc92ofwj9z0.jpeg" alt="Dataproc Workflow Template 2" width="800" height="709"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Additionally, within the Dataproc service, you can monitor the execution status of each job. It provides details about each execution, including the performance of individual steps within the workflow template, as illustrated below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr1aps48bpuu9hpsztbtb.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr1aps48bpuu9hpsztbtb.jpeg" alt="Dataproc Workflow Template 3" width="800" height="297"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Cloud Scheduler&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;By accessing the Cloud Scheduler service, you'll find the scheduled job created during the deployment process. The interface displays the last run status, the defined schedule for execution, and additional details about the target URL and other parameters.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7k92sfeqnxwynlr200lj.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7k92sfeqnxwynlr200lj.jpeg" alt="Cloud Scheduler" width="800" height="115"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Cloud Storage&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;As part of the deployment process, several Cloud Storage buckets are created: one bucket for storing data related to the data lake, another for the Dataproc cluster, and a third for the PySpark scripts and libraries used in the project. The Dataproc service itself creates a cluster to manage temporary data generated during processing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuulc45ddeqy0ucggsclt.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuulc45ddeqy0ucggsclt.jpeg" alt="Cloud Storage" width="800" height="394"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After the data processing is complete, a new directory is established in the designated Cloud Storage bucket to save the ingested data from the data lake. The transient directory, created during the deployment phase, serves as the location where data was copied from the GitHub repository to Cloud Storage. In a production environment, another application would likely handle the ingestion of data into this transient layer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwpm81m10u0gjt401dftw.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwpm81m10u0gjt401dftw.jpeg" alt="Cloud Storage Files" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Data pipelines are crucial components in the landscape of data processing. While there are robust and feature-rich tools available, such as Azure Data Factory and Apache Airflow, simpler solutions can be valuable in certain scenarios. The decision on the most appropriate tool ultimately rests with the data and architecture teams, who must assess the specific needs and context to select the best solution for the moment.&lt;/p&gt;

&lt;p&gt;Links and References&lt;br&gt;
&lt;a href="https://github.com/jader-lima/gcp-dataproc-workflow" rel="noopener noreferrer"&gt;Github Repo&lt;/a&gt;&lt;br&gt;
&lt;a href="https://cloud.google.com/dataproc/docs/concepts/workflows/workflow-schedule-solutions" rel="noopener noreferrer"&gt;Dataproc workflow documentation&lt;/a&gt;&lt;/p&gt;

</description>
      <category>pyspark</category>
      <category>gcp</category>
      <category>dataproc</category>
      <category>pipelines</category>
    </item>
    <item>
      <title>Running pyspark jobs on Google Cloud Dataproc</title>
      <dc:creator>Jader Lima</dc:creator>
      <pubDate>Mon, 05 Aug 2024 22:50:14 +0000</pubDate>
      <link>https://dev.to/jader_lima_b72a63be5bbddc/running-pyspark-jobs-on-google-cloud-dataproc-jd1</link>
      <guid>https://dev.to/jader_lima_b72a63be5bbddc/running-pyspark-jobs-on-google-cloud-dataproc-jd1</guid>
      <description>&lt;p&gt;This blog focuses on data processing and its tools and techniques, with a particular emphasis on big data tools in cloud environments like GCP, AWS, and Azure. Featuring hands-on material, we will demonstrate various ways to use different tools and techniques. Feel free to use the material as you wish and reach out with any suggestions.&lt;/p&gt;

&lt;h2&gt;
  
  
  About This Post
&lt;/h2&gt;

&lt;p&gt;In this post, we will process data using batch processing techniques, which handle data manually or via scheduling tools. This technique completes the task at the end of processing and waits until it is manually or automatically triggered again. The data will be stored using cloud storage following the medallion architecture, also known as the multi-hop architecture. This architecture typically consists of three layers: bronze, silver, and gold, and is widely used for creating data lakes and datalakehouses.&lt;/p&gt;

&lt;p&gt;In the medallion architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bronze Layer&lt;/strong&gt;: This is the raw data layer, where data is stored in a format suitable for big data processing, such as Parquet, Delta, or ORC.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Silver Layer&lt;/strong&gt;: In this layer, data is cleaned, organized, and deduplicated but without significant transformations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gold Layer&lt;/strong&gt;: This is the layer designed for consumption, where significant business transformations and aggregations take place.
Raw, Trusted, and Refined Layers: These terms are equivalent to bronze, silver, and gold, respectively.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Dataproc service will handle reading, processing, cleaning, and writing the data in the different layers of the data lake.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Spark
&lt;/h2&gt;

&lt;p&gt;Apache Spark is an open-source framework for large-scale data processing. It provides a unified programming interface for cluster computing and enables efficient parallel data processing. Spark supports multiple programming languages, including Python, Scala, and Java, and is widely used for data analysis, machine learning, and real-time data processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Google Dataproc
&lt;/h2&gt;

&lt;p&gt;Google Dataproc is a managed service from Google Cloud that simplifies the processing of large datasets using frameworks like Apache Spark and Apache Hadoop. It streamlines the creation, management, and scaling of data processing clusters, allowing you to focus on data analysis rather than infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cloud Storage
&lt;/h2&gt;

&lt;p&gt;Google Cloud Storage is a highly scalable and durable object storage service from Google Cloud. It allows you to store and access large volumes of unstructured data, such as media files, backups, and large datasets. Cloud Storage offers various storage options to meet different performance and cost needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Diagram
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhynzqq67862fmg08c14i.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhynzqq67862fmg08c14i.jpg" alt="Architecture Diagram" width="485" height="340"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Environment Variable Configuration
&lt;/h2&gt;

&lt;p&gt;The commands below can be executed using the Google Cloud Shell or by configuring the GCP CLI on a personal notebook.&lt;/p&gt;

&lt;p&gt;To list existing projects, execute the command below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud projects list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To list available regions, execute the command below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud compute regions list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To list available zones, execute the command below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud compute zones list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The variables below are used to create the necessary storages. Three storages will be created: one to serve as a datalake, another to store PySpark scripts and auxiliary JARs, and the last one to store Dataproc cluster information. For a simple test of the Dataproc service, the configurations below are sufficient and work with a free trial account on GCP.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;######## STORAGE NAMES AND GENERAL PARAMETERS ##########&lt;/span&gt;
&lt;span class="nv"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;YOUR_GCP_PROJECT_ID&amp;gt;
&lt;span class="nv"&gt;REGION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;YOUR_GCP_REGION&amp;gt;
&lt;span class="nv"&gt;ZONE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;YOUR_GCP_ZONE&amp;gt;
&lt;span class="nv"&gt;GCP_BUCKET_DATALAKE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;YOUR_DATALAKE_STORAGE_NAME&amp;gt;
&lt;span class="nv"&gt;GCP_BUCKET_BIGDATA_FILES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;YOUR_STORAGE_FILE_NAME&amp;gt;
&lt;span class="nv"&gt;GCP_BUCKET_DATAPROC&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;YOUR_STORAGE_DATAPROC_NAME&amp;gt;

&lt;span class="c"&gt;###### DATAPROC ENV #################&lt;/span&gt;
&lt;span class="nv"&gt;DATAPROC_CLUSTER_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;YOUR_DATAPROC_CLUSTER_NAME&amp;gt;
&lt;span class="nv"&gt;DATAPROC_WORKER_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;n2-standard-2
&lt;span class="nv"&gt;DATAPROC_MASTER_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;n2-standard-2
&lt;span class="nv"&gt;DATAPROC_NUM_WORKERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2
&lt;span class="nv"&gt;DATAPROC_IMAGE_VERSION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2.1-debian11
&lt;span class="nv"&gt;DATAPROC_WORKER_NUM_LOCAL_SSD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
&lt;span class="nv"&gt;DATAPROC_MASTER_NUM_LOCAL_SSD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
&lt;span class="nv"&gt;DATAPROC_MASTER_BOOT_DISK_SIZE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;32   
&lt;span class="nv"&gt;DATAPROC_WORKER_DISK_SIZE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;32
&lt;span class="nv"&gt;DATAPROC_MASTER_BOOT_DISK_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;pd-balanced
&lt;span class="nv"&gt;DATAPROC_WORKER_BOOT_DISK_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;pd-balanced
&lt;span class="nv"&gt;DATAPROC_COMPONENTS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;JUPYTER

&lt;span class="c"&gt;#########&lt;/span&gt;
&lt;span class="nv"&gt;GCP_STORAGE_PREFIX&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gs://
&lt;span class="nv"&gt;BRONZE_DATALAKE_FILES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;bronze
&lt;span class="nv"&gt;TRANSIENT_DATALAKE_FILES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;transient
&lt;span class="nv"&gt;BUCKET_DATALAKE_FOLDER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;transient
&lt;span class="nv"&gt;BUCKET_BIGDATA_JAR_FOLDER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;jars
&lt;span class="nv"&gt;BUCKET_BIGDATA_PYSPARK_FOLDER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;scripts
&lt;span class="nv"&gt;DATAPROC_APP_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ingestion_countries_csv_to_delta 
&lt;span class="nv"&gt;JAR_LIB1&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;delta-core_2.12-2.3.0.jar
&lt;span class="nv"&gt;JAR_LIB2&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;delta-storage-2.3.0.jar 
&lt;span class="nv"&gt;APP_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'countries_ingestion_csv_to_delta'&lt;/span&gt;
&lt;span class="nv"&gt;SUBJECT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;departments
&lt;span class="nv"&gt;FILE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;countries
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Creating the Services that Make Up the Solution
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Storage&lt;/strong&gt;
Create the storage buckets with the commands below:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud storage buckets create gs://&lt;span class="nv"&gt;$GCP_BUCKET_BIGDATA_FILES&lt;/span&gt; &lt;span class="nt"&gt;--default-storage-class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nearline &lt;span class="nt"&gt;--location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$REGION&lt;/span&gt;
gcloud storage buckets create gs://&lt;span class="nv"&gt;$GCP_BUCKET_DATALAKE&lt;/span&gt; &lt;span class="nt"&gt;--default-storage-class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nearline &lt;span class="nt"&gt;--location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$REGION&lt;/span&gt;
gcloud storage buckets create gs://&lt;span class="nv"&gt;$GCP_BUCKET_DATAPROC&lt;/span&gt; &lt;span class="nt"&gt;--default-storage-class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nearline &lt;span class="nt"&gt;--location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$REGION&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result should look like the image below:&lt;br&gt;
&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6fbhao8ztvbh9c8vrbe9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6fbhao8ztvbh9c8vrbe9.png" alt="storage created" width="800" height="351"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Uploading Solution Files
&lt;/h2&gt;

&lt;p&gt;After creating the cloud storages, it is necessary to upload the CSV files that will be processed by Dataproc, as well as the libraries used by the PySpark script and the PySpark script itself.&lt;/p&gt;

&lt;p&gt;Before starting the file upload, it's important to understand the project repository structure, which includes folders for the Data Lake files, the Python code, and the libraries.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnqe64qlqlrh5emmm8tgm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnqe64qlqlrh5emmm8tgm.png" alt="repo structure" width="249" height="353"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Data Lake Storage
&lt;/h3&gt;

&lt;p&gt;There are various ways to upload the files. We'll choose the easiest method: simply select the storage created to store the Data Lake files, click on "Upload Folder," and select the "transient" folder. The "transient" folder is available in the application repository; just download it to your local machine.&lt;/p&gt;

&lt;p&gt;The result should look like the image below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgbdovv3fcack3tbtsjj5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgbdovv3fcack3tbtsjj5.png" alt="datalake" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwh7w56kt62cwlkl1juub.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwh7w56kt62cwlkl1juub.png" alt="datalake2" width="800" height="574"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Big Data Files Storage
&lt;/h3&gt;

&lt;p&gt;Now it's necessary to upload the PySpark script containing the application's processing logic and the required libraries to save the data in Delta format. Upload the "jars" and "scripts" folders.&lt;/p&gt;

&lt;p&gt;The result should look like the image below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9sc7fm39gqhtonz63y1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9sc7fm39gqhtonz63y1.png" alt="big data files" width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Dataproc Cluster&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When choosing between a single-node cluster and a multi-node cluster, consider the following:&lt;/p&gt;

&lt;p&gt;Single-Node Cluster: Ideal for proof of concept projects and more cost-effective, but with limited computational power.&lt;br&gt;
Multi-Node Cluster: Offers greater computational power but comes at a higher cost.&lt;br&gt;
For this experiment, you can choose either option based on your needs and resources.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single-Node Cluster&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud dataproc clusters create &lt;span class="nv"&gt;$DATAPROC_CLUSTER_NAME&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--enable-component-gateway&lt;/span&gt; &lt;span class="nt"&gt;--bucket&lt;/span&gt; &lt;span class="nv"&gt;$GCP_BUCKET_DATAPROC&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--region&lt;/span&gt; &lt;span class="nv"&gt;$REGION&lt;/span&gt; &lt;span class="nt"&gt;--zone&lt;/span&gt; &lt;span class="nv"&gt;$ZONE&lt;/span&gt; &lt;span class="nt"&gt;--master-machine-type&lt;/span&gt; &lt;span class="nv"&gt;$DATAPROC_MASTER_TYPE&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--master-boot-disk-type&lt;/span&gt; &lt;span class="nv"&gt;$DATAPROC_MASTER_BOOT_DISK_TYPE&lt;/span&gt; &lt;span class="nt"&gt;--master-boot-disk-size&lt;/span&gt; &lt;span class="nv"&gt;$DATAPROC_MASTER_BOOT_DISK_SIZE&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--num-master-local-ssds&lt;/span&gt; &lt;span class="nv"&gt;$DATAPROC_MASTER_NUM_LOCAL_SSD&lt;/span&gt; &lt;span class="nt"&gt;--image-version&lt;/span&gt; &lt;span class="nv"&gt;$DATAPROC_IMAGE_VERSION&lt;/span&gt; &lt;span class="nt"&gt;--single-node&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--optional-components&lt;/span&gt; &lt;span class="nv"&gt;$DATAPROC_COMPONENTS&lt;/span&gt; &lt;span class="nt"&gt;--project&lt;/span&gt; &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Multi-Node Cluster&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud dataproc clusters create &lt;span class="nv"&gt;$DATAPROC_CLUSTER_NAME&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--enable-component-gateway&lt;/span&gt; &lt;span class="nt"&gt;--bucket&lt;/span&gt; &lt;span class="nv"&gt;$GCP_BUCKET_DATAPROC&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--region&lt;/span&gt; &lt;span class="nv"&gt;$REGION&lt;/span&gt; &lt;span class="nt"&gt;--zone&lt;/span&gt; &lt;span class="nv"&gt;$ZONE&lt;/span&gt; &lt;span class="nt"&gt;--master-machine-type&lt;/span&gt; &lt;span class="nv"&gt;$DATAPROC_MASTER_TYPE&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--master-boot-disk-type&lt;/span&gt; &lt;span class="nv"&gt;$DATAPROC_MASTER_BOOT_DISK_TYPE&lt;/span&gt; &lt;span class="nt"&gt;--master-boot-disk-size&lt;/span&gt; &lt;span class="nv"&gt;$DATAPROC_MASTER_BOOT_DISK_SIZE&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--num-master-local-ssds&lt;/span&gt; &lt;span class="nv"&gt;$DATAPROC_MASTER_NUM_LOCAL_SSD&lt;/span&gt; &lt;span class="nt"&gt;--num-workers&lt;/span&gt; &lt;span class="nv"&gt;$DATAPROC_NUM_WORKERS&lt;/span&gt; &lt;span class="nt"&gt;--worker-machine-type&lt;/span&gt; &lt;span class="nv"&gt;$DATAPROC_WORKER_TYPE&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--worker-boot-disk-type&lt;/span&gt; &lt;span class="nv"&gt;$DATAPROC_WORKER_BOOT_DISK_TYPE&lt;/span&gt; &lt;span class="nt"&gt;--worker-boot-disk-size&lt;/span&gt; &lt;span class="nv"&gt;$DATAPROC_WORKER_DISK_SIZE&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--num-worker-local-ssds&lt;/span&gt; &lt;span class="nv"&gt;$DATAPROC_WORKER_NUM_LOCAL_SSD&lt;/span&gt; &lt;span class="nt"&gt;--image-version&lt;/span&gt; &lt;span class="nv"&gt;$DATAPROC_IMAGE_VERSION&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--optional-components&lt;/span&gt; &lt;span class="nv"&gt;$DATAPROC_COMPONENTS&lt;/span&gt; &lt;span class="nt"&gt;--project&lt;/span&gt; &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dataproc graphic interface:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd46w6zcl5evcmq2ww8vi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd46w6zcl5evcmq2ww8vi.png" alt="dataproc web interface" width="800" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Dataproc details:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ldmqirdmb32oweyda4n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ldmqirdmb32oweyda4n.png" alt="dataproc web interface details" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Dataproc web interfaces:&lt;br&gt;
&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi1ul1mtab50abxv0h0jr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi1ul1mtab50abxv0h0jr.png" alt="dataproc web interface items" width="800" height="531"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Dataproc jupyter web interfaces:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqqwcxbkpefconguw0qzb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqqwcxbkpefconguw0qzb.png" alt="dataproc jupyter" width="800" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To list existing Dataproc clusters in a specific region, execute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud dataproc clusters list &lt;span class="nt"&gt;--region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$REGION&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Running a PySpark Job with Spark Submit
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Spark Submit
&lt;/h3&gt;

&lt;p&gt;To use an existing cluster, besides the notebook interface, we can submit a PySpark or Spark script with Spark Submit. &lt;/p&gt;

&lt;p&gt;Create new variables for job execution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;PYSPARK_SCRIPT_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$GCP_STORAGE_PREFIX$GCP_BUCKET_BIGDATA_FILES&lt;/span&gt;/&lt;span class="nv"&gt;$BUCKET_BIGDATA_PYSPARK_FOLDER&lt;/span&gt;/&lt;span class="nv"&gt;$PYSPARK_INGESTION_SCRIPT&lt;/span&gt;
&lt;span class="nv"&gt;JARS_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$GCP_STORAGE_PREFIX$GCP_BUCKET_BIGDATA_FILES&lt;/span&gt;/&lt;span class="nv"&gt;$BUCKET_BIGDATA_JAR_FOLDER&lt;/span&gt;/&lt;span class="nv"&gt;$JAR_LIB1&lt;/span&gt;
&lt;span class="nv"&gt;JARS_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$JARS_PATH&lt;/span&gt;,&lt;span class="nv"&gt;$GCP_STORAGE_PREFIX$GCP_BUCKET_BIGDATA_FILES&lt;/span&gt;/&lt;span class="nv"&gt;$BUCKET_BIGDATA_JAR_FOLDER&lt;/span&gt;/&lt;span class="nv"&gt;$JAR_LIB2&lt;/span&gt;
&lt;span class="nv"&gt;TRANSIENT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$GCP_STORAGE_PREFIX$GCP_BUCKET_DATALAKE&lt;/span&gt;/&lt;span class="nv"&gt;$BUCKET_DATALAKE_FOLDER&lt;/span&gt;/&lt;span class="nv"&gt;$SUBJECT&lt;/span&gt;/&lt;span class="nv"&gt;$FILE&lt;/span&gt;
&lt;span class="nv"&gt;BRONZE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$GCP_STORAGE_PREFIX$GCP_BUCKET_DATALAKE&lt;/span&gt;/&lt;span class="nv"&gt;$BRONZE_DATALAKE_FILES&lt;/span&gt;/&lt;span class="nv"&gt;$SUBJECT&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To verify the contents of the variables, use the echo command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$PYSPARK_SCRIPT_PATH&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$JARS_PATH&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$TRANSIENT&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$BRONZE&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  About the PySpark Script
&lt;/h2&gt;

&lt;p&gt;The PySpark script is divided into two steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Receiving the parameters sent by the spark-submit command. These parameters are:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;--app_name - PySpark Application Name&lt;br&gt;
--bucket_transient - URI of the GCS transient bucket&lt;br&gt;
--bucket_bronze - URI of the GCS bronze bucket&lt;br&gt;
Calling the main method&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F22bd4itby94sr95my0hn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F22bd4itby94sr95my0hn.png" alt="step1" width="792" height="541"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Main Method
Calls the method that creates the Spark session
Calls the method that reads the data from the transient layer stored in the storage
Calls the method that writes the data to the bronze layer in the storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7re4pelojwgvtkam5kah.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7re4pelojwgvtkam5kah.png" alt="pyspark_functions" width="800" height="641"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Execute Spark Submit with the command below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud dataproc &lt;span class="nb"&gt;jobs &lt;/span&gt;submit pyspark &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--project&lt;/span&gt; &lt;span class="nv"&gt;$PROJECT_ID&lt;/span&gt; &lt;span class="nt"&gt;--region&lt;/span&gt; &lt;span class="nv"&gt;$REGION&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--cluster&lt;/span&gt; &lt;span class="nv"&gt;$DATAPROC_CLUSTER_NAME&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--jars&lt;/span&gt; &lt;span class="nv"&gt;$JARS_PATH&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nv"&gt;$PYSPARK_SCRIPT_PATH&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nt"&gt;--app_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$DATAPROC_APP_NAME&lt;/span&gt; &lt;span class="nt"&gt;--bucket_transient&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$TRANSIENT&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--bucket_bronze&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$BRONZE&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To verify dataproc job execution, in select dataproc cluster, click in job to se results and logs&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F31y0c3jvmt3mcv97ayli.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F31y0c3jvmt3mcv97ayli.png" alt="dataproc_execution" width="800" height="202"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In datalake storage, a new folder was created, to bronze layer data. As your datalake is became bigger more and more folder will be created&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9zxeed2h6kv282i10g12.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9zxeed2h6kv282i10g12.png" alt="bronze_folder" width="800" height="433"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffi7c3hafusukeq0c9xgf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffi7c3hafusukeq0c9xgf.png" alt="bronze_folder_detail" width="800" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Removing the Created Services
&lt;/h2&gt;

&lt;p&gt;To avoid unexpected costs, remove the created services and resources after use.&lt;/p&gt;

&lt;p&gt;To delete the created storages along with their contents, execute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud storage &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;--recursive&lt;/span&gt; &lt;span class="nv"&gt;$GCP_STORAGE_PREFIX$GCP_BUCKET_DATALAKE&lt;/span&gt;
gcloud storage &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;--recursive&lt;/span&gt; &lt;span class="nv"&gt;$GCP_STORAGE_PREFIX$GCP_BUCKET_BIGDATA_FILES&lt;/span&gt;
gcloud storage &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;--recursive&lt;/span&gt; &lt;span class="nv"&gt;$GCP_STORAGE_PREFIX$GCP_BUCKET_DATAPROC&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To delete the Dataproc cluster created in the experiment, execute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud dataproc clusters delete &lt;span class="nv"&gt;$DATAPROC_CLUSTER_NAME&lt;/span&gt; &lt;span class="nt"&gt;--region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$REGION&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After deletion, the command below should not list any existing clusters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud dataproc clusters list &lt;span class="nt"&gt;--region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$REGION&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The need to process data using PySpark exists in various scenarios, including batch and streaming processing. Currently, many cloud providers offer the option of managed clusters, which are easy to manage.&lt;/p&gt;

&lt;p&gt;In this blog, we demonstrated how to create a managed cluster on Google Cloud Platform and how to use storage that serves as a distributed file system. Clouds like AWS offer similar services, such as AWS EMR and AWS S3, or Azure with HDInsight and Storage Account Gen2.&lt;/p&gt;

&lt;h3&gt;
  
  
  Links and References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/jader-lima/gcp-dataproc-cluster" rel="noopener noreferrer"&gt;Github Repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cloud.google.com/dataproc/docs" rel="noopener noreferrer"&gt;Dataproc documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.databricks.com/glossary/medallion-architecture" rel="noopener noreferrer"&gt;medallion-architecture&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>pyspark</category>
      <category>gcp</category>
      <category>dataproc</category>
      <category>pipeline</category>
    </item>
  </channel>
</rss>
