DEV Community: Jader Lima

Using Google Cloud Functions for Three-Tier Data Processing with Google Composer and Automated Deployments via GitHub Actions

Jader Lima — Fri, 25 Oct 2024 02:03:03 +0000

About
Tools Used
Solution Architecture Diagram
Deployment Process
- Prerequisites
- Secrets for GitHub Actions
- To create a new secret:
- Setting Up the DevOps Service Account
GitHub Actions Pipeline: Steps
- enable-services
- deploy-buckets
- deploy-cloud-function
- deploy-composer-service-account
- deploy-bigquery-dataset-bigquery-tables
- deploy-composer-environment
- deploy-composer-http-connection
- deploy-dags
GitHub Actions Workflow Explanation
Resources Created After Deployment
Conclusion
References

About

This post explores the use of Google Cloud Functions for processing data in a three-tier architecture. The solution is orchestrated with Google Composer and features automated deployments using GitHub Actions. We will walk through the tools used, deployment process, and pipeline steps, providing a clear guide for building an end-to-end cloud-based data pipeline.

Tools Used

Google Cloud Platform (GCP): The primary cloud environment.
Cloud Storage: For storing input and processed data across different layers (Bronze, Silver, Gold).
Cloud Functions: Serverless functions responsible for data processing in each tier.
Google Composer: An orchestration tool based on Apache Airflow, used to schedule and manage workflows.
GitHub Actions: Automation tool for deploying and managing the pipeline infrastructure.

Solution Architecture Diagram

Deployment Process

Prerequisites

Before setting up the project, ensure you have the following:

GCP Account: A Google Cloud account with billing enabled.
Service Account for DevOps: A service account with the required permissions to deploy resources in GCP.
Secrets in GitHub: Store the GCP service account credentials, project ID, and bucket name as secrets in GitHub for secure access.

Secrets for GitHub Actions

To securely access your GCP project and resources, set the following secrets in GitHub Actions:

BUCKET_DATALAKE: Your Cloud Storage bucket for the data lake.
GCP_DEVOPS_SA_KEY: The service account key in JSON format.
PROJECT_ID: Your GCP project ID.
REGION_PROJECT_ID: The region where your GCP project is deployed.

To create a new secret:

1. In project repository, menu **Settings** 
2. **Security**, 
3. **Secrets and variables**,click in access **Action**
4. **New repository secret**, type a **name** and **value** for secret.

For more details , access :
https://docs.github.com/pt/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions

Setting Up the DevOps Service Account

Create a service account in GCP with permissions for Cloud Functions, Composer, BigQuery, and Cloud Storage. Grant the necessary roles such as:

Cloud Functions Admin
Composer User
BigQuery Data Editor
Storage Object Admin

GitHub Actions Pipeline: Steps

The pipeline automates the entire deployment process, ensuring all components are set up correctly. Here's a breakdown of the key jobs from the GitHub Actions file, each responsible for a different aspect of the deployment.

enable-services

enable-services:
    runs-on: ubuntu-22.04
    steps:
    - uses: actions/checkout@v2

    # Step to Authenticate with GCP
    - name: Authorize GCP
      uses: 'google-github-actions/auth@v2'
      with:
        credentials_json:  ${{ secrets.GCP_DEVOPS_SA_KEY }}

    # Step to Configure  Cloud SDK
    - name: Set up Cloud SDK
      uses: google-github-actions/setup-gcloud@v2
      with:
        version: '>= 363.0.0'
        project_id: ${{ secrets.PROJECT_ID }}

    # Step to Configure Docker to use the gcloud command-line tool as a credential helper
    - name: Configure Docker
      run: |-
        gcloud auth configure-docker

    - name: Set up python 3.8
      uses: actions/setup-python@v2
      with:
        python-version: 3.8.16

    # Step to Create GCP Bucket 
    - name: Enable gcp service api's
      run: |-
        gcloud services enable ${{ env.GCP_SERVICE_API_0 }}
        gcloud services enable ${{ env.GCP_SERVICE_API_1 }}
        gcloud services enable ${{ env.GCP_SERVICE_API_2 }}
        gcloud services enable ${{ env.GCP_SERVICE_API_3 }}
        gcloud services enable ${{ env.GCP_SERVICE_API_4 }}

deploy-buckets

deploy-buckets:
    needs: [enable-services]
    runs-on: ubuntu-22.04
    timeout-minutes: 10

    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Authorize GCP
      uses: 'google-github-actions/auth@v2'
      with:
        credentials_json:  ${{ secrets.GCP_DEVOPS_SA_KEY }}

    # Step to Authenticate with GCP
    - name: Set up Cloud SDK
      uses: google-github-actions/setup-gcloud@v2
      with:
        version: '>= 363.0.0'
        project_id: ${{ secrets.PROJECT_ID }}

    # Step to Configure Docker to use the gcloud command-line tool as a credential helper
    - name: Configure Docker
      run: |-
        gcloud auth configure-docker

    # Step to Create GCP Bucket 
    - name: Create Google Cloud Storage - datalake
      run: |-
        if ! gsutil ls -p ${{ secrets.PROJECT_ID }} gs://${{ secrets.BUCKET_DATALAKE }} &> /dev/null; \
          then \
            gcloud storage buckets create gs://${{ secrets.BUCKET_DATALAKE }} --default-storage-class=nearline --location=${{ env.REGION }}
          else
            echo "Cloud Storage : gs://${{ secrets.BUCKET_DATALAKE }}  already exists" ! 
          fi


    # Step to Upload the file to GCP Bucket - transient files
    - name: Upload transient files to Google Cloud Storage
      run: |-
        TARGET=${{ env.INPUT_FOLDER }}
        BUCKET_PATH=${{ secrets.BUCKET_DATALAKE }}/${{ env.INPUT_FOLDER }}    
        gsutil cp -r $TARGET gs://${BUCKET_PATH}

deploy-cloud-function

deploy-cloud-function:
    needs: [enable-services, deploy-buckets]
    runs-on: ubuntu-22.04
    steps:
    - uses: actions/checkout@v2

    # Step to Authenticate with GCP
    - name: Authorize GCP
      uses: 'google-github-actions/auth@v2'
      with:
        credentials_json:  ${{ secrets.GCP_DEVOPS_SA_KEY }}

    # Step to Configure  Cloud SDK
    - name: Set up Cloud SDK
      uses: google-github-actions/setup-gcloud@v2
      with:
        version: '>= 363.0.0'
        project_id: ${{ secrets.PROJECT_ID }}

    # Step to Configure Docker to use the gcloud command-line tool as a credential helper
    - name: Configure Docker
      run: |-
        gcloud auth configure-docker

    - name: Set up python 3.10
      uses: actions/setup-python@v2
      with:
        python-version: 3.10.12
    #cloud_function_scripts/csv_to_parquet
    - name: Create cloud function - ${{ env.CLOUD_FUNCTION_1_NAME }}
      run: |-
        cd ${{ env.FUNCTION_SCRIPTS }}/${{ env.CLOUD_FUNCTION_1_NAME }}
        gcloud functions deploy ${{ env.CLOUD_FUNCTION_1_NAME }} \
        --gen2 \
        --cpu=${{ env.FUNCTION_CPU  }} \
        --memory=${{ env.FUNCTION_MEMORY  }} \
        --runtime ${{ env.PYTHON_FUNCTION_RUNTIME }} \
        --trigger-http \
        --region ${{ env.REGION }} \
        --entry-point ${{ env.CLOUD_FUNCTION_1_NAME }}

    - name: Create cloud function - ${{ env.CLOUD_FUNCTION_2_NAME }}
      run: |-
        cd ${{ env.FUNCTION_SCRIPTS }}/${{ env.CLOUD_FUNCTION_2_NAME }}
        gcloud functions deploy ${{ env.CLOUD_FUNCTION_2_NAME }} \
        --gen2 \
        --cpu=${{ env.FUNCTION_CPU  }} \
        --memory=${{ env.FUNCTION_MEMORY  }} \
        --runtime ${{ env.PYTHON_FUNCTION_RUNTIME }} \
        --trigger-http \
        --region ${{ env.REGION }} \
        --entry-point ${{ env.CLOUD_FUNCTION_2_NAME }}

    - name: Create cloud function - ${{ env.CLOUD_FUNCTION_3_NAME }}
      run: |-
        cd ${{ env.FUNCTION_SCRIPTS }}/${{ env.CLOUD_FUNCTION_3_NAME }}
        gcloud functions deploy ${{ env.CLOUD_FUNCTION_3_NAME }} \
        --gen2 \
        --cpu=${{ env.FUNCTION_CPU  }} \
        --memory=${{ env.FUNCTION_MEMORY  }} \
        --runtime ${{ env.PYTHON_FUNCTION_RUNTIME }} \
        --trigger-http \
        --region ${{ env.REGION }} \
        --entry-point ${{ env.CLOUD_FUNCTION_3_NAME }}

deploy-composer-service-account

deploy-composer-service-account:
    needs: [enable-services, deploy-buckets, deploy-cloud-function ]
    runs-on: ubuntu-22.04
    timeout-minutes: 10

    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Authorize GCP
      uses: 'google-github-actions/auth@v2'
      with:
        credentials_json:  ${{ secrets.GCP_DEVOPS_SA_KEY }}

    # Step to Authenticate with GCP
    - name: Set up Cloud SDK
      uses: google-github-actions/setup-gcloud@v2
      with:
        version: '>= 363.0.0'
        project_id: ${{ secrets.PROJECT_ID }}

    # Step to Configure Docker to use the gcloud command-line tool as a credential helper
    - name: Configure Docker
      run: |-
        gcloud auth configure-docker


    - name: Create service account
      run: |-

        if ! gcloud iam service-accounts list | grep -i ${{ env.SERVICE_ACCOUNT_NAME}} &> /dev/null; \
          then \
            gcloud iam service-accounts create ${{ env.SERVICE_ACCOUNT_NAME }} \
            --display-name=${{ env.SERVICE_ACCOUNT_DESCRIPTION }}
          fi
    - name: Add permissions to service account
      run: |-
        gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \
          --member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \
          --role="roles/composer.user"

        gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \
          --member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \
          --role="roles/storage.objectAdmin"

        gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \
          --member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \
          --role="roles/cloudfunctions.invoker"

        # Permissão para criar e gerenciar ambientes Composer
        gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \
          --member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \
          --role="roles/composer.admin"

        gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \
          --member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \
          --role="roles/composer.worker"

        # Permissão para criar e configurar instâncias e recursos na VPC
        gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \
          --member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \
          --role="roles/compute.networkAdmin"

        # Permissão para interagir com o Cloud Storage, necessário para buckets e logs
        gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \
          --member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \
          --role="roles/storage.admin"

        # Permissão para criar e gerenciar recursos no projeto, como buckets e instâncias
        gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \
          --member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \
          --role="roles/editor"

        # Permissão para acessar e usar recursos necessários para o IAM
        gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \
          --member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \
          --role="roles/iam.serviceAccountUser"

        gcloud functions add-iam-policy-binding ${{env.CLOUD_FUNCTION_1_NAME}} \
          --region="${{env.REGION}}" \
          --member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \
          --role="roles/cloudfunctions.invoker"

        gcloud functions add-invoker-policy-binding ${{env.CLOUD_FUNCTION_1_NAME}} \
          --region="${{env.REGION}}" \
          --member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" 

        gcloud functions add-iam-policy-binding ${{env.CLOUD_FUNCTION_2_NAME}} \
          --region="${{env.REGION}}" \
          --member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \
          --role="roles/cloudfunctions.invoker"

        gcloud functions add-invoker-policy-binding ${{env.CLOUD_FUNCTION_2_NAME}} \
          --region="${{env.REGION}}" \
          --member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com"     

        gcloud functions add-iam-policy-binding ${{env.CLOUD_FUNCTION_3_NAME}} \
          --region="${{env.REGION}}" \
          --member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \
          --role="roles/cloudfunctions.invoker"

        gcloud functions add-invoker-policy-binding ${{env.CLOUD_FUNCTION_3_NAME}} \
          --region="${{env.REGION}}" \
          --member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com"   

        SERVICE_NAME_1=$(gcloud functions describe ${{ env.CLOUD_FUNCTION_1_NAME }} --region=${{ env.REGION }} --format="value(serviceConfig.service)")
        gcloud run services add-iam-policy-binding $SERVICE_NAME_1 \
        --region="${{env.REGION}}" \
        --member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \
        --role="roles/run.invoker"

        SERVICE_NAME_2=$(gcloud functions describe ${{ env.CLOUD_FUNCTION_2_NAME }} --region=${{ env.REGION }} --format="value(serviceConfig.service)")
        gcloud run services add-iam-policy-binding $SERVICE_NAME_2 \
        --region="${{env.REGION}}" \
        --member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \
        --role="roles/run.invoker"

        SERVICE_NAME_3=$(gcloud functions describe ${{ env.CLOUD_FUNCTION_3_NAME }} --region=${{ env.REGION }} --format="value(serviceConfig.service)")
        gcloud run services add-iam-policy-binding $SERVICE_NAME_3 \
        --region="${{env.REGION}}" \
        --member="serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com" \
        --role="roles/run.invoker"


        gcloud functions add-invoker-policy-binding ${{env.CLOUD_FUNCTION_1_NAME}} \
        --region="${{env.REGION}}" \
        --member="allUsers"

        gcloud functions add-invoker-policy-binding ${{env.CLOUD_FUNCTION_2_NAME}} \
        --region="${{env.REGION}}" \
        --member="allUsers"

        gcloud functions add-invoker-policy-binding ${{env.CLOUD_FUNCTION_3_NAME}} \
        --region="${{env.REGION}}" \
        --member="allUsers"

deploy-bigquery-dataset-bigquery-tables

deploy-bigquery-dataset-bigquery-tables:
    needs: [enable-services, deploy-buckets, deploy-cloud-function, deploy-composer-service-account ]
    runs-on: ubuntu-22.04
    timeout-minutes: 10

    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Authorize GCP
      uses: 'google-github-actions/auth@v2'
      with:
        credentials_json:  ${{ secrets.GCP_DEVOPS_SA_KEY }}

    # Step to Authenticate with GCP
    - name: Set up Cloud SDK
      uses: google-github-actions/setup-gcloud@v2
      with:
        version: '>= 363.0.0'
        project_id: ${{ secrets.PROJECT_ID }}

    # Step to Configure Docker to use the gcloud command-line tool as a credential helper
    - name: Configure Docker
      run: |-
        gcloud auth configure-docker


    - name: Create Big Query Dataset
      run: |-  
        if ! bq ls --project_id ${{ secrets.PROJECT_ID}} -a | grep -w ${{ env.BIGQUERY_DATASET}} &> /dev/null; \
          then 

            bq --location=${{ env.REGION }} mk \
          --default_table_expiration 0 \
          --dataset ${{ env.BIGQUERY_DATASET }}

          else
            echo "Big Query Dataset : ${{ env.BIGQUERY_DATASET }} already exists" ! 
          fi

    - name: Create Big Query table
      run: |-
        TABLE_NAME_CUSTOMER=${{ env.BIGQUERY_DATASET}}.${{ env.BIGQUERY_TABLE_CUSTOMER}}
        c=0
        for table in $(bq ls --max_results 1000 "${{ secrets.PROJECT_ID}}:${{ env.BIGQUERY_DATASET}}" | tail -n +3 | awk '{print $1}'); do

            # Determine the table type and file extension
            if bq show --format=prettyjson $BIGQUERY_TABLE_CUSTOMER | jq -r '.type' | grep -q -E "TABLE"; then
              echo "Dataset ${{ env.BIGQUERY_DATASET}} already has table named : $table " !
              if [ "$table" == "${{ env.BIGQUERY_TABLE_CUSTOMER}}" ]; then
                echo "Dataset ${{ env.BIGQUERY_DATASET}} already has table named : $table " !
                ((c=c+1))       
              fi                  
            else
                echo "Ignoring $table"            
                continue
            fi
        done
        echo " contador $c "
        if [ $c == 0 ]; then
          echo "Creating table named : $table for Dataset ${{ env.BIGQUERY_DATASET}} " !

          bq mk --table \
          $TABLE_NAME_CUSTOMER \
          ./big_query_schemas/customer_schema.json


        fi

deploy-composer-environment

deploy-composer-environment:
    needs: [enable-services, deploy-buckets, deploy-cloud-function, deploy-composer-service-account, deploy-bigquery-dataset-bigquery-tables ]
    runs-on: ubuntu-22.04
    timeout-minutes: 40

    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Authorize GCP
      uses: 'google-github-actions/auth@v2'
      with:
        credentials_json:  ${{ secrets.GCP_DEVOPS_SA_KEY }}

    # Step to Authenticate with GCP
    - name: Set up Cloud SDK
      uses: google-github-actions/setup-gcloud@v2
      with:
        version: '>= 363.0.0'
        project_id: ${{ secrets.PROJECT_ID }}

    # Step to Configure Docker to use the gcloud command-line tool as a credential helper
    - name: Configure Docker
      run: |-
        gcloud auth configure-docker

    # Create composer environments
    - name: Create composer environments
      run: |-
        if ! gcloud composer environments list --project=${{ secrets.PROJECT_ID }} --locations=${{ env.REGION }} | grep -i ${{ env.COMPOSER_ENV_NAME }} &> /dev/null; then
            gcloud composer environments create ${{ env.COMPOSER_ENV_NAME }} \
                --project ${{ secrets.PROJECT_ID }} \
                --location ${{ env.REGION }} \
                --environment-size ${{ env.COMPOSER_ENV_SIZE }} \
                --image-version ${{ env.COMPOSER_IMAGE_VERSION }} \
                --service-account ${{ env.SERVICE_ACCOUNT_NAME }}@${{ secrets.PROJECT_ID }}.iam.gserviceaccount.com
        else
            echo "Composer environment ${{ env.COMPOSER_ENV_NAME }} already exists!"
        fi

    # Create composer environments
    - name: Create composer variable PROJECT_ID 
      run: |-
        gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \
        --location ${{ env.REGION}} variables \
        -- set PROJECT_ID ${{ secrets.PROJECT_ID }}

    - name: Create composer variable REGION
      run: |-  
        gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \
          --location ${{ env.REGION }} variables \
          -- set REGION ${{ env.REGION }}

    - name: Create composer variable CLOUD_FUNCTION_1_NAME
      run: |-
        gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }}\
          --location ${{ env.REGION }} variables \
          -- set CLOUD_FUNCTION_1_NAME ${{ env.CLOUD_FUNCTION_1_NAME }}

    - name: Create composer variable CLOUD_FUNCTION_2_NAME
      run: |-
        gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \
        --location ${{ env.REGION }} variables \
        -- set CLOUD_FUNCTION_2_NAME ${{ env.CLOUD_FUNCTION_2_NAME }}

    - name: Create composer variable CLOUD_FUNCTION_3_NAME
      run: |-
        gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \
        --location ${{ env.REGION }} variables \
        -- set CLOUD_FUNCTION_3_NAME ${{ env.CLOUD_FUNCTION_3_NAME }}

    - name: Create composer variable BUCKET_DATALAKE
      run: |-
        gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \
        --location ${{ env.REGION}} variables \
        -- set BUCKET_NAME ${{ secrets.BUCKET_DATALAKE }}

    - name: Create composer variable TRANSIENT_FILE_PATH
      run: |-
        gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \
        --location ${{ env.REGION }} variables \
        -- set TRANSIENT_FILE_PATH ${{ env.TRANSIENT_FILE_PATH }}

    - name: Create composer variable BRONZE_PATH 
      run: |-
        gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \
        --location ${{ env.REGION }} variables \
        -- set BRONZE_PATH ${{ env.BRONZE_PATH }}

    - name: Create composer variable SILVER_PATH
      run: |-
        gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \
        --location ${{ env.REGION }} variables \
        -- set SILVER_PATH ${{ env.SILVER_PATH }}

    - name: Create composer variable REGION_PROJECT_ID
      run: |-
        gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \
        --location ${{ env.REGION }} variables \
        -- set REGION_PROJECT_ID "${{ env.REGION }}-${{ secrets.PROJECT_ID }}"


    - name: Create composer variable BIGQUERY_DATASET
      run: |-
        gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \
        --location ${{ env.REGION }} variables \
        -- set BIGQUERY_DATASET "${{ env.BIGQUERY_DATASET }}"

    - name: Create composer variable BIGQUERY_TABLE_CUSTOMER
      run: |-
        gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \
        --location ${{ env.REGION }} variables \
        -- set BIGQUERY_TABLE_CUSTOMER "${{ env.BIGQUERY_TABLE_CUSTOMER }}"

deploy-composer-http-connection

deploy-composer-http-connection:
    needs: [enable-services, deploy-buckets, deploy-cloud-function, deploy-composer-service-account, deploy-bigquery-dataset-bigquery-tables, deploy-composer-environment ]
    runs-on: ubuntu-22.04

    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Authorize GCP
      uses: 'google-github-actions/auth@v2'
      with:
        credentials_json:  ${{ secrets.GCP_DEVOPS_SA_KEY }}

    # Step to Authenticate with GCP
    - name: Set up Cloud SDK
      uses: google-github-actions/setup-gcloud@v2
      with:
        version: '>= 363.0.0'
        project_id: ${{ secrets.PROJECT_ID }}

    # Step to Configure Docker to use the gcloud command-line tool as a credential helper
    - name: Configure Docker
      run: |-
        gcloud auth configure-docker

    - name: Create composer http connection HTTP_CONNECTION
      run: |-
        HOST="https://${{ env.REGION }}-${{ secrets.PROJECT_ID }}.cloudfunctions.net"
        gcloud composer environments run ${{ env.COMPOSER_ENV_NAME }} \
        --location ${{ env.REGION }} connections \
        -- add ${{ env.HTTP_CONNECTION }} \
        --conn-type ${{ env.CONNECTION_TYPE  }} \
        --conn-host $HOST

deploy-dags

deploy-dags:
    needs: [enable-services, deploy-buckets, deploy-cloud-function, deploy-composer-service-account, deploy-bigquery-dataset-bigquery-tables, deploy-composer-environment, deploy-composer-http-connection ]
    runs-on: ubuntu-22.04

    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Authorize GCP
      uses: 'google-github-actions/auth@v2'
      with:
        credentials_json:  ${{ secrets.GCP_DEVOPS_SA_KEY }}

    # Step to Authenticate with GCP
    - name: Set up Cloud SDK
      uses: google-github-actions/setup-gcloud@v2
      with:
        version: '>= 363.0.0'
        project_id: ${{ secrets.PROJECT_ID }}

    - name: Get Composer bucket name and Deploy DAG to Composer
      run: |-
        COMPOSER_BUCKET=$(gcloud composer environments describe ${{ env.COMPOSER_ENV_NAME }} \
        --location ${{ env.REGION }} \
        --format="value(config.dagGcsPrefix)")
        gsutil -m cp -r ./dags/* $COMPOSER_BUCKET/dags/

Resources Created After Deployment

After the deployment process is complete, the following resources will be available:

Cloud Storage Buckets: Organized into Bronze, Silver, and Gold layers.
Cloud Functions:: Responsible for processing data across the three layers.
Service Account for Composer:: With appropriate permissions for invoking Cloud Functions.
BigQuery Dataset and Tables: A DataFrame created for storing processed data.
Google Composer Environment: Orchestrates the Cloud Functions with daily executions.
Composer DAG: The DAG manages the workflow that invokes Cloud Functions and processes data.

Conclusion

This solution demonstrates how to leverage Google Cloud Functions, Composer, and BigQuery to create a robust three-tier data processing pipeline. The automation using GitHub Actions ensures a smooth, reproducible deployment process, making it easier to manage cloud-based data pipelines at scale.

References

Google Cloud Platform Documentation: https://cloud.google.com/docs
GitHub Actions Documentation:: https://docs.github.com/en/actions
Google Composer Documentation:: https://cloud.google.com/composer/docs
Cloud Functions Documentation: https://cloud.google.com/functions/docs
GitHub Repo: https://github.com/jader-lima/gcp-cloud-functions-to-bigquery

Using Cloud Functions and Cloud Schedule to process data with Google Dataflow

Jader Lima — Sun, 22 Sep 2024 21:31:29 +0000

GCP DataFlow Function Schedule

Overview

This project showcases the integration of Google Cloud services, specifically Dataflow, Cloud Functions, and Cloud Scheduler, to create a highly scalable, cost-effective, and easy-to-maintain data processing solution. It demonstrates how you can automate data pipelines, perform seamless integration with other GCP services like BigQuery, and manage workflows efficiently through CI/CD pipelines with GitHub Actions. This setup provides flexibility, reduces manual intervention, and ensures that the data processing workflows run smoothly and consistently.

Technologies Used
Features
Architecture Diagram
Getting Started
- Prerequisites
- Setup Instructions
Deploying the Project
- Workflow YAML Explanation
- Workflow Job Steps
Resources Created After Deployment
Conclusion
Documentation Links

Technologies Used

Google Dataflow

Google Dataflow is a fully managed service for stream and batch data processing, which is built on Apache Beam. It allows for the creation of highly efficient, low-latency, and cost-effective data pipelines. Dataflow can handle large-scale data processing tasks, making it ideal for use cases like real-time analytics and ETL jobs.
Cloud Storage

Google Cloud Storage is a scalable, durable, and secure object storage service designed to handle large volumes of unstructured data. It is ideal for use in big data analysis, backups, and content distribution, offering high availability and low latency across the globe.
Cloud Functions

Google Cloud Functions is a serverless execution environment that allows you to run code in response to events. In this project, Cloud Functions are used to trigger Dataflow jobs and manage workflow automation efficiently with minimal operational overhead.
Cloud Scheduler

Google Cloud Scheduler is a fully managed cron job service that allows you to schedule tasks or trigger cloud services at specific intervals. It’s used in this project to automate the execution of the Cloud Functions, ensuring that Dataflow jobs run as needed without manual intervention.
CI/CD Process with GitHub Actions

GitHub Actions enables continuous integration and continuous delivery (CI/CD) workflows directly from your GitHub repository. In this project, it is used to automate the build, testing, and deployment of resources to Google Cloud, ensuring consistent and reliable deployments.
GitHub Secrets and Configuration

GitHub Secrets securely store sensitive information such as API keys, service account credentials, and configuration settings required for deployment. By keeping these details secure, the risk of leaks and unauthorized access is minimized.

Features

Ingest and transform data from Google Cloud Storage using Google Dataflow.
Encapsulate the Dataflow process into a reusable Dataflow template.
Create a Cloud Function that executes the Dataflow template through a REST API.
Automate the execution of the Cloud Function using Cloud Scheduler.
Implement a CI/CD pipeline with GitHub Actions for automated deployments.
Incorporate comprehensive error handling and logging for reliable data processing.

Architecture Diagram

Getting Started

Prerequisites

Before getting started, ensure you have the following:

A Google Cloud account with billing enabled.
A GitHub account.

Setup Instructions

Clone the Repository

   git clone https://github.com/jader-lima/gcp-dataflow-function-schedule.git
   cd gcp-dataproc-bigquery-workflow-template

Set Up Google Cloud Environment

Create a Google Cloud Storage bucket to store your data.
Set up a BigQuery dataset where your data will be ingested.
Create a Dataproc cluster for processing.

Creat a new service account for deploy purposes

Create the Service Account:

gcloud iam service-accounts create devops-dataops-sa \
    --description="Service account for DevOps and DataOps tasks" \
    --display-name="DevOps DataOps Service Account"

Grant Storage Access Permissions (Buckets): Storage Admin (roles/storage.admin): Grants permissions to create, list, and manipulate buckets and files.

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/storage.admin"

Grant Dataflow Permissions: Dataflow Admin (roles/dataflow.admin): To create, run, and manage Dataflow jobs. Dataflow Developer (roles/dataflow.developer): Allows the development and submission of Dataflow jobs.

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/dataflow.admin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/dataflow.developer"

Permissions to Create and Manage Cloud Functions and Cloud Scheduler: Cloud Functions Admin (roles/cloudfunctions.admin): To create and manage Cloud Functions. Cloud Scheduler Admin (roles/cloudscheduler.admin): To create and manage Cloud Scheduler jobs.

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/cloudfunctions.admin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/cloudscheduler.admin"

Grant Permissions to Manage Service Accounts: IAM Service Account Admin (roles/iam.serviceAccountAdmin): To create and manage other service accounts. IAM Service Account User (roles/iam.serviceAccountUser): To use service accounts in different services.

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/iam.serviceAccountAdmin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/iam.serviceAccountUser"


gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/serviceusage.serviceUsageAdmin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/resourcemanager.projectIamAdmin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/resourcemanager.projectIamAdmin"

Permission to enable API services:

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/cloudscheduler.admin"

Additional Permissions (Optional): Compute Admin (roles/compute.admin): If your pipeline needs to create compute resources (e.g., virtual machine instances). Viewer (roles/viewer): To ensure the account can view other resources in the project.

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/compute.admin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/viewer"

Configure Environment Variables and Secrets

Ensure the following environment variables are set in your deployment configuration or within GitHub Secrets:

GCP_BUCKET_BIGDATA_FILES: Secret used to store the name of the cloud storage
GCP_BUCKET_DATALAKE: Secret used to store the name of the cloud storage
GCP_BUCKET_DATAPROC: Secret used to store the name of the cloud storage
GCP_BUCKET_TEMP_BIGQUERY: Secret used to store the name of the cloud storage
GCP_DEVOPS_SA_KEY: Secret used to store the value of the service account key. For this project, the default service key was used.
GCP_SERVICE_ACCOUNT: Secret used to store the value of the service account key. For this project, the default service key was used.
PROJECT_ID: Secret used to store the project id value

Creat a new service account for deploy purposes

Creating github secret

To create a new secret:
1. In project repository, menu Settings
2. Security,
3. Secrets and variables,click in access Action
4. New repository secret, type a name and value for secret.

For more details , access :
https://docs.github.com/pt/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions

Deploying the project

Whenever a push to the main branch occurs, GitHub Actions will trigger and run the YAML script. The script contains four jobs, described in detail below. In essence, GitHub Actions uses the service account credentials to authenticate with Google Cloud and execute the necessary steps as described in the YAML file.

Workflow File YAML Explanation

Environments Needed
We have variations for basic usage for cluster characteristics, bucket paths, process names and steps
make workflow. In case of new steps in the workflow or new scripts, new variables can be easily added as below :

Workflow Job Steps

enable-services:
This step enables the necessary APIs for Cloud Functions, Dataflow, and the build process.
deploy-buckets:
This step creates Google Cloud Storage buckets and copies the required data files and scripts into them.
build-dataflow-classic-template:
Builds and stores a Dataflow template in a Cloud Storage bucket for future execution.
deploy-cloud-function:
Deploys a Cloud Function that triggers the execution of the Dataflow template using the google-api-python-client library.
deploy-cloud-schedule:
Creates a Cloud Scheduler job to automate the execution of the Cloud Function, ensuring data is processed at defined intervals.

Resources Created After Deployment

Upon deployment, the following resources are created:

Google Cloud Storage Bucket

A Cloud Storage bucket to store data and templates.

Csv files of the olist dataset, stored in the transient layer of the datalake.

Csv file created after Dataflow processing, this file can be used in analysis tools, spreadsheets, databases, etc.

Dataflow Classic Template

A reusable Dataflow template stored in Cloud Storage.

Cloud Scheduler Job

Automated scheduled jobs for Dataflow executions.

Conclusion

This project demonstrates how to leverage Google Cloud services like Dataflow, Cloud Functions, and Cloud Scheduler to create a fully automated and scalable data processing pipeline. The integration with GitHub Actions ensures continuous deployment, while the use of Cloud Functions and Scheduler provides flexibility and automation, minimizing operational overhead. This setup is versatile and can be easily extended to incorporate additional GCP services such as BigQuery.

Links and References
GitHub Repo
Cloud Functions
DataFlow
Cloud Scheduler

Loading data to Google Big Query using Dataproc workflow templates and cloud Schedule

Jader Lima — Fri, 06 Sep 2024 23:29:22 +0000

Overview

This project is designed to facilitate the ingestion of data from Google Cloud Storage into BigQuery using Apache PySpark on Google Dataproc. Furthermore, it utilizes Google Cloud Scheduler for automated execution and GitHub Actions for seamless deployment.

Technologies Used
Features
Architecture Diagram
Getting Started
- Prerequisites
- Setup Instructions
Deploying the project
- Workflow File YAML Explanation
Resources Created After Deployment
Conclusion

Technologies Used

Google Dataproc
Google Dataproc is a fully managed cloud service designed to simplify running Apache Spark and Hadoop clusters in the Google Cloud ecosystem. It provides a fast and scalable way to process large datasets while integrating seamlessly with other Google Cloud services, such as Cloud Storage, to minimize operational overhead and ensure efficient big data processing.
Cloud Storage
Google Cloud Storage is a scalable and secure object storage service for storing large amounts of unstructured data. It offers high availability and strong global consistency, making it suitable for a wide range of scenarios, such as data backups, big data analytics, and content distribution.
Workflow Templates
Workflow templates in Google Cloud simplify the definition and management of complex processes involving multiple cloud services. This feature helps in scheduling and executing intricate workflows, optimizing resource management, and automating multi-step tasks across different services.
Cloud Scheduler
Google Cloud Scheduler is a fully managed service for running scheduled jobs with no infrastructure to manage. It can be used to automate workflows, run reports, or trigger specific cloud services at defined intervals.
CI/CD Process with GitHub Actions
Implementing a CI/CD pipeline with GitHub Actions automates the build, test, and deployment stages of your project. In this workflow, GitHub Actions triggers deployment to Google Cloud every time code changes are pushed to the repository, ensuring a consistent and accurate deployment process with minimal manual intervention.
GitHub Secrets and Configuration
GitHub Secrets is essential for maintaining the security of sensitive information like API keys, service account credentials, and other configuration data. By securely storing these details outside your source code, you mitigate the risk of unauthorized access and potential leaks.
Google BigQuery
A fully managed, scalable data warehouse that enables lightning-fast SQL queries and supports large-scale analytics across terabytes of data.

Features

Ingest data from Google Cloud Storage into BigQuery using Dataproc with PySpark.
Utilize Cloud Scheduler to automate the execution of data ingestion workflows.
Implement CI/CD for automated deployment through GitHub Actions.
Comprehensive error handling and logging for reliable data processing.

Architecture Diagram

Getting Started

Prerequisites

Before you begin, ensure you have the following prerequisites set up:

Google Cloud account with billing enabled.
GitHub account.

Setup Instructions

Clone the Repository

   git clone https://github.com/jader-lima/gcp-dataproc-bigquery-workflow-template.git
   cd gcp-dataproc-bigquery-workflow-template

Set Up Google Cloud Environment

Create a Google Cloud Storage bucket to store your data.
Set up a BigQuery dataset where your data will be ingested.
Create a Dataproc cluster for processing.

Configure Environment Variables

Ensure the following environment variables are set in your deployment configuration or within GitHub Secrets:

GCP_BUCKET_BIGDATA_FILES: Secret used to store the name of the cloud storage
GCP_BUCKET_DATALAKE: Secret used to store the name of the cloud storage
GCP_BUCKET_DATAPROC: Secret used to store the name of the cloud storage
GCP_BUCKET_TEMP_BIGQUERY: Secret used to store the name of the cloud storage
GCP_SA_KEY: Secret used to store the value of the service account key. For this project, the default service key was used.
GCP_SERVICE_ACCOUNT: Secret used to store the value of the service account key. For this project, the default service key was used.
PROJECT_ID: Secret used to store the project id value

Creating github secret

To create a new secret:
1. In project repository, menu Settings
2. Security,
3. Secrets and variables,click in access Action
4. New repository secret, type a name and value for secret.

For more details , access :
https://docs.github.com/pt/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions

Deploying the project

Workflow File YAML Explanation

MY_VAR_NAME : my_var_value 
${{ env.MY_VAR_NAME}}

env:
    REGION: us-east1
    ZONE: us-east1-b
    DATAPROC_CLUSTER_NAME: dataproc-bigdata-multi-node-cluster
    DATAPROC_WORKER_TYPE: n2-standard-2
    DATAPROC_MASTER_TYPE: n2-standard-2
    DATAPROC_NUM_WORKERS: 2
    DATAPROC_IMAGE_VERSION: 2.1-debian11
    DATAPROC_WORKER_NUM_LOCAL_SSD: 1
    DATAPROC_MASTER_NUM_LOCAL_SSD: 1
    DATAPROC_MASTER_BOOT_DISK_SIZE: 32   
    DATAPROC_WORKER_DISK_SIZE: 32
    DATAPROC_MASTER_BOOT_DISK_TYPE: pd-balanced
    DATAPROC_WORKER_BOOT_DISK_TYPE: pd-balanced
    DATAPROC_COMPONENTS: JUPYTER
    DATAPROC_WORKFLOW_NAME: report_olist_order_items
    BIGQUERY_DATASET: olist
    BIGQUERY_TABLE_ORDER_ITEMS: order_items_report
    BRONZE_DATALAKE_FILES: bronze
    TRANSIENT_DATALAKE_FILES: transient
    BUCKET_DATALAKE_FOLDER: transient
    BUCKET_BIGDATA_JAR_FOLDER: jars
    BUCKET_BIGDATA_SCRIPT_FOLDER: scripts
    BUCKET_BIGDATA_PYSPARK_FOLDER: pyspark
    BUCKET_BIGDATA_PYSPARK_INGESTION: ingestion
    BUCKET_BIGDATA_PYSPARK_ENRICHMENT: enrichment/
    DATAPROC_APP_NAME: ingestion_countries_csv_to_delta 
    JAR_LIB1: delta-core_2.12-2.3.0.jar
    JAR_LIB2: delta-storage-2.3.0.jar 
    APP_NAME: 'countries_ingestion_csv_to_delta'
    PYSPARK_INGESTION_SCRIPT: ingestion_csv_to_delta.py
    PYSPARK_ENRICHMENT_SCRIPT_ORDER_ITENS: order_order_items_to_bigquery.py
    TIME_PARTITION_FIELD: datePartition
    FILE1: orders
    FILE2: order_items
    SUBJECT : olist
    STEP1 : orders
    STEP2 : order_items
    STEP3 : order_items_report
    TIME_ZONE : America/Sao_Paulo
    SCHEDULE : "00 12 * * *"
    SCHEDULE_NAME : schedule_olist_etl
    SERVICE_ACCOUNT_NAME : account-dataproc-bq-workflow
    CUSTOM_ROLE : DataProcBigQueryWorkflowCustomRole
    STEP1_NAME : step_ingestion_orders
    STEP2_NAME : step_ingestion_order_items
    STEP3_NAME : step_ingestion_order_items_bigquery

Workflow Job Steps

deploy-buckets: This step is responsible for creating the required Cloud Storage buckets. If the bucket name does not already exist, it will be created. Once the buckets are created, the necessary data files, JARs, and scripts are copied into the appropriate folders.

jobs:
  deploy-buckets:
    runs-on: ubuntu-22.04
    timeout-minutes: 10

    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Authorize GCP
      uses: 'google-github-actions/auth@v2'
      with:
        credentials_json:  ${{ secrets.GCP_SA_KEY }}

    # Step to Authenticate with GCP
    - name: Set up Cloud SDK
      uses: google-github-actions/setup-gcloud@v2
      with:
        version: '>= 363.0.0'
        project_id: ${{ secrets.PROJECT_ID }}

    # Step to Configure Docker to use the gcloud command-line tool as a credential helper
    - name: Configure Docker
      run: |-
        gcloud auth configure-docker


    # Step to Create GCP Bucket 
    - name: Create Google Cloud Storage - files
      run: |-
        if ! gsutil ls -p ${{ secrets.PROJECT_ID }} gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }} &> /dev/null; \
          then \
            gcloud storage buckets create gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }} --default-storage-class=nearline --location=${{ env.REGION }}
          else
            echo "Cloud Storage : gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}  already exists" ! 
          fi

    # Step to Create GCP Bucket 
    - name: Create Google Cloud Storage - dataproc
      run: |-
        if ! gsutil ls -p ${{ secrets.PROJECT_ID }} gs://${{ secrets.GCP_BUCKET_DATAPROC }} &> /dev/null; \
          then \
            gcloud storage buckets create gs://${{ secrets.GCP_BUCKET_DATAPROC }} --default-storage-class=nearline --location=${{ env.REGION }}
          else
            echo "Cloud Storage : gs://${{ secrets.GCP_BUCKET_DATAPROC }}  already exists" ! 
          fi

    # Step to Create GCP Bucket 
    - name: Create Google Cloud Storage - datalake
      run: |-
        if ! gsutil ls -p ${{ secrets.PROJECT_ID }} gs://${{ secrets.GCP_BUCKET_DATALAKE }} &> /dev/null; \
          then \
            gcloud storage buckets create gs://${{ secrets.GCP_BUCKET_DATALAKE }} --default-storage-class=nearline --location=${{ env.REGION }}
          else
            echo "Cloud Storage : gs://${{ secrets.GCP_BUCKET_DATALAKE }}  already exists" ! 
          fi

    # Step to Create GCP Bucket 
    - name: Create Google Cloud Storage - big query temp files
      run: |-
        if ! gsutil ls -p ${{ secrets.PROJECT_ID }} gs://${{ secrets.GCP_BUCKET_TEMP_BIGQUERY }} &> /dev/null; \
          then \
            gcloud storage buckets create gs://${{ secrets.GCP_BUCKET_TEMP_BIGQUERY }} --default-storage-class=nearline --location=${{ env.REGION }}
          else
            echo "Cloud Storage : gs://${{ secrets.GCP_BUCKET_TEMP_BIGQUERY }}  already exists" ! 
          fi

    # Step to Upload the file to GCP Bucket - transient files
    - name: Upload transient files to Google Cloud Storage
      run: |-
        TARGET=${{ env.TRANSIENT_DATALAKE_FILES }}
        BUCKET_PATH=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}    
        gsutil cp -r $TARGET gs://${BUCKET_PATH}


    # Step to Upload the file to GCP Bucket - jar files
    - name: Upload jar files to Google Cloud Storage
      run: |-
        TARGET=${{ env.BUCKET_BIGDATA_JAR_FOLDER }}
        BUCKET_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}
        gsutil cp -r $TARGET gs://${BUCKET_PATH}

    # Step to Upload the file to GCP Bucket - pyspark files
    - name: Upload pyspark files to Google Cloud Storage
      run: |-
        TARGET=${{ env.BUCKET_BIGDATA_SCRIPT_FOLDER}}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER}}
        BUCKET_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES}}/${{ env.BUCKET_BIGDATA_SCRIPT_FOLDER}}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER}}
        gsutil cp -r $TARGET gs://${BUCKET_PATH}

deploy-bigquery-dataset-bigquery-tables: This step creates the BigQuery dataset and tables. It checks for the existence of a dataset and validates whether a table is present. The table schema is copied from a predefined JSON file located in the scripts/libs bucket.

deploy-bigquery-dataset-bigquery-tables:
    needs: [deploy-buckets]
    runs-on: ubuntu-22.04
    timeout-minutes: 10

    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Authorize GCP
      uses: 'google-github-actions/auth@v2'
      with:
        credentials_json:  ${{ secrets.GCP_SA_KEY }}

    # Step to Authenticate with GCP
    - name: Set up Cloud SDK
      uses: google-github-actions/setup-gcloud@v2
      with:
        version: '>= 363.0.0'
        project_id: ${{ secrets.PROJECT_ID }}

    # Step to Configure Docker to use the gcloud command-line tool as a credential helper
    - name: Configure Docker
      run: |-
        gcloud auth configure-docker


    - name: Create Big Query Dataset
      run: |-  
        if ! bq ls --project_id ${{ secrets.PROJECT_ID}} -a | grep -w ${{ env.BIGQUERY_DATASET}} &> /dev/null; \
          then 

            bq --location=${{ env.REGION }} mk \
          --default_table_expiration 0 \
          --dataset ${{ env.BIGQUERY_DATASET }}

          else
            echo "Big Query Dataset : ${{ env.BIGQUERY_DATASET }} already exists" ! 
          fi

    - name: Create Big Query table
      run: |-
        TABLE_NAME_ORDER_ITEMS=${{ env.BIGQUERY_DATASET}}.${{ env.BIGQUERY_TABLE_ORDER_ITEMS}}
        c=0
        for table in $(bq ls --max_results 1000 "${{ secrets.PROJECT_ID}}:${{ env.BIGQUERY_DATASET}}" | tail -n +3 | awk '{print $1}'); do

            # Determine the table type and file extension
            if bq show --format=prettyjson $TABLE_NAME_ORDER_ITEMS | jq -r '.type' | grep -q -E "TABLE"; then
              echo "Dataset ${{ env.BIGQUERY_DATASET}} already has table named : $table " !
              if [ "$table" == "${{ env.BIGQUERY_TABLE_ORDER_ITEMS}}" ]; then
                echo "Dataset ${{ env.BIGQUERY_DATASET}} already has table named : $table " !
                ((c=c+1))       
              fi                  
            else
                echo "Ignoring $table"            
                continue
            fi
        done
        echo " contador $c "
        if [ $c == 0 ]; then
          echo "Creating table named : $table for Dataset ${{ env.BIGQUERY_DATASET}} " !

          bq mk --table \
          --time_partitioning_field ${{ env.TIME_PARTITION_FIELD}} \
          $TABLE_NAME_ORDER_ITEMS \
          ./scripts/bigquery_files/schemas/order_items_schema.json


        fi

deploy-dataproc-workflow-template:

This step creates the workflow template. It sets up a Dataproc cluster and links it with the workflow. The three main steps of the workflow are then created, with validations ensuring that each component is only created once.

 deploy-dataproc-workflow-template:
    needs: [deploy-buckets]
    runs-on: ubuntu-22.04
    timeout-minutes: 10

    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Authorize GCP
      uses: 'google-github-actions/auth@v2'
      with:
        credentials_json:  ${{ secrets.GCP_SA_KEY }}

    # Step to Authenticate with GCP
    - name: Set up Cloud SDK
      uses: google-github-actions/setup-gcloud@v2
      with:
        version: '>= 363.0.0'
        project_id: ${{ secrets.PROJECT_ID }}

    # Step to Configure Docker to use the gcloud command-line tool as a credential helper
    - name: Configure Docker
      run: |-
        gcloud auth configure-docker

    - name: Create Dataproc Workflow
      run: |-  
        if ! gcloud dataproc workflow-templates list --region=${{ env.REGION}} | grep -i ${{ env.DATAPROC_WORKFLOW_NAME}} &> /dev/null; \
          then \
            gcloud dataproc workflow-templates create ${{ env.DATAPROC_WORKFLOW_NAME }} --region ${{ env.REGION }}

          else
            echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME }} already exists" ! 
          fi

    - name: Create Dataproc Managed Cluster
      run: >       
        gcloud dataproc workflow-templates set-managed-cluster ${{ env.DATAPROC_WORKFLOW_NAME }} 
        --region ${{ env.REGION }} 
        --zone ${{ env.ZONE }} 
        --image-version ${{ env.DATAPROC_IMAGE_VERSION }} 
        --master-machine-type=${{ env.DATAPROC_MASTER_TYPE }} 
        --master-boot-disk-type ${{ env.DATAPROC_MASTER_BOOT_DISK_TYPE }} 
        --master-boot-disk-size ${{ env.DATAPROC_MASTER_BOOT_DISK_SIZE }} 
        --worker-machine-type=${{ env.DATAPROC_WORKER_TYPE }} 
        --worker-boot-disk-type ${{ env.DATAPROC_WORKER_BOOT_DISK_TYPE }}
        --worker-boot-disk-size ${{ env.DATAPROC_WORKER_DISK_SIZE }} 
        --num-workers=${{ env.DATAPROC_NUM_WORKERS }} 
        --cluster-name=${{ env.DATAPROC_CLUSTER_NAME }} 
        --optional-components ${{ env.DATAPROC_COMPONENTS }} 
        --service-account=${{ env.GCP_SERVICE_ACCOUNT }}

    - name: Add Job Ingestion orders to Workflow
      run: |-
        if gcloud dataproc workflow-templates list --region=${{ env.REGION}} | grep -i ${{ env.DATAPROC_WORKFLOW_NAME}} &> /dev/null; \
          then \

            if gcloud dataproc workflow-templates describe ${{ env.DATAPROC_WORKFLOW_NAME}} --region=${{ env.REGION}} | grep -i ${{ env.STEP1_NAME  }} &> /dev/null; \
            then \
              echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME }} already has step : ${{ env.STEP1_NAME  }} " ! 
            else
              PYSPARK_SCRIPT_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES}}/${{ env.BUCKET_BIGDATA_SCRIPT_FOLDER}}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER}}/${{ env.BUCKET_BIGDATA_PYSPARK_INGESTION}}/${{ env.PYSPARK_INGESTION_SCRIPT}}
              JARS_PATH=gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB1 }}
              JARS_PATH=${JARS_PATH},gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB2 }}
              TRANSIENT=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}/${{ env.SUBJECT }}/${{ env.FILE1 }}
              BRONZE=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BRONZE_DATALAKE_FILES }}/${{ env.SUBJECT }}/${{ env.FILE1 }}

              gcloud dataproc workflow-templates add-job pyspark gs://${PYSPARK_SCRIPT_PATH} \
              --workflow-template ${{ env.DATAPROC_WORKFLOW_NAME }}  \
              --step-id ${{env.STEP1_NAME }} \
              --region ${{ env.REGION }} \
              --jars ${JARS_PATH} \
              -- --app_name=${{ env.APP_NAME }}${{ env.STEP1 }} --bucket_transient=gs://${TRANSIENT} \
              --bucket_bronze=gs://${BRONZE}
            fi
        else
          echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME}} not exists" ! 
        fi        

    - name: Add Job Ingestion order items to Workflow
      run: |-
        if gcloud dataproc workflow-templates list --region=${{ env.REGION}} | grep -i ${{ env.DATAPROC_WORKFLOW_NAME}} &> /dev/null; \
          then \
            if gcloud dataproc workflow-templates describe ${{ env.DATAPROC_WORKFLOW_NAME}} --region=${{ env.REGION}} | grep -i ${{ env.STEP2_NAME }} &> /dev/null; \
            then \
              echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME }} already has step : ${{ env.STEP2_NAME  }} " ! 
            else
              PYSPARK_SCRIPT_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES}}/${{ env.BUCKET_BIGDATA_SCRIPT_FOLDER}}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER}}/${{ env.BUCKET_BIGDATA_PYSPARK_INGESTION}}/${{ env.PYSPARK_INGESTION_SCRIPT}}
              JARS_PATH=gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB1 }}
              JARS_PATH=${JARS_PATH},gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB2 }}
              TRANSIENT=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}/${{ env.SUBJECT }}/${{ env.FILE2 }}
              BRONZE=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BRONZE_DATALAKE_FILES }}/${{ env.SUBJECT }}/${{ env.FILE2 }}


              gcloud dataproc workflow-templates add-job pyspark gs://${PYSPARK_SCRIPT_PATH} \
              --workflow-template ${{ env.DATAPROC_WORKFLOW_NAME }}  \
              --step-id ${{ env.STEP2_NAME }} \
              --start-after ${{ env.STEP1_NAME }} \
              --region ${{ env.REGION }} \
              --jars ${JARS_PATH} \
              -- --app_name=${{ env.APP_NAME }}${{ env.STEP2 }} --bucket_transient=gs://${TRANSIENT} \
              --bucket_bronze=gs://${BRONZE}
            fi
        else
          echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME}} not exists" ! 
        fi


    - name: Add Job order + order items ingestion into big query  to Workflow
      run: |-
        if gcloud dataproc workflow-templates list --region=${{ env.REGION}} | grep -i ${{ env.DATAPROC_WORKFLOW_NAME}} &> /dev/null; \
          then \
            if gcloud dataproc workflow-templates describe ${{ env.DATAPROC_WORKFLOW_NAME}} --region=${{ env.REGION}} | grep -i ${{ env.STEP3_NAME }} &> /dev/null; \
            then \
              echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME }} already has step : ${{ env.STEP3_NAME }} " ! 
            else

              PYSPARK_SCRIPT_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES}}/${{ env.BUCKET_BIGDATA_SCRIPT_FOLDER}}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER}}/${{ env.BUCKET_BIGDATA_PYSPARK_ENRICHMENT}}/${{ env.PYSPARK_ENRICHMENT_SCRIPT_ORDER_ITENS}}
              JARS_PATH=gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB1 }}
              JARS_PATH=${JARS_PATH},gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB2 }}
              BRONZE_ORDERS=${{ secrets.GCP_BUCKET_DATALAKE}}/${{ env.BRONZE_DATALAKE_FILES}}/${{ env.SUBJECT}}/${{ env.FILE1}}
              BRONZE_ORDER_ITEMS=${{ secrets.GCP_BUCKET_DATALAKE}}/${{ env.BRONZE_DATALAKE_FILES}}/${{ env.SUBJECT}}/${{ env.FILE2}}
              TABLE_NAME_ORDER_ITEMS=${{ secrets.PROJECT_ID}}.${{ env.BIGQUERY_DATASET}}.${{ env.BIGQUERY_TABLE_ORDER_ITEMS}}
              BIG_QUERY_TEMP=${{ secrets.GCP_BUCKET_TEMP_BIGQUERY }}



              gcloud dataproc workflow-templates add-job pyspark gs://${PYSPARK_SCRIPT_PATH} \
              --workflow-template ${{ env.DATAPROC_WORKFLOW_NAME }}  \
              --step-id ${{ env.STEP3_NAME }} \
              --start-after ${{ env.STEP1_NAME }},${{ env.STEP2_NAME }} \
              --region ${{ env.REGION }} \
              --jars ${JARS_PATH} \
              -- --app_name=${{ env.APP_NAME }}${{ env.STEP3 }} --bronze_orders_zone=gs://${BRONZE_ORDERS} \
              --bronze_orders_items_zone=gs://${BRONZE_ORDER_ITEMS} \
              --bigquery_table=${TABLE_NAME_ORDER_ITEMS} \
              --temp_bucket=${BIG_QUERY_TEMP} 

            fi
        else
          echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME}} not exists" ! 
        fi

deploy-cloud-schedule: In this final step, a service account, custom role, and Cloud Scheduler job are created. The Cloud Scheduler runs the workflows on a predefined schedule, and the service account used is granted the necessary permissions.

deploy-cloud-schedule:
    needs: [deploy-buckets, deploy-dataproc-workflow-template]
    runs-on: ubuntu-22.04
    timeout-minutes: 10

    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Authorize GCP
      uses: 'google-github-actions/auth@v2'
      with:
        credentials_json:  ${{ secrets.GCP_SA_KEY }}

    # Step to Authenticate with GCP
    - name: Set up Cloud SDK
      uses: google-github-actions/setup-gcloud@v2
      with:
        version: '>= 363.0.0'
        project_id: ${{ secrets.PROJECT_ID }}

    # Step to Configure Docker to use the gcloud command-line tool as a credential helper
    - name: Configure Docker
      run: |-
        gcloud auth configure-docker


    - name: Create service account
      run: |-

        if ! gcloud iam service-accounts list | grep -i ${{ env.SERVICE_ACCOUNT_NAME}} &> /dev/null; \
          then \
            gcloud iam service-accounts create ${{ env.SERVICE_ACCOUNT_NAME }} \
            --display-name="scheduler dataproc workflow service account"
          fi
    - name: Create Custom role for service account
      run: |-
        if ! gcloud iam roles list --project ${{ secrets.PROJECT_ID }} | grep -i ${{ env.CUSTOM_ROLE }} &> /dev/null; \
          then \
            gcloud iam roles create ${{ env.CUSTOM_ROLE }} --project ${{ secrets.PROJECT_ID }} \
            --title "Dataproc Workflow template scheduler" --description "Dataproc Workflow template scheduler" \
            --permissions "dataproc.workflowTemplates.instantiate,iam.serviceAccounts.actAs" --stage ALPHA
          fi    

    - name: Add the custom role for service account
      run: |-
        gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \
        --member=serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com \
        --role=projects/${{secrets.PROJECT_ID}}/roles/${{env.CUSTOM_ROLE}}

    - name: Create cloud schedule for workflow execution
      run: |-
        if ! gcloud scheduler jobs list --location ${{env.REGION}} | grep -i ${{env.SCHEDULE_NAME}} &> /dev/null; \
          then \
            gcloud scheduler jobs create http ${{env.SCHEDULE_NAME}} \
            --schedule="30 12 * * *" \
            --description="Dataproc workflow " \
            --location=${{env.REGION}} \
            --uri=https://dataproc.googleapis.com/v1/projects/${{secrets.PROJECT_ID}}/regions/${{env.REGION}}/workflowTemplates/${{env.DATAPROC_WORKFLOW_NAME}}:instantiate?alt=json \
            --time-zone=${{env.TIME_ZONE}} \
            --oauth-service-account-email=${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com
          fi

Resources Created After Deployment

Upon deployment, the following resources are created:

Google Cloud Storage Bucket

At the end of the deployment process, several Cloud Storage buckets are created: one bucket for storing data related to the data lake, another for the Dataproc cluster, one for the PySpark scripts and libraries used in the project and one for BigQuery Temporary files generated in load process . The Dataproc service itself creates a cluster to manage temporary data generated during processing.

Dataproc Cluster

Now Dataproc service shows a new Workflow template, the picture above shows 2 templates. At the the Workflow tab, is possible to explore some options, as monitoring workflow executions and analyzing their details.

Selecting the created workflow, is possible to see the cluster used for processing and workflow's steps, dependencies between the steps.

With Dataproc Service, is possible monitoring the execution status of each job, with individual detail about each execution, its performance , a example is displayed below.

BigQuery Dataset

In the image below the bigQuery DataSet is displayed, in this case we have just one table, with several columns of different numeric types, but it is possible to have several other datasets in the same project, as well as countless other tables, procedures, views and functions.

The result of a query is demonstrated, in this case the query returns all columns in the table, using the column partitioned in the clause where is always a good practice to avoid excessive costs, as Biq Query charges for data processed in the query and stored data, As the data used in the experiment is small and I used a Trial account, there was no cost for the consultation

Cloud Scheduler Job

The cloud scheduler shows that there are 2 dataproc Workflow schedules, which automate the entire data ingestion, transformation and enrichment process in an orchestrated manner and with scheduled execution.
It is possible to force the execution of any schedule manually, by selecting the desired Cloud Scheduler and clicking on the "FORCE RUN" button, so it is possible to modify the scheduling time in "EDIT"

Conclusion

This project demonstrates how Google Cloud’s Dataproc, BigQuery, Cloud Storage, and Cloud Scheduler can be integrated to create a scalable, automated data ingestion pipeline. By leveraging GitHub Actions for CI/CD, the project ensures streamlined deployment, robust automation, and seamless workflows. This setup can be adapted to suit various use cases in big data processing, enabling organizations to process, store, and analyze large datasets efficiently.

Links and References
GitHub Repo
Big Query
DataProc
Cloud Scheduler
Wrokflows

Creating a data pipeline using Dataproc workflow templates and cloud Schedule

Jader Lima — Wed, 21 Aug 2024 13:11:52 +0000

About This Post

Data pipelines are processes of acquiring, transforming and enriching data,
orchestrated and scheduled, which process information from different sources and with countless possible destinations and applications.
There are several systems that help in creating data pipelines, in this post we will cover creating data pipelines on Google Cloud Platform, using the dataproc Workflow template and creating a schedule with cloud Schedule

Description of Services Used in GCP

Apache Spark

Apache Spark is an open-source unified analytics engine for large-scale data processing. It is known for its speed, ease of use, and sophisticated analytics capabilities. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance, making it suitable for a wide variety of big data applications.

Google Dataproc

Google Dataproc is a fully managed cloud service that simplifies running Apache Spark and Apache Hadoop clusters in the Google Cloud environment. It allows users to easily process large datasets and integrates seamlessly with other Google Cloud services such as Cloud Storage. Dataproc is designed to make big data processing fast and efficient while minimizing operational overhead.

Cloud Storage

Google Cloud Storage is a scalable and secure object storage service for storing large amounts of unstructured data. It offers high availability and strong global consistency, making it suitable for a wide range of scenarios, such as data backups, big data analytics, and content distribution.

Workflow Templates

Workflow templates in Google Cloud allow you to define and manage complex workflows that automate interactions between different cloud services. This feature simplifies the process of building, scheduling, and executing intricate workflows, ensuring better management of resources and tasks.

Cloud Scheduler

Google Cloud Scheduler is a fully managed cron job service that allows you to run arbitrary functions at specified times without needing to manage the infrastructure. It is useful for automating tasks such as running reports, triggering workflows, and executing other scheduled jobs.

CI/CD Process with GitHub Actions

Incorporating a CI/CD pipeline using GitHub Actions involves automating the build, test, and deployment processes of your applications. For this project, GitHub Actions simplifies the deployment of code and resources to Google Cloud. This automation leverages GitHub's infrastructure to trigger workflows based on events such as code pushes, ensuring that your applications are built and deployed consistently and accurately each time code changes are made.

GitHub Secrets and Configuration

Utilizing secrets in GitHub Actions is vital for maintaining security during the deployment process. Secrets allow you to store sensitive information such as API keys, passwords, and service account credentials securely. By keeping this sensitive data out of your source code, you minimize the risk of leaks and unauthorized access.

GCP_BUCKET_BIGDATA_FILES
1. Secret used to store the name of the cloud storage
GCP_BUCKET_DATALAKE
1. Secret used to store the name of the cloud storage
GCP_BUCKET_DATAPROC
1. Secret used to store the name of the cloud storage
GCP_SERVICE_ACCOUNT
GCP_SA_KEY
1. Secret used to store the value of the service account key. For this project, the default service key was used.
PROJECT_ID
1. Secret used to store the project id value

Creating a GCP service account key

To create computing resources in any cloud, in an automated or programmatic way, it is necessary to have an access key.
In the case of GCP, we use an access key linked to a service account, for the project the default account was used.

In GCP Console, access :
1. IAM &Admin
2. Service accounts
3. Select default service account, default name is something like Compute Engine default service account
4. In selected service account, menu KEYS,
  1. ADD KEY, Create new Key, Key Type json
  2. Download the key file, use the content as key value in your secret in github

For more details, access:
https://cloud.google.com/iam/docs/keys-create-delete

Creating github secret

To create a new secret:
1. In project repository, menu Settings
2. Security,
3. Secrets and variables,click in access Action
4. New repository secret, type a name and value for secret.

For more details , access :
https://docs.github.com/pt/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions

Architecture Diagram

Deploying the project

Every time a push to the main branch happens, github actions will be triggered,
running the yml script.
the yml script contains 3 jobs which are explained in more detail below, but basically
github actions uses the credentials of the service account with rights to create
computing resources, if you authenticate to GCP, perform the steps described in the yml file

on:
    push:
        branchs: [main]

Workflow File YAML Explanation

Environments Needed

Here's a brief explanation of the environment variables needed in your workflows based on the YAML file you provided:

env:
    BRONZE_DATALAKE_FILES: bronze
    TRANSIENT_DATALAKE_FILES: transient
    BUCKET_DATALAKE_FOLDER: transient
    BUCKET_BIGDATA_JAR_FOLDER: jars
    BUCKET_BIGDATA_PYSPARK_FOLDER: scripts
    PYSPARK_INGESTION_SCRIPT : ingestion_csv_to_delta.py 
    REGION: us-east1
    ZONE: us-east1-b
    DATAPROC_CLUSTER_NAME : dataproc-bigdata-multi-node-cluster
    DATAPROC_WORKER_TYPE : n2-standard-2
    DATAPROC_MASTER_TYPE : n2-standard-2
    DATAPROC_NUM_WORKERS : 2
    DATAPROC_IMAGE_VERSION : 2.1-debian11
    DATAPROC_WORKER_NUM_LOCAL_SSD: 1
    DATAPROC_MASTER_NUM_LOCAL_SSD: 1
    DATAPROC_MASTER_BOOT_DISK_SIZE: 32   
    DATAPROC_WORKER_DISK_SIZE: 32
    DATAPROC_MASTER_BOOT_DISK_TYPE: pd-balanced
    DATAPROC_WORKER_BOOT_DISK_TYPE: pd-balanced
    DATAPROC_COMPONENTS: JUPYTER
    DATAPROC_WORKFLOW_NAME: departments_etl
    DATAPROC_WORKFLOW_INGESTION_STEP_NAME: ingestion_countries_csv_to_delta 
    JAR_LIB1 : delta-core_2.12-2.3.0.jar
    JAR_LIB2 : delta-storage-2.3.0.jar 
    APP_NAME : 'countries_ingestion_csv_to_delta'
    SUBJECT : departments
    STEP1 : countries
    STEP2 : departments
    STEP3 : employees
    STEP4 : jobs
    TIME_ZONE : America/Sao_Paulo
    SCHEDULE : "20 12 * * *"
    SCHEDULE_NAME : schedule_departments_etl
    SERVICE_ACCOUNT_NAME : dataproc-account-workflow
    CUSTOM_ROLE : WorkflowCustomRole
    STEP1_NAME : step_countries
    STEP2_NAME : step_departments
    STEP3_NAME : step_employees
    STEP4_NAME : step_jobs

Deploy Buckets Job

This job is responsible for creating three Google Cloud Storage buckets: one for transient files, one for jar files, and one for PySpark scripts. It checks if each bucket already exists before attempting to create them. Additionally, it uploads specified files into these buckets to prepare for later processing.

jobs:
  deploy-buckets:
    runs-on: ubuntu-22.04
    timeout-minutes: 10
    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Authorize GCP
      uses: 'google-github-actions/auth@v2'
      with:
        credentials_json:  ${{ secrets.GCP_SA_KEY }}

    # Step to Create GCP Bucket 
    - name: Create Google Cloud Storage - files
      run: |-
        if ! gsutil ls -p ${{ secrets.PROJECT_ID }} gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }} &> /dev/null; \
          then \
            gcloud storage buckets create gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }} --default-storage-class=nearline --location=${{ env.REGION }}
          else
            echo "Cloud Storage : gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}  already exists" ! 
          fi

    # Step to Create GCP Bucket 
    - name: Create Google Cloud Storage - dataproc
      run: |-
        if ! gsutil ls -p ${{ secrets.PROJECT_ID }} gs://${{ secrets.GCP_BUCKET_DATAPROC }} &> /dev/null; \
          then \
            gcloud storage buckets create gs://${{ secrets.GCP_BUCKET_DATAPROC }} --default-storage-class=nearline --location=${{ env.REGION }}
          else
            echo "Cloud Storage : gs://${{ secrets.GCP_BUCKET_DATAPROC }}  already exists" ! 
          fi

    # Step to Create GCP Bucket 
    - name: Create Google Cloud Storage - datalake
      run: |-
        if ! gsutil ls -p ${{ secrets.PROJECT_ID }} gs://${{ secrets.GCP_BUCKET_DATALAKE }} &> /dev/null; \
          then \
            gcloud storage buckets create gs://${{ secrets.GCP_BUCKET_DATALAKE }} --default-storage-class=nearline --location=${{ env.REGION }}
          else
            echo "Cloud Storage : gs://${{ secrets.GCP_BUCKET_DATALAKE }}  already exists" ! 
          fi

    # Step to Upload the file to GCP Bucket - transient files
    - name: Upload transient files to Google Cloud Storage
      run: |-
        TARGET=${{ env.TRANSIENT_DATALAKE_FILES }}
        BUCKET_PATH=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}    
        gsutil cp -r $TARGET gs://${BUCKET_PATH}


    # Step to Upload the file to GCP Bucket - jar files
    - name: Upload jar files to Google Cloud Storage
      run: |-
        TARGET=${{ env.BUCKET_BIGDATA_JAR_FOLDER }}
        BUCKET_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}
        gsutil cp -r $TARGET gs://${BUCKET_PATH}

    # Step to Upload the file to GCP Bucket - pyspark files
    - name: Upload pyspark files to Google Cloud Storage
      run: |-
        TARGET=${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}
        BUCKET_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}
        gsutil cp -r $TARGET gs://${BUCKET_PATH}

    # Step to create dataproc cluster
    - name: Upload pyspark files to Google Cloud Storage
      run: |-
        TARGET=${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}
        BUCKET_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}
        gsutil cp -r $TARGET gs://${BUCKET_PATH}

Explanation:

This job begins by checking out the code and authorizing Google Cloud credentials. It then checks for the existence of three specified Cloud Storage buckets—one for transient files, one for JAR files, and one for PySpark scripts. If these buckets do not exist, it creates them with gcloud. Finally, it uploads the relevant files to the corresponding buckets using gsutil.

Deploy Dataproc Workflow Template Job

This job deploys a Dataproc workflow template in Google Cloud. It begins by checking if the workflow template already exists; if not, it creates one. It also sets up a managed Dataproc cluster with specific configurations such as the machine types and number of workers. Subsequently, it adds various steps (jobs) to the workflow template to outline the processing tasks for data ingestion.

Code Snippet:

  deploy-dataproc-workflow-template:
    needs: [deploy-buckets]
    runs-on: ubuntu-22.04
    timeout-minutes: 10

    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Authorize GCP
      uses: 'google-github-actions/auth@v2'
      with:
        credentials_json:  ${{ secrets.GCP_SA_KEY }}

    - name: Create Dataproc Workflow
      run: |-  
        if ! gcloud dataproc workflow-templates list --region=${{ env.REGION}} | grep -i ${{ env.DATAPROC_WORKFLOW_NAME}} &> /dev/null; \
          then \
            gcloud dataproc workflow-templates create ${{ env.DATAPROC_WORKFLOW_NAME }} --region ${{ env.REGION }}
          else
            echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME }} already exists" ! 
          fi

    - name: Create Dataproc Managed Cluster
      run: >       
        gcloud dataproc workflow-templates set-managed-cluster ${{ env.DATAPROC_WORKFLOW_NAME }} 
        --region ${{ env.REGION }} 
        --zone ${{ env.ZONE }} 
        --image-version ${{ env.DATAPROC_IMAGE_VERSION }} 
        --master-machine-type=${{ env.DATAPROC_MASTER_TYPE }} 
        --master-boot-disk-type ${{ env.DATAPROC_MASTER_BOOT_DISK_TYPE }} 
        --master-boot-disk-size ${{ env.DATAPROC_MASTER_BOOT_DISK_SIZE }} 
        --worker-machine-type=${{ env.DATAPROC_WORKER_TYPE }} 
        --worker-boot-disk-type ${{ env.DATAPROC_WORKER_BOOT_DISK_TYPE }}
        --worker-boot-disk-size ${{ env.DATAPROC_WORKER_DISK_SIZE }} 
        --num-workers=${{ env.DATAPROC_NUM_WORKERS }} 
        --cluster-name=${{ env.DATAPROC_CLUSTER_NAME }} 
        --optional-components ${{ env.DATAPROC_COMPONENTS }} 
        --service-account=${{ env.GCP_SERVICE_ACCOUNT }}

    - name: Add Job Ingestion countries to Workflow
      run: |-
        if gcloud dataproc workflow-templates list --region=${{ env.REGION}} | grep -i ${{ env.DATAPROC_WORKFLOW_NAME}} &> /dev/null; \
          then \

            if gcloud dataproc workflow-templates describe ${{ env.DATAPROC_WORKFLOW_NAME}} --region=${{ env.REGION}} | grep -i ${{ env.STEP1_NAME  }} &> /dev/null; \
            then \
              echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME }} already has step : ${{ env.STEP1_NAME  }} " ! 
            else
              PYSPARK_SCRIPT_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}/${{ env.PYSPARK_INGESTION_SCRIPT }}
              JARS_PATH=gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB1 }}
              JARS_PATH=${JARS_PATH},gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB2 }}
              TRANSIENT=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}/${{ env.SUBJECT }}/${{ env.STEP1 }}
              BRONZE=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BRONZE_DATALAKE_FILES }}/${{ env.SUBJECT }}/${{ env.STEP1 }}

              gcloud dataproc workflow-templates add-job pyspark gs://${PYSPARK_SCRIPT_PATH} \
              --workflow-template ${{ env.DATAPROC_WORKFLOW_NAME }}  \
              --step-id ${{env.STEP1_NAME }} \
              --region ${{ env.REGION }} \
              --jars ${JARS_PATH} \
              -- --app_name=${{ env.APP_NAME }}${{ env.STEP1 }} --bucket_transient=gs://${TRANSIENT} \
              --bucket_bronze=gs://${BRONZE}
            fi
        else
          echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME}} not exists" ! 
        fi        

    - name: Add Job Ingestion departments to Workflow
      run: |-
        if gcloud dataproc workflow-templates list --region=${{ env.REGION}} | grep -i ${{ env.DATAPROC_WORKFLOW_NAME}} &> /dev/null; \
          then \
            if gcloud dataproc workflow-templates describe ${{ env.DATAPROC_WORKFLOW_NAME}} --region=${{ env.REGION}} | grep -i ${{ env.STEP2_NAME }} &> /dev/null; \
            then \
              echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME }} already has step : ${{ env.STEP2_NAME  }} " ! 
            else
              PYSPARK_SCRIPT_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}/${{ env.PYSPARK_INGESTION_SCRIPT }}
              JARS_PATH=gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB1 }}
              JARS_PATH=${JARS_PATH},gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB2 }}
              TRANSIENT=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}/${{ env.SUBJECT }}/${{ env.STEP2 }}
              BRONZE=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BRONZE_DATALAKE_FILES }}/${{ env.SUBJECT }}/${{ env.STEP2 }}

              gcloud dataproc workflow-templates add-job pyspark gs://${PYSPARK_SCRIPT_PATH} \
              --workflow-template ${{ env.DATAPROC_WORKFLOW_NAME }}  \
              --step-id ${{ env.STEP2_NAME }} \
              --start-after ${{ env.STEP1_NAME }} \
              --region ${{ env.REGION }} \
              --jars ${JARS_PATH} \
              -- --app_name=${{ env.APP_NAME }}${{ env.STEP2 }} --bucket_transient=gs://${TRANSIENT} \
              --bucket_bronze=gs://${BRONZE}
            fi
        else
          echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME}} not exists" ! 
        fi


    - name: Add Job Ingestion employees to Workflow
      run: |-
        if gcloud dataproc workflow-templates list --region=${{ env.REGION}} | grep -i ${{ env.DATAPROC_WORKFLOW_NAME}} &> /dev/null; \
          then \
            if gcloud dataproc workflow-templates describe ${{ env.DATAPROC_WORKFLOW_NAME}} --region=${{ env.REGION}} | grep -i ${{ env.STEP3_NAME }} &> /dev/null; \
            then \
              echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME }} already has step : ${{ env.STEP3_NAME }} " ! 
            else
              PYSPARK_SCRIPT_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}/${{ env.PYSPARK_INGESTION_SCRIPT }}
              JARS_PATH=gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB1 }}
              JARS_PATH=${JARS_PATH},gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB2 }}
              TRANSIENT=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}/${{ env.SUBJECT }}/${{ env.STEP2 }}
              PYSPARK_SCRIPT_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}/${{ env.PYSPARK_INGESTION_SCRIPT }}
              JARS_PATH=gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB1 }}
              JARS_PATH=${JARS_PATH},gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB2 }}
              TRANSIENT=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}/${{ env.SUBJECT }}/${{ env.STEP3 }}
              BRONZE=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BRONZE_DATALAKE_FILES }}/${{ env.SUBJECT }}/${{ env.STEP3 }}

              gcloud dataproc workflow-templates add-job pyspark gs://${PYSPARK_SCRIPT_PATH} \
              --workflow-template ${{ env.DATAPROC_WORKFLOW_NAME }}  \
              --step-id ${{ env.STEP3_NAME }} \
              --start-after ${{ env.STEP1_NAME }},${{ env.STEP2_NAME }} \
              --region ${{ env.REGION }} \
              --jars ${JARS_PATH} \
              -- --app_name=${{ env.APP_NAME }}${{ env.STEP3 }} --bucket_transient=gs://${TRANSIENT} \
              --bucket_bronze=gs://${BRONZE}
            fi
        else
          echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME}} not exists" ! 
        fi

    - name: Add Job Ingestion Jobs to Workflow
      run: |-
        if gcloud dataproc workflow-templates list --region=${{ env.REGION}} | grep -i ${{ env.DATAPROC_WORKFLOW_NAME}} &> /dev/null; \
          then \
            if gcloud dataproc workflow-templates describe ${{ env.DATAPROC_WORKFLOW_NAME}} --region=${{ env.REGION}} | grep -i ${{ env.STEP4_NAME }} &> /dev/null; \
            then \
              echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME }} already has step : ${{ env.STEP4_NAME }} " ! 
            else
              PYSPARK_SCRIPT_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}/${{ env.PYSPARK_INGESTION_SCRIPT }}
              JARS_PATH=gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB1 }}
              JARS_PATH=${JARS_PATH},gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB2 }}
              TRANSIENT=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}/${{ env.SUBJECT }}/${{ env.STEP2 }}
              PYSPARK_SCRIPT_PATH=${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_PYSPARK_FOLDER }}/${{ env.PYSPARK_INGESTION_SCRIPT }}
              JARS_PATH=gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB1 }}
              JARS_PATH=${JARS_PATH},gs://${{ secrets.GCP_BUCKET_BIGDATA_FILES }}/${{ env.BUCKET_BIGDATA_JAR_FOLDER }}/${{ env.JAR_LIB2 }}
              TRANSIENT=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BUCKET_DATALAKE_FOLDER }}/${{ env.SUBJECT }}/${{ env.STEP4 }}
              BRONZE=${{ secrets.GCP_BUCKET_DATALAKE }}/${{ env.BRONZE_DATALAKE_FILES }}/${{ env.SUBJECT }}/${{ env.STEP4 }}

              gcloud dataproc workflow-templates add-job pyspark gs://${PYSPARK_SCRIPT_PATH} \
              --workflow-template ${{ env.DATAPROC_WORKFLOW_NAME }}  \
              --step-id ${{ env.STEP4_NAME }} \
              --start-after ${{ env.STEP1_NAME }},${{ env.STEP2_NAME }},${{ env.STEP3_NAME }} \
              --region ${{ env.REGION }} \
              --jars ${JARS_PATH} \
              -- --app_name=${{ env.APP_NAME }}${{ env.STEP4 }} --bucket_transient=gs://${TRANSIENT} \
              --bucket_bronze=gs://${BRONZE}
            fi
        else
          echo "Workflow Template : ${{ env.DATAPROC_WORKFLOW_NAME}} not exists" ! 
        fi

Explanation

This job follows a systematic approach to deploying a Dataproc workflow template. It first checks if the workflow template exists and creates it if it does not. Next, a managed Dataproc cluster is configured with specified properties (e.g., number of workers, machine type). The job also adds specified steps for data ingestion tasks to the workflow template, detailing how data should be processed. The remaining steps for Add Job are structured similarly, each focusing on different data ingestion tasks within the workflow.

Deploy Cloud Schedule Job
This job sets up a scheduling mechanism using Google Cloud Scheduler. It creates a service account specifically for the scheduled job, defines a custom role with specific permissions, and binds the custom role to the service account. Finally, it creates the cloud schedule to trigger the execution of the workflow at defined intervals.

    deploy-cloud-schedule:
    needs: [deploy-buckets, deploy-dataproc-workflow-template]
    runs-on: ubuntu-22.04
    timeout-minutes: 10

    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Authorize GCP
      uses: 'google-github-actions/auth@v2'
      with:
        credentials_json:  ${{ secrets.GCP_SA_KEY }}

    # Step to Authenticate with GCP
    - name: Set up Cloud SDK
      uses: google-github-actions/setup-gcloud@v2
      with:
        version: '>= 363.0.0'
        project_id: ${{ secrets.PROJECT_ID }}

    # Step to Configure Docker to use the gcloud command-line tool as a credential helper
    - name: Configure Docker
      run: |-
        gcloud auth configure-docker


    - name: Create service account
      run: |-

        if ! gcloud iam service-accounts list | grep -i ${{ env.SERVICE_ACCOUNT_NAME}} &> /dev/null; \
          then \
            gcloud iam service-accounts create ${{ env.SERVICE_ACCOUNT_NAME }} \
            --display-name="scheduler dataproc workflow service account"
          fi
    - name: Create Custom role for service account
      run: |-
        if ! gcloud iam roles list --project ${{ secrets.PROJECT_ID }} | grep -i ${{ env.CUSTOM_ROLE }} &> /dev/null; \
          then \
            gcloud iam roles create ${{ env.CUSTOM_ROLE }} --project ${{ secrets.PROJECT_ID }} \
            --title "Dataproc Workflow template scheduler" --description "Dataproc Workflow template scheduler" \
            --permissions "dataproc.workflowTemplates.instantiate,iam.serviceAccounts.actAs" --stage ALPHA
          fi    

    - name: Add the custom role for service account
      run: |-
        gcloud projects add-iam-policy-binding ${{secrets.PROJECT_ID}} \
        --member=serviceAccount:${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com \
        --role=projects/${{secrets.PROJECT_ID}}/roles/${{env.CUSTOM_ROLE}}

    - name: Create cloud schedule for workflow execution
      run: |-
        if ! gcloud scheduler jobs list --location ${{env.REGION}} | grep -i ${{env.SCHEDULE_NAME}} &> /dev/null; \
          then \
            gcloud scheduler jobs create http ${{env.SCHEDULE_NAME}} \
            --schedule="30 12 * * *" \
            --description="Dataproc workflow " \
            --location=${{env.REGION}} \
            --uri=https://dataproc.googleapis.com/v1/projects/${{secrets.PROJECT_ID}}/regions/${{env.REGION}}/workflowTemplates/${{env.DATAPROC_WORKFLOW_NAME}}:instantiate?alt=json \
            --time-zone=${{env.TIME_ZONE}} \
            --oauth-service-account-email=${{env.SERVICE_ACCOUNT_NAME}}@${{secrets.PROJECT_ID}}.iam.gserviceaccount.com
          fi

Explanation

In this job, a service account is created specifically for handling the scheduled workflow execution. It also defines a custom role that grants the necessary permissions for the service account to instantiate the workflow template. This custom role is then associated with the service account to ensure it has the required permissions. Finally, the job creates a cloud schedule that triggers the workflow execution at predetermined times, ensuring automated execution of the data processing workflow.

Resources created after deploy process

Dataproc Workflow Template

After deploying the project, you can access the Dataproc service to view the Workflow template. In the Workflow tab, you can explore various options, including monitoring workflow executions and analyzing their details.

When you select the created workflow, you can see the cluster used for processing and the steps that comprise the workflow, including any dependencies between the steps. This visibility allows you to track the workflow's operational flow.

Additionally, within the Dataproc service, you can monitor the execution status of each job. It provides details about each execution, including the performance of individual steps within the workflow template, as illustrated below.

Cloud Scheduler

By accessing the Cloud Scheduler service, you'll find the scheduled job created during the deployment process. The interface displays the last run status, the defined schedule for execution, and additional details about the target URL and other parameters.

Cloud Storage

As part of the deployment process, several Cloud Storage buckets are created: one bucket for storing data related to the data lake, another for the Dataproc cluster, and a third for the PySpark scripts and libraries used in the project. The Dataproc service itself creates a cluster to manage temporary data generated during processing.

After the data processing is complete, a new directory is established in the designated Cloud Storage bucket to save the ingested data from the data lake. The transient directory, created during the deployment phase, serves as the location where data was copied from the GitHub repository to Cloud Storage. In a production environment, another application would likely handle the ingestion of data into this transient layer.

Conclusion

Data pipelines are crucial components in the landscape of data processing. While there are robust and feature-rich tools available, such as Azure Data Factory and Apache Airflow, simpler solutions can be valuable in certain scenarios. The decision on the most appropriate tool ultimately rests with the data and architecture teams, who must assess the specific needs and context to select the best solution for the moment.

Links and References
Github Repo
Dataproc workflow documentation

Running pyspark jobs on Google Cloud Dataproc

Jader Lima — Mon, 05 Aug 2024 22:50:14 +0000

This blog focuses on data processing and its tools and techniques, with a particular emphasis on big data tools in cloud environments like GCP, AWS, and Azure. Featuring hands-on material, we will demonstrate various ways to use different tools and techniques. Feel free to use the material as you wish and reach out with any suggestions.

About This Post

In this post, we will process data using batch processing techniques, which handle data manually or via scheduling tools. This technique completes the task at the end of processing and waits until it is manually or automatically triggered again. The data will be stored using cloud storage following the medallion architecture, also known as the multi-hop architecture. This architecture typically consists of three layers: bronze, silver, and gold, and is widely used for creating data lakes and datalakehouses.

In the medallion architecture:

Bronze Layer: This is the raw data layer, where data is stored in a format suitable for big data processing, such as Parquet, Delta, or ORC.
Silver Layer: In this layer, data is cleaned, organized, and deduplicated but without significant transformations.
Gold Layer: This is the layer designed for consumption, where significant business transformations and aggregations take place. Raw, Trusted, and Refined Layers: These terms are equivalent to bronze, silver, and gold, respectively.

The Dataproc service will handle reading, processing, cleaning, and writing the data in the different layers of the data lake.

Apache Spark

Apache Spark is an open-source framework for large-scale data processing. It provides a unified programming interface for cluster computing and enables efficient parallel data processing. Spark supports multiple programming languages, including Python, Scala, and Java, and is widely used for data analysis, machine learning, and real-time data processing.

Google Dataproc

Google Dataproc is a managed service from Google Cloud that simplifies the processing of large datasets using frameworks like Apache Spark and Apache Hadoop. It streamlines the creation, management, and scaling of data processing clusters, allowing you to focus on data analysis rather than infrastructure.

Cloud Storage

Google Cloud Storage is a highly scalable and durable object storage service from Google Cloud. It allows you to store and access large volumes of unstructured data, such as media files, backups, and large datasets. Cloud Storage offers various storage options to meet different performance and cost needs.

Architecture Diagram

Environment Variable Configuration

The commands below can be executed using the Google Cloud Shell or by configuring the GCP CLI on a personal notebook.

To list existing projects, execute the command below:

gcloud projects list

To list available regions, execute the command below:

gcloud compute regions list

To list available zones, execute the command below:

gcloud compute zones list

The variables below are used to create the necessary storages. Three storages will be created: one to serve as a datalake, another to store PySpark scripts and auxiliary JARs, and the last one to store Dataproc cluster information. For a simple test of the Dataproc service, the configurations below are sufficient and work with a free trial account on GCP.

######## STORAGE NAMES AND GENERAL PARAMETERS ##########
PROJECT_ID=<YOUR_GCP_PROJECT_ID>
REGION=<YOUR_GCP_REGION>
ZONE=<YOUR_GCP_ZONE>
GCP_BUCKET_DATALAKE=<YOUR_DATALAKE_STORAGE_NAME>
GCP_BUCKET_BIGDATA_FILES=<YOUR_STORAGE_FILE_NAME>
GCP_BUCKET_DATAPROC=<YOUR_STORAGE_DATAPROC_NAME>

###### DATAPROC ENV #################
DATAPROC_CLUSTER_NAME=<YOUR_DATAPROC_CLUSTER_NAME>
DATAPROC_WORKER_TYPE=n2-standard-2
DATAPROC_MASTER_TYPE=n2-standard-2
DATAPROC_NUM_WORKERS=2
DATAPROC_IMAGE_VERSION=2.1-debian11
DATAPROC_WORKER_NUM_LOCAL_SSD=1
DATAPROC_MASTER_NUM_LOCAL_SSD=1
DATAPROC_MASTER_BOOT_DISK_SIZE=32   
DATAPROC_WORKER_DISK_SIZE=32
DATAPROC_MASTER_BOOT_DISK_TYPE=pd-balanced
DATAPROC_WORKER_BOOT_DISK_TYPE=pd-balanced
DATAPROC_COMPONENTS=JUPYTER

#########
GCP_STORAGE_PREFIX=gs://
BRONZE_DATALAKE_FILES=bronze
TRANSIENT_DATALAKE_FILES=transient
BUCKET_DATALAKE_FOLDER=transient
BUCKET_BIGDATA_JAR_FOLDER=jars
BUCKET_BIGDATA_PYSPARK_FOLDER=scripts
DATAPROC_APP_NAME=ingestion_countries_csv_to_delta 
JAR_LIB1=delta-core_2.12-2.3.0.jar
JAR_LIB2=delta-storage-2.3.0.jar 
APP_NAME='countries_ingestion_csv_to_delta'
SUBJECT=departments
FILE=countries

Creating the Services that Make Up the Solution

Storage Create the storage buckets with the commands below:

gcloud storage buckets create gs://$GCP_BUCKET_BIGDATA_FILES --default-storage-class=nearline --location=$REGION
gcloud storage buckets create gs://$GCP_BUCKET_DATALAKE --default-storage-class=nearline --location=$REGION
gcloud storage buckets create gs://$GCP_BUCKET_DATAPROC --default-storage-class=nearline --location=$REGION

The result should look like the image below:

Uploading Solution Files

After creating the cloud storages, it is necessary to upload the CSV files that will be processed by Dataproc, as well as the libraries used by the PySpark script and the PySpark script itself.

Before starting the file upload, it's important to understand the project repository structure, which includes folders for the Data Lake files, the Python code, and the libraries.

Data Lake Storage

There are various ways to upload the files. We'll choose the easiest method: simply select the storage created to store the Data Lake files, click on "Upload Folder," and select the "transient" folder. The "transient" folder is available in the application repository; just download it to your local machine.

The result should look like the image below:

Big Data Files Storage

Now it's necessary to upload the PySpark script containing the application's processing logic and the required libraries to save the data in Delta format. Upload the "jars" and "scripts" folders.

The result should look like the image below:

Dataproc Cluster

When choosing between a single-node cluster and a multi-node cluster, consider the following:

Single-Node Cluster: Ideal for proof of concept projects and more cost-effective, but with limited computational power.
Multi-Node Cluster: Offers greater computational power but comes at a higher cost.
For this experiment, you can choose either option based on your needs and resources.

Single-Node Cluster

gcloud dataproc clusters create $DATAPROC_CLUSTER_NAME \
--enable-component-gateway --bucket $GCP_BUCKET_DATAPROC \
--region $REGION --zone $ZONE --master-machine-type $DATAPROC_MASTER_TYPE \
--master-boot-disk-type $DATAPROC_MASTER_BOOT_DISK_TYPE --master-boot-disk-size $DATAPROC_MASTER_BOOT_DISK_SIZE \
--num-master-local-ssds $DATAPROC_MASTER_NUM_LOCAL_SSD --image-version $DATAPROC_IMAGE_VERSION --single-node \
--optional-components $DATAPROC_COMPONENTS --project $PROJECT_ID

Multi-Node Cluster

gcloud dataproc clusters create $DATAPROC_CLUSTER_NAME \
--enable-component-gateway --bucket $GCP_BUCKET_DATAPROC \
--region $REGION --zone $ZONE --master-machine-type $DATAPROC_MASTER_TYPE \
--master-boot-disk-type $DATAPROC_MASTER_BOOT_DISK_TYPE --master-boot-disk-size $DATAPROC_MASTER_BOOT_DISK_SIZE \
--num-master-local-ssds $DATAPROC_MASTER_NUM_LOCAL_SSD --num-workers $DATAPROC_NUM_WORKERS --worker-machine-type $DATAPROC_WORKER_TYPE \
--worker-boot-disk-type $DATAPROC_WORKER_BOOT_DISK_TYPE --worker-boot-disk-size $DATAPROC_WORKER_DISK_SIZE \
--num-worker-local-ssds $DATAPROC_WORKER_NUM_LOCAL_SSD --image-version $DATAPROC_IMAGE_VERSION \
--optional-components $DATAPROC_COMPONENTS --project $PROJECT_ID

Dataproc graphic interface:

Dataproc details:

Dataproc web interfaces:

Dataproc jupyter web interfaces:

To list existing Dataproc clusters in a specific region, execute:

gcloud dataproc clusters list --region=$REGION

Running a PySpark Job with Spark Submit

Spark Submit

To use an existing cluster, besides the notebook interface, we can submit a PySpark or Spark script with Spark Submit.

Create new variables for job execution:

PYSPARK_SCRIPT_PATH=$GCP_STORAGE_PREFIX$GCP_BUCKET_BIGDATA_FILES/$BUCKET_BIGDATA_PYSPARK_FOLDER/$PYSPARK_INGESTION_SCRIPT
JARS_PATH=$GCP_STORAGE_PREFIX$GCP_BUCKET_BIGDATA_FILES/$BUCKET_BIGDATA_JAR_FOLDER/$JAR_LIB1
JARS_PATH=$JARS_PATH,$GCP_STORAGE_PREFIX$GCP_BUCKET_BIGDATA_FILES/$BUCKET_BIGDATA_JAR_FOLDER/$JAR_LIB2
TRANSIENT=$GCP_STORAGE_PREFIX$GCP_BUCKET_DATALAKE/$BUCKET_DATALAKE_FOLDER/$SUBJECT/$FILE
BRONZE=$GCP_STORAGE_PREFIX$GCP_BUCKET_DATALAKE/$BRONZE_DATALAKE_FILES/$SUBJECT

To verify the contents of the variables, use the echo command:

echo $PYSPARK_SCRIPT_PATH
echo $JARS_PATH
echo $TRANSIENT
echo $BRONZE

About the PySpark Script

The PySpark script is divided into two steps:

Receiving the parameters sent by the spark-submit command. These parameters are:

--app_name - PySpark Application Name
--bucket_transient - URI of the GCS transient bucket
--bucket_bronze - URI of the GCS bronze bucket
Calling the main method

Main Method Calls the method that creates the Spark session Calls the method that reads the data from the transient layer stored in the storage Calls the method that writes the data to the bronze layer in the storage

Execute Spark Submit with the command below:

gcloud dataproc jobs submit pyspark \
--project $PROJECT_ID --region $REGION \
--cluster $DATAPROC_CLUSTER_NAME \
--jars $JARS_PATH \
$PYSPARK_SCRIPT_PATH \
-- --app_name=$DATAPROC_APP_NAME --bucket_transient=$TRANSIENT \
--bucket_bronze=$BRONZE

To verify dataproc job execution, in select dataproc cluster, click in job to se results and logs

In datalake storage, a new folder was created, to bronze layer data. As your datalake is became bigger more and more folder will be created

Removing the Created Services

To avoid unexpected costs, remove the created services and resources after use.

To delete the created storages along with their contents, execute:

gcloud storage rm --recursive $GCP_STORAGE_PREFIX$GCP_BUCKET_DATALAKE
gcloud storage rm --recursive $GCP_STORAGE_PREFIX$GCP_BUCKET_BIGDATA_FILES
gcloud storage rm --recursive $GCP_STORAGE_PREFIX$GCP_BUCKET_DATAPROC

To delete the Dataproc cluster created in the experiment, execute:

gcloud dataproc clusters delete $DATAPROC_CLUSTER_NAME --region=$REGION

After deletion, the command below should not list any existing clusters:

gcloud dataproc clusters list --region=$REGION

Conclusion

The need to process data using PySpark exists in various scenarios, including batch and streaming processing. Currently, many cloud providers offer the option of managed clusters, which are easy to manage.

In this blog, we demonstrated how to create a managed cluster on Google Cloud Platform and how to use storage that serves as a distributed file system. Clouds like AWS offer similar services, such as AWS EMR and AWS S3, or Azure with HDInsight and Storage Account Gen2.

DEV Community: Jader Lima

Using Google Cloud Functions for Three-Tier Data Processing with Google Composer and Automated Deployments via GitHub Actions

Table of Contents

About

Tools Used

Solution Architecture Diagram

Deployment Process

Prerequisites

Secrets for GitHub Actions

To create a new secret:

Setting Up the DevOps Service Account

GitHub Actions Pipeline: Steps

enable-services

deploy-buckets

deploy-cloud-function

deploy-composer-service-account

deploy-bigquery-dataset-bigquery-tables

deploy-composer-environment

deploy-composer-http-connection

deploy-dags

Resources Created After Deployment

Conclusion

References

Using Cloud Functions and Cloud Schedule to process data with Google Dataflow

GCP DataFlow Function Schedule

Overview

Table of Contents

Technologies Used

Features

Architecture Diagram

Getting Started

Prerequisites

Setup Instructions

Set Up Google Cloud Environment

Creat a new service account for deploy purposes

Configure Environment Variables and Secrets

Creat a new service account for deploy purposes

Creating github secret

Deploying the project

Workflow File YAML Explanation

Workflow Job Steps

Resources Created After Deployment

Google Cloud Storage Bucket

Dataflow Classic Template

Cloud Scheduler Job

Conclusion

Loading data to Google Big Query using Dataproc workflow templates and cloud Schedule

Overview

Table of Contents

Technologies Used

Features

Architecture Diagram

Getting Started

Prerequisites

Setup Instructions

Set Up Google Cloud Environment

Configure Environment Variables

Creating github secret

Deploying the project

Workflow File YAML Explanation

Workflow Job Steps

Resources Created After Deployment

Google Cloud Storage Bucket

Dataproc Cluster

BigQuery Dataset

Cloud Scheduler Job

Conclusion

Creating a data pipeline using Dataproc workflow templates and cloud Schedule

About This Post

Description of Services Used in GCP

Apache Spark

Google Dataproc

Cloud Storage

Workflow Templates

Cloud Scheduler

CI/CD Process with GitHub Actions

GitHub Secrets and Configuration

Creating a GCP service account key

Creating github secret

Architecture Diagram