DEV Community

Jader Lima
Jader Lima

Posted on

Using Cloud Functions and Cloud Schedule to process data with Google Dataflow

GCP DataFlow Function Schedule

Overview

This project showcases the integration of Google Cloud services, specifically Dataflow, Cloud Functions, and Cloud Scheduler, to create a highly scalable, cost-effective, and easy-to-maintain data processing solution. It demonstrates how you can automate data pipelines, perform seamless integration with other GCP services like BigQuery, and manage workflows efficiently through CI/CD pipelines with GitHub Actions. This setup provides flexibility, reduces manual intervention, and ensures that the data processing workflows run smoothly and consistently.

Table of Contents

Technologies Used

  • Google Dataflow

    Google Dataflow is a fully managed service for stream and batch data processing, which is built on Apache Beam. It allows for the creation of highly efficient, low-latency, and cost-effective data pipelines. Dataflow can handle large-scale data processing tasks, making it ideal for use cases like real-time analytics and ETL jobs.

  • Cloud Storage

    Google Cloud Storage is a scalable, durable, and secure object storage service designed to handle large volumes of unstructured data. It is ideal for use in big data analysis, backups, and content distribution, offering high availability and low latency across the globe.

  • Cloud Functions

    Google Cloud Functions is a serverless execution environment that allows you to run code in response to events. In this project, Cloud Functions are used to trigger Dataflow jobs and manage workflow automation efficiently with minimal operational overhead.

  • Cloud Scheduler

    Google Cloud Scheduler is a fully managed cron job service that allows you to schedule tasks or trigger cloud services at specific intervals. It’s used in this project to automate the execution of the Cloud Functions, ensuring that Dataflow jobs run as needed without manual intervention.

  • CI/CD Process with GitHub Actions

    GitHub Actions enables continuous integration and continuous delivery (CI/CD) workflows directly from your GitHub repository. In this project, it is used to automate the build, testing, and deployment of resources to Google Cloud, ensuring consistent and reliable deployments.

  • GitHub Secrets and Configuration

    GitHub Secrets securely store sensitive information such as API keys, service account credentials, and configuration settings required for deployment. By keeping these details secure, the risk of leaks and unauthorized access is minimized.

Features

  • Ingest and transform data from Google Cloud Storage using Google Dataflow.
  • Encapsulate the Dataflow process into a reusable Dataflow template.
  • Create a Cloud Function that executes the Dataflow template through a REST API.
  • Automate the execution of the Cloud Function using Cloud Scheduler.
  • Implement a CI/CD pipeline with GitHub Actions for automated deployments.
  • Incorporate comprehensive error handling and logging for reliable data processing.

Architecture Diagram

architecture

Getting Started

Prerequisites

Before getting started, ensure you have the following:

  • A Google Cloud account with billing enabled.
  • A GitHub account.

Setup Instructions

  1. Clone the Repository
   git clone https://github.com/jader-lima/gcp-dataproc-bigquery-workflow-template.git
   cd gcp-dataproc-bigquery-workflow-template
Enter fullscreen mode Exit fullscreen mode

Set Up Google Cloud Environment

  1. Create a Google Cloud Storage bucket to store your data.
  2. Set up a BigQuery dataset where your data will be ingested.
  3. Create a Dataproc cluster for processing.

Creat a new service account for deploy purposes

  1. Create the Service Account:
gcloud iam service-accounts create devops-dataops-sa \
    --description="Service account for DevOps and DataOps tasks" \
    --display-name="DevOps DataOps Service Account"
Enter fullscreen mode Exit fullscreen mode
  1. Grant Storage Access Permissions (Buckets): Storage Admin (roles/storage.admin): Grants permissions to create, list, and manipulate buckets and files.
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/storage.admin"
Enter fullscreen mode Exit fullscreen mode
  1. Grant Dataflow Permissions: Dataflow Admin (roles/dataflow.admin): To create, run, and manage Dataflow jobs. Dataflow Developer (roles/dataflow.developer): Allows the development and submission of Dataflow jobs.
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/dataflow.admin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/dataflow.developer"
Enter fullscreen mode Exit fullscreen mode
  1. Permissions to Create and Manage Cloud Functions and Cloud Scheduler: Cloud Functions Admin (roles/cloudfunctions.admin): To create and manage Cloud Functions. Cloud Scheduler Admin (roles/cloudscheduler.admin): To create and manage Cloud Scheduler jobs.
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/cloudfunctions.admin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/cloudscheduler.admin"
Enter fullscreen mode Exit fullscreen mode
  1. Grant Permissions to Manage Service Accounts: IAM Service Account Admin (roles/iam.serviceAccountAdmin): To create and manage other service accounts. IAM Service Account User (roles/iam.serviceAccountUser): To use service accounts in different services.
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/iam.serviceAccountAdmin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/iam.serviceAccountUser"


gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/serviceusage.serviceUsageAdmin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/resourcemanager.projectIamAdmin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/resourcemanager.projectIamAdmin"
Enter fullscreen mode Exit fullscreen mode
  1. Permission to enable API services:
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/cloudscheduler.admin"
Enter fullscreen mode Exit fullscreen mode
  1. Additional Permissions (Optional): Compute Admin (roles/compute.admin): If your pipeline needs to create compute resources (e.g., virtual machine instances). Viewer (roles/viewer): To ensure the account can view other resources in the project.
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/compute.admin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/viewer"
Enter fullscreen mode Exit fullscreen mode

Configure Environment Variables and Secrets

Ensure the following environment variables are set in your deployment configuration or within GitHub Secrets:

  • GCP_BUCKET_BIGDATA_FILES: Secret used to store the name of the cloud storage
  • GCP_BUCKET_DATALAKE: Secret used to store the name of the cloud storage
  • GCP_BUCKET_DATAPROC: Secret used to store the name of the cloud storage
  • GCP_BUCKET_TEMP_BIGQUERY: Secret used to store the name of the cloud storage
  • GCP_DEVOPS_SA_KEY: Secret used to store the value of the service account key. For this project, the default service key was used.
  • GCP_SERVICE_ACCOUNT: Secret used to store the value of the service account key. For this project, the default service key was used.
  • PROJECT_ID: Secret used to store the project id value

Creat a new service account for deploy purposes

Creating github secret

  1. To create a new secret:
    1. In project repository, menu Settings
    2. Security,
    3. Secrets and variables,click in access Action
    4. New repository secret, type a name and value for secret.

github secret creation

For more details , access :
https://docs.github.com/pt/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions

Deploying the project

Whenever a push to the main branch occurs, GitHub Actions will trigger and run the YAML script. The script contains four jobs, described in detail below. In essence, GitHub Actions uses the service account credentials to authenticate with Google Cloud and execute the necessary steps as described in the YAML file.

Workflow File YAML Explanation

Environments Needed
We have variations for basic usage for cluster characteristics, bucket paths, process names and steps
make workflow. In case of new steps in the workflow or new scripts, new variables can be easily added as below :

Workflow Job Steps

  • enable-services:
    This step enables the necessary APIs for Cloud Functions, Dataflow, and the build process.

  • deploy-buckets:
    This step creates Google Cloud Storage buckets and copies the required data files and scripts into them.

  • build-dataflow-classic-template:
    Builds and stores a Dataflow template in a Cloud Storage bucket for future execution.

  • deploy-cloud-function:
    Deploys a Cloud Function that triggers the execution of the Dataflow template using the google-api-python-client library.

  • deploy-cloud-schedule:
    Creates a Cloud Scheduler job to automate the execution of the Cloud Function, ensuring data is processed at defined intervals.

Resources Created After Deployment

Upon deployment, the following resources are created:

Google Cloud Storage Bucket

A Cloud Storage bucket to store data and templates.

buckets

Csv files of the olist dataset, stored in the transient layer of the datalake.

bucket transient

Csv file created after Dataflow processing, this file can be used in analysis tools, spreadsheets, databases, etc.

bucket silver

Dataflow Classic Template

A reusable Dataflow template stored in Cloud Storage.

dataproc-workflow3

Cloud Scheduler Job

Automated scheduled jobs for Dataflow executions.

Cloud Schedule

Conclusion

This project demonstrates how to leverage Google Cloud services like Dataflow, Cloud Functions, and Cloud Scheduler to create a fully automated and scalable data processing pipeline. The integration with GitHub Actions ensures continuous deployment, while the use of Cloud Functions and Scheduler provides flexibility and automation, minimizing operational overhead. This setup is versatile and can be easily extended to incorporate additional GCP services such as BigQuery.

Links and References
GitHub Repo
Cloud Functions
DataFlow
Cloud Scheduler

Top comments (0)