Jader Lima

Posted on Sep 22, 2024 • Edited on Dec 11, 2024

Using Cloud Functions and Cloud Schedule to process data with Google Dataflow

#gcp #dataflow #cloudfunctions #cloudscheduler

GCP DataFlow Function Schedule

Overview

This project showcases the integration of Google Cloud services, specifically Dataflow, Cloud Functions, and Cloud Scheduler, to create a highly scalable, cost-effective, and easy-to-maintain data processing solution. It demonstrates how you can automate data pipelines, perform seamless integration with other GCP services like BigQuery, and manage workflows efficiently through CI/CD pipelines with GitHub Actions. This setup provides flexibility, reduces manual intervention, and ensures that the data processing workflows run smoothly and consistently.

Technologies Used
Features
Architecture Diagram
Getting Started
- Prerequisites
- Setup Instructions
Deploying the Project
- Workflow YAML Explanation
- Workflow Job Steps
Resources Created After Deployment
Conclusion
Documentation Links

Technologies Used

Google Dataflow

Google Dataflow is a fully managed service for stream and batch data processing, which is built on Apache Beam. It allows for the creation of highly efficient, low-latency, and cost-effective data pipelines. Dataflow can handle large-scale data processing tasks, making it ideal for use cases like real-time analytics and ETL jobs.
Cloud Storage

Google Cloud Storage is a scalable, durable, and secure object storage service designed to handle large volumes of unstructured data. It is ideal for use in big data analysis, backups, and content distribution, offering high availability and low latency across the globe.
Cloud Functions

Google Cloud Functions is a serverless execution environment that allows you to run code in response to events. In this project, Cloud Functions are used to trigger Dataflow jobs and manage workflow automation efficiently with minimal operational overhead.
Cloud Scheduler

Google Cloud Scheduler is a fully managed cron job service that allows you to schedule tasks or trigger cloud services at specific intervals. It’s used in this project to automate the execution of the Cloud Functions, ensuring that Dataflow jobs run as needed without manual intervention.
CI/CD Process with GitHub Actions

GitHub Actions enables continuous integration and continuous delivery (CI/CD) workflows directly from your GitHub repository. In this project, it is used to automate the build, testing, and deployment of resources to Google Cloud, ensuring consistent and reliable deployments.
GitHub Secrets and Configuration

GitHub Secrets securely store sensitive information such as API keys, service account credentials, and configuration settings required for deployment. By keeping these details secure, the risk of leaks and unauthorized access is minimized.

Features

Ingest and transform data from Google Cloud Storage using Google Dataflow.
Encapsulate the Dataflow process into a reusable Dataflow template.
Create a Cloud Function that executes the Dataflow template through a REST API.
Automate the execution of the Cloud Function using Cloud Scheduler.
Implement a CI/CD pipeline with GitHub Actions for automated deployments.
Incorporate comprehensive error handling and logging for reliable data processing.

Architecture Diagram

Getting Started

Prerequisites

Before getting started, ensure you have the following:

A Google Cloud account with billing enabled.
A GitHub account.

Setup Instructions

Clone the Repository

   git clone https://github.com/jader-lima/gcp-dataflow-function-schedule.git
   cd gcp-dataproc-bigquery-workflow-template

Set Up Google Cloud Environment

Create a Google Cloud Storage bucket to store your data.
Set up a BigQuery dataset where your data will be ingested.
Create a Dataproc cluster for processing.

Creat a new service account for deploy purposes

Create the Service Account:

gcloud iam service-accounts create devops-dataops-sa \
    --description="Service account for DevOps and DataOps tasks" \
    --display-name="DevOps DataOps Service Account"

Grant Storage Access Permissions (Buckets): Storage Admin (roles/storage.admin): Grants permissions to create, list, and manipulate buckets and files.

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/storage.admin"

Grant Dataflow Permissions: Dataflow Admin (roles/dataflow.admin): To create, run, and manage Dataflow jobs. Dataflow Developer (roles/dataflow.developer): Allows the development and submission of Dataflow jobs.

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/dataflow.admin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/dataflow.developer"

Permissions to Create and Manage Cloud Functions and Cloud Scheduler: Cloud Functions Admin (roles/cloudfunctions.admin): To create and manage Cloud Functions. Cloud Scheduler Admin (roles/cloudscheduler.admin): To create and manage Cloud Scheduler jobs.

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/cloudfunctions.admin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/cloudscheduler.admin"

Grant Permissions to Manage Service Accounts: IAM Service Account Admin (roles/iam.serviceAccountAdmin): To create and manage other service accounts. IAM Service Account User (roles/iam.serviceAccountUser): To use service accounts in different services.

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/iam.serviceAccountAdmin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/iam.serviceAccountUser"


gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/serviceusage.serviceUsageAdmin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/resourcemanager.projectIamAdmin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/resourcemanager.projectIamAdmin"

Permission to enable API services:

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/cloudscheduler.admin"

Additional Permissions (Optional): Compute Admin (roles/compute.admin): If your pipeline needs to create compute resources (e.g., virtual machine instances). Viewer (roles/viewer): To ensure the account can view other resources in the project.

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/compute.admin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:devops-dataops-sa@$PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/viewer"

Configure Environment Variables and Secrets

Ensure the following environment variables are set in your deployment configuration or within GitHub Secrets:

GCP_BUCKET_BIGDATA_FILES: Secret used to store the name of the cloud storage
GCP_BUCKET_DATALAKE: Secret used to store the name of the cloud storage
GCP_BUCKET_DATAPROC: Secret used to store the name of the cloud storage
GCP_BUCKET_TEMP_BIGQUERY: Secret used to store the name of the cloud storage
GCP_DEVOPS_SA_KEY: Secret used to store the value of the service account key. For this project, the default service key was used.
GCP_SERVICE_ACCOUNT: Secret used to store the value of the service account key. For this project, the default service key was used.
PROJECT_ID: Secret used to store the project id value

Creat a new service account for deploy purposes

Creating github secret

To create a new secret:
1. In project repository, menu Settings
2. Security,
3. Secrets and variables,click in access Action
4. New repository secret, type a name and value for secret.

For more details , access :
https://docs.github.com/pt/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions

Deploying the project

Whenever a push to the main branch occurs, GitHub Actions will trigger and run the YAML script. The script contains four jobs, described in detail below. In essence, GitHub Actions uses the service account credentials to authenticate with Google Cloud and execute the necessary steps as described in the YAML file.

Workflow File YAML Explanation

Environments Needed
We have variations for basic usage for cluster characteristics, bucket paths, process names and steps
make workflow. In case of new steps in the workflow or new scripts, new variables can be easily added as below :

Workflow Job Steps

enable-services:
This step enables the necessary APIs for Cloud Functions, Dataflow, and the build process.
deploy-buckets:
This step creates Google Cloud Storage buckets and copies the required data files and scripts into them.
build-dataflow-classic-template:
Builds and stores a Dataflow template in a Cloud Storage bucket for future execution.
deploy-cloud-function:
Deploys a Cloud Function that triggers the execution of the Dataflow template using the google-api-python-client library.
deploy-cloud-schedule:
Creates a Cloud Scheduler job to automate the execution of the Cloud Function, ensuring data is processed at defined intervals.

Resources Created After Deployment

Upon deployment, the following resources are created:

Google Cloud Storage Bucket

A Cloud Storage bucket to store data and templates.

Csv files of the olist dataset, stored in the transient layer of the datalake.

Csv file created after Dataflow processing, this file can be used in analysis tools, spreadsheets, databases, etc.

Dataflow Classic Template

A reusable Dataflow template stored in Cloud Storage.

Cloud Scheduler Job

Automated scheduled jobs for Dataflow executions.

Conclusion

This project demonstrates how to leverage Google Cloud services like Dataflow, Cloud Functions, and Cloud Scheduler to create a fully automated and scalable data processing pipeline. The integration with GitHub Actions ensures continuous deployment, while the use of Cloud Functions and Scheduler provides flexibility and automation, minimizing operational overhead. This setup is versatile and can be easily extended to incorporate additional GCP services such as BigQuery.

Links and References
GitHub Repo
Cloud Functions
DataFlow
Cloud Scheduler

DEV Community

Using Cloud Functions and Cloud Schedule to process data with Google Dataflow

GCP DataFlow Function Schedule

Overview

Table of Contents

Technologies Used

Features

Architecture Diagram

Getting Started

Prerequisites

Setup Instructions

Set Up Google Cloud Environment

Creat a new service account for deploy purposes

Configure Environment Variables and Secrets

Creat a new service account for deploy purposes

Creating github secret

Deploying the project

Workflow File YAML Explanation

Workflow Job Steps

Resources Created After Deployment

Google Cloud Storage Bucket

Dataflow Classic Template

Cloud Scheduler Job

Conclusion

Top comments (0)