Recently, I've been experimenting with ETL tools while learning data engineering. Every time I hear "ETL," I can't help but think of NewJeans' ETA... It's driving me crazy.
Anyway, here’s a quick summary of two open-source tools: Embulk and Digdag — especially when you're loading data into Google BigQuery.
Embulk: OSS Tool for Data Transfer (Load) Between DBs, Storages, File Formats, and Cloud Services
You can easily transfer local files into Google Cloud BigQuery using Embulk.
You write a simple .yml configuration file to define input and output:
(e.g. which file to read, how to parse it, and where to load it.)
Execution is straightforward:
bash
embulk run <file_name.yml>
Tip: When handling JSON files, make sure to define only one column during parsing.
Digdag: Task Pipeline, Scheduling, and Workflow Automation
In simple terms, Digdag is a workflow automation tool.
After extracting data, you can run SQL queries on the table and load the processed results back into BigQuery.
Of course, you can also specify target table names during the workflow.
To execute:
bash
digdag run <file_name.dig>
Example Workflow
Here’s a simple example of how Embulk and Digdag can be combined:
- Load CSV File using Embulk (Input & Output)
- Parse the file and load it into a staging table.
- Transform the Loaded Table using SQL
- Use SELECT to extract and split specific fields into proper columns.
- Load the Transformed Data into a Final Table
That's it. Very simple but very effective combo when you're dealing with lightweight ETL pipelines into BigQuery.
Official Google BigQuery Docs
JSON functions: https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions
Conversion functions : https://cloud.google.com/bigquery/docs/reference/standard-sql/conversion_functions
data-types : https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
Top comments (0)