Hey everyone,
I wanted to share a tool I've been working on called BigDAG. It's born out of my own frustrations with managing data transformations in BigQuery, and I figured others might find it useful too.
The Core Problem: BigQuery Complexity
BigQuery is fantastic for data analysis, but when you start building real pipelines, things can get messy. You end up with:
- SQL Files Everywhere: Queries for tables, views, and transformations scattered across different files.
- Manual Dependency Tracking: Figuring out the correct order to run scripts becomes a chore.
- Repetitive
bq
Commands: Creating datasets, tables, and views involves running similar commands repeatedly. - Deployment Headaches: Pushing changes to BigQuery can be a manual and error-prone process.
I needed a way to bring some order to this chaos, to define my workflows as code, and to automate the deployment process. That's the motivation behind BigDAG.
BigDAG: A Practical Solution
BigDAG is a Python tool that helps you manage data workflows in BigQuery. It's not trying to be a full-fledged data orchestration platform; it's a practical tool to solve a specific problem. Here's the basic idea:
- Organize Your Files: You structure your SQL scripts, schema definitions, and external table definitions in a folder. This folder represents your data pipeline.
- Dependency Detection: BigDAG automatically detects dependencies between your SQL scripts by looking at the queries. If a view uses another table, BigDAG knows it needs to be created first.
- Manual Overrides: If automatic detection isn't enough, you can specify dependencies in a
deps.yaml
file. - Simple CLI: BigDAG provides a command-line interface (CLI) to execute your workflows. You can create, update, or delete objects in BigQuery with a single command.
- Basic Templating: BigDAG uses simple templating to inject your project ID and dataset name into your SQL queries.
Key Features
- Dependency Management: Automatically figures out the order to run your scripts.
- Code-Based Workflows: Define your pipelines using files and YAML.
- CLI for Deployment: Deploy changes to BigQuery with a single command.
- Dry Run Option: See the commands that will be executed before running them.
- Recreation Support: Easily recreate your entire dataset and all objects.
A Simple DAG Example
Let's look at a simple example of how you might structure your DAG folder:
my_dag/
├── raw/
│ └── sales.sheet.def.json
│ └── sales.sheet.schema.json
├── trusted/
│ └── sales.view.sql
└── refined/
└── monthly_sales.table.sql
Here's what the files might contain:
my_dag/raw/sales.sheet.def.json
:
{
"sourceFormat": "GOOGLE_SHEETS",
"sourceUris": [
"https://docs.google.com/spreadsheets/d/FAKE_SPREADSHEET_ID"
],
"googleSheetsOptions": {
"range": "sales!A1:Z",
"skipLeadingRows": 1
}
}
my_dag/raw/sales.sheet.schema.json
:
{
"fields": [
{"name": "sale_date", "type": "DATE", "mode": "REQUIRED"},
{"name": "product_id", "type": "STRING", "mode": "REQUIRED"},
{"name": "amount", "type": "NUMERIC", "mode": "REQUIRED"}
]
}
my_dag/trusted/sales.view.sql
:
-- SQL script for creating the sales view
SELECT
*
FROM
raw_sales;
my_dag/refined/monthly_sales.table.sql
:
-- SQL script for creating the monthly_sales table
SELECT
date_trunc('month', sale_date) AS month,
SUM(amount) AS total_sales
FROM
`{{project_id}}.{{dataset}}.trusted_sales`
GROUP BY
month;
In this example:
-
sales.sheet.def.json
andsales.sheet.schema.json
define an external table based on a Google Sheet. -
sales.view.sql
creates a view on top of the raw sales data. -
monthly_sales.table.sql
creates a table based on the trusted sales view.
BigDAG will automatically infer that monthly_sales.table.sql
depends on trusted_sales
, and that trusted_sales
depends on raw_sales
.
How to Use It (Quickly)
-
Install:
git clone https://github.com/feroult/bigdag.git cd bigdag pip install -e .
Set up your DAG folder: Organize your SQL scripts and definitions.
-
Run the CLI:
bigdag --folder path/to/your/dag --project your_project_id --dataset your_dataset_name
The Goal: Making Life Easier
BigDAG is about making data workflows more manageable. It's about:
- Reducing Manual Errors: Automating repetitive tasks to avoid mistakes.
- Improving Workflow Clarity: Making it easier to understand and maintain your data pipelines.
- Saving Time: Spending less time on infrastructure and more time on analysis.
Check it Out
If you're dealing with similar challenges in BigQuery, you might find BigDAG useful. It's open-source, so feel free to take a look, contribute, or just use it as is.
https://github.com/feroult/bigdag
Hope it helps!
Top comments (0)