DEV Community: Cem Keskin

Using dbt for Transformation Tasks on BigQuery

Cem Keskin — Mon, 02 May 2022 19:36:47 +0000

Introduction: What Is dbt?

Two common approaches to enable the flow of big data are ELT (extract, load, transform) and ETL (extract, transform, load). Both start with unstructured data and despite the slight difference in naming, they result in distinct practices of data engineering:

ELT prioritizes loading and keeps transformation for a later time. It would deal with basic pre-processing such as removing duplicated data or filling missing values before serving to a team who is supposed to transform it.
ETL focuses on transformation before delivering to target systems/teams. Hence, it does more than preprocessing in ELT that would involve making data structured, clean and type transformed.

dbt is a tool to conduct transformation (“T”) practices on data warehouses for ELT. As the name suggests, it involves the operations after the data is extracted and loaded. In other words, once you have “landed” your big data on a data warehouse, dbt can help you to pre-process before serving for the use of subsequent processes. The visualization given below shows its role in a data pipeline.

Tutorial for BigQuery Transformations

dbt is originally a command line tool but currently it has a cloud service (www.getdbt.com) as well that makes the initial steps more convenient for newbees. In this short tutorial, dbt cloud service is used to conduct some basic transformation tasks on the data uploaded from a public dataset (the process was explained here) as a part of Data Engineering Zoomcamp (by DataTalks.club) Capstone Project.

Step-1: Initiate a project on dbt cloud

The process starts with initiation of an account and a project on dbt cloud that is free for individuals:

Step-2: Match with BigQuery

Clicking on the “Create Project” is followed by simple questions to declare project name and data warehouse for integration. Selecting BigQuery as the data warehouse, you will land on a page to assign GCP service account information. (Note that the account has to have BigQuery crenetdials.)

Here, downloading the service account information as a Json file from GCP and uploading it to the dbt would prevent error. Testing the authorization, one can continue for matching the repository to host the dbt project. It is possible to host it on a “dbt Cloud managed repository” or on another repo of choise. Completeing the step, you will have a project ready to initialize on dbt cloud. Clicking on “initialize your project” button, you will have a fresh project template:

Step3: Identify the requirec components and configurations

For a dbt project, core elements are:

dbt_project.yml file that is to configure the project,
models folder to host models to be run for proposed transformations,
macros to invove files to declare reusable SQL queries in Jinja format,
seeds folder to host CSV files for declerations regarding data on data ware house such as zip code & city names, employee ID & personal data’ etc. Note that these are not the data itself but the references to use it properly.

Step4: Define model

In this introductory tutorial, we will only use the models. The task is to unify all daily data of a PV system produced during a year. That is to say unify 365 files. The source of the data and how they were uploaded to BigQuery was explained in a previous post. The content of the each table can be unified with UNION ALL query as below:

-- Select columns of interest
SELECT measured_on, system_id, \
    ac_power__5074 as ac_power, \
    ambient_temp__5042 as ambient_temp,
    kwh_net__5046 as kwh_net, \
    module_temp__5043 as module_temp, \
    poa_irradiance__5041 as poa_irradiance, 
    pr__5047 as pr 

-- Decleare one of the files to be combined 
FROM `project_name.dataset_name.table_name`   

UNION ALL 

SELECT ...
    ...
    ...
    ...

However, writing 365 lines is not convenient, of course. Hence, it is possible to take advantage of a property of BigQuery. It helps you synthesize long queries with a abviously repetitive pattern. For the case study of the tutorial, it was possible to produce a 365 lines of UNION ALL query for a year by runing the following query on BigQuery:

SELECT string_agg(concat("SELECT measured_on, \
    system_id, ac_power__5069 as ac_power, \
    ambient_temp__5062 as ambient_temp, \
    kwh_net__5066 as kwh_net,   \
    module_temp__5063 as module_temp, \
    poa_irradiance__5061 as poa_irradiance, \
    pr__5067 as pr  \
    FROM `YOUR-BQ-PROJECT-NAME.pvsys1433.",  \
    table_id, "`") , "   UNION ALL \n")  \


FROM YOUR-BQ-PROJECT-NAME.pvsys1433.__TABLES__;

Than, you can simply copy-paste the long query to the model file created on models/staging folder of your project in dbt cloud. Running model with dbt run your_model_name.sql command, you will recieve a new table on your corresponding BigQuery project dataset with the name of your model file.

How to Use Apache Airflow to Get 1000+ Files From a Public Dataset

Cem Keskin — Sun, 24 Apr 2022 21:30:40 +0000

Apache Airflow is a platform to manage workflows that is a crucial role for data intensive applications. One can define, schedule, monitor and troubleshoot data workflows as code that makes maintenance, versioning, dependence management and testing more convenient. Being initiated by Airbnb, today it is an open-source tool backed by the Apache Software Foundation.

Airflow provides robust integrations with major cloud platforms (involving GCP, AWS, MS Azure, etc) as well as local resources. Moreover, it is written in Python that is also used for creating workflows. Accordingly and not surprisingly, it a well-accepted solution by the industry for applications in different scales. It is also important to note that it allows dynamically manage workflows (data pipelines) but workflows themselves are expected to be -almost- static. It is definitely not for streaming.

1. The Basic Architecture and Terminology

Task and Directed Acyclic Graph (DAG) are two fundamental concepts to understand how to use Airflow.

A task is an atomized and standalone piece of work (action). Airflow helps you define, run and monitor tasks in Python3, bash scripts, etc. It would be any operation on or with data such as transferring, analysis and storage. Tasks are defined using code templates called operators and the building block of all operators is the BaseOperator. Generic operators are used for variety of tasks that build DAGs. Moreover, there are specific versions of operators. One of them is sensors that are observing specific points of DAG to detect a specific event to happen. Other tasks with unique functionalities are defined with @task decorator that is handled by TaskFlow API. The code snippet given below shows the basic structure of defining tasks within DAG files with examples for BashOperator and PythonOperator.

from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator

# ... We see complete DAG files below.
# Here is just an example for how to define tasks. 

my_Bash_task = BashOperator(
        task_id="Bash_task_for_XX",
        bash_command="....bash....command....."
    )

my_Python_task = PythonOperator(
        task_id="Python_task_for_YY",
        python_callable=a_predefined_Python_function,
        op_kwargs={
            "src_file": "address_to_source_file",
        },
    )

A DAG represents the interdependence among tasks (see Figure 1, Source). Nodes of a DAG are individual tasks whereas the edges correspond to data transition among two tasks.

Airflow helps you link tasks to compose DAGs for controlling flow. To do so, it brings together a variety of services as represented in Figure 2 (Source).

In architecture of Apache Airflow,

Workers are the components in which tasks are run in line with the commands received from executer.
Scheduler follows dependencies defined for tasks and DAGs. Once these are met, scheduler triggers the tasks in accordance with the given timing policy.
Executor runs tasks either inside the scheduler or by pushing corresponding workers.
DAG Directory is the folder in which .py files for each DAG lives.
Metadata Database stores the state of the scheduler, executer and webserver.
User interface helps users to control and follow workflows with a intuitive graphical screen and to reach some outputs from the system (such as logs) easily.
Webserver links the system with user interface for remote control with interactive GUI.

Once tasks and DAGS are defined and the system id activated (usually within containers), users get a screen as shown below (Source) where the workflow can be followed.

Overviewing basic architecture and the terminology, let’s see them in action.

2. Installing Airflow

Airflow is a highly configurable tool. Accordingly, it’s installation can be customized due to the requirements of each specific application. Moreover, it is a common practice to host it in a container to isolate from system interactions and dependency conflicts. Brief guide presented here is based on the official guide and the show case is originally presented by DataTalks Club in during Data Engineering Zoomcamp (2022 cohort). The code base and the configurations files used in this tutorial are available here. The case in this tutorial aims to

get large number of files (1000+) from a public dataset (OEDI photovoltaic systems dataset) to the local machine in .csv format
convert them to .parquet format for more effective computation on cloud in following steps,
upload data to a bucket on Google Cloud Platform (GCP),
transfer them from bucket to BigQuery for further analysis.

Parquet is a free and open source columnar storage format backed by Apache Software Foundation. It allows efficient compression and encoding within Hadoop ecosystem independent of frameworks or programming languages. Hence it is a common format for public databases and actually OEDI PV dataset also has a version in .parquet format. However, in line with the corresponding DataTalks Club tutorials and videos, the conversion process is involved to present variety of operations. Otherwise, it is possible to directly transfer .parquet files to GCP using Airflow or any other tool.

2.1 Containerization

Following the best practices, the installation of the Airflow will be containerized. Accordingly, the Dockerfile and the docker-compose.yaml files have crucial role. Airflow documentation presents a typical docker-compose file for make life easier for newcomers. It uses the official airflow image: apache/airflow:2.2.3. Hence, the Dockerfile developed by DataTalks Club starts with it system requirements and settings. Then, the SDK for GCP is downloaded and installed for cloud integrations. The file concludes with

setting a home directory for Airflow within container,
including additional scripts if necessary
setting a parameter for the user ID of Airflow.

(Note: it is a common practice to host the following files and folders within a folder, preferable named as ‘airflow’.

# First-time build can take upto 10 mins.

FROM apache/airflow:2.2.3

ENV AIRFLOW_HOME=/opt/airflow

USER root
RUN apt-get update -qq && apt-get install vim -qqq
# git gcc g++ -qqq

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Ref: https://airflow.apache.org/docs/docker-stack/recipes.html

SHELL ["/bin/bash", "-o", "pipefail", "-e", "-u", "-x", "-c"]

ARG CLOUD_SDK_VERSION=322.0.0
ENV GCLOUD_HOME=/home/google-cloud-sdk

ENV PATH="${GCLOUD_HOME}/bin/:${PATH}"

RUN DOWNLOAD_URL="https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-${CLOUD_SDK_VERSION}-linux-x86_64.tar.gz" \
    && TMP_DIR="$(mktemp -d)" \
    && curl -fL "${DOWNLOAD_URL}" --output "${TMP_DIR}/google-cloud-sdk.tar.gz" \
    && mkdir -p "${GCLOUD_HOME}" \
    && tar xzf "${TMP_DIR}/google-cloud-sdk.tar.gz" -C "${GCLOUD_HOME}" --strip-components=1 \
    && "${GCLOUD_HOME}/install.sh" \
       --bash-completion=false \
       --path-update=false \
       --usage-reporting=false \
       --quiet \
    && rm -rf "${TMP_DIR}" \
    && gcloud --version

WORKDIR $AIRFLOW_HOME

COPY scripts scripts
RUN chmod +x scripts

USER $AIRFLOW_UID

The docker-compose.yaml file suggested by the Airflow documentation should be downloaded next to the Dockerfile:

curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.2.5/docker-compose.yaml'

It involves variety of functionalities that Airflow need and presents (descripition are from official guide):

airflow-scheduler - The scheduler monitors all tasks and DAGs, then triggers the task instances once their dependencies are complete.
airflow-webserver - The webserver is available at http://localhost:8080.
airflow-worker - The worker that executes the tasks given by the scheduler.
airflow-init - The initialization service.
flower - The flower app for monitoring the environment. It is available at http://localhost:5555.
postgres - The database.
redis - The redis - broker that forwards messages from scheduler to worker.

The Airflow documentation also suggest to create folders to keep DAGs, log files and plugins outside the container:

mkdir -p ./dags ./logs ./plugins

Moreover, a .env file should be created to declare the user ID to the docker-compose:

echo -e "AIRFLOW_UID=$(id -u)" > .env

Having the base image, next step is to include GCP related components to the docker-compose.yaml and to set credentials. DataTalks Club template includes the following lines:

# (line 61 to 66)
GOOGLE_APPLICATION_CREDENTIALS: /.google/credentials/google_credentials.json
AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT: 'google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json'

# TODO: Please change GCP_PROJECT_ID & GCP_GCS_BUCKET, as per your config
GCP_PROJECT_ID: 'YOUR-PROJECT-ID'
GCP_GCS_BUCKET: 'YOUR-BUCKET-NAME'

# line 72 >> link to the credentials file (at your host machine) for your GCP service account
- ~/.google/credentials/:/.google/credentials:ro

Also note that DataTalks template replaces the image
tag in original document with the build of the Dockerfile (Line 47 to 49). The rest of the docker-compose.yaml file is ~300 line that is needless to display here. Please investigate the file downloaded (with curl command given above) and visit DataTalks Repository.

2.2 Running the Containers

Once gathering the necessary files and folders, it’s time to build and up the service with the following commands:

docker-compose build  #would require 10-15 mins

docker-compose up airflow-init #requires ~1 min

docker-compose up #requires 2-3 mins

As mentioned above, Airflow has a webserver that provides an interactive GUI (localhost:8080) to monitor and control the processes declared by DAGs.

3. Composing DAGs to Use OEDI Data

Having an Airflow setup up-and-running, next step is to compose DAG files to execute tasks. The complete code for this part can be seen here. In this text, only the custom parts for OEDI PV dataset will be reviewed. Rest of the code is in line with DataTalks tutorial.

As like typical Python files, DAG files starts with imports followed by declarations. First part of the declarations involve the environmental parameters refer to Dockerfile and docker-compose files:

AIRFLOW_HOME = os.environ.get("AIRFLOW_HOME", "/opt/airflow/")

PROJECT_ID = os.environ.get("GCP_PROJECT_ID")
BUCKET = os.environ.get("GCP_GCS_BUCKET")
BIGQUERY_DATASET = os.environ.get("BIGQUERY_DATASET", pv_system_label)

Rest of the declarations are related to the OEDI data lake that is hosted on AWS-S3 buckets. An example URL for the files:

[https://oedi-data-lake.s3.amazonaws.com/pvdaq/csv/pvdata/system_id=1199/year=2011/month=1/day=1/system_1199__date_2011_01_01.csv](https://oedi-data-lake.s3.amazonaws.com/pvdaq/csv/pvdata/system_id=1199/year=2011/month=1/day=1/system_1199__date_2011_01_01.csv)

It can be separated into following components:

URL Core for PV dataset: https://oedi-data-lake.s3.amazonaws.com/pvdaq/csv/pvdata/
System declaration with a numeric code: system_id=1199/
Year declaration: year=2011/
Month declaration:month=1/
Day declaration: day=1/
File name declaration: system_1199__date_2011_01_01.csv

Hence, we need a parameter to define system ID and independent year, month and day parameters to target specific files. The ID is just a string:

pv_system_ID = '1430'

To manipulate date parameters, we can embed Python codes within a Jinja template:

{{ execution_date.strftime(\'%Y\') }}

Hence, we can declare parametrized URL as:

URL_PREFIX='https://oedi-data-lake.s3.amazonaws.com/ \
            pvdaq/csv/pvdata/ \
            system_id='+pv_system_ID+ \
            '/year='+'{{ execution_date.strftime(\'%Y\') }}'+\
            '/month={{ execution_date.strftime(\'%-m\') }}' +\
            '/day='+'{{ execution_date.strftime(\'%-e\') }}'

URL_TEMPLATE= URL_PREFIX +\
            '/system_'+pv_system_ID+'__date_'+\
            '{{ execution_date.strftime(\'%Y\') }}'+\
            '_{{ execution_date.strftime(\'%m\') }}_'+\
            '{{ execution_date.strftime(\'%d\') }}'+'.csv'

In a similar manner, it is useful to rename downloaded files before conversion:

OUTPUT_FILE_TEMPLATE = AIRFLOW_HOME + \
'/pvsys'+pv_system_ID+\
'data_{{ execution_date.strftime(\'%Y\') }}\
{{ execution_date.strftime(\'%m\') }}\
{{ execution_date.strftime(\'%d\') }}.csv'

parquet_file = OUTPUT_FILE_TEMPLATE.\
replace('.csv', '.parquet')

DAG code continues with 2 function definitions (namely format_to_parquet and upload_to_gcs defined by DataTalk Club to be used in operators. In line with the 4 tasks given at the beginning of Section 2, the DAG involves 4 operators (remember the definition and role of operators given in Section 1).

The first operator gets data from the corresponding link by using the curl command with URL and output templates declared above:

download_task = BashOperator(
        task_id='get_data',
        bash_command=f'curl -sSL {URL_TEMPLATE} > {OUTPUT_FILE_TEMPLATE}'
    )

The second operator converts .csv file to .parquet file using the format_to_parquet funtion:

convert_task = PythonOperator(
        task_id="convert_csv_to_parquet",
        python_callable=format_to_parquet,
        op_kwargs={
            "src_file": OUTPUT_FILE_TEMPLATE,
        },
    )

The third operator sends the converted file to GCP bucket using the upload_to_gcs function with parametrized system and file names:

local_to_gcs_task = PythonOperator(
        task_id="local_to_gcs_task",
        python_callable=upload_to_gcs,
        op_kwargs={
            "bucket": BUCKET,
            "object_name": f"{pv_system_label}/{parquet_file_name}",
            "local_file": f"{parquet_file}",
        },
    )

The last operator transfers files from bucket to BigQuery with a specific operator defined for this task:

bigquery_external_table_task = BigQueryCreateExternalTableOperator(
        task_id="bigquery_external_table_task",
        table_resource={
            "tableReference": {
                "projectId": PROJECT_ID,
                "datasetId": BIGQUERY_DATASET,
                "tableId": pv_system_label+"_"+"{{execution_date.strftime(\'%d\') }}{{execution_date.strftime(\'%m\') }}{{execution_date.strftime(\'%Y\') }}",
            },
            "externalDataConfiguration": {
                "sourceFormat": "PARQUET",
                "sourceUris": [f"gs://{BUCKET}/{pv_system_label}/{parquet_file_name}"],
            },
        },
    )

The last step is to ‘chain’ all these operators to build the ‘tree’ of tasks:

download_task >> convert_task \
        >> local_to_gcs_task >> bigquery_external_table_task

Initiating Airflow with such a DAG definition for 4 PV system (IDs with 1430 to 1433), one should get the following graph:

After triggering a DAG, the following tree visualization appears:

This also visualize how the DAG works. Referring to the date declarations given in DAT initiation:

with DAG(
    dag_id="dag_for_"+pv_system_label+"data",
    start_date=datetime(2015, 1, 1),
    end_date=datetime(2015, 12, 31),
    schedule_interval="@daily",

) as dag:

Airflow parametrizes year, month and day information to be used in DAG. These parameters are used in the URL template explained above to get the exact file for each iteration. Hence, beginning from start date, Airflow iterates the DAG for each day up to the end_date. After completing each run for 4 systems, the following creen appears on Airflow DAGs menu and BigQuery:

Building GCS Buckets and BigQuery Tables with Terraform

Cem Keskin — Sun, 03 Apr 2022 18:20:40 +0000

Terraform helps data scientists and engineers to build an infrastructure and to manage its lifecyle.

There are two ways to use it: on local & cloud. Below is the description how to install and use it in local to build an infrastructure on Google Cloud Platform (GCP).

Initial Installations: Terraform and Google Cloud SDK

For installing Terraform, pick the proper guide for your operating system provided in their webpage.

Once completing the Terraform installation, you also need to have a GCP account and initiate a project. The ID of the project is importrant to note while proceeding with Terraform.

The following step is to get key to access and control your GCP project. Pick the project you just created from the pull-down menu on the header of GCP and go to the:

         Navigation Menu >> IAM & Admin >> Service Accounts >> Create Service Account

and floow the steps:

Step1: Assign the name of your preference
Step2: Pick the role “Viewer” for initiation
Step3: Skip this optional step for your personal projects.

Then you should see a new account on your Service Accounts list. Click on Actions >> Manage Keys >> Add Key >> JSON to download the key on your local machine.

Next Step is installing Google Cloud SDK to your local machine following the straight forward instractions given here.

Then open your terminal (below is a GNU/Linux example) to set the environmental variable on your local machine to link with the key you downloaded (Json file) with the following instructions:

export GOOGLE_APPLICATION_CREDENTIALS=/--path to your JSON---/XXXXX-dadas2a4cff8.json

gcloud auth application-default login

This redirects you to the browser in order to select your corresponding Google account. Now, your local SDK has credentials to reach and configure your cloud services. However, having these initial authentications, you still need to modify your service account permissions specific for the GCP services you intended to build, namely Google Cloud Storage (GCP) and BigQuery.

   Navigation Menu >> IAM & Admin >> IAM

and pick your project to edit its permissions as following:

Next step is enabling the APIs for your project by following the links:

(Take care of the GCP account and the project name while enabling APIs.)

Building GCP Services with Terraform

Completing necessary installations (Teraform and Google Cloud SDK) and authentications, we are ready to build these two GCP services via Terraform from your local machine. Basically, two files are needed to configure the installations: main.tf and variables.tf. The former one requires the code given below to create GCP services with respect to variables provided in latter (the following code snippet).

# The code below is from https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_1_basics_n_setup/1_terraform_gcp
# --------------------------------------------------

terraform {
  required_version = ">= 1.0"
  backend "local" {}  
    google = {
      source  = "hashicorp/google"
    }
  }
}

provider "google" {
  project = var.project
  region = var.region
  // credentials = file(var.credentials)  # Use this if you do not want to set env-var GOOGLE_APPLICATION_CREDENTIALS
}

# Data Lake Bucket
# Ref: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/storage_bucket
resource "google_storage_bucket" "data-lake-bucket" {
  name          = "${local.data_lake_bucket}_${var.project}" # Concatenating DL bucket & Project name for unique naming
  location      = var.region

  # Optional, but recommended settings:
  storage_class = var.storage_class
  uniform_bucket_level_access = true

  versioning {
    enabled     = true
  }

  lifecycle_rule {
    action {
      type = "Delete"
    }
    condition {
      age = 30  // days
    }
  }

  force_destroy = true
}

# DWH
# Ref: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_dataset
resource "google_bigquery_dataset" "dataset" {
  dataset_id = var.BQ_DATASET
  project    = var.project
  location   = var.region
}

The code for variables.tf:

# The code below is from https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_1_basics_n_setup/1_terraform_gcp
# The comments are added by the author

locals {
  data_lake_bucket = "BUCKET_NAME"  # Write a name for the GCS bucket to be created
}

variable "project" {
  description = "Your GCP Project ID"   # Don't write anything here: it will be prompted during installation
}

variable "region" {
  description = "Region for GCP resources. Choose as per your location: https://cloud.google.com/about/locations"
  default = "europe-west6"  # Pick a data center location in which your services will be located
  type = string
}

variable "storage_class" {
  description = "Storage class type for your bucket. Check official docs for more info."
  default = "STANDARD"
}

variable "BQ_DATASET" {
  description = "BigQuery Dataset that raw data (from GCS) will be written to"
  type = string
  default = "Dataset_Name" # Write a name for the BigQuery Dataset to be created
}

Once the files given above are located to a folder, it is time to execute them. THere few main commands for Terraform CLI:

Main commands:

init: prepares the directory by adding necessary folders and files for following commands
validate: checks the existing configuration if it is valid
plan: shows planned changes for the given configuration
apply: creates the infrastructure for the given configuration
destroy: destroys existing infrastructure

THe init, plan and apply commands will give the following outputs (shortened):

x@y:~/-----/terraform$ **terraform init**

Initializing the backend...

Successfully configured the backend "local"! Terraform will automatically
use this backend unless the backend configuration changes.
.
.
.

x@y:~/-----/terraform$ **terraform plan**
var.project
  Your GCP Project ID

  Enter a value: xxx-yyy # write yor GCP project ID here

x@y:~/-----/terraform$ **terraform apply**

var.project
  Your GCP Project ID

  Enter a value: xxx-yyy # write yor GCP project ID here

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following
symbols:
  + create

Terraform will perform the following actions:
.
.
.

After executing the three simple codes above, you will see A new GCS bucket and BigQuery table in your GCP account.

Versioning Data and Pipeline With Git, DVC and Cloud Storage

Cem Keskin — Fri, 04 Feb 2022 12:24:42 +0000

Introduction

The vast majority of data science projects are born into Jupyter Notebooks. Being interactive and easy to use, they make exploratory data analysis (EDA) so convenient. They are also widely used for further steps such as machine learning model development, performance assessment, hyper-parameter tuning, among others. However, as the project makes progress and the deployment scenarios are under investigation, notebooks start to suffer in terms of versioning, reproducibility, interoperability, file type issues, etc. In other words, as the project moves from isolated local environments to common ones, one needs more software engineering oriented tools than data science specific ones. At this point, some of the DevOps practices can contribute.

A common and inevitable best practice of modern software development is using version control, In daily practice, it is almost equivalent to use git. It helps switching among different versions of files. However, it has limitations where the file size limit and file type compatibility are the major ones. One can conduct version control of .py or .ipynb files conveniently. However, versioning a dataset in size of couple of hundred megabytes or a file in .bit format is not possible or convenient with git. At this point, a free and open source tool, namely Data Version Control (DVC) by iterative.ai, comes into the scene.

1.1 DVC

DVC is a highly capable command line tool. As a global definition, it makes dataset and experiment versioning convenient by complementing some other major developer tools such as Git. As an example, it enables dataset versioning together with git and cloud (or remote) storage. It provides command line tools to define .dvc files whose versions are tracked by git. These specific files keep references for exact versions of your dataset that are stored in cloud (like S3, Gdrive, etc. In short, DVC acts as a middleman that serves you to integrate git and cloud for dataset versioning.

As like dataset versioning, it also helps for experiment versioning using .dvc, .yaml and other config files. You can built pipelines that make your data flow through several processes to yield a value for your research, business or hobby. Using DVC, you can define such a pipeline and maintain it seamlessly. You can redefine your complete pipeline or fix some parts. What ever the case, DVC makes life easier for your team.

No need more descriptions or promises. Let's jump into it with an introductory tutorial. Note that, the dataset used in this tutorial is a small one that can be stored in git. DVC is actually developed for larger datasets but a smaller one is used here to save loading and computation time. Beside these costs, the procedure is exactly same for larger datasets with the steps and tools defined below.

2. Let's start

This tutorial is a partial reproduction of a previous data science project that was depended on notebooks during the development. Here, a simple pipeline is built for the work flow that starts with getting the data and ends with evaluating 2 simple model alternatives (deployment is not involved). Since the tutorial is on versioning, the codes for data preparation, model training and model performance evaluation are just transferred from corresponding notebooks of the previous project to the src folder of the new project as .py files. Hence the tree was lean as below at the beginning.

.
├── README.md
└── src
    ├── config.py
    ├── evaluate.py
    ├── prepare.py
    └── train.py

Next step is to deploy pipenv for dependence management:

x@y:~/DVC_tutorial$ pipenv install
x@y:~/DVC_tutorial$ pipenv shell
(DVC_tutorial) x@y:~/DVC_tutorial$ pipenv install \
dvc[gdrive] pandas numpy sklearn openpyxl

Note that installing DVC on Linux is as easy as pip install dvc. However, dvc[gdrive] is used here to install specific DVC version. This can work with Gdrive properly since it will be used for storing the versions of data during the tutorial. To see other installation alternatives, see DVC website.

Then git and DVC are initiated with git init and dvc init commands. At this point we have the following files and folders:

(DVC_tutorial) x@y:~/DVC_tutorial$ ls -a
.dvc  .dvcignore  .git  Pipfile  Pipfile.lock  README.md  src

3. Versioning Data

The original dataset of the project is stored in a UCI repository. Create the data folder and pull the data file (in .xlsx format) with specific DVC command:

(DVC_tutorial) x@y:~/DVC_tutorial$ mkdir data && cd data
(DVC_tutorial) x@y:~/DVC_tutorial/data$ dvc get-url https://archive.ics.uci.edu/ml/machine-learning-databases/00242/ENB2012_data.xlsx

Then "add" the data file to DVC with dvc add ENB2012_data.xlsx command. This yields the corresponding .dvc file for tracking it. This is the file that git will be tracking; not the original dataset. Using this file, DVC acts a tool that matches a dataset in a local or remote storage (Gdrive, S3, etc.) with the code base of the project stored on git.

Next step is pushing the data to the cloud that is Gdrive for this tutorial. Initially, create a folder manually on Gdrive web interface and get the label:

Once you get the label, declare it to DVC as remote storage location and commit it as below:

(DVC_tutorial) x@y:~/DVC_tutorial/data$ dvc remote add -d raw_storage gdrive://1I680q6HvPqcxbNJ8qnQ01c1pKxxxxxxxxxx
Setting 'raw_storage' as a default remote.
(DVC_tutorial) x@y:~/DVC_tutorial/data$ git commit ../.dvc/config -m "Remote data storage is added with name: dataset"
(DVC_tutorial) x@y:~/DVC_tutorial/data$ dvc push

Once you pushed it, Gdrive will ask you a simple-to-follow authentication procedure to get a verification code. Entering it, upload will start and you will get a folder with a random name on your Gdrive:

3.1 Building and storing another version of the dataset

Imagine a case that you have to keep the raw data but it is not useful as it is. You would need to transform it as needed and only keep new version on your local. As an example, let's say we need a .csv file instead of .xlsx. Simply convert it using python with the name of "dataset.csv:

(DVC_tutorial) x@y:~/DVC_tutorial/data$ python3
>>> import pandas as pd
>>> pd.read_excel("ENB2012_data.xlsx").to_csv("dataset.csv", index=None, header=True)
(DVC_tutorial) x@y:~/DVC_tutorial/data$ ls
dataset.csv  ENB2012_data.xlsx  ENB2012_data.xlsx.dvc

Then repeat the DVC and git steps:

(DVC_tutorial) x@y:~/DVC_tutorial/data$ dvc add dataset.csv 
(DVC_tutorial) x@y:~/DVC_tutorial/data$ git add dataset.csv.dvc
(DVC_tutorial) x@y:~/DVC_tutorial/data$ git commit -m "Converted data is integrated with DVC"
(DVC_tutorial) x@y:~/DVC_tutorial/data$ dvc push

and get a second folder on Gdrive for the converted data:

Now you can remove the raw data (keep the .dvc file) from your local to save disk space. However, as you progress in you EDA, still you may need to update your dataset. Once again you can create a new version of your dataset and only keep it in your local environment. Previous versions will be on remote storage and you can reach them as needed.

Just as a fictious scenario, say that last 100 lines of the .csv file is irrelevant for your purposes and you planned to progress by removing them:

(DVC_tutorial) x@y:~/DVC_tutorial/data$ cat dataset.csv | wc -l
769
(DVC_tutorial) x@y:~/DVC_tutorial/data$ head -n -100 dataset.csv > tmp.txt && mv tmp.txt dataset.csv
(DVC_tutorial) x@y:~/DVC_tutorial/data$ cat dataset.csv | wc -l
669

Having a new version of the dataset, you also need to store it in remote repository. Again using dvc add, git add (for .dvc file), git commit and dvc push command sequence as above. Once completed, you will have another folder on your Gdrive page.

3.2 Switching among dataset versions

Of course, DVC not only help to store different versions of your dataset. It also makes it possible to switch among them. Let's see our git logs:

(DVC_tutorial) x@y:~/DVC_tutorial/data$ git log --oneline
ca1258b (HEAD -> master) dataset is pre-processed
ea6973a Converted data is integrated with DVC
4025c49 Remote data storage is added with name: dataset
a20ad92 Raw data is pulled and integrated with DVC
6bff3fb (origin/master) initiation

Say it we regret to erase last 100 lines and would like to use them again. The only thing we need to do is to checkout to corresponding state of .dvc file:

(DVC_tutorial) x@y:~/DVC_tutorial/data$ cat dataset.csv | wc -l
669
(DVC_tutorial) x@y:~/DVC_tutorial/data$ git checkout ea6973a dataset.csv.dvc
(DVC_tutorial) x@y:~/DVC_tutorial/data$ dvc checkout
(DVC_tutorial) x@y:~/DVC_tutorial/data$ cat dataset.csv | wc -l
769

As the examples above show, DVC help us to surf between different versions of our dataset with git based tracking of .dvc files. You can see on your Github page that dataset.csv file is not there but instead only the corresponding .dvc files are available.

At this point, we have the following tree for our local project folder:

.
├── data
│   ├── dataset.csv
│   ├── dataset.csv.dvc
│   └── ENB2012_data.xlsx.dvc
├── Pipfile
├── Pipfile.lock
├── README.md
└── src
    ├── config.py
    ├── evaluate.py
    ├── prepare.py
    └── train.py

4. Building Pipelines

Having desired form(s) of the dataset, the next step is to iterate a sequence of steps (pipeline) to built and test model(s). DVC helps you automate this procedure as well. The procedure would involve any step from data wrangling to model performance visualization. While working with different pipelines, DVC help you to document and compare the alternatives in terms of parameters you picked.

You can build a pipeline with DVC using dvc run command or via dvc.yaml file. Actually, when you use the former, DVC itself produce the former. Let' try it.

The primitive pipeline we will built here involves 3 fundamental steps: preparation, training and evaluation. For each of those steps, there is a dedicated .py file under the src folder. Using those and proper declerations, it is a straight forward task to build a pipeline.

Code for the preperation step of the pipeline:

(DVC_tutorial)x@y:~/DVC_tutorial/src$ dvc run\ 
> -n prepare \
> -d prepare.py -d ../data/dataset.csv \
> -o ../data/prepared
> python3 prepare.py ../data/dataset.csv

Code for the training step of the pipeline:

(DVC_tutorial)x@y:~/DVC_tutorial/src$ dvc run \
> -n training -d train.py \
> -d ../data/prepared \
> -o ../assets \
> python3 train.py ../data/dataset.csv

Code for the evaluation step of the pipeline:

(DVC_tutorial)x@y:~/DVC_tutorial/src$ dvc run \ 
> -n evaluating \
> -d evaluate.py -d ../data/prepared -d ../assets/models \
> -o ../assets/metrics \
> python3 evaluate.py ../data/prepared ../assets/metrics

Note that there is a common pattern for declaration of each step. You define

a name for the procedure with -n flag,
dependencies with -d flag,
output location with -o flag,
code to execute and its dependencies with python3 command.

After entering the commands above, you get the following dvc.yaml file that can also be used for modifying the pipeline (remember that you can initiate the pipeline just by this file and corresponding DVC commands as well).

(DVC_tutorial) x@y:~/DVC_tutorial$ cat src/dvc.yaml 
stages:
  prepare:
    cmd: python3 prepare.py ../data/dataset.csv
    deps:
    - ../data/dataset.csv
    - prepare.py
    outs:
    - ../data/prepared
  training:
    cmd: python3 train.py ../data/dataset.csv
    deps:
    - ../data/prepared
    - train.py
    outs:
    - ../assets/models
  evaluating:
    cmd: python3 evaluate.py ../data/prepared ../assets/metrics
    deps:
    - ../assets/models
    - ../data/prepared
    - evaluate.py
    outs:
    - ../assets/metrics

Once you built the pipeline, you can modify any part and rebuild it very conveniently. Say it, you wish to change the model you use in train.py file. Initially it was a Random Forest model but you would like to try Extra Tree Regressor as well. You only need to modify corresponding part:

# Build the Random Forest  model:

# model = RandomForestRegressor(
#    n_estimators=150, max_depth=6, random_state=Config.RANDOM_SEED )

# Build Etra Tree Regression Model (alternative model):

model = ExtraTreesRegressor(random_state=155)

After modification, the only thing you have to do is to run dvc repro command. It will the run whole procedure for you. Also, DVC is smart enough to eliminate the steps that not affected by the change. In out example, for example, no need to re-run the preparation step.

With the given code and config file, the performance metric of each run is stored in assets/metrics/metrics.json file.

5. Conclusion

The article presents how DVC make iterations over your work flow so convenient. The tutorial is focused on versioning of the dataset and the pipeline (repository is here). However, DVC presents more tools for hyper-parameter tuning, plotting and experiment management that will be subject of an upcoming post.