<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Cem Keskin</title>
    <description>The latest articles on DEV Community by Cem Keskin (@cemkeskin84).</description>
    <link>https://dev.to/cemkeskin84</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F808708%2F601ee1d5-ce2d-44c5-9c9c-43cb05e10c3b.png</url>
      <title>DEV Community: Cem Keskin</title>
      <link>https://dev.to/cemkeskin84</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/cemkeskin84"/>
    <language>en</language>
    <item>
      <title>Using dbt for Transformation Tasks on BigQuery</title>
      <dc:creator>Cem Keskin</dc:creator>
      <pubDate>Mon, 02 May 2022 19:36:47 +0000</pubDate>
      <link>https://dev.to/cemkeskin84/using-dbt-for-transformation-tasks-on-bigquery-3p1g</link>
      <guid>https://dev.to/cemkeskin84/using-dbt-for-transformation-tasks-on-bigquery-3p1g</guid>
      <description>&lt;h2&gt;
  
  
  Introduction: What Is dbt?
&lt;/h2&gt;

&lt;p&gt;Two common approaches to enable the flow of big data are &lt;a href="https://www.ibm.com/cloud/learn/elt"&gt;ELT (extract, load, transform)&lt;/a&gt; and &lt;a href="https://www.ibm.com/cloud/learn/etl"&gt;ETL (extract, transform, load)&lt;/a&gt;. Both start with unstructured data and despite the slight difference in naming, they result in distinct practices of data engineering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ELT prioritizes loading and keeps transformation for a later time. It would deal with basic pre-processing such as removing duplicated data or filling missing values before serving to a team who is supposed to transform it.&lt;/li&gt;
&lt;li&gt;ETL focuses on transformation before delivering to target systems/teams. Hence, it does more than preprocessing in ELT that would involve making data structured, clean and type transformed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;dbt is a tool to conduct transformation (“T”) practices on data warehouses for &lt;strong&gt;ELT&lt;/strong&gt;. As the name suggests, it involves the operations after the data is extracted and loaded. In other words, once you have “landed” your big data on a data warehouse, dbt can help you to pre-process before serving for the use of subsequent processes. The visualization given below shows its role in a data pipeline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--shRls_Oc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/w24gdekyyovf7khvhcpw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--shRls_Oc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/w24gdekyyovf7khvhcpw.png" alt="transformation with dbt while building a data pipeleine" width="800" height="293"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Tutorial for BigQuery Transformations
&lt;/h2&gt;

&lt;p&gt;dbt is originally a command line tool but currently it has a cloud service (&lt;a href="http://www.getdbt.com/"&gt;www.getdbt.com&lt;/a&gt;) as well that makes the initial steps more convenient for newbees. In this short tutorial, dbt cloud service is used to conduct some basic transformation tasks on the data uploaded from a public dataset (the process was explained here) as a part of Data Engineering Zoomcamp (by DataTalks.club) Capstone Project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-1: Initiate a project on dbt cloud&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The process starts with initiation of an account and a project on dbt cloud that is free for individuals:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7zVTjZMR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6p2oo098yao6n71zw956.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7zVTjZMR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6p2oo098yao6n71zw956.png" alt="starting a dbt project" width="796" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jz2AuQcG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tm1amyxhq3jyoaju2zub.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jz2AuQcG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tm1amyxhq3jyoaju2zub.png" alt="starting a dbt project" width="800" height="415"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step-2: Match with BigQuery&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Clicking on the “Create Project” is followed by simple questions to declare project name and data warehouse for integration. Selecting BigQuery as the data warehouse, you will land on a page to assign GCP service account information. (Note that the account has to have BigQuery crenetdials.)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--53Dv8slS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ivpdyhnn2thxjndbaf17.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--53Dv8slS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ivpdyhnn2thxjndbaf17.png" alt="dbt integration with BigQuery" width="800" height="773"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here, downloading the service account information as a Json file from GCP and uploading it to the dbt would prevent error. Testing the authorization, one can continue for matching the repository to host the dbt project. It is possible to host it on a “dbt Cloud managed repository” or on another repo of choise. Completeing the step, you will have a project ready to initialize on dbt cloud. Clicking on “initialize your project” button, you will have a fresh project template:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8uiwhCQM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/inhv71qaqfaujen4284m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8uiwhCQM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/inhv71qaqfaujen4284m.png" alt="Image description" width="498" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step3: Identify the requirec components and configurations&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For a dbt project, core elements are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;dbt_project.yml&lt;/strong&gt; file that is to configure the project,&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;models&lt;/strong&gt; folder to host models to be run for proposed transformations,&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;macros&lt;/strong&gt; to invove files to declare reusable SQL queries in Jinja format,&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;seeds&lt;/strong&gt; folder to host CSV files for declerations regarding data on data ware house such as zip code &amp;amp; city names, employee ID &amp;amp; personal data’ etc. Note that these are not the data itself but the references to use it properly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step4: Define model&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this introductory tutorial,  we will only use the models. The task is to unify all daily data of a PV system produced during a year. That is to say unify 365 files. The source of the data and how they were uploaded to BigQuery was explained in &lt;a href="https://dev.to/cemkeskin84/how-to-use-apache-airflow-to-get-1000-files-from-a-public-dataset-1mnd"&gt;a previous post&lt;/a&gt;. The content of the each table can be unified with &lt;strong&gt;UNION ALL&lt;/strong&gt; query as below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-- Select columns of interest
SELECT measured_on, system_id, \
    ac_power__5074 as ac_power, \
    ambient_temp__5042 as ambient_temp,
    kwh_net__5046 as kwh_net, \
    module_temp__5043 as module_temp, \
    poa_irradiance__5041 as poa_irradiance, 
    pr__5047 as pr 

-- Decleare one of the files to be combined 
FROM `project_name.dataset_name.table_name`   

UNION ALL 

SELECT ...
    ...
    ...
    ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, writing 365 lines is not convenient, of course. Hence, it is possible to take advantage of a property of BigQuery. It helps you synthesize long queries with a abviously repetitive pattern. For the case study of the tutorial, it was possible to produce a 365 lines of UNION ALL query for a year by runing the following query on BigQuery:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT string_agg(concat("SELECT measured_on, \
    system_id, ac_power__5069 as ac_power, \
    ambient_temp__5062 as ambient_temp, \
    kwh_net__5066 as kwh_net,   \
    module_temp__5063 as module_temp, \
    poa_irradiance__5061 as poa_irradiance, \
    pr__5067 as pr  \
    FROM `YOUR-BQ-PROJECT-NAME.pvsys1433.",  \
    table_id, "`") , "   UNION ALL \n")  \


FROM YOUR-BQ-PROJECT-NAME.pvsys1433.__TABLES__;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Than, you can simply copy-paste the long query to the model file created on models/staging folder of your project in dbt cloud. Running model with dbt run your_model_name.sql command, you will recieve a new table on your corresponding BigQuery project dataset with the name of your model file.&lt;/p&gt;

</description>
      <category>dezoomcamp</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>How to Use Apache Airflow to Get 1000+ Files From a Public Dataset</title>
      <dc:creator>Cem Keskin</dc:creator>
      <pubDate>Sun, 24 Apr 2022 21:30:40 +0000</pubDate>
      <link>https://dev.to/cemkeskin84/how-to-use-apache-airflow-to-get-1000-files-from-a-public-dataset-1mnd</link>
      <guid>https://dev.to/cemkeskin84/how-to-use-apache-airflow-to-get-1000-files-from-a-public-dataset-1mnd</guid>
      <description>&lt;p&gt;Apache Airflow is a platform to manage workflows that is a crucial role for data intensive applications. One can define, schedule, monitor and troubleshoot data workflows as code that makes maintenance, versioning, dependence management and testing more convenient. Being initiated by Airbnb, today it is an open-source tool backed by the Apache Software Foundation. &lt;/p&gt;

&lt;p&gt;Airflow provides robust integrations with major cloud platforms (involving GCP, AWS, MS Azure, etc) as well as local resources. Moreover, it is written in Python that is also used for creating workflows.  Accordingly and not surprisingly, it a well-accepted solution by the industry for applications in different scales. It is also important to note that it allows dynamically manage workflows (data pipelines) but workflows themselves are expected to be -almost- static. It is definitely not for streaming.  &lt;/p&gt;

&lt;h1&gt;
  
  
  1. The Basic Architecture and Terminology
&lt;/h1&gt;

&lt;p&gt;Task and Directed Acyclic Graph (DAG) are two fundamental concepts to understand how to use Airflow. &lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;task&lt;/strong&gt; is an atomized and standalone piece of work (action). Airflow helps you define, run and monitor tasks in Python3, bash scripts, etc. It would be any operation on or with data such as transferring, analysis and storage. Tasks are defined using code templates called operators and the building block of all operators is the &lt;code&gt;BaseOperator&lt;/code&gt;. Generic operators are used for variety of tasks that build DAGs. Moreover, there are specific versions of operators. One of them is sensors that are observing specific points of DAG to detect a specific event to happen. Other tasks with unique functionalities are defined with &lt;code&gt;@task&lt;/code&gt;  decorator that is handled by TaskFlow API. The code snippet given below shows the basic structure of defining tasks within DAG files with examples for BashOperator and PythonOperator.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;airflow.operators.bash&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BashOperator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;airflow.operators.python&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PythonOperator&lt;/span&gt;

&lt;span class="c1"&gt;# ... We see complete DAG files below.
# Here is just an example for how to define tasks. 
&lt;/span&gt;
&lt;span class="n"&gt;my_Bash_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Bash_task_for_XX"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"....bash....command....."&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;my_Python_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Python_task_for_YY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;a_predefined_Python_function&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;op_kwargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s"&gt;"src_file"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"address_to_source_file"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A &lt;strong&gt;DAG&lt;/strong&gt; represents the interdependence among tasks (see Figure 1, &lt;a href="https://airflow.apache.org/"&gt;Source&lt;/a&gt;). Nodes of a DAG are individual tasks whereas the edges correspond to data transition among two tasks. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ryWiSklq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/z9f0zy559ont13cx9e2l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ryWiSklq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/z9f0zy559ont13cx9e2l.png" alt="A basic DAG example" width="800" height="158"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;Airflow helps you link tasks to compose DAGs for controlling flow. To do so, it brings together a variety of services as represented in Figure 2 (&lt;a href="https://airflow.apache.org/"&gt;Source&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--iVmKqkP1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8dma0qkvs324qudtumyt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--iVmKqkP1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8dma0qkvs324qudtumyt.png" alt="Components and Architecture of Apache Airflow" width="744" height="484"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In architecture of Apache Airflow,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Workers&lt;/strong&gt; are the components in which tasks are run in line with the commands received from executer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scheduler&lt;/strong&gt; follows dependencies defined for tasks and DAGs. Once these are met, scheduler triggers the tasks in accordance with the given timing policy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Executor&lt;/strong&gt; runs tasks either inside the scheduler or by pushing corresponding workers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DAG Directory&lt;/strong&gt; is the folder in which &lt;code&gt;.py&lt;/code&gt; files for each DAG lives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata Database&lt;/strong&gt; stores the state of the scheduler, executer and webserver.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User interface&lt;/strong&gt; helps users to control and follow workflows with a intuitive graphical screen and to reach some outputs from the system (such as logs) easily.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Webserver&lt;/strong&gt; links the system with user interface for remote control with interactive GUI.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once tasks and DAGS are defined and the system id activated (usually within containers), users get a screen as shown below (&lt;a href="https://airflow.apache.org/"&gt;Source&lt;/a&gt;) where the workflow can be followed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--FG0alNTg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/su2og8dxh2plld384wf9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--FG0alNTg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/su2og8dxh2plld384wf9.png" alt="An example of Apache Airflow web interface" width="800" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Overviewing basic architecture and the terminology, let’s see them in action.&lt;/p&gt;

&lt;h1&gt;
  
  
  2. Installing Airflow
&lt;/h1&gt;

&lt;p&gt;Airflow is a highly configurable tool. Accordingly, it’s installation can be customized due to the requirements of each specific application. Moreover, it is a common practice to host it in a container to isolate from system interactions and dependency conflicts. Brief guide presented here is based on the &lt;a href="https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html"&gt;official guide&lt;/a&gt; and the show case is originally presented by DataTalks Club in during &lt;a href="https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_2_data_ingestion/airflow"&gt;Data Engineering Zoomcamp (2022 cohort)&lt;/a&gt;. The code base and the configurations files used in this tutorial are available &lt;a href="https://github.com/CemKeskin84/DataEng_Zoomcamp/tree/main/dez-project/airflow"&gt;here&lt;/a&gt;. The case in this tutorial aims to &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;get large number of files (1000+) from a public dataset (&lt;a href="https://data.openei.org/s3_viewer?bucket=oedi-data-lake&amp;amp;limit=100&amp;amp;prefix=pvdaq%2Fcsv%2F"&gt;OEDI photovoltaic systems dataset&lt;/a&gt;) to the local machine in &lt;code&gt;.csv&lt;/code&gt; format&lt;/li&gt;
&lt;li&gt;convert them to &lt;code&gt;.parquet&lt;/code&gt; format for more effective computation on cloud in following steps,&lt;/li&gt;
&lt;li&gt;upload data to a bucket on Google Cloud Platform (GCP),&lt;/li&gt;
&lt;li&gt;transfer them from bucket to BigQuery for further analysis.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Parquet is a free and open source columnar storage format backed by Apache Software Foundation. It allows efficient compression and encoding within Hadoop ecosystem independent of frameworks or programming languages. Hence it is a common format for public databases and actually OEDI PV dataset also has a version in &lt;code&gt;.parquet&lt;/code&gt; format. However, in line with the corresponding &lt;a href="https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_2_data_ingestion/airflow"&gt;DataTalks Club tutorials&lt;/a&gt; and &lt;a href="https://www.youtube.com/watch?v=lqDMzReAtrw&amp;amp;list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&amp;amp;index=18"&gt;videos&lt;/a&gt;, the conversion process is involved to present variety of operations. Otherwise, it is possible to directly transfer &lt;code&gt;.parquet&lt;/code&gt; files to GCP using Airflow or any other tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  2.1 Containerization
&lt;/h2&gt;

&lt;p&gt;Following the best practices, the installation of the Airflow will be containerized. Accordingly, the &lt;code&gt;Dockerfile&lt;/code&gt; and the &lt;code&gt;docker-compose.yaml&lt;/code&gt; files have crucial role. Airflow documentation presents a typical docker-compose file for make life easier for newcomers. It uses the official airflow image: &lt;code&gt;apache/airflow:2.2.3&lt;/code&gt;. Hence, the &lt;a href="https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_2_data_ingestion/airflow/Dockerfile"&gt;Dockerfile developed by DataTalks&lt;/a&gt; Club starts with it system requirements and settings. Then, the SDK for GCP is downloaded and installed for cloud integrations. The file concludes with &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;setting a home directory for Airflow within container,&lt;/li&gt;
&lt;li&gt;including additional scripts if necessary&lt;/li&gt;
&lt;li&gt;setting a parameter for the user ID of Airflow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;(Note: it is a common practice to host the following files and folders within a folder, preferable named as ‘airflow’.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# First-time build can take upto 10 mins.&lt;/span&gt;

FROM apache/airflow:2.2.3

ENV &lt;span class="nv"&gt;AIRFLOW_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/opt/airflow

USER root
RUN apt-get update &lt;span class="nt"&gt;-qq&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get &lt;span class="nb"&gt;install &lt;/span&gt;vim &lt;span class="nt"&gt;-qqq&lt;/span&gt;
&lt;span class="c"&gt;# git gcc g++ -qqq&lt;/span&gt;

COPY requirements.txt &lt;span class="nb"&gt;.&lt;/span&gt;
RUN pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--no-cache-dir&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# Ref: https://airflow.apache.org/docs/docker-stack/recipes.html&lt;/span&gt;

SHELL &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"/bin/bash"&lt;/span&gt;, &lt;span class="s2"&gt;"-o"&lt;/span&gt;, &lt;span class="s2"&gt;"pipefail"&lt;/span&gt;, &lt;span class="s2"&gt;"-e"&lt;/span&gt;, &lt;span class="s2"&gt;"-u"&lt;/span&gt;, &lt;span class="s2"&gt;"-x"&lt;/span&gt;, &lt;span class="s2"&gt;"-c"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;

ARG &lt;span class="nv"&gt;CLOUD_SDK_VERSION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;322.0.0
ENV &lt;span class="nv"&gt;GCLOUD_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/home/google-cloud-sdk

ENV &lt;span class="nv"&gt;PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GCLOUD_HOME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/bin/:&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;PATH&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

RUN &lt;span class="nv"&gt;DOWNLOAD_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;CLOUD_SDK_VERSION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-linux-x86_64.tar.gz"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nv"&gt;TMP_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;mktemp&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; curl &lt;span class="nt"&gt;-fL&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;DOWNLOAD_URL&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--output&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TMP_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/google-cloud-sdk.tar.gz"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GCLOUD_HOME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;tar &lt;/span&gt;xzf &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TMP_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/google-cloud-sdk.tar.gz"&lt;/span&gt; &lt;span class="nt"&gt;-C&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GCLOUD_HOME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--strip-components&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GCLOUD_HOME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/install.sh"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="nt"&gt;--bash-completion&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="nt"&gt;--path-update&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="nt"&gt;--usage-reporting&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="nt"&gt;--quiet&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;TMP_DIR&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; gcloud &lt;span class="nt"&gt;--version&lt;/span&gt;

WORKDIR &lt;span class="nv"&gt;$AIRFLOW_HOME&lt;/span&gt;

COPY scripts scripts
RUN &lt;span class="nb"&gt;chmod&lt;/span&gt; +x scripts

USER &lt;span class="nv"&gt;$AIRFLOW_UID&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;docker-compose.yaml&lt;/code&gt; file suggested by the Airflow documentation should be downloaded next to the Dockerfile:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-LfO&lt;/span&gt; &lt;span class="s1"&gt;'https://airflow.apache.org/docs/apache-airflow/2.2.5/docker-compose.yaml'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It involves variety of functionalities that Airflow need and presents (descripition are from &lt;a href="https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html#using-custom-images"&gt;official guide&lt;/a&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;airflow-scheduler&lt;/code&gt; - The &lt;a href="https://airflow.apache.org/docs/apache-airflow/stable/concepts/scheduler.html"&gt;scheduler&lt;/a&gt; monitors all tasks and DAGs, then triggers the task instances once their dependencies are complete.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;airflow-webserver&lt;/code&gt; - The webserver is available at &lt;code&gt;http://localhost:8080&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;airflow-worker&lt;/code&gt; - The worker that executes the tasks given by the scheduler.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;airflow-init&lt;/code&gt; - The initialization service.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;flower&lt;/code&gt; - &lt;a href="https://flower.readthedocs.io/en/latest/"&gt;The flower app&lt;/a&gt; for monitoring the environment. It is available at &lt;code&gt;http://localhost:5555&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;postgres&lt;/code&gt; - The database.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;redis&lt;/code&gt; - &lt;a href="https://redis.io/"&gt;The redis&lt;/a&gt; - broker that forwards messages from scheduler to worker.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Airflow documentation also suggest to create folders to keep DAGs, log files and plugins outside the container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; ./dags ./logs ./plugins
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Moreover, a &lt;code&gt;.env&lt;/code&gt; file should be created to declare the user ID to the docker-compose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;"AIRFLOW_UID=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; .env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Having the base image, next step is to include GCP related components to the &lt;code&gt;docker-compose.yaml&lt;/code&gt; and to set credentials. &lt;a href="https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_2_data_ingestion/airflow/docker-compose.yaml"&gt;DataTalks Club template&lt;/a&gt; includes the following lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# (line 61 to 66)&lt;/span&gt;
GOOGLE_APPLICATION_CREDENTIALS: /.google/credentials/google_credentials.json
AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT: &lt;span class="s1"&gt;'google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json'&lt;/span&gt;

&lt;span class="c"&gt;# TODO: Please change GCP_PROJECT_ID &amp;amp; GCP_GCS_BUCKET, as per your config&lt;/span&gt;
GCP_PROJECT_ID: &lt;span class="s1"&gt;'YOUR-PROJECT-ID'&lt;/span&gt;
GCP_GCS_BUCKET: &lt;span class="s1"&gt;'YOUR-BUCKET-NAME'&lt;/span&gt;

&lt;span class="c"&gt;# line 72 &amp;gt;&amp;gt; link to the credentials file (at your host machine) for your GCP service account&lt;/span&gt;
- ~/.google/credentials/:/.google/credentials:ro
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also note that DataTalks template replaces the &lt;code&gt;image&lt;/code&gt;&lt;br&gt;
 tag in original document with the &lt;code&gt;build&lt;/code&gt; of the Dockerfile (Line 47 to 49). The rest of the &lt;code&gt;docker-compose.yaml&lt;/code&gt; file is ~300 line that is needless to display here. Please investigate the file downloaded (with &lt;code&gt;curl&lt;/code&gt; command given above) and visit &lt;a href="https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_2_data_ingestion/airflow/docker-compose.yaml"&gt;DataTalks Repository&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  2.2 Running the Containers
&lt;/h2&gt;

&lt;p&gt;Once gathering the necessary files and folders, it’s time to build and up the service with the following commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose build  &lt;span class="c"&gt;#would require 10-15 mins&lt;/span&gt;

docker-compose up airflow-init &lt;span class="c"&gt;#requires ~1 min&lt;/span&gt;

docker-compose up &lt;span class="c"&gt;#requires 2-3 mins &lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As mentioned above, Airflow has a webserver that provides an interactive GUI (&lt;code&gt;localhost:8080&lt;/code&gt;) to monitor and control the processes declared by DAGs.&lt;/p&gt;

&lt;h1&gt;
  
  
  3. Composing DAGs to Use OEDI Data
&lt;/h1&gt;

&lt;p&gt;Having an Airflow setup up-and-running, next step is to compose DAG files to execute tasks. The complete code for this part can be seen &lt;a href="https://github.com/CemKeskin84/DataEng_Zoomcamp/blob/main/dez-project/airflow/dags/pv_system1430.py"&gt;here&lt;/a&gt;. In this text, only the custom parts for OEDI PV dataset will be reviewed. Rest of the code is in line with &lt;a href="https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_2_data_ingestion/airflow/dags/data_ingestion_gcs_dag.py"&gt;DataTalks tutorial&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As like typical Python files, DAG files starts with imports followed by declarations. First part of the declarations involve the environmental parameters refer to Dockerfile and docker-compose files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;AIRFLOW_HOME &lt;span class="o"&gt;=&lt;/span&gt; os.environ.get&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"AIRFLOW_HOME"&lt;/span&gt;, &lt;span class="s2"&gt;"/opt/airflow/"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

PROJECT_ID &lt;span class="o"&gt;=&lt;/span&gt; os.environ.get&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"GCP_PROJECT_ID"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
BUCKET &lt;span class="o"&gt;=&lt;/span&gt; os.environ.get&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"GCP_GCS_BUCKET"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
BIGQUERY_DATASET &lt;span class="o"&gt;=&lt;/span&gt; os.environ.get&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"BIGQUERY_DATASET"&lt;/span&gt;, pv_system_label&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rest of the declarations are related to the OEDI data lake that is hosted on AWS-S3 buckets. An example URL for the files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;[&lt;/span&gt;https://oedi-data-lake.s3.amazonaws.com/pvdaq/csv/pvdata/system_id&lt;span class="o"&gt;=&lt;/span&gt;1199/year&lt;span class="o"&gt;=&lt;/span&gt;2011/month&lt;span class="o"&gt;=&lt;/span&gt;1/day&lt;span class="o"&gt;=&lt;/span&gt;1/system_1199__date_2011_01_01.csv]&lt;span class="o"&gt;(&lt;/span&gt;https://oedi-data-lake.s3.amazonaws.com/pvdaq/csv/pvdata/system_id&lt;span class="o"&gt;=&lt;/span&gt;1199/year&lt;span class="o"&gt;=&lt;/span&gt;2011/month&lt;span class="o"&gt;=&lt;/span&gt;1/day&lt;span class="o"&gt;=&lt;/span&gt;1/system_1199__date_2011_01_01.csv&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It can be separated into following components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;URL Core for PV dataset: &lt;a href="https://oedi-data-lake.s3.amazonaws.com/pvdaq/csv/pvdata/system_id=1199/year=2011/month=1/day=1/system_1199__date_2011_01_01.csv"&gt;&lt;code&gt;https://oedi-data-lake.s3.amazonaws.com/pvdaq/csv/pvdata/&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;System declaration with a numeric code: &lt;a href="https://oedi-data-lake.s3.amazonaws.com/pvdaq/csv/pvdata/system_id=1199/year=2011/month=1/day=1/system_1199__date_2011_01_01.csv"&gt;&lt;code&gt;system_id=1199/&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Year declaration: &lt;a href="https://oedi-data-lake.s3.amazonaws.com/pvdaq/csv/pvdata/system_id=1199/year=2011/month=1/day=1/system_1199__date_2011_01_01.csv"&gt;&lt;code&gt;year=2011/&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Month declaration:&lt;a href="https://oedi-data-lake.s3.amazonaws.com/pvdaq/csv/pvdata/system_id=1199/year=2011/month=1/day=1/system_1199__date_2011_01_01.csv"&gt;&lt;code&gt;month=1/&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Day declaration: &lt;a href="https://oedi-data-lake.s3.amazonaws.com/pvdaq/csv/pvdata/system_id=1199/year=2011/month=1/day=1/system_1199__date_2011_01_01.csv"&gt;&lt;code&gt;day=1/&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;File name declaration: &lt;a href="https://oedi-data-lake.s3.amazonaws.com/pvdaq/csv/pvdata/system_id=1199/year=2011/month=1/day=1/system_1199__date_2011_01_01.csv"&gt;&lt;code&gt;system_1199__date_2011_01_01.csv&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hence, we need a parameter to define system ID and independent year, month and day parameters to target specific files. The ID is just a string:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pv_system_ID &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'1430'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To manipulate date parameters, we can embed Python codes within a Jinja template:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;{{ execution_date.strftime(\'%Y\') }}&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;Hence, we can declare parametrized URL as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;URL_PREFIX&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'https://oedi-data-lake.s3.amazonaws.com/ \
            pvdaq/csv/pvdata/ \
            system_id='&lt;/span&gt;+pv_system_ID+ &lt;span class="se"&gt;\&lt;/span&gt;
            &lt;span class="s1"&gt;'/year='&lt;/span&gt;+&lt;span class="s1"&gt;'{{ execution_date.strftime(\'&lt;/span&gt;%Y&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;}}&lt;/span&gt;&lt;span class="s1"&gt;'+\
            '&lt;/span&gt;/month&lt;span class="o"&gt;={{&lt;/span&gt; execution_date.strftime&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;%-m&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;}}&lt;/span&gt;&lt;span class="s1"&gt;' +\
            '&lt;/span&gt;/day&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'+'&lt;/span&gt;&lt;span class="o"&gt;{{&lt;/span&gt; execution_date.strftime&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;%-e&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;}}&lt;/span&gt;&lt;span class="s1"&gt;'

URL_TEMPLATE= URL_PREFIX +\
            '&lt;/span&gt;/system_&lt;span class="s1"&gt;'+pv_system_ID+'&lt;/span&gt;__date_&lt;span class="s1"&gt;'+\
            '&lt;/span&gt;&lt;span class="o"&gt;{{&lt;/span&gt; execution_date.strftime&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;%Y&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;}}&lt;/span&gt;&lt;span class="s1"&gt;'+\
            '&lt;/span&gt;_&lt;span class="o"&gt;{{&lt;/span&gt; execution_date.strftime&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;%m&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;}}&lt;/span&gt;_&lt;span class="s1"&gt;'+\
            '&lt;/span&gt;&lt;span class="o"&gt;{{&lt;/span&gt; execution_date.strftime&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;%d&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;}}&lt;/span&gt;&lt;span class="s1"&gt;'+'&lt;/span&gt;.csv&lt;span class="s1"&gt;'
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a similar manner, it is useful to rename downloaded files before conversion:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;OUTPUT_FILE_TEMPLATE &lt;span class="o"&gt;=&lt;/span&gt; AIRFLOW_HOME + &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="s1"&gt;'/pvsys'&lt;/span&gt;+pv_system_ID+&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="s1"&gt;'data_{{ execution_date.strftime(\'&lt;/span&gt;%Y&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;}}&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="o"&gt;{{&lt;/span&gt; execution_date.strftime&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;%m&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;}}&lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="o"&gt;{{&lt;/span&gt; execution_date.strftime&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;%d&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;}}&lt;/span&gt;.csv&lt;span class="s1"&gt;'

parquet_file = OUTPUT_FILE_TEMPLATE.\
replace('&lt;/span&gt;.csv&lt;span class="s1"&gt;', '&lt;/span&gt;.parquet&lt;span class="s1"&gt;')
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DAG code continues with 2 function definitions (namely &lt;code&gt;format_to_parquet&lt;/code&gt; and &lt;code&gt;upload_to_gcs&lt;/code&gt; &lt;a href="https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_2_data_ingestion/airflow/dags/data_ingestion_gcs_dag.py"&gt;defined by DataTalk Club&lt;/a&gt; to be used in operators. In line with the 4 tasks given at the beginning of Section 2, the DAG involves 4 operators (remember the definition and role of operators given in Section 1).&lt;/p&gt;

&lt;p&gt;The first operator gets data from the corresponding link by using the &lt;code&gt;curl&lt;/code&gt; command with URL and output templates declared above:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;download_task &lt;span class="o"&gt;=&lt;/span&gt; BashOperator&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="nv"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'get_data'&lt;/span&gt;,
        &lt;span class="nv"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;f&lt;span class="s1"&gt;'curl -sSL {URL_TEMPLATE} &amp;gt; {OUTPUT_FILE_TEMPLATE}'&lt;/span&gt;
    &lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The second operator converts &lt;code&gt;.csv&lt;/code&gt; file to &lt;code&gt;.parquet&lt;/code&gt; file using the &lt;code&gt;format_to_parquet&lt;/code&gt;  funtion:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;convert_task &lt;span class="o"&gt;=&lt;/span&gt; PythonOperator&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="nv"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"convert_csv_to_parquet"&lt;/span&gt;,
        &lt;span class="nv"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;format_to_parquet,
        &lt;span class="nv"&gt;op_kwargs&lt;/span&gt;&lt;span class="o"&gt;={&lt;/span&gt;
            &lt;span class="s2"&gt;"src_file"&lt;/span&gt;: OUTPUT_FILE_TEMPLATE,
        &lt;span class="o"&gt;}&lt;/span&gt;,
    &lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The third operator sends the converted file to GCP bucket using the &lt;code&gt;upload_to_gcs&lt;/code&gt; function with parametrized system and file names:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;local_to_gcs_task &lt;span class="o"&gt;=&lt;/span&gt; PythonOperator&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="nv"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"local_to_gcs_task"&lt;/span&gt;,
        &lt;span class="nv"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;upload_to_gcs,
        &lt;span class="nv"&gt;op_kwargs&lt;/span&gt;&lt;span class="o"&gt;={&lt;/span&gt;
            &lt;span class="s2"&gt;"bucket"&lt;/span&gt;: BUCKET,
            &lt;span class="s2"&gt;"object_name"&lt;/span&gt;: f&lt;span class="s2"&gt;"{pv_system_label}/{parquet_file_name}"&lt;/span&gt;,
            &lt;span class="s2"&gt;"local_file"&lt;/span&gt;: f&lt;span class="s2"&gt;"{parquet_file}"&lt;/span&gt;,
        &lt;span class="o"&gt;}&lt;/span&gt;,
    &lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The last operator transfers files from bucket to BigQuery with a specific operator defined for this task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bigquery_external_table_task &lt;span class="o"&gt;=&lt;/span&gt; BigQueryCreateExternalTableOperator&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="nv"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"bigquery_external_table_task"&lt;/span&gt;,
        &lt;span class="nv"&gt;table_resource&lt;/span&gt;&lt;span class="o"&gt;={&lt;/span&gt;
            &lt;span class="s2"&gt;"tableReference"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="s2"&gt;"projectId"&lt;/span&gt;: PROJECT_ID,
                &lt;span class="s2"&gt;"datasetId"&lt;/span&gt;: BIGQUERY_DATASET,
                &lt;span class="s2"&gt;"tableId"&lt;/span&gt;: pv_system_label+&lt;span class="s2"&gt;"_"&lt;/span&gt;+&lt;span class="s2"&gt;"{{execution_date.strftime(&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="s2"&gt;%d&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="s2"&gt;) }}{{execution_date.strftime(&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="s2"&gt;%m&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="s2"&gt;) }}{{execution_date.strftime(&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="s2"&gt;%Y&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="s2"&gt;) }}"&lt;/span&gt;,
            &lt;span class="o"&gt;}&lt;/span&gt;,
            &lt;span class="s2"&gt;"externalDataConfiguration"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="s2"&gt;"sourceFormat"&lt;/span&gt;: &lt;span class="s2"&gt;"PARQUET"&lt;/span&gt;,
                &lt;span class="s2"&gt;"sourceUris"&lt;/span&gt;: &lt;span class="o"&gt;[&lt;/span&gt;f&lt;span class="s2"&gt;"gs://{BUCKET}/{pv_system_label}/{parquet_file_name}"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;,
            &lt;span class="o"&gt;}&lt;/span&gt;,
        &lt;span class="o"&gt;}&lt;/span&gt;,
    &lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The last step is to ‘chain’ all these operators to build the ‘tree’ of tasks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;download_task &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; convert_task &lt;span class="se"&gt;\&lt;/span&gt;
        &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; local_to_gcs_task &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; bigquery_external_table_task
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Initiating Airflow with such a DAG definition for 4 PV system (IDs with 1430 to 1433), one should get the following graph:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3qhAYTzN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vg2bln9hyjoixpun5vts.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3qhAYTzN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vg2bln9hyjoixpun5vts.png" alt="Image description" width="800" height="296"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After triggering a DAG, the following tree visualization appears:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BDL2FBxF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ifbdboa9m06afkzvykva.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BDL2FBxF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ifbdboa9m06afkzvykva.png" alt="Image description" width="800" height="326"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This also visualize how the DAG works. Referring to the date declarations given in DAT initiation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;with DAG&lt;span class="o"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"dag_for_"&lt;/span&gt;+pv_system_label+&lt;span class="s2"&gt;"data"&lt;/span&gt;,
    &lt;span class="nv"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;datetime&lt;span class="o"&gt;(&lt;/span&gt;2015, 1, 1&lt;span class="o"&gt;)&lt;/span&gt;,
    &lt;span class="nv"&gt;end_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;datetime&lt;span class="o"&gt;(&lt;/span&gt;2015, 12, 31&lt;span class="o"&gt;)&lt;/span&gt;,
    &lt;span class="nv"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"@daily"&lt;/span&gt;,

&lt;span class="o"&gt;)&lt;/span&gt; as dag:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Airflow parametrizes year, month and day information to be used in DAG. These parameters are used in the URL template explained above to get the exact file for each iteration. Hence, beginning from &lt;code&gt;start date&lt;/code&gt;, Airflow iterates the DAG for each day up to the &lt;code&gt;end_date&lt;/code&gt;. After completing each run for 4 systems, the following creen appears on Airflow DAGs menu and BigQuery:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8VFQwjzc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/z2ye81p0bwkb7rbodfwl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8VFQwjzc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/z2ye81p0bwkb7rbodfwl.png" alt="Image description" width="800" height="318"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lXu60zjJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fngfeokq7tsxecf04se2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lXu60zjJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fngfeokq7tsxecf04se2.png" alt="Image description" width="800" height="632"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>dezoomcamp</category>
      <category>dataengineering</category>
      <category>airflow</category>
      <category>publicdatasets</category>
    </item>
    <item>
      <title>Building GCS Buckets and BigQuery Tables with Terraform</title>
      <dc:creator>Cem Keskin</dc:creator>
      <pubDate>Sun, 03 Apr 2022 18:20:40 +0000</pubDate>
      <link>https://dev.to/cemkeskin84/building-gcs-bucket-and-bigquery-tables-with-terraform-4hf4</link>
      <guid>https://dev.to/cemkeskin84/building-gcs-bucket-and-bigquery-tables-with-terraform-4hf4</guid>
      <description>&lt;p&gt;Terraform helps data scientists and engineers to build an infrastructure and to manage its lifecyle.&lt;/p&gt;

&lt;p&gt;There are two ways to use it: on local &amp;amp; cloud. Below is the description how to install and use it in local to build an infrastructure on Google Cloud Platform (GCP).&lt;/p&gt;

&lt;h2&gt;
  
  
  Initial Installations: Terraform and Google Cloud SDK
&lt;/h2&gt;

&lt;p&gt;For installing Terraform, pick the proper guide for your operating system provided in their &lt;a href="https://www.terraform.io/downloads" rel="noopener noreferrer"&gt;webpage.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once completing the Terraform installation, you also need to have a GCP account and initiate a project. The ID of the project is importrant to note while proceeding with Terraform.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faqyenvdwlyte90xobhqy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faqyenvdwlyte90xobhqy.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The following step is to get key to access and control your GCP project. Pick the project you just created from the pull-down menu on the header of GCP and go to the: &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;         Navigation Menu &amp;gt;&amp;gt; IAM &amp;amp; Admin &amp;gt;&amp;gt; Service Accounts &amp;gt;&amp;gt; Create Service Account
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;and floow the steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Step1: Assign the name of your preference&lt;/li&gt;
&lt;li&gt;Step2: Pick the role “Viewer” for initiation&lt;/li&gt;
&lt;li&gt;Step3: Skip this optional step for your personal projects.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then you should see a new account on your Service Accounts list. Click on Actions &amp;gt;&amp;gt; Manage Keys &amp;gt;&amp;gt; Add Key &amp;gt;&amp;gt; JSON to download the key on your local machine.&lt;/p&gt;

&lt;p&gt;Next Step is installing Google Cloud SDK to your local machine following the straight forward instractions given          &lt;a href="https://cloud.google.com/sdk/docs/install-sdk" rel="noopener noreferrer"&gt;here.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then open your terminal (below is a GNU/Linux example) to set the environmental variable on your local machine to link with the key you downloaded (Json file) with the following instructions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_APPLICATION_CREDENTIALS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/--path to your JSON---/XXXXX-dadas2a4cff8.json

gcloud auth application-default login
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This redirects you to the browser in order to select your corresponding Google account. Now, your local SDK has credentials to reach and configure your cloud services. However, having these initial authentications, you still need to modify your service account permissions specific for the GCP services you intended to build, namely Google Cloud Storage (GCP) and BigQuery.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   Navigation Menu &amp;gt;&amp;gt; IAM &amp;amp; Admin &amp;gt;&amp;gt; IAM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;and pick your project to edit its permissions as following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F15y1400bxxt2e8pmsl41.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F15y1400bxxt2e8pmsl41.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next step is enabling the APIs for your project by following the links:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://console.cloud.google.com/apis/library/iam.googleapis.com" rel="noopener noreferrer"&gt;IAM API&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://console.cloud.google.com/apis/library/iamcredentials.googleapis.com" rel="noopener noreferrer"&gt;IAM Credentials&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;(Take care of the GCP account and the project name while enabling APIs.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Building GCP Services with Terraform
&lt;/h2&gt;

&lt;p&gt;Completing necessary installations (Teraform and Google Cloud SDK) and authentications, we are ready to build these two GCP services via Terraform from your local machine. Basically, two files are needed to configure the installations: &lt;code&gt;main.tf&lt;/code&gt; and &lt;code&gt;variables.tf&lt;/code&gt;. The former one requires the code given below to create GCP services with respect to variables provided in latter (the following code snippet).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# The code below is from https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_1_basics_n_setup/1_terraform_gcp&lt;/span&gt;
&lt;span class="c"&gt;# --------------------------------------------------&lt;/span&gt;

terraform &lt;span class="o"&gt;{&lt;/span&gt;
  required_version &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"&amp;gt;= 1.0"&lt;/span&gt;
  backend &lt;span class="s2"&gt;"local"&lt;/span&gt; &lt;span class="o"&gt;{}&lt;/span&gt;  
    google &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="nb"&gt;source&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"hashicorp/google"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

provider &lt;span class="s2"&gt;"google"&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  project &lt;span class="o"&gt;=&lt;/span&gt; var.project
  region &lt;span class="o"&gt;=&lt;/span&gt; var.region
  // credentials &lt;span class="o"&gt;=&lt;/span&gt; file&lt;span class="o"&gt;(&lt;/span&gt;var.credentials&lt;span class="o"&gt;)&lt;/span&gt;  &lt;span class="c"&gt;# Use this if you do not want to set env-var GOOGLE_APPLICATION_CREDENTIALS&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;# Data Lake Bucket&lt;/span&gt;
&lt;span class="c"&gt;# Ref: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/storage_bucket&lt;/span&gt;
resource &lt;span class="s2"&gt;"google_storage_bucket"&lt;/span&gt; &lt;span class="s2"&gt;"data-lake-bucket"&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  name          &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.data_lake_bucket&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;_&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.project&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="c"&gt;# Concatenating DL bucket &amp;amp; Project name for unique naming&lt;/span&gt;
  location      &lt;span class="o"&gt;=&lt;/span&gt; var.region

  &lt;span class="c"&gt;# Optional, but recommended settings:&lt;/span&gt;
  storage_class &lt;span class="o"&gt;=&lt;/span&gt; var.storage_class
  uniform_bucket_level_access &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;true

  &lt;/span&gt;versioning &lt;span class="o"&gt;{&lt;/span&gt;
    enabled     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;

  lifecycle_rule &lt;span class="o"&gt;{&lt;/span&gt;
    action &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Delete"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    condition &lt;span class="o"&gt;{&lt;/span&gt;
      age &lt;span class="o"&gt;=&lt;/span&gt; 30  // days
    &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;

  force_destroy &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;# DWH&lt;/span&gt;
&lt;span class="c"&gt;# Ref: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_dataset&lt;/span&gt;
resource &lt;span class="s2"&gt;"google_bigquery_dataset"&lt;/span&gt; &lt;span class="s2"&gt;"dataset"&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  dataset_id &lt;span class="o"&gt;=&lt;/span&gt; var.BQ_DATASET
  project    &lt;span class="o"&gt;=&lt;/span&gt; var.project
  location   &lt;span class="o"&gt;=&lt;/span&gt; var.region
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code for &lt;code&gt;variables.tf&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# The code below is from https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/week_1_basics_n_setup/1_terraform_gcp&lt;/span&gt;
&lt;span class="c"&gt;# The comments are added by the author&lt;/span&gt;

locals &lt;span class="o"&gt;{&lt;/span&gt;
  data_lake_bucket &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"BUCKET_NAME"&lt;/span&gt;  &lt;span class="c"&gt;# Write a name for the GCS bucket to be created&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

variable &lt;span class="s2"&gt;"project"&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  description &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Your GCP Project ID"&lt;/span&gt;   &lt;span class="c"&gt;# Don't write anything here: it will be prompted during installation&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

variable &lt;span class="s2"&gt;"region"&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  description &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Region for GCP resources. Choose as per your location: https://cloud.google.com/about/locations"&lt;/span&gt;
  default &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"europe-west6"&lt;/span&gt;  &lt;span class="c"&gt;# Pick a data center location in which your services will be located&lt;/span&gt;
  &lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; string
&lt;span class="o"&gt;}&lt;/span&gt;

variable &lt;span class="s2"&gt;"storage_class"&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  description &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Storage class type for your bucket. Check official docs for more info."&lt;/span&gt;
  default &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"STANDARD"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

variable &lt;span class="s2"&gt;"BQ_DATASET"&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  description &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"BigQuery Dataset that raw data (from GCS) will be written to"&lt;/span&gt;
  &lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; string
  default &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Dataset_Name"&lt;/span&gt; &lt;span class="c"&gt;# Write a name for the BigQuery Dataset to be created&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the files given above are located to a folder, it is time to execute them. THere few main commands for Terraform CLI:&lt;/p&gt;

&lt;p&gt;Main commands:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;init:&lt;/strong&gt; prepares the directory by adding necessary folders and files for following commands&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;validate: c&lt;/strong&gt;hecks the existing configuration if it is valid&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;plan:&lt;/strong&gt; shows planned changes for the given configuration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;apply:&lt;/strong&gt; creates the infrastructure for the given configuration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;destroy:&lt;/strong&gt; destroys existing infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;THe init, plan and apply commands will give the following outputs (shortened):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;x@y:~/-----/terraform&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="k"&gt;**&lt;/span&gt;terraform init&lt;span class="k"&gt;**&lt;/span&gt;

Initializing the backend...

Successfully configured the backend &lt;span class="s2"&gt;"local"&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt; Terraform will automatically
use this backend unless the backend configuration changes.
&lt;span class="nb"&gt;.&lt;/span&gt;
&lt;span class="nb"&gt;.&lt;/span&gt;
&lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;x@y:~/-----/terraform&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="k"&gt;**&lt;/span&gt;terraform plan&lt;span class="k"&gt;**&lt;/span&gt;
var.project
  Your GCP Project ID

  Enter a value: xxx-yyy &lt;span class="c"&gt;# write yor GCP project ID here&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;x@y:~/-----/terraform&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="k"&gt;**&lt;/span&gt;terraform apply&lt;span class="k"&gt;**&lt;/span&gt;

var.project
  Your GCP Project ID

  Enter a value: xxx-yyy &lt;span class="c"&gt;# write yor GCP project ID here&lt;/span&gt;

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following
symbols:
  + create

Terraform will perform the following actions:
&lt;span class="nb"&gt;.&lt;/span&gt;
&lt;span class="nb"&gt;.&lt;/span&gt;
&lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After executing the three simple codes above, you will see A new GCS bucket and BigQuery table in your GCP account.&lt;/p&gt;

</description>
      <category>dezoomcamp</category>
      <category>dataengineering</category>
      <category>bigquery</category>
      <category>gcp</category>
    </item>
    <item>
      <title>Versioning Data and Pipeline With Git, DVC and Cloud Storage</title>
      <dc:creator>Cem Keskin</dc:creator>
      <pubDate>Fri, 04 Feb 2022 12:24:42 +0000</pubDate>
      <link>https://dev.to/cemkeskin84/versioning-data-and-pipeline-with-git-dvc-and-cloud-storage-5cpd</link>
      <guid>https://dev.to/cemkeskin84/versioning-data-and-pipeline-with-git-dvc-and-cloud-storage-5cpd</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;The vast majority of data science projects are born into Jupyter Notebooks. Being interactive and easy to use, they make exploratory data analysis (EDA) so convenient. They are also widely used for further steps such as machine learning model development, performance assessment, hyper-parameter tuning, among others. However, as the project makes progress and the deployment scenarios are under investigation, notebooks start to suffer in terms of versioning, reproducibility, interoperability, file type issues, etc. In other words, as the project moves from isolated local environments to common ones, one needs more software engineering oriented tools than data science specific ones. At this point, some of the DevOps practices can contribute.&lt;/p&gt;

&lt;p&gt;A common and inevitable best practice of modern software development is using version control, In daily practice, it is almost equivalent to use git. It helps switching among different versions of files. However, it has limitations where the file size limit and file type compatibility are the major ones. One can conduct version control of .py or .ipynb files conveniently. However, versioning a dataset in size of couple of hundred megabytes or a file in .bit format is not possible or convenient with git. At this point, a free and open source tool, namely Data Version Control (DVC) by iterative.ai, comes into the scene.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.1 DVC
&lt;/h3&gt;

&lt;p&gt;DVC is a highly capable command line tool. As a global definition, it makes dataset and experiment versioning convenient by complementing some other major developer tools such as Git. As an example, it enables dataset versioning together with git and cloud (or remote) storage. It provides command line tools to define .dvc files whose versions are tracked by git. These specific files keep references for exact versions of your dataset that are stored in cloud (like S3, Gdrive, etc. In short, DVC acts as a middleman that serves you to integrate git and cloud for dataset versioning.&lt;/p&gt;

&lt;p&gt;As like dataset versioning, it also helps for experiment versioning using .dvc, .yaml and other config files. You can built pipelines that make your data flow through several processes to yield a value for your research, business or hobby. Using DVC, you can define such a pipeline and maintain it seamlessly. You can redefine your complete pipeline or fix some parts. What ever the case, DVC makes life easier for your team.&lt;/p&gt;

&lt;p&gt;No need more descriptions or promises. Let's jump into it with an introductory tutorial. Note that, the dataset used in this tutorial is a small one that can be stored in git. DVC is actually developed for larger datasets but a smaller one is used here to save loading and computation time. Beside these costs, the procedure is exactly same for larger datasets with the steps and tools defined below.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Let's start
&lt;/h2&gt;

&lt;p&gt;This tutorial is a partial reproduction of &lt;a href="https://github.com/CemKeskin84/ML-Zoomcamp/tree/main/midterm_project"&gt;a previous data science project&lt;/a&gt; that was depended on notebooks during the development. Here, a simple pipeline is built for the work flow that starts with getting the data and ends with evaluating 2 simple model alternatives (deployment is not involved). Since the tutorial is on versioning, the codes for data preparation, model training and model performance evaluation are just transferred from corresponding notebooks of the previous project to the &lt;code&gt;src&lt;/code&gt; folder of the new project as &lt;code&gt;.py&lt;/code&gt; files. Hence the tree was lean as below at the beginning.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;.&lt;/span&gt;
├── README.md
└── src
    ├── config.py
    ├── evaluate.py
    ├── prepare.py
    └── train.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next step is to deploy &lt;code&gt;pipenv&lt;/code&gt; for dependence management:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;x@y:~/DVC_tutorial&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv &lt;span class="nb"&gt;install
&lt;/span&gt;x@y:~/DVC_tutorial&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv shell
&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
dvc[gdrive] pandas numpy sklearn openpyxl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Note that installing DVC on Linux is as easy as &lt;code&gt;pip install dvc&lt;/code&gt;. However, &lt;code&gt;dvc[gdrive]&lt;/code&gt; is used here to install specific DVC version. This can work with Gdrive properly since it will be used for storing the versions of data during the tutorial. To see other installation alternatives, see &lt;a href="https://dvc.org/doc/install"&gt;DVC website&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Then git and DVC are initiated with &lt;code&gt;git init&lt;/code&gt; and &lt;code&gt;dvc init&lt;/code&gt; commands. At this point we have the following files and folders:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-a&lt;/span&gt;
.dvc  .dvcignore  .git  Pipfile  Pipfile.lock  README.md  src
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  3. Versioning Data
&lt;/h2&gt;

&lt;p&gt;The original dataset of the project is stored in a &lt;a href="https://archive.ics.uci.edu/ml/machine-learning-databases/00242/"&gt;UCI repository&lt;/a&gt;. Create the &lt;code&gt;data&lt;/code&gt; folder and pull the data file (in .xlsx format) with specific DVC command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;data &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;data
&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial/data&lt;span class="nv"&gt;$ &lt;/span&gt;dvc get-url https://archive.ics.uci.edu/ml/machine-learning-databases/00242/ENB2012_data.xlsx

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then "add" the data file to DVC with &lt;code&gt;dvc add ENB2012_data.xlsx&lt;/code&gt; command. This yields the corresponding &lt;code&gt;.dvc&lt;/code&gt; file for tracking it. This is the file that git will be tracking; not the original dataset. Using this file, DVC acts a tool that matches a dataset in a local or remote storage (Gdrive, S3, etc.) with the code base of the project stored on git.&lt;/p&gt;

&lt;p&gt;Next step is pushing the data to the cloud that is Gdrive for this tutorial. Initially, create a folder manually on Gdrive web interface and get the label:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--tb1WTzLo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/spgyah12wbtvxy7nyshk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--tb1WTzLo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/spgyah12wbtvxy7nyshk.png" alt="Image description" width="664" height="170"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once you get the label, declare it to DVC as remote storage location and commit it as below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial/data&lt;span class="nv"&gt;$ &lt;/span&gt;dvc remote add &lt;span class="nt"&gt;-d&lt;/span&gt; raw_storage gdrive://1I680q6HvPqcxbNJ8qnQ01c1pKxxxxxxxxxx
Setting &lt;span class="s1"&gt;'raw_storage'&lt;/span&gt; as a default remote.
&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial/data&lt;span class="nv"&gt;$ &lt;/span&gt;git commit ../.dvc/config &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Remote data storage is added with name: dataset"&lt;/span&gt;
&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial/data&lt;span class="nv"&gt;$ &lt;/span&gt;dvc push
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once you pushed it, Gdrive will ask you a simple-to-follow authentication procedure to get a verification code. Entering it, upload will start and you will get a folder with a random name on your Gdrive:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--RNTsSLfT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/t0m05ytsfo417ay3p7v1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--RNTsSLfT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/t0m05ytsfo417ay3p7v1.png" alt="Image description" width="674" height="259"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 Building and storing another version of the dataset
&lt;/h3&gt;

&lt;p&gt;Imagine a case that you have to keep the raw data but it is not useful as it is. You would need to transform it as needed and only keep new version on your local. As an example, let's say we need a .csv file instead of .xlsx. Simply convert it using python with the name of "dataset.csv:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial/data&lt;span class="nv"&gt;$ &lt;/span&gt;python3
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; import pandas as pd
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; pd.read_excel&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"ENB2012_data.xlsx"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;.to_csv&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"dataset.csv"&lt;/span&gt;, &lt;span class="nv"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;None, &lt;span class="nv"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial/data&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;ls
&lt;/span&gt;dataset.csv  ENB2012_data.xlsx  ENB2012_data.xlsx.dvc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then repeat the DVC and git steps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial/data&lt;span class="nv"&gt;$ &lt;/span&gt;dvc add dataset.csv 
&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial/data&lt;span class="nv"&gt;$ &lt;/span&gt;git add dataset.csv.dvc
&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial/data&lt;span class="nv"&gt;$ &lt;/span&gt;git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Converted data is integrated with DVC"&lt;/span&gt;
&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial/data&lt;span class="nv"&gt;$ &lt;/span&gt;dvc push
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and get a second folder on Gdrive for the converted data:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wNUjYTHV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/29k4lw2zzgnchl7icakv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wNUjYTHV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/29k4lw2zzgnchl7icakv.png" alt="Image description" width="800" height="295"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now you can remove the raw data (keep the .dvc file) from your local to save disk space. However, as you progress in you EDA, still you may need to update your dataset. Once again you can create a new version of your dataset and only keep it in your local environment. Previous versions will be on remote storage and you can reach them as needed.&lt;/p&gt;

&lt;p&gt;Just as a fictious scenario, say that last 100 lines of the .csv file is irrelevant for your purposes and you planned to progress by removing them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial/data&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;dataset.csv | &lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt;
769
&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial/data&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="nt"&gt;-100&lt;/span&gt; dataset.csv &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; tmp.txt &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;mv &lt;/span&gt;tmp.txt dataset.csv
&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial/data&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;dataset.csv | &lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt;
669
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Having a new version of the dataset, you also need to store it in remote repository. Again using &lt;code&gt;dvc add&lt;/code&gt;, &lt;code&gt;git add&lt;/code&gt; (for .dvc file), &lt;code&gt;git commit&lt;/code&gt; and &lt;code&gt;dvc push&lt;/code&gt; command sequence as above. Once completed, you will have another folder on your Gdrive page.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 Switching among dataset versions
&lt;/h3&gt;

&lt;p&gt;Of course, DVC not only help to store different versions of your dataset. It also makes it possible to switch among them. Let's see our git logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial/data&lt;span class="nv"&gt;$ &lt;/span&gt;git log &lt;span class="nt"&gt;--oneline&lt;/span&gt;
ca1258b &lt;span class="o"&gt;(&lt;/span&gt;HEAD -&amp;gt; master&lt;span class="o"&gt;)&lt;/span&gt; dataset is pre-processed
ea6973a Converted data is integrated with DVC
4025c49 Remote data storage is added with name: dataset
a20ad92 Raw data is pulled and integrated with DVC
6bff3fb &lt;span class="o"&gt;(&lt;/span&gt;origin/master&lt;span class="o"&gt;)&lt;/span&gt; initiation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Say it we regret to erase last 100 lines and would like to use them again. The only thing we need to do is to checkout to corresponding state of .dvc file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial/data&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;dataset.csv | &lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt;
669
&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial/data&lt;span class="nv"&gt;$ &lt;/span&gt;git checkout ea6973a dataset.csv.dvc
&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial/data&lt;span class="nv"&gt;$ &lt;/span&gt;dvc checkout
&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial/data&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;dataset.csv | &lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt;
769
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As the examples above show, DVC help us to surf between different versions of our dataset with git based tracking of .dvc files. You can see on your Github page that dataset.csv file is not there but instead only the corresponding .dvc files are available.&lt;/p&gt;

&lt;p&gt;At this point, we have the following tree for our local project folder:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.
├── data
│   ├── dataset.csv
│   ├── dataset.csv.dvc
│   └── ENB2012_data.xlsx.dvc
├── Pipfile
├── Pipfile.lock
├── README.md
└── src
    ├── config.py
    ├── evaluate.py
    ├── prepare.py
    └── train.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. Building Pipelines
&lt;/h2&gt;

&lt;p&gt;Having desired form(s) of the dataset, the next step is to iterate a sequence of steps (pipeline) to built and test model(s). DVC helps you automate this procedure as well. The procedure would involve any step from data wrangling to model performance visualization. While working with different pipelines, DVC help you to document and compare the alternatives in terms of parameters you picked.&lt;/p&gt;

&lt;p&gt;You can build a pipeline with DVC using &lt;code&gt;dvc run&lt;/code&gt; command or via dvc.yaml file. Actually, when you use the former, DVC itself produce the former. Let' try it.&lt;/p&gt;

&lt;p&gt;The primitive pipeline we will built here involves 3 fundamental steps: preparation, training and evaluation. For each of those steps, there is a dedicated &lt;code&gt;.py&lt;/code&gt; file under the src folder. Using those and proper declerations, it is a straight forward task to build a pipeline.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code for the preperation step of the pipeline:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt;x@y:~/DVC_tutorial/src&lt;span class="nv"&gt;$ &lt;/span&gt;dvc run&lt;span class="se"&gt;\ &lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; prepare &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; prepare.py &lt;span class="nt"&gt;-d&lt;/span&gt; ../data/dataset.csv &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; ../data/prepared
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; python3 prepare.py ../data/dataset.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Code for the training step of the pipeline:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt;x@y:~/DVC_tutorial/src&lt;span class="nv"&gt;$ &lt;/span&gt;dvc run &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; training &lt;span class="nt"&gt;-d&lt;/span&gt; train.py &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; ../data/prepared &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; ../assets &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; python3 train.py ../data/dataset.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Code for the evaluation step of the pipeline:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt;x@y:~/DVC_tutorial/src&lt;span class="nv"&gt;$ &lt;/span&gt;dvc run &lt;span class="se"&gt;\ &lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; evaluating &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; evaluate.py &lt;span class="nt"&gt;-d&lt;/span&gt; ../data/prepared &lt;span class="nt"&gt;-d&lt;/span&gt; ../assets/models &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; ../assets/metrics &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; python3 evaluate.py ../data/prepared ../assets/metrics 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that there is a common pattern for declaration of each step. You define&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a name for the procedure with &lt;code&gt;-n&lt;/code&gt; flag,&lt;/li&gt;
&lt;li&gt;dependencies with &lt;code&gt;-d&lt;/code&gt; flag,&lt;/li&gt;
&lt;li&gt;output location with &lt;code&gt;-o&lt;/code&gt; flag,&lt;/li&gt;
&lt;li&gt;code to execute and its dependencies with &lt;code&gt;python3&lt;/code&gt; command.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After entering the commands above, you get the following dvc.yaml file that can also be used for modifying the pipeline (remember that you can initiate the pipeline just by this file and corresponding DVC commands as well).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;(&lt;/span&gt;DVC_tutorial&lt;span class="o"&gt;)&lt;/span&gt; x@y:~/DVC_tutorial&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;src/dvc.yaml 
stages:
  prepare:
    cmd: python3 prepare.py ../data/dataset.csv
    deps:
    - ../data/dataset.csv
    - prepare.py
    outs:
    - ../data/prepared
  training:
    cmd: python3 train.py ../data/dataset.csv
    deps:
    - ../data/prepared
    - train.py
    outs:
    - ../assets/models
  evaluating:
    cmd: python3 evaluate.py ../data/prepared ../assets/metrics
    deps:
    - ../assets/models
    - ../data/prepared
    - evaluate.py
    outs:
    - ../assets/metrics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once you built the pipeline, you can modify any part and rebuild it very conveniently. Say it, you wish to change the model you use in &lt;code&gt;train.py&lt;/code&gt; file. Initially it was a Random Forest model but you would like to try Extra Tree Regressor as well. You only need to modify corresponding part:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Build the Random Forest  model:
&lt;/span&gt;
&lt;span class="c1"&gt;# model = RandomForestRegressor(
#    n_estimators=150, max_depth=6, random_state=Config.RANDOM_SEED )
&lt;/span&gt;
&lt;span class="c1"&gt;# Build Etra Tree Regression Model (alternative model):
&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ExtraTreesRegressor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;155&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After modification, the only thing you have to do is to run &lt;code&gt;dvc repro&lt;/code&gt; command. It will the run whole procedure for you. Also, DVC is smart enough to eliminate the steps that not affected by the change. In out example, for example, no need to re-run the preparation step. &lt;/p&gt;

&lt;p&gt;With the given code and config file, the performance metric of each run is stored in &lt;code&gt;assets/metrics/metrics.json&lt;/code&gt; file.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Conclusion
&lt;/h2&gt;

&lt;p&gt;The article presents how DVC make iterations over your work flow so convenient. The tutorial is focused on versioning of the dataset and the pipeline (repository is &lt;a href="https://github.com/CemKeskin84/DVC_basics_tutorial"&gt;here&lt;/a&gt;). However, DVC presents more tools for hyper-parameter tuning, plotting and experiment management that will be subject of an upcoming post. &lt;/p&gt;

</description>
      <category>datascience</category>
      <category>dvc</category>
      <category>git</category>
      <category>python</category>
    </item>
  </channel>
</rss>
