<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Eduardo Blancas</title>
    <description>The latest articles on DEV Community by Eduardo Blancas (@edublancas).</description>
    <link>https://dev.to/edublancas</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F354645%2F700c32da-5d53-4883-b533-557d460620b9.jpeg</url>
      <title>DEV Community: Eduardo Blancas</title>
      <link>https://dev.to/edublancas</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/edublancas"/>
    <language>en</language>
    <item>
      <title>How to Access Google Sheets from a Jupyter Notebook</title>
      <dc:creator>Eduardo Blancas</dc:creator>
      <pubDate>Tue, 16 Aug 2022 16:12:00 +0000</pubDate>
      <link>https://dev.to/edublancas/how-to-access-google-sheets-from-a-jupyter-notebook-3omm</link>
      <guid>https://dev.to/edublancas/how-to-access-google-sheets-from-a-jupyter-notebook-3omm</guid>
      <description>&lt;p&gt;Data is the starting point for all the projects and products in data science. The first lines of a script are dedicated to reading the data. This is a fact and does not change based on the project. What changes is the source of data. There are a variety of places where we store the data such as databases, S3 bucket, BigQuery, external files, spreadsheets, and so on.&lt;/p&gt;

&lt;p&gt;Google sheets are quite common for storing small to medium size data. One of the nice things about Google sheets is that you can directly connect to them from a Jupyter notebook. You don’t have to download the data in your local directory and read it from there. Another advantage of connecting from within a notebook is that you can update the sheet directly.&lt;/p&gt;

&lt;p&gt;Consider you have a data manipulating task to do. You write your script that connects to Google sheets, read the data, do the manipulation, and write the updated file back to the same sheet. You can schedule this script so that the data in the sheet is always up to date.&lt;/p&gt;

&lt;p&gt;In this article, we will learn how to access Google sheets from a Jupyter notebook.&lt;/p&gt;

&lt;h3&gt;
  
  
  Create a project
&lt;/h3&gt;

&lt;p&gt;The first step is to create a project in the &lt;a href="https://console.cloud.google.com/cloud-resource-manager"&gt;Google Cloud Console&lt;/a&gt;, which can be done by clicking on create project in the console. You will have a quota of 12 free projects with your account.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ObkgalcG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nzja53qx9pd7ibdgq6sg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ObkgalcG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nzja53qx9pd7ibdgq6sg.png" alt="Image description" width="624" height="207"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Enable Google Drive API and Google Sheets API
&lt;/h3&gt;

&lt;p&gt;The next step is to enable the Google Drive API and Google Sheets API. In the Google Cloud Console menu, select APIs &amp;amp; Services and then Enabled APIs &amp;amp; Services as shown below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9c-uwzc5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9zn639x140csybbc05ad.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--9c-uwzc5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9zn639x140csybbc05ad.png" alt="Image description" width="475" height="408"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you have multiple projects, you will need to select the one you want to enable the API for. If you only have one project, it will automatically be selected.&lt;/p&gt;

&lt;p&gt;Click on “ENABLE APIS and SERVICES” and then search for “Google Drive API” and then “Google Sheets API”. In the search results, click on the related icons as shown below. Then click on ENABLE API in the page that opens up. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--1r6WiGnU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nw1dn1js5ns0fg2kwsc0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--1r6WiGnU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nw1dn1js5ns0fg2kwsc0.png" alt="Image description" width="624" height="144"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2947m8hr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7zp3kuux6ibc4kk9bibl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2947m8hr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7zp3kuux6ibc4kk9bibl.png" alt="Image description" width="578" height="131"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Create credentials
&lt;/h3&gt;

&lt;p&gt;Now that we have the Google Drive API enabled, we need to create credentials. At the top right corner of the page that opened up after enabling the API, you will see the CREATE CREDENTIALS icon and then select service account credentials. The page shown below opens up.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QpMs6u-7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/das8pvr6ab1agrdwb5u5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QpMs6u-7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/das8pvr6ab1agrdwb5u5.png" alt="Image description" width="579" height="372"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Give it a name and click CREATE AND CONTINUE. Then the grant pages open up, just hit CONTINUE and DONE. You will then see the following screen. Click on the link in the email part.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--0UC7rE12--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/u4e5ry87bo72r5v9hsno.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--0UC7rE12--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/u4e5ry87bo72r5v9hsno.png" alt="Image description" width="624" height="80"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It will take you to the service account details page. Copy the email address shown here. This email will be used for connecting to your account. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Xhlv-ztf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/yfkxuww24mqqbmlv6igj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Xhlv-ztf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/yfkxuww24mqqbmlv6igj.png" alt="Image description" width="624" height="238"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We also need to generate a key. Go back to the service accounts page. Click on the three dots under the actions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--d0p9aQII--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fup8xz2yvtbfpftkgub8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--d0p9aQII--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fup8xz2yvtbfpftkgub8.png" alt="Image description" width="624" height="66"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Click on MANAGE KEYS and then ADD a KEY. It will ask you to choose the format, select JSON and hit CREATE. The json file will automatically be downloaded.&lt;/p&gt;
&lt;h3&gt;
  
  
  Connecting to the Google Sheet
&lt;/h3&gt;

&lt;p&gt;You, of course, need a Google sheet to connect to. I created one that contains some sample sales data. You will need to share the Google sheet with the email address copied in the previous step. Click on the share button at the top right corner of the Google sheet, paste this address and hit SEND.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Av1ABdbJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/e1s03tfqy66chlo79mdq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Av1ABdbJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/e1s03tfqy66chlo79mdq.png" alt="Image description" width="399" height="319"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We are now ready to connect to this sheet. Open up a Jupyter notebook. We will use a Python library called gspread, which can be installed with pip.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;!&lt;/span&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;gspread
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The next step is to import the library.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;gspread&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We need to give the service account details to gspread, which can be done using the service_account method.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sa&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gspread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;service_account&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"project-1-357814-ba841f7c3630.json"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The path to the json file that contains the service credentials is passed to the filename parameter. If the file is in the same working directory as the notebook, you can just write the name of the file. &lt;/p&gt;

&lt;p&gt;The &lt;code&gt;sa&lt;/code&gt; is a gspread client, which can be used for connecting to the sheets by using the open method and the sheet name.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sheet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"sample_sales"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A Google sheet document might have multiple pages (i.e. worksheets) so we also need to specify the page name before getting the data. Ours has one page that is called “Sheet1”.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;work_sheet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sheet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;worksheet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Sheet1"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We have the data in a gspread worksheet object. We can extract the data using the &lt;code&gt;get_all_records&lt;/code&gt; method and create a Pandas DataFrame as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;work_sheet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_all_records&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's take a look at the first 5 rows of the data using the head method.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--azUA3xl1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/f7wko6rw53njqat3jejj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--azUA3xl1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/f7wko6rw53njqat3jejj.png" alt="Image description" width="624" height="162"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We have successfully connected to a Google sheet and retrieved the data it contains. Most of the steps we have completed need to be done only for once. After that, it is just a matter of writing the sheet name, which is definitely more practical than downloading the data as a CSV file and then reading it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Organizing your data analysis code can become a challenging task, check out our &lt;a href="https://github.com/ploomber/ploomber"&gt;open-source framework&lt;/a&gt; which allows you to build modular data analysis pipelines so you can extract those insights from the spreadsheets!&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>jupyter</category>
      <category>datascience</category>
      <category>analytics</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Analyze and plot 5.5M records in 20s with BigQuery and Ploomber</title>
      <dc:creator>Eduardo Blancas</dc:creator>
      <pubDate>Mon, 08 Aug 2022 21:55:42 +0000</pubDate>
      <link>https://dev.to/edublancas/analyze-and-plot-55m-records-in-20s-with-bigquery-and-ploomber-4837</link>
      <guid>https://dev.to/edublancas/analyze-and-plot-55m-records-in-20s-with-bigquery-and-ploomber-4837</guid>
      <description>&lt;p&gt;This tutorial will show how you can use Google Cloud and Ploomber to develop a scalable and production-ready pipeline.&lt;/p&gt;

&lt;p&gt;We'll use Google BigQuery (data warehouse) and Cloud Storage to show how we can transform big datasets with ease using SQL, plot the results with Python, and store the results in the cloud. Thanks to BigQuery scalability (we'll use a dataset with 5.5M records!) and Ploomber's convenience, &lt;strong&gt;the entire process from importing the data to the summary report is on the cloud takes less than 20 seconds!&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Before we begin, I'll quickly go over two Google Cloud services we use for this project. &lt;a href="https://en.wikipedia.org/wiki/BigQuery"&gt;Google BigQuery&lt;/a&gt; is a serverless data warehouse that allows us to analyze data at scale. In simpler terms, we can store massive datasets and query using SQL without managing servers. On the other hand, &lt;a href="https://en.wikipedia.org/wiki/Google_Cloud_Storage"&gt;Google Cloud Storage&lt;/a&gt; is a storage service; it is the equivalent service to Amazon S3.&lt;/p&gt;

&lt;p&gt;Since our analysis comprises SQL and Python, we use &lt;a href="https://github.com/ploomber/ploomber"&gt;Ploomber&lt;/a&gt;, an open-source framework to write maintainable pipelines. It abstracts all the details, so we focus on writing the SQL and Python scripts.&lt;/p&gt;

&lt;p&gt;Finally, the data. We'll be using a public dataset that contains statistics of people's names in the US over time. The dataset contains 5.5M records. Here's what it looks like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--a1Sb8q8f--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dw4vnjifx7wdnt4mkj6w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--a1Sb8q8f--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dw4vnjifx7wdnt4mkj6w.png" alt="data" width="560" height="581"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's now take a look at the pipeline's architecture!&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture overview
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8cL0jTUg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/eqied2nt7etnlsdno1sp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8cL0jTUg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/eqied2nt7etnlsdno1sp.png" alt="architecture" width="880" height="552"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first step is the &lt;code&gt;create-table.sql&lt;/code&gt; script; such script runs a &lt;code&gt;CREATE TABLE&lt;/code&gt; statement to copy a public dataset. &lt;code&gt;create-view.sql&lt;/code&gt; and &lt;code&gt;create-materialized-view.sql&lt;/code&gt; use the existing table and generate a view and a materialized view (their purpose is to show how we can create other types of SQL relations, we don't use the outputs).&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;dump-table.sql&lt;/code&gt; queries the existing table, and it dumps the results into a local file. Then, the &lt;code&gt;plot.py&lt;/code&gt; Python script uses the local data file, generates a plot, and uploads it in HTML format to Cloud Storage. The whole process may seem intimidating, but Ploomber makes this straightforward!&lt;/p&gt;

&lt;p&gt;Let's now configure the cloud services we'll use!&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;p&gt;We need to create a bucket in Cloud Storage and a dataset in BigQuery; the following sections explain how to do so.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud Storage
&lt;/h3&gt;

&lt;p&gt;Go to the &lt;a href="https://console.cloud.google.com/storage"&gt;Cloud Storage&lt;/a&gt; console (select a project or create a new one, if needed) and create a new bucket (you may use an existing one if you prefer so). In our case, we'll create a bucket "ploomber-bucket" under the project "ploomber":&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--WtPsl-4R--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mi1q5sj0iv9kqqbee7vt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--WtPsl-4R--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mi1q5sj0iv9kqqbee7vt.png" alt="create bucket" width="453" height="334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then, enter a name (in our case "ploomber-bucket"), and click on "CREATE":&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--MVLS99PU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/l84is5exgaiec21arfye.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--MVLS99PU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/l84is5exgaiec21arfye.png" alt="storage confirm" width="461" height="714"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's now configure BigQuery.&lt;/p&gt;

&lt;h3&gt;
  
  
  BigQuery
&lt;/h3&gt;

&lt;p&gt;Go to the &lt;a href="https://console.cloud.google.com/bigquery"&gt;BigQuery&lt;/a&gt; console and create a dataset. To do so, click on the three stacked dots next to your project's name and then click on "Create dataset":&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--t_w9pkdX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qy6rc9p6lsox2hh6rvk9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--t_w9pkdX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qy6rc9p6lsox2hh6rvk9.png" alt="bigquery-create" width="522" height="377"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, enter "my_dataset" as the Dataset ID and "us" in &lt;em&gt;Data location&lt;/em&gt; (location&lt;br&gt;
is important since we'll be using a public dataset located in such region),&lt;br&gt;
then click on "CREATE DATASET":&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--1azc8yoh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fi4ymok2i3ed28cwy6w0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--1azc8yoh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fi4ymok2i3ed28cwy6w0.png" alt="bigquery confirm" width="567" height="547"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Google Cloud is ready now! Let's now configure our local environment.&lt;/p&gt;
&lt;h3&gt;
  
  
  Local setup
&lt;/h3&gt;

&lt;p&gt;First, let's authenticate so we can make API calls to Google Cloud. Ensure&lt;br&gt;
you authenticate with an account that has enough permissions in the project&lt;br&gt;
to use BigQuery and Cloud Storage:&lt;/p&gt;




&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud auth login
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;p&gt;If you have trouble, check &lt;a href="https://cloud.google.com/sdk/gcloud/reference/auth"&gt;the docs.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, let's install Ploomber to get the code example:&lt;/p&gt;




&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# note: this example requires ploomber 0.19.2 or higher&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;ploomber &lt;span class="nt"&gt;--upgrade&lt;/span&gt;

&lt;span class="c"&gt;# download example&lt;/span&gt;
ploomber examples &lt;span class="nt"&gt;-n&lt;/span&gt; templates/google-cloud &lt;span class="nt"&gt;-o&lt;/span&gt; gcloud

&lt;span class="c"&gt;# move to the example folder&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;gcloud
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;p&gt;Let's now review the structure of the project.&lt;/p&gt;
&lt;h2&gt;
  
  
  Project structure
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pipeline.yaml&lt;/code&gt; Pipeline declaration&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;clients.py&lt;/code&gt; Functions to create BigQuery and Cloud Storage clients&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;requirements.txt&lt;/code&gt; Python dependencies&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sql/&lt;/code&gt; SQL scripts (executed in BigQuery)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;scripts/&lt;/code&gt; Python scripts (executed locally, outputs uploaded to Cloud Storage)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can look at the files in detail &lt;a href="https://github.com/ploomber/projects/tree/master/templates/google-cloud"&gt;here.&lt;/a&gt; For this tutorial, I'll quickly mention a few crucial details.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pipeline.yaml&lt;/code&gt; is the central file in this project; Ploomber uses this file&lt;br&gt;
to assemble your pipeline and run it, here's what it looks like:&lt;/p&gt;




&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Content of pipeline.yaml&lt;/span&gt;
&lt;span class="na"&gt;tasks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# NOTE: ensure all products match the dataset name you created&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sql/create-table.sql&lt;/span&gt;
    &lt;span class="na"&gt;product&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;my_dataset&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;my_table&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;table&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sql/create-view.sql&lt;/span&gt;
    &lt;span class="na"&gt;product&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;my_dataset&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;my_view&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;view&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sql/create-materialized-view.sql&lt;/span&gt;
    &lt;span class="na"&gt;product&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;my_dataset&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;my_materialized_view&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;view&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="c1"&gt;# dump data locally (and upload outputs to Cloud Storage)&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sql/dump-table.sql&lt;/span&gt;
    &lt;span class="na"&gt;product&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;products/dump.parquet&lt;/span&gt;
    &lt;span class="na"&gt;chunksize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="no"&gt;null&lt;/span&gt;

  &lt;span class="c1"&gt;# process data with Python (and upload outputs to Cloud Storage)&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;scripts/plot.py&lt;/span&gt;
    &lt;span class="na"&gt;product&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;products/plot.html&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;p&gt;Each task in the &lt;code&gt;pipeline.yaml&lt;/code&gt; file contains two elements: the source code&lt;br&gt;
we want to execute and the product. You can see that we have a few SQL scripts&lt;br&gt;
that generate tables and views. However, the &lt;code&gt;dump-table.sql&lt;/code&gt; creates a&lt;br&gt;
&lt;code&gt;.parquet&lt;/code&gt; file. This indicates to Ploomber that it should download the results&lt;br&gt;
instead of storing them on BigQuery. Finally, the &lt;code&gt;plot.py&lt;/code&gt; script contains an&lt;br&gt;
&lt;code&gt;.html&lt;/code&gt; output; Ploomber will automatically run the script and store the&lt;br&gt;
results in the HTML file.&lt;/p&gt;

&lt;p&gt;You might be wondering how the order is determined. Ploomber extracts references&lt;br&gt;
from the source code itself; for example, the &lt;code&gt;create-view.sql&lt;/code&gt; depends on&lt;br&gt;
&lt;code&gt;create-table.sql&lt;/code&gt;. If we look at the code, we'll see the reference:&lt;/p&gt;




&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="n"&gt;Content&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="k"&gt;sql&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;create&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;view&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;sql&lt;/span&gt;
&lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="p"&gt;}};&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="n"&gt;upstream&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;"create-table"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;p&gt;There is a placeholder &lt;code&gt;{{ upstream["create-table"] }}&lt;/code&gt;, this indicates&lt;br&gt;
that we should run &lt;code&gt;create-table.sql&lt;/code&gt; first. At runtime, Ploomber will replace&lt;br&gt;
the placeholder for the table name. We also have a second placeholder&lt;br&gt;
&lt;code&gt;{{ product }}&lt;/code&gt;, this will be replaced by the value in the &lt;code&gt;pipeline.yaml&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;That's it for the &lt;code&gt;pipeline.yaml&lt;/code&gt;. Let's review the &lt;code&gt;clients.py&lt;/code&gt; file.&lt;/p&gt;
&lt;h2&gt;
  
  
  Configure &lt;code&gt;clients.py&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;clients.py&lt;/code&gt; contains a function that returns clients to communicate with&lt;br&gt;
BigQuery and Cloud Storage.&lt;/p&gt;

&lt;p&gt;For example, this is how we connect to BigQuery:&lt;/p&gt;




&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Content of clients.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;db&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="s"&gt;"""Client to send queries to BigQuery
    """&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;DBAPIClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;p&gt;Note that we're returning a &lt;code&gt;ploomber.clients.DBAPIClient&lt;/code&gt; object. Ploomber&lt;br&gt;
wraps BigQuery's connector, so it works with other databases.&lt;/p&gt;

&lt;p&gt;Secondly, we configure the Cloud Storage client:&lt;/p&gt;




&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Content of clients.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="s"&gt;"""Client to upload files to Google Cloud Storage
    """&lt;/span&gt;
    &lt;span class="c1"&gt;# ensure your bucket_name matches
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;GCloudStorageClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'ploomber-bucket'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                               &lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'my-pipeline'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;p&gt;Here, we return a &lt;code&gt;ploomber.clients.GCloudStorageClient&lt;/code&gt; object (ensure&lt;br&gt;
that the &lt;code&gt;bucket_name&lt;/code&gt; matches yours!)&lt;/p&gt;

&lt;p&gt;Great, we're ready to run the pipeline!&lt;/p&gt;
&lt;h2&gt;
  
  
  Running the pipeline
&lt;/h2&gt;

&lt;p&gt;Ensure your terminal is open in the &lt;code&gt;gcloud&lt;/code&gt; folder and execute the following:&lt;/p&gt;




&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# install dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# run the pipeline&lt;/span&gt;
ploomber build
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;p&gt;After a few seconds of running the &lt;code&gt;ploomber build&lt;/code&gt; command, you should see&lt;br&gt;
something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name                      Ran?      Elapsed (s)    Percentage
------------------------  ------  -------------  ------------
create-table              True          5.67999      30.1718
create-view               True          1.84277       9.78868
create-materialized-view  True          1.566         8.31852
dump-table                True          5.57417      29.6097
plot                      True          4.16257      22.1113
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you get an error, you most likely have a misconfiguration. Please send us a&lt;br&gt;
&lt;a href="https://ploomber.io/community"&gt;message on Slack&lt;/a&gt; so we can help you fix it!&lt;/p&gt;

&lt;p&gt;If you open the &lt;a href="https://console.cloud.google.com/bigquery"&gt;BigQuery&lt;/a&gt; console, you'll see the new tables and views:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YfgINYwN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9dat44pbaw7cw8js667x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YfgINYwN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9dat44pbaw7cw8js667x.png" alt="bigquery" width="308" height="294"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the &lt;a href="https://console.cloud.google.com/storage"&gt;Cloud Storage&lt;/a&gt; console, you'll see the HTML report:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ctOXTDm---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tnhcougeayrcgz3mnpts.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ctOXTDm---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tnhcougeayrcgz3mnpts.png" alt="storage" width="584" height="548"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Finally, if you download and open the HTML file, you'll see the plot!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--AarK7ahw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tabljlg7ikoovtvs4j8n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--AarK7ahw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tabljlg7ikoovtvs4j8n.png" alt="plot" width="437" height="317"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Incremental builds
&lt;/h2&gt;

&lt;p&gt;It may take a few iterations to get the final analysis. This process involves&lt;br&gt;
making small changes to your code and rerunning the workflow. Ploomber can&lt;br&gt;
keep track of source code changes to accelerate iterations, so it only&lt;br&gt;
executes outdated scripts next time. Enabling this requires a bit of extra&lt;br&gt;
configuration since Ploomber needs to store your pipeline's metadata, we&lt;br&gt;
already pre-configured the same workflow, so it stores the metadata in a SQLite&lt;br&gt;
database, you can run it with the following command:&lt;/p&gt;




&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ploomber build &lt;span class="nt"&gt;--entry-point&lt;/span&gt; pipeline.incremental.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;p&gt;If you run the command another time, you'll see that it skips all tasks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name                      Ran?      Elapsed (s)    Percentage
------------------------  ------  -------------  ------------
create-table              False               0             0
create-view               False               0             0
create-materialized-view  False               0             0
dump-table                False               0             0
plot                      False               0             0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now try changing &lt;code&gt;plot.py&lt;/code&gt; and rerun the pipeline; you'll see that it skips&lt;br&gt;
most tasks!&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing remarks
&lt;/h2&gt;

&lt;p&gt;This tutorial showed how to build maintainable and scalable data analysis pipelines on Google Cloud. Ploomber has many other features to simplify your&lt;br&gt;
workflow such as parametrization (store outputs on a different each time you&lt;br&gt;
run the pipeline), task parallelization, and even cloud execution (in case&lt;br&gt;
you need more power to run your Python scripts!).&lt;/p&gt;

&lt;p&gt;Check out our &lt;a href="https://docs.ploomber.io/"&gt;documentation&lt;/a&gt; to learn more, and don't hesitate to &lt;a href="https://ploomber.io/community"&gt;send us any questions!&lt;/a&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>datascience</category>
      <category>opensource</category>
      <category>cloud</category>
    </item>
    <item>
      <title>Tips and Tricks to Use Jupyter Notebooks Effectively</title>
      <dc:creator>Eduardo Blancas</dc:creator>
      <pubDate>Mon, 01 Aug 2022 17:27:00 +0000</pubDate>
      <link>https://dev.to/edublancas/tips-and-tricks-to-use-jupyter-notebooks-effectively-589n</link>
      <guid>https://dev.to/edublancas/tips-and-tricks-to-use-jupyter-notebooks-effectively-589n</guid>
      <description>&lt;p&gt;The Jupyter Notebook is a web-based interactive computing platform, and it is usually the first tool we learn about in data science. Most of us start our learning journeys in Jupyter notebooks. They are great for learning, practicing, and experimenting.&lt;/p&gt;

&lt;p&gt;There are several reasons why the Jupyter notebook is a highly popular tool. Here are some of them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Being able to see the code and the output together makes it easier to learn and practice.&lt;/li&gt;
&lt;li&gt;It supports Markdown cells which are great for write-ups, preparing reports, and documenting your work.&lt;/li&gt;
&lt;li&gt;In-line outputs including data visualizations are highly useful for exploratory data analysis.&lt;/li&gt;
&lt;li&gt;You can run the code cell-by-cell which expedites the debugging process as well as understanding other people’s code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Although we quite often use Jupyter notebooks in our work, we do not make the most out of them and fail to discover its full potential. In this article, we will go over some tips and tricks to make more use of Jupyter notebooks. Some of these are shortcuts that can increase your efficiency.&lt;/p&gt;

&lt;h3&gt;
  
  
  New cell
&lt;/h3&gt;

&lt;p&gt;Creating a new cell is one of the most frequently done operations while working in a Jupyter notebook so a quick way of doing this is very helpful.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ESC + A creates a new cell above the current cell&lt;/li&gt;
&lt;li&gt;ESC + B creates a new cell below the current cell&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--An_KX3ID--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jjm1igiownqmzjlfv71c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--An_KX3ID--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jjm1igiownqmzjlfv71c.png" alt="Image description" width="454" height="192"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Cell output
&lt;/h3&gt;

&lt;p&gt;One of the great features of Jupyter notebooks is that they maintain the state of execution of each cell. In other words, cell outputs are cached. This is very useful because you do not have to execute a cell each time you want to check its output or results. &lt;/p&gt;

&lt;p&gt;However, some outputs take too much space and make the overall content hard to follow. We can hide a cell output with  “ESC + O” and unhide by pressing these keys again.&lt;/p&gt;

&lt;p&gt;If a cell is not needed anymore, you can delete it with “ESC + D + D”.&lt;/p&gt;

&lt;h3&gt;
  
  
  Magic commands
&lt;/h3&gt;

&lt;p&gt;Magic commands are built-in for the IPython kernel. They are quite useful for performing a variety of tasks. Magic commands start with the “%” syntax element. Here are some examples that will help you get more productive.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# prints the current working directory&lt;/span&gt;
%pwd

&lt;span class="c"&gt;# change the current working directory&lt;/span&gt;
%cd

&lt;span class="c"&gt;# list files and folders in the current working directory&lt;/span&gt;
%ls

&lt;span class="c"&gt;# list files and folders in a specific folder&lt;/span&gt;
%ls &lt;span class="o"&gt;[&lt;/span&gt;path to folder]

&lt;span class="c"&gt;# export the current current IPython history to a notebook file&lt;/span&gt;
%notebook &lt;span class="o"&gt;[&lt;/span&gt;filename]

&lt;span class="c"&gt;# lists currently available magics&lt;/span&gt;
%lsmagic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;If you're looking to improve your Jupyter workflow, check out Ploomber's open-source projects: &lt;a href="https://github.com/ploomber/ploomber"&gt;Ploomber&lt;/a&gt; for developing modular data pipelines, &lt;a href="https://github.com/ploomber/soorgeon"&gt;Soorgeon&lt;/a&gt; for refactoring and cleaning), or &lt;a href="https://github.com/ploomber/nbsnapshot"&gt;nbsnapshot&lt;/a&gt; for notebook testing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Multiple outputs
&lt;/h3&gt;

&lt;p&gt;By default, when you execute a cell that returns multiple outputs, only the last output is shown. Here is an example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;mylist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"programming"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;myset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"programming"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;mylist&lt;/span&gt;
&lt;span class="n"&gt;myset&lt;/span&gt;

&lt;span class="c1"&gt;# output
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'a'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'g'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'i'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'m'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'n'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'o'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'p'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'r'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is possible to see all the outputs but we need to change a setting as below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;IPython.core.interactiveshell&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;InteractiveShell&lt;/span&gt;
&lt;span class="n"&gt;InteractiveShell&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ast_node_interactivity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"all"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If we execute the code block above, we will see the output of both variables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;mylist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"programming"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;myset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"programming"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;mylist&lt;/span&gt;
&lt;span class="n"&gt;myset&lt;/span&gt;

&lt;span class="c1"&gt;# output
&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'p'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'r'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'o'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'g'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'r'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'a'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'m'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'m'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'i'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'n'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'g'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'a'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'g'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'i'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'m'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'n'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'o'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'p'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'r'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Shortcuts
&lt;/h3&gt;

&lt;p&gt;There are several shortcuts that you can use in Jupyter notebooks. Here are the ones I find quite useful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CTRL + D (⌘ + D in Mac) deletes what is written in the current line.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Vkhkaa-W--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n47d731gmf5v7k6224nn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Vkhkaa-W--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n47d731gmf5v7k6224nn.png" alt="Image description" width="505" height="156"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ESC + M changes a cell to Markdown.&lt;/li&gt;
&lt;li&gt;ESC + UP or ESC + K selects the cell above.&lt;/li&gt;
&lt;li&gt;ESC + DOWN or ESC + J selects the cell below.&lt;/li&gt;
&lt;li&gt;ESC + SHIFT + M merges selected cells. If only one cell is selected, it is merged with the cell below.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Learning about a function
&lt;/h3&gt;

&lt;p&gt;Python has a rich selection of third party libraries, which simplify tasks and speed up development processes. Those libraries typically have a lot of functions and methods. We sometimes can’t remember what a function does exactly or its syntax. &lt;/p&gt;

&lt;p&gt;In such cases, we can learn about the signature and docstring of a function inside the Jupyter notebook. We just need to add ? at the end of the function name. Here is how we can learn about the query function of Pandas library.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We have learned about some useful tips and tricks for Jupyter notebooks. You do not have to use all of them immediately but you will see they increase your productivity and efficiency once you start using them.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Questions? &lt;a href="https://ploomber.io/community"&gt;Join our growing community of Jupyter practitioners!&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Credits
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://ploomber.io"&gt;ploomber.io&lt;/a&gt; by Soner Yildirim, re-shared with permission&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.flaticon.com/free-icons/effective"&gt;Effective icons created by Pixel perfect - Flaticon&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.flaticon.com/free-icons/notebook"&gt;Notebook icons created by mikan933 - Flaticon&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>jupyter</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Machine Learning Model Selection with Nested Cross-Validation</title>
      <dc:creator>Eduardo Blancas</dc:creator>
      <pubDate>Wed, 27 Jul 2022 16:14:49 +0000</pubDate>
      <link>https://dev.to/edublancas/machine-learning-model-selection-with-nested-cross-validation-3kgg</link>
      <guid>https://dev.to/edublancas/machine-learning-model-selection-with-nested-cross-validation-3kgg</guid>
      <description>&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/37NIM3RSMz4"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;a href="https://ploomber.io/blog/nested-cv/"&gt;Supplemental material.&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Three Tools for Executing Jupyter Notebooks</title>
      <dc:creator>Eduardo Blancas</dc:creator>
      <pubDate>Tue, 26 Jul 2022 02:35:00 +0000</pubDate>
      <link>https://dev.to/edublancas/three-tools-for-executing-jupyter-notebooks-fob</link>
      <guid>https://dev.to/edublancas/three-tools-for-executing-jupyter-notebooks-fob</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Executing notebooks can be very helpful in various situations, especially for long-running code execution (e.g., training a model) or parallelized execution (e.g., training a hundred models at the same time). It is also vital in data analysis automation in projects at regular intervals or involving more than one notebook. This blog post will introduce three commonly used ways of executing notebooks: Ploomber, Papermill, and NBClient.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ploomber
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8cpur8tnfy6gdxzlnrv5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8cpur8tnfy6gdxzlnrv5.png" alt="ploomber logo"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ploomber is the complete solution for notebook execution. It builds on top of papermill and extends it to allow writing multi-stage workflows where each task is a notebook. Meanwhile, it automatically manages orchestration. Hence you can &lt;a href="https://github.com/ploomber/projects/tree/master/cookbook/grid" rel="noopener noreferrer"&gt;run notebooks in parallel&lt;/a&gt; without having to write extra code.&lt;/p&gt;

&lt;p&gt;Another feature of Ploomber is that you can use the &lt;a href="https://jupytext.readthedocs.io/en/latest/formats.html#the-percent-format" rel="noopener noreferrer"&gt;percent format&lt;/a&gt; (supported by &lt;a href="https://code.visualstudio.com/docs/python/jupyter-support-py#_jupyter-code-cells" rel="noopener noreferrer"&gt;VSCode&lt;/a&gt;, PyCharm, etc.) and execute it as a notebook, to automatically capture the outputs like charts or tables in an output file.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz7x1snkut539595eaunu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz7x1snkut539595eaunu.png" alt="python-percent-format"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Also, you can export the pipelines to airflow, Kubernetes, etc. Please refer to &lt;a href="https://soopervisor.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;this documentation&lt;/a&gt; for more information on how to go from a notebook to a production pipeline.&lt;/p&gt;

&lt;p&gt;Ploomber offers two interfaces for notebook execution: YAML and Python. The first one is the easiest to get started, and the second offers more flexibility for building more complex workflows. Furthermore, it provides a free cloud service to execute your notebooks in the cloud and parallelize experiments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Execute Notebooks via Python API
&lt;/h3&gt;

&lt;p&gt;Ploomber offers a Python API for executing notebooks. The following example will run &lt;code&gt;first.ipynb&lt;/code&gt;, then &lt;code&gt;second.ipynb&lt;/code&gt; and store the executed notebooks in &lt;code&gt;out/first.ipynb&lt;/code&gt; and &lt;code&gt;out/second.ipynb&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ploomber&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ploomber.tasks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;NotebookRunner&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ploomber.products&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;File&lt;/span&gt;

&lt;span class="n"&gt;dag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;NotebookRunner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;first.ipynb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nc"&gt;File&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;out/first.ipynb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;second&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;NotebookRunner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;second.ipynb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nc"&gt;File&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;out/second.ipynb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;second&lt;/span&gt;

&lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Execute Notebooks via YAML API
&lt;/h3&gt;

&lt;p&gt;Ploomber also offers a YAML API for executing notebooks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;tasks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;first.ipynb&lt;/span&gt;
    &lt;span class="na"&gt;product&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;out/first.ipynb&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;second.ipynb&lt;/span&gt;
    &lt;span class="na"&gt;product&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;out/second.ipynb&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then users call the following code to execute the notebook:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ploomber build
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Execute Notebooks on Cloud
&lt;/h3&gt;

&lt;p&gt;Ploomber also supports executing notebooks on the cloud by running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ploomber cloud build
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Please refer to &lt;a href="https://docs.ploomber.io/en/latest/cloud/cloud-execution.html" rel="noopener noreferrer"&gt;this documentation&lt;/a&gt; for more information on Cloud Execution with Ploomber.&lt;/p&gt;

&lt;h2&gt;
  
  
  Papermill
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8vsmlyi93rcpo97td8ci.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8vsmlyi93rcpo97td8ci.png" alt="papermill"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Papermill's main feature is to allow injecting parameters to a notebook. Therefore, you can use them as templates (e.g. run the same model training notebook with different parameters). However, it limits itself to providing a function to execute the notebook. Hence there is no way to manage concurrent executions.&lt;/p&gt;

&lt;p&gt;There are two ways to execute the notebook:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python API&lt;/li&gt;
&lt;li&gt;Command Line Interface&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Execute Notebooks via Python API
&lt;/h3&gt;

&lt;p&gt;Papermill offers a Python API. Users can execute notebooks with Papermill by running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;papermill&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pm&lt;/span&gt;

&lt;span class="n"&gt;pm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute_notebook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
   &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;path/to/input.ipynb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;path/to/output.ipynb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="n"&gt;parameters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ratio&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Execute Notebooks via CLI
&lt;/h3&gt;

&lt;p&gt;Users can also execute notebooks via CLI. To run a notebook using the CLI, enter the following papermill command in the terminal with the input notebook, location for the output notebook, and options.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;papermill input.ipynb output.ipynb &lt;span class="nt"&gt;-p&lt;/span&gt; alpha 0.6 &lt;span class="nt"&gt;-p&lt;/span&gt; l1_ratio 0.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Via CLI, users can choose to execute a notebook with parameters in types of a parameters file, a YAML string, or raw strings. You can refer to &lt;a href="https://papermill.readthedocs.io/en/latest/usage-execute.html" rel="noopener noreferrer"&gt;this documentation&lt;/a&gt; for more information on executing a notebook with parameters via CLI.&lt;/p&gt;

&lt;h2&gt;
  
  
  NBClient
&lt;/h2&gt;

&lt;p&gt;NBClient provides a convenient way to execute the input cells of a &lt;code&gt;.ipynb&lt;/code&gt; notebook file and save the results, both input and output cells, as a &lt;code&gt;.ipynb&lt;/code&gt; file. If you need to export notebooks to other formats, such as reStructured Text or Markdown (optionally executing them), please refer to &lt;a href="https://nbconvert.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;nbconvert&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It offers a few extra features like notebook-level and cell-level hooks and also supports two ways of executing notebooks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python API&lt;/li&gt;
&lt;li&gt;Command Line Interface&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Execute Notebooks via Python API
&lt;/h3&gt;

&lt;p&gt;The following quick example shows how to import &lt;code&gt;nbformat&lt;/code&gt; and  &lt;code&gt;NotebookClient&lt;/code&gt; classes, then load and configure the notebook notebook_filename. We specified two optional arguments, &lt;code&gt;timeout&lt;/code&gt; and &lt;code&gt;kernel_name&lt;/code&gt;, which define the cell execution timeout and the execution kernel. Usually, we don’t need to set these options, but these and others are available to control the execution context.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;nbformat&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nbclient&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;NotebookClient&lt;/span&gt;

&lt;span class="n"&gt;nb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nbformat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;notebook_filename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;as_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;NotebookClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kernel_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;python3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then we can execute the notebook by running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And we can save the resulting notebook in the current folder in the file &lt;code&gt;executed_notebook.ipynb&lt;/code&gt; by running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;nbformat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;executed_notebook.ipynb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Execute Notebooks via CLI
&lt;/h3&gt;

&lt;p&gt;NBClient supports running notebooks via CLI for the most basic use cases. However, for more sophisticated execution options, consider the &lt;a href="https://ploomber.io/" rel="noopener noreferrer"&gt;Ploomber&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;Running a notebook is this easy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;jupyter execute notebook.ipynb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It expects notebooks as input arguments and accepts optional flags to modify the default behavior. And we can pass more than one notebook as well with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;jupyter execute notebook.ipynb notebook2.ipynb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;In a nutshell, NBClient is the most basic way to execute notebooks and Papermill builts on top of NBClient. Both of them support running notebooks via Python API and CLI.&lt;/p&gt;

&lt;p&gt;Ploomber is the most complete and most convenient solution. It builds on top of papermill and extends it to allow writing multi-stage workflows where each task is a notebook. Besides Python API and CLI, users are also supported to execute notebooks via YAML API or on the cloud with Ploomber.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Enjoyed this article? Join our growing &lt;a href="https://ploomber.io/community" rel="noopener noreferrer"&gt;community of Jupyter users&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/ploomber/ploomber" rel="noopener noreferrer"&gt;Ploomber Source Code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.ploomber.io/en/stable/" rel="noopener noreferrer"&gt;Ploomber Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/nteract/papermill" rel="noopener noreferrer"&gt;Papermill Source Code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://papermill.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;Papermill Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/jupyter/nbclient" rel="noopener noreferrer"&gt;NBClient Source Code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://nbclient.readthedocs.io/en/latest/client.html" rel="noopener noreferrer"&gt;NBClient Documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>opensource</category>
      <category>github</category>
      <category>datascience</category>
    </item>
    <item>
      <title>The 10 Trending Python Repositories on GitHub (May 2022)</title>
      <dc:creator>Eduardo Blancas</dc:creator>
      <pubDate>Thu, 23 Jun 2022 15:11:10 +0000</pubDate>
      <link>https://dev.to/edublancas/the-10-trending-python-repositories-on-github-may-2022-l92</link>
      <guid>https://dev.to/edublancas/the-10-trending-python-repositories-on-github-may-2022-l92</guid>
      <description>&lt;p&gt;A few months ago, I discovered that GitHub keeps track of &lt;a href="https://github.com/trending/python?since=monthly"&gt;trending repositories&lt;/a&gt;, and since then, I often take a look at it to see what's up. So this month, I decided to share my thoughts on what I found; let's get started!&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://github.com/borisdayma/dalle-mini"&gt;DALL·E Mini&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;AI model that generates images from text.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LwS3uG4q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/26xr68d6kswq7w0y1v86.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LwS3uG4q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/26xr68d6kswq7w0y1v86.png" alt="dall e mini" width="880" height="812"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The announcement of Open AI's &lt;a href="https://openai.com/dall-e-2/"&gt;DALL·E 2&lt;/a&gt; took the community by storm, but given that it's not available, it's no surprise that this project is seeing significant interest.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://github.com/PaddlePaddle/PaddleNLP/blob/develop/README_en.md"&gt;PaddleNLP&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;NLP library with pre-trained models.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--dWZHkijL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mqp6iq6oh201mdj8caz4.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--dWZHkijL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mqp6iq6oh201mdj8caz4.gif" alt="paddle nlp" width="880" height="285"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;PaddleNLP is a library for Natural Language processing. It provides a comprehensive set of Chinese transformer models, and its design is based on Hugging Face's Transformer library.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://github.com/hpcaitech/ColossalAI"&gt;ColossalAI&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;A framework for large-scale Deep Learning parallel training.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ANELUyG6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h5ha20c3f4y7mydseugu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ANELUyG6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h5ha20c3f4y7mydseugu.png" alt="colossal" width="880" height="178"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As transformer architectures become the standard in many CV and NLP tasks, better performance comes with larger model sizes. Colossal AI aims to provide a simple AI to train large models in parallel.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://github.com/iperov/DeepFaceLive"&gt;DeepFaceLive&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;A library to swap faces from a website or video.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OkHBgDXs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ma6ba61280ts7sgwhcfc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OkHBgDXs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ma6ba61280ts7sgwhcfc.png" alt="DeepFaceLive" width="880" height="650"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;DeepFaceLive allows changing the face in real-time or from a recording. Imagine hopping on a Zoom call and looking like &lt;a href="https://github.com/iperov/DeepFaceLive/blob/master/doc/celebs/Keanu_Reeves/examples.md"&gt;Keanu Reeves!&lt;/a&gt;. Crazy!&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://github.com/heartexlabs/label-studio"&gt;Label Studio&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;A data labeling tool for audio, text, images, videos, and time series via a UI.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8hWHc3lO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/oa5imkl285yr9lfqso4y.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8hWHc3lO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/oa5imkl285yr9lfqso4y.gif" alt="LabelStudio" width="880" height="488"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Getting accurately labeled data is the first task in many ML projects. Label Studio supports many types of data and offers a graphical user interface to do it.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://github.com/ploomber/ploomber"&gt;Intermission: Ploomber&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Ploomber is a framework to develop pipelines interactively (Jupyter, VSCode) and deploy them to the cloud (K8s, Airflow AWS, SLURM).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xNZpnkot--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/o948pqr0g66m1ilrwwc9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xNZpnkot--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/o948pqr0g66m1ilrwwc9.png" alt="ploomber" width="880" height="689"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Interactive tools like Jupyter make it hard to develop maintainable projects; Ploomber allows data scientists to keep the interactive workflow they are used to but embrace best practices from software engineering to ease the transition to production.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://github.com/bregman-arie/devops-exercises"&gt;DevOps Exercise&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;A collection of &amp;gt;2.2k DevOps interview questions.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--JlGQIeWk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6zp70ep28be96qejyjgb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--JlGQIeWk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6zp70ep28be96qejyjgb.png" alt="devops exercises" width="500" height="120"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first non-AI repository on the list! This repository hosts more than 2.2k DevOps questions to help you prepare for your interview!&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://github.com/PaddlePaddle/PaddleOCR"&gt;PaddleOCR&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;A library for creating Optical Character Recognition tools.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_5OQHlBA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pl7jxowshx4d1nsokzyr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_5OQHlBA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pl7jxowshx4d1nsokzyr.png" alt="paddle ocr" width="880" height="228"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;PaddleOCR supports many OCR-related algorithms to help users through data production, model training, compression, inference, and deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://github.com/iperov/DeepFaceLab"&gt;DeepFaceLab&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;DeepFaceLab is a library to replace faces in videos.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--WshuH7D3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/aakka47qodxfm0a2n5nx.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--WshuH7D3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/aakka47qodxfm0a2n5nx.jpg" alt="deepface lab" width="880" height="838"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Another deepfakes library! According to the repository, more than 95% deepfake videos are created with DeepFaceLab.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://github.com/unifyai/ivy"&gt;IVY&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Ivy aims to provide a single interface for ML frameworks.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--1kh0FDsh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4xcqpl02eo7bvsaw4dff.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--1kh0FDsh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4xcqpl02eo7bvsaw4dff.png" alt="ivy" width="880" height="321"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With the explosion of computational frameworks such as JAX, TensorFlow, PyTorch, MXNet, and NumPy, it's hard for practitioners to keep up and master them. Ivy aims to unify them so you can write once and export to any of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://github.com/apache/airflow"&gt;Airflow&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Airflow is a platform to author, schedule, and monitor workflows.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jY9OeWUI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/otgs1quiuqwkemijpjmy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jY9OeWUI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/otgs1quiuqwkemijpjmy.png" alt="airflow" width="880" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Airflow is one of the most widely used platforms for managing workflows. It allows you to define workflows as directed acyclic graphs of tasks and schedule them.&lt;/p&gt;




&lt;p&gt;Originally posted at &lt;a href="https://ploomber.io/posts/github-22-05"&gt;ploomber.io&lt;/a&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>github</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
    <item>
      <title>From Jupyter to Kubernetes: Refactoring and Deploying Notebooks Using Open-Source Tools</title>
      <dc:creator>Eduardo Blancas</dc:creator>
      <pubDate>Thu, 23 Jun 2022 12:14:44 +0000</pubDate>
      <link>https://dev.to/edublancas/from-jupyter-to-kubernetes-refactoring-and-deploying-notebooks-using-open-source-tools-2a2</link>
      <guid>https://dev.to/edublancas/from-jupyter-to-kubernetes-refactoring-and-deploying-notebooks-using-open-source-tools-2a2</guid>
      <description>&lt;p&gt;Notebooks are great for rapid iterations and prototyping but quickly get messy. After working on a notebook, my code becomes difficult to manage and unsuitable for deployment. In production, code organization is essential for maintainability (it's much easier to improve and debug organized code than a long, messy notebook).&lt;/p&gt;

&lt;p&gt;In this post, I'll describe how you can use our open-source tools to cover the entire life cycle of a Data Science project: starting from a messy notebook until you have that code running in production. Let's get started!&lt;/p&gt;

&lt;p&gt;The first step is to clean up our notebook with automated tools; then, we'll automatically refactor our monolithic notebook into a modular pipeline with &lt;code&gt;soorgeon&lt;/code&gt;; after that, we'll test that our pipeline runs; and, finally, we'll deploy our pipeline to Kubernetes. The main benefit of this workflow is that all steps are fully automated, so we can return to Jupyter, iterate (or fix bugs), and deploy again effortlessly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cleaning up the notebook
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkjbugwa11fe0guur493e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkjbugwa11fe0guur493e.png" alt="soorgeon-clean"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The interactivity of notebooks makes it simple to try out new ideas, but it also yields messy code. While exploring data, we often rush to write code without considering readability. Lucky for us, there are tools like &lt;a href="https://github.com/PyCQA/isort" rel="noopener noreferrer"&gt;isort&lt;/a&gt; and &lt;a href="https://github.com/psf/black" rel="noopener noreferrer"&gt;black&lt;/a&gt; which allow us to easily re-format our code to improve readability. Unfortunately, these tools only work with &lt;code&gt;.py&lt;/code&gt; files; however, &lt;code&gt;soorgeon&lt;/code&gt; enable us to run them on notebook files (&lt;code&gt;.ipynb&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;soorgeon
soorgeon clean path/to/notebook.ipynb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note: If you need an example notebook to try these commands, here's one:&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl https://raw.githubusercontent.com/ploomber/soorgeon/main/examples/machine-learning/nb.ipynb &lt;span class="nt"&gt;-o&lt;/span&gt; notebook.ipynb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check out the image at the beginning of this section: I introduced some extra whitespace on the left notebook. However, after applying &lt;code&gt;soorgeon clean&lt;/code&gt; (picture on the right), we see that the extra whitespace went away. So now we can focus on writing code and apply &lt;code&gt;soorgeon clean&lt;/code&gt; to use auto-formatting easily!&lt;/p&gt;

&lt;h2&gt;
  
  
  Refactoring the notebook
&lt;/h2&gt;

&lt;p&gt;Creating analysis on a single notebook is convenient: we can move around sections and edit them easily; however, this has many drawbacks: it's hard to collaborate and test. Organizing our analysis in multiple files will allow us to define clear boundaries so multiple pipelines can work in the project without getting into each other's ways.&lt;/p&gt;

&lt;p&gt;The process of going from a single notebook to a modular pipeline is time-consuming and error-prone; fortunately, &lt;code&gt;soorgeon&lt;/code&gt; can do the heavy lifting for us:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;soorgeon
soorgeon refactor path/to/notebook.ipynb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Upon refactoring, we'll see a bunch of new files:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw6za31xxny7gq26mxs6d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw6za31xxny7gq26mxs6d.png" alt="file tree"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ploomber turns our notebook into a modularized project automatically! It generates a &lt;code&gt;README.md&lt;/code&gt; with basic instructions and a &lt;code&gt;requirements.txt&lt;/code&gt; (extracting package names from &lt;code&gt;import&lt;/code&gt; statements). Furthermore, it creates a &lt;code&gt;tasks/&lt;/code&gt; directory with a few &lt;code&gt;.ipynb&lt;/code&gt; files; these files come from the original notebook sections separated by markdown headings. &lt;code&gt;soorgeon refactor&lt;/code&gt; figures out which sections depend on which ones. &lt;/p&gt;

&lt;p&gt;If you prefer to export &lt;code&gt;.py&lt;/code&gt; files; you can pass the &lt;code&gt;--file-format&lt;/code&gt; option:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;soorgeon refactor nb.ipynb --file-format py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;tasks/&lt;/code&gt; directory will have &lt;code&gt;.py&lt;/code&gt; files this time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.
├── README.md
├── nb.ipynb
├── pipeline.yaml
├── requirements.txt
└── tasks
    ├── clean.py
    ├── linear-regression.py
    ├── load.py
    ├── random-forest-regressor.py
    └── train-test-split.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;soorgeon&lt;/code&gt; uses Markdown headings to determine how many output tasks to generate. In our case, there are five of them. Then, &lt;code&gt;soorgeon&lt;/code&gt; analyzes the code to resolve the dependencies among sections and adds the necessary code to pass outputs to the each task.&lt;/p&gt;

&lt;p&gt;For example, our "Train test split" section creates a variables &lt;code&gt;X&lt;/code&gt;, &lt;code&gt;y&lt;/code&gt;, &lt;code&gt;X_train&lt;/code&gt;, &lt;code&gt;X_test&lt;/code&gt;, &lt;code&gt;y_train&lt;/code&gt;, and &lt;code&gt;y_test&lt;/code&gt;; and the last four variables are used by the "Linear regression" section:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsbj7qx56oup60dpsqxrs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsbj7qx56oup60dpsqxrs.png" alt="notebook-sections"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By determining input and output variables, &lt;code&gt;soorgeon&lt;/code&gt; determines that the "Linear regression" section depends on the "Train test split" section. Furthermore, the "Random Forest Regressor" section also depends on the "Train test split" since it also uses the variables generated by the "Train test split" section. With this information, &lt;code&gt;soorgeon&lt;/code&gt; builds the dependency graph.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing our pipeline
&lt;/h2&gt;

&lt;p&gt;Now it's time to ensure that our modular pipeline runs correctly. To do so, we'll use the second package in our toolbox: &lt;code&gt;ploomber&lt;/code&gt;. Ploomber allows us to develop and execute our pipelines locally.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# install dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# execute pipeline&lt;/span&gt;
ploomber build
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name               Ran?      Elapsed (s)    Percentage
-----------------  ------  -------------  ------------
load               True         14.4272       38.6993
clean              True          7.89353      21.1734
train-test-split   True          2.98341       8.00263
linear-regression  True          3.77029      10.1133
random-forest-     True          8.20591      22.0113
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ploomber&lt;/code&gt; offers a lot of tools to manage our pipeline; for example, we can generate a plot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ploomber plot
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq31y3tde6drw7z02kbb2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq31y3tde6drw7z02kbb2.png" alt="pipeline-plot"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see the dependency graph; there are three serial tasks: &lt;code&gt;load&lt;/code&gt;, &lt;code&gt;clean&lt;/code&gt;, and &lt;code&gt;train-test-split&lt;/code&gt;. After them, we see two independent tasks: &lt;code&gt;linear-regression&lt;/code&gt;, and &lt;code&gt;random-forest-regressor&lt;/code&gt;. The advantage of modularizing our work is that members of our team can work independently, we can &lt;a href="https://docs.ploomber.io/en/latest/user-guide/testing.html" rel="noopener noreferrer"&gt;test tasks&lt;/a&gt; in isolation, and run independent tasks in &lt;a href="https://docs.ploomber.io/en/latest/api/_modules/executors/ploomber.executors.Parallel.html" rel="noopener noreferrer"&gt;parallel&lt;/a&gt;. With &lt;code&gt;ploomber&lt;/code&gt; we can keep developing the pipeline with Jupyter until we're ready to deploy!&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment
&lt;/h2&gt;

&lt;p&gt;To keep things simple, you may deploy your Ploomber pipeline with &lt;a href="https://ploomber.io/blog/cron/" rel="noopener noreferrer"&gt;cron&lt;/a&gt;, and run &lt;code&gt;ploomber build&lt;/code&gt; on a schedule. However, in some cases, you may want to leverage existing infrastructure. We got you covered! With &lt;code&gt;soopervisor&lt;/code&gt;, you can export your pipeline to &lt;a href="https://soopervisor.readthedocs.io/en/latest/tutorials/airflow.html" rel="noopener noreferrer"&gt;Airflow&lt;/a&gt;, &lt;a href="https://soopervisor.readthedocs.io/en/latest/tutorials/aws-batch.html" rel="noopener noreferrer"&gt;AWS Batch&lt;/a&gt;, &lt;a href="https://soopervisor.readthedocs.io/en/latest/tutorials/kubernetes.html" rel="noopener noreferrer"&gt;Kubernetes&lt;/a&gt;, &lt;a href="https://soopervisor.readthedocs.io/en/latest/tutorials/slurm.html" rel="noopener noreferrer"&gt;SLURM&lt;/a&gt;, or &lt;a href="https://soopervisor.readthedocs.io/en/latest/tutorials/kubeflow.html" rel="noopener noreferrer"&gt;Kubeflow&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# add a target environment named 'argo'&lt;/span&gt;
soopervisor add argo &lt;span class="nt"&gt;--backend&lt;/span&gt; argo-workflows

&lt;span class="c"&gt;# generate argo yaml spec&lt;/span&gt;
soopervisor &lt;span class="nb"&gt;export &lt;/span&gt;argo &lt;span class="nt"&gt;--skip-tests&lt;/span&gt;  &lt;span class="nt"&gt;--ignore-git&lt;/span&gt;

&lt;span class="c"&gt;# submit workflow&lt;/span&gt;
argo submit &lt;span class="nt"&gt;-n&lt;/span&gt; argo argo/argo.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;soopervisor add&lt;/code&gt; adds some files to our project, like a preconfigured &lt;code&gt;Dockerfile&lt;/code&gt; (which we can modify if we want to). On the other hand &lt;code&gt;soopervisor export&lt;/code&gt; takes our existing pipeline and exports it to Argo Workflows so we can run it on Kubernetes.&lt;/p&gt;

&lt;p&gt;By changing the &lt;code&gt;--backend&lt;/code&gt; argument in the &lt;code&gt;soopervisor add&lt;/code&gt; command, you can switch to other supported platforms. Alternatively, you may sign up for &lt;a href="https://docs.ploomber.io/en/latest/cloud/cloud-execution.html" rel="noopener noreferrer"&gt;free cloud service&lt;/a&gt;, which allows you to run your notebooks in the cloud with one command.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final remarks
&lt;/h2&gt;

&lt;p&gt;Notebook cleaning and refactoring are time-consuming and error-prone, and we are developing tools to make this process a breeze. In this blog post, we went from having a monolithic notebook until we had a modular pipeline running in production—all of that in an automated way using open-source tools. So please let us know what features you'd like to see. &lt;a href="https://ploomber.io/community" rel="noopener noreferrer"&gt;Join our community&lt;/a&gt; and share your thoughts!&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>jupyter</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Collaborative Data Science with Ploomber</title>
      <dc:creator>Eduardo Blancas</dc:creator>
      <pubDate>Sat, 27 Jun 2020 23:03:50 +0000</pubDate>
      <link>https://dev.to/edublancas/collaborative-data-science-with-ploomber-ad</link>
      <guid>https://dev.to/edublancas/collaborative-data-science-with-ploomber-ad</guid>
      <description>&lt;p&gt;&lt;em&gt;Introducing Ploomber Spec API, a simple way to sync Data Science teamwork using a short YAML file.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The &lt;em&gt;did you upload the latest data version yet?&lt;/em&gt; nightmare
&lt;/h2&gt;

&lt;p&gt;Data pipelines are multi-stage processes. Whether you are doing data visualization or training a Machine Learning model, there is an inherent workflow structure. The following diagram shows a typical Machine Learning pipeline:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--m_lfANTZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/lychkkow4ut8lzlcctlw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--m_lfANTZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/lychkkow4ut8lzlcctlw.png" alt="pipeline" width="880" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Imagine you are working with three colleagues, taking one feature branch each (A to D). To execute your pipeline, you could write a &lt;em&gt;master script&lt;/em&gt; that runs all tasks from left to right.&lt;/p&gt;

&lt;p&gt;During development, end-to-end runs are rare because they take too long. If you want to skip redundant computations (tasks whose source code has not changed), you could open the master script and manually run outdated tasks, but this will soon turn into a mess. Not only you have to keep track of task's status along your feature branch, but also in every other task that merges with any of your tasks. For example, if you need to use the output from "Join features", you have to ensure the output generated by all four branches is updated.&lt;/p&gt;

&lt;p&gt;Since full runs take too long and keeping track of outdated tasks manually is a laborious process, you might resort to the &lt;em&gt;evil trick of sharing intermediate results&lt;/em&gt;. Everyone uploads the &lt;em&gt;latest version&lt;/em&gt; of all or some selected tasks (most likely the ones with the computed features) to a shared location that you can then copy to your local workspace.&lt;/p&gt;

&lt;p&gt;But to generate the  &lt;em&gt;latest&lt;/em&gt; version of each task, you have to ensure it was generated from the &lt;em&gt;latest version&lt;/em&gt; of all its upstream dependencies, which take us back to original problem; there are simply no guarantees about data lineage. Using a data file whose origin is unknown as input severely compromises reproducibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introducing the Ploomber Spec API
&lt;/h2&gt;

&lt;p&gt;The new API offers a simple way to sync Data Science teamwork. All you have to do is list source code location and products (files or database tables/views) for each task in a &lt;code&gt;pipeline.yaml&lt;/code&gt; file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# pipeline.yaml&lt;/span&gt;

&lt;span class="c1"&gt;# clean data from the raw table&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;clean.sql&lt;/span&gt;
  &lt;span class="na"&gt;product&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;clean_data&lt;/span&gt;
  &lt;span class="c1"&gt;# function that returns a db client&lt;/span&gt;
  &lt;span class="na"&gt;client&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;db.get_client&lt;/span&gt;

&lt;span class="c1"&gt;# aggregate clean data&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aggregate.sql&lt;/span&gt;
  &lt;span class="na"&gt;product&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;agg_data&lt;/span&gt;
  &lt;span class="na"&gt;client&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;db.get_client&lt;/span&gt;

&lt;span class="c1"&gt;# dump data to a csv file&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SQLDump&lt;/span&gt;
  &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dump_agg_data.sql&lt;/span&gt;
  &lt;span class="na"&gt;product&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;output/data.csv&lt;/span&gt;  
  &lt;span class="na"&gt;client&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;db.get_client&lt;/span&gt;

&lt;span class="c1"&gt;# visualize data from csv file&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;plot.py&lt;/span&gt;
  &lt;span class="na"&gt;product&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# where to save the executed notebook&lt;/span&gt;
    &lt;span class="na"&gt;nb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;output/executed-notebook-plot.ipynb&lt;/span&gt;
    &lt;span class="c1"&gt;# tasks can generate other outputs&lt;/span&gt;
    &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;output/some_data.csv&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ploomber will analyze your source code to determine dependencies and skip a task if its source code (and the source code of all its upstream dependencies) has not changed since the last run.&lt;/p&gt;

&lt;p&gt;Say your colleagues updated a few tasks. To bring the pipeline up-to-date:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git pull
ploomber entry pipeline.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you run the command again, nothing will be executed.&lt;/p&gt;

&lt;p&gt;Apart from helping you sync with your team. Ploomber is great for developing pipelines iteratively: modify any part and call build again, only modified tasks will be executed. Since the tool is not tied up with git, you can experiment without committing changes; if you don't like them, just discard and build again. If your pipeline fails, fix the issue, build again and execution will resume from the crashing point.&lt;/p&gt;

&lt;p&gt;Ploomber is robust to code style changes. It won't trigger execution if you only added whitespace or formatted your source code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it!
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://mybinder.org/v2/gh/ploomber/projects/master?filepath=spec%2FREADME.md"&gt;&lt;strong&gt;Click here to try out the live demo (no installation required).&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you prefer to run it locally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"ploomber[all]"&lt;/span&gt;

&lt;span class="c"&gt;# create a new project with basic structure&lt;/span&gt;
ploomber new

ploomber entry pipeline.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you want to know how Ploomber works and what other neat features there are, keep on reading.&lt;/p&gt;

&lt;h2&gt;
  
  
  Inferring dependencies and injecting products
&lt;/h2&gt;

&lt;p&gt;For Jupyter notebooks (and annotated Python scripts), Ploomber looks for a "parameters" cell and extracts dependencies from an "upstream" variable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# annotated python file, it will be converted to a notebook during execution
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="c1"&gt;# + tags=["parameters"]
&lt;/span&gt;&lt;span class="n"&gt;upstream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'some_task'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="c1"&gt;# +
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;upstream&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'some_task'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# do data processing...
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'data'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In SQL files, it will look for and "upstream" placeholder:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt;&lt;span class="n"&gt;upstream&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'some_task'&lt;/span&gt;&lt;span class="p"&gt;]}}&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once it figured out dependencies, the next step is to inject products declared in the YAML file to the source code and upstream dependencies to downstream consumers.&lt;/p&gt;

&lt;p&gt;In Jupyter notebooks and Python scripts, Ploomber injects a cell with a variable called "product" and an "upstream" dictionary with the location of its upstream dependencies.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="c1"&gt;# + tags=["parameters"]
&lt;/span&gt;&lt;span class="n"&gt;upstream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'some_task'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="c1"&gt;# + tags=["injected-parameters"]
# this task uses the output from "some_task" as input
&lt;/span&gt;&lt;span class="n"&gt;upstream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'some_task'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'output/from/some_task.csv'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'data'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'output/current/task.csv'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# +
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;upstream&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'some_task'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# do data processing...
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'data'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For SQL files, Ploomber replaces the placeholders with the appropriate table/view names. For example, the SQL script shown above will be resolved as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;clean&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;some_table&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Embracing Jupyter notebooks as an output format
&lt;/h2&gt;

&lt;p&gt;If you look at the &lt;code&gt;plot.py&lt;/code&gt; task above, you'll notice that it has two products. This is because "source" is interpreted as the set of instructions to execute, while the product is the executed notebook with cell outputs. This executed notebook serves as a rich log that can include tables and charts, which is incredibly useful for debugging data processing code.&lt;/p&gt;

&lt;p&gt;Since existing cell outputs from the source file are ignored, there is no strong reason to use &lt;code&gt;.ipynb&lt;/code&gt; files as sources. We highly recommend you to work with annotated Python scripts (&lt;code&gt;.py&lt;/code&gt;) instead. They will be converted to notebooks at runtime via &lt;a href="https://github.com/mwouts/jupytext"&gt;jupytext&lt;/a&gt; and then executed using &lt;a href="https://github.com/nteract/papermill"&gt;papermill&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Another nice feature from jupytext is that you can develop Python scripts interactively. Once you start &lt;code&gt;jupyter notebook&lt;/code&gt;, your &lt;code&gt;.py&lt;/code&gt; files will render as regular &lt;code&gt;.ipynb&lt;/code&gt; files. You can modify and execute cells at will, but building your pipeline will enforce a top-to-bottom execution. This helps prevent Jupyter notebooks most common source of errors: hidden state due to unordered cell execution.&lt;/p&gt;

&lt;p&gt;Using annotated Python scripts makes code versioning simpler. Jupyter notebooks (&lt;code&gt;.ipynb&lt;/code&gt;) are JSON files, this makes code reviews and merges harder; by using plain scripts as sources and notebooks as products you get the best of both worlds: simple code versioning and rich execution logs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Seamlessly mix Python and (templated) SQL
&lt;/h2&gt;

&lt;p&gt;If your data lives in a database, you could write a Python script that connects to it, sends the query and closes the connection. Ploomber allows you to skip boilerplate code so you focus on writing the SQL part. You could even write entire pipelines using SQL alone.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://jinja.palletsprojects.com/en/2.11.x/"&gt;jinja&lt;/a&gt; templating is integrated, which can help you modularize your SQL code by using &lt;a href="https://jinja.palletsprojects.com/en/2.11.x/templates/#macros"&gt;macros&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If your database is supported by &lt;a href="https://www.sqlalchemy.org/"&gt;SQLAlchemy&lt;/a&gt; or it has a client that implements the &lt;a href="https://www.python.org/dev/peps/pep-0249/"&gt;DBAPI&lt;/a&gt; interface, it will work with Ploomber. This covers pretty much all databases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Interactive development and debugging
&lt;/h2&gt;

&lt;p&gt;Ploomber does not compromise structure with interactivity. You can load your pipeline and interact with it using the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipython &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-m&lt;/span&gt; ploomber.entry pipeline.yaml &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nt"&gt;--action&lt;/span&gt; status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will start an interactive session with a &lt;code&gt;dag&lt;/code&gt; object.&lt;/p&gt;

&lt;p&gt;Visualize pipeline dependencies (requires &lt;a href="https://graphviz.org/"&gt;graphviz&lt;/a&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can interactively develop Python scripts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'task'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;develop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will open the Python script as a Jupyter notebook with injected parameters but will remove them before the file is saved.&lt;/p&gt;

&lt;p&gt;Line by line debugging is also supported:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'task'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;debug&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since SQL code goes through a rendering process to replace placeholders, it is useful to see how the rendered code looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'sql_task'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Closing remarks
&lt;/h2&gt;

&lt;p&gt;There are many more features available through the Python API that are not yet implemented in the spec API. We are currently porting some of the most important features (integration testing, task parallelization).&lt;/p&gt;

&lt;p&gt;We want to keep the spec API short and simple for data scientists looking to get the Ploomber experience without having to learn the Python framework. For many projects, the spec API is more than enough.&lt;/p&gt;

&lt;p&gt;The Python API is recommended for projects that require advanced features such as dynamic pipelines (pipelines whose exact number of tasks is determined by its parameters).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where to go from here&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/ploomber/ploomber"&gt;Github&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ploomber.readthedocs.io/en/stable/"&gt;Documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Found an error in this post? &lt;a href="https://github.com/ploomber/posts/issues/new?title=Issue%20in%20post%3A%20%22Collaborative%20Data%20Science%20with%20Ploomber%22"&gt;Click here to let us know&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;Originally posted at &lt;a href="https://ploomber.io/posts/collaborative-ds"&gt;ploomber.io&lt;/a&gt;&lt;/p&gt;

</description>
      <category>jupyter</category>
      <category>python</category>
      <category>datascience</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Rethinking Continuous Integration for Data Science</title>
      <dc:creator>Eduardo Blancas</dc:creator>
      <pubDate>Thu, 18 Jun 2020 00:54:56 +0000</pubDate>
      <link>https://dev.to/edublancas/rethinking-continuous-integration-for-data-science-1c0c</link>
      <guid>https://dev.to/edublancas/rethinking-continuous-integration-for-data-science-1c0c</guid>
      <description>&lt;h1&gt;
  
  
  Prelude: Software development practice in Data Science
&lt;/h1&gt;

&lt;p&gt;As Data Science and Machine learning get wider industry adoption, practitioners realize that deploying data products comes with a high (and often unexpected) maintenance cost. As &lt;a href="https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf"&gt;Sculley and co-authors&lt;/a&gt; argue in their well-known paper:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;(ML systems) have all the maintenance problems of traditional code plus an additional set of ML-specific issues.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Paradoxically, even though data-intensive systems have higher maintenance cost than their traditional software counterparts, software engineering best practices are mostly overlooked. Based on my conversations with fellow data scientists, I believe that such practices are ignored primarily because they are perceived as unnecessary extra work due to misaligned incentives.&lt;/p&gt;

&lt;p&gt;Data projects ultimate objective is to impact business, but this impact is really hard to assess during development. How much impact a dashboard will have? What about the impact of a predictive model? If the product is not yet in production, it is hard to estimate business impact and we have to resort to proxy metrics: for decision-making tools, business stakeholders might subjectively judge how much a new dashboard can help them improve their decisions, for a predictive model, we could come up with a rough estimate based on model's performance.&lt;/p&gt;

&lt;p&gt;This causes the tool (e.g. a dashboard or model) to be perceived as the unique valuable piece in the data pipeline, because it is what the proxy metric acts upon. In consequence, most time and effort is done in trying to improve this final deliverable, while all the previous intermediate steps get less attention.&lt;/p&gt;

&lt;p&gt;If the project is taken to production, depending on the overall code quality, the team might have to refactor a lot of the codebase to be production-ready. This refactoring can range from doing small improvements to a complete overhaul, the more changes the project goes through, the harder will be to reproduce original results. All of this can severely delay or put launch at risk.&lt;/p&gt;

&lt;p&gt;A better approach is to always keep our code deploy-ready (or almost) at anytime. This calls for a workflow that ensures our code is tested and results are reproducible always. This concept is called Continuous Integration and is a widely adopted practice in software engineering. This blog post introduces an adapted CI procedure that can be effectively applied in data projects with existing open source tools.&lt;/p&gt;

&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;ol&gt;
&lt;li&gt;Structure your pipeline in several tasks, each one saving intermediate results to disk&lt;/li&gt;
&lt;li&gt;Implement your pipeline in such a way that you can parametrize it&lt;/li&gt;
&lt;li&gt;The first parameter should sample raw data to allow quick end-to-end runs for testing&lt;/li&gt;
&lt;li&gt;A second parameter should change artifacts location to separate testing and production environments&lt;/li&gt;
&lt;li&gt;On every push, the CI service runs unit tests that verify logic inside each task&lt;/li&gt;
&lt;li&gt;The pipeline is then executed with a data sample and integration tests verify integrity of intermediate results&lt;/li&gt;
&lt;/ol&gt;

&lt;h1&gt;
  
  
  What is Continuous Integration?
&lt;/h1&gt;

&lt;p&gt;Continuous Integration (CI) is a software development practice where small changes get continuously integrated in the project's codebase. Each change is automatically tested to ensure that the project will work as expected for end-users in a production environment.&lt;/p&gt;

&lt;p&gt;To contrast the difference between traditional software and a data project we compare two use cases: a software engineer working on an e-commerce website and a data scientist developing a data pipeline that outputs a report with daily sales.&lt;/p&gt;

&lt;p&gt;In the e-commerce portal use case, the production environment is the live website and end-users are people who use it; in the data pipeline use case, the production environment is the server that runs the daily pipeline to generate the report and end-users are business analysts that use the report to inform decisions.&lt;/p&gt;

&lt;p&gt;We define data pipeline as a series of ordered tasks whose inputs are raw datasets, intermediate tasks generate transformed datasets (saved to disk)  and the final task produces a data product, in this case, a report with daily sales (but this could be something else, like a Machine Learning model). The following diagram, shows our daily report pipeline example:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QNMZEiAs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/qjjkj5ueppir4diwpzgz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QNMZEiAs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/qjjkj5ueppir4diwpzgz.png" alt="Alt Text" width="880" height="359"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each blue block represents a pipeline task, the green block represents a script that generates the final report. Orange blocks contain the schemas for the raw source. Every task generates one product: blue blocks generate data file (but this could also be tables/views in a database) while the green block generates the report with charts and tables.&lt;/p&gt;

&lt;h1&gt;
  
  
  Continuous Integration for Data Science: Ideal workflow
&lt;/h1&gt;

&lt;p&gt;As I mentioned in the prelude, the last task in the data pipeline is often what gets the most attention (e.g. the trained model in a Machine Learning pipeline). Not surprisingly, existing articles on CI for Data Science/Machine Learning also focus on this; but to effectively apply the CI framework we have to think in terms of the whole computational chain: from getting raw data to delivering a data product. Failing to acknowledge that a data pipeline has a richer structure  causes data scientists to focus too much on the very end and ignore code quality in the rest of the tasks.&lt;/p&gt;

&lt;p&gt;In my experience, most bugs generate along the way, even worse, in many cases errors won't break the pipeline, but contaminate your data and compromise your results. Each step along the way should be given equal importance.&lt;/p&gt;

&lt;p&gt;Let's make things more concrete with a description of the proposed workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A data scientist pushes code changes (e.g. modifies one of the tasks in the pipeline)&lt;/li&gt;
&lt;li&gt;Pushing triggers the CI service to run the pipeline end-to-end and test each generated artifact (e.g. one test could verify that all rows in the &lt;code&gt;customers&lt;/code&gt; table have a non-empty &lt;code&gt;customer_id&lt;/code&gt; value)&lt;/li&gt;
&lt;li&gt;If tests pass, a code review follows&lt;/li&gt;
&lt;li&gt;If changes are approved by the reviewer, code is merged&lt;/li&gt;
&lt;li&gt;Every morning, the "production" pipeline (latest commit in the main branch) runs end-to-end and sends the report to the business analysts&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Such workflow has two primary advantages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Early bug detection: Bugs are detected in the development phase, instead of production&lt;/li&gt;
&lt;li&gt;Always production-ready: Since we required code changes to pass all the tests before integrating them to the main branch, we ensure we can deploy our latest stable feature continuously by just deploying the latest commit in the master branch&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This workflow is what software engineers do in traditional software projects. I call this ideal workflow because it is what we'd do if we could do an end-to-end pipeline run in a reasonable amount of time. This isn't true for a lot of projects due to data scale: if our pipeline takes hours to run end-to-end it is unfeasible to run it every time we make a small change. This is why we cannot simply apply the standard CI workflow (steps 1 to 4) to Data Science. We'll make a few changes to make it feasible for projects where running time is a challenge.&lt;/p&gt;

&lt;h1&gt;
  
  
  Software testing
&lt;/h1&gt;

&lt;p&gt;CI allows developers to continuously integrate code changes by running automated tests: if any of the tests fail, the commit is rejected. This makes sure that we always have a working project in the main branch.&lt;/p&gt;

&lt;p&gt;Traditional software is developed in small, largely independent modules. This separation is natural, as there are clear boundaries among components (e.g. sign up, billing, notifications, etc). Going back to the e-commerce website use case, an engineer's to-do list might look like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;People can create new accounts using email and password&lt;/li&gt;
&lt;li&gt;Passwords can be recovered by sending a message to the registered email&lt;/li&gt;
&lt;li&gt;Users can login using previously saved credentials&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once the engineer writes the code to support such functionality (&lt;a href="https://en.wikipedia.org/wiki/Test-driven_development"&gt;or even before!&lt;/a&gt;), he/she will make sure the code works by writing some tests, which will execute the code being tested and check it behaves as expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;my_project&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_account&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;users_db&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_create_account&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# simulate creating a new account
&lt;/span&gt;    &lt;span class="n"&gt;create_account&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'someone@ploomber.io'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'somepassword'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# verify the account was created by qerying the users database
&lt;/span&gt;    &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;users_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find_with_email&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'someone@ploomber.io'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But unit testing is not the only type of testing, as we will see in the next section.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing levels
&lt;/h2&gt;

&lt;p&gt;There are &lt;a href="https://en.wikipedia.org/wiki/Software_testing#Testing_levels"&gt;four levels&lt;/a&gt; of software testing. It is important to understand the differences to develop effective tests for our projects. For this post, we'll focus on the first two.&lt;/p&gt;

&lt;h3&gt;
  
  
  Unit testing
&lt;/h3&gt;

&lt;p&gt;The snippet I showed in the previous section is called a unit test. Unit tests verify that a single &lt;em&gt;unit&lt;/em&gt; works. There isn't a strict definition of &lt;em&gt;unit&lt;/em&gt; but it's often equivalent to calling a single procedure, in our case, we are testing the &lt;code&gt;create_account&lt;/code&gt; procedure.&lt;/p&gt;

&lt;p&gt;Unit testing is effective in traditional software projects because modules are designed to be largely independent from each other; by unit testing them separately, we can quickly pinpoint errors. Sometimes new changes break tests not due to the changes themselves but because they have side effects, if the module is independent, it gives us guarantee that we should be looking for the error within the module's scope.&lt;/p&gt;

&lt;p&gt;The utility of having procedures is that we can reuse them by parametrizing their behavior with input parameters. The input space for our &lt;code&gt;create_account&lt;/code&gt;  function is the combination of all possible email addresses and all possible passwords. There is an infinite number of combinations but it is reasonable to say that if we test our code against a representative number of cases we can conclude the procedure works (and if we find a case where it doesn't, we fix the code and add a new test case). In practice this boils down to testing procedure against a set of representative cases and known edge cases.&lt;/p&gt;

&lt;p&gt;Given that tests run in an automated way, we need a pass/fail criteria for each. In the software engineering jargon this is called a &lt;a href="https://en.wikipedia.org/wiki/Test_oracle"&gt;test oracle&lt;/a&gt;. Coming up with good test oracles is essential for testing: tests are useful to the extent that they evaluate the right outcome.&lt;/p&gt;

&lt;h3&gt;
  
  
  Integration testing
&lt;/h3&gt;

&lt;p&gt;The second testing level are integration tests. Unit tests are a bit simplistic since they test units independently, this simplification is useful for efficiency, as there is no need to start up the whole system to test a small part of it.&lt;/p&gt;

&lt;p&gt;But sometimes errors arise when inputs and outputs cross module's boundaries. Even though our modules are largely independent, they still have to interact with each other at some point (e.g. the billing module has to interaction with the notifications module to send a receipt). To catch potential errors during this interaction we use integration testing.&lt;/p&gt;

&lt;p&gt;Writing integration tests is more complex than writing unit tests as there are more elements to be considered. This is why traditional software systems are designed to be &lt;a href="https://en.wikipedia.org/wiki/Loose_coupling"&gt;loosely coupled&lt;/a&gt; by limiting the number of interactions and avoiding cross-module side effects. As we will see in the next section, integration testing is essential for testing data projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Effective testing
&lt;/h2&gt;

&lt;p&gt;Writing tests is an art of its own, the purpose of testing is to catch as most errors as we can during development so they don't show up in production. In a way, tests are simulating user's actions and check that the system behaves as expected, for that reason, an effective test is one that simulates realistic scenarios and appropriately evaluates whether the system did the right thing or not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An effective test should meet four requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;1. The simulated state of the system must be representative of the system when the user is interacting with it&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The goal of tests is to prevent errors in production, so we have to represent the system status as closely as possible. Even though our e-commerce website might have dozens of modules (user signup, billing, product listing, customer support, etc), they are designed to be as independent as possible, this makes simulating our system easier. We could argue that having a dummy database is enough to simulate the system when a new user signs up, the existence or absence of any other module should have no effect in the module being tested. The more interactions among components, the harder it is to test a realistic scenario in production.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;2. Input data be representative of real user input&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When testing a procedure, we want to know if given an input, the procedure does what it's supposed to do. Since we cannot run every possible input, we have to think of enough cases that represent regular operation as well as possible edge cases (e.g. what happens if a user signs up with an invalid e-mail address). To test our &lt;code&gt;create_account&lt;/code&gt; procedure, we should pass a few regular e-mail accounts but also some invalid ones and verity that it either creates the account or shows an appropriate error message.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;3. Appropriate test oracle&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As we mentioned in the previous section, the test oracle is our pass/fail criteria. The simpler and smaller the procedure to test, the easier is to come up with one. If we are not testing the right outcome, our test won't be useful. Our test for &lt;code&gt;create_account&lt;/code&gt; implies that checking the users table in the database is an appropriate way of evaluating our function.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;4. Reasonable runtime&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;While tests run, the developer has to wait until results come back. If testing is slow, we will either have to wait for a long time which might lead to developers just ignore the CI system altogether. This causes code changes to accumulate making debug much harder (it is easier to find the error when we changed 5 lines than when we changed 100)&lt;/p&gt;

&lt;h1&gt;
  
  
  Effective testing for data pipelines
&lt;/h1&gt;

&lt;p&gt;In the previous sections, we described the first two levels of software testing and the four properties of an effective test. This section discusses how to adapt testing techniques from traditional software development to data projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Unit testing for data pipelines
&lt;/h2&gt;

&lt;p&gt;Unlike modules in traditional software, our pipeline tasks (blocks in our diagram) are not independent, they have a logical execution. To accurately represent the state of our system we have to respect such order. Since the input for one task depends on the output from their upstream dependencies, root cause for an error can be either in the failing task or in any upstream task. This isn't good for us since it increases the number of potential places to search for the bug, abstracting logic in smaller procedures and unit testing them helps reduce this problem.&lt;/p&gt;

&lt;p&gt;Say that our task &lt;code&gt;add_product_information&lt;/code&gt; performs some data cleaning before joining sales with products:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;my_project&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;clean&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_product_information&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;upstream&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# load
&lt;/span&gt;    &lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;upstream&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'sales'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;upstream&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'products'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;# clean
&lt;/span&gt;    &lt;span class="n"&gt;sales_clean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clean&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fix_timestamps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;products_clean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clean&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;remove_discontinued&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;products&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# join
&lt;/span&gt;    &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sales_clean&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;products_clean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'product_id'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'clean/sales_w_product_info.parquet'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We abstracted cleaning logic in two sub-procedures &lt;code&gt;clean.fix_timestamps&lt;/code&gt; and &lt;code&gt;clean.remove_discontinued&lt;/code&gt;, errors in any of the sub-procedures will propagate to the output and in consequence, to any downstream tasks. To prevent this, we should add a few unit tests that verify logic for each sub-procedure in isolation.&lt;/p&gt;

&lt;p&gt;Often, pipeline tasks that transform data are composed of just few calls with little custom logic to external packages (e.g. pandas). In such cases, unit testing won't be very effective. Imagine one of the tasks in your pipeline looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# cleaning
# ...
# ...
&lt;/span&gt;
&lt;span class="c1"&gt;# transform
&lt;/span&gt;&lt;span class="n"&gt;series&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s"&gt;'customer_id'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'product_category'&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DatFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s"&gt;'mean_price'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;series&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Assuming you already unit tested the cleaning logic, there isn't much to unit test about your transformations, writing unit tests for such simple procedures is not a good investment of your time. This is where integration testing comes in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Integration testing for data pipelines
&lt;/h2&gt;

&lt;p&gt;Our pipeline flows inputs and outputs until it generates the final result. This flow can break if task input expectations aren't true (e.g. column names), moreover, each data transformation encodes certain assumptions &lt;em&gt;we&lt;/em&gt; make about the data. Integration tests help us verity that outputs flow through the pipeline correctly.&lt;/p&gt;

&lt;p&gt;If we wanted to test the &lt;em&gt;group by&lt;/em&gt; transformation shown above, we could run the pipeline task and evaluate our expectations using the output data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# since we are grouping by these two keys, they should be unique
&lt;/span&gt;&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_unique&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_category&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_unique&lt;/span&gt;

&lt;span class="c1"&gt;# price is always positive, mean should be as well
&lt;/span&gt;&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean_price&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;span class="c1"&gt;# check there are no NAs (this might happen if we take the mean of
# an array with NAs)
&lt;/span&gt;&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean_price&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These four assertions are quickly to write and clearly encode our output expectations. Let's now see how we can write effective integration tests in detail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;State of the system&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As we mentioned in the previous section, pipeline tasks have dependencies. In order to accurately represent the system status in our tests, we have to respect execution order and run our integration tests after each task is done, let's modify our original diagram to reflect this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--73AJFb5D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/n1t5elvuro0epgirj4cj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--73AJFb5D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/n1t5elvuro0epgirj4cj.png" alt="Alt Text" width="880" height="359"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test oracle&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The challenge when testing pipeline tasks is that there is no single right answer. When developing a &lt;code&gt;create_user&lt;/code&gt; procedure, we can argue that inspecting the database for the new user is an appropriate measure of success, but what about a procedure that cleans data?&lt;/p&gt;

&lt;p&gt;There is no unique answer, because the concept of &lt;em&gt;clean data&lt;/em&gt; depends on the specifics of our project. The best we can do is to explicitly code our output expectations as a series of tests. Common scenarios to avoid are including invalid observations in the analysis, null values, duplicates, unexpected column names, etc. Such expectations are good candidates for integration tests to prevent dirty data from leaking into our pipeline. Even tasks that pull raw data should be tested to detect data changes: columns get deleted, renamed, etc. Testing raw data properties help us quickly identify when our source data has changed.&lt;/p&gt;

&lt;p&gt;Some changes such as column renaming will break our pipeline even if we don't write a test, but explicitly testing has a big advantage: we can fix the error in the right place and avoid redundant fixes. Imagine what would happen if renaming a column breaks two downstream tasks, each one being developed by a different colleague, once they encounter the error they will be tempted to rename the column in their code (the two downstream tasks), when the correct approach is to fix in the upstream task.&lt;/p&gt;

&lt;p&gt;Furthermore, errors that break our pipeline should be the least of our worries, the most dangerous bugs in data pipelines are sneaky; they won't break your pipeline but will contaminate all downstream tasks in subtle ways that can severely flaw your data analysis and even flip your conclusion, which is the worst possible scenario. Because of this, I cannot stress how important is to code data expectations as part of any data analysis project.&lt;/p&gt;

&lt;p&gt;Pipeline tasks don't have to be Python procedures, they'll often be SQL scripts and you should test them in the same way. For example, you can test that there are no nulls in certain column with the following query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;some_table&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;some_column&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For procedures whose output is not a dataset, coming up with a test oracle gets trickier. A common output in data pipelines are human-readable documents (i.e. reports). While it is technically possible to test graphical outputs such as tables or charts, this requires more setup. A first (and often good enough) approach is to unit test the input that generates visual output (e.g. test the function that prepares the data for plotting instead of the actual plot). If you're curious about testing plots, &lt;a href="https://github.com/matplotlib/pytest-mpl"&gt;click here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Realistic input data and running time&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We mentioned that realistic input data is important for testing. In data projects we already have real data we can use in our tests, however, passing the full dataset for testing is unfeasible as data pipelines have computationally expensive tasks that take a lot to finish.&lt;/p&gt;

&lt;p&gt;To reduce running time and keep our input data realistic we pass a data sample. How this sample is obtained depends on the specifics of the project. The objective is to get a representative data sample whose properties are similar to the full dataset. In our example, we could take a random sample of yesterday's sales. Then, if we want to test certain properties (e.g. that our pipeline handles NAs correctly), we could either insert some NAs in the random sample or use another sampling method such as &lt;a href="https://en.wikipedia.org/wiki/Stratified_sampling"&gt;stratified sampling&lt;/a&gt;. Sampling only needs to happen in tasks that pull raw data, downstream tasks will just process whatever output came from their upstream dependencies.&lt;/p&gt;

&lt;p&gt;Sampling is only enabled during testing. Make sure  your pipeline is designed to easily switch this setting off and keep generated artifacts (test vs production) clearly labeled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;my_project&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;daily_sales_pipeline&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_with_sample&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# run with sample and stores all artifacts in the testing folder
&lt;/span&gt;    &lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;daily_sales_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;artifacts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'/path/to/output/testing'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The snipped above makes the assumption that we can represent our pipeline as a "pipeline object" and call it with parameters. This is a very powerful abstraction that makes your pipeline flexible to be executed under different settings. Upon task successful execution, you should run the corresponding integration test. For example, say we want to test our &lt;code&gt;add_product_information&lt;/code&gt; procedure, our pipeline should call the following function one such task is done:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_product_information&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that we are passing the path to the data as an argument to the function, this will allow us to easily switch the path to load the data from. This is important to avoid pipeline runs to interfere with each other. For example, if you have several git branches, you can organize artifacts by branch in a folder called &lt;code&gt;/data/{branch-name}&lt;/code&gt;; if you are sharing a server with a colleague, each one could save their artifacts to &lt;code&gt;/data/{username}&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you are working with SQL scripts, you can apply the same testing pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_product_information_sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;relation&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Assume client is an object to send queries to the db
&lt;/span&gt;    &lt;span class="c1"&gt;# and relation the table/view to test
&lt;/span&gt;    &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"""
    SELECT EXISTS(
        SELECT * FROM {product}
        WHERE {column} IS NULL
    )
    """&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;relation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;relation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'customer_id'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;relation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;relation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'product_id'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;relation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;relation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'category'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apart from sampling, we can we further speed up testing by running tasks in parallel. Although there's a limited amount of parallelization we can do, which is given by the pipeline structure: we cannot run a task until their upstream dependencies are completed.&lt;/p&gt;

&lt;p&gt;Parametrized pipelines and executing tests upon task execution is supported in our library &lt;a href="https://github.com/ploomber/ploomber"&gt;Ploomber&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  The testing tradeoff in Data Science
&lt;/h1&gt;

&lt;p&gt;Data projects have much more uncertainty than traditional software. Sometimes we don't even know if the project is even technically possible, so we have to invest some time to give an answer. This uncertainty comes in detriment of good software practices: we want to reduce uncertainty and estimate project's impact by making as much progress as we can to answer questions about feasibility, thus, good software practices (such as testing) are usually not perceived as &lt;em&gt;actual progress&lt;/em&gt; and are habitually overlooked.&lt;/p&gt;

&lt;p&gt;My recommendation is to incrementally increase testing as you make progress. During early stages, it is important to focus on integration tests as they are quick to implement and effective. The most common errors in data transformations are easy to detect using simple assertions: check that IDs are unique, no duplicates, no empty values, columns fall within expected ranges. You'll be surprised how many bugs you catch with a few lines of code. These errors are obvious once you take a look at the data but might not even break your pipeline, they will just produce wrong results, integration testing prevents this.&lt;/p&gt;

&lt;p&gt;Second, leverage off-the-shelf packages as much as possible, especially for highly complex data transformations or algorithms; but beware of quality and favor maintained packages even if they don't offer state-of-the-art performance. Third-party packages come with their own tests which reduces work for you.&lt;/p&gt;

&lt;p&gt;There might also be parts that are not as critical or are very hard to test. Plotting procedures are a common example: unless you are producing a highly customized plot, there is little benefit on testing a small plotting function that just calls matplotlib and customizes axis a little bit. Focus on testing the input that goes into the plotting function.&lt;/p&gt;

&lt;p&gt;As your project matures, you can start focusing on increasing your testing coverage and paying some technical debt.&lt;/p&gt;

&lt;h1&gt;
  
  
  Debugging data pipelines
&lt;/h1&gt;

&lt;p&gt;When tests fail, it is time to debug. Our first line of defense is logging: whenever we run our pipeline we should generate a relevant set of logging records for us to review. I recommend you to take a look to the &lt;a href="https://docs.python.org/3/library/logging.html"&gt;&lt;code&gt;logging&lt;/code&gt; module in the Python standard library&lt;/a&gt; which provides a flexible framework for this (do not use &lt;code&gt;print&lt;/code&gt; for logging), a good practice is to keep a file with logs from every pipeline run.&lt;/p&gt;

&lt;p&gt;While logging can hint you where the problem is, designing your pipeline for easy debugging is critical. Let's recall our definition of data pipeline:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Series of ordered tasks whose inputs are raw datasets, intermediate tasks generate transformed datasets (saved &lt;strong&gt;to disk&lt;/strong&gt;)  and the final task produces a data product.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Keeping all intermediate results in memory is definitely faster, as disk operations are slower than memory. However, saving results to disk makes debugging much easier. If we don't persist intermediate results to disk, debugging means we have to re-execute our pipeline again to replicate the error conditions, if we keep intermediate results, we just have to reload the upstream dependencies for the failing task. Let's see how we can debug our &lt;code&gt;add_product_information&lt;/code&gt; procedure using the &lt;a href="https://docs.python.org/3/library/pdb.html"&gt;Python debugger&lt;/a&gt; from the standard library:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pdb&lt;/span&gt;

&lt;span class="n"&gt;pdb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;runcall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;add_product_information&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;upstream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'sales'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;path_to_sales&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'product'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;path_to_product&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since our tasks are isolated from each other and only interact via inputs and outputs, we can easily replicate the error conditions. Just make sure that you are passing the right input parameters to your function. You can easily apply this workflow if you use &lt;a href="https://ploomber.readthedocs.io/en/stable/guide/debugging.html"&gt;Ploomber's debugging capabilities.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Debugging SQL scripts is harder since we don't have debuggers as we do in Python. My recommendation is to keep your SQL scripts in a reasonable size: once a SQL script becomes too big, you should consider breaking it down in two separate tasks. Organizing your SQL code using &lt;code&gt;WITH&lt;/code&gt; helps with readability and can help you debug complex statements:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;customers_subset&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="p"&gt;..&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;products_subset&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you find an error in a SQL script organized like this, you can replace the last &lt;code&gt;SELECT&lt;/code&gt; statement for something like &lt;code&gt;SELECT * FROM customers_subset&lt;/code&gt; to take a look at intermediate results.&lt;/p&gt;

&lt;h1&gt;
  
  
  Running integration tests in production
&lt;/h1&gt;

&lt;p&gt;In traditional software, tests are only run in the development environment, it is assumed that if a piece of code reaches production, it must have been tested and works correctly.&lt;/p&gt;

&lt;p&gt;For data pipelines, integration tests are part of the pipeline itself and it is up to you to decide whether to execute them or not. The two variables that play here are response time and end-users. If running frequency is low (e.g. a pipeline that executes daily) and end-users are internal (e.g. business analysts) you should consider keeping the tests in production. A ML training pipeline also follows this pattern, it has low running frequency because it executes on demand (whenever you want to train a model) and the end-users are you and any other person in the team. This is important given that we run our tests with a sample of the data, running them with the full dataset might give a different result either because our sampling method didn't capture certain properties in the data.&lt;/p&gt;

&lt;p&gt;Another common (and often unforeseeable) scenario are data changes. It is important that you keep yourself informed of planned changes in upstream data (e.g. a migration to a different warehouse platform) but there's still a chance that you'd find out data changes until you pass new data through the pipeline. In the best case scenario, your pipeline will raise an exception that you'll be able to detect, worst case, your pipeline will execute just fine but the output will contain wrong results. For this reason, it is important to keep your integration tests running in the production environment.&lt;/p&gt;

&lt;p&gt;Bottom line: If you can allow a pipeline to delay their final output (e.g the daily sales report), keep tests in production and make sure you are properly notified about them, the simplest solution is to make your pipeline send you an e-mail.&lt;/p&gt;

&lt;p&gt;For pipelines where output is expected often and quickly (e.g. an API) you can change your strategy. For non-critical errors, you can log instead of raising exceptions but for critical cases, where you know a failing tests will prevent you from returning an appropriate results (e.g. user entered a negative value for an "age" column), you should return an appropriate error message. Handling errors in production is part of &lt;em&gt;model monitoring&lt;/em&gt;, which we will cover in an upcoming post.&lt;/p&gt;

&lt;h1&gt;
  
  
  Revisited workflow
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--U9-WOpE1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/yd0p3q8hkayba2uy0zgf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--U9-WOpE1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/yd0p3q8hkayba2uy0zgf.png" alt="Alt Text" width="800" height="622"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We now revisit the workflow based on observations from the previous sections. On every push, unit tests run first, then pipeline is then executed with a sample of the data, upon each task execution, integration tests are run to verify each output, if all tests pass, the commit is marked as successful. This is the end of the CI process and should only take a few minutes.&lt;/p&gt;

&lt;p&gt;Given that we are continuously testing each code change, we should be able to deploy anytime. This idea of continuously deploying software is called &lt;a href="https://en.wikipedia.org/wiki/Continuous_deployment"&gt;Continuous Deployment&lt;/a&gt;, it deserves a dedicated post but here's the summary.&lt;/p&gt;

&lt;p&gt;Since we need to generate a daily report, the pipeline runs every morning. The first step is to pull (from the repository or an artifact store) the latest stable version available install it in the production server, integration tests run on each successful task run to check data expectations, if any of these tests fails, a notification is sent. If everything goes well, the pipeline emails the report to business analysts. &lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation details
&lt;/h2&gt;

&lt;p&gt;This section provides general guidelines and resources to implement the CI workflow with existing tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unit testing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To unit test logic inside each data pipeline task, we can leverage existing tools. I highly recommend using &lt;a href="https://docs.pytest.org/en/stable/"&gt;pytest&lt;/a&gt;. It has a small learning curve for basic usage; but and as you get more comfortable with it, I'd advise to explore more of their features (e.g. &lt;a href="https://docs.pytest.org/en/stable/fixture.html#fixture"&gt;fixtures&lt;/a&gt;). Becoming a power user of any testing framework comes with great benefits, as you'll invest less time writing tests and maximize their effectiveness to catch bugs. Keep practicing until writing tests becomes the natural first step before writing any actual code. This technique of writing tests first is called &lt;a href="https://en.wikipedia.org/wiki/Test-driven_development"&gt;Test-driven development (TDD)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Running integration tests&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Integration tests have more tooling requirements since they need to take the data pipeline structure into account (run tasks in order), parametrization (for sampling) and testing execution (run tests after each task). There's been a recent surge in workflow management tools that can be helpful to do this to some extent.&lt;/p&gt;

&lt;p&gt;Our library &lt;a href="https://github.com/ploomber/ploomber"&gt;Ploomber&lt;/a&gt; supports all features required to implement this workflow: representing your pipeline as a &lt;a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph"&gt;DAG&lt;/a&gt;, separating dev/test/production environments, parametrizing pipelines, running test functions upon task execution, integration with the Python debugger, among other features.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;External systems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A lot of simple to moderately complex data applications are developed in a single server: the first pipeline tasks dump raw data from a warehouse and all downstream tasks output intermediate results as local files (e.g. parquet or csv files). This architecture allows to easily contain and execute the pipeline in a different system: to test locally, just run the pipeline and save the artifacts in a folder of your choice, to run it in the CI server, just copy the source code and execute the pipeline there, there is no dependency on any external system.&lt;/p&gt;

&lt;p&gt;However, for cases where data scale is a challenge, the pipeline might just serve as an execution coordinator doing little to no actual computations, think for example of a purely SQL pipeline that only sends SQL scripts to an analytical database and waits for completion.&lt;/p&gt;

&lt;p&gt;When execution depends on external systems, implementing CI is harder, because you depend on another system to execute your pipeline. In traditional software projects, this is solved by creating &lt;a href="https://en.wikipedia.org/wiki/Mock_object"&gt;mocks&lt;/a&gt;, which mimic the behavior of another object. Think about the e-commerce website: the production database is a large server that supports all users. During development and testing, there is no need for such big system, a smaller one with some data (maybe a sample of real data or even fake data) is enough, as long as it accurately mimics behavior of the production database.&lt;/p&gt;

&lt;p&gt;This is often not possible in data projects. If we are using a big external server to speed up computations, we most likely only have that system (e.g a company-wide Hadoop cluster) and mocking it is unfeasible. One way to tackle this is to store pipeline artifacts in different "environments". For example, if you are using a large analytical database for your project, store production artifacts in a &lt;code&gt;prod&lt;/code&gt; schema and testing artifacts in a &lt;code&gt;test&lt;/code&gt; schema. If you cannot create schemas, you can also prefix all your tables and views (e.g. &lt;code&gt;prod_customers&lt;/code&gt; and &lt;code&gt;test_customers&lt;/code&gt;). Parametrizing your pipeline can help you easily switch schemas/suffixes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CI server&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To automate testing execution you need a CI server. Whenever you push to the repository, the CI server will run tests against the new commit. There are many options available, verify if the company you work for already has a CI service. If there isn't one, you won't get the automated process but you can still run implement it halfway by running your tests locally on each commit.&lt;/p&gt;

&lt;h1&gt;
  
  
  Extension: Machine Learning pipeline
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--VodVaQP6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/wlwzmebfz6730fb9zwwr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--VodVaQP6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/wlwzmebfz6730fb9zwwr.png" alt="Alt Text" width="880" height="359"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's modify our previous daily report pipeline to cover an important use case: developing a Machine Learning model. Say that we now want to forecast daily sales for the next month. We could do this by getting historical sales (instead of just yesterday's sales), generating features and training a model.&lt;/p&gt;

&lt;p&gt;Our end-to-end process has two stages: first, we process the data to generate a training set, then we train models and select the best one. If we follow the same rigorous testing approach for each task along the way, we will be able to catch dirty data from getting into our model, remember: &lt;a href="https://en.wikipedia.org/wiki/Garbage_in,_garbage_out"&gt;garbage in, garbage out&lt;/a&gt;. Sometimes practitioners focus too much on the training task by trying out a lot of fancy models or complex hyperparameter tuning schemas. While this approach is certainly valuable, there is usually a lot of low-hanging fruit in the data preparation process that can impact our model's performance significantly. But to maximize this impact, we must ensure that the data preparation stage is reliable and reproducible.&lt;/p&gt;

&lt;p&gt;Bugs in data preparation cause either results that are &lt;em&gt;too good to be true&lt;/em&gt; (i.e. data leakage) or suboptimal models; our tests should address both scenarios. To prevent data leakage, we can test for existence of problematic columns in the training set (e.g. a column whose value is known only after our target variable is visible). To avoid suboptimal performance, integrations tests that verify our data assumptions play an important role but we can include other tests to check quality in our final dataset such as verifying we have data across all years and that data regarded as unsuitable for training does not appear.&lt;/p&gt;

&lt;p&gt;Getting historical data will to increase CI running time overall but data sampling (as we did in the daily report pipeline) helps. Even better, you can cache a local copy with the data sample to avoid fetching the sample every time you run your tests.&lt;/p&gt;

&lt;p&gt;To ensure full model reproducibility, we should only train models using artifacts that generated from an automated process. Once tests pass, a process could automatically trigger an end-to-end pipeline execution with the full dataset to generate training data.&lt;/p&gt;

&lt;p&gt;Keeping historical artifacts can also help with model audibility, given a hash commit, we should be able to locate the generated training data, moreover, re-executing the pipeline from the same commit should yield identical results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model evaluation as part of the CI workflow
&lt;/h2&gt;

&lt;p&gt;Our current CI workflow tests our pipeline with a data sample to make sure the final output is suitable for training. Wouldn't it be nice if we could also test the training procedure?&lt;/p&gt;

&lt;p&gt;Recall that the purpose of CI is to allow developers integrate small changes iteratively, for this to be effective, feedback needs to come back quickly. Training ML models usually comes with a long running time; unless we have a way of finishing our training procedure in a few minutes, we'll have to think how to test swiftly.&lt;/p&gt;

&lt;p&gt;Let's analyze two subtly different scenarios to understand how we can integrate them in the CI workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Testing a training algorithm&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you are implementing your own training algorithm, you should test your implementation independent of the rest of your pipeline. This tests verify the correctness of your implementation.&lt;/p&gt;

&lt;p&gt;This is something that any ML framework does (scikit-learn, keras, etc.), since they have to ensure that improvements to the current implementations do not break them. In most cases, unless you are working with a very data-hungry algorithm, this won't come with a running time problem because you can unit test your implementation with synthetic/toy dataset. This same logic applies to any training preprocessors (such as data scaling).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Testing your training pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In practice, training is not a single-stage procedure. The first step is to load your data, then you might do some final cleaning such as removing IDs or hot-encoding categorical features. After that, you pass the data a multi-stage training pipeline that involves splitting, data preprocessing (e.g. standardize, PCA, etc), hyperparameter tuning and model selection. Things can go wrong in any of these steps, especially if your pipeline has highly customized procedures.&lt;/p&gt;

&lt;p&gt;Testing your training pipeline is hard because there is no obvious test oracle. My advice is to try to make your pipeline as simple as possible by leveraging existing implementations (scikit-learn has amazing tools for &lt;a href="https://scikit-learn.org/stable/modules/compose.html"&gt;this&lt;/a&gt;) to reduce the amount of code to test.&lt;/p&gt;

&lt;p&gt;In practice, I've found useful to define a test criteria &lt;em&gt;relative&lt;/em&gt; to previous results. If the first time I trained a model I got an accuracy of X, then I save this number and use it as reference. Subsequent experiments should fail within a &lt;em&gt;reasonable range&lt;/em&gt; of X: sudden drops or gains in performance trigger an alert to review results manually. Sometimes this is good news, it means that performance is improving because my new features are working, other times, it is bad news: sudden gains in performance might come from information leakage while sudden drops from incorrectly processing data or accidentally dropping rows/columns.&lt;/p&gt;

&lt;p&gt;To keep running time feasible, run the training pipeline with the data sample and have your test compare performance with a metric obtained using the same sampling procedure. This is more complex than it sounds because results variance will increase if you train with less data which makes coming up with the &lt;em&gt;reasonable range&lt;/em&gt; more challenging.&lt;/p&gt;

&lt;p&gt;If the above strategy does not work, you can try using a surrogate model in your CI pipeline that is faster to train and increase your data sample size. For example, if you are training a neural network, you could train using a simpler architecture to make training faster and increase the data sample used in CI to reduce variance across CI runs.&lt;/p&gt;

&lt;h1&gt;
  
  
  The next frontier: CD for Data Science
&lt;/h1&gt;

&lt;p&gt;CI allows us to integrate code in short cycles, but that's not the end of the story. At some point we have to &lt;em&gt;deploy&lt;/em&gt; our project, this is where &lt;a href="https://en.wikipedia.org/wiki/Continuous_delivery"&gt;Continuous Delivery&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Continuous_deployment"&gt;Continuous Deployment&lt;/a&gt; come in.&lt;/p&gt;

&lt;p&gt;The first step towards deployment is &lt;em&gt;releasing&lt;/em&gt; our project. Releasing is taking all necessary files (i.e. source code, configuration files, etc) and putting them in a format that can be used for installing our project in the production environment. For example, releasing a Python package requires uploading our code to the Python Package Index.&lt;/p&gt;

&lt;p&gt;Continuous Delivery ensures that software &lt;em&gt;can be&lt;/em&gt; released anytime, but deployment is still a manual process (i.e. someone has to execute instructions in the production environment), in other words, it only automates the release process. Continuous Deployment involves automating release and deployment. Let's now analyze this concepts in terms of data projects.&lt;/p&gt;

&lt;p&gt;For pipelines that produce human-readable documents (e.g. a report), Continuous Deployment is straightforwards. After CI passes, another process should grab all necessary files and create a an installable artifact, then, the production environment can use this artifact to setup and install our project. Next time the pipeline runs, it should be using the latest stable version.&lt;/p&gt;

&lt;p&gt;On the other hand, Continuous Deployment for ML pipelines is much harder. The output of a pipeline is not a unique model, but several candidate models that should be compared to deploy the best one. Things get even more complicated if we already have a model in production, because it could be that no deployment is the best option (e.g. if the new model doesn't improve predictive power significant but comes with an increase in runtime or more data dependencies).&lt;/p&gt;

&lt;p&gt;An even more important (and more difficult) property to assess than predictive power is &lt;a href="https://en.wikipedia.org/wiki/Fairness_(machine_learning)"&gt;model fairness&lt;/a&gt;. Every new deployment must be evaluated for bias towards sensitive groups. Coming up with an automated way to assess a model in both predictive power and fairness is very difficult and it deservers its own post. If you want to know more about model fairness, &lt;a href="https://fairmlbook.org/"&gt;this is a great place to start&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;But Continuous Delivery for ML is still a manageable process. Once a commit passes all tests with a data sample (CI), another process runs the pipeline with the full dataset and stores the final dataset in object storage (CD stage 1).&lt;/p&gt;

&lt;p&gt;The training procedure then loads the artifacts and finds optimal models by tuning hyperparameters for each selected algorithms. Finally, it serializes the best model specification (i.e. algorithm and its best hyperparameters) along with evaluation reports (CD stage 2). When it's all done, we take a look at the reports and choose a model for deployment. &lt;/p&gt;

&lt;p&gt;In the previous section, we discussed how we can include model evaluation in the CI workflow. The proposed solution is limited by CI running time requirements; after the first stage in the CD process is done, we can include a more robust solution by training the latest best model specification with the full dataset, this will catch bugs causing performance drops at the moment, instead of having to wait for the second stage to finish, given it has a much higher running time. The CD workflow looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--O1ktuFzH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/towp1atdnzyv5x4prznm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--O1ktuFzH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/towp1atdnzyv5x4prznm.png" alt="Alt Text" width="800" height="644"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Triggering CD from a successful CI run can be manual, a data scientist might not want to generate datasets for every passing commit, but it should be easy to do given the commit hash (i.e. with a single click or command).&lt;/p&gt;

&lt;p&gt;It is also convenient allow manual execution of the second stage because data scientists often use the same dataset to run several &lt;em&gt;experiments&lt;/em&gt; by customizing the training pipeline, thus, a single dataset can potentially trigger many training jobs.&lt;/p&gt;

&lt;p&gt;Experiment reproducibility is critical in ML pipelines. There is a one-to-one relationship between a commit, a CI run and a data preparation run (CD stage 1), thus, we can uniquely identify a dataset by the commit hash that generated it. And we should be able to reproduce any experiment by running the data preparation step again and running the training stage with the same training parameters.&lt;/p&gt;

&lt;h1&gt;
  
  
  Closing remarks
&lt;/h1&gt;

&lt;p&gt;As CI/CD processes for Data Science start to mature and standardize, we'll start to see new tools to ease implementation. Currently, a lot of data scientists are not even considering CI/CD as part of their workflow. In some cases, they just don't know about it, in others, because implementing CI/CD effectively requires a setup process for repurposing existing tools. Data scientists should not worry about setting up a CI/CD service, they should just focus on writing their code, tests and push.&lt;/p&gt;

&lt;p&gt;Apart from CI/CD tools specifically tailored for data projects, we also need data pipeline management tools to standardize pipeline development. In the last couple years, I've seen a lot of new projects, unfortunately, most of them focus on aspects such as scheduling or scaling rather than user experience, which is critical if we want software engineering practices such as modularization and testing to be embraced by all data scientists. This is the reason why we built &lt;a href="https://github.com/ploomber/ploomber"&gt;Ploomber&lt;/a&gt;, to help data scientists easily and incrementally adopt better development practices.&lt;/p&gt;

&lt;p&gt;Shortening the CI-developer feedback loop is critical for CI success. While data sampling is an effective approach, we can do better by using incremental runs: changing a single task in a pipeline should only trigger the least amount of work by re-using previously computed artifacts. Ploomber already offers this to some extent and we are experimenting ways to improve this feature.&lt;/p&gt;

&lt;p&gt;I believe CI is the most important missing piece in the Data Science software stack: we already have great tools to perform AutoML, one-click model deployment and model monitoring. CI will close the gap to allow teams confidently and continuously train and deploy models.&lt;/p&gt;

&lt;p&gt;To move our field forward, we must start paying more attention to our development processes.&lt;/p&gt;

&lt;p&gt;Found an error in this post? &lt;a href="https://github.com/ploomber/posts/issues/new?title=Issue%20in%20post%3A%20%22Rethinking%20Continuous%20Integration%20for%20Data%20Science%22"&gt;Click here to let us know&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;Originally posted at &lt;a href="https://ploomber.io/posts/ci4ds"&gt;ploomber.io&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Robust Jupyter report generation using static analysis</title>
      <dc:creator>Eduardo Blancas</dc:creator>
      <pubDate>Mon, 20 Apr 2020 23:25:39 +0000</pubDate>
      <link>https://dev.to/edublancas/robust-jupyter-report-generation-using-static-analysis-1ch9</link>
      <guid>https://dev.to/edublancas/robust-jupyter-report-generation-using-static-analysis-1ch9</guid>
      <description>&lt;p&gt;Jupyter notebooks are a great format for generating data analysis reports since they can contain rich output such as tables and charts in a single file. With the release of &lt;a href="https://github.com/nteract/papermill"&gt;papermill&lt;/a&gt;, a package that lets you parametrize and execute &lt;code&gt;.ipynb&lt;/code&gt; files programmatically, it became easier to use notebooks as templates to  generate analytical reports. When developing a Machine Learning model, I use Jupyter notebooks in tandem with papermill to generate a report for each experiment I run, this way, I can always go back and check performance metrics, tables and charts to compare one experiment to another.&lt;/p&gt;

&lt;p&gt;After trying out the Jupyter notebook + papermill combination in a few projects, I found some recurring problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;.ipynb&lt;/code&gt; stores cell's output in the same file. This is good for the final report, but for development purposes, if two or more people edit the same notebook, cell's output will get into the way, making &lt;code&gt;git merge&lt;/code&gt; a big pain&lt;/li&gt;
&lt;li&gt;Even if we make sure cell's output is deleted before pushing to the git repository, comparing versions using &lt;code&gt;git diff&lt;/code&gt; yields illegible results (&lt;code&gt;.ipynb&lt;/code&gt; files are JSON files with complex structure)&lt;/li&gt;
&lt;li&gt;Notebooks are developed interactively: cells are added and moved around, this interactivity often causes a top to bottom execution to result in errors. Given that papermill executes notebooks cell by cell, something as simple as a syntax error in the very last cell will be raised until such cell is executed&lt;/li&gt;
&lt;li&gt;Papermill doesn't validate input parameters, it just adds a new cell. This might lead to unexpected behavior, such as an "undefined variable" errors or inadvertently using a default parameter value. This is especially frustrating for long-running notebooks where one finds out errors after waiting for the notebook to finish execution&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this blog post I'll explain my workflow for robust report generation, this post is divided in two parts, part I discusses the solution to problems 1 and 2, part II covers 3 and 4. Incorporating this workflow will help you better integrate your report's source code with git and save precious time by automatically preventing notebook execution when errors are detected.&lt;/p&gt;

&lt;p&gt;Along the way, you'll also learn a few interesting things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How Jupyter notebooks are represented (the &lt;code&gt;.ipynb&lt;/code&gt; format)&lt;/li&gt;
&lt;li&gt;How to read and manipulate notebooks using the &lt;code&gt;nbformat&lt;/code&gt; package&lt;/li&gt;
&lt;li&gt;How to convert a Python script (&lt;code&gt;.py&lt;/code&gt;) to a Jupyter notebook using &lt;code&gt;jupytext&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Basic Jupyter notebook static analysis using &lt;code&gt;pyflakes&lt;/code&gt; and &lt;code&gt;parso&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;How to programmatically execute Jupyter notebooks using &lt;code&gt;papermill&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;How to automate report validation and generation using &lt;code&gt;ploomber&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Workflow summary
&lt;/h2&gt;

&lt;p&gt;The solution for problems 1 and 2 is to use a different format for development and convert to &lt;code&gt;.ipynb&lt;/code&gt; right before during execution, &lt;a href="https://github.com/mwouts/jupytext"&gt;jupytext&lt;/a&gt; does exactly that. Problems 3 and 4 are approached by doing static analysis before executing the notebook.&lt;/p&gt;

&lt;p&gt;Step by step summary:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Work on &lt;code&gt;.py&lt;/code&gt; files (instead of &lt;code&gt;.ipynb&lt;/code&gt;) to make git integration easier&lt;/li&gt;
&lt;li&gt;Declare your "notebook" parameters at the top, tagging the cell as "parameters" (see &lt;a href="https://jupytext.readthedocs.io/en/latest/formats.html#notebooks-as-scripts"&gt;jupytext reference&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Before executing your notebook, validate the &lt;code&gt;.py&lt;/code&gt; file using pyflakes and parso&lt;/li&gt;
&lt;li&gt;If validation succeeds, use jupytext to convert your &lt;code&gt;.py&lt;/code&gt; file to a &lt;code&gt;.ipynb&lt;/code&gt; notebook&lt;/li&gt;
&lt;li&gt;Execute your &lt;code&gt;.ipynb&lt;/code&gt; notebook using papermill&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Alternatively, you can use &lt;a href="https://github.com/ploomber/ploomber"&gt;ploomber&lt;/a&gt; to automate the whole process, sample code is provided at the end of this post.&lt;/p&gt;

&lt;h1&gt;
  
  
  Part I: To ease git integration replace &lt;code&gt;.ipynb&lt;/code&gt; notebooks with  &lt;code&gt;.py&lt;/code&gt; files
&lt;/h1&gt;

&lt;h2&gt;
  
  
  How are notebooks represented on disk?
&lt;/h2&gt;

&lt;p&gt;A &lt;code&gt;notebook.ipynb&lt;/code&gt; file is just a JSON file with a certain structure, which is defined in the &lt;a href="https://nbformat.readthedocs.io/en/latest/"&gt;&lt;code&gt;nbformat&lt;/code&gt; package&lt;/a&gt;. When we open the Jupyter application (by using the &lt;code&gt;jupyter notebook&lt;/code&gt; command), Jupyter uses nbformat under the hood to save our changes in the &lt;code&gt;.ipynb&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;Let's see how we can create a notebook by directly manipulating an object and then serializing it to JSON:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# create a new notebook (nbformat.v4 defines the lastest jupyter notebook format)
&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nbformat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;v4&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new_notebook&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# let's add a new code cell
&lt;/span&gt;&lt;span class="n"&gt;cell&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nbformat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;v4&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new_code_cell&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'# this line was added programatically&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt; 1 + 1'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cells&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cell&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# what kind of object is this?
&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;'A notebook is an object of type: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Console output: (1/1):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A notebook is an object of type: &amp;lt;class 'nbformat.notebooknode.NotebookNode'&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can convert the notebook object to its JSON representation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nbformat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;v4&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nbjson&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONWriter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;nb_json&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Notebook JSON representation (content of the .ipynb file):&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nb_json&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Console output: (1/1):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Notebook JSON representation (content of the .ipynb file):

 {
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this line was added programatically\n",
    " 1 + 1"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notebook is a great output format since it supports embedded charts and tables in a single file, which we can easily share or review later but it's not a good choice as source code format. Say we edit the previous notebook, by just changing the first cell and adding a second one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# edit first cell
&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'cells'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s"&gt;'source'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'# Change cell&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt; 2 + 2'&lt;/span&gt;

&lt;span class="c1"&gt;# add a new one
&lt;/span&gt;&lt;span class="n"&gt;cell&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nbformat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;v4&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new_code_cell&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'# This is a new cell&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt; 3 + 3'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cells&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cell&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How our changes would look like for the reviewer?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# generate diff view between the old and the new notebook
&lt;/span&gt;&lt;span class="n"&gt;nb_json_edited&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;diff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;difflib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndiff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb_json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;splitlines&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keepends&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                     &lt;span class="n"&gt;nb_json_edited&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;splitlines&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;keepends&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;''&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;diff&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Console output: (1/1):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
   "cells": [
    {
     "cell_type": "code",
     "execution_count": null,
     "metadata": {},
     "outputs": [],
     "source": [
-     "# this line was added programatically\n",
+     "# Change cell\n",
-     " 1 + 1"
?       ^   ^
+     " 2 + 2"
?       ^   ^
+    ]
+   },
+   {
+    "cell_type": "code",
+    "execution_count": null,
+    "metadata": {},
+    "outputs": [],
+    "source": [
+     "# This is a new cell\n",
+     " 3 + 3"
     ]
    }
   ],
   "metadata": {},
   "nbformat": 4,
   "nbformat_minor": 4
  }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's hard to see what's going on, and this is just a notebook with two cells and no output. In a real notebook with dozens of cells, understanding the difference  between the old and new versions by eye is impossible.&lt;/p&gt;

&lt;p&gt;To ease git integration, I use plain &lt;code&gt;.py&lt;/code&gt; files and only convert them to &lt;code&gt;.ipynb&lt;/code&gt; notebooks before execution. We could parse a &lt;code&gt;.py&lt;/code&gt; file and convert it to a valid &lt;code&gt;.ipynb&lt;/code&gt; file using nbformat, but there are important details such as tags or markdown cells that we have to take care of, fortunately, &lt;a href="https://github.com/mwouts/jupytext"&gt;&lt;code&gt;jupytext&lt;/code&gt;&lt;/a&gt; does that for us.&lt;/p&gt;

&lt;p&gt;Furthermore, once jupytext is installed, opening a &lt;code&gt;.py&lt;/code&gt; file in the &lt;code&gt;jupyter notebook&lt;/code&gt; application will treat the file as a notebook and we will be able to run, add, remove cells as usual.&lt;/p&gt;

&lt;p&gt;Let's see how to convert a &lt;code&gt;.py&lt;/code&gt; file to &lt;code&gt;.ipynb&lt;/code&gt; using jupytext:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# define your "notebook" in a plain .py file
# note that jupytext defines a syntax to support markdown cells and cell tags
&lt;/span&gt;&lt;span class="n"&gt;py_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"""# This is a markdown cell

# + tags=['parameters']
x = 1
y = 2
"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# use jupyter to convert it to a notebook object
&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jupytext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;py_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'py'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;'Object type:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Console output: (1/1):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Object type:
&amp;lt;class 'nbformat.notebooknode.NotebookNode'&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using &lt;code&gt;.py&lt;/code&gt; files solve solves problems 1 and 2. Let's now discuss problems 3 and 4. &lt;/p&gt;

&lt;h1&gt;
  
  
  Part II: To catch errors before execution, use static analysis
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Static_program_analysis"&gt;Static analysis&lt;/a&gt; is the analysis of source code without execution. Since our notebooks usually take a lot to run, we want to catch as many errors as we can before running them, given that we got rid of the complex &lt;code&gt;.ipynb&lt;/code&gt; format, we can now use tools that analyze Python source code to spot errors.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Jupyter notebooks are executed by papermill?
&lt;/h2&gt;

&lt;p&gt;It is important to understand how papermill executes notebooks to motivate this section. papermill performs a cell by cell execution: it takes the code from the first cell, sends it to the Python kernel, waits for a response, saves output and repeats this process for all cells. You can see the details in  the source code &lt;a href="https://github.com/nteract/papermill/blob/fdd62827d789cb93e3e5c8debe3b5c5d8d3c58b8/papermill/clientwrap.py#L62"&gt;here&lt;/a&gt;, you'll notice that &lt;code&gt;PapermillNotebookClient&lt;/code&gt; is a subclass of &lt;code&gt;NotebookClient&lt;/code&gt;, which is part of the &lt;a href="https://github.com/jupyter/nbclient"&gt;nbclient&lt;/a&gt;, an official package that also implements a notebook executor.&lt;/p&gt;

&lt;p&gt;This cell by cell logic has an important implication: an error in cell i, will be raised until such cell is executed, image your notebook looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;time&lt;/span&gt;

&lt;span class="c1"&gt;# cell 1 - simulate a long-running operation
&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ...
# ...
&lt;/span&gt;
&lt;span class="c1"&gt;# cell 100 - there is a syntax error here (missing ":")!
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Something as simple as a syntax error will make your notebook crash until it reaches cell 100. To fix this problem, we will do a very simple static analysis in the whole notebook source code before executing it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding errors with pyflakes
&lt;/h2&gt;

&lt;p&gt;To prevent some runtime errors, we will run a few checks in our source code before executing it. pyflakes is a tool that looks for errors in our source code by parsing it. Since pyflakes does not execute code, it is limited in terms of how many errors it can find but it is very useful to find simple errors that would otherwise be detected at runtime. For the full list of errors pyflakes can detect, &lt;a href="https://github.com/PyCQA/pyflakes/blob/master/pyflakes/messages.py"&gt;see this&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let's see how it works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;py_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"""
import time

time.sleep(3600)

x = 1
y = 2

# z is never defined!
x + y + z

print('Variables. x: {}. y: {}'.format(x))
"""&lt;/span&gt;

&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pyflakes_check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;py_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'my_file.py'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Console output: (1/1):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my_file.py:10:9 undefined name 'z'
my_file.py:12:7 '...'.format(...) is missing argument(s) for placeholder(s): 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;pyflakes found that variable 'z' is used but never defined, had we executed this notebook, we'd have find out about the error after waiting for one hour.&lt;/p&gt;

&lt;p&gt;There are other projects similar to pyflakes, such as &lt;a href="https://www.pylint.org/"&gt;pylint&lt;/a&gt;. pylint is able to find more errors that pyflakes but it also flags style errors (such as inconsistent indentation), we probably don't want to prevent notebook execution due to style issues, so we'd have to filter out some messages. pyflakes works just fine for our purposes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parametrized notebooks with papermill
&lt;/h2&gt;

&lt;p&gt;papermill can parametrize notebooks which allows you to use them as templates. Say you have a notebook called &lt;code&gt;yearly_template.ipynb&lt;/code&gt; that takes a year as a parameter and generates a summary for data generated in that year, you could execute it from the command line using papermill like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;papermill yearly_template.ipynb report_2019.ipynb &lt;span class="nt"&gt;-p&lt;/span&gt; year 2019
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;.ipynb&lt;/code&gt; files support &lt;a href="https://nbformat.readthedocs.io/en/latest/format_description.html#cell-metadata"&gt;cell tags&lt;/a&gt;, when you execute a notebook, papermill will inject a new cell with your parameters just below a cell tagged with "parameters". Although we are not dealing with &lt;code&gt;.ipynb&lt;/code&gt; files anymore, we can still tag cells using jupytext syntax. Let's define a simple notebook:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# + tags=['parameters']
&lt;/span&gt;&lt;span class="n"&gt;year&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;


&lt;span class="c1"&gt;# +
&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'the year is {}'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;year&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you convert the code above to &lt;code&gt;.ipynb&lt;/code&gt; and then execute it using papermill, papermill will execute the following cells:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Cell 1: cell tagged with "parameters"
&lt;/span&gt;&lt;span class="n"&gt;year&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;


&lt;span class="c1"&gt;# Cell 2: injected by papermill
&lt;/span&gt;&lt;span class="n"&gt;year&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2019&lt;/span&gt;


&lt;span class="c1"&gt;# Cell 3
&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'the year is {}'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;year&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;papermill limits itself to inject the passed parameters and execute the notebook, it does not perform any kind of validation. Adding a simple validation logic can help us prevent runtime errors before execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Extracting declared parameters with  parso
&lt;/h2&gt;

&lt;p&gt;I want parametrized notebooks to behave more like functions: they should refuse to run if any parameter is missing or if anything is passed but not declared. To enable this feature we have to analyze the "parameters" cell and compare it with the parameters passed via papermill. &lt;a href="https://github.com/davidhalter/parso"&gt;parso&lt;/a&gt; is a package that parses Python code and allows us to do exactly that.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;params_cell&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"""
# + tags=['parameters']
a = 1
b = 2
c = 3
"""&lt;/span&gt;

&lt;span class="c1"&gt;# parse "parameters" cell, find which variables are defined
&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parso&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params_cell&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Defined variables: '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_used_names&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Console output: (1/1):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Defined variables:  ['a', 'b', 'c']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We see that parso detected the three variables, we can use this information to validate input parameters against declared ones.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: I recently discovered that finding declared variables can also be done with the &lt;a href="https://docs.python.org/3/library/ast.html"&gt;ast module&lt;/a&gt;, which is part of the standard library&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting it all together
&lt;/h2&gt;

&lt;p&gt;We now implement the logic in a single function, we'll take Python source code as input and validate using pyflakes and parso.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_notebook_source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb_source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'notebook'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="s"&gt;"""
    Perform static analysis on a Jupyter notebook source raises
    an exception if validation fails

    Parameters
    ----------
    nb_source : str
        Jupyter notebook source code in jupytext's py format,
        must have a cell with the tag "parameters"

    params : dict
        Parameter that will be added to the notebook source

    filename : str
        Filename to identify pyflakes warnings and errors
    """&lt;/span&gt;
    &lt;span class="c1"&gt;# parse the JSON string and convert it to a notebook object using jupytext
&lt;/span&gt;    &lt;span class="n"&gt;nb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jupytext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb_source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'py'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# add a new cell just below the cell tagged with "parameters"
&lt;/span&gt;    &lt;span class="c1"&gt;# this emulates the notebook that papermill will run
&lt;/span&gt;    &lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params_cell&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;add_passed_parameters&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# run pyflakes and collect errors
&lt;/span&gt;    &lt;span class="n"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;check_source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;error_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;

    &lt;span class="c1"&gt;# pyflakes returns "warnings" and "errors", collect them separately
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'warnings'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;error_message&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="s"&gt;'pyflakes warnings:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'warnings'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'errors'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;error_message&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="s"&gt;'pyflakes errors:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'errors'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# compare passed parameters with declared
&lt;/span&gt;    &lt;span class="c1"&gt;# parameters. This will make our notebook behave more
&lt;/span&gt;    &lt;span class="c1"&gt;# like a "function", if any parameter is passed but not
&lt;/span&gt;    &lt;span class="c1"&gt;# declared, this will return an error message, if any parameter
&lt;/span&gt;    &lt;span class="c1"&gt;# is declared but not passed, a warning is shown
&lt;/span&gt;    &lt;span class="n"&gt;res_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;check_params&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params_cell&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'source'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;error_message&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;res_params&lt;/span&gt;

    &lt;span class="c1"&gt;# if any errors were returned, raise an exception
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;error_message&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error_message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's now see the implementation of the functions used above:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_params&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params_source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="s"&gt;"""
    Compare the parameters cell's source with the passed parameters, warn
    on missing parameter and raise error if an extra parameter was passed.
    """&lt;/span&gt;
    &lt;span class="c1"&gt;# params are keys in "params" dictionary
&lt;/span&gt;    &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# use parso to parse the "parameters" cell source code and get all variable names declared
&lt;/span&gt;    &lt;span class="n"&gt;declared&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parso&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params_source&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;get_used_names&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="c1"&gt;# now act depending on missing variables and/or extra variables
&lt;/span&gt;
    &lt;span class="n"&gt;missing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;declared&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;
    &lt;span class="n"&gt;extra&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;declared&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;missing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;warnings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="s"&gt;'Missing parameters: {}, will use default value'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;missing&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;extra&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;'Passed non-declared parameters: {}'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extra&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;''&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="s"&gt;"""
    Run pyflakes on a notebook, wil catch errors such as missing passed
    parameters that do not have default values
    """&lt;/span&gt;
    &lt;span class="c1"&gt;# concatenate all cell's source code in a single string
&lt;/span&gt;    &lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'source'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cells&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;# this objects are needed to capture pyflakes output
&lt;/span&gt;    &lt;span class="n"&gt;warn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;StringIO&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;StringIO&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;reporter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Reporter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# run pyflakes.api.check on the source code
&lt;/span&gt;    &lt;span class="n"&gt;pyflakes_check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reporter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;reporter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seek&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seek&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# return any error messages returned by pyflakes
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'warnings'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readlines&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
            &lt;span class="s"&gt;'errors'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readlines&lt;/span&gt;&lt;span class="p"&gt;())}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_passed_parameters&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="s"&gt;"""
    Insert a cell just below the one tagged with "parameters"

    Notes
    -----
    Insert a code cell with params, to simulate the notebook papermill
    will run. This is a simple implementation, for the actual one see:
    https://github.com/nteract/papermill/blob/master/papermill/parameterize.py
    """&lt;/span&gt;
    &lt;span class="c1"&gt;# find "parameters" cell
&lt;/span&gt;    &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params_cell&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_get_parameters_cell&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# convert the parameters passed to valid python code
&lt;/span&gt;    &lt;span class="c1"&gt;# e.g {'a': 1, 'b': 'hi'} to:
&lt;/span&gt;    &lt;span class="c1"&gt;# a = 1
&lt;/span&gt;    &lt;span class="c1"&gt;# b = 'hi'
&lt;/span&gt;    &lt;span class="n"&gt;params_as_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;_parse_token&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()])&lt;/span&gt;

    &lt;span class="c1"&gt;# insert the cell with the passed parameters
&lt;/span&gt;    &lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cells&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'cell_type'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'code'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'metadata'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
                              &lt;span class="s"&gt;'execution_count'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                              &lt;span class="s"&gt;'source'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;params_as_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                              &lt;span class="s"&gt;'outputs'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params_cell&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_get_parameters_cell&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="s"&gt;"""
    Iterate over cells, return the index and cell content
    for the first cell tagged "parameters", if not cell
    is found raise a ValueError
    """&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cells&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;cell_tags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'tags'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cell_tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="s"&gt;'parameters'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cell_tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;

    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Notebook does not have a cell tagged "parameters"'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_parse_token&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="s"&gt;"""
    Convert parameters to their Python code representation

    Notes
    -----
    This is a very simple way of doing it, for a more complete implementation,
    check out papermill's source code:
    https://github.com/nteract/papermill/blob/master/papermill/translators.py
    """&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;'{} = {}'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;repr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Testing our &lt;code&gt;check_notebook_source&lt;/code&gt; function
&lt;/h2&gt;

&lt;p&gt;Here we show some use cases for our validation function.&lt;/p&gt;

&lt;h3&gt;
  
  
  Raise error if "parameters" cell does not exist:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;notebook_no_parameters_tag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"""
a + b
"""&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;check_notebook_source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;notebook_no_parameters_tag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'a'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'b'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Raised exception:'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Console output: (1/1):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Raised exception: Notebook does not have a cell tagged "parameters"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Do not raise errors if "parameters" cell exist and passed parameters match:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;notebook_ab&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"""
# + tags=['parameters']
a = 1
b = 2

# +
a + b
"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;check_notebook_source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;notebook_ab&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'a'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'b'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Warn if using a default value:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;check_notebook_source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;notebook_ab&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'a'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Console output: (1/1):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/Users/Edu/miniconda3/envs/blog/lib/python3.6/site-packages/ipykernel_launcher.py:19: UserWarning: Missing parameters: {'b'}, will use default value
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Raise an error if passing and undeclared parameter:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;check_notebook_source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;notebook_ab&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'a'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'b'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'c'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Raised exception:'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Console output: (1/2):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Raised exception:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Console output: (2/2):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Passed non-declared parameters: {'c'}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Raise an error if a variable is used but never declared:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;notebook_w_warning&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"""
# + tags=['parameters']
a = 1
b = 2

# +
# variable "c" is used but never declared!
a + b + c
"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;check_notebook_source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;notebook_w_warning&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'a'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'b'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Raised exception:'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Console output: (1/1):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Raised exception: 
pyflakes warnings:
notebook:7:9 undefined name 'c'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Catch syntax error:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;notebook_w_error&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"""
# + tags=['parameters']
a = 1
b = 2

# +
if
"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;check_notebook_source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;notebook_w_error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'a'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'b'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Raised exception:'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Console output: (1/2):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Raised exception:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Console output: (2/2):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pyflakes errors:
notebook:6:3: invalid syntax

if

  ^
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Automating the workflow using ploomber
&lt;/h2&gt;

&lt;p&gt;To implement this workflow effectively, we have to make sure that our validation function is always run, then we have to convert the &lt;code&gt;.py&lt;/code&gt; to &lt;code&gt;.ipynb&lt;/code&gt; and finally, execute it using papermill. ploomber can automate this workflow easily, it can even convert the final output to several formats such as HTML. We only have to pass the source code and place our &lt;code&gt;check_notebook_source&lt;/code&gt; inside the &lt;code&gt;on_render&lt;/code&gt; hook.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: we include the notebook's source code as strings in the following example for simplicity, in a real project, it is better to load them from a file.&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# source code for report 1
&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"""
# # My report

# + tags=["parameters"]
product = None
x = 1
y = 2
# -

print('x + y =', x + y)
"""&lt;/span&gt;

&lt;span class="c1"&gt;# source code for report 2
&lt;/span&gt;&lt;span class="n"&gt;nb_another&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"""
# # Another report

# + tags=["parameters"]
product = None
x = 1
y = 2
# -

print('x - y =', x - y)
"""&lt;/span&gt;

&lt;span class="c1"&gt;# on render hook: run before executing the notebooks
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_render&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# task.params are read-only, get a copy
&lt;/span&gt;    &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# papermill (the library ploomber uses to execute notebooks) only supports
&lt;/span&gt;    &lt;span class="c1"&gt;# parameters that are JSON serializable, to test what papermill will
&lt;/span&gt;    &lt;span class="c1"&gt;# actually run, we have to check on the product in its serializable form
&lt;/span&gt;    &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'product'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'product'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;to_json_serializable&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;check_notebook_source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="c1"&gt;# store all reports under output/
&lt;/span&gt;&lt;span class="n"&gt;out_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'output'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;out_dir&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ploomber gives you parallel execution for free
&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'parallel'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ploomber supports exporting ipynb files to several formats
# using the official nbconvert package, we convert our
# reports to HTML here by just adding the .html extension
&lt;/span&gt;&lt;span class="n"&gt;t1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;NotebookRunner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;File&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out_dir&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="s"&gt;'out.html'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'t1'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;ext_in&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'py'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;kernelspec_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'python3'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'x'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;t1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;on_render&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;on_render&lt;/span&gt;

&lt;span class="n"&gt;t2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;NotebookRunner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb_another&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;File&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out_dir&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="s"&gt;'another.html'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'t2'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;ext_in&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'py'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;kernelspec_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'python3'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'x'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'y'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;t2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;on_render&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;on_render&lt;/span&gt;

&lt;span class="c1"&gt;# run the pipeline. No errors are raised but note that a warning is shown
&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Console output: (1/1):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/Users/Edu/dev/ploomber/src/ploomber/dag.py:469: UserWarning: Task "NotebookRunner: t1 -&amp;gt; File(output/out.html)" had the following warnings:

Missing parameters: {'y'}, will use default value
  warnings.warn(warning)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Jupyter notebooks (&lt;code&gt;.ipynb&lt;/code&gt;) is a great &lt;strong&gt;output&lt;/strong&gt; format, but using it under version control causes a lot of trouble, by using simple &lt;code&gt;.py&lt;/code&gt; files and leveraging jupytext, we get the best of both worlds: we edit simple Python source code files but our generated reports are executed as Jupyter notebooks which allows them to contain rich output such as tables and charts. To save time, we developed a function that validates our input notebook and catches errors before the notebook is executed. Finally, by using ploomber, we were able to create a clean and efficient workflow: HTML reports are transparently generated from plain &lt;code&gt;.py&lt;/code&gt; files.&lt;/p&gt;

&lt;p&gt;Found an error in this post? &lt;a href="https://github.com/ploomber/posts/issues/new?title=Issue%20in%20post%3A%20%22Robust%20Jupyter%20report%20generation%20using%20static%20analysis%22"&gt;Click here to let us know&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;Originally posted at &lt;a href="https://ploomber.io/posts/nb-static-analysis"&gt;ploomber.io&lt;/a&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>jupyter</category>
      <category>python</category>
    </item>
    <item>
      <title>Setting up reproducible Python environments for Data Science</title>
      <dc:creator>Eduardo Blancas</dc:creator>
      <pubDate>Wed, 01 Apr 2020 21:41:03 +0000</pubDate>
      <link>https://dev.to/edublancas/setting-up-reproducible-python-environments-for-data-science-3822</link>
      <guid>https://dev.to/edublancas/setting-up-reproducible-python-environments-for-data-science-3822</guid>
      <description>&lt;p&gt;&lt;strong&gt;Setting up a Python environment for Data Science is hard&lt;/strong&gt;. Throughout my projects, I've experienced a recurring pattern when attempting to set up Python packages and their dependencies:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You start with a clean Python environment, usually using &lt;code&gt;conda&lt;/code&gt;, &lt;code&gt;venv&lt;/code&gt; or &lt;code&gt;virtualenv&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;As you make progress, you start adding dependencies via &lt;code&gt;pip install&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Most of the time, it just works, but when it doesn't, a painful trial and error process follows&lt;/li&gt;
&lt;li&gt;And things keep working, until you have to reproduce your environment on a different machine...&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this blog post, you'll learn how to set up reproducible Python environments for Data Science that are robust across operating systems and guidelines for troubleshooting installation errors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why does &lt;code&gt;pip install&lt;/code&gt; fail?
&lt;/h2&gt;

&lt;p&gt;Most &lt;code&gt;pip install&lt;/code&gt; failures are due to missing dependencies. For example, some database drivers such as &lt;code&gt;psycopg2&lt;/code&gt; are just bindings of another library (in this case &lt;a href="https://www.postgresql.org/docs/9.1/libpq.html"&gt;&lt;code&gt;libpq&lt;/code&gt;&lt;/a&gt;), if you try to install &lt;code&gt;psycopg2&lt;/code&gt; without having &lt;code&gt;libpq&lt;/code&gt;, it will fail. The key is to know which dependencies are missing and how to install them.&lt;/p&gt;

&lt;p&gt;Before diving into more details, let's first give some background on how Python packages are distributed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Source distributions and built distributions
&lt;/h2&gt;

&lt;p&gt;There are two primary ways of distributing Python packages (distribution just means making a Python package available to anyone who wants to use it). The first one is a source distribution (&lt;code&gt;.tar.gz&lt;/code&gt; files), the second one is a built distribution (&lt;code&gt;.whl&lt;/code&gt; files), also known as wheels.&lt;/p&gt;

&lt;p&gt;As the name implies, &lt;a href="https://packaging.python.org/glossary/#term-source-distribution-or-sdist"&gt;source distributions&lt;/a&gt; contain all the source code you need to build a package (building is a prerequisite to installing a package). The recipe to build is usually declared in a &lt;code&gt;setup.py&lt;/code&gt; file. This is the equivalent to having all the raw ingredients and instructions for cooking something.&lt;/p&gt;

&lt;p&gt;On the other hand, &lt;a href="https://packaging.python.org/glossary/#term-built-distribution"&gt;built distributions&lt;/a&gt; are generated by having source distributions go through the &lt;em&gt;build process&lt;/em&gt;. They are "already cooked" packages whose files only need to be moved to the correct location for you to use them. Built distributions are OS-specific, which means that you need a version compatible with your current operating system. It is the equivalent to having the dish ready and only take it to your table.&lt;/p&gt;

&lt;p&gt;There are many nuances to this, the bottom line is that built distributions are easier and faster to install (you just have to move files). If you want to know more about the differences, &lt;a href="https://packaging.python.org/overview/"&gt;this is a good place to start&lt;/a&gt;. Let's go back to our &lt;code&gt;pip install&lt;/code&gt; discussion.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happens when I run &lt;code&gt;pip install [package]&lt;/code&gt;?
&lt;/h2&gt;

&lt;p&gt;When you execute &lt;code&gt;pip install [package]&lt;/code&gt;, pip will try to find a package with that name in the &lt;a href="https://pypi.org/"&gt;Python Package Index&lt;/a&gt; (or pypi). If it finds the package, it will first try to find a wheel for your OS, if it cannot find it, it will fetch the source distribution. Making wheels available for different OSs is up to the developer, popular packages usually do this, see for example &lt;a href="https://pypi.org/project/numpy/#files"&gt;numpy available files on pypi&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pip install&lt;/code&gt; will also install any dependencies required by the package you requested; however, it has some limitations and can only install dependencies that can also be installed via &lt;code&gt;pip&lt;/code&gt;. It is important to emphasize that these limitations are by design: &lt;strong&gt;&lt;code&gt;pip&lt;/code&gt; is a Python package manager, it is not designed to deal with non-Python packages.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Since &lt;code&gt;pip&lt;/code&gt; is not designed to handle arbitrary dependencies, it will ask the OS for dependencies it cannot install such as compilers (this happens often with Python packages with parts written in C). This implicit process makes environments managed by &lt;code&gt;pip&lt;/code&gt; harder to reproduce: if you take your &lt;code&gt;requirements.txt&lt;/code&gt; to a different system, it might break if a non-Python dependency that existed in the previous environment does not exist in the new one.&lt;/p&gt;

&lt;p&gt;Given that &lt;code&gt;pip install [package]&lt;/code&gt; triggers the installation of &lt;code&gt;[package]&lt;/code&gt; plus all its dependencies it has to fetch (built or source) distributions for all of them and depending on how many and in which format these are obtained (built distributions are easier to install), the process will vary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build and runtime dependencies
&lt;/h2&gt;

&lt;p&gt;Sometimes Python packages need other non-Python packages &lt;em&gt;to build&lt;/em&gt;. As I mentioned before, packages that have C code need a compiler (such as &lt;code&gt;gcc&lt;/code&gt;) at build time, once C source code is compiled, &lt;code&gt;gcc&lt;/code&gt; is no longer needed, that's why they are called &lt;em&gt;build dependencies&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Other packages have non-Python dependencies &lt;em&gt;to run&lt;/em&gt;, for example, &lt;code&gt;psycopg2&lt;/code&gt; requires the PostgreSQL library &lt;code&gt;libpq&lt;/code&gt; to submit queries to the database. This is called a &lt;em&gt;runtime dependency&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This difference leads to the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When installing from source (&lt;code&gt;.tar.gz&lt;/code&gt; file) you need build + runtime dependencies&lt;/li&gt;
&lt;li&gt;When installing from a wheel (&lt;code&gt;.whl&lt;/code&gt; file) you only need runtime dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why can't &lt;code&gt;pip install&lt;/code&gt; just install all dependencies for me?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Sqn1BbhH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/3bfpn9y42akhkpt1966d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Sqn1BbhH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/3bfpn9y42akhkpt1966d.png" alt="pip-diagram" width="800" height="667"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pip&lt;/code&gt;'s purpose is to handle Python dependencies, hence, installing things such as a compiler is out of its scope, and it will just request them to the system. These limitations are well-known, which is the reason why &lt;code&gt;conda&lt;/code&gt; exists. &lt;code&gt;conda&lt;/code&gt; is also a package manager, but unlike &lt;code&gt;pip&lt;/code&gt;, its scope is not restricted to Python (for example, it can install the &lt;code&gt;gcc&lt;/code&gt; compiler), this makes &lt;code&gt;conda install&lt;/code&gt; more flexible since it can handle dependencies than &lt;code&gt;pip&lt;/code&gt; cannot.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; when we refer to &lt;code&gt;conda&lt;/code&gt;, we mean the command-line tool (also known as miniconda), &lt;strong&gt;not&lt;/strong&gt; to the whole Anaconda distribution. This is true for the rest of this article. For an article describing some &lt;code&gt;conda&lt;/code&gt; misconceptions, &lt;a href="https://jakevdp.github.io/blog/2016/08/25/conda-myths-and-misconceptions/"&gt;click here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using &lt;code&gt;conda install [package]&lt;/code&gt; for robust installations
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--swt5D9W---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/2bjlprw8oelhy3rfo3zp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--swt5D9W---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/2bjlprw8oelhy3rfo3zp.png" alt="conda-diagram" width="800" height="667"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But &lt;code&gt;conda&lt;/code&gt; is not only a package manager but an environment manager as well&lt;/strong&gt;, this is key to understand the operational difference between &lt;code&gt;pip install&lt;/code&gt; and &lt;code&gt;conda install&lt;/code&gt;. When using &lt;code&gt;pip&lt;/code&gt;, packages will install in the current active Python environment, whatever that is. This could be a system-wide installation or, more often, a local virtual environment created using tools such as &lt;code&gt;venv&lt;/code&gt; or &lt;code&gt;vitualenv&lt;/code&gt;, but still, any non-Python dependencies will be requested to the system.&lt;/p&gt;

&lt;p&gt;In contrast, &lt;code&gt;conda&lt;/code&gt; is a package manager and an environment manager, using &lt;code&gt;conda install&lt;/code&gt; will install dependencies in the current local conda environment, at first glance, &lt;code&gt;conda&lt;/code&gt; is very similar to using &lt;code&gt;pip&lt;/code&gt; + &lt;code&gt;venv&lt;/code&gt;, but &lt;code&gt;conda&lt;/code&gt; can install non-Python dependencies, which provides a higher level of isolation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Downsides of using &lt;code&gt;conda install&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;There are a few downsides to using &lt;code&gt;conda&lt;/code&gt;, though. For &lt;code&gt;conda install [package]&lt;/code&gt; to work, someone has to write a &lt;em&gt;conda recipe&lt;/em&gt;; sometimes developers maintain these recipes but other times recipe maintainers are third-parties, in such case, they might become outdated and &lt;code&gt;conda install&lt;/code&gt; will yield an older version than &lt;code&gt;pip install&lt;/code&gt;. Fortunately, well-known packages such as numpy, tensorflow or pytorch, have high-quality recipes and installation through conda is reliable.&lt;/p&gt;

&lt;p&gt;The second downside is that many packages are not available in conda, which means we have no option but to use &lt;code&gt;pip install&lt;/code&gt;, fortunately, with a few precautions we can safely use it inside a conda environment. &lt;strong&gt;The &lt;code&gt;conda&lt;/code&gt; + &lt;code&gt;pip&lt;/code&gt; combination, gives a robust way of setting up Python environments.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: there is a way to access more packages when using &lt;code&gt;conda install&lt;/code&gt; by adding &lt;a href="https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html"&gt;channels&lt;/a&gt;, which are locations where conda searches for packages. Only add channels from sources you trust. The most popular community-driven channel is &lt;a href="https://github.com/conda-forge"&gt;conda-forge&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Using &lt;code&gt;pip install&lt;/code&gt; and &lt;code&gt;conda install&lt;/code&gt; inside a conda environment
&lt;/h2&gt;

&lt;p&gt;At the time of writing, using pip inside a conda environment has a few problems, you can &lt;a href="https://www.anaconda.com/using-pip-in-a-conda-environment/"&gt;read the details here&lt;/a&gt;. Since sometimes we have no other way but to use &lt;code&gt;pip&lt;/code&gt; to install dependencies not available through &lt;code&gt;conda&lt;/code&gt;, here's my recommended workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start with a clean conda environment&lt;/li&gt;
&lt;li&gt;Install as many packages as you can using &lt;code&gt;conda install&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Install the rest of your dependencies using &lt;code&gt;pip install&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Manually keep a list of conda dependencies using an &lt;a href="https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-file-manually"&gt;&lt;code&gt;environment.yml&lt;/code&gt;&lt;/a&gt; file and pip dependencies using a &lt;code&gt;requirements.txt&lt;/code&gt; (See note below)&lt;/li&gt;
&lt;li&gt;If you need to install a new package via conda, after you've used pip, re-create the conda environment&lt;/li&gt;
&lt;li&gt;Make environment creation part of your testing procedure. Use tools such as &lt;a href="https://nox.thea.codes/en/stable/"&gt;nox&lt;/a&gt; to run your tests in a &lt;a href="https://nox.thea.codes/en/stable/tutorial.html#testing-with-conda"&gt;clean conda environment&lt;/a&gt;, this way you'll make sure your environment is reproducible&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you follow this procedure, anyone looking to reproduce your results only needs two files: &lt;code&gt;environment.yml&lt;/code&gt; and &lt;code&gt;requirements.txt&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Note: The reason I recommend keeping a manual list is to be conscious about each dependency, if we decide to experiment with some library but end up not using it, it is a good practice to remove it from our dependencies. If we rely on auto-generated files such as &lt;code&gt;pip freeze&lt;/code&gt; we might end up including dependencies that we don't need.&lt;/p&gt;

&lt;h2&gt;
  
  
  Debugging installation errors
&lt;/h2&gt;

&lt;p&gt;While using &lt;code&gt;conda&lt;/code&gt; is a more reliable way to install packages with complex dependencies, there is no guarantee that things will just work; furthermore, if you need a package only available through &lt;code&gt;pip&lt;/code&gt; via a source distribution, you are more likely to encounter installation issues. Here are some examples of troubleshooting installation errors.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: All the following tests were performed using a clean Ubuntu 18.04.4 image with miniconda3&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Example 1: &lt;code&gt;impyla&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;When we try to install impyla (an Apache Hive driver) using &lt;code&gt;pip install impyla&lt;/code&gt;, we get a long error output. When fixing installation issues is important to skim through it to spot the missing dependency, these are the important lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;unable to execute 'gcc': No such file or directory
error: command 'gcc' failed with exit status 1
----------------------------------------
ERROR: Failed building wheel for bitarray

...
...
...


unable to execute 'gcc': No such file or directory
error: command 'gcc' failed with exit status 1
----------------------------------------
ERROR: Failed building wheel for thriftpy2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;bitarray&lt;/code&gt; and &lt;code&gt;thriftpy2&lt;/code&gt; are &lt;code&gt;impyla&lt;/code&gt; dependencies. Wheels are not available, hence pip had to use source distributions, we can confirm this in the first output lines (look at the &lt;code&gt;.tar.gz&lt;/code&gt; extension):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Collecting bitarray
  Downloading bitarray-1.2.1.tar.gz (48 kB)
     |################################| 48 kB 5.6 MB/s
Collecting thrift&amp;gt;=0.9.3
  Downloading thrift-0.13.0.tar.gz (59 kB)
     |################################| 59 kB 7.7 MB/s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But why did these dependencies fail to install? We see in the log that both dependencies tried to use &lt;code&gt;gcc&lt;/code&gt; but they could not find it. Installing it (e.g. &lt;code&gt;apt install gcc&lt;/code&gt;) and trying &lt;code&gt;pip install impyla&lt;/code&gt; again fixes the issue. But you can also do &lt;code&gt;conda install impyla&lt;/code&gt; which has the advantage of not installing &lt;code&gt;gcc&lt;/code&gt; system-wide. &lt;strong&gt;Using conda is often the easiest way to fix installation issues.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Example 2: &lt;code&gt;pycopg2&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Let's first see what happens with &lt;code&gt;pip install psycopg2&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error: pg_config executable not found.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As in the previous case, we are missing one dependency. The tricky part is that &lt;code&gt;pg_config&lt;/code&gt; is not a standalone executable; it is installed by another package, which is what you'll find after some online digging. If using apt, you can get this to work by doing &lt;code&gt;apt install libpq-dev&lt;/code&gt; before using pip. But again, &lt;code&gt;conda install psycopg2&lt;/code&gt; works out of the box. This is the output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;krb5               pkgs/main/linux-64::krb5-1.16.4-h173b8e3_0
libpq              pkgs/main/linux-64::libpq-11.2-h20c2e04_0
psycopg2           pkgs/main/linux-64::psycopg2-2.8.4-py36h1ba5d50_0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We see that conda will install &lt;code&gt;libpq&lt;/code&gt; along with &lt;code&gt;psycopg2&lt;/code&gt;, but unlike using a system package manager (e.g. &lt;code&gt;apt&lt;/code&gt;), it will do it locally, which is good for isolating our environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example 3: &lt;code&gt;numpy&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Numpy is one most widely used packages. &lt;code&gt;pip install numpy&lt;/code&gt; works reliably since developers upload wheels for the most popular operating systems. But this doesn't mean using pip is the best we can do.&lt;/p&gt;

&lt;p&gt;Taken from the &lt;a href="https://numpy.org/devdocs/user/building.html"&gt;docs&lt;/a&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NumPy does not require any external linear algebra libraries to be installed. However, if these are available, NumPy’s setup script can detect them and use them for building.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;In other words, depending on the availability of external linear algebra libraries, your numpy installation will be different. Let's see what happens when we run &lt;code&gt;conda install numpy&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;package                    |            build
---------------------------|-----------------
blas-1.0                   |              mkl           6 KB
intel-openmp-2020.0        |              166         756 KB
libgfortran-ng-7.3.0       |       hdf63c60_0        1006 KB
mkl-2020.0                 |              166       128.9 MB
mkl-service-2.3.0          |   py36he904b0f_0         219 KB
mkl_fft-1.0.15             |   py36ha843d7b_0         155 KB
mkl_random-1.1.0           |   py36hd6b4f25_0         324 KB
numpy-1.18.1               |   py36h4f9e942_0           5 KB
numpy-base-1.18.1          |   py36hde5b4d6_1         4.2 MB
------------------------------------------------------------
                                       Total:       135.5 MB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Along with numpy, conda will also install &lt;code&gt;mkl&lt;/code&gt;, which is a library for optimizing math routines in systems with Intel processors. By using &lt;code&gt;conda install&lt;/code&gt;, you get this for free, if using &lt;code&gt;pip&lt;/code&gt;, you'd only get a vanilla numpy installation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What about containers?
&lt;/h2&gt;

&lt;p&gt;Containerization technologies such as Docker provide a higher level of isolation than a conda environment, but I think they are used in Data Science projects earlier than they should be. Once you have to run your pipeline in a production environment, containers are a natural choice, but for development purposes, a conda environment goes a long way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing remarks
&lt;/h2&gt;

&lt;p&gt;Getting a Python environment up and running is an error-prone process, and nobody likes spending time fixing installation issues. Understanding the basics of how Python packages are built and distributed, plus using the right tool for the job is a huge improvement over trial and error.&lt;/p&gt;

&lt;p&gt;Furthermore, it is not enough to set up our environment once, if we want others to reproduce our work or make the transition to production simple, we have to ensure that there is an automated way to set up our environment from scratch, including environment creation as part of our testing process will let us know when things break.&lt;/p&gt;

&lt;p&gt;Found an error in this post? &lt;a href="https://github.com/ploomber/posts/issues/new?title=Issue%20in%20post%3A%20%22Setting%20up%20reproducible%20Python%20environments%20for%20Data%20Science%22"&gt;Click here to let us know&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;Originally posted at &lt;a href="https://ploomber.io/posts/python-envs"&gt;ploomber.io&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Model selection with scikit-learn and ploomber</title>
      <dc:creator>Eduardo Blancas</dc:creator>
      <pubDate>Tue, 24 Mar 2020 21:12:02 +0000</pubDate>
      <link>https://dev.to/edublancas/model-selection-with-scikit-learn-and-ploomber-3ek3</link>
      <guid>https://dev.to/edublancas/model-selection-with-scikit-learn-and-ploomber-3ek3</guid>
      <description>&lt;p&gt;Model selection is an important part of any Machine Learning task. Since each model encodes their own &lt;a href="https://en.wikipedia.org/wiki/Inductive_bias"&gt;inductive bias&lt;/a&gt;, it is important to compare them to understand their subtleties and choose the best one for the problem at hand. While knowing each learning algorithm in detail is important to have an intuition about which ones to try, it is always helpful to visualize actual results in our data.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: This blog post assumes you are familiar with the model selection framework via &lt;a href="https://sebastianraschka.com/blog/2018/model-evaluation-selection-part4.html#nested-cross-validation"&gt;nested cross-validation&lt;/a&gt; and with the following scikit-learn modules (click for documentation): &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html"&gt;&lt;code&gt;GridSearchCV&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html"&gt;&lt;code&gt;cross_val_predict&lt;/code&gt;&lt;/a&gt; and &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html"&gt;&lt;code&gt;Pipeline&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The quick and dirty approach for model selection would be to have a long Jupyter notebook, where we train all models and output charts for each one. In this post we will show how to achieve this in a cleaner way by using scikit-learn and &lt;a href="https://github.com/ploomber/ploomber"&gt;ploomber&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project layout
&lt;/h2&gt;

&lt;p&gt;We split the code in three files:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;pipelines.py&lt;/code&gt;. Contains functions to instantiate scikit-learn pipelines&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;report.py&lt;/code&gt;. Contains the source code that performs hyperparameter tuning and model evaluation, imports pipelines defined in &lt;code&gt;pipelines.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;main.py&lt;/code&gt;. Contains the loop that executes &lt;code&gt;report.py&lt;/code&gt; for each pipeline using ploomber&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Unless otherwise noted, the snippets shown in this post belong to &lt;code&gt;main.py&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Functions to instantiate pipelines (&lt;code&gt;pipelines.py&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;We start declaring each of our &lt;em&gt;model pipelines&lt;/em&gt;, which are just functions that return a scikit-learn &lt;code&gt;Pipeline&lt;/code&gt; instance, we will use this in a nested cross-validation loop to choose the best hyperparameters and estimate generalization performance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Content of pipelines.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.pipeline&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Pipeline&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StandardScaler&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Ridge&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.svm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;NuSVR&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ridge&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;([(&lt;/span&gt;&lt;span class="s"&gt;'scaler'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
                     &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'reg'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Ridge&lt;/span&gt;&lt;span class="p"&gt;())])&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;nusvr&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;([(&lt;/span&gt;&lt;span class="s"&gt;'scaler'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
                     &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'reg'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NuSVR&lt;/span&gt;&lt;span class="p"&gt;())])&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We have one factory for NuSVR and another one Ridge Regression. Since these two models are sensitive to scaling, we include them in a scikit-learn pipeline that scales all features before feeding the data into the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hyperparameter tuning and performance estimation (&lt;code&gt;report.py&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;We will process each model separately, generating three HTML reports in total, the reports will be generated using the following source code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Content of report.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;IPython.display&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Markdown&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;importlib&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_boston&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cross_val_predict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GridSearchCV&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;seaborn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="c1"&gt;# + tags=["parameters"]
&lt;/span&gt;&lt;span class="n"&gt;m_init&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="n"&gt;m_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="c1"&gt;# -
&lt;/span&gt;
&lt;span class="n"&gt;Markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'# Report for {}'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m_init&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Params: '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m_params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# +
# m_init is module.sub_module.constructor import it from the string
&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;m_init&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'.'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;mod_str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;constructor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'.'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;mod&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;importlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;import_module&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mod_str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# instantiate it
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mod&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;)()&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# -
&lt;/span&gt;
&lt;span class="c1"&gt;# load data
&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_boston&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;feature_names&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;

&lt;span class="c1"&gt;# +
# Perform grid search over the passed parameters
&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GridSearchCV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m_params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_jobs&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# We want to estimate generalization performance *and* tune hyperparameters
# so we are using nested cross-validation
&lt;/span&gt;&lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cross_val_predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# -
&lt;/span&gt;
&lt;span class="c1"&gt;# prev vs actual scatter plot
&lt;/span&gt;&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_size_inches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Predicted'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Actual'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# residuals
&lt;/span&gt;&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_size_inches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Residual'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# residuals distribution
&lt;/span&gt;&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_size_inches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;distplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Residual distribution'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# print metrics
&lt;/span&gt;&lt;span class="n"&gt;mae&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;mse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;'MAE: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;mae&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;'MSE: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;mse&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Running the execution loop (&lt;code&gt;main.py&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;We now turn our attention to main script that will take the model pipelines, the report source code and execute them. First we have to define the parameters we want to try for each model. We define one dictionary for each, the key &lt;code&gt;m_init&lt;/code&gt; has the pipeline location (we will dynamically  import this using the &lt;a href="https://docs.python.org/3/library/importlib.html#module-importlib"&gt;&lt;code&gt;importlib&lt;/code&gt;&lt;/a&gt; library, finally, the &lt;code&gt;m_params&lt;/code&gt; key contains the hyperparameters to try, not that for Ridge Regression and NuSVR, we have to add a &lt;code&gt;ref__&lt;/code&gt; prefix to each parameter, this is because the factories return scikit-learn &lt;code&gt;Pipeline&lt;/code&gt; objects and we need to specify to which step the parameters belong to.&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;ploomber.tasks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;NotebookRunner&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;ploomber.products&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;File&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;ploomber&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;

&lt;span class="c1"&gt;# Ridge Regression grid
&lt;/span&gt;&lt;span class="n"&gt;params_ridge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s"&gt;'m_init'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'pipelines.ridge'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;'m_params'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s"&gt;'reg__alpha'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;3.0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Random Forest Regression grid
&lt;/span&gt;&lt;span class="n"&gt;params_rf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s"&gt;'m_init'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'sklearn.ensemble.RandomForestRegressor'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;'m_params'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s"&gt;'n_estimators'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="s"&gt;'min_samples_leaf'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Nu Support Vector Regression grid
&lt;/span&gt;&lt;span class="n"&gt;params_nusvr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s"&gt;'m_init'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'pipelines.nusvr'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;'m_params'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s"&gt;'reg__nu'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="s"&gt;'reg__C'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="s"&gt;'reg__kernel'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'rbf'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'sigmoid'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that we do not have a pipeline for &lt;code&gt;RandomForestRegressor&lt;/code&gt;, Random Forest is not sensitive to scaling so we use the model directly.&lt;/p&gt;

&lt;p&gt;We now add the execution loop, we will execute it using &lt;a href="https://github.com/ploomber/ploomber"&gt;ploomber&lt;/a&gt;. We just have to tell &lt;code&gt;ploomber&lt;/code&gt; where to load the source code from, which parameters to use on each iteration and where to save the output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# load report source code
&lt;/span&gt;&lt;span class="n"&gt;notebook&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'report.py'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# we will save all notebooks in the artifacts/ folder
&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'artifacts'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;params_all&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'ridge'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;params_ridge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'rf'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;params_rf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'nusvr'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;params_nusvr&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;dag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# loop over params and create one notebook task for each...
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;params_all&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# NotebookRunner is able to execute ipynb files using
&lt;/span&gt;    &lt;span class="c1"&gt;# papermill under the hood, if the input file has a
&lt;/span&gt;    &lt;span class="c1"&gt;# different extension (like in our case), it will first
&lt;/span&gt;    &lt;span class="c1"&gt;# convert it to an ipynb file using jupytext
&lt;/span&gt;    &lt;span class="n"&gt;NotebookRunner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;notebook&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                   &lt;span class="c1"&gt;# save it in artifacts/{name}.html
&lt;/span&gt;                   &lt;span class="c1"&gt;# NotebookRunner will generate ipynb files by
&lt;/span&gt;                   &lt;span class="c1"&gt;# default, but you can choose other formats,
&lt;/span&gt;                   &lt;span class="c1"&gt;# any format supported by the official nbconvert
&lt;/span&gt;                   &lt;span class="c1"&gt;# package is supported here
&lt;/span&gt;                   &lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;File&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;'.html'&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                   &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                   &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                   &lt;span class="c1"&gt;# pass the parameters
&lt;/span&gt;                   &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                   &lt;span class="n"&gt;ext_in&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'py'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                   &lt;span class="n"&gt;kernelspec_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'python3'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Build the DAG:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Output:
name    Ran?      Elapsed (s)    Percentage
------  ------  -------------  ------------
nusvr   True          6.95555       27.8197
rf      True         11.6961        46.78
ridge   True          6.35066       25.4003
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. After building the DAG, each model will generate one report, you can see them here: &lt;a href="https://ploomber.github.io/posts/model-selection/artifacts/ridge"&gt;Ridge&lt;/a&gt;, &lt;a href="https://ploomber.github.io/posts/model-selection/artifacts/rf"&gt;Random Forest&lt;/a&gt; and &lt;a href="https://ploomber.github.io/posts/model-selection/artifacts/nusvr"&gt;NuSVR&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Splitting logic into separate files improves readability and maintainability, if we want to add another model we only have to add a new dictionary with the parameter grid, if preprocessing is needed, we just add a factory in &lt;code&gt;pipelines.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Using ploomber provides a concise and clean framework for generating reports, in just a few lines of code, we generated all our reports, however, we made a big simplifications in our &lt;code&gt;report.py&lt;/code&gt; file: we are loading, training and evaluating in a single source file, if we made even a small change to our charts we would have to re-train every model again. A better approach is to split that logic in several steps, and that scenario is where ploomber is very effective:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Clean raw data (save clean dataset)&lt;/li&gt;
&lt;li&gt;Train model and predict (save predictions)&lt;/li&gt;
&lt;li&gt;Evaluate predictions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If we split each model pipeline in three steps, and run build, we will obtain the same results, now let's say you want to add a new chart, so you modify step 3. All you have to do to update your reports is &lt;code&gt;dag.build()&lt;/code&gt;, ploomber will figure out that it does not have to re-run steps 1-2 and overwrite the old reports with the new ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing remarks
&lt;/h2&gt;

&lt;p&gt;Developing Machine Learning model is an iterative process, by breaking down the entire pipeline logic in small steps and maximizing code reusability, we can develop short and maintainable pipelines. Jupyter is a superb tool (I use it every day and I'm actually writing this blog post from Jupyter), but do not fall into the habit of coding everything in a big notebook, which inevitably leads to unmaintainable code, prefer many short notebooks (or .py files) over a big single one.&lt;/p&gt;

&lt;p&gt;Source code for this post is available &lt;a href="https://github.com/ploomber/posts/tree/master/model-selection"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Found an error in this post? &lt;a href="https://github.com/ploomber/posts/issues/new?title=Issue%20in%20model-selection"&gt;Click here to let us know&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This blog post was generated using package versions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Output:
matplotlib==3.1.3
numpy==1.18.1
pandas==1.0.1
scikit-learn==0.22.2
seaborn==0.10.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>python</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
