<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Wei Lee</title>
    <description>The latest articles on DEV Community by Wei Lee (@leew).</description>
    <link>https://dev.to/leew</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F385872%2F67b87b97-3a82-468b-9f48-1b7c9a284ea7.jpeg</url>
      <title>DEV Community: Wei Lee</title>
      <link>https://dev.to/leew</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/leew"/>
    <language>en</language>
    <item>
      <title>DVC - Pipeline Versioning</title>
      <dc:creator>Wei Lee</dc:creator>
      <pubDate>Fri, 02 Jul 2021 10:51:07 +0000</pubDate>
      <link>https://dev.to/leew/dvc-pipeline-versioning-3a53</link>
      <guid>https://dev.to/leew/dvc-pipeline-versioning-3a53</guid>
      <description>&lt;p&gt;This post also lives on &lt;a href="https://lee-w.github.io/posts/tech/2021/06/dvc-pipeline-versioning/"&gt;GitHub&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;We've versioned our data in the previous post. This article will demonstrate how we could define a data pipeline and version it through DVC.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pipeline versioning
&lt;/h2&gt;

&lt;p&gt;We'll continue using &lt;a href="https://github.com/Lee-W/dvc_example/"&gt;dvc_example&lt;/a&gt;. You can checkout to tag &lt;a href="https://github.com/Lee-W/dvc_example/tree/v3-remove-2-rows"&gt;v3-remove-2-rows&lt;/a&gt; to follow along.&lt;/p&gt;

&lt;h3&gt;
  
  
  Split training logic into different stages
&lt;/h3&gt;

&lt;p&gt;In the original design, we use &lt;code&gt;pipenv run python digit_recognizer/digit_recognizer.py&lt;/code&gt; to run the whole training process. We'll split them into &lt;code&gt;process-data&lt;/code&gt;, &lt;code&gt;train&lt;/code&gt;, and &lt;code&gt;report&lt;/code&gt; stages.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="p"&gt;......&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"process-data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"data/digit_data.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"data/digit_target.csv"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;process_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;export_processed_data&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s"&gt;"output/training_data.pkl"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;export_processed_data&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s"&gt;"output/testing_data.pkl"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"train"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_processed_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"output/training_data.pkl"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;export_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"output/model.pkl"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"report"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_processed_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"output/testing_data.pkl"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"output/model.pkl"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;predicted_y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;output_test_data_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;predicted_y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;output_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;predicted_y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can view the complete code change on &lt;a href="https://github.com/Lee-W/dvc_example/compare/v3-remove-2-rows...v4-split-pipeline-logic"&gt;v3-remove-2-rows...v4-split-pipeline-logic&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;After these changes, we'll use the following 3 commands to run the pipeline.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pipenv run python digit_recognizer/digit_recognizer.py process-data
pipenv run python digit_recognizer/digit_recognizer.py train
pipenv run python digit_recognizer/digit_recognizer.py report
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Add the first stage in our pipeline to DVC
&lt;/h3&gt;

&lt;p&gt;We add stages through &lt;a href="https://dvc.org/doc/command-reference/run"&gt;dvc run&lt;/a&gt; command. Let's add our first stage &lt;code&gt;process-data&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# add process-data stage&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv run dvc run &lt;span class="nt"&gt;--name&lt;/span&gt; process-data &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-d&lt;/span&gt; digit_recognizer/digit_recognizer.py &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-d&lt;/span&gt; data/digit_data.csv &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-d&lt;/span&gt; data/digit_target.csv &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-o&lt;/span&gt; output/training_data.pkl &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-o&lt;/span&gt; output/testing_data.pkl &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="s2"&gt;"pipenv run python digit_recognizer/digit_recognizer.py process-data"&lt;/span&gt;

Running stage &lt;span class="s1"&gt;'process-data'&lt;/span&gt;:
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; pipenv run python digit_recognizer/digit_recognizer.py process-data
Creating &lt;span class="s1"&gt;'dvc.yaml'&lt;/span&gt;
Adding stage &lt;span class="s1"&gt;'process-data'&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="s1"&gt;'dvc.yaml'&lt;/span&gt;
Generating lock file &lt;span class="s1"&gt;'dvc.lock'&lt;/span&gt;
Updating lock file &lt;span class="s1"&gt;'dvc.lock'&lt;/span&gt;

To track the changes with git, run:

git add dvc.yaml output/.gitignore dvc.lock


Next, we add these DVC files into git to track.

&lt;span class="c"&gt;# add DVC configuration to git and commit&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;git add dvc.yaml dvc.lock output/.gitignore
&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv run cz commit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See what's composed of this command&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--name&lt;/code&gt;: the name of this stage

&lt;ul&gt;
&lt;li&gt;It must be unique throughout the project.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-d&lt;/code&gt;: the dependencies of this stage

&lt;ul&gt;
&lt;li&gt;All the files related to running this stage should be counted as dependencies.&lt;/li&gt;
&lt;li&gt;DVC won't these dependency files into it storage but only store the hashes of them.&lt;/li&gt;
&lt;li&gt;In this example, we need &lt;code&gt;digit_recognizer/digit_recognizer.py&lt;/code&gt; to load &lt;code&gt;data/digit_data.csv&lt;/code&gt; and &lt;code&gt;data/digit_target.csv&lt;/code&gt; to process the data. Thus, these 3 files are added as dependencies.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-o&lt;/code&gt;: the output files of this stage

&lt;ul&gt;
&lt;li&gt;DVC stores these files in its storage. If you want to track it through git or simply don't want to track it, you can use &lt;code&gt;-O&lt;/code&gt; instead.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;dvc run&lt;/code&gt; runs the stage right after adding it. If you don't want DVC to run it, you can add &lt;code&gt;--no-exec&lt;/code&gt; flag or &lt;a href="https://dvc.org/doc/command-reference/stage/add"&gt;dvc stage add&lt;/a&gt; with the same arguments&lt;/p&gt;

&lt;p&gt;After adding a stage, DVC updates &lt;code&gt;dvc.yaml&lt;/code&gt;, &lt;code&gt;output/.gitignore&lt;/code&gt; and &lt;code&gt;dvc.lock&lt;/code&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;dvc.yaml&lt;/code&gt;: the definition of stages
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;stages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;process-data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cmd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pipenv run python digit_recognizer/digit_recognizer.py process-data&lt;/span&gt;
    &lt;span class="na"&gt;deps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;data/digit_data.csv&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;data/digit_target.csv&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;digit_recognizer/digit_recognizer.py&lt;/span&gt;
    &lt;span class="na"&gt;outs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;output/testing_data.pkl&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;output/training_data.pkl&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DVC transforms what we defined in &lt;code&gt;dvc run&lt;/code&gt; to a human-readable format and store it. But if you already know how to define the stage, you can edit &lt;code&gt;dvc.yaml&lt;/code&gt; directly. In addition, there're advanced techniques like &lt;a href="https://dvc.org/doc/user-guide/project-structure/pipelines-files#templating"&gt;Templating&lt;/a&gt; and &lt;a href="https://dvc.org/doc/user-guide/project-structure/pipelines-files#foreach-stages"&gt;foreach stages&lt;/a&gt; that can help us define complicated stages.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;dvc.lock&lt;/code&gt;: the hashes of dependencies and outputs
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2.0'&lt;/span&gt;
&lt;span class="na"&gt;stages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;process-data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cmd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pipenv run python digit_recognizer/digit_recognizer.py process-data&lt;/span&gt;
    &lt;span class="na"&gt;deps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/digit_data.csv&lt;/span&gt;
      &lt;span class="na"&gt;md5&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;942481fce846fb9750b7b8023c80a5ef&lt;/span&gt;
      &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;490582&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/digit_target.csv&lt;/span&gt;
      &lt;span class="na"&gt;md5&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2a6cfa13365ac9b3af5146133aca6789&lt;/span&gt;
      &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3590&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;digit_recognizer/digit_recognizer.py&lt;/span&gt;
      &lt;span class="na"&gt;md5&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;65ecf27479538a74ade42462b1566db1&lt;/span&gt;
      &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3629&lt;/span&gt;
    &lt;span class="na"&gt;outs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;output/testing_data.pkl&lt;/span&gt;
      &lt;span class="na"&gt;md5&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;78be1761d227f71b1a8f858fed766982&lt;/span&gt;
      &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;529016&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;output/training_data.pkl&lt;/span&gt;
      &lt;span class="na"&gt;md5&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;f95e8f978a05395ba23479ff60eda076&lt;/span&gt;
      &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;528427&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DVC uses these hashes to know whether there's any modification on our stages. Therefore, we only add deterministic files. Randomness might make this lock file meaningless. Take a look at the "Avoiding unexpected behavior" in &lt;a href="https://dvc.org/doc/command-reference/run#description"&gt;dvc run - Description&lt;/a&gt; could save your time debugging unexpected failure.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;output/.gitignore&lt;/code&gt;: Add files that DVC should track to gitignore
&lt;/h4&gt;

&lt;h3&gt;
  
  
  Define the whole pipeline
&lt;/h3&gt;

&lt;p&gt;With similar command, we can add &lt;code&gt;train&lt;/code&gt; and &lt;code&gt;report&lt;/code&gt; stages to our pipeline.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# add train stage&lt;/span&gt;
pipenv run dvc run &lt;span class="nt"&gt;--name&lt;/span&gt; train &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-d&lt;/span&gt; digit_recognizer/digit_recognizer.py &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-d&lt;/span&gt; output/training_data.pkl &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-o&lt;/span&gt; output/model.pkl &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="s2"&gt;"pipenv run python digit_recognizer/digit_recognizer.py train"&lt;/span&gt;

&lt;span class="c"&gt;# add report stage&lt;/span&gt;
pipenv run dvc run &lt;span class="nt"&gt;--name&lt;/span&gt; report &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-d&lt;/span&gt; digit_recognizer/digit_recognizer.py &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-d&lt;/span&gt; output/testing_data.pkl &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-d&lt;/span&gt; output/model.pkl &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-o&lt;/span&gt; output/metrics.json &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-o&lt;/span&gt; output/test_data_results.csv &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="s2"&gt;"pipenv run python digit_recognizer/digit_recognizer.py report"&lt;/span&gt;

&lt;span class="c"&gt;# add DVC configuration to git and commit&lt;/span&gt;
git add dvc.yaml dvc.lock model/.gitignore
pipenv run cz commit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Similar to our previous step, DVC updates &lt;code&gt;dvc.yaml&lt;/code&gt;, &lt;code&gt;dvc.lock&lt;/code&gt; and &lt;code&gt;output/.gitignore&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;$ cat dvc.yaml&lt;/span&gt;

&lt;span class="nn"&gt;...&lt;/span&gt;
  &lt;span class="na"&gt;train&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cmd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pipenv run python digit_recognizer/digit_recognizer.py train&lt;/span&gt;
    &lt;span class="na"&gt;deps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;digit_recognizer/digit_recognizer.py&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;output/training_data.pkl&lt;/span&gt;
    &lt;span class="na"&gt;outs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;output/model.pkl&lt;/span&gt;
  &lt;span class="na"&gt;report&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cmd&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pipenv run python digit_recognizer/digit_recognizer.py report&lt;/span&gt;
    &lt;span class="na"&gt;deps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;digit_recognizer/digit_recognizer.py&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;output/model.pkl&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;output/testing_data.pkl&lt;/span&gt;
    &lt;span class="na"&gt;outs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;output/metrics.json&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;output/test_data_results.csv&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can visualize the pipeline through &lt;a href="https://dvc.org/doc/command-reference/dag"&gt;dvc dag&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv run dvc dag

    +----------+
    | data.dvc |
    +----------+
          &lt;span class="k"&gt;*&lt;/span&gt;
          &lt;span class="k"&gt;*&lt;/span&gt;
          &lt;span class="k"&gt;*&lt;/span&gt;
  +--------------+
  | process-data |
  +--------------+
     &lt;span class="k"&gt;**&lt;/span&gt;        &lt;span class="k"&gt;**&lt;/span&gt;
   &lt;span class="k"&gt;**&lt;/span&gt;            &lt;span class="k"&gt;*&lt;/span&gt;
  &lt;span class="k"&gt;*&lt;/span&gt;               &lt;span class="k"&gt;**&lt;/span&gt;
+-------+           &lt;span class="k"&gt;*&lt;/span&gt;
| train |         &lt;span class="k"&gt;**&lt;/span&gt;
+-------+        &lt;span class="k"&gt;*&lt;/span&gt;
     &lt;span class="k"&gt;**&lt;/span&gt;        &lt;span class="k"&gt;**&lt;/span&gt;
       &lt;span class="k"&gt;**&lt;/span&gt;    &lt;span class="k"&gt;**&lt;/span&gt;
         &lt;span class="k"&gt;*&lt;/span&gt;  &lt;span class="k"&gt;*&lt;/span&gt;
     +--------+
     | report |
     +--------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you pay attention to each parameter passes to &lt;code&gt;dvc run&lt;/code&gt;, you might have noticed that &lt;code&gt;train&lt;/code&gt; stage depends on the output &lt;code&gt;output/training_data.pkl&lt;/code&gt; from &lt;code&gt;process-data&lt;/code&gt; stage. This is how DVC decides the order of each stage in our pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Run the pipeline
&lt;/h3&gt;

&lt;p&gt;Contradict to its naming, &lt;code&gt;dvc run&lt;/code&gt; is only used for defining the stage and run it for the first time. &lt;a href="https://dvc.org/doc/command-reference/repro#repro"&gt;dvc repro&lt;/a&gt; (reproduce) is what we use to run the pipeline,&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv run dvc repro

&lt;span class="s1"&gt;'data.dvc'&lt;/span&gt; didn&lt;span class="s1"&gt;'t change, skipping
Stage '&lt;/span&gt;train&lt;span class="s1"&gt;' didn'&lt;/span&gt;t change, skipping
Data and pipelines are up to date.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because we've not yet made any changes since we define our pipeline, DVC won't waste time and resources to generate results it has already known. However, you can add a &lt;code&gt;-f&lt;/code&gt; flag to force DVC to rerun the pipeline.&lt;/p&gt;

&lt;p&gt;Next, we'll change gamma to 0.01 to see how &lt;code&gt;dvc repro&lt;/code&gt; works.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;train_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;
    &lt;span class="n"&gt;clf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;svm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SVC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gamma&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because our dependency &lt;code&gt;digit_recognizer/digit_recognizer.py&lt;/code&gt; has been modified, DVC expects the result might be different. Therefore, we can now run &lt;code&gt;dvc repro&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv run dvc repro

&lt;span class="s1"&gt;'data.dvc'&lt;/span&gt; didn&lt;span class="s1"&gt;'t change, skipping
Running stage '&lt;/span&gt;process-data&lt;span class="s1"&gt;':
&amp;gt; pipenv run python digit_recognizer/digit_recognizer.py process-data
Updating lock file '&lt;/span&gt;dvc.lock&lt;span class="s1"&gt;'

Running stage '&lt;/span&gt;train&lt;span class="s1"&gt;':
&amp;gt; pipenv run python digit_recognizer/digit_recognizer.py train
Updating lock file '&lt;/span&gt;dvc.lock&lt;span class="s1"&gt;'

Running stage '&lt;/span&gt;report&lt;span class="s1"&gt;':
&amp;gt; pipenv run python digit_recognizer/digit_recognizer.py report
Updating lock file '&lt;/span&gt;dvc.lock&lt;span class="s1"&gt;'

To track the changes with git, run:

git add dvc.lock
Use `dvc push` to send your updates to remote storage.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By running &lt;code&gt;git diff&lt;/code&gt;, you'll find out that the hashes of &lt;code&gt;digit_recognizer/digit_recognizer.py&lt;/code&gt;, &lt;code&gt;output/model.pkl&lt;/code&gt;, &lt;code&gt;output/metrics.json&lt;/code&gt;, &lt;code&gt;output/test_data_results.csv&lt;/code&gt; inside &lt;code&gt;dvc.lock&lt;/code&gt; has been changed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Track parameters
&lt;/h3&gt;

&lt;p&gt;In the previous section, even though we change only the parameter related to the &lt;code&gt;train&lt;/code&gt; stage, DVC still reruns the whole pipeline. To make DVC runs only the stages affect by the changed parameters, we can refactor our code to load parameters from a separate file &lt;code&gt;params.yaml&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_params&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"params.yaml"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"data/digit_data.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"data/digit_target.csv"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;process_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"process_data"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"train"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;export_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;......&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is how &lt;code&gt;params.yaml&lt;/code&gt; looks like.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;process_data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.5&lt;/span&gt;
  &lt;span class="na"&gt;shuffle&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;
&lt;span class="na"&gt;train&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;gamma&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.01&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full code changes can be found on &lt;a href="https://github.com/Lee-W/dvc_example/tree/v5-parameters-in-separate-file"&gt;v5-parameters-in-separate-file&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To add parameters to a stage, we'll need to run the previous &lt;code&gt;dvc run&lt;/code&gt; command again with &lt;code&gt;-f&lt;/code&gt; and &lt;code&gt;-p&lt;/code&gt; flag.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;-f&lt;/code&gt;: overwrite the stage with the same name&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-p&lt;/code&gt;: parameters

&lt;ul&gt;
&lt;li&gt;Use "," to separate parameters
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add parameters process_data.test_size and process_data.shuffle to process-data stage&lt;/span&gt;
pipenv run dvc run &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; process-data &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-d&lt;/span&gt; digit_recognizer/digit_recognizer.py &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-d&lt;/span&gt; data/digit_data.csv &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-d&lt;/span&gt; data/digit_target.csv &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-o&lt;/span&gt; output/training_data.pkl &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-o&lt;/span&gt; output/testing_data.pkl &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-p&lt;/span&gt; process_data.test_size,process_data.shuffle &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="s2"&gt;"pipenv run python digit_recognizer/digit_recognizer.py process-data"&lt;/span&gt;

&lt;span class="c"&gt;# Add parameters train.gamma to train stage&lt;/span&gt;
pipenv run dvc run &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; train &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-d&lt;/span&gt; digit_recognizer/digit_recognizer.py &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-d&lt;/span&gt; output/training_data.pkl &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-o&lt;/span&gt; output/model.pkl &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-p&lt;/span&gt; train.gamma &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="s2"&gt;"pipenv run python digit_recognizer/digit_recognizer.py train"&lt;/span&gt;  

&lt;span class="c"&gt;# add DVC configuration to git and commit&lt;/span&gt;
git add dvc.yaml dvc.lock model/.gitignore
pipenv run cz commit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DVC adds &lt;code&gt;params&lt;/code&gt; key to both &lt;code&gt;process-data&lt;/code&gt; and &lt;code&gt;train&lt;/code&gt; stages in &lt;code&gt;dvc.yaml&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stages:
  process-data:
    ......
    params:
    - process_data.shuffle
    - process_data.test_size
  train:
      ......
    params:
    - train.gamma
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;params.yaml&lt;/code&gt; is the default parameter file name, but DVC also supports YAML, JSON, TOML, and &lt;a href="https://dvc.org/doc/command-reference/params#examples-python-parameters-file"&gt;Python files&lt;/a&gt;. We only need to add the file name as an additional layer to &lt;code&gt;params&lt;/code&gt; to use it. e.g.,&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;  &lt;span class="c1"&gt;# this is an example of using different parameter file name&lt;/span&gt;
  &lt;span class="c1"&gt;# we don't need to make changes to our code&lt;/span&gt;
  &lt;span class="na"&gt;train&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="s"&gt;......&lt;/span&gt;
    &lt;span class="s"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="err"&gt;  &lt;/span&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;params.json&lt;/span&gt;
      &lt;span class="s"&gt;- train.gamma&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's change gamma to 0.1. We can check this change through &lt;a href="https://dvc.org/doc/command-reference/params/diff"&gt;dvc params diff&lt;/a&gt;. By providing git reference, we can even see parameters difference between different git commits. (e.g., &lt;code&gt;dvc params diff HEAD~1&lt;/code&gt;)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv run dvc params diff

Path     Param        Old    New
params.yaml  train.gamma  0.01   0.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If we run &lt;code&gt;dvc repro&lt;/code&gt; now, DVC reruns only &lt;code&gt;train&lt;/code&gt; and &lt;code&gt;report&lt;/code&gt; stages. &lt;code&gt;train&lt;/code&gt; stage is affected by &lt;code&gt;train.gamma&lt;/code&gt; change. Due to this change, the output file from the &lt;code&gt;train&lt;/code&gt; stage has been updated. Thus, DVC reruns &lt;code&gt;report&lt;/code&gt; stages as well.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv run dvc repro

&lt;span class="s1"&gt;'data.dvc'&lt;/span&gt; didn&lt;span class="s1"&gt;'t change, skipping
Stage '&lt;/span&gt;process-data&lt;span class="s1"&gt;' didn'&lt;/span&gt;t change, skipping
Running stage &lt;span class="s1"&gt;'train'&lt;/span&gt;:
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; pipenv run python digit_recognizer/digit_recognizer.py train
Updating lock file &lt;span class="s1"&gt;'dvc.lock'&lt;/span&gt;

Running stage &lt;span class="s1"&gt;'report'&lt;/span&gt;:
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; pipenv run python digit_recognizer/digit_recognizer.py report
Updating lock file &lt;span class="s1"&gt;'dvc.lock'&lt;/span&gt;

To track the changes with git, run:

    git add dvc.lock
Use &lt;span class="sb"&gt;`&lt;/span&gt;dvc push&lt;span class="sb"&gt;`&lt;/span&gt; to send your updates to remote storage.

&lt;span class="c"&gt;# reset gamma back to 0.01&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;git checkout dvc.lock params.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We're not going to store this parameter change. Run &lt;code&gt;git checkout out params.yaml dvc.lock&lt;/code&gt; to restore the previous state.&lt;/p&gt;

&lt;h3&gt;
  
  
  Track metrics
&lt;/h3&gt;

&lt;p&gt;We now know how to track parameters. Next, we'll see how changing these parameters affect the performance of our models. You may have already noticed that we've outputted a &lt;code&gt;output/metrics.json&lt;/code&gt; file. Although we could track it as the output file, DVC has better support for metrics files.&lt;/p&gt;

&lt;p&gt;Like adding parameters, we add &lt;code&gt;-m&lt;/code&gt; flag for DVC to recognize the output as metrics. Instead of using &lt;code&gt;-M&lt;/code&gt; as the official tutorial did, I use &lt;code&gt;-m&lt;/code&gt; because I prefer tracking metrics through DVC remote storage instead of saving it to git as part of our source code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add output/metrics.json as metrics to report stage&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv run dvc run &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; report &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-d&lt;/span&gt; digit_recognizer/digit_recognizer.py &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-d&lt;/span&gt; output/testing_data.pkl &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-d&lt;/span&gt; output/model.pkl &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-o&lt;/span&gt; output/test_data_results.csv &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-m&lt;/span&gt; output/metrics.json &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="s2"&gt;"pipenv run python digit_recognizer/digit_recognizer.py report"&lt;/span&gt;

&lt;span class="c"&gt;# add DVC configuration to git and commit&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;git add dvc.yaml dvc.lock model/.gitignore
&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv run cz commit

&lt;span class="c"&gt;# metrics have been added to the report stage as expected&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;dvc.yaml

...
  report:
    ......
    metrics:
    - metrics.json:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use &lt;a href="https://dvc.org/doc/command-reference/metrics/show"&gt;dvc metrics show&lt;/a&gt; to see how well our model performs&lt;br&gt;
Note that values are not calculated through DVC. DVC only provides a way to display values in file organized as tree hierarchies and compare them throughout different git commits.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv run dvc metrics show

Path             accuracy_score    weighted_f1_score    weighted_precision    weighted_recall
output/metrics.json  0.69265       0.74567              0.91941               0.69265
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Change gamma to 0.1 again and use &lt;a href="https://dvc.org/doc/command-reference/metrics/diff"&gt;dvc metrics diff&lt;/a&gt; to see if model performance is improved after this change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# reruns the pipeline with new parameters&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv run dvc repro

&lt;span class="c"&gt;# check metrics differences between unstaged and HEAD&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv run dvc metrics diff

Path             Metric              Old      New      Change
output/metrics.json  accuracy_score  0.69265  0.10134  &lt;span class="nt"&gt;-0&lt;/span&gt;.59131
output/metrics.json  weighted_f1_score   0.74567  0.01865  &lt;span class="nt"&gt;-0&lt;/span&gt;.72702
output/metrics.json  weighted_precision  0.91941  0.01027  &lt;span class="nt"&gt;-0&lt;/span&gt;.90914
output/metrics.json  weighted_recall 0.69265  0.10134  &lt;span class="nt"&gt;-0&lt;/span&gt;.59131

&lt;span class="c"&gt;# reset gamma back to 0.01&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;git checkout dvc.lock params.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We don't need this change either. Reset gamma back to 0.01 through &lt;code&gt;git checkout&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  plotting
&lt;/h3&gt;

&lt;p&gt;There's only one left output &lt;code&gt;output/test_data_results.csv&lt;/code&gt; that has not yet been used. This file stores the ground truth and the predicted result from our model. We're going to use it to see how DVC plots our data. Before plotting, let's change gamma to 0.001 first and run &lt;code&gt;dvc repro&lt;/code&gt;. Otherwise, the output plot will look a bit odd due to the low model performance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;output/test_data_results.csv

actual,predicted
4.0,4.0
8.0,8.0
......
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add &lt;code&gt;--plots&lt;/code&gt; flag and specify &lt;code&gt;output/test_data_results.csv&lt;/code&gt; as the file to plot&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# add output/test_data_results.csv as the file to plot to report stage&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv run dvc run &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; report &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-d&lt;/span&gt; digit_recognizer/digit_recognizer.py &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-d&lt;/span&gt; output/testing_data.pkl &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-d&lt;/span&gt; output/model.pkl &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-o&lt;/span&gt; output/test_data_results.csv &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;-m&lt;/span&gt; output/metrics.json &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--plots&lt;/span&gt; output/test_data_results.csv &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="s2"&gt;"pipenv run python digit_recognizer/digit_recognizer.py report"&lt;/span&gt;

&lt;span class="c"&gt;# plots have been added to dvc.yaml&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;dvc.yaml
  ......
  plots:
  - output/test_data_results.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DVC generates plots through &lt;a href="https://vega.github.io/vega/"&gt;Vega&lt;/a&gt;, a declarative grammar that can define interactive graph in JSON format. It supports linear plot, scatter plot, and confusion matrix by default. These templates are stored in &lt;code&gt;.dvc/plots&lt;/code&gt;. We can also define our plots. (Read&lt;br&gt;
&lt;a href="https://dvc.org/doc/command-reference/plots#custom-templates"&gt;dvc plots - Custom templates&lt;/a&gt; to find out more)&lt;/p&gt;

&lt;p&gt;In the following example, we'll plot a confusion matrix through &lt;a href="https://dvc.org/doc/command-reference/plots/show"&gt;dvc plots show&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv run dvc plots show output/test_data_results.csv &lt;span class="nt"&gt;--template&lt;/span&gt; confusion &lt;span class="nt"&gt;-x&lt;/span&gt; actual &lt;span class="nt"&gt;-y&lt;/span&gt; predicted &lt;span class="nt"&gt;--out&lt;/span&gt; confusion_matrix.html

file:///....../confusion_matrix.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--template&lt;/code&gt;: name of the plot template&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-x&lt;/code&gt;: field name of the data for the X-axis&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-y&lt;/code&gt;: field name of the data for the y axis&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--out&lt;/code&gt;: output file name&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The following is a screenshot of the generated plot.&lt;/p&gt;

&lt;p&gt;&lt;a href="/images/posts-image/2021-dvc/confusion-matrix.jpg" class="article-body-image-wrapper"&gt;&lt;img src="/images/posts-image/2021-dvc/confusion-matrix.jpg" alt="confusion-matrix"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As of now, DVC does not track our plot (i.e. &lt;code&gt;confusion-matrix.jpg&lt;/code&gt;) but only our data to plot (i.e., &lt;code&gt;output/test_data_results.csv&lt;/code&gt;). Let's add &lt;code&gt;plot&lt;/code&gt; as the final stage of our pipeline.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add stage plot&lt;/span&gt;
pipenv run dvc run &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; plot &lt;span class="se"&gt;\&lt;/span&gt;
          &lt;span class="nt"&gt;-d&lt;/span&gt; output/test_data_results.csv &lt;span class="se"&gt;\&lt;/span&gt;
          &lt;span class="nt"&gt;-o&lt;/span&gt; confusion_matrix.html &lt;span class="se"&gt;\&lt;/span&gt;
          &lt;span class="s2"&gt;"pipenv run dvc plots show output/test_data_results.csv --template confusion -x actual -y predicted --out confusion_matrix.html"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this post, we create a data pipeline that process data, train the model, generate the report and visualize it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv run dvc dag

       +----------+
       | data.dvc |
       +----------+
             &lt;span class="k"&gt;*&lt;/span&gt;
             &lt;span class="k"&gt;*&lt;/span&gt;
             &lt;span class="k"&gt;*&lt;/span&gt;
     +--------------+
     | process-data |
     +--------------+
         &lt;span class="k"&gt;*&lt;/span&gt;        &lt;span class="k"&gt;*&lt;/span&gt;
       &lt;span class="k"&gt;**&lt;/span&gt;          &lt;span class="k"&gt;*&lt;/span&gt;
      &lt;span class="k"&gt;*&lt;/span&gt;             &lt;span class="k"&gt;**&lt;/span&gt;
+-------+             &lt;span class="k"&gt;*&lt;/span&gt;
| train |           &lt;span class="k"&gt;**&lt;/span&gt;
+-------+          &lt;span class="k"&gt;*&lt;/span&gt;
         &lt;span class="k"&gt;*&lt;/span&gt;        &lt;span class="k"&gt;*&lt;/span&gt;
          &lt;span class="k"&gt;**&lt;/span&gt;    &lt;span class="k"&gt;**&lt;/span&gt;
            &lt;span class="k"&gt;*&lt;/span&gt;  &lt;span class="k"&gt;*&lt;/span&gt;
        +--------+
        | report |
        +--------+
             &lt;span class="k"&gt;*&lt;/span&gt;
             &lt;span class="k"&gt;*&lt;/span&gt;
             &lt;span class="k"&gt;*&lt;/span&gt;
         +------+
         | plot |
         +------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We also see how to use DVC to track each component and provide an easy way to run the pipeline. In the following article, we will discuss how to run experiments with different parameters and compare the results in an even more convenient way.&lt;/p&gt;

&lt;h2&gt;
  
  
  One more thing: When should we save files to DVC instead of git?
&lt;/h2&gt;

&lt;p&gt;Short answer: It depends.&lt;/p&gt;

&lt;p&gt;When defining pipeline we can decide whether to save our outputs (&lt;code&gt;-o&lt;/code&gt; / &lt;code&gt;-O&lt;/code&gt;), metrics (&lt;code&gt;-m&lt;/code&gt; / &lt;code&gt;-M&lt;/code&gt;) and plots (&lt;code&gt;--plots&lt;/code&gt; / &lt;code&gt;--plots-no-cache&lt;/code&gt;) to DVC storage. DVC document suggests not storing metrics and plots to DVC as they are typically small enough for git to track. But I'd prefer storing only thing relates to our logic to git. That's why I use &lt;code&gt;-m&lt;/code&gt; and &lt;code&gt;--plots&lt;/code&gt; in the examples. If you don't want to track these, you could just pass &lt;code&gt;-O&lt;/code&gt;, &lt;code&gt;-M&lt;/code&gt;, or &lt;code&gt;--plots-no-cache&lt;/code&gt; and add them to both &lt;code&gt;.gitignore&lt;/code&gt; and &lt;code&gt;.dvcignore&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dvc.org/"&gt;DVC&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stanford-cs329s.github.io/syllabus.html"&gt;CS 329S: Machine Learning Systems Design&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>datascience</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>DVC - Data Versioning</title>
      <dc:creator>Wei Lee</dc:creator>
      <pubDate>Fri, 02 Jul 2021 10:49:33 +0000</pubDate>
      <link>https://dev.to/leew/dvc-data-versioning-3kl8</link>
      <guid>https://dev.to/leew/dvc-data-versioning-3kl8</guid>
      <description>&lt;p&gt;This article also lives on &lt;a href="https://lee-w.github.io/posts/tech/2021/06/dvc-data-versioning/"&gt;GitHub&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;About DVC (Data Version Control)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What's DVC?

&lt;ul&gt;
&lt;li&gt;version control system for data science and machine learning&lt;/li&gt;
&lt;li&gt;compatible with git (it's based on git)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;What can DVC do?

&lt;ul&gt;
&lt;li&gt;track

&lt;ul&gt;
&lt;li&gt;data&lt;/li&gt;
&lt;li&gt;model&lt;/li&gt;
&lt;li&gt;pipeline&lt;/li&gt;
&lt;li&gt;metrics&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;use storage directly&lt;/li&gt;
&lt;li&gt;no external services needed&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Who are the targeted users of DVC?

&lt;ul&gt;
&lt;li&gt;ML research / engineer&lt;/li&gt;
&lt;li&gt;DevOps &amp;amp; Engineers&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Why DVC?

&lt;ul&gt;
&lt;li&gt;It links your data, model, and pipelines with your metrics.

&lt;ul&gt;
&lt;li&gt;reproducibility&lt;/li&gt;
&lt;li&gt;trackable&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Read &lt;a href="https://dvc.org/doc/use-cases/versioning-data-and-model-files"&gt;DVC - Versioning Data and Models&lt;/a&gt; for more use cases&lt;/p&gt;

&lt;h2&gt;
  
  
  How to use DVC?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Install DVC globally
&lt;/h3&gt;

&lt;p&gt;I suggest using &lt;a href="https://pypa.github.io/pipx/"&gt;pipx&lt;/a&gt; if you're to install DVC globally. However, an even better way is to install it inside the virtual environment within your project.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;pipx
&lt;span class="nv"&gt;$ &lt;/span&gt;pipx &lt;span class="nb"&gt;install &lt;/span&gt;dvc
&lt;span class="nv"&gt;$ &lt;/span&gt;dvc &lt;span class="nt"&gt;--version&lt;/span&gt;

2.3.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DVC also provides &lt;a href="https://dvc.org/doc/install/completion"&gt;Shell Completion&lt;/a&gt; and &lt;a href="https://dvc.org/doc/install/plugins"&gt;Syntax Highlighting Plugins&lt;/a&gt; for popular editors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Take a look at the example project
&lt;/h3&gt;

&lt;p&gt;I'll use &lt;a href="https://github.com/Lee-W/dvc_example/"&gt;dvc_example&lt;/a&gt; to demonstrate how I applied DVC to an existing machine learning project. The example is based on &lt;a href="https://scikit-learn.org/0.24/auto_examples/classification/plot_digits_classification.html"&gt;Recognizing hand-written digits&lt;/a&gt; from scikit-learn documentation. All the DVC parts start from &lt;a href="https://github.com/Lee-W/dvc_example/tree/v1-base"&gt;v1-base&lt;/a&gt;. You can &lt;code&gt;git checkout&lt;/code&gt; to the tag to follow along.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;git clone https://github.com/Lee-W/dvc_example/ &lt;span class="nt"&gt;--branch&lt;/span&gt; v1-base
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;dvc_example
&lt;span class="nv"&gt;$ &lt;/span&gt;tree
&lt;span class="nb"&gt;.&lt;/span&gt;
├── LICENSE
├── Pipfile
├── Pipfile.lock
├── digit_recognizer
│   ├── __init__.py
│   └── digit_recognizer.py
├── docs
│   └── README.md
├── mkdocs.yml
├── output
└── tasks.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To set up the development environment, you'll need &lt;a href="https://pipenv.pypa.io/en/latest/"&gt;pipenv&lt;/a&gt; and &lt;a href="https://www.pyinvoke.org/"&gt;invoke&lt;/a&gt;. If you run into an error when running &lt;code&gt;pipenv install&lt;/code&gt;, you can run &lt;code&gt;export SYSTEM_VERSION_COMPAT=1&lt;/code&gt; before it. It's an open issue (&lt;a href="https://github.com/pypa/pipenv/issues/4564#issuecomment-756625303"&gt;Issue with NumPy, macOS 11 Big Sur, Python 3.9.1 Does pipenv not use the latest pip? #4564&lt;/a&gt;) of pipenv as of now. Or, you can just run the following commands.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# install needed tools&lt;/span&gt;
pipx &lt;span class="nb"&gt;install &lt;/span&gt;pipenv invoke

&lt;span class="c"&gt;# set up environments&lt;/span&gt;
invoke init-dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We'll use &lt;a href="https://github.com/Lee-W/dvc_example/blob/v1-base/digit_recognizer/digit_recognizer.py"&gt;digit_recognizer/digit_recognizer.py&lt;/a&gt; for training a model that can recognize handwritten digits.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;process_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;predicted_y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;output_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;predicted_y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;output_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;predicted_y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Install DVC into the virtual environment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pipenv &lt;span class="nb"&gt;install &lt;/span&gt;dvc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're to save data to remote storage, you might need to install extra dependencies.&lt;br&gt;
(e.g., &lt;code&gt;pipenv install dvc[s3]&lt;/code&gt;)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Supported types

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;[s3]&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;[azure]&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;[gdrive]&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;[gs]&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;[oss]&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;[ssh]&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Or, use &lt;code&gt;pipenv install dvc[all]&lt;/code&gt; to install them all&lt;/p&gt;

&lt;p&gt;Read &lt;a href="https://dvc.org/doc/command-reference/remote"&gt;dvc remote&lt;/a&gt; for more information&lt;/p&gt;
&lt;h3&gt;
  
  
  Initialize DVC
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# initialize DVC configurations&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv run dvc init

&lt;span class="c"&gt;# see what's created by DVC&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;tree .dvc

.dvc
├── config
└── plots
    ├── confusion.json
    ├── confusion_normalized.json
    ├── default.json
    ├── linear.json
    ├── scatter.json
    └── smooth.json

&lt;span class="c"&gt;# track DVC configuration through git&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;git add .dvc

&lt;span class="c"&gt;# git commit&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv run cz commit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Add DVC remote
&lt;/h3&gt;

&lt;p&gt;I'll use another local directory &lt;code&gt;../dvc_remote&lt;/code&gt; as our remote storage. You can change it to s3 or other remote storage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; ../dvc_remote
dvc remote add &lt;span class="nt"&gt;--default&lt;/span&gt; &lt;span class="nb"&gt;local&lt;/span&gt; ../dvc_remote
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Through &lt;code&gt;--default&lt;/code&gt; flag, we can push/pull from &lt;code&gt;local&lt;/code&gt; remote without specifying remote name.&lt;/p&gt;

&lt;p&gt;Let see what's changed in &lt;code&gt;.dvc/config&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cat .dvc/config

[core]
remote = local
['remote "local"']
url = ../../dvc_remote
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The url is &lt;code&gt;../../dvc_remote&lt;/code&gt; instead of &lt;code&gt;../dvc_remote&lt;/code&gt; because it's the relative path to &lt;code&gt;.dvc&lt;/code&gt;. As we've not yet push anything to our pseudo remote, &lt;code&gt;../dvc_remote&lt;/code&gt; is still empty.&lt;/p&gt;

&lt;h3&gt;
  
  
  Track data through DVC
&lt;/h3&gt;

&lt;p&gt;By this time, the data is loaded through &lt;a href="https://scikit-learn.org/0.24/modules/generated/sklearn.datasets.load_digits.html"&gt;sklearn.datasets.load_digits&lt;/a&gt;. We're going to change it to read from static file in &lt;code&gt;data/&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_data&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Load data
&lt;/span&gt;    &lt;span class="n"&gt;digits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load_digits&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can use the following script to output the digit data into &lt;code&gt;data/&lt;/code&gt;. Note that it's a one-time use script. We won't add it into git.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;os&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt;

&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"data"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;digits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load_digits&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;digits&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"data/digit_data.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;digits&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"data/digit_target.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We'll need to make changes to &lt;code&gt;load_data&lt;/code&gt; and &lt;code&gt;main&lt;/code&gt; functions to read data from these files.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;csv_reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quoting&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;QUOTE_NONNUMERIC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;csv_reader&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;csv_reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quoting&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;QUOTE_NONNUMERIC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;csv_reader&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;

&lt;span class="p"&gt;......&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"data/digit_data.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"data/digit_target.csv"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;......&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run &lt;code&gt;pipenv run python digit_recognizer/digit_recognizer.py&lt;/code&gt; to check whether everything works as we expected. If so, add these code changes into git.&lt;/p&gt;

&lt;p&gt;Next, add &lt;code&gt;data/&lt;/code&gt; to DVC.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv run dvc add data

100% Add|████████████████|1/1 &lt;span class="o"&gt;[&lt;/span&gt;00:00,  2.14file/s]

To track the changes with git, run:

git add data.dvc .gitignore
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;dvc add&lt;/code&gt; creates a &lt;code&gt;data.dvc&lt;/code&gt; file to track &lt;code&gt;data/&lt;/code&gt; and add it into &lt;code&gt;.gitignore&lt;/code&gt; so that &lt;code&gt;data/&lt;/code&gt; will only be tracked through DVC but not git.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add DVC files into git track&lt;/span&gt;
git add .gitignore data.dvc

&lt;span class="c"&gt;# git commit&lt;/span&gt;
pipenv run cz commit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In &lt;code&gt;data.dvc&lt;/code&gt;, we can see 2 files (&lt;code&gt;digit_data.csv&lt;/code&gt; and &lt;code&gt;digit_target.csv&lt;/code&gt;) are tracked.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;data.dvc

outs:
- md5: b8d81f4964ecb86739c79c833fb491f3.dir
  size: 494728
  nfiles: 2
  path: data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Push these tracked data into DVC remote&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dvc push
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See what's changed in our repo storage &lt;code&gt;../dvc_remote&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;tree ../dvc_remote

../dvc_remote
├── 02
│   └── b861b6dc8e08da6d66547860f69277
├── 8c
│   └── ba569595920d230ade453b150f372b
└── b8
    └── d81f4964ecb86739c79c833fb491f3.dir

3 directories, 3 files
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The md5 value of our tracked data is &lt;code&gt;b8d81f4964ecb86739c79c833fb491f3.dir&lt;/code&gt;. There's also a corresponding file in &lt;code&gt;../dvc_remote/b8/d81f4964ecb86739c79c833fb491f3.dir&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; ../dvc_remote/b8/d81f4964ecb86739c79c833fb491f3.dir

&lt;span class="o"&gt;[{&lt;/span&gt;&lt;span class="s2"&gt;"md5"&lt;/span&gt;: &lt;span class="s2"&gt;"02b861b6dc8e08da6d66547860f69277"&lt;/span&gt;, &lt;span class="s2"&gt;"relpath"&lt;/span&gt;: &lt;span class="s2"&gt;"digit_data.csv"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;, &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"md5"&lt;/span&gt;: &lt;span class="s2"&gt;"8cba569595920d230ade453b150f372b"&lt;/span&gt;, &lt;span class="s2"&gt;"relpath"&lt;/span&gt;: &lt;span class="s2"&gt;"digit_target.csv"&lt;/span&gt;&lt;span class="o"&gt;}]&lt;/span&gt;%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This file indicates where the actual data sources are stored in &lt;code&gt;../dvc_remote&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In conclusion, if we want to know how data is stored through DVC,&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;find the md5 value in &lt;code&gt;*.dvc&lt;/code&gt; in our project&lt;/li&gt;
&lt;li&gt;find the path that matches this md5 value in our remote storage&lt;/li&gt;
&lt;li&gt;use the md5 value specified in the previous step to find the data sources in our remote storage&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But most of the time, we don't need to do so. We can leave the tracking work to DVC.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fetch data from DVC remote storage
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# temporary delete our data locally&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; data

&lt;span class="c"&gt;# check whether DVC actually tracks our data&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;dvc status

data.dvc:
changed outs:
    deleted:            data

&lt;span class="c"&gt;# bring our data back from remote storage&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;dvc checkout data

data
├── digit_data.csv
└── digit_target.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Add data changes into DVC
&lt;/h3&gt;

&lt;p&gt;To demonstrate how DVC track data changes, let's remove the last 2 rows from &lt;code&gt;data/digit_data.csv&lt;/code&gt; and &lt;code&gt;data/digit_target.csv&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# check what's changed&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;dvc status

data.dvc:
changed outs:
    modified:           data

&lt;span class="c"&gt;# Add these changes to DVC and git&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;dvc add
&lt;span class="nv"&gt;$ &lt;/span&gt;git add data.dvc
&lt;span class="c"&gt;# git commit&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;pipenv run cz commit

&lt;span class="c"&gt;# Push these changes to our remote storage&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;dvc push
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The md5 value has been changed, and the size of our data is smaller than our previous record, 494728.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;data.dvc

outs:
- md5: a333e114a49194e823ab9a4fa9e33ee9.dir
  size: 494172
  nfiles: 2
  path: data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;More files are added to &lt;code&gt;../dvc_remote&lt;/code&gt; due to the data changes. You can follow the steps in the previous section to see what're actually store.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;tree ../dvc_remote

../dvc_remote
├── 02
│   └── b861b6dc8e08da6d66547860f69277
├── 2a
│   └── 6cfa13365ac9b3af5146133aca6789
├── 8c
│   └── ba569595920d230ade453b150f372b
├── 94
│   └── 2481fce846fb9750b7b8023c80a5ef
├── a3
│   └── 33e114a49194e823ab9a4fa9e33ee9.dir
└── b8
    └── d81f4964ecb86739c79c833fb491f3.dir

6 directories, 6 files
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's &lt;code&gt;git checkout&lt;/code&gt; to the previous git commit to see what happens if we only revert the changes in &lt;code&gt;data.dvc&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# or "git checkout v2-track-data"&lt;/span&gt;
git checkout HEAD~1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After running &lt;code&gt;wc -l data/digit_data.csv&lt;/code&gt;, we'll still find 1795 rows instead of 1797 rows in the previous stage. That's because we need to run &lt;code&gt;dvc checkout&lt;/code&gt; as well.&lt;/p&gt;

&lt;p&gt;We might easily forget this step. Thus, DVC implements a git-hook that can trigger &lt;code&gt;dvc checkout&lt;/code&gt; right after &lt;code&gt;git checkout&lt;/code&gt;. You can install these git-hooks through &lt;code&gt;dvc install&lt;/code&gt;. These hooks are added into &lt;code&gt;.git/hooks&lt;/code&gt;. If you want to know the detail of what's added, read &lt;a href="https://dvc.org/doc/command-reference/install"&gt;dvc install&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Test these steps again. There should be an additional line after running &lt;code&gt;git checkout&lt;/code&gt;. This is the output message of &lt;code&gt;dvc checkout&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;M       data/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Push our code to a remote git repository&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git remote add origin &amp;lt;REMOTE GIT REPO&amp;gt;
git push origin main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Fetch code and data changes from remote
&lt;/h3&gt;

&lt;p&gt;We've already pushed all the code and data changes to remote. Let's see how we could reproduce in another environment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# check what's in our repo&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;dvc list &amp;lt;REMOTE GIT REPO&amp;gt;

.dvcignore
.github
.gitignore
LICENSE
Pipfile
Pipfile.lock
data
data.dvc
digit_recognizer
docs
mkdocs.yml
output
tasks.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Although git does not track &lt;code&gt;data/&lt;/code&gt;, we can still list it through DVC.&lt;/p&gt;

&lt;p&gt;Because we use relative path &lt;code&gt;../dvc_remote&lt;/code&gt; as DVC remote storage, we need to create the new project in the same layer as &lt;code&gt;dvc_example&lt;/code&gt;. We'll clone the project into &lt;code&gt;../dvc_example_on_another_machine&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone repo git repo&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;git clone &amp;lt;YOUR REMOTE GIT REPO&amp;gt; ../dvc_example_on_another_machine
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ../dvc_example_on_another_machine
&lt;span class="nv"&gt;$ &lt;/span&gt;tree &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="nb"&gt;.&lt;/span&gt;
├── LICENSE
├── Pipfile
├── Pipfile.lock
├── data.dvc
├── digit_recognizer
│   ├── __init__.py
│   └── digit_recognizer.py
├── docs
│   └── README.md
├── mkdocs.yml
├── output
└── tasks.py

3 directories, 9 files
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, &lt;code&gt;data/&lt;/code&gt; has not yet been added to the project. We can now pull data from our DVC remote storage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# pull data from default DVC remote storage&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;dvc pull

A   data/
1 file added and 2 files fetched

&lt;span class="c"&gt;# `data` has now been added to the project&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;tree &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="nb"&gt;.&lt;/span&gt;
├── LICENSE
├── Pipfile
├── Pipfile.lock
├── data
│   ├── digit_data.csv
│   └── digit_target.csv
├── data.dvc
├── digit_recognizer
│   ├── __init__.py
│   └── digit_recognizer.py
├── docs
│   └── README.md
├── mkdocs.yml
├── output
└── tasks.py

4 directories, 11 files
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's all for data versioning in DVC. In the next post, We'll continue on versioning a data pipeline, tracking parameters and metrics. We won't need &lt;code&gt;dvc_example_on_another_machine&lt;/code&gt; for the following steps. Feel free to remove it and change directory back to &lt;code&gt;dvc_example&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reference
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dvc.org/"&gt;DVC&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://stanford-cs329s.github.io/syllabus.html"&gt;CS 329S: Machine Learning Systems Design&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>datascience</category>
      <category>monitoring</category>
    </item>
  </channel>
</rss>
