DEV Community

Cover image for Visualizing Machine Learning pipelines from the command line
Déborah Mesquita
Déborah Mesquita

Posted on

Visualizing Machine Learning pipelines from the command line

Machine Learning projects sometimes look like building a brick wall: we define a stage and then build the other stages on top of it. Being able to visualize this "brick wall" is very useful, and today we'll learn how to do it using DVC.

What is DVC?

DVC is an open-source version control system for Machine Learning projects. Here we'll focus on the ML pipelines feature.

📍 Step 1: Installing DVC

You can do install it as a Python package. DVC works best with git so we'll init a git project too:

$ python3 -m venv .env
$ source .env/bin/activate
(.env)$ pip3 install dvc
(.env)$ git init
(.env)$ dvc init

📍 Step 2: Creating the pipeline stages

To create the pipeline stages we use the dvc run command. These are the main options

  • -n : specify a name for the stage generated by this command
  • -p : specify a set of parameter dependencies the stage depends on
  • -d : specify a file or a directory the stage depends on
  • -o : specify a file or directory that is the result of running the command

We then create a Python script file for each pipeline stage and use the dvc run command with the above options to tell DVC how to run the pipeline. The parameters for each stage (-p option) can be specified in a YAML file.

# file params.yaml
prepare:
    categories:
        - comp.graphics
        - rec.sport.baseball
train:
    alpha: 0.9

Here is an example of a pipeline to:

  1. Gather data
  2. Generate the features
  3. Train a model
  4. Evaluate the model

The name of the pipeline stages are prepare, featurize, train and evaluate.

(.env)$ dvc run -n prepare -p prepare.categories \
-d src/prepare.py -o data/prepared python3 src/prepare.py
(.env)$ dvc run -n featurize -d src/featurize.py \
-d data/prepared -o data/features python3 src/featurize.py \
data/prepared data/features
(.env)$ dvc run -n train -p train.alpha -d src/train.py \
-d data/features -o model.pkl \
python3 src/train.py data/features  model.pkl
(.env)$ dvc run -n evaluate -d src/evaluate.py -d model.pkl \
 -d data/features --metrics-no-cache scores.json \
--plots-no-cache plots.json python3 src/evaluate.py \
model.pkl  data/features scores.json plots.json

📍 Step 3: Visualizing the pipeline

We can run the dvc dag command anytime to visualize the pipeline. This is the pipeline after creating all the stages above:

dvc dag
         +---------+      
         | prepare |      
         +---------+      
              *           
              *           
              *           
        +-----------+     
        | featurize |     
        +-----------+     
         **        **     
       **            *    
      *               **  
+-------+               * 
| train |             **  
+-------+            *    
         **        **     
           **    **       
             *  *         
        +----------+      
        | evaluate |      
        +----------+  

And that's it!
You can see the whole code for this pipeline here.

If you want a more in depth guide on building pipelines you can check this other article: The ultimate guide to building maintainable Machine Learning pipelines using DVC.

Thanks for reading!

Top comments (0)