<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sephi Berry</title>
    <description>The latest articles on DEV Community by Sephi Berry (@sephib).</description>
    <link>https://dev.to/sephib</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F427607%2F195bb67b-b142-4cac-a00e-d5278da27f5f.png</url>
      <title>DEV Community: Sephi Berry</title>
      <link>https://dev.to/sephib</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sephib"/>
    <language>en</language>
    <item>
      <title>ML Configuration Management</title>
      <dc:creator>Sephi Berry</dc:creator>
      <pubDate>Thu, 28 Jul 2022 10:40:00 +0000</pubDate>
      <link>https://dev.to/artlist/ml-configuration-management-4hde</link>
      <guid>https://dev.to/artlist/ml-configuration-management-4hde</guid>
      <description>&lt;p&gt;This post follows our blog describing our &lt;a href="https://dev.to/artlist/lessons-learned-on-the-road-to-mlops-22lj"&gt;ML Ops manifest&lt;/a&gt;. In this post we will dive into the our configuration management within our ML projects    &lt;/p&gt;

&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;p&gt;Working in a real-world business environment, requires moving codes between research/development, test and production environments, which are crucial for development velocity. While doing so, it is important to allow for a common language &amp;amp; standards between various AI &amp;amp; Development teams, for frictionless deployment of codes. Additionally configuration management assist in:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ML work with many parameters, &lt;/li&gt;
&lt;li&gt;hyper params etc. , &lt;/li&gt;
&lt;li&gt;we want to separate config from code (12 factor app)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These days, it comes without a surprise that there are several OpenSource (OS) configuration frameworks that can be utilized for this. After reviewing several options (including &lt;a href="https://hydra.cc/" rel="noopener noreferrer"&gt;Hydra&lt;/a&gt;), we decided on &lt;a href="https://www.dynaconf.com/" rel="noopener noreferrer"&gt;dynaconf&lt;/a&gt;, since it fulfilled our requirements of being:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python based&lt;/li&gt;
&lt;li&gt;Simple&lt;/li&gt;
&lt;li&gt;Easily configurable and extendable&lt;/li&gt;
&lt;li&gt;Allow for overriding and cascading&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbljmsryuxo7o4vyp6q2k.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbljmsryuxo7o4vyp6q2k.jpeg" alt="Moving from Dev to Prod"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Development Environment
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.artlsit.io" rel="noopener noreferrer"&gt;Artlist&lt;/a&gt; runs in multiple cloud environments, however currently most of the ML workloads run on GCP. Following GCP best practices, we have set up different projects for each environment, thus allowing for strict isolation between them, in addition to enabling billing segmentation. This separation needs to be easily propagated into the configuration, for seamless code execution.&lt;/p&gt;

&lt;p&gt;In this post we review our configuration in relation to&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Basic implementation&lt;/li&gt;
&lt;li&gt;
Advance Templating
&lt;/li&gt;
&lt;li&gt;Simple Overriding&lt;/li&gt;
&lt;li&gt;
&lt;a href="//#project-vs.-module-settings"&gt;Project vs. Module settings&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Updating Configurations&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now let’s see how &lt;code&gt;dynaconf&lt;/code&gt; can help out with this.  &lt;/p&gt;

&lt;h3&gt;
  
  
  Basic Implementation
&lt;/h3&gt;

&lt;p&gt;We decided to work with configuration settings that are stored in external &lt;code&gt;toml&lt;/code&gt; files, which are easily readable and are becoming one of the &lt;a href="https://peps.python.org/pep-0680/" rel="noopener noreferrer"&gt;de-facto standards&lt;/a&gt; in python.  &lt;/p&gt;

&lt;p&gt;A code snippet from our basic configuration file is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[default]&lt;/span&gt;
&lt;span class="py"&gt;PROJECT_ID&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"@format artlist-{this.current_env}"&lt;/span&gt;
&lt;span class="py"&gt;BASE_NAME&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="err"&gt;“my_feature_name”&lt;/span&gt;
&lt;span class="py"&gt;BASE_PIPELINE_NAME_GCP&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"@jinja {this.BASE_NAME | replace('_', '-')}"&lt;/span&gt;
&lt;span class="py"&gt;BUCKET_NAME&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"@format {this.BASE_NAME}--{this.current_env}"&lt;/span&gt;

&lt;span class="nn"&gt;[dev]&lt;/span&gt;
&lt;span class="py"&gt;SERVICE_ACCOUNT&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"service-account1@artlist-dev.iam.gserviceaccount.com"&lt;/span&gt;

&lt;span class="nn"&gt;[tst]&lt;/span&gt;
&lt;span class="py"&gt;SERVICE_ACCOUNT&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"service-account2@artlist-tst.iam.gserviceaccount.com"&lt;/span&gt;

&lt;span class="nn"&gt;[prd]&lt;/span&gt;
&lt;span class="py"&gt;SERVICE_ACCOUNT&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"service-account3@artlist-prd.iam.gserviceaccount.com"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let's break it down.&lt;br&gt;&lt;br&gt;
Whenever &lt;em&gt;dynaconf&lt;/em&gt; runs - it runs in a specific environment. The default environment is called DEVELOPMENT. However, since we wanted to move easily between the environments (and GCP resources), we changed the naming convention of the environments (to a  3 letter acronym = dev, tst, prd), so we can readily reference the relevant GCP project while specifying the environment.&lt;br&gt;&lt;br&gt;
Using the &lt;a href="https://www.dynaconf.com/configuration/#env_switcher" rel="noopener noreferrer"&gt;&lt;code&gt;env_swithcher&lt;/code&gt;&lt;/a&gt;, we can indicate to &lt;em&gt;dynaconf&lt;/em&gt; which configuration to load and what GCP project to access with the following line: &lt;br&gt;
 &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;PROJECT_ID = "@format artlist-{this.current_env}"&lt;br&gt;&lt;br&gt;
&lt;/p&gt;


&lt;/blockquote&gt;

&lt;p&gt;Using the &lt;code&gt;@format&lt;/code&gt; as a prefix to the string, we can parse the parameter that is within the curly brackets. For example, if the current environment is set to ‘dev’ the PROJECT_ID variable will be &lt;code&gt;artlist-dev&lt;/code&gt;, thus accessing only the resources from the &lt;code&gt;dev&lt;/code&gt; project,  whereas if the environment is set to ‘prd’ the PROJECT_ID will be &lt;code&gt;artlist-prd&lt;/code&gt;.  &lt;/p&gt;

&lt;p&gt;Accessing the rest of the relevant variables is based on the &lt;a href="https://github.com/toml-lang/toml" rel="noopener noreferrer"&gt;various sections in the toml&lt;/a&gt; file.&lt;br&gt;&lt;br&gt;
For example, referencing the &lt;em&gt;Production Service Account&lt;/em&gt; (SA) will be by accessing the &lt;code&gt;SERVICE_ACCOUNT&lt;/code&gt; variable which is under the [prd] section&lt;/p&gt;
&lt;h3&gt;
  
  
  Advance Templating
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Dynaconf&lt;/em&gt; includes the ability to work with &lt;a href="https://jinja.palletsprojects.com/" rel="noopener noreferrer"&gt;&lt;code&gt;Jinja&lt;/code&gt; templating&lt;/a&gt; - this can be useful for manipulating strings. GCP has a quark that requires naming containers within the GCP registry so as not to have ‘_’ (underscore) as separators, but rather ‘-’ (hyphen). And since we wanted to sync our registry and the artifacts coming out of &lt;a href="https://cloud.google.com/vertex-ai/docs/pipelines/introduction" rel="noopener noreferrer"&gt;&lt;em&gt;Vertex AI pipelines&lt;/em&gt;&lt;/a&gt; (that are stored within buckets / Cloud Storage), we were able to keep the python naming convention of ‘_’ , while converting the strings to the GCP convention when required.&lt;br&gt;&lt;br&gt;
Using the &lt;em&gt;jinja&lt;/em&gt;’s &lt;a href="https://jinja.palletsprojects.com/en/3.0.x/templates/?highlight=replace#jinja-filters.replace" rel="noopener noreferrer"&gt;text replace&lt;/a&gt; method we can easily alter the text as necessary:&lt;/p&gt;



&lt;blockquote&gt;
&lt;p&gt;BASE_PIPELINE_NAME_GCP = "@jinja {this.BASE_NAME | replace('_', '-')}"&lt;br&gt;&lt;br&gt;
&lt;/p&gt;


&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Simple Overriding
&lt;/h3&gt;

&lt;p&gt;Another useful feature of &lt;code&gt;dynaconf&lt;/code&gt; is that you can easily override the configuration using local settings. This is very convenient since local settings for development doesn’t need to be checked into source control, while the general settings should be synced to the entire team.&lt;br&gt;&lt;br&gt;
All that is required to differentiate between the settings is to add the &lt;code&gt;.local&lt;/code&gt; &lt;a href="https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.suffix" rel="noopener noreferrer"&gt;suffix&lt;/a&gt; to the file name, see example below:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;General Settings - settings.toml
&lt;/li&gt;
&lt;li&gt;Local Settings - settings.local.toml
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Whenever &lt;code&gt;dynaconf&lt;/code&gt; identifies the suffix &lt;code&gt;.local.toml&lt;/code&gt; it will overwrite the variables configuration that exists in the &lt;code&gt;settings.toml&lt;/code&gt; with the loaded &lt;code&gt;settings.local.toml&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fps21dh0nqxqdkgsqeyfi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fps21dh0nqxqdkgsqeyfi.png" alt="overriding example"&gt;&lt;/a&gt;&lt;br&gt;
An example for overwriting with local credentials&lt;/p&gt;
&lt;h3&gt;
  
  
  Project vs. Module settings
&lt;/h3&gt;

&lt;p&gt;Our ML framework is &lt;a href="https://www.kubeflow.org/" rel="noopener noreferrer"&gt;KubeFlow&lt;/a&gt; (hosted by &lt;em&gt;GCP VertexAI pipelines&lt;/em&gt;), which requires various configurations: some at the component level (which are reused independently in various pipelines), while others are at the pipeline/project/cross-component level. To load both settings, we can use another feature of &lt;code&gt;dynaconf&lt;/code&gt; which can define a specific file name template that will automatically be loaded by &lt;code&gt;dynaconf&lt;/code&gt;. Here is our implementation practice:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Any configurations that are at the project level will be written in the project settings - &lt;code&gt;settings_prj.toml&lt;/code&gt; (see &lt;a href="https://www.dynaconf.com/configuration/#settings_file-or-settings_files" rel="noopener noreferrer"&gt;dynaconf settings_files&lt;/a&gt; configuration)&lt;/li&gt;
&lt;li&gt;Any configurations that are at the component level will be written in the component settings - &lt;code&gt;settings.toml&lt;/code&gt; .&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Updating Configurations
&lt;/h3&gt;

&lt;p&gt;Sometimes there is a need to update the configuration during runtime, this can be challenging since the entire configuration is loaded immediately when the library is called. To do so, we can use a decorator to update the configuration. Assuming the &lt;code&gt;cfg&lt;/code&gt; is the configuration settings, we can write the following decorator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@input_to_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;input_to_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sequence_override&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
   &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;[decorator] override config parameters with function inputs
   Args:
       config: Dynaconf configuration / settings to be updated
       wrapped_func ([function]): [the function to capture it&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s input and push to the config]
       sequence_override: configures the option for overriding keys or merging the values
   &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

   &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;decorator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wrapped_func&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
       &lt;span class="nd"&gt;@wraps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wrapped_func&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;inner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
           &lt;span class="nf"&gt;_override_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sequence_override&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sequence_override&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
           &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;wrapped_func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

       &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;inner&lt;/span&gt;

   &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;decorator&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;p&gt;In this blog post, we have laid out our configuration implementation using &lt;code&gt;dynacof&lt;/code&gt; library. We saw how we&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Used the basic setup of &lt;code&gt;dynaconf&lt;/code&gt; configuration&lt;/li&gt;
&lt;li&gt;Synced our GCP project with &lt;code&gt;dynaconf&lt;/code&gt; environments&lt;/li&gt;
&lt;li&gt;Worked with the advanced &lt;code&gt;dynaconf&lt;/code&gt; settings&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In our next posts, we will extend our description of the various elements that have been incorporated into our ML project workflow, while developing our internal base library, which include standardization of:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Logging&lt;/li&gt;
&lt;li&gt;Accessing our secrets (using &lt;a href="https://cloud.google.com/secret-manager" rel="noopener noreferrer"&gt;GCP Secret Manager&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Conduct our experiment tracking (using &lt;a href="https://clear.ml/" rel="noopener noreferrer"&gt;ClearML&lt;/a&gt; )&lt;/li&gt;
&lt;/ul&gt;

&lt;h6&gt;
  
  
  # Image copyright
&lt;/h6&gt;

&lt;p&gt;The banner image was co-created using &lt;a href="https://openai.com/dall-e-2/" rel="noopener noreferrer"&gt;DALLE2&lt;/a&gt;&lt;/p&gt;

</description>
      <category>mlops</category>
      <category>dynaconf</category>
      <category>configuration</category>
      <category>ai</category>
    </item>
    <item>
      <title>Design patterns in ML</title>
      <dc:creator>Sephi Berry</dc:creator>
      <pubDate>Wed, 27 Jan 2021 07:31:34 +0000</pubDate>
      <link>https://dev.to/sephib/design-patterns-in-ml-2f8c</link>
      <guid>https://dev.to/sephib/design-patterns-in-ml-2f8c</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XPCrCrov--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://images-na.ssl-images-amazon.com/images/I/51pSVhMRMkL._SX379_BO1%2C204%2C203%2C200_.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XPCrCrov--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://images-na.ssl-images-amazon.com/images/I/51pSVhMRMkL._SX379_BO1%2C204%2C203%2C200_.jpg" alt="ML Design Patterns" href="https://www.oreilly.com/library/view/machine-learning-design/9781098115777/" width="381" height="499"&gt;&lt;/a&gt;  &lt;/p&gt;

&lt;p&gt;Listening to &lt;a href="https://www.linkedin.com/in/sara-robinson-40377924/"&gt;Sara Robinson&lt;/a&gt;, the "Machine Learning Design Patterns" book co-author, on the &lt;a href="https://podcastaddict.com/episode/117142559"&gt;MLOps.community podcast&lt;/a&gt;, raises some issues related to our growing industry in ML/AI in general and specifically in Data Engineering realm. Although I have not yet read the book, I'm allowing myself to reflect on what I understood from the episode. There are many issues discussed in the book, fortunately they decided to speak about &lt;em&gt;Workflow Pipelines&lt;/em&gt; (chapter 25), which is dear to my heart - since I believe that it is a key element for successful ML projects.&lt;/p&gt;

&lt;p&gt;As an industry, we are still evolving and best practices are still emerging. Saying that - there are many simple solutions and practices from project management and software development that can easily put an ML project on the right track. Identifying the business values that are currently most relevant are a key component when understanding the various trade offs in the engineering processes.&lt;/p&gt;

&lt;p&gt;We too enjoy the flexibility of jupyter notebooks, but I disagree with the what Sara said about when to transition from a jupyter notebook to a more structure pipeline. Working methodologically with templates and clear inputs and outputs for each notebook should be implemented from day one. Breaking up notebooks for each step and writing down the logical stages in a markdown file is a key component for saving time and for successful collaboration with any team member. This is even true for yourself - there is nothing better then returning after a weekend to a project and getting up running within a few minutes.  &lt;/p&gt;

&lt;p&gt;Reproducibility is another key component in any ML project. &lt;a href="https://mlflow.org"&gt;MLflow&lt;/a&gt; is our framework choice for tracking our experiments, which is mentioned as a tool for creating pipelines. However putting &lt;code&gt;MLflow&lt;/code&gt; with &lt;code&gt;Airflow&lt;/code&gt; as the same solution for &lt;code&gt;Workflow Pipeline&lt;/code&gt; (page 284) doesn't seem right.&lt;br&gt;&lt;br&gt;
In the book they state that the following stages make up for ML Pipeline:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Data Collection
&lt;/li&gt;
&lt;li&gt;Data Validation
&lt;/li&gt;
&lt;li&gt;Data Processing
&lt;/li&gt;
&lt;li&gt;Model Building
&lt;/li&gt;
&lt;li&gt;Training &amp;amp; Evaluating
&lt;/li&gt;
&lt;li&gt;Model Deployment
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;While &lt;code&gt;MLflow&lt;/code&gt; may be best suited for steps 4-6, &lt;code&gt;Airflow&lt;/code&gt; is probably best suited for steps 1-3. &lt;br&gt;
Here I think it is worth while pointing out that there are considerable differences between:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Workflow Pipeline (e.g. as described in this chapter - containerizing is the key issue here)
&lt;/li&gt;
&lt;li&gt;Data Pipeline (e.g Airflow or &lt;a href="https://dagster.io/"&gt;Dagster&lt;/a&gt;)
&lt;/li&gt;
&lt;li&gt;Model Pipeline (e.g. &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html"&gt;scikit-learn pipeline&lt;/a&gt;)
&lt;/li&gt;
&lt;li&gt;ML Tracking Pipeline (e.g. experiment tracking - &lt;a href="https://mlflow.org/docs/latest/tracking.html"&gt;MLflow Tracking&lt;/a&gt; )
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;(a full write up will be in a future blog).&lt;br&gt;&lt;br&gt;
I think there are considerable differences between these pipelines, and putting them together is confusing. Additional information about the complex landscape can be read in a blog post &lt;a href="https://a16z.com/2020/10/15/the-emerging-architectures-for-modern-data-infrastructure/"&gt;Emerging Architectures for Modern Data Infrastructure&lt;/a&gt; by Matt Bornstein, Martin Casado, and Jennifer Li.&lt;/p&gt;

&lt;p&gt;Finally, I totally agree with the excitement that was conveyed by the participant with the understanding that the MLops field is still growing in many directions and understanding that part of the attraction in the field is that we are able to experiment with different methodologies while learning new libraries and designs as we mature the industry. &lt;/p&gt;

</description>
    </item>
    <item>
      <title>NaNs bites</title>
      <dc:creator>Sephi Berry</dc:creator>
      <pubDate>Tue, 29 Dec 2020 11:18:17 +0000</pubDate>
      <link>https://dev.to/sephib/nans-bites-17kk</link>
      <guid>https://dev.to/sephib/nans-bites-17kk</guid>
      <description>&lt;p&gt;This post is co-authored with &lt;a href="https://www.linkedin.com/in/davidkatz-il/"&gt;David Katz&lt;/a&gt;  &lt;/p&gt;

&lt;h1&gt;
  
  
  Background
&lt;/h1&gt;

&lt;p&gt;While working on a health-related classification project our team encountered a very large &lt;a href="https://en.wikipedia.org/wiki/Sparse_matrix"&gt;sparse metrix&lt;/a&gt;, due to the vast amount of health/lab tests that were available. After a meeting with the business domain experts, we understood that our initial data preprocessing for removing missing data (NaN) was faulty.&lt;br&gt;&lt;br&gt;
In this post, we would like to share the pit-fall that we experienced and share our process for identifying features with missing values related to classification problems.  &lt;/p&gt;
&lt;h1&gt;
  
  
  The Pit Fall
&lt;/h1&gt;

&lt;p&gt;We had over 2K of features across 40K patients. We knew that most of the features had a significant amount of &lt;code&gt;NaN&lt;/code&gt;s, so we used the common methods - such as  &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html"&gt;scikit-learn's VarianceThreshold&lt;/a&gt; and &lt;a href="http://topepo.github.io/caret/pre-processing.html#zero--and-near-zero-variance-predictors"&gt;caret's near zero variance&lt;/a&gt; functions to remove features with high &lt;code&gt;NaN&lt;/code&gt; values.&lt;br&gt;&lt;br&gt;
We were left with less than 70 features and ran our base model to see if our classifier model could predict better than randomness. After displaying the &lt;a href="https://catboost.ai/docs/concepts/python-reference_catboostclassifier_get_feature_importance.html#python-reference_catboostclassifier_get_feature_importance"&gt;feature importance&lt;/a&gt; from our &lt;code&gt;CatBoost&lt;/code&gt; model, some concerns were raised regarding some of the features.&lt;br&gt;&lt;br&gt;
So we went back and did some homework...  &lt;/p&gt;

&lt;p&gt;While re-analyzing the features that were left, we saw that although they had passed &lt;br&gt;
our initial &lt;code&gt;NaN&lt;/code&gt; tests, we did not check for the distribution of the NaN across our classes, i.e. some features had a significant amount of NaNs concentrated in a specific class which were not evenly distributed across our group classifications.&lt;/p&gt;
&lt;h1&gt;
  
  
  Data sample
&lt;/h1&gt;

&lt;p&gt;To demonstrate this process let's look at an example dataset  - the &lt;a href="https://archive.ics.uci.edu/ml/datasets/Horse+Colic"&gt;Horse Colic&lt;/a&gt; dataset.&lt;br&gt;&lt;br&gt;
This dataset includes the outcome/survival of horses diagnosed with colic disease based upon their past medical histories.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;itertools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;combinations&lt;/span&gt;

&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;display&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float_format&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"{:,.2f}"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;

&lt;span class="n"&gt;names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"surgery,Age,Hospital Number,rectal temperature,pulse,respiratory rate,temperature of extremities,peripheral pulse,mucous membranes,capillary refill time,pain,peristalsis,abdominal distension,nasogastric tube,nasogastric reflux,nasogastric reflux PH,rectal examination,abdomen,packed cell volume,total protein,abdominocentesis appearance,abdomcentesis total protein,outcome,surgical lesion,type of lesion1,type of lesion2,type of lesion3,cp_data"&lt;/span&gt;
&lt;span class="n"&gt;names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;" "&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"_"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;","&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;file_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;label_col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"outcome"&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;preprocess_df&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nan&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"nan"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nan&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"outcome"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;())].&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# clean up label column
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"ID"&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;label_col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;label_col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"lived"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"died"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"3"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"euthanized"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;


&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;preprocess_df&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"df.shape: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;df.shape: (299, 28)&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;surgery&lt;/th&gt;
      &lt;th&gt;Age&lt;/th&gt;
      &lt;th&gt;Hospital_Number&lt;/th&gt;
      &lt;th&gt;rectal_temperature&lt;/th&gt;
      &lt;th&gt;pulse&lt;/th&gt;
      &lt;th&gt;respiratory_rate&lt;/th&gt;
      &lt;th&gt;temperature_of_extremities&lt;/th&gt;
      &lt;th&gt;peripheral_pulse&lt;/th&gt;
      &lt;th&gt;mucous_membranes&lt;/th&gt;
      &lt;th&gt;capillary_refill_time&lt;/th&gt;
      &lt;th&gt;...&lt;/th&gt;
      &lt;th&gt;packed_cell_volume&lt;/th&gt;
      &lt;th&gt;total_protein&lt;/th&gt;
      &lt;th&gt;abdominocentesis_appearance&lt;/th&gt;
      &lt;th&gt;abdomcentesis_total_protein&lt;/th&gt;
      &lt;th&gt;outcome&lt;/th&gt;
      &lt;th&gt;surgical_lesion&lt;/th&gt;
      &lt;th&gt;type_of_lesion1&lt;/th&gt;
      &lt;th&gt;type_of_lesion2&lt;/th&gt;
      &lt;th&gt;type_of_lesion3&lt;/th&gt;
      &lt;th&gt;cp_data&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;ID&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;530101&lt;/td&gt;
      &lt;td&gt;38.50&lt;/td&gt;
      &lt;td&gt;66&lt;/td&gt;
      &lt;td&gt;28&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;...&lt;/td&gt;
      &lt;td&gt;45.00&lt;/td&gt;
      &lt;td&gt;8.40&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;died&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;11300&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;534817&lt;/td&gt;
      &lt;td&gt;39.2&lt;/td&gt;
      &lt;td&gt;88&lt;/td&gt;
      &lt;td&gt;20&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;NaN&lt;/td&gt;
      &lt;td&gt;4&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;...&lt;/td&gt;
      &lt;td&gt;50&lt;/td&gt;
      &lt;td&gt;85&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;euthanized&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
      &lt;td&gt;2208&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;0&lt;/td&gt;
      &lt;td&gt;2&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;2 rows × 28 columns&lt;/p&gt;

&lt;p&gt;First we will check how many featurs have &lt;code&gt;NaN&lt;/code&gt; values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# There are 28 features in this dataset
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;number_of_features_with_NaN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;_s_na&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_s_na&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;_s_na&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;


&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"The number features with NaN values are &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;number_of_features_with_NaN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The number features with NaN values are 19
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Near Zero Variance
&lt;/h2&gt;

&lt;p&gt;Simulating our intitial workflow, we will remove features with NaN values with our implementation of caret's R library &lt;a href=""&gt;near zero variance&lt;/a&gt; function (with their default values).&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;near_zero_variance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cols&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frq_cut&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;95&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unique_cut&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;
    &lt;span class="n"&gt;drop_cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;val_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;normalize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;to_dict&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;drop_cols&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;lunique&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;percent_unique&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;lunique&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;freq_ratio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val_count&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;val_count&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;1e-5&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;freq_ratio&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;frq_cut&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;percent_unique&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;unique_cut&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;drop_cols&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;drop_cols&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;


&lt;span class="n"&gt;df_nzr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;near_zero_variance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"After processeing the dataset via `near_zero_variance` we are left with &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;number_of_features_with_NaN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_nzr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; features with NaN values.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;After processeing the dataset via `near_zero_variance` we are left with 18 features with NaN values.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Deeper NaN  Analysis
&lt;/h2&gt;

&lt;p&gt;Since we are only interested in understanding the missing values in the dataset, we can view how many &lt;code&gt;NaN&lt;/code&gt; values are in the various features.&lt;br&gt;
Let's now plot the remaining features relative to the percent of &lt;code&gt;NaN&lt;/code&gt;s in each feature&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;plot_percent_nan_in_features&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;nan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;feature_nan_threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nb"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nan&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;nan&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;feature_nan_threshold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;orient&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"index"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"num_of_features"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"bar"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Number of Features"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Percentage of NaNs in feature"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;plot_percent_nan_in_features&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_nzr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7TiEpLQX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/58zwqpi41aoh8cvvsi1e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7TiEpLQX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/58zwqpi41aoh8cvvsi1e.png" alt="plot_percent_nan_in_features" width="606" height="329"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the above plot, we can see that the number of features with more than 40% of NaNs are 2 features and above 25% are 6 features.  &lt;/p&gt;

&lt;p&gt;We can see that some of the features have very high &lt;code&gt;NaN&lt;/code&gt; values.  &lt;/p&gt;

&lt;p&gt;Let's now remove these problematic features.&lt;br&gt;
For this example we will set a threshold of 35% .&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;threshold_max_na&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.35&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;drop_features_above_threshold_max_na&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold_max_na&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;threshold_max_na&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;nan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;nan_threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nan&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;nan&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;threshold_max_na&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;nan_threshold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;


&lt;span class="n"&gt;df_nzr_threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;drop_features_above_threshold_max_na&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_nzr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"After drop_features_above_threshold_max_na the number of features with NaNs that we are left with are : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;number_of_features_with_NaN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_nzr_threshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;After drop_features_above_threshold_max_na the number of features with NaNs that we are left with are : 14
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;We may assume that we have removed the problematic features and can try to imputate our NaN data and run our pipeline/model.  &lt;/p&gt;

&lt;p&gt;But before we do so let's look a bit more closely at our classification. &lt;/p&gt;

&lt;h2&gt;
  
  
  NaN Distribution Among the Classifer Labels
&lt;/h2&gt;

&lt;p&gt;Looking at the classifier label feature &lt;code&gt;outcome&lt;/code&gt; we can see the following distribution&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_value_counts_df&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s2&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;check_na&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;s1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;check_na&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;s1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;s2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;s2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;s2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;s1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"num"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"percent"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"percent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;df_labels_counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;create_value_counts_df&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;label_col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df_labels_counts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;num&lt;/th&gt;
      &lt;th&gt;percent&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;lived&lt;/th&gt;
      &lt;td&gt;178&lt;/td&gt;
      &lt;td&gt;0.60&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;died&lt;/th&gt;
      &lt;td&gt;77&lt;/td&gt;
      &lt;td&gt;0.26&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;euthanized&lt;/th&gt;
      &lt;td&gt;44&lt;/td&gt;
      &lt;td&gt;0.15&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We can see that the distribution of the classes is uneven.&lt;br&gt;&lt;br&gt;
Our class distribution is apporximately  60%, 25%, 15% between the lived, died, euthanized classes (respectively).   &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But how are our NaNs distributed&lt;/strong&gt;?     &lt;/p&gt;

&lt;p&gt;What is the distribution of &lt;code&gt;NaN&lt;/code&gt;s in each feature with relation to our classification field.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_na_per_label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label_col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;df_labels_counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;label_col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;sum_na&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
        &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;label_col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df_labels_counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cols&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df_sum_na&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_sum_na"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"all"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;df_labels_counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()],&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sum_na&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df_sum_na&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"all_percentage_na"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;df_sum_na&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_percentage_na"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;df_sum_na&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_sum_na"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;df_labels_counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"num"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df_sum_na&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_na_cols&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;_s_na&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;na_cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_s_na&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;_s_na&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;
    &lt;span class="n"&gt;na_cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;na_cols&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;label_col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;na_cols&lt;/span&gt;

&lt;span class="n"&gt;na_cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_na_cols&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_nzr_threshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df_sum_na&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_na_per_label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_nzr_threshold&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;na_cols&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;label_col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df_sum_na&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;all_sum_na&lt;/th&gt;
      &lt;th&gt;lived_sum_na&lt;/th&gt;
      &lt;th&gt;died_sum_na&lt;/th&gt;
      &lt;th&gt;euthanized_sum_na&lt;/th&gt;
      &lt;th&gt;all_percentage_na&lt;/th&gt;
      &lt;th&gt;lived_percentage_na&lt;/th&gt;
      &lt;th&gt;died_percentage_na&lt;/th&gt;
      &lt;th&gt;euthanized_percentage_na&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;rectal_temperature&lt;/th&gt;
      &lt;td&gt;60&lt;/td&gt;
      &lt;td&gt;26&lt;/td&gt;
      &lt;td&gt;24&lt;/td&gt;
      &lt;td&gt;10&lt;/td&gt;
      &lt;td&gt;0.20&lt;/td&gt;
      &lt;td&gt;0.15&lt;/td&gt;
      &lt;td&gt;0.31&lt;/td&gt;
      &lt;td&gt;0.23&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;pulse&lt;/th&gt;
      &lt;td&gt;24&lt;/td&gt;
      &lt;td&gt;12&lt;/td&gt;
      &lt;td&gt;11&lt;/td&gt;
      &lt;td&gt;1&lt;/td&gt;
      &lt;td&gt;0.08&lt;/td&gt;
      &lt;td&gt;0.07&lt;/td&gt;
      &lt;td&gt;0.14&lt;/td&gt;
      &lt;td&gt;0.02&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;respiratory_rate&lt;/th&gt;
      &lt;td&gt;58&lt;/td&gt;
      &lt;td&gt;31&lt;/td&gt;
      &lt;td&gt;19&lt;/td&gt;
      &lt;td&gt;8&lt;/td&gt;
      &lt;td&gt;0.19&lt;/td&gt;
      &lt;td&gt;0.17&lt;/td&gt;
      &lt;td&gt;0.25&lt;/td&gt;
      &lt;td&gt;0.18&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;temperature_of_extremities&lt;/th&gt;
      &lt;td&gt;56&lt;/td&gt;
      &lt;td&gt;32&lt;/td&gt;
      &lt;td&gt;13&lt;/td&gt;
      &lt;td&gt;11&lt;/td&gt;
      &lt;td&gt;0.19&lt;/td&gt;
      &lt;td&gt;0.18&lt;/td&gt;
      &lt;td&gt;0.17&lt;/td&gt;
      &lt;td&gt;0.25&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;peripheral_pulse&lt;/th&gt;
      &lt;td&gt;69&lt;/td&gt;
      &lt;td&gt;39&lt;/td&gt;
      &lt;td&gt;18&lt;/td&gt;
      &lt;td&gt;12&lt;/td&gt;
      &lt;td&gt;0.23&lt;/td&gt;
      &lt;td&gt;0.22&lt;/td&gt;
      &lt;td&gt;0.23&lt;/td&gt;
      &lt;td&gt;0.27&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;mucous_membranes&lt;/th&gt;
      &lt;td&gt;47&lt;/td&gt;
      &lt;td&gt;28&lt;/td&gt;
      &lt;td&gt;11&lt;/td&gt;
      &lt;td&gt;8&lt;/td&gt;
      &lt;td&gt;0.16&lt;/td&gt;
      &lt;td&gt;0.16&lt;/td&gt;
      &lt;td&gt;0.14&lt;/td&gt;
      &lt;td&gt;0.18&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;capillary_refill_time&lt;/th&gt;
      &lt;td&gt;32&lt;/td&gt;
      &lt;td&gt;19&lt;/td&gt;
      &lt;td&gt;10&lt;/td&gt;
      &lt;td&gt;3&lt;/td&gt;
      &lt;td&gt;0.11&lt;/td&gt;
      &lt;td&gt;0.11&lt;/td&gt;
      &lt;td&gt;0.13&lt;/td&gt;
      &lt;td&gt;0.07&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;pain&lt;/th&gt;
      &lt;td&gt;55&lt;/td&gt;
      &lt;td&gt;34&lt;/td&gt;
      &lt;td&gt;12&lt;/td&gt;
      &lt;td&gt;9&lt;/td&gt;
      &lt;td&gt;0.18&lt;/td&gt;
      &lt;td&gt;0.19&lt;/td&gt;
      &lt;td&gt;0.16&lt;/td&gt;
      &lt;td&gt;0.20&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;peristalsis&lt;/th&gt;
      &lt;td&gt;44&lt;/td&gt;
      &lt;td&gt;22&lt;/td&gt;
      &lt;td&gt;15&lt;/td&gt;
      &lt;td&gt;7&lt;/td&gt;
      &lt;td&gt;0.15&lt;/td&gt;
      &lt;td&gt;0.12&lt;/td&gt;
      &lt;td&gt;0.19&lt;/td&gt;
      &lt;td&gt;0.16&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;abdominal_distension&lt;/th&gt;
      &lt;td&gt;56&lt;/td&gt;
      &lt;td&gt;31&lt;/td&gt;
      &lt;td&gt;14&lt;/td&gt;
      &lt;td&gt;11&lt;/td&gt;
      &lt;td&gt;0.19&lt;/td&gt;
      &lt;td&gt;0.17&lt;/td&gt;
      &lt;td&gt;0.18&lt;/td&gt;
      &lt;td&gt;0.25&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;nasogastric_tube&lt;/th&gt;
      &lt;td&gt;104&lt;/td&gt;
      &lt;td&gt;62&lt;/td&gt;
      &lt;td&gt;25&lt;/td&gt;
      &lt;td&gt;17&lt;/td&gt;
      &lt;td&gt;0.35&lt;/td&gt;
      &lt;td&gt;0.35&lt;/td&gt;
      &lt;td&gt;0.32&lt;/td&gt;
      &lt;td&gt;0.39&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;rectal_examination&lt;/th&gt;
      &lt;td&gt;102&lt;/td&gt;
      &lt;td&gt;56&lt;/td&gt;
      &lt;td&gt;26&lt;/td&gt;
      &lt;td&gt;20&lt;/td&gt;
      &lt;td&gt;0.34&lt;/td&gt;
      &lt;td&gt;0.31&lt;/td&gt;
      &lt;td&gt;0.34&lt;/td&gt;
      &lt;td&gt;0.45&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;packed_cell_volume&lt;/th&gt;
      &lt;td&gt;29&lt;/td&gt;
      &lt;td&gt;13&lt;/td&gt;
      &lt;td&gt;8&lt;/td&gt;
      &lt;td&gt;8&lt;/td&gt;
      &lt;td&gt;0.10&lt;/td&gt;
      &lt;td&gt;0.07&lt;/td&gt;
      &lt;td&gt;0.10&lt;/td&gt;
      &lt;td&gt;0.18&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;total_protein&lt;/th&gt;
      &lt;td&gt;33&lt;/td&gt;
      &lt;td&gt;13&lt;/td&gt;
      &lt;td&gt;12&lt;/td&gt;
      &lt;td&gt;8&lt;/td&gt;
      &lt;td&gt;0.11&lt;/td&gt;
      &lt;td&gt;0.07&lt;/td&gt;
      &lt;td&gt;0.16&lt;/td&gt;
      &lt;td&gt;0.18&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We can see that the distributions of the &lt;code&gt;NaN&lt;/code&gt;s across the classification field are not even. &lt;em&gt;e.g.&lt;/em&gt; the &lt;code&gt;rectal_temperature&lt;/code&gt; feature has twice as much NaN in the &lt;code&gt;died&lt;/code&gt; &amp;amp; &lt;code&gt;lived&lt;/code&gt; classes than in the &lt;code&gt;euthanized&lt;/code&gt; class.  &lt;/p&gt;

&lt;p&gt;Assuming that we don't want to remove any features that have less than 15% of &lt;code&gt;NaN&lt;/code&gt;s, no matter how the &lt;code&gt;NaN&lt;/code&gt; distribution is across the classification field.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;threshold_min_na&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.15&lt;/span&gt;
&lt;span class="n"&gt;mask_threshold_min_na&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_sum_na&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"all_percentage_na"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;threshold_min_na&lt;/span&gt;
&lt;span class="n"&gt;df_sum_min_na&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_sum_na&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;mask_threshold_min_na&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;df_sum_min_na&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;all_sum_na&lt;/th&gt;
      &lt;th&gt;lived_sum_na&lt;/th&gt;
      &lt;th&gt;died_sum_na&lt;/th&gt;
      &lt;th&gt;euthanized_sum_na&lt;/th&gt;
      &lt;th&gt;all_percentage_na&lt;/th&gt;
      &lt;th&gt;lived_percentage_na&lt;/th&gt;
      &lt;th&gt;died_percentage_na&lt;/th&gt;
      &lt;th&gt;euthanized_percentage_na&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;rectal_temperature&lt;/th&gt;
      &lt;td&gt;60&lt;/td&gt;
      &lt;td&gt;26&lt;/td&gt;
      &lt;td&gt;24&lt;/td&gt;
      &lt;td&gt;10&lt;/td&gt;
      &lt;td&gt;0.20&lt;/td&gt;
      &lt;td&gt;0.15&lt;/td&gt;
      &lt;td&gt;0.31&lt;/td&gt;
      &lt;td&gt;0.23&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;respiratory_rate&lt;/th&gt;
      &lt;td&gt;58&lt;/td&gt;
      &lt;td&gt;31&lt;/td&gt;
      &lt;td&gt;19&lt;/td&gt;
      &lt;td&gt;8&lt;/td&gt;
      &lt;td&gt;0.19&lt;/td&gt;
      &lt;td&gt;0.17&lt;/td&gt;
      &lt;td&gt;0.25&lt;/td&gt;
      &lt;td&gt;0.18&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;temperature_of_extremities&lt;/th&gt;
      &lt;td&gt;56&lt;/td&gt;
      &lt;td&gt;32&lt;/td&gt;
      &lt;td&gt;13&lt;/td&gt;
      &lt;td&gt;11&lt;/td&gt;
      &lt;td&gt;0.19&lt;/td&gt;
      &lt;td&gt;0.18&lt;/td&gt;
      &lt;td&gt;0.17&lt;/td&gt;
      &lt;td&gt;0.25&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;peripheral_pulse&lt;/th&gt;
      &lt;td&gt;69&lt;/td&gt;
      &lt;td&gt;39&lt;/td&gt;
      &lt;td&gt;18&lt;/td&gt;
      &lt;td&gt;12&lt;/td&gt;
      &lt;td&gt;0.23&lt;/td&gt;
      &lt;td&gt;0.22&lt;/td&gt;
      &lt;td&gt;0.23&lt;/td&gt;
      &lt;td&gt;0.27&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;mucous_membranes&lt;/th&gt;
      &lt;td&gt;47&lt;/td&gt;
      &lt;td&gt;28&lt;/td&gt;
      &lt;td&gt;11&lt;/td&gt;
      &lt;td&gt;8&lt;/td&gt;
      &lt;td&gt;0.16&lt;/td&gt;
      &lt;td&gt;0.16&lt;/td&gt;
      &lt;td&gt;0.14&lt;/td&gt;
      &lt;td&gt;0.18&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;pain&lt;/th&gt;
      &lt;td&gt;55&lt;/td&gt;
      &lt;td&gt;34&lt;/td&gt;
      &lt;td&gt;12&lt;/td&gt;
      &lt;td&gt;9&lt;/td&gt;
      &lt;td&gt;0.18&lt;/td&gt;
      &lt;td&gt;0.19&lt;/td&gt;
      &lt;td&gt;0.16&lt;/td&gt;
      &lt;td&gt;0.20&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;abdominal_distension&lt;/th&gt;
      &lt;td&gt;56&lt;/td&gt;
      &lt;td&gt;31&lt;/td&gt;
      &lt;td&gt;14&lt;/td&gt;
      &lt;td&gt;11&lt;/td&gt;
      &lt;td&gt;0.19&lt;/td&gt;
      &lt;td&gt;0.17&lt;/td&gt;
      &lt;td&gt;0.18&lt;/td&gt;
      &lt;td&gt;0.25&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;nasogastric_tube&lt;/th&gt;
      &lt;td&gt;104&lt;/td&gt;
      &lt;td&gt;62&lt;/td&gt;
      &lt;td&gt;25&lt;/td&gt;
      &lt;td&gt;17&lt;/td&gt;
      &lt;td&gt;0.35&lt;/td&gt;
      &lt;td&gt;0.35&lt;/td&gt;
      &lt;td&gt;0.32&lt;/td&gt;
      &lt;td&gt;0.39&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;rectal_examination&lt;/th&gt;
      &lt;td&gt;102&lt;/td&gt;
      &lt;td&gt;56&lt;/td&gt;
      &lt;td&gt;26&lt;/td&gt;
      &lt;td&gt;20&lt;/td&gt;
      &lt;td&gt;0.34&lt;/td&gt;
      &lt;td&gt;0.31&lt;/td&gt;
      &lt;td&gt;0.34&lt;/td&gt;
      &lt;td&gt;0.45&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now we can further analyse our data. Let's see the ratio between the classifications.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_ratio_between_classes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label_classes&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;label_a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label_b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;combinations&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label_classes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"ratio_percentage_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;label_a&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;label_b&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;label_a&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_percentage_na"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;label_b&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_percentage_na"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;col_ratio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="s"&gt;"ratio_percentage"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col_ratio&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="n"&gt;create_ratio_between_classes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_sum_min_na&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df_labels_counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;rot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Bp3AFlxX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/xh75jtlnzejad2jeco5w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Bp3AFlxX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/xh75jtlnzejad2jeco5w.png" alt="plot ratio_between_classes" width="714" height="398"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Values closer to 1 have similar percentage of &lt;code&gt;NaN&lt;/code&gt;s, whereas values that are further away a higher distributions of the &lt;code&gt;NaN&lt;/code&gt;s across the classification field.  &lt;/p&gt;

&lt;p&gt;We can set a lower and upper threshold for filtering out the problematic features. Once all the ratios are between these limits we will want to keep this feature. Any value outside these limits we can assume that the &lt;code&gt;NaN&lt;/code&gt;s are unevenly distributed, and the features should be removed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_features_outside_threshold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lt_ratio_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;gt_ratio_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;features_to_drop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="s"&gt;"ratio"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;mask_ratio_threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;between&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lt_ratio_threshold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gt_ratio_threshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;features_to_drop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mask_ratio_threshold&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;mask_ratio_threshold&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;features_to_drop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;features_to_drop&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;features_to_drop&lt;/span&gt;


&lt;span class="n"&gt;features_to_drop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_features_outside_threshold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_sum_min_na&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;features_to_drop&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;['rectal_examination',
 'abdominal_distension',
 'rectal_temperature',
 'temperature_of_extremities',
 'respiratory_rate']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df_for_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_nzr_threshold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;features_to_drop&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"After removing the features from get_features_outside_threshold function we are left with &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;number_of_features_with_NaN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_for_model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; features with `NaN`s that we are going to imputate"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After removing the features from get_features_outside_threshold function we are left with 9 features with &lt;code&gt;NaN&lt;/code&gt;s that we are going to imputate.&lt;/p&gt;

&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;p&gt;This post describes the issues while analysing &lt;code&gt;NaN&lt;/code&gt;s for feature selection.  &lt;/p&gt;

&lt;p&gt;Simple filtering methods do not always perform as expected and additional emphsesis should be taken when working with sparse matrices.&lt;/p&gt;

&lt;p&gt;We can analyse &lt;code&gt;NaN&lt;/code&gt; within features in multiple levels:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;At the global level - i.e. the total amount of NaNs within a feature  (both for removing and for keeping features) &lt;/li&gt;
&lt;li&gt;At the label/classification level - i.e. the relative distribution of NaNs per class.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Finally we recommand trying out the &lt;a href="https://github.com/ResidentMario/missingno"&gt;missingno package&lt;/a&gt; for graphical analysis of &lt;code&gt;NaN&lt;/code&gt; values&lt;/p&gt;

&lt;p&gt;The notebook for this post is at &lt;a href="https://gist.github.com/DavidKatz-il/16737cb60733303c4ac65a0dd288609a"&gt;this gist link&lt;/a&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>datapreperations</category>
      <category>dataanalysis</category>
    </item>
    <item>
      <title>Tidying up Pipelines with DataClasses</title>
      <dc:creator>Sephi Berry</dc:creator>
      <pubDate>Mon, 16 Nov 2020 21:27:29 +0000</pubDate>
      <link>https://dev.to/sephib/tidying-up-pipelines-with-dataclasses-2bde</link>
      <guid>https://dev.to/sephib/tidying-up-pipelines-with-dataclasses-2bde</guid>
      <description>&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;p&gt;Tidy code makes everyone's life easier.&lt;br&gt;&lt;br&gt;
The code in an ML project will probably be read many times, so making our workflow easier to understand will be appreciated later by everyone on the team.&lt;br&gt;
During ML projects, we need to access data in a similar manner (throughout our workflow) for training, validating and predicting our model and data. A clear semantic for accessing the data allows for easier code management between projects. Additionally naming conventions are also very useful in order to be able to understand and reuse the code in an optimal manner.&lt;br&gt;&lt;br&gt;
There are some tools that can assist in this cleanliness such as the usage of Pipelines and Dataclasses.  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;MLEngineer is 10% ML 90% Engineer.   &lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  Pipeline
&lt;/h3&gt;

&lt;p&gt;Pipeline is a &lt;em&gt;meta&lt;/em&gt; object that assists in managing the processes in a ML model.  Pipelines can encapsulate separate processes which can later on be combined together.&lt;br&gt;&lt;br&gt;
Forcing a workflow to be implemented within a &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html"&gt;Pipeline objects&lt;/a&gt; can be nuisance at the beginning (especially the conversion between  &lt;code&gt;pandas DataFrame&lt;/code&gt; and &lt;code&gt;np.ndarray&lt;/code&gt; ), but down-the-line it guaranties the quality of the model  (no data leakage, modularity etc.). Here is &lt;a href="https://www.youtube.com/watch?v=yv4adDGcFE8"&gt;Kevin Markham 4 min. video&lt;/a&gt; explaining pipeline advantages.   &lt;/p&gt;
&lt;h3&gt;
  
  
  Dataclass
&lt;/h3&gt;

&lt;p&gt;Another useful &lt;em&gt;Python object&lt;/em&gt; to save datasets along the pipeline are &lt;code&gt;dataclasses&lt;/code&gt;. Before Python 3.7 you may have been using &lt;a href="https://docs.python.org/3.9/library/collections.html?highlight=namedtuple#collections.namedtuple"&gt;namedtuple&lt;/a&gt;, however after Python 3.7 &lt;a href="https://docs.python.org/3/library/dataclasses.html"&gt;dataclasses&lt;/a&gt; were introduced, and are now a great candidate for storing such data objects. Using dataclasses allows for access consistency to the various datasets throughout the ML Pipeline.&lt;/p&gt;
&lt;h2&gt;
  
  
  Pipeline
&lt;/h2&gt;

&lt;p&gt;Since we are not analysing any dataset, this blog post is an example of an &lt;em&gt;advance&lt;/em&gt; &lt;code&gt;pipeline&lt;/code&gt; that incorporates non standard  pieces (none standard &lt;code&gt;sklearn&lt;/code&gt; modules).&lt;br&gt;&lt;br&gt;
Assuming that we have a classification problem and our data has numeric and categorical column types, the &lt;code&gt;pipeline&lt;/code&gt;  incorporates:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Preprocess data preparation per column type
&lt;/li&gt;
&lt;li&gt;Handle the &lt;code&gt;categorical&lt;/code&gt; columns using the &lt;a href="https://github.com/WinVector/pyvtreat"&gt;vtreat&lt;/a&gt; package
&lt;/li&gt;
&lt;li&gt;Run a &lt;a href="https://catboost.ai/"&gt;catboost&lt;/a&gt; classifier.
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We may build our pipeline as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;num_pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;([(&lt;/span&gt;&lt;span class="s"&gt;"scaler"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;StanderdScaler&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
                     &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"variance"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;VarianceThreshold&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
                     &lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;preprocess_pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ColumnTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
   &lt;span class="n"&gt;remainder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"passthrough"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="n"&gt;transformers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="s"&gt;"num_pipe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_pipe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;select_dtypes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"number"&lt;/span&gt;&lt;span class="p"&gt;))]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;                     

&lt;span class="n"&gt;pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;([(&lt;/span&gt;&lt;span class="s"&gt;"preprocess_pipe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;preprocess_pipe&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
               &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"vtreat"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BinomiaOutcomeTreatmentPlan&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;                 

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this &lt;em&gt;pseudo code&lt;/em&gt; our Pipeline has some preprocessing to the numeric columns followed by the processing of the categorical columns with the vtreat package (it will pass-through all the non-categorical and numeric columns).  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Since &lt;code&gt;catboost&lt;/code&gt; does not have a transform method we are going to introduce it later on.
&lt;/li&gt;
&lt;li&gt;The usage of &lt;code&gt;vtreat&lt;/code&gt; is an example of the possibility to use nonstandard modules within the classifications (assuming they follow &lt;code&gt;sklearn&lt;/code&gt; paradigms)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So now the time has come to cut up our data...&lt;br&gt;&lt;br&gt;
&lt;a href="https://getyarn.io/yarn-clip/2c689f11-6d71-425c-a701-81be09ad034e#llil9DAFRQ.copy" rel="Homicidal Barber"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--juaxiWdw--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://static.wikia.nocookie.net/montypython/images/5/53/The_ant_5.jpg" width="854" height="640"&gt;&lt;/a&gt;  &lt;/p&gt;
&lt;h2&gt;
  
  
  Test vs. Train vs. Valid
&lt;/h2&gt;

&lt;p&gt;A common workflow when developing an ML model is the necessity to split the date into &lt;a href="https://machinelearningmastery.com/difference-test-validation-datasets/"&gt;Test/Train/Valid datasets&lt;/a&gt;.    &lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.cs.nthu.edu.tw/~shwu/courses/ml/labs/08_CV_Ensembling" rel="TTV datasets"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Cm6yxCwe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/http://www.cs.nthu.edu.tw/%7Eshwu/courses/ml/labs/08_CV_Ensembling/fig-holdout.png" width="585" height="369"&gt;&lt;/a&gt;  source: Shan-Hung Wu &amp;amp; DataLab, National Tsing Hua University&lt;/p&gt;

&lt;p&gt;In a nut shell the difference between the data are:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Test - put aside - don't look until final model estimation
&lt;/li&gt;
&lt;li&gt;Train - dataset to train model
&lt;/li&gt;
&lt;li&gt;Valid - dataset to validate model during the training phase (this can be via Cross Validation iteration, GridSearch, etc.)
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each dataset will have similar attributes that we will need to save and access throughout the ML workflow.&lt;br&gt;&lt;br&gt;
In order to prevent confusion lets create a &lt;code&gt;dataclass&lt;/code&gt; to save the datasets in a structured manner.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# basic dataclass 
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;

&lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Split&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;pred_class&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;pred_proba&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can create the training and test datasets as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;train&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'train'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   
&lt;span class="n"&gt;test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'test'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each &lt;code&gt;dataclass&lt;/code&gt; will have the following &lt;a href="https://docs.python.org/3/library/dataclasses.html#dataclasses.Field"&gt;fields&lt;/a&gt;:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;X&lt;/code&gt; - a numpy ndarray storing all the features
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;y&lt;/code&gt; - a numpy array storing the labeling classification
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;idx&lt;/code&gt; - the index for storing the original indexes useful for referencing at the end of the pipe line
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pred_class&lt;/code&gt; - a numpy array storing the predicted classification
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pred_proba&lt;/code&gt; - a numpy ndarray for storing the probabilities of the classifications&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Additionally we will store a &lt;code&gt;name&lt;/code&gt; for the dataclass (in the &lt;em&gt;init&lt;/em&gt; function) to easily referencing it along the pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Splitting In Action
&lt;/h3&gt;

&lt;p&gt;There are several methods that can be used to split the datasets. When data are imbalanced it is important to split the data with a stratified method. In our case, we chose to use &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html"&gt;StratifiedShuffleSplit&lt;/a&gt; however, in contrast  to the simple &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train%20split#sklearn.model_selection.train_test_split"&gt;train-test split&lt;/a&gt; which returns the datasets themselves, the StratifiedShuffleSplit returns only the indices for each group, thus we will need a helper function to get the dataset themselves (our helper function is nice and minimal for the usage of our &lt;code&gt;dataclasses&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_split_from_idx&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;split1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Split&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;split2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Split&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;split1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;split2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;split1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;split2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;split1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;split2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;split1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iloc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;split2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;split1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;split2&lt;/span&gt;  

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;fold_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;StratifiedSplitValid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_split&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_split_from_idx&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# a helper function
&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pipeline in action
&lt;/h3&gt;

&lt;p&gt;Now we can run the first part of our Pipeline&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_train_X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once we have &lt;code&gt;fit_transform&lt;/code&gt; our data (allowing for vtreat magic to work), we can introduce the &lt;code&gt;catboost&lt;/code&gt; classifier into our Pipeline.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;catboost_clf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CatBoostClassifier&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;train_valid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"train_valid"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;valid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"valid"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;fold_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_valid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StratifiedSplitValid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_train_X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_split&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;train_valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;valid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_split_from_idx&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_train_X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="s"&gt;"catboost_clf"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;catboost_clf&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_size&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_valid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;catboost_clf__eval_set&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the two following points:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Using &lt;code&gt;pipe.steps.append&lt;/code&gt; we are able to introduce steps into the pipeline that could not be initially part of the workflow.
&lt;/li&gt;
&lt;li&gt;Adding parameters into the steps within the pipeline requires the use of double dunder for &lt;a href="https://scikit-learn.org/stable/modules/compose.html#nested-parameters"&gt;nested paramters&lt;/a&gt;.
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Finally we can get some results&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pred_class&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.(&lt;/span&gt;&lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pred_proba&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pipe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pred_proba&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)[:,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now when we analyse our model we can generate our metrics (e.g. confusion_matrix) by easily referencing to the relevant dataset as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;confusion_matrix&lt;/span&gt;
&lt;span class="n"&gt;conf_matrix_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;confusion_matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_true&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pred_class&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This blog post outlines the advantages for using Pipelines and Dataclasses.&lt;br&gt;&lt;br&gt;
Working with the Dataclasses is really a no-brainer since it is very simple and can easily be incorporated into any code base.  Pipelines require more effort while integrating them into the code, but the benefits are substantial and well worth it.&lt;br&gt;
I hope the example illustrated the potential for such usage and will inspire and encourage you to try it out.&lt;/p&gt;

</description>
      <category>pipeline</category>
      <category>dataengineering</category>
      <category>python</category>
      <category>scikit</category>
    </item>
    <item>
      <title>Simple Pipeline Monitoring Dashboard</title>
      <dc:creator>Sephi Berry</dc:creator>
      <pubDate>Sun, 02 Aug 2020 09:54:06 +0000</pubDate>
      <link>https://dev.to/sephib/simple-pipeline-monitoring-dashboard-386p</link>
      <guid>https://dev.to/sephib/simple-pipeline-monitoring-dashboard-386p</guid>
      <description>&lt;p&gt;This post is co-authored with &lt;a href="https://www.linkedin.com/in/davidkatz-il/"&gt;David Katz&lt;/a&gt;  &lt;/p&gt;

&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;p&gt;These days any project that is deployed should incorporate the principles of CI/CD (highly recommended &lt;a href="https://www.youtube.com/watch?v=Dx2vG6qmtPs&amp;amp;t=232s"&gt;great talk from Eric Ma&lt;/a&gt; July 2020 - describes the issue in the realm of &lt;em&gt;Data Science&lt;/em&gt; ). Thus after setting up our &lt;a href="https://dev.to/sephib/implementing-a-graph-network-pipeline-with-dagster-3i3a"&gt;dagster pipeline&lt;/a&gt; we needed to implement some sort monitoring solution to review the outcome of our workflow. Working in a small DS team we needed to push forward and couldn't wait for the &lt;em&gt;heavy guns&lt;/em&gt; of the enterprise IT to take over. So until we have there support here is a simple dashboard that we put together to monitor our &lt;code&gt;Assets&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In this blog post we aim to describe how we created a functional dashboard based on python widgets.&lt;br&gt;&lt;br&gt;
We will describe the origin of our data, followed by the our solution using python's &lt;a href="https://panel.holoviz.org/"&gt;Panel&lt;/a&gt; library.  &lt;/p&gt;

&lt;p&gt;The code for this post is in this repo &lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--566lAguM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/sephib"&gt;
        sephib
      &lt;/a&gt; / &lt;a href="https://github.com/sephib/dagster-graph-project"&gt;
        dagster-graph-project
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Repo demonstrating a Dagster pipeline to generate Neo4j Graph
    &lt;/h3&gt;
  &lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--nq2_zzHG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/cae38vg0dkeyw4teeyqz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--nq2_zzHG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/cae38vg0dkeyw4teeyqz.png" alt="Simple Dashboard demo" width="825" height="511"&gt;&lt;/a&gt;  &lt;/p&gt;

&lt;h2&gt;
  
  
  Dagster Assets
&lt;/h2&gt;

&lt;p&gt;We are not going to dive into &lt;a href="https://docs.dagster.io/"&gt;Dagster&lt;/a&gt; (see  previous &lt;a href="https://dev.to/sephib/implementing-a-graph-network-pipeline-with-dagster-3i3a"&gt;blog post on our data pipeline&lt;/a&gt;), but the TLDR  is that Dagster is an orchestration framework for building modern data applications and workflows. The framework has integrated logging and the ability to &lt;a href="https://docs.dagster.io/overview/asset-materializations#materializing-an-asset"&gt;produce persistent assets&lt;/a&gt; that are stored in a database (in our case &lt;em&gt;postgresql&lt;/em&gt;) for future references.&lt;br&gt;&lt;br&gt;
For our project we are interested in monitoring the number of nodes and edges that we generate in our data pipline workflow. During our &lt;em&gt;pipeline run&lt;/em&gt; we log (or in the Dagster's jargon &lt;code&gt;Materialize&lt;/code&gt; - see &lt;a href="https://docs.dagster.io/examples/materializations#main"&gt;AssetMaterialization in the documentation&lt;/a&gt;) various stats on the datasets that we whish to  manipulate. We would like to view the changes of these stats over time in order to verify the "health" of our system/pipeline.&lt;/p&gt;
&lt;h2&gt;
  
  
  Panel widgets
&lt;/h2&gt;

&lt;p&gt;Today, the python ecosystem is very rich and vibrant with various visualization libraries that are constantly being developed. Two of the libraries that we reviewed were &lt;a href="https://www.streamlit.io/"&gt;streamlit&lt;/a&gt; and &lt;a href="https://panel.holoviz.org/"&gt;Panel&lt;/a&gt;. We decided to go with Panel which seemed to suit our needs (due mainly to its structure and maintenance from our side).&lt;br&gt;&lt;br&gt;
Inspired by a talk given by  &lt;a href="https://www.youtube.com/watch?v=Un30yb1WlpU&amp;amp;feature=youtu.be"&gt;Lina Weichbrodt in the MLOps meetup&lt;/a&gt;,  we wanted to view the percent change of our metrics over time.   &lt;/p&gt;

&lt;p&gt;Panel is capable of displaying and integrating many python widgets from various packages. We are going to work with hvplot which best fits our needs, due to its richness and its integration with Pandas.    &lt;/p&gt;
&lt;h2&gt;
  
  
  Getting our data/assets from the database
&lt;/h2&gt;

&lt;p&gt;In this section we describe how we extracted the data from &lt;code&gt;Dagster's Asset&lt;/code&gt; database. If this is not relevant, you may want to jump to the sample data section below.&lt;br&gt;&lt;br&gt;
In order to access the &lt;code&gt;Asset&lt;/code&gt; data we needed to dig into the &lt;code&gt;event_log&lt;/code&gt; table which logs all the events that are generated when running a Dagster pipeline. The script to extracts the data into a Pandas Dataframe, based on the &lt;code&gt;Asset Keys&lt;/code&gt; that are defined in the &lt;code&gt;Materialization&lt;/code&gt; process, is in the [linked repo].    &lt;/p&gt;

&lt;p&gt;Here are the key elements in the script:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;In order to access the assets we need to query the &lt;code&gt;event_logs&lt;/code&gt; table. We can use a &lt;code&gt;sqlalchemy&lt;/code&gt; query as follows:  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;select([t_event_logs.c.event]).where(t_event_logs.c.asset_key.in_(assets))&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For parsing the results we can use &lt;code&gt;dagster's&lt;/code&gt; internal utility &lt;code&gt;deserialize_json_to_dagster_namedtuple&lt;/code&gt;. Bellow is the function that converts the assets into a dictionary. Please note that we are only retrieving assets of a numeric type (which can be plotted). This is parallel to &lt;code&gt;dagit's&lt;/code&gt; decision to display only asset numeric values in a graphs.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def get_asset_keys_values(results)-&amp;gt;dict:
    assets={}
    for result in results:
        dagster_namedtuple = deserialize_json_to_dagster_namedtuple(result[0])
        time_stamp = datetime.fromtimestamp(dagster_namedtuple.timestamp).strftime('%Y-%m-%d %H:%M:%S')
        assets[time_stamp] = {}        
        assets[time_stamp]['asset_key'] = dagster_namedtuple.dagster_event.asset_key.to_string()
        from entry in dagster_namedtuple.dagster_event.event_specific_data.materialization.metadata_entries:
            if isinstance(entry.entry_data, FloatMetadataEntryData):  # Only assets that are numerical
                assets[time_stamp][entry.label] = entry.entry_data.value
    return assets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The full code for retrieving the data is in &lt;a href="https://github.com/sephib/dagster-graph-project/blob/master/src/get_dagset_assets.py"&gt;get_dagster_asset.py&lt;/a&gt; file. &lt;/p&gt;
&lt;h3&gt;
  
  
  Sample Data
&lt;/h3&gt;

&lt;p&gt;For the dashboard in this post, we are going to use the sample data from bokeh.    &lt;/p&gt;

&lt;p&gt;Since we are simulating our datapipline outcomes we are going to use a sample of the columns:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;date - as our X / time axis&lt;/li&gt;
&lt;li&gt;Temperature
&lt;/li&gt;
&lt;li&gt;Humidity
&lt;/li&gt;
&lt;li&gt;Light &lt;/li&gt;
&lt;li&gt;CO2&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's view the data  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LB7EozhQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/ogy5gg8stqu4mk3mbk1u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LB7EozhQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/ogy5gg8stqu4mk3mbk1u.png" alt="sample dataframe" width="631" height="240"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since we are interested in the change of the various stats with time we can use Panda's &lt;a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pct_change.html"&gt;pct_change&lt;/a&gt; method to generate the values that we need.  This also allows displaying all the datasets in the same graph since the nominal values of the various datasets are of different orders of magnitude.    &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--tX6Qt-zN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/77pp9a6wl4n987kmq1rv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--tX6Qt-zN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/77pp9a6wl4n987kmq1rv.png" alt="sample df pct_change" width="534" height="177"&gt;&lt;/a&gt;  &lt;/p&gt;

&lt;p&gt;Now that we have the data we can build our dashboard  &lt;/p&gt;
&lt;h2&gt;
  
  
  Dashboard
&lt;/h2&gt;

&lt;p&gt;We have 2 widgets that we want to use in our dashboard:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A line plot -  displaying the datasets
1.1. A scatter plot - adding markers to the line plot &lt;/li&gt;
&lt;li&gt;A date_range_slider widget - presenting the date range that we want to display
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Our dashboard will display each data series along the X time axis.  &lt;/p&gt;
&lt;h3&gt;
  
  
  DateRangeSlider
&lt;/h3&gt;

&lt;p&gt;Panel's &lt;a href="https://panel.holoviz.org/reference/widgets/DateRangeSlider.html"&gt;DateRangeSlider&lt;/a&gt; widget "allows selecting a date range using a slider with two handles".  &lt;/p&gt;

&lt;p&gt;The parameters of the widget are self-explanitory&lt;br&gt;&lt;br&gt;
Please not that the &lt;code&gt;value&lt;/code&gt; parameter is for the default values of the DateRangeSlider, which consists of the start..end of the slider.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;date_range_slider = pn.widgets.DateRangeSlider(
        name='Date Range Slider',
        start=data[date_col].min(), 
        end=data[date_col].max(),
        value=(data[date_col].max() - timedelta(hours=1), 
               data[date_col].max()
               )   # defualt value for slider
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Line Plot &amp;amp; Panel's Glue
&lt;/h2&gt;

&lt;p&gt;Now let's look at the Line plot code:  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;import holoviews.plotting.bokeh&lt;br&gt;&lt;br&gt;
import hvplot.pandas  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These define that &lt;a href="https://bokeh.org/"&gt;bokeh&lt;/a&gt; will be the visualization engine for hvplot, in addition to allowing for hvplot to use directly Panda's dataframes as the datasources for the plots.  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a class="mentioned-user" href="https://dev.to/pn"&gt;@pn&lt;/a&gt;.depends(date_range_slider.param.value)  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The Panel decorator causes the &lt;em&gt;line plot&lt;/em&gt; to vary - based on the value that is changed from the &lt;code&gt;date_range_slider&lt;/code&gt; widget.   &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;start_date = date_range[0], end_date = date_range[1]&lt;br&gt;&lt;br&gt;
mask = (crime_data[date_col] &amp;gt; start_date) &amp;amp; (crime_data[date_col] &amp;lt;= end_date)&lt;br&gt;&lt;br&gt;
data = crime_data.loc[mask]   &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In order to filter the dataframe we are masking the data based on the current values from the &lt;code&gt;date_range_slider&lt;/code&gt; widget.  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;crime_data.hvplot.line  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the basic call for a  [line plot] to be rendered from the Panda's dataframe.  &lt;/p&gt;

&lt;p&gt;The &lt;a href="https://hvplot.holoviz.org/reference/pandas/scatter.html"&gt;scatter plot&lt;/a&gt; was added in order to display the value markers on the line plot&lt;/p&gt;

&lt;p&gt;Here is the full function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@pn.depends(date_range_slider.param.value)
def get_plot(date_range):
    data = dft
    start_date = date_range[0]
    end_date = date_range[1]
    mask = (data[date_col] &amp;gt; start_date) &amp;amp; (data[date_col] &amp;lt;= end_date)
    data = data.loc[mask]

    lines = data[cols + [date_col]].hvplot.line(
          x=date_col
        , y=cols
        , value_label= 'value'  
        , legend='right'
        , height=400
        , width=800
        , muted_alpha=0
        , ylim=(-0.1, 0.1)  # This can be configured based on the pct change scale 
        , xlabel='time'
        , ylabel='% change'
    )   
    scatter = data[cols + [date_col]].hvplot.scatter(
                x=date_col,
                y= cols,

    )
    return lines.opts(axiswise=True) * scatter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Final function
&lt;/h2&gt;

&lt;p&gt;Now we can create a functions that connects the different widgets&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def get_dashboard(dft, cols, date_col):
    date_range_slider = pn.widgets.DateRangeSlider(
        name='Date Range Slider',
        start=data[date_col].min(), end=data[date_col].max(),
        value=(data[date_col].max() - timedelta(hours=1), data[date_col].max(),)
    )
    @pn.depends(date_range_slider.param.value)
    def get_plot(date_range):
        data = dft
        start_date = date_range[0]
        end_date = date_range[1]
        mask = (data[date_col] &amp;gt; start_date) &amp;amp; (data[date_col] &amp;lt;= end_date)
        data = data.loc[mask]

        lines = data[cols + [date_col]].hvplot.line(
              x=date_col
            , y=cols
            , value_label= 'value'  
            , legend='right'
            , height=400
            , width=800
            , muted_alpha=0
            , ylim=(-0.1, 0.1)  # This can be configured based on the pct change scale 
            , xlabel='time'
            , ylabel='% change'
        )   
        scatter = data[cols + [date_col]].hvplot.scatter(
                    x=date_col,
                    y= cols,

        )
        return lines.opts(axiswise=True) * scatter
    return get_plot, date_range_slider
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Desgin the Dashboard
&lt;/h2&gt;

&lt;p&gt;Panel has a simple method to aggregating all the widgets together using rows and columns (like a simple HTML table).   &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--02ICjvHJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/ihj8fz6ampqcvy7jw44l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--02ICjvHJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/ihj8fz6ampqcvy7jw44l.png" alt="Panel Layout" width="739" height="505"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Below is the code to design the layout&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plot, date_range_slider = get_dashboard(data, cols, 'date')
dashboard=pn.Row(
    pn.Column(
        pn.pane.Markdown(''' ## Dataset Percent Change'''),
        plot,
        date_range_slider,
    ),

)
dashboard

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--nq2_zzHG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/cae38vg0dkeyw4teeyqz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--nq2_zzHG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/cae38vg0dkeyw4teeyqz.png" alt="Simple Dashboard demo" width="825" height="511"&gt;&lt;/a&gt;   &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this blog post we have outlined our solution for monitoring our Dagster's Assets that we log during our data pipeline workflow.&lt;br&gt;&lt;br&gt;
Using the Panel / hvplot libraries was quite straightforward. The documentation and reference galleries were very useful, although getting the linkage between some widget actions may require a bit of JS. Working with the examples, as in the last section in the &lt;a href="https://panel.holoviz.org/getting_started/index.html"&gt;getting started documentation&lt;/a&gt;, in addition to the more advance examples, show the potential for an elaborate dashboard if required.  &lt;/p&gt;

</description>
      <category>python</category>
      <category>monitor</category>
      <category>pyviz</category>
    </item>
    <item>
      <title>Implementing a graph network pipeline with Dagster</title>
      <dc:creator>Sephi Berry</dc:creator>
      <pubDate>Thu, 09 Jul 2020 13:56:21 +0000</pubDate>
      <link>https://dev.to/sephib/implementing-a-graph-network-pipeline-with-dagster-3i3a</link>
      <guid>https://dev.to/sephib/implementing-a-graph-network-pipeline-with-dagster-3i3a</guid>
      <description>&lt;p&gt;This post is co-authored with &lt;a href="https://www.linkedin.com/in/davidkatz-il/" rel="noopener noreferrer"&gt;David Katz&lt;/a&gt; &lt;/p&gt;

&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Working in the Intelligence arena we try to 'connect the dots' to extract meaningful information from data.&lt;/li&gt;
&lt;li&gt;We analyze various datasets to link between them in a logical manner.&lt;/li&gt;
&lt;li&gt;This is useful in many different projects - so we needed to build a pipeline that can be both dynamic and robust, and be readily and easily  utilized.&lt;/li&gt;
&lt;li&gt;In this blog post we share our experience in running one of our data pipelines with  &lt;a href="https://docs.dagster.io/" rel="noopener noreferrer"&gt;dagster&lt;/a&gt; - which uses a modern approach (compared to the traditional Airflow / Luigi task managers), see &lt;a href="https://docs.dagster.io/docs/learn" rel="noopener noreferrer"&gt;Dagster's website description&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;We hope this blog post will help others to adopt such a data-pipeline and allow them to learn from our experiences.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The code for this post is in this repo &lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev.to%2Fassets%2Fgithub-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/sephib" rel="noopener noreferrer"&gt;
        sephib
      &lt;/a&gt; / &lt;a href="https://github.com/sephib/dagster-graph-project" rel="noopener noreferrer"&gt;
        dagster-graph-project
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Repo demonstrating a Dagster pipeline to generate Neo4j Graph
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;p&gt;This repo is an example of using &lt;a href="https://docs.dagster.io/" rel="nofollow noopener noreferrer"&gt;dagster framework&lt;/a&gt; in a real-world data pipeline.&lt;/p&gt;
&lt;div&gt;&lt;a rel="noopener noreferrer" href="https://github.com/sephib/dagster-graph-projectdocs/images/postBanner.png"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fsephib%2Fdagster-graph-projectdocs%2Fimages%2FpostBanner.png" alt="postBanner dagster spark neo4j" width="650"&gt;&lt;/a&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;See &lt;a href="https://dev.to/sephib/implementing-a-graph-network-pipeline-with-dagster-3i3a" rel="nofollow"&gt;Implementing a graph network pipeline with Dagster&lt;/a&gt; blog post for the entire write-up describing how we created a graph (nodes and edges) from separate data sources and batch import them into Neo4j. A &lt;a href="https://github.com/sephib/dagster-graph-project/tree/master/notebooks/dagster_pipeline_blog.ipynb" rel="noopener noreferrer"&gt;jupyter notebook&lt;/a&gt; is also available in this repo, in addition to the entire code to replicate this example.&lt;/li&gt;
&lt;/ul&gt;
&lt;div&gt;&lt;a rel="noopener noreferrer" href="https://github.com/sephib/dagster-graph-projectdocs/images/monitorpostBanner.png"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fsephib%2Fdagster-graph-projectdocs%2Fimages%2FmonitorpostBanner.png" alt="postBanner dagster spark neo4j" width="450"&gt;&lt;/a&gt;&lt;/div&gt;
&lt;p&gt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt;&amp;lt; HEAD&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;See &lt;a href="https://dev.to/sephib/simple-pipeline-monitoring-dashboard-386p" rel="nofollow"&gt;Simple Pipeline Monitoring Dashboard &lt;/a&gt; blog post for the entire write-up describing our monitor dashboard that we created using &lt;a href="https://panel.holoviz.org/" rel="nofollow noopener noreferrer"&gt;Panel&lt;/a&gt;. The code is available in &lt;a href="https://github.com/sephib/dagster-graph-project/tree/master/notebooks/dashboard_blog.ipynb" rel="noopener noreferrer"&gt;this repo as a jupyter notebook&lt;/a&gt; is also available.
=======&lt;/li&gt;
&lt;li&gt;See &lt;a href="https://dev.to/sephib/simple-pipeline-monitoring-dashboard-386p" rel="nofollow"&gt;Simple Pipeline Monitoring Dashboard &lt;/a&gt; blog post for the entire write-up describing our monitor dashboard that we created using &lt;a href="https://panel.holoviz.org/" rel="nofollow noopener noreferrer"&gt;Panel&lt;/a&gt;. A &lt;a href="https://github.com/sephib/dagster-graph-project/tree/master/notebooks/dashboard_blog.ipynb" rel="noopener noreferrer"&gt;jupyter notebook&lt;/a&gt; is also available.&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;blockquote&gt;
&lt;blockquote&gt;
&lt;blockquote&gt;
&lt;blockquote&gt;
&lt;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;82d3f8addaad60a9bc1821ec4b96e3ebf5d77197&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/blockquote&gt;
&lt;/blockquote&gt;
&lt;/blockquote&gt;
&lt;/blockquote&gt;
&lt;/blockquote&gt;
&lt;/blockquote&gt;
&lt;/div&gt;



&lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/sephib/dagster-graph-project" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;


&lt;h2&gt;
  
  
  Our Challenge
&lt;/h2&gt;

&lt;h4&gt;
  
  
  Connecting the dots
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Finding relationships between different entities can be challenging (especially across datasets) - but is vital when building a reliable intelligent report.
&lt;/li&gt;
&lt;li&gt;A logical structure to store the entities and their relationships is in a &lt;a href="https://en.wikipedia.org/wiki/Graph_database" rel="noopener noreferrer"&gt;Graph Database&lt;/a&gt;. In this example we are going to use &lt;a href="https://neo4j.com/" rel="noopener noreferrer"&gt;Neo4j&lt;/a&gt; DB for storing the graph.
&lt;/li&gt;
&lt;li&gt;Also we are going to use &lt;em&gt;psedo BigData&lt;/em&gt;. However the pipeline that we are presenting generates billions of relationships.
&lt;/li&gt;
&lt;li&gt;The outcome files of the pipeline will be in the format that will allow to use the dedicated &lt;a href="https://neo4j.com/docs/operations-manual/4.1/tools/import/" rel="noopener noreferrer"&gt;Neo4j tool for bulk import&lt;/a&gt;.   In the future we will do a separate blog post on our data analysis workflow with &lt;code&gt;Neo4j&lt;/code&gt;. &lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  First take
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Our initial version for our pipeline was based on custom code and configurations (YAML) files.
&lt;/li&gt;
&lt;li&gt; The code base is a combination of R and Python script that utilize Spark, Dask and HDFS.
&lt;/li&gt;
&lt;li&gt; A shell script aggregated all the scripts to run the entire workflow&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Inspect and Adapt
&lt;/h4&gt;

&lt;p&gt;After our initial alpha version we noticed that we had some problems that required our attention:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The pipeline was built horizontally (per process) and not vertically (per dataset) - leading to uncertainty and fragmented results
&lt;/li&gt;
&lt;li&gt;We needed to refactor our code in order to stabilize and verify the quality of the product.
&lt;/li&gt;
&lt;li&gt;Working with &lt;a href="//www.dask.org"&gt;dask&lt;/a&gt; didn't solve all of our use-cases and so we needed to run some workloads on &lt;a href="https://spark.apache.org/" rel="noopener noreferrer"&gt;spark&lt;/a&gt;.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After checking several options we chose &lt;code&gt;dagster&lt;/code&gt; as our pipeline framework for numerous reasons - including:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Simplicity logic/framework
&lt;/li&gt;
&lt;li&gt;Modern architecture with data as a "first class citizen".
&lt;/li&gt;
&lt;li&gt;Open-source code base
&lt;/li&gt;
&lt;li&gt;The framework includes a modern UI for monitoring and communicating the workflow status
&lt;/li&gt;
&lt;li&gt;Pipeline supports data dependencies (and not function outputs)
&lt;/li&gt;
&lt;li&gt;Ability to log/monitor metrics
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It was fortunate that our configuration files allowed us to identify the various functions and abstract our business logic from the data transformation.  &lt;/p&gt;

&lt;p&gt;*** This is a &lt;code&gt;dagster&lt;/code&gt; &lt;em&gt;intermediate level&lt;/em&gt; blog post - for newbees it is recommended to run through the &lt;a href="https://docs.dagster.io/docs/tutorial" rel="noopener noreferrer"&gt;beginner tutorial&lt;/a&gt; on  dagster's site.   &lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding &lt;code&gt;Dagster&lt;/code&gt; &lt;strong&gt;lego&lt;/strong&gt; blocks
&lt;/h2&gt;

&lt;p&gt;Before we start, here's a short introduction to Dagster's &lt;em&gt;&lt;em&gt;lego&lt;/em&gt;&lt;/em&gt; building blocks &lt;a href="https://docs.dagster.io/docs/learn/concepts" rel="noopener noreferrer"&gt;see dagster documentation&lt;/a&gt;: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;At the core we have &lt;code&gt;solids&lt;/code&gt; - these are the various "functional unit of computation that consumes and produces data assets".
&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;solids&lt;/code&gt; can be aggregated into &lt;code&gt;composite solids&lt;/code&gt;.
&lt;/li&gt;
&lt;li&gt;A &lt;code&gt;solid&lt;/code&gt; can have &lt;strong&gt;inputs&lt;/strong&gt; and &lt;strong&gt;outputs&lt;/strong&gt; that can be passed along the pipeline.
&lt;/li&gt;
&lt;li&gt;A &lt;code&gt;pipeline&lt;/code&gt; orchestrates the various &lt;code&gt;solids&lt;/code&gt; and the data flows.
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Design implementation
&lt;/h4&gt;

&lt;p&gt;The following diagram displays the architecture  outline of the code:  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F1q3rdf7iqaynsfm7xl6y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F1q3rdf7iqaynsfm7xl6y.png" alt="entity model"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;YAML configurations&lt;br&gt;&lt;br&gt;
&lt;code&gt;Dagster&lt;/code&gt; has many configuration files that assist in managing pipelines and there environments. In this example we will use only 2 configuration types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;resources - configuration file that manges the resources for the running pipeline
&lt;/li&gt;
&lt;li&gt;solids - configuration file for the &lt;code&gt;composite solids&lt;/code&gt;. Each data source has its own configuration, in addition to the composite solid that implements the creation of the Neo4j DB.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;Inputs  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Our premise is that the datasets inputs arrive in a timely manner (batch not streaming).
&lt;/li&gt;
&lt;li&gt;Each dataset source has a corresponding configuration file.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;Pipeline  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;pipeline&lt;/code&gt; consists of all the &lt;code&gt;composite solids&lt;/code&gt; that organize the work that needs to be executed within each &lt;code&gt;solid&lt;/code&gt;.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;Output Files &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In our example the outputs are:

&lt;ul&gt;
&lt;li&gt;Nodes and Edges flat files in the format to be bulk imported into Neo4j&lt;/li&gt;
&lt;li&gt;Neo4j DB&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Let's build
&lt;/h3&gt;

&lt;p&gt;Now we will build our basic units needed for our project:&lt;br&gt;&lt;br&gt;
Since we have several datasets we will can build each building block to be executed with a configuration file.  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read file into dataframe  (e.g. &lt;code&gt;cvs&lt;/code&gt; and &lt;code&gt;parquet&lt;/code&gt;)
&lt;/li&gt;
&lt;li&gt;Message our data to fit the schema that we want for our graph network (including: adding/drop columns,  rename_columns, concat columns etc.)
&lt;/li&gt;
&lt;li&gt;Generate nodes and edges  (per entity in the graph model)
&lt;/li&gt;
&lt;li&gt;Save nodes and edges into csv files
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Finally we will bulk import the csv files (nodes and edges) into &lt;code&gt;Neo4j&lt;/code&gt;  &lt;/p&gt;
&lt;h2&gt;
  
  
  Data
&lt;/h2&gt;

&lt;p&gt;In order to demonstrate our workflow we will use data from &lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fy695a4pgr7ndhr2azc0r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fy695a4pgr7ndhr2azc0r.png" alt="StatsBomb logo"&gt;&lt;/a&gt;[&lt;a href="https://statsbomb.com/" rel="noopener noreferrer"&gt;https://statsbomb.com/&lt;/a&gt;], a football analytics company that provides data from various leagues and competitions (for those American readers - we talking about the &lt;strong&gt;original&lt;/strong&gt; football - &lt;em&gt;&lt;em&gt;soccer&lt;/em&gt;&lt;/em&gt; ) . The company has a free open data API tier that can be accessed using the instructions on &lt;a href="https://github.com/statsbomb/statsbombpy#open-data" rel="noopener noreferrer"&gt;their git hub page&lt;/a&gt;.    &lt;/p&gt;

&lt;p&gt;We would like to find the relationships between the players in the following dimensions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Player relationship

&lt;ol&gt;
&lt;li&gt;Played together in the same match&lt;/li&gt;
&lt;li&gt;Passed the ball to another Player&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;Team relationship - Player played in in a team
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We will be building a graph based on the following entity model:  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F8exzh9mptxx2p9ezc4o7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F8exzh9mptxx2p9ezc4o7.png" alt="entity model"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The following script is available to download the data into 2 separate datasets:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Player lineup as csv  &lt;a href="//notebooks/statsbomb_player_team.ipynb"&gt;link to notebook&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Player passes as parquet  &lt;a href="//notebooks/statsbomb_player_pass.ipynb"&gt;link to notebook&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  Understanding the datasets
&lt;/h3&gt;
&lt;h4&gt;
  
  
  Player Lineup
&lt;/h4&gt;

&lt;p&gt;This dataset has all the information regarding the players relationships with their teams. The key columns that we will use are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;player_id&lt;/code&gt; -  identifies each &lt;code&gt;Player&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;team_id&lt;/code&gt; -  identifies each &lt;code&gt;Team&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;team_id&lt;/code&gt; &amp;amp; &lt;code&gt;match_id&lt;/code&gt; - will create our &lt;code&gt;EdgeID&lt;/code&gt; to identify when a &lt;code&gt;Player&lt;/code&gt; &lt;code&gt;PLAYED_TOGETHER&lt;/code&gt; with another &lt;code&gt;Player&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt; The &lt;code&gt;Player&lt;/code&gt; &lt;code&gt;PLAYED_IN&lt;/code&gt; relationship can be immediately derived from the table.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;
  
  
  Player Event
&lt;/h4&gt;

&lt;p&gt;This dataset has all the information regarding the players action within a match. The key columns that we will use are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;player_id&lt;/code&gt; - identifies each &lt;code&gt;Player&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pass_type&lt;/code&gt; - identifies the event in the match (we will select only the &lt;code&gt;pass&lt;/code&gt; event)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pass_recipient&lt;/code&gt; - will identify the recipient of the pass&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Additional properties will enrich the Nodes and Edges.&lt;/p&gt;
&lt;h3&gt;
  
  
  Let's  play  Lego
&lt;/h3&gt;

&lt;p&gt;Let's see how we can use the &lt;code&gt;dagster&lt;/code&gt;'s  blocks &lt;/p&gt;

&lt;p&gt;Since we are working with &lt;em&gt;BigData&lt;/em&gt; we will be working with &lt;code&gt;spark&lt;/code&gt; (we also implemented some of workflow on &lt;code&gt;Dask&lt;/code&gt; - but will keep this for a future post).&lt;br&gt;&lt;br&gt;
We will need to tell &lt;code&gt;dagster&lt;/code&gt; what resources we are going to use. In dagster's environment everything is very module so we can define our resource with a YAML file &lt;a href="https://docs.dagster.io/docs/apidocs/pipeline#dagster.resource" rel="noopener noreferrer"&gt;see api resource documentation&lt;/a&gt;.    &lt;/p&gt;

&lt;p&gt;In our env_yaml folder we have the following &lt;code&gt;env_yaml/resources.yaml&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resources:  
    spark:  
      config:  
        spark_conf:  
          spark:  
            master: "local"  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Additional configuration for &lt;code&gt;spark&lt;/code&gt; can  be included under &lt;code&gt;spark_conf&lt;/code&gt;. For our workload we added for example the following parameters:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;spark.executor.memoryOverhead: "16G"&lt;br&gt;
spark.driver.memory: "8G"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Obviously for the example in this post there is no need to add any additional parameters. &lt;/p&gt;

&lt;h3&gt;
  
  
  Solid Intro
&lt;/h3&gt;

&lt;p&gt;Lets review a simple &lt;code&gt;solid&lt;/code&gt;  such as a &lt;code&gt;show&lt;/code&gt; function for a spark dataframe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@solid(
    config_schema={
        "num_rows": Field(
            Int, is_required=False, default_value=5, description=("Number of rows to display"),
        ),
    }
)
def show(context, df: DataFrame):
    num_rows = context.solid_config.get("num_rows")
    context.log.info(f"df.show():\n{df._jdf.showString(num_rows, 20, False)}")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;@solid()&lt;/code&gt; decorator above the function converts the function into a &lt;code&gt;solid&lt;/code&gt; so dagster can utilize/ingest it.&lt;br&gt;&lt;br&gt;
The &lt;a href="https://docs.dagster.io/docs/apidocs/solids" rel="noopener noreferrer"&gt;solid decorator&lt;/a&gt;  can take several parameters. In this &lt;code&gt;solid&lt;/code&gt; we are just using    &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href=""&gt;config_schema&lt;/a&gt; which consists of:

&lt;ul&gt;
&lt;li&gt;
&lt;a href=""&gt;Field&lt;/a&gt; "num_rows" which will determine the number of rows to print out with the following properties:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Int&lt;/code&gt; type (see list of available &lt;a href="https://docs.dagster.io/docs/apidocs/types" rel="noopener noreferrer"&gt;types&lt;/a&gt;) .
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;is_required&lt;/code&gt; (boolean) determines if this parameter is is_required
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;default_value&lt;/code&gt; for the number of rows to show
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;description&lt;/code&gt; of the Field parameter.
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The config_schema assists in checking the validity of our pipeline.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;def show(context, df: DataFrame):  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Our function receives the &lt;a href=""&gt;context&lt;/a&gt; of the solid (supplementing the function with some additional inputs), in addition to a parameter that it will receive from the pipeline (which is a DataFrame).   &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;num_rows = context.solid_config.get("num_rows")  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When the function is executed we get the &lt;code&gt;num_rows&lt;/code&gt; parameter   &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;context.log.info(f"df.show():\n{df._jdf.showString(num_rows, 20, False)}")   &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Then we are using the [internal dagster logging("&lt;a href="https://docs.dagster.io/docs/apidocs/internals#dagster.DagsterLogManager%22" rel="noopener noreferrer"&gt;https://docs.dagster.io/docs/apidocs/internals#dagster.DagsterLogManager"&lt;/a&gt;) to print out the number of rows of the Dataframe.  &lt;/p&gt;
&lt;h3&gt;
  
  
  Solid  Cont.
&lt;/h3&gt;

&lt;p&gt;Now lets delve deeper into a more complex &lt;code&gt;solid&lt;/code&gt; such as &lt;code&gt;read_file&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@solid(
    output_defs=[OutputDefinition(dagster_type=DataFrame, name="df")],
    required_resource_keys={"spark"},
    config_schema={
        "path": Field(
            Any,
            is_required=True,
            description=(
                "String or a list of string for file-system backed data sources."
            ),
        ),
        "dtype": Field(
            list,
            is_required=False,
            description='Dictionary with column types e.g. {"col_name": "string"}.',
        ),
        "format": Field(
            String,
            default_value="csv",
            is_required=False,
            description='String for format of the data source. Default to "parquet".',
        ),
        "options": Field(
            Permissive(
                fields={
                    "inferSchema": Field(Bool, is_required=False),
                    "sep": Field(String, is_required=False),
                    "header": Field(Bool, is_required=False),
                    "encoding": Field(String, is_required=False),
                }
            ),
            is_required=False,
        ),
    },
)
def read_file(context) -&amp;gt; DataFrame:
    path = context.solid_config["path"]
    dtype = context.solid_config.get("dtype")
    _format = context.solid_config.get("format")
    options = context.solid_config.get("options", {})
    context.log.debug(
        f"read_file: path={path}, dtype={dtype}, _format={_format}, options={options}, "
    )
    spark = context.resources.spark.spark_session
    if dtype:
        df = (
            spark.read.format(_format)
            .options(**options)
            .schema(transform.create_schema(dtype))
            .load(path)
        )
    else:
        df = spark.read.format(_format).options(**options).load(path)

    yield Output(df, "df")


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lets now break it down&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;code&gt;solid&lt;/code&gt; decorator
&lt;/h4&gt;

&lt;p&gt;In this &lt;a href="https://docs.dagster.io/docs/apidocs/solids#dagster.SolidDefinition" rel="noopener noreferrer"&gt;&lt;code&gt;solid&lt;/code&gt; decorator&lt;/a&gt; we have some additional parameters:     &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;output_defs=[OutputDefinition(dagster_type=DataFrame, name="df")],  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;output_defs defines a list of &lt;a href="https://docs.dagster.io/docs/apidocs/solids#dagster.OutputDefinition" rel="noopener noreferrer"&gt;OutputDefinition&lt;/a&gt; of the &lt;code&gt;solid&lt;/code&gt;.  In our case the output will be a &lt;code&gt;dataframe&lt;/code&gt; that will be consumed by &lt;code&gt;solid&lt;/code&gt;s in the pipeline.  The &lt;code&gt;name&lt;/code&gt; of the object in the OutputDefinition can be consumed by other &lt;code&gt;solid&lt;/code&gt;s.
&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;required_resource_keys={"spark"}  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;a href="https://docs.dagster.io/docs/apidocs/solids#dagster.SolidDefinition" rel="noopener noreferrer"&gt;resources&lt;/a&gt; that the solid requires in order to execute.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; config_schema={  
         "path": Field(  
        ...          
        "options": Field(  
            Permissive(   
                fields={  
                    "inferSchema": Field(Bool, is_required=False),  
                    "sep": Field(String, is_required=False),  
                    "header": Field(Bool, is_required=False),  
                    "encoding": Field(String, is_required=False),  
                }  
            ),  
            is_required=False,  
        ),  

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;a href="("&gt;config_schema&lt;/a&gt; - similar to the explanation above.

&lt;ul&gt;
&lt;li&gt;In this &lt;code&gt;solid&lt;/code&gt; we also have a &lt;code&gt;Permissive&lt;/code&gt; Field type as a dictionary that will take various option parameters for reading in file.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  solid in actions
&lt;/h4&gt;

&lt;p&gt;Now let's look what the &lt;code&gt;solid&lt;/code&gt; does.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;def read_file(context) -&amp;gt; DataFrame:  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Every solid has a &lt;code&gt;context&lt;/code&gt; which is a collection of information provided by the system, such as the parameters provided within the &lt;code&gt;config_schema&lt;/code&gt;.  In this solid there is no additional parameter (compared to the above &lt;code&gt;show solid&lt;/code&gt; since this is the starting point of the pipeline) &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;path = context.solid_config.get("path")&lt;br&gt;&lt;br&gt;
   dtype = context.solid_config.get("dtype")&lt;br&gt;&lt;br&gt;
   _format = context.solid_config.get("format")&lt;br&gt;&lt;br&gt;
   options = context.solid_config.get("options", {})     &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In order to obtain the values from the &lt;code&gt;context&lt;/code&gt; we can use the &lt;code&gt;solid_config&lt;/code&gt; method  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;context.log.debug(f"read_file: path={path}, dtype={dtype}, _format={_format}, options={options}, ")  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Dagster comes with a &lt;a href="https://docs.dagster.io/docs/apidocs/pipeline#dagster.logger" rel="noopener noreferrer"&gt;build in logger&lt;/a&gt; that tracks all the events in the pipeline. In addition you are able to add any logs that you require.  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;spark = context.resources.spark.spark_session  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Normally we would get the &lt;code&gt;spark_session&lt;/code&gt; from the importing &lt;a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SparkSession" rel="noopener noreferrer"&gt;pyspark.sql.SparkSession&lt;/a&gt;, however since we already configured our &lt;code&gt;resources&lt;/code&gt; - we will get our session from the &lt;code&gt;context.resources&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    if dtype:
        df = (
            spark.read.format(_format)
            .options(**options)
            .schema(transform.create_schema(dtype))
            .load(path)
        )
    else:
        df = spark.read.format(_format).options(**options).load(path)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now that we have everything in place we can run the basic code to read in the data into spark.  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;yield Output(df, "df")  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Finally the function yields an &lt;a href="https://docs.dagster.io/docs/apidocs/solids#dagster.Output" rel="noopener noreferrer"&gt;Output&lt;/a&gt; that can be consumed by the other solids in the pipeline.  &lt;/p&gt;

&lt;h3&gt;
  
  
  Composite Solids
&lt;/h3&gt;

&lt;p&gt;A single &lt;code&gt;solid&lt;/code&gt; executes a &lt;br&gt;
computation, however when wanting to create dependencies between solids we can use a &lt;a href="https://docs.dagster.io/docs/apidocs/solids#dagster.composite_solid" rel="noopener noreferrer"&gt;composite_solid&lt;/a&gt;.    &lt;/p&gt;

&lt;p&gt;Here is a screen shot of the &lt;code&gt;composite_solid&lt;/code&gt; from &lt;a href="https://docs.dagster.io/tutorial/execute#executing-our-first-pipeline" rel="noopener noreferrer"&gt;dagit&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fkyn1ylmt24qcnhacss2g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fkyn1ylmt24qcnhacss2g.png" alt="pass_to composite solid"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Lets review the &lt;code&gt;passed_to&lt;/code&gt; composite_solid&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@composite_solid(
    output_defs=(
        [
            OutputDefinition(name="df_edges", dagster_type=DataFrame),
            OutputDefinition(name="df_nodes", dagster_type=DataFrame),
        ]
    ),
)
def passed_to():
    df_edges_disc = solids_transform.lit.alias("add_col_label1")(
        solids_transform.lit.alias("add_col_label2")(
            solids_transform.astype(
                solids_transform.rename_cols(
                    solids_transform.drop_cols(
                        solids_transform.dropna(solids_utils.read_file())
                    )
                )
            )
        )
    )
    solids_utils.show(df_edges_disc)
    df_edges, df_nodes = solids_edges.edges_agg(df_edges_disc)
    solids_utils.save_header.alias("save_header_edges")(
        solids_transform.rename_cols.alias("rename_cols_neo4j")(df_edges)
    )
    return df_edges, df_nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The &lt;a href="https://docs.dagster.io/docs/apidocs/solids#dagster.composite_solid" rel="noopener noreferrer"&gt;@composite_solid&lt;/a&gt; is very similar to the &lt;a href="https://docs.dagster.io/docs/apidocs/solids#dagster.SolidDefinition" rel="noopener noreferrer"&gt;@solid&lt;/a&gt; decorator.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Here we can see how the solids are nested within each other.  Every &lt;code&gt;solid&lt;/code&gt; has an input of a &lt;code&gt;DataFrame&lt;/code&gt; (excpet for the &lt;code&gt;read_file&lt;/code&gt; solid).  An every solid has an &lt;code&gt;Output&lt;/code&gt; &lt;code&gt;DataFrame&lt;/code&gt;.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;code&gt;alias&lt;/code&gt; allows to call the same &lt;code&gt;solid&lt;/code&gt; several time within a single &lt;a href="https://docs.dagster.io/docs/apidocs/solids#dagster.composite_solid" rel="noopener noreferrer"&gt;composite_solid&lt;/a&gt;.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A solid can return several outputs (which need to be defined in the solid decorator under the &lt;code&gt;OutputDefinition&lt;/code&gt; parameter )&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pipeline
&lt;/h3&gt;

&lt;p&gt;Finally we can put everything together in our pipeline. A &lt;a href="https://docs.dagster.io/docs/apidocs/pipeline#dagster.pipeline" rel="noopener noreferrer"&gt;pipeline&lt;/a&gt; builds up a dependency graph between the solids/composite_solids.   &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ffdxdmavtjcmoblfkf82d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ffdxdmavtjcmoblfkf82d.png" alt="pipeline"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@pipeline(mode_defs=[ModeDefinition(resource_defs={"spark": pyspark_resource})])
def statsbomb_pipeline():
    passed_to_edges, passed_to_nodes = passed_to()
    played_together_edges, played_together_nodes = played_together()
    played_in_edges, played_in_nodes = played_in()
    create_neo4j_db(dfs_nodes=[passed_to_nodes, played_together_nodes, played_in_nodes])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In our pipeline, we have three &lt;code&gt;composite_solid&lt;/code&gt;s that output 2 &lt;code&gt;Dataframe&lt;/code&gt; objects.&lt;br&gt;&lt;br&gt;
Our final &lt;code&gt;create_neo4j_db composite_solid&lt;/code&gt; is dependent on the output of the prior 2 &lt;code&gt;composite_solid&lt;/code&gt;s and executes a solid to generate the node files, in addition to execute a &lt;code&gt;neo4j-admin.bat&lt;/code&gt; script to bulk import that data into the database.  &lt;/p&gt;

&lt;h3&gt;
  
  
  Dagit
&lt;/h3&gt;

&lt;p&gt;The DAG output of the run can be viewed with &lt;a href="https://docs.dagster.io/tutorial/execute#executing-our-first-pipeline" rel="noopener noreferrer"&gt;Dagit&lt;/a&gt; (&lt;code&gt;dagster's&lt;/code&gt; UI).  This allows to review the various steps in the pipeline and get additional information on the various solids tasks.&lt;br&gt;&lt;br&gt;
Bellow is an image from one of our pipelines&lt;br&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F9o9vyebqigutm4eov3j9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F9o9vyebqigutm4eov3j9.png" alt="dagit run screenshot"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Into Neo4j
&lt;/h2&gt;

&lt;p&gt;The results of the Pipeline can be imported into &lt;code&gt;neo4j&lt;/code&gt; .&lt;br&gt;&lt;br&gt;
In order to import &lt;em&gt;bigData&lt;/em&gt; in an optimal manner we will use the &lt;a href="https://neo4j.com/docs/operations-manual/4.1/tools/import/" rel="noopener noreferrer"&gt;batch import&lt;/a&gt; admin tool.  This allows for loading tens of millions of nodes and billions of relationships in a reasonable time.  &lt;/p&gt;

&lt;p&gt;The result of the &lt;em&gt;import&lt;/em&gt; command is a &lt;code&gt;Neo4j&lt;/code&gt; database that can be loaded from &lt;a href="//data/processed/stats_player.db.zip"&gt;data/processed/&lt;/a&gt;.  To load the database you can use &lt;a href="https://neo4j.com/docs/operations-manual/3.5/tools/dump-load/" rel="noopener noreferrer"&gt;neo4j-admin load&lt;/a&gt; command.&lt;br&gt;&lt;br&gt;
Note that the neo4j database is in version 3.X&lt;/p&gt;

&lt;p&gt;Here is a screenshot for Liverpool's striker &lt;a href="https://en.wikipedia.org/wiki/Mohamed_Salah" rel="noopener noreferrer"&gt;Mohamed Salah&lt;/a&gt;:  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ftcqtit8k9sffqz89rk7h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ftcqtit8k9sffqz89rk7h.png" alt="graph Mohamed Salah"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Some Tips
&lt;/h2&gt;

&lt;p&gt;Once we managed to &lt;em&gt;&lt;em&gt;grok&lt;/em&gt;&lt;/em&gt; our understanding of &lt;code&gt;dagster&lt;/code&gt; and wrap our pyspark functions, our workflow was quite productive.  Here are some tips to make your onboarding easier:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;When working with Spark:

&lt;ol&gt;
&lt;li&gt;Run a &lt;a href="https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#cache" rel="noopener noreferrer"&gt;cache&lt;/a&gt; when returning a &lt;code&gt;spark dataframe&lt;/code&gt; from a &lt;code&gt;solid&lt;/code&gt;. This will prevent running the &lt;code&gt;DAG&lt;/code&gt; in complex pipeline that has various outputs.
&lt;/li&gt;
&lt;li&gt;Since we had various checkpoints where we needed to dump our datasets, we found that when our &lt;code&gt;spark&lt;/code&gt; performed unexpectedly, braking up the pipeline by reading back the output file (instead of passing on the dataframe object) allowed spark to manage its resources in an optimal manner.&lt;/li&gt;
&lt;li&gt;Since our environment is on a CDH, the iteration of building a pipeline was enhanced when combining a &lt;code&gt;jupyter notebook&lt;/code&gt; that would implement each step in the &lt;code&gt;composite solid&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;li&gt;Dagster:

&lt;ol&gt;
&lt;li&gt;Remember to set &lt;a href="https://docs.dagster.io/overview/instances/dagster-instance" rel="noopener noreferrer"&gt;DAGSTER_HOME&lt;/a&gt; once the pipeline is not a playground (in order to log the runs etc, otherwise each run is ephemeral)
&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;h2&gt;
  
  
  What else....
&lt;/h2&gt;

&lt;p&gt;Dagster has several additional components that can upgrade the pipeline in a significant manner. These include, among others, &lt;a href="https://docs.dagster.io/docs/learn/guides/testing/testing" rel="noopener noreferrer"&gt;Test framework&lt;/a&gt;,  &lt;a href="https://docs.dagster.io/docs/apidocs/solids#dagster.Materialization" rel="noopener noreferrer"&gt;Materialization&lt;/a&gt; for persistent artifacts in the pipeline and  a  &lt;a href="https://docs.dagster.io/docs/apidocs/schedules#dagster.schedule" rel="noopener noreferrer"&gt;scheduler&lt;/a&gt;.  Unfortunately these topics are outside the scope of this post.  &lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;In this post we describe our workflow for generating a graph from separate data sources.&lt;br&gt;&lt;br&gt;
As our project matured, we needed to stabilize our workflow, thus migrating our ad-hoc script scaffolds into &lt;code&gt;Dagster&lt;/code&gt;s framework.  In this process we were able to improve the quality of our pipeline and enable new data sources to quickly be integrated in our product in a frectioness manner.&lt;br&gt;&lt;br&gt;
We hope that this post will inspire you to upgrade your workflow.  &lt;/p&gt;

&lt;p&gt;Please feel free to contact us directly if you have any questions&lt;br&gt;&lt;br&gt;
&lt;a href="https://www.linkedin.com/in/davidkatz-il/" rel="noopener noreferrer"&gt;David Katz&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;a href="https://www.linkedin.com/in/berrygis/" rel="noopener noreferrer"&gt;Sephi Berry&lt;/a&gt;  &lt;/p&gt;

</description>
      <category>dagster</category>
      <category>dataengineering</category>
      <category>pipeline</category>
      <category>graph</category>
    </item>
  </channel>
</rss>
