DEV Community: Sephi Berry

ML Configuration Management

Sephi Berry — Thu, 28 Jul 2022 10:40:00 +0000

This post follows our blog describing our ML Ops manifest. In this post we will dive into the our configuration management within our ML projects

Background

Working in a real-world business environment, requires moving codes between research/development, test and production environments, which are crucial for development velocity. While doing so, it is important to allow for a common language & standards between various AI & Development teams, for frictionless deployment of codes. Additionally configuration management assist in:

ML work with many parameters,
hyper params etc. ,
we want to separate config from code (12 factor app)

These days, it comes without a surprise that there are several OpenSource (OS) configuration frameworks that can be utilized for this. After reviewing several options (including Hydra), we decided on dynaconf, since it fulfilled our requirements of being:

Python based
Simple
Easily configurable and extendable
Allow for overriding and cascading

Development Environment

Artlist runs in multiple cloud environments, however currently most of the ML workloads run on GCP. Following GCP best practices, we have set up different projects for each environment, thus allowing for strict isolation between them, in addition to enabling billing segmentation. This separation needs to be easily propagated into the configuration, for seamless code execution.

In this post we review our configuration in relation to

Basic implementation
Advance Templating
Simple Overriding
Project vs. Module settings
Updating Configurations

Now let’s see how dynaconf can help out with this.

Basic Implementation

We decided to work with configuration settings that are stored in external toml files, which are easily readable and are becoming one of the de-facto standards in python.

A code snippet from our basic configuration file is as follows:

[default]
PROJECT_ID = "@format artlist-{this.current_env}"
BASE_NAME = “my_feature_name”
BASE_PIPELINE_NAME_GCP = "@jinja {this.BASE_NAME | replace('_', '-')}"
BUCKET_NAME = "@format {this.BASE_NAME}--{this.current_env}"

[dev]
SERVICE_ACCOUNT="service-account1@artlist-dev.iam.gserviceaccount.com"

[tst]
SERVICE_ACCOUNT="service-account2@artlist-tst.iam.gserviceaccount.com"

[prd]
SERVICE_ACCOUNT="service-account3@artlist-prd.iam.gserviceaccount.com"

Now let's break it down.

Whenever dynaconf runs - it runs in a specific environment. The default environment is called DEVELOPMENT. However, since we wanted to move easily between the environments (and GCP resources), we changed the naming convention of the environments (to a 3 letter acronym = dev, tst, prd), so we can readily reference the relevant GCP project while specifying the environment.

Using the env_swithcher, we can indicate to dynaconf which configuration to load and what GCP project to access with the following line:

PROJECT_ID = "@format artlist-{this.current_env}"

Using the @format as a prefix to the string, we can parse the parameter that is within the curly brackets. For example, if the current environment is set to ‘dev’ the PROJECT_ID variable will be artlist-dev, thus accessing only the resources from the dev project, whereas if the environment is set to ‘prd’ the PROJECT_ID will be artlist-prd.

Accessing the rest of the relevant variables is based on the various sections in the toml file.

For example, referencing the Production Service Account (SA) will be by accessing the SERVICE_ACCOUNT variable which is under the [prd] section

Advance Templating

Dynaconf includes the ability to work with Jinja templating - this can be useful for manipulating strings. GCP has a quark that requires naming containers within the GCP registry so as not to have ‘_’ (underscore) as separators, but rather ‘-’ (hyphen). And since we wanted to sync our registry and the artifacts coming out of Vertex AI pipelines (that are stored within buckets / Cloud Storage), we were able to keep the python naming convention of ‘_’ , while converting the strings to the GCP convention when required.

Using the jinja’s text replace method we can easily alter the text as necessary:

BASE_PIPELINE_NAME_GCP = "@jinja {this.BASE_NAME | replace('_', '-')}"

Simple Overriding

Another useful feature of dynaconf is that you can easily override the configuration using local settings. This is very convenient since local settings for development doesn’t need to be checked into source control, while the general settings should be synced to the entire team.

All that is required to differentiate between the settings is to add the .local suffix to the file name, see example below:

General Settings - settings.toml
Local Settings - settings.local.toml

Whenever dynaconf identifies the suffix .local.toml it will overwrite the variables configuration that exists in the settings.toml with the loaded settings.local.toml file.

An example for overwriting with local credentials

Project vs. Module settings

Our ML framework is KubeFlow (hosted by GCP VertexAI pipelines), which requires various configurations: some at the component level (which are reused independently in various pipelines), while others are at the pipeline/project/cross-component level. To load both settings, we can use another feature of dynaconf which can define a specific file name template that will automatically be loaded by dynaconf. Here is our implementation practice:

Any configurations that are at the project level will be written in the project settings - settings_prj.toml (see dynaconf settings_files configuration)
Any configurations that are at the component level will be written in the component settings - settings.toml .

Updating Configurations

Sometimes there is a need to update the configuration during runtime, this can be challenging since the entire configuration is loaded immediately when the library is called. To do so, we can use a decorator to update the configuration. Assuming the cfg is the configuration settings, we can write the following decorator:

@input_to_config(cfg)
def input_to_config(config, sequence_override=True):
   """[decorator] override config parameters with function inputs
   Args:
       config: Dynaconf configuration / settings to be updated
       wrapped_func ([function]): [the function to capture it's input and push to the config]
       sequence_override: configures the option for overriding keys or merging the values
   """

   def decorator(wrapped_func):
       @wraps(wrapped_func)
       def inner(*args, **kwargs):
           _override_config(kwargs, config, sequence_override=sequence_override)
           return wrapped_func(*args, **kwargs)

       return inner

   return decorator

Summary

In this blog post, we have laid out our configuration implementation using dynacof library. We saw how we

Used the basic setup of dynaconf configuration
Synced our GCP project with dynaconf environments
Worked with the advanced dynaconf settings

In our next posts, we will extend our description of the various elements that have been incorporated into our ML project workflow, while developing our internal base library, which include standardization of:

Logging
Accessing our secrets (using GCP Secret Manager)
Conduct our experiment tracking (using ClearML )

# Image copyright

The banner image was co-created using DALLE2

Design patterns in ML

Sephi Berry — Wed, 27 Jan 2021 07:31:34 +0000

Listening to Sara Robinson, the "Machine Learning Design Patterns" book co-author, on the MLOps.community podcast, raises some issues related to our growing industry in ML/AI in general and specifically in Data Engineering realm. Although I have not yet read the book, I'm allowing myself to reflect on what I understood from the episode. There are many issues discussed in the book, fortunately they decided to speak about Workflow Pipelines (chapter 25), which is dear to my heart - since I believe that it is a key element for successful ML projects.

As an industry, we are still evolving and best practices are still emerging. Saying that - there are many simple solutions and practices from project management and software development that can easily put an ML project on the right track. Identifying the business values that are currently most relevant are a key component when understanding the various trade offs in the engineering processes.

We too enjoy the flexibility of jupyter notebooks, but I disagree with the what Sara said about when to transition from a jupyter notebook to a more structure pipeline. Working methodologically with templates and clear inputs and outputs for each notebook should be implemented from day one. Breaking up notebooks for each step and writing down the logical stages in a markdown file is a key component for saving time and for successful collaboration with any team member. This is even true for yourself - there is nothing better then returning after a weekend to a project and getting up running within a few minutes.

Reproducibility is another key component in any ML project. MLflow is our framework choice for tracking our experiments, which is mentioned as a tool for creating pipelines. However putting MLflow with Airflow as the same solution for Workflow Pipeline (page 284) doesn't seem right.

In the book they state that the following stages make up for ML Pipeline:

Data Collection
Data Validation
Data Processing
Model Building
Training & Evaluating
Model Deployment

While MLflow may be best suited for steps 4-6, Airflow is probably best suited for steps 1-3.
Here I think it is worth while pointing out that there are considerable differences between:

Workflow Pipeline (e.g. as described in this chapter - containerizing is the key issue here)
Data Pipeline (e.g Airflow or Dagster)
Model Pipeline (e.g. scikit-learn pipeline)
ML Tracking Pipeline (e.g. experiment tracking - MLflow Tracking )

(a full write up will be in a future blog).

I think there are considerable differences between these pipelines, and putting them together is confusing. Additional information about the complex landscape can be read in a blog post Emerging Architectures for Modern Data Infrastructure by Matt Bornstein, Martin Casado, and Jennifer Li.

Finally, I totally agree with the excitement that was conveyed by the participant with the understanding that the MLops field is still growing in many directions and understanding that part of the attraction in the field is that we are able to experiment with different methodologies while learning new libraries and designs as we mature the industry.

NaNs bites

Sephi Berry — Tue, 29 Dec 2020 11:18:17 +0000

This post is co-authored with David Katz

Background

While working on a health-related classification project our team encountered a very large sparse metrix, due to the vast amount of health/lab tests that were available. After a meeting with the business domain experts, we understood that our initial data preprocessing for removing missing data (NaN) was faulty.

In this post, we would like to share the pit-fall that we experienced and share our process for identifying features with missing values related to classification problems.

The Pit Fall

We had over 2K of features across 40K patients. We knew that most of the features had a significant amount of NaNs, so we used the common methods - such as scikit-learn's VarianceThreshold and caret's near zero variance functions to remove features with high NaN values.

We were left with less than 70 features and ran our base model to see if our classifier model could predict better than randomness. After displaying the feature importance from our CatBoost model, some concerns were raised regarding some of the features.

So we went back and did some homework...

While re-analyzing the features that were left, we saw that although they had passed
our initial NaN tests, we did not check for the distribution of the NaN across our classes, i.e. some features had a significant amount of NaNs concentrated in a specific class which were not evenly distributed across our group classifications.

Data sample

To demonstrate this process let's look at an example dataset - the Horse Colic dataset.

This dataset includes the outcome/survival of horses diagnosed with colic disease based upon their past medical histories.

import pandas as pd
import numpy as np
from itertools import combinations

pd.options.display.float_format = "{:,.2f}".format

names = "surgery,Age,Hospital Number,rectal temperature,pulse,respiratory rate,temperature of extremities,peripheral pulse,mucous membranes,capillary refill time,pain,peristalsis,abdominal distension,nasogastric tube,nasogastric reflux,nasogastric reflux PH,rectal examination,abdomen,packed cell volume,total protein,abdominocentesis appearance,abdomcentesis total protein,outcome,surgical lesion,type of lesion1,type of lesion2,type of lesion3,cp_data"
names = names.replace(" ", "_").split(",")
file_path = (
    "https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv"
)
df = pd.read_csv(file_path, names=names)
label_col = "outcome"


def preprocess_df(df):
    df.columns = [c.strip() for c in df.columns]
    df = df.replace("?", np.nan).replace("nan", np.nan)
    df = df[~(df["outcome"].isna())].copy()  # clean up label column
    df.index.name = "ID"
    df[label_col] = (
        df[label_col]
        .astype(str)
        .replace({"1": "lived", "2": "died", "3": "euthanized"})
    )
    return df


df = preprocess_df(df)
print(f"df.shape: {df.shape}")
df.head(2)

df.shape: (299, 28)

	surgery	Age	Hospital_Number	rectal_temperature	pulse	respiratory_rate	temperature_of_extremities	peripheral_pulse	mucous_membranes	capillary_refill_time	...	packed_cell_volume	total_protein	abdominocentesis_appearance	abdomcentesis_total_protein	outcome	surgical_lesion	type_of_lesion1	type_of_lesion2	type_of_lesion3	cp_data
ID
0	2	1	530101	38.50	66	28	3	3	NaN	2	...	45.00	8.40	NaN	NaN	died	2	11300	0	0	2
1	1	1	534817	39.2	88	20	NaN	NaN	4	1	...	50	85	2	2	euthanized	2	2208	0	0	2

2 rows × 28 columns

First we will check how many featurs have NaN values.

# There are 28 features in this dataset
def number_of_features_with_NaN(df):
    _s_na = df.isna().sum()
    return len(_s_na[_s_na > 0])


print(f"The number features with NaN values are {number_of_features_with_NaN(df)}")

The number features with NaN values are 19

Near Zero Variance

Simulating our intitial workflow, we will remove features with NaN values with our implementation of caret's R library near zero variance function (with their default values).

def near_zero_variance(df, cols=None, frq_cut=95 / 5, unique_cut=10):
    if not cols:
        cols = df.columns
    drop_cols = []
    for col in cols:
        val_count = list(
            df[col].value_counts(dropna=False, normalize=True).to_dict().items()
        )
        if len(val_count) == 1:
            drop_cols.append(col)
            continue
        lunique = len(val_count)
        percent_unique = 100 * lunique / len(df[col])
        freq_ratio = val_count[0][1] / val_count[1][1] + 1e-5

        if (freq_ratio > frq_cut) & (percent_unique <= unique_cut):
            drop_cols.append(col)
    return df[[c for c in df.columns if c not in drop_cols]]


df_nzr = near_zero_variance(df)
print(
    f"After processeing the dataset via `near_zero_variance` we are left with {number_of_features_with_NaN(df_nzr)} features with NaN values.\n"
)

After processeing the dataset via `near_zero_variance` we are left with 18 features with NaN values.

Deeper NaN Analysis

Since we are only interested in understanding the missing values in the dataset, we can view how many NaN values are in the various features.
Let's now plot the remaining features relative to the percent of NaNs in each feature

def plot_percent_nan_in_features(df):
    nan = df.isna().sum() / len(df)
    feature_nan_threshold = {
        round(threshold, 2): len(df.columns) - len(nan[nan < threshold])
        for threshold in np.arange(0, 1.01, 0.05)
    }
    _df = pd.DataFrame.from_dict(
        feature_nan_threshold, orient="index", columns=["num_of_features"]
    )
    _df.plot(
        kind="bar",
        ylabel="Number of Features",
        xlabel="Percentage of NaNs in feature",
        figsize=(10, 5),
        grid=True,
    )


plot_percent_nan_in_features(df_nzr)

From the above plot, we can see that the number of features with more than 40% of NaNs are 2 features and above 25% are 6 features.

We can see that some of the features have very high NaN values.

Let's now remove these problematic features.
For this example we will set a threshold of 35% .

threshold_max_na = 0.35


def drop_features_above_threshold_max_na(df, threshold_max_na=threshold_max_na):
    nan = df.isna().sum() / len(df)
    nan_threshold = nan[nan > threshold_max_na]
    df = df.drop(columns=nan_threshold.index).copy()
    return df


df_nzr_threshold = drop_features_above_threshold_max_na(df_nzr)
print(
    f"After drop_features_above_threshold_max_na the number of features with NaNs that we are left with are : {number_of_features_with_NaN(df_nzr_threshold)}"
)

After drop_features_above_threshold_max_na the number of features with NaNs that we are left with are : 14

We may assume that we have removed the problematic features and can try to imputate our NaN data and run our pipeline/model.

But before we do so let's look a bit more closely at our classification.

NaN Distribution Among the Classifer Labels

Looking at the classifier label feature outcome we can see the following distribution

def create_value_counts_df(df, s2=pd.Series(dtype=float), col=None, check_na=False):
    if col:
        s1 = df[col].value_counts(dropna=False)
    if check_na:
        s1 = df.isna().sum()
        s2 = df.isna().sum() / len(df)
    if s2.empty:
        s2 = df[col].value_counts(normalize=True, dropna=False)
    df = pd.concat([s1, s2], axis=1)
    df.columns = ["num", "percent"]
    return df.sort_values(by="percent", ascending=False)


df_labels_counts = create_value_counts_df(df, col=label_col)
df_labels_counts

	num	percent
lived	178	0.60
died	77	0.26
euthanized	44	0.15

We can see that the distribution of the classes is uneven.

Our class distribution is apporximately 60%, 25%, 15% between the lived, died, euthanized classes (respectively).

But how are our NaNs distributed?

What is the distribution of NaNs in each feature with relation to our classification field.

def get_na_per_label(df, label_col, labels=df_labels_counts.index):
    cols = [c for c in df.columns if c != label_col]
    sum_na = (
        [df[col].isna().sum()]
        + [
            df[df[label_col] == label][col].isna().sum()
            for label in df_labels_counts.index.tolist()
        ]
        for col in cols
    )
    df_sum_na = pd.DataFrame(
        columns=[f"{col}_sum_na" for col in ["all"] + df_labels_counts.index.tolist()],
        data=sum_na,
        index=cols,
    )
    df_sum_na["all_percentage_na"] = df.isna().sum() / len(df)
    for label in labels:
        df_sum_na[f"{label}_percentage_na"] = (
            df_sum_na[f"{label}_sum_na"] / df_labels_counts.loc[label, "num"]
        )
    return df_sum_na

def get_na_cols(df):
    _s_na = df.isna().sum()
    na_cols = _s_na[_s_na > 0].index
    na_cols = list(na_cols) + [label_col]
    return na_cols

na_cols = get_na_cols(df_nzr_threshold)
df_sum_na = get_na_per_label(df_nzr_threshold[na_cols], label_col)
df_sum_na

	all_sum_na	lived_sum_na	died_sum_na	euthanized_sum_na	all_percentage_na	lived_percentage_na	died_percentage_na	euthanized_percentage_na
rectal_temperature	60	26	24	10	0.20	0.15	0.31	0.23
pulse	24	12	11	1	0.08	0.07	0.14	0.02
respiratory_rate	58	31	19	8	0.19	0.17	0.25	0.18
temperature_of_extremities	56	32	13	11	0.19	0.18	0.17	0.25
peripheral_pulse	69	39	18	12	0.23	0.22	0.23	0.27
mucous_membranes	47	28	11	8	0.16	0.16	0.14	0.18
capillary_refill_time	32	19	10	3	0.11	0.11	0.13	0.07
pain	55	34	12	9	0.18	0.19	0.16	0.20
peristalsis	44	22	15	7	0.15	0.12	0.19	0.16
abdominal_distension	56	31	14	11	0.19	0.17	0.18	0.25
nasogastric_tube	104	62	25	17	0.35	0.35	0.32	0.39
rectal_examination	102	56	26	20	0.34	0.31	0.34	0.45
packed_cell_volume	29	13	8	8	0.10	0.07	0.10	0.18
total_protein	33	13	12	8	0.11	0.07	0.16	0.18

We can see that the distributions of the NaNs across the classification field are not even. e.g. the rectal_temperature feature has twice as much NaN in the died & lived classes than in the euthanized class.

Assuming that we don't want to remove any features that have less than 15% of NaNs, no matter how the NaN distribution is across the classification field.

threshold_min_na = 0.15
mask_threshold_min_na = df_sum_na["all_percentage_na"] > threshold_min_na
df_sum_min_na = df_sum_na[mask_threshold_min_na].copy()
df_sum_min_na

	all_sum_na	lived_sum_na	died_sum_na	euthanized_sum_na	all_percentage_na	lived_percentage_na	died_percentage_na	euthanized_percentage_na
rectal_temperature	60	26	24	10	0.20	0.15	0.31	0.23
respiratory_rate	58	31	19	8	0.19	0.17	0.25	0.18
temperature_of_extremities	56	32	13	11	0.19	0.18	0.17	0.25
peripheral_pulse	69	39	18	12	0.23	0.22	0.23	0.27
mucous_membranes	47	28	11	8	0.16	0.16	0.14	0.18
pain	55	34	12	9	0.18	0.19	0.16	0.20
abdominal_distension	56	31	14	11	0.19	0.17	0.18	0.25
nasogastric_tube	104	62	25	17	0.35	0.35	0.32	0.39
rectal_examination	102	56	26	20	0.34	0.31	0.34	0.45

Now we can further analyse our data. Let's see the ratio between the classifications.

def create_ratio_between_classes(df, label_classes):
    for label_a, label_b in combinations(label_classes, 2):
        df[f"ratio_percentage_{label_a}_{label_b}"] = (
            df[f"{label_a}_percentage_na"] / df[f"{label_b}_percentage_na"]
        )
    col_ratio = [col for col in df.columns if "ratio_percentage" in col]
    return df[col_ratio]


create_ratio_between_classes(df_sum_min_na, df_labels_counts.index).plot(
    figsize=(12, 5), rot=45, grid=True
);

Values closer to 1 have similar percentage of NaNs, whereas values that are further away a higher distributions of the NaNs across the classification field.

We can set a lower and upper threshold for filtering out the problematic features. Once all the ratios are between these limits we will want to keep this feature. Any value outside these limits we can assume that the NaNs are unevenly distributed, and the features should be removed.

def get_features_outside_threshold(
    df,
    lt_ratio_threshold=0.7,
    gt_ratio_threshold=1.3,
):
    features_to_drop = []
    for col in [col for col in df.columns if "ratio" in col]:
        mask_ratio_threshold = ~(
            df[col].between(lt_ratio_threshold, gt_ratio_threshold)
        )
        features_to_drop.extend(mask_ratio_threshold[mask_ratio_threshold].index)
    features_to_drop = list(set(features_to_drop))
    return features_to_drop


features_to_drop = get_features_outside_threshold(df_sum_min_na)
features_to_drop

['rectal_examination',
 'abdominal_distension',
 'rectal_temperature',
 'temperature_of_extremities',
 'respiratory_rate']

df_for_model = df_nzr_threshold.drop(columns=features_to_drop)
print(
    f"After removing the features from get_features_outside_threshold function we are left with {number_of_features_with_NaN(df_for_model)} features with `NaN`s that we are going to imputate"
)

After removing the features from get_features_outside_threshold function we are left with 9 features with NaNs that we are going to imputate.

Summary

This post describes the issues while analysing NaNs for feature selection.

Simple filtering methods do not always perform as expected and additional emphsesis should be taken when working with sparse matrices.

We can analyse NaN within features in multiple levels:

At the global level - i.e. the total amount of NaNs within a feature (both for removing and for keeping features)
At the label/classification level - i.e. the relative distribution of NaNs per class.

Finally we recommand trying out the missingno package for graphical analysis of NaN values

The notebook for this post is at this gist link

Tidying up Pipelines with DataClasses

Sephi Berry — Mon, 16 Nov 2020 21:27:29 +0000

Background

Tidy code makes everyone's life easier.

The code in an ML project will probably be read many times, so making our workflow easier to understand will be appreciated later by everyone on the team.
During ML projects, we need to access data in a similar manner (throughout our workflow) for training, validating and predicting our model and data. A clear semantic for accessing the data allows for easier code management between projects. Additionally naming conventions are also very useful in order to be able to understand and reuse the code in an optimal manner.

There are some tools that can assist in this cleanliness such as the usage of Pipelines and Dataclasses.

MLEngineer is 10% ML 90% Engineer.

Pipeline

Pipeline is a meta object that assists in managing the processes in a ML model. Pipelines can encapsulate separate processes which can later on be combined together.

Forcing a workflow to be implemented within a Pipeline objects can be nuisance at the beginning (especially the conversion between pandas DataFrame and np.ndarray ), but down-the-line it guaranties the quality of the model (no data leakage, modularity etc.). Here is Kevin Markham 4 min. video explaining pipeline advantages.

Dataclass

Another useful Python object to save datasets along the pipeline are dataclasses. Before Python 3.7 you may have been using namedtuple, however after Python 3.7 dataclasses were introduced, and are now a great candidate for storing such data objects. Using dataclasses allows for access consistency to the various datasets throughout the ML Pipeline.

Pipeline

Since we are not analysing any dataset, this blog post is an example of an advance pipeline that incorporates non standard pieces (none standard sklearn modules).

Assuming that we have a classification problem and our data has numeric and categorical column types, the pipeline incorporates:

Preprocess data preparation per column type
Handle the categorical columns using the vtreat package
Run a catboost classifier.

We may build our pipeline as follows:

y = df.pop("label")
X = df.copy(True)

num_pipe = Pipeline([("scaler",StanderdScaler()),
                     ("variance",VarianceThreshold()),
                     ])
preprocess_pipe = ColumnTransformer(
   remainder="passthrough",
   transformers=[("num_pipe", num_pipe, X.select_dtypes("number"))]
)                     

pipe = Pipeline([("preprocess_pipe", preprocess_pipe),
               ("vtreat", BinomiaOutcomeTreatmentPlan()),
])

In this pseudo code our Pipeline has some preprocessing to the numeric columns followed by the processing of the categorical columns with the vtreat package (it will pass-through all the non-categorical and numeric columns).

Note

Since catboost does not have a transform method we are going to introduce it later on.
The usage of vtreat is an example of the possibility to use nonstandard modules within the classifications (assuming they follow sklearn paradigms)

So now the time has come to cut up our data...

Test vs. Train vs. Valid

A common workflow when developing an ML model is the necessity to split the date into Test/Train/Valid datasets.

source: Shan-Hung Wu & DataLab, National Tsing Hua University

In a nut shell the difference between the data are:

Test - put aside - don't look until final model estimation
Train - dataset to train model
Valid - dataset to validate model during the training phase (this can be via Cross Validation iteration, GridSearch, etc.)

Each dataset will have similar attributes that we will need to save and access throughout the ML workflow.

In order to prevent confusion lets create a dataclass to save the datasets in a structured manner.

# basic dataclass 
import numpy as np
from dataclasses import dataclass

@dataclass
class Split:
    X: np.ndarray = None
    y: np.array = None
    idx: np.array = None
    pred_class: np.array = None
    pred_proba: np.ndarray = None
    kwargs: Dict = None

    def __init__(self, name:str):
        self.name = name

Now we can create the training and test datasets as follows:

train = Split(name='train')   
test = Split(name='test')

Each dataclass will have the following fields:

X - a numpy ndarray storing all the features
y - a numpy array storing the labeling classification
idx - the index for storing the original indexes useful for referencing at the end of the pipe line
pred_class - a numpy array storing the predicted classification
pred_proba - a numpy ndarray for storing the probabilities of the classifications

Additionally we will store a name for the dataclass (in the init function) to easily referencing it along the pipeline.

Splitting In Action

There are several methods that can be used to split the datasets. When data are imbalanced it is important to split the data with a stratified method. In our case, we chose to use StratifiedShuffleSplit however, in contrast to the simple train-test split which returns the datasets themselves, the StratifiedShuffleSplit returns only the indices for each group, thus we will need a helper function to get the dataset themselves (our helper function is nice and minimal for the usage of our dataclasses).

def get_split_from_idx(X, y, split1: Split, split2: Split):
    split1.X, split2.X = X.iloc[split1.idx], X.iloc[split2.idx]
    split1.y, split2.y = y.iloc[split1.idx], y.iloc[split2.idx]
    return split1, split2  

for fold_name, (train.idx, test.idx) in enumerate( StratifiedSplitValid(X, y, n_split=5, train_size=0.8) ):
    train, test = get_split_from_idx(X, y, train, test)  # a helper function

Pipeline in action

Now we can run the first part of our Pipeline

_train_X = pipe.fit_transform(train.X)

Once we have fit_transform our data (allowing for vtreat magic to work), we can introduce the catboost classifier into our Pipeline.

catboost_clf = CatBoostClassifier()

train_valid = Split(name="train_valid")
valid = Split(name="valid")
for fold_name, (train_valid.idx, valid.idx) in enumerate(StratifiedSplitValid(_train_X, train.y, n_split=10, train_size=0.9) ):
    train_valid, valid = get_split_from_idx(_train_X, train.y, train_valid, valid)

    pipe.steps.append(("catboost_clf",catboost_clf))

    pipe.fit(train_size.X, train_valid.y,
            catboost_clf__eval_set=[(valid.X, valid.y)],
    )

Notice the two following points:

Using pipe.steps.append we are able to introduce steps into the pipeline that could not be initially part of the workflow.
Adding parameters into the steps within the pipeline requires the use of double dunder for nested paramters.

Finally we can get some results

test.pred_class = pipe.(test.X)  
test.pred_proba = pipe.pred_proba(test.X)[:,1]

Now when we analyse our model we can generate our metrics (e.g. confusion_matrix) by easily referencing to the relevant dataset as follows:


from sklearn.metrics import confusion_matrix
conf_matrix_test = confusion_matrix(y_true=test.y, y_pred=test.pred_class )

Conclusion

This blog post outlines the advantages for using Pipelines and Dataclasses.

Working with the Dataclasses is really a no-brainer since it is very simple and can easily be incorporated into any code base. Pipelines require more effort while integrating them into the code, but the benefits are substantial and well worth it.
I hope the example illustrated the potential for such usage and will inspire and encourage you to try it out.

Simple Pipeline Monitoring Dashboard

Sephi Berry — Sun, 02 Aug 2020 09:54:06 +0000

This post is co-authored with David Katz

Background

These days any project that is deployed should incorporate the principles of CI/CD (highly recommended great talk from Eric Ma July 2020 - describes the issue in the realm of Data Science ). Thus after setting up our dagster pipeline we needed to implement some sort monitoring solution to review the outcome of our workflow. Working in a small DS team we needed to push forward and couldn't wait for the heavy guns of the enterprise IT to take over. So until we have there support here is a simple dashboard that we put together to monitor our Assets.

In this blog post we aim to describe how we created a functional dashboard based on python widgets.

We will describe the origin of our data, followed by the our solution using python's Panel library.

The code for this post is in this repo

sephib / dagster-graph-project

Repo demonstrating a Dagster pipeline to generate Neo4j Graph

Dagster Assets

We are not going to dive into Dagster (see previous blog post on our data pipeline), but the TLDR is that Dagster is an orchestration framework for building modern data applications and workflows. The framework has integrated logging and the ability to produce persistent assets that are stored in a database (in our case postgresql) for future references.

For our project we are interested in monitoring the number of nodes and edges that we generate in our data pipline workflow. During our pipeline run we log (or in the Dagster's jargon Materialize - see AssetMaterialization in the documentation) various stats on the datasets that we whish to manipulate. We would like to view the changes of these stats over time in order to verify the "health" of our system/pipeline.

Panel widgets

Today, the python ecosystem is very rich and vibrant with various visualization libraries that are constantly being developed. Two of the libraries that we reviewed were streamlit and Panel. We decided to go with Panel which seemed to suit our needs (due mainly to its structure and maintenance from our side).

Inspired by a talk given by Lina Weichbrodt in the MLOps meetup, we wanted to view the percent change of our metrics over time.

Panel is capable of displaying and integrating many python widgets from various packages. We are going to work with hvplot which best fits our needs, due to its richness and its integration with Pandas.

Getting our data/assets from the database

In this section we describe how we extracted the data from Dagster's Asset database. If this is not relevant, you may want to jump to the sample data section below.

In order to access the Asset data we needed to dig into the event_log table which logs all the events that are generated when running a Dagster pipeline. The script to extracts the data into a Pandas Dataframe, based on the Asset Keys that are defined in the Materialization process, is in the [linked repo].

Here are the key elements in the script:

In order to access the assets we need to query the event_logs table. We can use a sqlalchemy query as follows:

select([t_event_logs.c.event]).where(t_event_logs.c.asset_key.in_(assets))
For parsing the results we can use dagster's internal utility deserialize_json_to_dagster_namedtuple. Bellow is the function that converts the assets into a dictionary. Please note that we are only retrieving assets of a numeric type (which can be plotted). This is parallel to dagit's decision to display only asset numeric values in a graphs.

def get_asset_keys_values(results)->dict:
    assets={}
    for result in results:
        dagster_namedtuple = deserialize_json_to_dagster_namedtuple(result[0])
        time_stamp = datetime.fromtimestamp(dagster_namedtuple.timestamp).strftime('%Y-%m-%d %H:%M:%S')
        assets[time_stamp] = {}        
        assets[time_stamp]['asset_key'] = dagster_namedtuple.dagster_event.asset_key.to_string()
        from entry in dagster_namedtuple.dagster_event.event_specific_data.materialization.metadata_entries:
            if isinstance(entry.entry_data, FloatMetadataEntryData):  # Only assets that are numerical
                assets[time_stamp][entry.label] = entry.entry_data.value
    return assets

The full code for retrieving the data is in get_dagster_asset.py file.

Sample Data

For the dashboard in this post, we are going to use the sample data from bokeh.

Since we are simulating our datapipline outcomes we are going to use a sample of the columns:

date - as our X / time axis
Temperature
Humidity
Light
CO2

Let's view the data

Since we are interested in the change of the various stats with time we can use Panda's pct_change method to generate the values that we need. This also allows displaying all the datasets in the same graph since the nominal values of the various datasets are of different orders of magnitude.

Now that we have the data we can build our dashboard

Dashboard

We have 2 widgets that we want to use in our dashboard:

A line plot - displaying the datasets 1.1. A scatter plot - adding markers to the line plot
A date_range_slider widget - presenting the date range that we want to display

Our dashboard will display each data series along the X time axis.

DateRangeSlider

Panel's DateRangeSlider widget "allows selecting a date range using a slider with two handles".

The parameters of the widget are self-explanitory

Please not that the value parameter is for the default values of the DateRangeSlider, which consists of the start..end of the slider.

date_range_slider = pn.widgets.DateRangeSlider(
        name='Date Range Slider',
        start=data[date_col].min(), 
        end=data[date_col].max(),
        value=(data[date_col].max() - timedelta(hours=1), 
               data[date_col].max()
               )   # defualt value for slider
)

Line Plot & Panel's Glue

Now let's look at the Line plot code:

import holoviews.plotting.bokeh

import hvplot.pandas

These define that bokeh will be the visualization engine for hvplot, in addition to allowing for hvplot to use directly Panda's dataframes as the datasources for the plots.

@pn.depends(date_range_slider.param.value)

The Panel decorator causes the line plot to vary - based on the value that is changed from the date_range_slider widget.

start_date = date_range[0], end_date = date_range[1]

mask = (crime_data[date_col] > start_date) & (crime_data[date_col] <= end_date)

data = crime_data.loc[mask]

In order to filter the dataframe we are masking the data based on the current values from the date_range_slider widget.

crime_data.hvplot.line

This is the basic call for a [line plot] to be rendered from the Panda's dataframe.

The scatter plot was added in order to display the value markers on the line plot

Here is the full function:

@pn.depends(date_range_slider.param.value)
def get_plot(date_range):
    data = dft
    start_date = date_range[0]
    end_date = date_range[1]
    mask = (data[date_col] > start_date) & (data[date_col] <= end_date)
    data = data.loc[mask]

    lines = data[cols + [date_col]].hvplot.line(
          x=date_col
        , y=cols
        , value_label= 'value'  
        , legend='right'
        , height=400
        , width=800
        , muted_alpha=0
        , ylim=(-0.1, 0.1)  # This can be configured based on the pct change scale 
        , xlabel='time'
        , ylabel='% change'
    )   
    scatter = data[cols + [date_col]].hvplot.scatter(
                x=date_col,
                y= cols,

    )
    return lines.opts(axiswise=True) * scatter

Final function

Now we can create a functions that connects the different widgets

def get_dashboard(dft, cols, date_col):
    date_range_slider = pn.widgets.DateRangeSlider(
        name='Date Range Slider',
        start=data[date_col].min(), end=data[date_col].max(),
        value=(data[date_col].max() - timedelta(hours=1), data[date_col].max(),)
    )
    @pn.depends(date_range_slider.param.value)
    def get_plot(date_range):
        data = dft
        start_date = date_range[0]
        end_date = date_range[1]
        mask = (data[date_col] > start_date) & (data[date_col] <= end_date)
        data = data.loc[mask]

        lines = data[cols + [date_col]].hvplot.line(
              x=date_col
            , y=cols
            , value_label= 'value'  
            , legend='right'
            , height=400
            , width=800
            , muted_alpha=0
            , ylim=(-0.1, 0.1)  # This can be configured based on the pct change scale 
            , xlabel='time'
            , ylabel='% change'
        )   
        scatter = data[cols + [date_col]].hvplot.scatter(
                    x=date_col,
                    y= cols,

        )
        return lines.opts(axiswise=True) * scatter
    return get_plot, date_range_slider

Desgin the Dashboard

Panel has a simple method to aggregating all the widgets together using rows and columns (like a simple HTML table).

Below is the code to design the layout

plot, date_range_slider = get_dashboard(data, cols, 'date')
dashboard=pn.Row(
    pn.Column(
        pn.pane.Markdown(''' ## Dataset Percent Change'''),
        plot,
        date_range_slider,
    ),

)
dashboard

Conclusion

In this blog post we have outlined our solution for monitoring our Dagster's Assets that we log during our data pipeline workflow.

Using the Panel / hvplot libraries was quite straightforward. The documentation and reference galleries were very useful, although getting the linkage between some widget actions may require a bit of JS. Working with the examples, as in the last section in the getting started documentation, in addition to the more advance examples, show the potential for an elaborate dashboard if required.

Implementing a graph network pipeline with Dagster

Sephi Berry — Thu, 09 Jul 2020 13:56:21 +0000

This post is co-authored with David Katz

Background

Working in the Intelligence arena we try to 'connect the dots' to extract meaningful information from data.
We analyze various datasets to link between them in a logical manner.
This is useful in many different projects - so we needed to build a pipeline that can be both dynamic and robust, and be readily and easily utilized.
In this blog post we share our experience in running one of our data pipelines with dagster - which uses a modern approach (compared to the traditional Airflow / Luigi task managers), see Dagster's website description
We hope this blog post will help others to adopt such a data-pipeline and allow them to learn from our experiences.

The code for this post is in this repo

sephib / dagster-graph-project

Repo demonstrating a Dagster pipeline to generate Neo4j Graph

This repo is an example of using dagster framework in a real-world data pipeline.

See Implementing a graph network pipeline with Dagster blog post for the entire write-up describing how we created a graph (nodes and edges) from separate data sources and batch import them into Neo4j. A jupyter notebook is also available in this repo, in addition to the entire code to replicate this example.

<<<<<<< HEAD

See Simple Pipeline Monitoring Dashboard blog post for the entire write-up describing our monitor dashboard that we created using Panel. The code is available in this repo as a jupyter notebook is also available. =======
See Simple Pipeline Monitoring Dashboard blog post for the entire write-up describing our monitor dashboard that we created using Panel. A jupyter notebook is also available.

82d3f8addaad60a9bc1821ec4b96e3ebf5d77197

View on GitHub

Our Challenge

Connecting the dots

Finding relationships between different entities can be challenging (especially across datasets) - but is vital when building a reliable intelligent report.
A logical structure to store the entities and their relationships is in a Graph Database. In this example we are going to use Neo4j DB for storing the graph.
Also we are going to use psedo BigData. However the pipeline that we are presenting generates billions of relationships.
The outcome files of the pipeline will be in the format that will allow to use the dedicated Neo4j tool for bulk import. In the future we will do a separate blog post on our data analysis workflow with Neo4j.

First take

Our initial version for our pipeline was based on custom code and configurations (YAML) files.
The code base is a combination of R and Python script that utilize Spark, Dask and HDFS.
A shell script aggregated all the scripts to run the entire workflow

Inspect and Adapt

After our initial alpha version we noticed that we had some problems that required our attention:

The pipeline was built horizontally (per process) and not vertically (per dataset) - leading to uncertainty and fragmented results
We needed to refactor our code in order to stabilize and verify the quality of the product.
Working with dask didn't solve all of our use-cases and so we needed to run some workloads on spark.

After checking several options we chose dagster as our pipeline framework for numerous reasons - including:

Simplicity logic/framework
Modern architecture with data as a "first class citizen".
Open-source code base
The framework includes a modern UI for monitoring and communicating the workflow status
Pipeline supports data dependencies (and not function outputs)
Ability to log/monitor metrics

It was fortunate that our configuration files allowed us to identify the various functions and abstract our business logic from the data transformation.

*** This is a dagster intermediate level blog post - for newbees it is recommended to run through the beginner tutorial on dagster's site.

Understanding `Dagster` lego blocks

Before we start, here's a short introduction to Dagster's lego building blocks see dagster documentation:

At the core we have solids - these are the various "functional unit of computation that consumes and produces data assets".
The solids can be aggregated into composite solids.
A solid can have inputs and outputs that can be passed along the pipeline.
A pipeline orchestrates the various solids and the data flows.

Design implementation

The following diagram displays the architecture outline of the code:

YAML configurations

Dagster has many configuration files that assist in managing pipelines and there environments. In this example we will use only 2 configuration types:
- resources - configuration file that manges the resources for the running pipeline
- solids - configuration file for the composite solids. Each data source has its own configuration, in addition to the composite solid that implements the creation of the Neo4j DB.
Inputs
- Our premise is that the datasets inputs arrive in a timely manner (batch not streaming).
- Each dataset source has a corresponding configuration file.
Pipeline
- The pipeline consists of all the composite solids that organize the work that needs to be executed within each solid.
Output Files
- In our example the outputs are:
  - Nodes and Edges flat files in the format to be bulk imported into Neo4j
  - Neo4j DB

Let's build

Now we will build our basic units needed for our project:

Since we have several datasets we will can build each building block to be executed with a configuration file.

Read file into dataframe (e.g. cvs and parquet)
Message our data to fit the schema that we want for our graph network (including: adding/drop columns, rename_columns, concat columns etc.)
Generate nodes and edges (per entity in the graph model)
Save nodes and edges into csv files

Finally we will bulk import the csv files (nodes and edges) into Neo4j

Data

In order to demonstrate our workflow we will use data from [https://statsbomb.com/], a football analytics company that provides data from various leagues and competitions (for those American readers - we talking about the original football - soccer ) . The company has a free open data API tier that can be accessed using the instructions on their git hub page.

We would like to find the relationships between the players in the following dimensions:

Player relationship
1. Played together in the same match
2. Passed the ball to another Player
Team relationship - Player played in in a team

We will be building a graph based on the following entity model:

The following script is available to download the data into 2 separate datasets:

Player lineup as csv link to notebook
Player passes as parquet link to notebook

Understanding the datasets

Player Lineup

This dataset has all the information regarding the players relationships with their teams. The key columns that we will use are:

player_id - identifies each Player
team_id - identifies each Team
team_id & match_id - will create our EdgeID to identify when a Player PLAYED_TOGETHER with another Player
The Player PLAYED_IN relationship can be immediately derived from the table.

Player Event

This dataset has all the information regarding the players action within a match. The key columns that we will use are:

player_id - identifies each Player
pass_type - identifies the event in the match (we will select only the pass event)
pass_recipient - will identify the recipient of the pass

Additional properties will enrich the Nodes and Edges.

Let's play Lego

Let's see how we can use the dagster's blocks

Since we are working with BigData we will be working with spark (we also implemented some of workflow on Dask - but will keep this for a future post).

We will need to tell dagster what resources we are going to use. In dagster's environment everything is very module so we can define our resource with a YAML file see api resource documentation.

In our env_yaml folder we have the following env_yaml/resources.yaml file:

resources:  
    spark:  
      config:  
        spark_conf:  
          spark:  
            master: "local"

Additional configuration for spark can be included under spark_conf. For our workload we added for example the following parameters:

spark.executor.memoryOverhead: "16G"
spark.driver.memory: "8G"

Obviously for the example in this post there is no need to add any additional parameters.

Solid Intro

Lets review a simple solid such as a show function for a spark dataframe:

@solid(
    config_schema={
        "num_rows": Field(
            Int, is_required=False, default_value=5, description=("Number of rows to display"),
        ),
    }
)
def show(context, df: DataFrame):
    num_rows = context.solid_config.get("num_rows")
    context.log.info(f"df.show():\n{df._jdf.showString(num_rows, 20, False)}")

The @solid() decorator above the function converts the function into a solid so dagster can utilize/ingest it.

The solid decorator can take several parameters. In this solid we are just using

config_schema which consists of:
- Field "num_rows" which will determine the number of rows to print out with the following properties:
  - Int type (see list of available types) .
  - is_required (boolean) determines if this parameter is is_required
  - default_value for the number of rows to show
  - description of the Field parameter.

The config_schema assists in checking the validity of our pipeline.

def show(context, df: DataFrame):

Our function receives the context of the solid (supplementing the function with some additional inputs), in addition to a parameter that it will receive from the pipeline (which is a DataFrame).

num_rows = context.solid_config.get("num_rows")

When the function is executed we get the num_rows parameter

context.log.info(f"df.show():\n{df._jdf.showString(num_rows, 20, False)}")

Then we are using the [internal dagster logging("https://docs.dagster.io/docs/apidocs/internals#dagster.DagsterLogManager") to print out the number of rows of the Dataframe.

Solid Cont.

Now lets delve deeper into a more complex solid such as read_file:

@solid(
    output_defs=[OutputDefinition(dagster_type=DataFrame, name="df")],
    required_resource_keys={"spark"},
    config_schema={
        "path": Field(
            Any,
            is_required=True,
            description=(
                "String or a list of string for file-system backed data sources."
            ),
        ),
        "dtype": Field(
            list,
            is_required=False,
            description='Dictionary with column types e.g. {"col_name": "string"}.',
        ),
        "format": Field(
            String,
            default_value="csv",
            is_required=False,
            description='String for format of the data source. Default to "parquet".',
        ),
        "options": Field(
            Permissive(
                fields={
                    "inferSchema": Field(Bool, is_required=False),
                    "sep": Field(String, is_required=False),
                    "header": Field(Bool, is_required=False),
                    "encoding": Field(String, is_required=False),
                }
            ),
            is_required=False,
        ),
    },
)
def read_file(context) -> DataFrame:
    path = context.solid_config["path"]
    dtype = context.solid_config.get("dtype")
    _format = context.solid_config.get("format")
    options = context.solid_config.get("options", {})
    context.log.debug(
        f"read_file: path={path}, dtype={dtype}, _format={_format}, options={options}, "
    )
    spark = context.resources.spark.spark_session
    if dtype:
        df = (
            spark.read.format(_format)
            .options(**options)
            .schema(transform.create_schema(dtype))
            .load(path)
        )
    else:
        df = spark.read.format(_format).options(**options).load(path)

    yield Output(df, "df")

Lets now break it down

`solid` decorator

In this solid decorator we have some additional parameters:

output_defs=[OutputDefinition(dagster_type=DataFrame, name="df")],

output_defs defines a list of OutputDefinition of the solid. In our case the output will be a dataframe that will be consumed by solids in the pipeline. The name of the object in the OutputDefinition can be consumed by other solids.

required_resource_keys={"spark"}

The resources that the solid requires in order to execute.

 config_schema={  
         "path": Field(  
        ...          
        "options": Field(  
            Permissive(   
                fields={  
                    "inferSchema": Field(Bool, is_required=False),  
                    "sep": Field(String, is_required=False),  
                    "header": Field(Bool, is_required=False),  
                    "encoding": Field(String, is_required=False),  
                }  
            ),  
            is_required=False,  
        ),

config_schema - similar to the explanation above.
- In this solid we also have a Permissive Field type as a dictionary that will take various option parameters for reading in file.

solid in actions

Now let's look what the solid does.

def read_file(context) -> DataFrame:

Every solid has a context which is a collection of information provided by the system, such as the parameters provided within the config_schema. In this solid there is no additional parameter (compared to the above show solid since this is the starting point of the pipeline)

path = context.solid_config.get("path")

dtype = context.solid_config.get("dtype")

_format = context.solid_config.get("format")

options = context.solid_config.get("options", {})

In order to obtain the values from the context we can use the solid_config method

context.log.debug(f"read_file: path={path}, dtype={dtype}, _format={_format}, options={options}, ")

Dagster comes with a build in logger that tracks all the events in the pipeline. In addition you are able to add any logs that you require.

spark = context.resources.spark.spark_session

Normally we would get the spark_session from the importing pyspark.sql.SparkSession, however since we already configured our resources - we will get our session from the context.resources.

    if dtype:
        df = (
            spark.read.format(_format)
            .options(**options)
            .schema(transform.create_schema(dtype))
            .load(path)
        )
    else:
        df = spark.read.format(_format).options(**options).load(path)

Now that we have everything in place we can run the basic code to read in the data into spark.

yield Output(df, "df")

Finally the function yields an Output that can be consumed by the other solids in the pipeline.

Composite Solids

A single solid executes a
computation, however when wanting to create dependencies between solids we can use a composite_solid.

Here is a screen shot of the composite_solid from dagit

Lets review the passed_to composite_solid

@composite_solid(
    output_defs=(
        [
            OutputDefinition(name="df_edges", dagster_type=DataFrame),
            OutputDefinition(name="df_nodes", dagster_type=DataFrame),
        ]
    ),
)
def passed_to():
    df_edges_disc = solids_transform.lit.alias("add_col_label1")(
        solids_transform.lit.alias("add_col_label2")(
            solids_transform.astype(
                solids_transform.rename_cols(
                    solids_transform.drop_cols(
                        solids_transform.dropna(solids_utils.read_file())
                    )
                )
            )
        )
    )
    solids_utils.show(df_edges_disc)
    df_edges, df_nodes = solids_edges.edges_agg(df_edges_disc)
    solids_utils.save_header.alias("save_header_edges")(
        solids_transform.rename_cols.alias("rename_cols_neo4j")(df_edges)
    )
    return df_edges, df_nodes

The @composite_solid is very similar to the @solid decorator.
Here we can see how the solids are nested within each other. Every solid has an input of a DataFrame (excpet for the read_file solid). An every solid has an Output DataFrame.
The alias allows to call the same solid several time within a single composite_solid.
A solid can return several outputs (which need to be defined in the solid decorator under the OutputDefinition parameter )

Pipeline

Finally we can put everything together in our pipeline. A pipeline builds up a dependency graph between the solids/composite_solids.

@pipeline(mode_defs=[ModeDefinition(resource_defs={"spark": pyspark_resource})])
def statsbomb_pipeline():
    passed_to_edges, passed_to_nodes = passed_to()
    played_together_edges, played_together_nodes = played_together()
    played_in_edges, played_in_nodes = played_in()
    create_neo4j_db(dfs_nodes=[passed_to_nodes, played_together_nodes, played_in_nodes])

In our pipeline, we have three composite_solids that output 2 Dataframe objects.

Our final create_neo4j_db composite_solid is dependent on the output of the prior 2 composite_solids and executes a solid to generate the node files, in addition to execute a neo4j-admin.bat script to bulk import that data into the database.

Dagit

The DAG output of the run can be viewed with Dagit (dagster's UI). This allows to review the various steps in the pipeline and get additional information on the various solids tasks.

Bellow is an image from one of our pipelines

Into Neo4j

The results of the Pipeline can be imported into neo4j .

In order to import bigData in an optimal manner we will use the batch import admin tool. This allows for loading tens of millions of nodes and billions of relationships in a reasonable time.

The result of the import command is a Neo4j database that can be loaded from data/processed/. To load the database you can use neo4j-admin load command.

Note that the neo4j database is in version 3.X

Here is a screenshot for Liverpool's striker Mohamed Salah:

Some Tips

Once we managed to grok our understanding of dagster and wrap our pyspark functions, our workflow was quite productive. Here are some tips to make your onboarding easier:

When working with Spark:
1. Run a cache when returning a spark dataframe from a solid. This will prevent running the DAG in complex pipeline that has various outputs.
2. Since we had various checkpoints where we needed to dump our datasets, we found that when our spark performed unexpectedly, braking up the pipeline by reading back the output file (instead of passing on the dataframe object) allowed spark to manage its resources in an optimal manner.
3. Since our environment is on a CDH, the iteration of building a pipeline was enhanced when combining a jupyter notebook that would implement each step in the composite solid
Dagster:
1. Remember to set DAGSTER_HOME once the pipeline is not a playground (in order to log the runs etc, otherwise each run is ephemeral)

What else....

Dagster has several additional components that can upgrade the pipeline in a significant manner. These include, among others, Test framework, Materialization for persistent artifacts in the pipeline and a scheduler. Unfortunately these topics are outside the scope of this post.

Summary

In this post we describe our workflow for generating a graph from separate data sources.

As our project matured, we needed to stabilize our workflow, thus migrating our ad-hoc script scaffolds into Dagsters framework. In this process we were able to improve the quality of our pipeline and enable new data sources to quickly be integrated in our product in a frectioness manner.

We hope that this post will inspire you to upgrade your workflow.

Please feel free to contact us directly if you have any questions

David Katz

Sephi Berry

DEV Community: Sephi Berry

ML Configuration Management

Background

Development Environment

Basic Implementation

Advance Templating

Simple Overriding

Project vs. Module settings

Updating Configurations

Summary

# Image copyright

Design patterns in ML

NaNs bites

Background

The Pit Fall

Data sample

Near Zero Variance

Deeper NaN Analysis

NaN Distribution Among the Classifer Labels

Summary

Tidying up Pipelines with DataClasses

Background

Pipeline

Dataclass

Pipeline

Test vs. Train vs. Valid

Splitting In Action

Pipeline in action

Conclusion

Simple Pipeline Monitoring Dashboard

Background

sephib / dagster-graph-project

Repo demonstrating a Dagster pipeline to generate Neo4j Graph

Dagster Assets

Panel widgets

Getting our data/assets from the database

Sample Data

Dashboard

DateRangeSlider

Line Plot & Panel's Glue

Final function

Desgin the Dashboard

Conclusion

Implementing a graph network pipeline with Dagster

Background

sephib / dagster-graph-project

Repo demonstrating a Dagster pipeline to generate Neo4j Graph

Our Challenge

Connecting the dots

First take

Inspect and Adapt

Understanding Dagster lego blocks

Design implementation

Let's build

Data

Understanding the datasets

Player Lineup

Player Event

Let's play Lego

Solid Intro

Solid Cont.

solid decorator

solid in actions

Composite Solids

Pipeline

Dagit

Into Neo4j

Some Tips

What else....

Summary

Understanding `Dagster` lego blocks

`solid` decorator