DEV Community: Edward Amor

Analysis of Hidden Technical Debt in Machine Learning Systems

Edward Amor — Sun, 01 Nov 2020 05:00:00 +0000

Hidden Technical Debt in Machine Learning Systems ¹ offers a very interesting high level overview of the numerous extra layers of technical debt ² which exist in Machine Learning enabled systems. Unlike standard software systems, ML-enabled systems utilize external data instead of standard code and software logic, and contain a machine learning component. This replacement of standard software logic with data results in systems which are much harder to maintain in the long run if the proper precautions aren’t taken. Therefore, it’s imperative that every Data Scientist/Machine Learning Engineer be aware of the various debts that come with ML-enabled systems in order to prevent serious catastrophe in the future.

Model Complexity

Model complexity refers to the overall complex nature of ML-enabled systems, the process through which data is input and output, and any intermediary stages in between. This complexity makes it entirely impossible to make isolated changes in an ML-enabled system, as any change would result in a variation of the ML component within. Additionally, an interesting problem which arises as complexity in an ML system increases is the advent of undeclared consumers. Undeclared consumers are other systems or parts of the development stack which silently utilize outputs and/or intermediary files generated by the ML system. This poses a huge risk since these components are now silently coupled to the system, and any changes in the ML component affect the silent consumers. This coupling could result in adverse outcomes which are tough to debug at best and possibly cascading failures at worst.

Data Dependencies

ML is required in exactly those cases when the desired behavior cannot be effectively expressed in software logic without dependency on external data.

Unlike regular software systems, ML-enabled systems are entirely dependent on external data. When the inputs to an ML system aren’t strictly maintained, the input data may change and lead to an adverse effect on the system. This includes any improvements to the input signals, since changing anything changes everything. Additionally, over time underutilized data features, legacy features, bundled features, and/or correlated features can generate inefficiencies at best and faults at worst. It’s imperative that regular input validation checks are made, and exhaustive leave-one-out feature selection evaluations are run to eliminate underutilized features.

Feedback Loops

One of the key features of live ML systems is that they often end up influencing their own behavior if they update over time.

ML systems are unique in that they require training, which requires data. Often over time direct feedback loops arise and ML systems directly influence the selection of their own future training data. Although this is a relatively tough problem to deal with, it’s exactly what data scientists love to research and solve. The more challenging issue is hidden feedback loops, which is when two systems influence each other indirectly through the world. An example being, two stock market prediction models from independent investment firms. Any improvements (or at worst, bugs) to one may influence the bidding and buying behavior of the other.

Anti-Patterns

An anti-pattern is a common response to a recurring problem that is usually ineffective and risks being highly counterproductive. ³ Within an ML-enabled system there are a few unique anti-patterns which hinder the maintainability of the system. Glue code, which is often used to get data into and out of a general purpose ML solution, ends up creating lots of supporting code which is costly in the long term. Pipeline jungles, which organically evolve over time, are the result of incremental scrapes, joins, and sampling steps often with intermediate outputs and files. Managing and testing these pipelines are costly and time consuming, however since 2015 many libraries (such as scikit-learn with the Pipeline class) come with built-in pipeline abstractions easing their management.

Other Areas of ML Debt

Lastly, there are many other areas of technical debt in the production of ML-enabled systems. Process management debt, which occurs in very mature systems that tend to have hundreds or thousands of models running simultaneously, involves managing and assigning resources with different business priorities. Reproducibility debt involves designing real world systems and ensuring strict reproducibility, which is difficult when using randomized algorithms, non-determinism in parallel learning, and interactions with the outside world. Further, probably the most important type of debt, cultural debt. Cultural debt exists when there is a hard line between research and engineering, which is counter productive in the long term. Therefore, it’s imperative to cultivate a culture which rewards the simplicity, stability, and reproducibility.

Key Takeaways

The goal is not to add new functionality, but to enable future improvements, reduce errors, and improve maintainability.

As the Data Science field continues to grow, it’s important that those within the community are aware of the issues involved with putting ML into production. Luckily, since generally 95% of any ML system isn’t actually ML, it works in our benefit to learn from the software engineering field, and take advantage of the many decades of learned experience. The authors of Hidden Technical Debt in Machine Learning Systems did an excellent job of expressing the additional layers of technical debt involved in ML systems, and the various solutions/measures to limit it. Some of the key ways they offer to pay down the debt are:

Using common APIs, which allows support infrastructure to be more reusable.
Isolating models by serving ensembles to reduce interaction between the external world and models.
Creating versioned copies of inputs, to prevent detriments to the system from changes in the input.
Regularly running exhaustive leave-one-feature-out evaluations, to identify and remove unnecessary features.
Testing of input signals, providing sanity checks which prevent corruption of models.
Improving documentation

Data Science has never been an isolated field, but it is more important then ever that as a community we pay attention to the long term implications of our ML systems. Taking the time and consideration from the beginning when managing these systems will result in better maintainability and future growth which otherwise would have been burdensome.

Simple Time Series Forecasting with ML

Edward Amor — Wed, 16 Sep 2020 13:00:00 +0000

Time series forecasting is an interesting sub-topic within the field of machine learning, mainly due to the time component which adds to the complexity of making predictions. Over the past month I’ve grown quite fond of it, and one of the best things I’ve learned is that standard supervised machine learning algorithms can be applied to time series to make predictions. The process is quite similar to a standard ML process with the exception that you have to structure your data a specific way to maintain the temporal structure.

Environment Setup

For setting up your environment I do recommend that you use anaconda, it’s kind of the de facto environment manager when doing data science. However, if you only have python on your system that is more than enough as well. I’m also assuming you have a terminal available with a unix-like shell such as bash or git bash.

$ mkdir tsml-tutorial
$ cd tsml-tutorial

If you have anaconda available on your system:

$ conda create -n tsml jupyter pandas scipy numpy matplotlib seaborn scikit-learn statsmodels
$ conda activate tsml

If you don’t have anaconda available on your system, but have python 3.3+ installed:

$ python -m venv venv
$ source venv/bin/activate
$ pip install jupyter pandas scipy numpy matplotlib seaborn scikit-learn statsmodels

Now that you have an environment installed you can start following along by starting your local jupyter server and opening a fresh notebook.

$ jupyter notebook

Data Extraction

For this little tutorial we’ll be using one of the most common univariate time series datasets, that you’ve probably already seen, Daily minimum temperatures in Melbourne, Australia, 1981-1990. The data consists of, as you may have guessed, the daily minimum temperature over the course of 10 years in Melbourne, Australia. We’ll be grabbing our data using pandas, from a github repository. You can find the data at the following url https://github.com/jbrownlee/Datasets/blob/master/daily-min-temperatures.csv.

# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, TimeSeriesSplit
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.compose import ColumnTransformer

%matplotlib inline
sns.set()

# load our dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv"
df = pd.read_csv(url)

# output dataframe info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3650 entries, 0 to 3649
Data columns (total 2 columns):
 # Column Non-Null Count Dtype  
--- ------ -------------- -----  
 0 Date 3650 non-null object 
 1 Temp 3650 non-null float64
dtypes: float64(1), object(1)
memory usage: 57.2+ KB

Our dataframe consists of 2 columns, Date and Temp, with no missing values, and 3650 observations (365 per year). Our data is typed as follows:

Date column as a string which we’ll want to convert to a DateTimeIndex
Temp column as a float64.

# set Date as datetimeindex
df["Date"] = pd.to_datetime(df["Date"])
df = df.set_index("Date")

Data Exploration

Since this is a time series, we’d be remiss if we didn’t plot the data out fully. We’ll also want to inspect our data and see if there is autocorrelation.

# plot full 10 years
fig, ax = plt.subplots(figsize=(16, 9))
df.plot.line(title="Daily minimum temperatures in Melbourne, Australia, 1981-1990", style=".", ax=ax)
df.rolling(30).mean().plot(figsize=(16, 9), style="-", ax=ax)
df.rolling(30).std().plot(figsize=(16, 9), style="-", ax=ax)
plt.legend(["Temperature", "30-Day Rolling Average", "30-Day Rolling Std. Dev."])
plt.show()

Our plot of the temperature for the last ten years shows the temperature oscillates, almost like a sinusoidal wave. With our rolling standard deviation showing that we don’t grow in variance as time progresses. This would definitely be an optimal dataset for an SARIMA model, but that isn’t what we are here for.

Modeling

This is the crux of our tutorial and essentially we’ll be doing regression (using a RF regressor albeit) to predict the temperature. To start we’ll create some features such as time lags, and time features to incorporate the temporal structure into our model. To make it easier as our data grows though we’ll want to make a pipeline.

Features to create:

Time lags for the previous week
Rolling 30-Day Temperature average
Rolling 7-Day Temperature average
Month of the year
Week of the year
Next day’s temperature (what we are predicting)

# create our features and new dataframe
data = pd.DataFrame({f"t-{x}": df.Temp.shift(x) for x in range(7, 0, -1)})
data["t"] = df.Temp
data["day"] = df.index.isocalendar().day
data["week_of_year"] = df.index.isocalendar().week
data["month"] = df.index.month
data["7-Day Temp. Avg."] = df.Temp.rolling(7).mean()
data["30-Day Temp. Avg."] = df.Temp.rolling(30).mean()
data["t+1"] = df.Temp.shift(-1)
data = data.dropna()

t is our current time step, and t+1 is the next day’s temperature which we’ll be predicting. To make our model aware of time we’ve also created a week and month feature, and included lag values and rolling averages. Next we’ll want to divide our data up into testing and training sets so we can do some training and validate our data. However, since we are working with time series data, there is a strict order dependence and so we can’t split and shuffle our data, we’ll have to maintain our order.

We’ll split our data up using a 70-30 split, where the last 30% of our data will be used as our testing data, and the first 70% is for our training.

# split data up into training and testing set, preprocess
num_cols = ['t-7', 't-6', 't-5', 't-4', 't-3', 't-2', 't-1', 't',
            '7-Day Temp. Avg.', '30-Day Temp. Avg.']
col_trans = ColumnTransformer(
    [
        ("categorical_cols", OneHotEncoder(drop="first", sparse=False), ["week_of_year", "month", "day"]),
        ("numeric_cols", StandardScaler(), num_cols)
    ]
)

pipe = Pipeline([("trans", col_trans), ("regression", RandomForestRegressor(n_jobs=-1))])

X = data.drop(columns="t+1")
y = data["t+1"]

X_train, X_test = X[:int(X.shape[0] * .7)], X[int(X.shape[0] * .7):]
y_train, y_test = y[:int(y.shape[0] * .7)], y[int(y.shape[0] * .7):]

Since we can’t do cross validation, we’ll use the time series split class from sklearn, which is essentially the k-fold validation of time series validation. Our alternative would be to train our model on all the data, and use information criterion, which realistically when doing any model selection you should use multiple metrics to select your model.

# perform cross validation on training data
-cross_val_score(pipe, X_train, y_train, cv=TimeSeriesSplit(), scoring="neg_root_mean_squared_error").mean()

2.5270646279116358

Here we have the RMSE score after doing some cross validation, it isn’t anything special but verifies that we can apply our standard ML toolset on a time series dataset. From our CV we can see our model is about 2.5 degrees off.

# fit our model and make predictions on testing data
pipe.fit(X_train, y_train)
preds = pipe.predict(X_test)
# show the predictions
y[-1086:].plot(figsize=(16, 9), title="Predictions on Hold Out Data")
pd.Series(preds, index=y[-1086:].index).plot()
plt.legend(["Observations", "Predictions"])
plt.show()

# output RMSE score on test data
mean_squared_error(y_test, preds, squared=False)

2.3153790785018127

Looking at the predictions made by our model, we aren’t going to be telling anyone the weather anytime soon. However, this is a prime example of how to apply standard Machine Learning algorithms to your time series.

Model Selection, Validation, and Hyperparameter Tuning

Edward Amor — Sat, 01 Aug 2020 21:00:00 +0000

In practice, a majority of the time dedicated to any data science project (unless you’re lucky) is consumed by data cleaning and wrangling. However, once you’ve completed you’re data mining, cleaning, exploration, and feature engineering, generally the next step is to do some machine learning. The ML process is pretty standard regardless of the algorithm you choose, it’ll always require some model selection, model validation, and hyperparameter tuning. One of the easiest ways you can wrap your mind around the process is through trial and error.

Logistic Regression

Logistic regression is very well known supervised binary classification algorithm. Unlike linear regression where you’re predicting a continuous value, logistic regression is a binary algorithm outputs a prediction of either a 0 or a 1, or a probability (there is also softmax regression an extension to logisitic regression for more than 2 classes). Although logistic regression isn’t as fancy as neural networks or natural language processing, it is still a significantly useful tool in any data scientist’s toolbelt. Just like other classification algorthims it can be used for:

Predicting Bank Loan Worthiness
Detecting Credit Card Fraud
Detecting Email Spam
And Many More …

To get a grasp on how model selection, model validation, and hyperparameter tuning work we’ll run through an example with a simple dataset. For the purposes of demonstration we’ll be using a dataset from the UCI Machine Learning Repository.

Blood Transfusion Service Center Data Set

The dataset we’ll be using can be found in the UCI Machine Learning Repository by clicking here. Below is some information about the data for those who would like to know, it was taken directly from the UCI Machine Learning Repository.

Summary:

To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build a FRMTC model, we selected 748 donors at random from the donor database. These 748 donor data, each one included R (Recency - months since last donation), F (Frequency - total number of donation), M (Monetary - total blood donated in c.c.), T (Time - months since first donation), and a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood).

Attribute Information:

R (Recency - months since last donation)
F (Frequency - total number of donation)
M (Monetary - total blood donated in c.c.)
T (Time - months since first donation)
a binary variable representing whether he/she donated blood in March 2007 (1 stand for donating blood; 0 stands for not donating blood)

Source:

Original Owner and Donor

Prof. I-Cheng Yeh

Department of Information Management

Chung-Hua University,

Hsin Chu, Taiwan 30067, R.O.C.

e-mail: icyeh ‘at’ chu.edu.tw

TEL:886-3-5186511

Citation Request:

Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming, “Knowledge discovery on RFM model using Bernoulli sequence, “Expert Systems with Applications, 2008 (doi:10.1016/j.eswa.2008.07.018).

Exploratory Data Analysis

It’s always a good habit after extracting and cleaning your data to perform some EDA to get a grasp of the main characteristics of the data you’ll be working with. It provides you with visuals which immensely assist in the preprocessing steps later on.

# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.metrics import classification_report, plot_roc_curve, plot_confusion_matrix, plot_precision_recall_curve
from sklearn.preprocessing import StandardScaler

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline

%matplotlib inline
sns.set(font_scale=1.1)

SEED = 890432


# load data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data"
df = pd.read_csv(url)
df.columns = ["r", "f", "m", "t", "y"] # rename the columns for brevity
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 # Column Non-Null Count Dtype
--- ------ -------------- -----
 0 r 748 non-null int64
 1 f 748 non-null int64
 2 m 748 non-null int64
 3 t 748 non-null int64
 4 y 748 non-null int64
dtypes: int64(5)
memory usage: 29.3 KB

In our case, the dataset is typed correctly and void of any null values since a majority of the processing has already been done, leaving just EDA and modeling to us. Since this is a classification challenge, one of the best things to look at immediately is the class distribution of your output. This will give us insight into whether we should implement some downsampling/upsampling/hybrid-sampling to adjust for imbalance.

# plot class distribution
plt.figure(figsize=(16, 9))
sns.countplot(x="y", data=df)

# adjust figure 
ticks, _ = plt.xticks()
plt.xticks(ticks, ["Didn't donate in 2007 (0)", "Did donate in 2007 (1)"])
plt.xlabel("")
plt.ylabel("# of individuals")
plt.title("Class Distribution of Dependent Variable")

plt.show()

There is for sure a class imbalance which we’ll have to account for later in our modeling, but this highlights why it’s always important to visually explore your data. Next we should also determine if there exists any correlation amongst our independent variables, if there is we could implement some dimensionality reduction or remove some redundant variables. We’ll inspect this by plotting the pairwise relationship between the variables, and also viewing a heatmap of pairwise pearson correlation values.

# view the correlation between variables
sns.pairplot(df, vars=["r", "f", "m", "t"], aspect=1.3, corner=True)
plt.show()

Simply looking the histogram of each variable we see they’re all positively skewed, which is another thing we’ll have to adjust for before we do our modeling by scaling our data. Along with that, the variables “f” and “m” appear to be positively correlated, forming basically a straight line. This makes sense as from our attribute information we know that frequency is “the total number of donation and monetary is “the total blood donated in c.c.”. We should therefore expect that as frequency increases, monetary will also increase.

# obtain the pairwise pearson correlation
correlation = df[["r", "f", "m", "t"]].corr()
mask = np.triu(np.ones_like(correlation)) # for removing upper diagonal of information

# plot heatmap of pearson correlation
plt.figure(figsize=(16, 9))
labels = ["Recency", "Frequency", "Monetary", "Time"]
sns.heatmap(correlation, annot=True, square=True, xticklabels=labels, yticklabels=labels, mask=mask)

plt.title("Pairwise Correlation Heatmap")
plt.show()

Our suspicions are confirmed, and by the looks of it our Frequency and Monetary variables are perfectly positively correlated with each other. We’ll remove one of these columns as it’ll interfere with our modeling efforts down the line. It also appears that Frequency/Monetary are somewhat positively correlated with Time, my heuristic is to leave the variable if the pearson correlation is |x| < .75, so in this case we’ll simply stick to removing either frequency or monetary.

# drop the monetary column
df.drop(columns=["m"], inplace=True, errors="ignore")

Since this is a classification task, another good thing to do is look at the distribution of values in each category. To get a quick summation of this information, we can use a box and whisker plot.

# plot categorical distribution of values
plt.figure(figsize=(16, 9))

plt.subplot(131)
sns.boxplot(data=df, x="y", y="r", showmeans=True)
plt.title("Recency")

plt.subplot(132)
sns.boxplot(data=df, x="y", y="f", showmeans=True)
plt.title("Frequency")

plt.subplot(133)
sns.boxplot(data=df, x="y", y="t", showmeans=True)
plt.title("Time")

plt.show()

As noted before, we will be scaling our data to within the same range. It is quite noticeable that the range for each of the above 3 plots is significantly substantially. Other than that what stands out to me is the somewhat similar characteristics between the classes, particularly the time variable. Between the Recency and Frequency variables, we see decent amount of outliers which could dampen the ability of our logit model from detecting the difference between the two classes. After generating a vanilla model we’ll assess it’s performance and whether we want to drop our outlier observations.

Last but certainly not least, we’ll look at the descriptive statistics of our variables. This is typically also helpful at the beginning of any EDA, as you should notice any suspicious facts about your data relatively immediately.

# output descriptive statistics
df.describe()

	r	f	t	y
count	748.000000	748.000000	748.000000	748.000000
mean	9.506684	5.514706	34.282086	0.237968
std	8.095396	5.839307	24.376714	0.426124
min	0.000000	1.000000	2.000000	0.000000
25%	2.750000	2.000000	16.000000	0.000000
50%	7.000000	4.000000	28.000000	0.000000
75%	14.000000	7.000000	50.000000	0.000000
max	74.000000	50.000000	98.000000	1.000000

Nothing unusual here, as we’ve uncovered the majority of our information from our previous visualizations.

Preprocessing

This step is short and just involves setting our data up for modeling by splitting it into training and testing sets and doing any additional scaling/manipulation. Typically it’s best to use a pipeline for scaling/manipulation of your data, as it’ll reduce the headache down the line, and provide you with a simple interface for modeling.

# create our data pipeline
imbalanced_pipeline = make_pipeline(
    StandardScaler(),
    LogisticRegression(solver="liblinear", random_state=SEED)
)

skfold = StratifiedKFold(shuffle=True, random_state=SEED)

Our pipeline is quite simple but given a different dataset, with different characteristics, we’d have to use something else. One thing to note is, we are doing nothing to account for class imbalance in our vanilla pipeline, but based on our assessment we will see if we need to.

# separate our data into training and testing sets
X = df[["r", "f", "t"]]
y = df["y"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=SEED, stratify=y)

We use the stratify keyword argument, to make sure we separate out a proportionate amount of our classes into our training and testing set. So 75% of our class 0 and class 1 respectively will be in the training set, and 25% will be in the test set respectively.

Model Validation

To validate the results of our training, we’ll be using cross validation with kfolds to ascertain generally the performance of our model.

# perform CV on our data and output f1 results
f1_score = cross_val_score(imbalanced_pipeline, X_train, y_train, scoring="f1", cv=skfold, n_jobs=-1).mean()
"%.4f" % f1_score

'0.2470'

Our current model performs horribly, this may be due to a number of reasons, and could even be that our logistic regression algorithm just isn’t suited to this task. However, we should take this baseline score, and try to improve.

# fit our model and output precision recall curve
imbalanced_pipeline.fit(X_train, y_train)

fig, ax = plt.subplots(figsize=(16, 9))
plot_precision_recall_curve(imbalanced_pipeline, X_train, y_train, ax=ax)
plt.title("Precision-Recall Curve Baseline Logit")
plt.show()

The precision recall curve for our classifier really tells us that this model worse than random guessing, and is wrong more time than it is right.

# output classification report
print(classification_report(y_train, imbalanced_pipeline.predict(X_train)))

              precision recall f1-score support

           0 0.79 0.98 0.88 399
           1 0.72 0.17 0.27 124

    accuracy 0.79 523
   macro avg 0.76 0.57 0.58 523
weighted avg 0.78 0.79 0.73 523

It appears we are severely being hurt by the imbalance in our classes, next we’ll use synthethic over sampling, and random undersampling to better our model. Although this isn’t hyperparameter tuning, it is important to tune your data just as much as you’d tune your model, as what you get out is only as good as what you put in.

# make balanced pipeline
balanced_pipeline = make_pipeline(
    StandardScaler(),
    SMOTE(random_state=SEED), # completely balance the two classes
    LogisticRegression(solver="liblinear", random_state=SEED)
)

Our new pipeline integrates the imbalanced learn libraries Synthetic Minority Oversampling Technique to upsample our minority class. It does this by generating synthethic datapoints using our minority class data. The result is an updated dataset with a balanced class distribution of data. Since we’ll have a balanced dataset for training, we’ll be able to use an ROC/AUC curve to assess performance.

# perform CV on our data and output f1 results
f1_score = cross_val_score(balanced_pipeline, X_train, y_train, scoring="f1", cv=skfold, n_jobs=-1).mean()
"%.4f" % f1_score

'0.5153'

We’ve already doubled our baseline f1_score which is an excellent sign, we desperately needed to implement that upsampling.

# fit our model and output ROC curve, now that our pipeline is balancing our dataset
balanced_pipeline.fit(X_train, y_train)

fig, ax = plt.subplots(figsize=(16, 9))
plot_roc_curve(balanced_pipeline, X_train, y_train, ax=ax)
plt.title("Precision-Recall Curve Balanced Logit")
plt.plot(np.linspace(0, 1), np.linspace(0, 1), "--")
plt.show()

Here we can see that our model performs alright, it isn’t anything special, but it does have the ability to somewhat distinguish between the two classes.

# output classification report
print(classification_report(y_train, balanced_pipeline.predict(X_train)))

              precision recall f1-score support

           0 0.90 0.63 0.74 399
           1 0.39 0.77 0.52 124

    accuracy 0.66 523
   macro avg 0.64 0.70 0.63 523
weighted avg 0.78 0.66 0.69 523

Likewise our classification report shows that the recall score for our minority class as dramatically gone up by .60. This has come with the diminshment of other numbers though.

Next we will be looking into some more model validation, and finally hyperparameter tuning to optimize our model some more.

Tuning

Now that we have our data pipeline working how we want it to, we’ll look into tuning our logit classifier by altering some of its parameters, called hyperparameter tuning in the biz. The easiest way to run through a finite set of possible parameters is to use GridSearchCV, if you have a large set of parameters to search through it is alot better to use the RandomizedSearchCV, which will only create n number of models you specify.

# create parameter grid and grid search
param_grid = {
    "logisticregression__penalty": ["l1", "l2"],
    "logisticregression__C": np.linspace(1e-10, 1),
    "logisticregression__fit_intercept": [True, False],

}

gs = GridSearchCV(balanced_pipeline, param_grid, scoring="f1", n_jobs=-1, cv=skfold)
gs.fit(X_train, y_train)
"%.4f" % gs.best_score_

'0.5153'

It looks like through our grid search we have no improvement. However, this is a good example of what not to do if you have a very large parameter space to search through. Instead you should check out the RandomizedSearchCV, which randomly selects a parameter set to use in each iteration.

# perform CV on our data and output f1 results
f1_score = cross_val_score(gs.best_estimator_, X_train, y_train, scoring="f1", cv=skfold, n_jobs=-1).mean()
"%.4f" % f1_score

'0.5153'

# fit our model and output ROC curve, now that our pipeline is balancing our dataset
gs.best_estimator_.fit(X_train, y_train)

fig, ax = plt.subplots(figsize=(16, 9))
plot_roc_curve(gs.best_estimator_, X_train, y_train, ax=ax)
plt.title("Precision-Recall Curve Grid Search Balanced Logit")
plt.plot(np.linspace(0, 1), np.linspace(0, 1), "--")
plt.show()

# output classification report
print(classification_report(y_train, gs.best_estimator_.predict(X_train)))

              precision recall f1-score support

           0 0.90 0.63 0.74 399
           1 0.39 0.77 0.52 124

    accuracy 0.66 523
   macro avg 0.64 0.70 0.63 523
weighted avg 0.78 0.66 0.69 523

# test data predictions
test_pred = gs.best_estimator_.predict(X_test)


# output classification report
print(classification_report(y_test, test_pred))

              precision recall f1-score support

           0 0.90 0.57 0.70 171
           1 0.37 0.80 0.50 54

    accuracy 0.62 225
   macro avg 0.63 0.68 0.60 225
weighted avg 0.77 0.62 0.65 225

# plot confusion matrix
fig, ax = plt.subplots(figsize=(16, 9))
plot_confusion_matrix(gs.best_estimator_, X_test, y_test, ax=ax)
plt.show()

By the looks of it we’ve optimized a model that from the beginning does very poorly. However, the lesson still stands, and the process of validation and tuning is still very much the same.

Conclusion

When working on any machine learning problem you’re data is invaluable, sometimes you have alot and sometimes you have very little. Regardless of the volume you have, you must always ensure there is no data leakage, which at best leads to false confidence, and at worse leads to being unaware that your models are terrible. Ensuring there is no data leakage is often simple, just split up your data into two groups one for training and another for testing. However, do make sure there is no order dependence in your data, or any other dependence in your data.

After splitting your data, it’s always good to use cross validation, it helps you verify using your training data, that you’re going in the right direction. You really only have to use cross validation on your training data, and never let your model touch your testing/hold-out data (don’t even let yourself see it) until you are reasonably sure that you want use your final model. The problem that usually arises is developers will use their testing data to verify if their model is good, but then optimize their model on the testing data, which isn’t any good. You want your model to generalize well and that means it can’t see your testing data at all.

Lastly, hyperparameter tuning, not everyone remembers exactly what each parameter does for each function call. That’s why documentation exists, make sure to go and check out the docs for whichever algorithm you are using, and determine how to manipulate the hyperparameters. Typically you can identify some good values for your hyperparameters during EDA, but if you can’t you can use something like GridSearchCV or RandomizedSearchCV to help you optimize and search through a large parameter space.

The Bias Variance Tradeoff

Edward Amor — Wed, 17 Jun 2020 05:00:00 +0000

Foundational to any data science curriculum is the introduction of the terms bias and variance, and subsequently the trade-off that exists between the two. As machine learning continues to grow it is imperative that we understand these concepts, as they directly effect the predictions we make and the business value we can derive from our generated models. While machine learning may seem simple, one of the more difficult parts is optimizing your models but sometimes optimization can lead to over-fitting and if your model is too simple it may be under-fitting your data. The inevitable trade-off between these two aspects will greatly impact the validity of your model, and the predictions you make. But what is bias, what is variance, and what is this trade-off?

Bias

When we speak of bias, we aren’t talking about the standard bias us humans are susceptible to. Instead when it comes to machine learning, we are actually referring to the difference between our model’s prediction and the expected value (Prediction - Reality). When a model has high bias it consistently is making wrong predictions, and isn’t considering the complexity of our data. A model with high bias is under-fitting our data and consistently does so after training as well on testing/validation data.

We can identify if our model has high bias if the following occur:

We tend to get high training errors.
The validation error or test error will be similar to the training error.

We can compensate for high bias by doing the following:

We need to gather more input features, or generate new ones using feature engineering techniques.
We can add polynomial features in order to increase the complexity.
If we are using any regularization terms in our model, we can try to minimize them.

Variance

Similar to the statistical term, variance refers to the variability of our model’s predictions. A model with high variance does not generalize well, and instead pays a lot of attention to our training data. What ends up happening is we get a model which performs very well during training, but when introduced to our testing/validation or any unseen data, we see very high error rates. One way to think about it is like a travel route, if you were to take the route alone and map it onto a completely different area, it wouldn’t work as it only fits the particular origin/destination it was made for. In our case we aren’t making routes, but the concept still holds and we hope to create models which generalize well to unseen data similar to the data used during training.

We can identify whether the model has high variance if:

We tend to get low training error
The validation error or test error will be very high.

We can fix high variance by:

Gathering more training data, so that the model can learn more based on the patterns rather than the noise.
We can even try to reduce the input features or do feature selection, reducing model complexity.
If we are using any regularization terms in our model, we can try to maximize them.

The Trade-Off

Now knowing what bias and variance is, it is key to understand that when we minimize one, we are maximizing the other. A model with high bias will have low variance, and vice versa. Given a model, with respect to bias and variance, we can say a model’s error is the sum of three parts, the bias, variance, and random noise (E[x] = bias + variance + noise ).

The bias–variance decomposition is a way of analyzing a learning algorithm’s expected generalization error with respect to a particular problem as a sum of three terms, the bias, variance, and a quantity called the irreducible error, resulting from noise in the problem itself.

— Wikipedia ¹

The trade-off therefore is determining the optimal bias and variance levels, so as to minimize our overall error. With the steps I’ve listed previously for minimizing either of the two, one has to iteratively improve on the models generated until we arrive at one which is relatively balanced between bias and variance. If we don’t balance these two terms out, we’ll end up with a model that either under or over fits our data, which doesn’t give us any value when it comes to making predictions.

https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff ↩︎

And Now for Something Completely Different

Edward Amor — Sat, 02 May 2020 13:00:00 +0000

On January 21st, 2020, I enrolled in Flatiron School’s Data Science Bootcamp with the intention of gaining and developing the foundational skills and techniques necessary to become a Data Scientist. At the time of writing, I’m about 4 months into the program and in retrospect, I believe my decision to enroll was one of the best choices I’ve made. Along with the opportunities that will be available to me when I finish, the passionate and intelligent peers that I get to collaborate with, and the breadth of exciting, new, and challenging material I’m learning, I am glad I made the decision to learn data science.

As my journey progresses, there is one question I keep getting, why did I decide to learn data science?

Photo by XPS on Unsplash

Passion for STEM

As an interdisciplinary field, Data Science** incorporates a lot from many other fields like mathematics, statistics, computer science , and information science. I wouldn’t have made the decision to learn data science if it wasn’t for my passion and joy for programming coupled with my love for mathematics. Ultimately, it really brings me joy** to work on the projects and material at my Bootcamp, a feeling I can only compare to the addiction of staying up late to solve calculus equations. Except now instead of calculus equations, it’s iteratively designing regression models and extracting business insights from large datasets, and who can forget making beautiful visualizations.

Adaptability

One of the best parts of data science is that it’s so flexible and can be applied to practically any domain. The high adaptability is clearly reflected in the tools data scientists use, the basic skills they require, and the knowledge they draw from. And with the rising amount of data collection by businesses large and small, data scientists fit in to analyze and provide businesses insights into the relationships that exist within their data. Moreover, since they’re generally skilled at every step of the data lifecycle, many organizations could benefit from having a specialist like a data scientist on staff.

Return on Investment

Photo by Micheile Henderson on Unsplash

At the end of the day I’m investing in myself, my future , so why not take advantage of my youth and break a few eggs trying new things and experimenting with what works and what doesn’t. I don’t know with absolute certainty that my decision will put me in a better financial position in the long run, but I do know that I am getting a head start developing my skills and turning myself into an asset. And since the role is in such high demand, data science is very lucrative. Who wouldn’t want to do something they’re excited about while also making a living, sounds like the perfect scenario.

The journey ahead of me is a long and arduous one, but I know beyond all doubt that I will look back on my decision to learn Data Science as the best choice I made for my career. I can’t wait to see the amazing things I will accomplish. Until then I’ll keep working hard and sharpening my skills.

Virtual Environments with Python

Edward Amor — Tue, 17 Mar 2020 21:00:00 +0000

Similar to other programming languages (R, Ruby, Scala, JavaScript) Python comes with its own way of managing third party packages you choose to install for projects. And since Python 3.4, pip has been included by default in all binary installations of Python, allowing users to install packages from the Python Packaging Index (a public repository of open source licensed packages). However, there is one major shortcoming of the way packages are managed, and that is all packages get installed and retrieved from the same place. To the uninitiated this may not seem like an issue, however it is a disaster waiting to happen.

Without going in-depth on the inner workings of package managers and dependency resolution, I’ll paint a simple picture. You’re a hobbyist developer and you enjoy scripting general tasks that are monotonous, and your new project involves downloading a bunch of images from a website. You’ve read through some repositories and figure you only need the package foo to get the task done. You go to download foo using pip but suddenly an error gets raised ERROR: bar 1.0 has requirement requests==2.24.0, but you'll have requests 2.22.0 which is incompatible. It appears a package you previously installed bar has a conflicting dependency with foo, in this case they both require different versions of the requests library. To solve this issue you could manually go through your package list and remove stuff, or you could use a virtual environment.

A virtual environment is essentially an isolated sandbox, with its own instance of pip and an isolated set of packages (and their dependencies). This means that for each different project you have, instead of downloading packages globally, you can have isolated environments with no dependency conflicts with other projects. The other benefit to having virtual environments is there are no limits, and for as many projects as you have, each one can have its own unique sandbox of packages to work with. I imagine the next question you might have is, well how do I get started using them, and honestly there are so many ways the options seem endless.

The Standard Library

The simplest way to start using virtual environments is to use the venv module (link here) that’s available in the standard library. Simply run the following command inside your project’s directory.

$ python -m venv env
$ source env/bin/activate

Et voilà, you’ve officially created and activated your virtual environment. You’ll know it’s active because your prompt will be different, and you can verify by running the command pip list, you should have both pip and setuptools installed.

(env) $ pip list

Package Version
---------- -------
pip 20.1.1
setuptools 47.1.0

You’ll also have a new directory env in your project, make sure not to commit it to your version control system. Instead, if you’re not already, keep a requirements.txt in your project which is just a plain text file with packages on each line required by your project. This will allow you, and any collaborators, to recreate your environment simply by running the command pip install -r requirements.txt.

The main disadvantage of this method of creating virtual environments is you have to maintain your requirements.txt file. Typically this means manually appending packages to the file (if you want a human readable version), or running pip freeze > requirements.txt (for a more explicit machine readable version) every time you install something new. Note that the output of the pip freeze command will include the exact version number of each package you’ve installed along with their dependencies.

Pipenv

As an alternative to the standard library’s venv module, and from the same mind that created the popular requests python library is pipenv. As the pipenv repository says:

[Pipenv] automatically creates and manages a virtualenv for your projects, as well as adds/removes packages from your Pipfile as you install/uninstall packages. It also generates the ever-important Pipfile.lock, which is used to produce deterministic builds.

It is an amazing tool once you get tired of using venv, and similar to other projects by Ken Reitz it is made for humans, and is immensely simple to use.

To get started, the best way to use pipenv is to have a fresh install of python, although it is fine if you don’t. Simply run pip install pipenv and you’re all set. Moving forward instead of using pip to install packages for your projects, you’ll use pipenv install [insert package name].

One thing you’ll notice when using pipenvis instead of a requirements.txt, it generates a Pipfile and Pipfile.lock. Both of these files are important and should be committed to your version control system. The Pipfile simply contains information about your project’s dependencies, whereas the Pipfile.lock contains sha256 hashes of each downloaded package allowing pip to guarantee you’re installing what you intend to. The result is a simple way to get deterministic environments (environments which are exactly the same), without any intervention from you.

**Note that in order to run your python files using your virtual environment, you’ll need to activate it. With pipenv it’s as simple as running pipenv shell while inside your project’s directory. **

One disadvantage to using pipenv, similar to venv, is that you need to already have python and pip installed on your system, otherwise you won’t be able to use them.

Conda

conda is the de facto environment/package manager for python data scientists for a reason. It’s a platform agnostic binary (python doesn’t need to already be on your system), which not only does package management, it also allows you to have different versions of python for different projects. You can download it by going to the Anaconda website and selecting the installer for your platform. After you have it, you can use the graphical user interface anaconda-navigator to access and manage your virtual environments. However, if you’re like me and live in your shell, then you’ll most likely want to use the conda command.

Note there is wonderful documentation on all the amazing configuration you can perform which I won’t go into, but I highly recommend you edit you condarc to your specification.

The main parts of conda that you should get acquainted with are creating environments, installing packages, and generating an environment.yml. The environment.yml is similar to both a requirements.txt and Pipfile.

To create an environment, activate, and install a package to the environment:

$ conda create --name my-env # create a new environment
$ conda activate my-env # activate the environment
(my-env) $ conda install jupyter # install jupyter in the environment

One of the most important thing is after installing packages creating an environment.yml, and committing it to version control. You can generate two types, one is a deterministic version which isn’t as useful especially when working on multiple platforms conda env export. The second type which is more generally useful can be created by running the command conda env export --from-history. These files can also be generated by hand if you ever need to.

One of the biggest advantages of using conda is that it also works for multiple programming languages, not just Python. A short list of the languages it is available for include R, Ruby, Lua, Scala, Java, JavaScript, C/ C++, FORTRAN, and more.

Pyenv & Pyenv-Virtualenv

pyenv is a python environment manager written in shell scripts (available for *nix systems). As it says in the project description, “It’s simple, unobtrusive, and follows the UNIX tradition of single-purpose tools that do one thing well.” Coupled with pyenv-virtualenv you can create virtual environments for many versions of python, including:

Python 2
Python 3
activepython
anaconda2
anaconda3
ironpython
jython
micropython
miniconda3
pypy
pypy2.7
pypy3.6
pyston
stackless

You’ll notice that anaconda is available, and by using pyenv you’re not limited to just regular python. pyenv really shines on GNU/Linux because it alleviates the pressures of installing packages to your system version of python. Personally, I prefer to use pyenv as it allows me to mix and match and play around with my python environments with no worry. And it’s super simple to install, and remove if you choose to abandon it.

The simplest way to install it is to use the automatic installer.

$ curl https://pyenv.run | bash
$ exec $SHELL

Once done, you’ll have the pyenv command available, along with some additional plugins. Most importantly you’ll have pyenv-virtualenv, which allows you to create virtual environments for the python versions you install.

To get started creating virtual environments, you’ll need to first install a version of python, then use virtualenv to creaete one.

$ pyenv install 3.8.5 # version I want to install
$ pyenv virtualenv 3.8.5 my-env # create the environment
$ pyenv activate my-env # activate the environment

The nicest part of pyenv-virtualenv, is that you can have global, and directory settings of which python version to use. For example:

[me@host my-project] $ pyenv local my-env

The above command will create a new file, .python-version, in the project directory, and every time you enter or exit the directory will magically activate/deactivate the environment. The best part is that it’s so easy.

The one disadvantage to using pyenv and friends, is that you will most likely need to read up on it to get comfortable with it. The learning curve isn’t steep, but if you’re curious it’s always best to know the inner workings of any tools you use. Note on some distributions of GNU/Linux you will have to install additional dependencies which can be found in the ‘Common build problems’ section of the wiki.

Recommendations

Depending on your use case my recommendation will be different, however above all else experiment with the options above. There are many other options available, although I’ve only listed some of the mainstream options, more obscure options are available.

Platform	Recommendation
Windows	`pipenv` for 99% of use cases. `conda` if you’re into data science
Mac OSX	`pyenv` with `pyenv-virtualenv`
GNU/Linux	`pyenv` with `pyenv-virtualenv`

I hope moving forward if this was of any help to you, that you start taking environment management seriously. It’s very powerful and will liberate you of some of the more serious headaches that may arise from not using it.

High Level Overview of Quantile Quantile Plots

Edward Amor — Sat, 01 Feb 2020 05:00:00 +0000

A part of any data analyst’s toolkit when working with one dimensional data, is the Quantile Quantile plot. Colloquially referred to as Q-Q plots, these visualizations are unique in that they’re mainly utilized when comparing samples and/or comparing distributions. Although they’re not intuitive, Q-Q plots are amazing tools, especially when assessing whether a sample fits a known distribution, like the Gaussian distribution.

Q-Q plots work simply by plotting the quantiles of one distribution (x-coordinate), typically a theoretical distribution, against the quantiles of another distribution (y-coordinate), typically an observed dataset. If the two quantiles being compared are related, then the resulting plot will show points lying approximately on the line y=x. There are some variations to the Q-Q plot though, and each one tells you something different about the data being compared. Q-Q plots are also loosely open to interpretation, and a good heuristic is if it generally lies close enough to the line y=x then you’re golden. Even data randomly drawn from the Gaussian distribution won’t lie exactly on the line y=x, so there is wiggle room.

This Q-Q plot shows the quantiles of 75 randomly drawn data points from the standard normal distribution, compared to the normal distribution. One would intuitively think that the points would lie perfectly on the line y=x, however this isn’t the case and explains why we say QQ plots are loosely open to interpretation.

One example of where Q-Q plots are definitely applied are in linear regression. In linear regression, there are assumptions that have to be met in order for the created model to be considered valid and not misleading. One of the assumptions is that the residuals of the model are normally distributed. To verify this assumption has not been violated, we typically use a Q-Q plot to quickly compare the distribution of residuals to that of the Gaussian distribution. If the residuals loosely fit the line y=x, then one can state that the assumption has not been violated.

This Q-Q plot was generated from fitting a multi-variate linear regression model, the residuals from the training data were then plotted against the normal standard distribution. One can see that this data doesn’t appear to be normal, due to the curvature of the points. This upward curvature actually denotes a positive skew in the residuals, meaning our model is over predicting even on our training set.

Just like any other graphical method for analyzing data, there are strengths and weaknesses to Q-Q plots. One has to know when best to use a Q-Q plot to receive the most benefit from it. In the case of Q-Q plots, they are immensely beneficial when comparing two distributions (theoretical or empirical), as they show how location (mean), scale (standard deviation), and skewness are similar or different in the two distributions. They’re also extremely beneficial when assessing the residuals of a regression model as shown previously.

The biggest weakness of Q-Q plots in my eyes is there exists an initial steep learning curve, but luckily the Internet offers a trove of information, and one of the most beneficial resources I found was a post on StackExchange. Beyond that, the other major issue with Q-Q plots is that there is some room for interpretation on whether your data lies close enough to the line y=x. One person’s assessment will not always line up with another person’s, but after some practice they provide an immense benefit when quickly assessing data.