DEV Community: Océane

How to deal with a big dataset? - fastai lesson3

Océane — Mon, 30 Sep 2019 12:18:11 +0000

Hi there,

This is lesson3 from fast.ai! This part will cover how to deal with a big dataset, how to construct a good validation set and how to interpret random forest models.

1. Working with a large dataset

Before loading data into a dataframe, it is a good practice to check the size of the dataset. The fastest way to do it is to check it in the command line. It can be done in jupyter notebook by adding ! in front of the lines of code.

!ls -lh
!wc -l train.csv
!head -5 train.csv > tiny_subset.csv
!shuf -n 5 -o tiny_subset.csv train.csv

These lines will give the size of the csv file and the number of lines in the file. It will create a subset of data containing 5 lines of the training set.

This tiny dataset can be used to figure out what are the datatypes used in pandas before reading in the the whole table. It helps to find the dates column as well, this is a useful information for pandas to parse that column as a date.

To create a dictionary from the pandas datatypes, we can use the following line of code:

types = df_tiny.dtypes.apply(lambda x: x.name).to_dict()

types = { ‘id’ : ‘int64’, ‘item_nbr’ : ‘int32’ ‘store_nbr’: ‘int8’ ‘unit_sales’: ‘float32’ ’onpromotion’: ‘object’ }

After this initial exploration, the reading of the whole file can be optimized. By passing a dictionary of the datatypes, the file reading can be 5 times faster.

df_all = pd.read_csv(‘train.csv’, parse_dates = [‘dates’], dtype = types, infer_datetime_format = True}

It is convenient to keep the model analysis fast and light at first. To train the model, we can use set_rf_samples(50000).

If a line of code takes quite a long time, it can be interesting to use a profiler %prun m.fit(x, y). It will give information on which lines of code took most of the time.

2. Building a robust validation set

Without a good validation set, it is hard to create a good model. It helps to tell whether a model is performing well or not. A good practice is to calibrate the validation set with the test set.

In the above image, four different models are submitted to the test set and the validation set. If the validation set is good, then the relationship between the test score and the validation score should lie in a straight line.

3. Interpreting machine learning models

Feature importance

Feature importance helps to understand which variables are contributing the most to the model. The following code prints the top 10 most important features for the current model.

def rf_feat_importance(m, df):
  return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_
                       ).sort_values('imp', ascending=False)

fi = rf_feat_importance(m, x)

def plot_fi(fi): 
  return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False,
                 color='#ffb500')

plot_fi(fi[:30]);
plt.xlabel('feature importance')

How is feature importance calculated?

The accuracy score is calculated with all the columns.
One column is chosen and its values are randomly shuffled.
The accuracy score is calculated once again.

This process is done for each columns and it is then possible to figure out which columns impact the most the accuracy score.

Features with a small importance can be removed from the model. Removing redundant columns lowers the possibility of collinearity (two columns that may be related to each other) and it will make the models run faster.

Confidence intervals based on standard deviation

After running a model, we get predictions. The question is then "how confident are we on that prediction?" The standard deviation of the predictions across the trees gives a relative understanding of how confident we should be on this prediction.

The fastai library provides a handy function called parallel_trees. It takes a random forest model m and some function to call on every tree in parallel. It will return a list of the result of applying that function to every tree.

def get_preds(t): return t.predict(X_valid)
%time preds = np.stack(parallel_trees(m, get_preds))

x = raw_valid.copy()
x['pred_std'] = np.std(preds, axis=0)
x['pred'] = np.mean(preds, axis=0)

We can explore the feature ProductSize.

First, we can count the number of lines for each ProductSize category.

x.ProductSize.value_counts().plot.barh(color='#ffb500');
plt.xlabel('value_counts')
plt.ylabel('ProductSize')

We can create a new dataframe that contains the mean of predictions and the mean of the standard deviation of predictions for each ProductSize category.

flds = ['ProductSize', 'SalePrice', 'pred', 'pred_std']
summ = x[flds].groupby(flds[0]).mean()
summ

We can take the ratio of the standard deviation values and the sum of predictions in order to compare which category has a higher deviation.

(summ.pred_std/summ.pred).sort_values(ascending=False)

In this case, we can explain that we have a higher standard deviation for the small, compact, large and mini ProductSize because we have less rows for these categories.

We can say that we are more confident about the predictions for the medium and large/medium ProductSize and less confident about the small, compact, large and mini ProductSize.

That is it for the third lesson! Happy machine learning!

You can follow me on Instagram @oyane806 !

Do you want to learn more about random forests? - fastai lesson2

Océane — Tue, 17 Sep 2019 07:04:05 +0000

Hi there,

I am currently studying machine learning with fast.ai. This is the second lesson!

I organized my notes in 4 categories: RMSE & R², decision tree, bag of little boostraps, hyperparameters tuning. The best way to learn is to be able to reexplain all of this and to apply this new knowledge to Kaggle competitions.

1. RMSE & R²

RMSE = standard deviation of residuals.

A residual is the difference between the actual data and the data predicted. It is in yellow in the figure below. Each residual is squared, it makes the impact of outliers stronger and it prevents negative and positive residuals to cancel themselves.

R² measures how much the model can explain data compared to the simple mean.

SSres is the sum of squares of residuals, they are represented in yellow in the figure above. SStot is the total sum of square proportional to the variance of the data, they are represented as the sum of the yellow and purple part in the figure above.

When SSres are as big as SStot ( R² close to 0), it means that the model is not better than the simple mean for predictions.

R² is the ratio between how good your model is versus how good is the naïve mean model.

2. A simple decision tree

A tree consists of a sequence of binary decisions.

The algorithm calculates for each variable (Coupler_System, Enclosure, etc), for each possible value of this variable (<0.1, <0.2, etc) the weighted average of two new nodes (here for the first split 16815 x 0.414 + 3185 x 0.109). It keeps the variable and value of the best score.

3. Bag of little bootstraps

To improve a simple decision tree, we can create a forest, that uses a statistical technique called bagging. The key is to construct multiple models which are better than nothing on different subsets of data and where the errors are, as much as possible, not correlate with each other.

When the average of these models is taken, the accuracy is better than the accuracy of a simple decision tree.

A bootstrap is a random subset of the original data, sometimes drawn with replacement. Some samples may occur several times in each splits. The idea is that a bootstrap contains only a part of the whole set of observations. Bootstrap is used to train a different classifier each time on a different set of observations.

The part not used forms the out-of-bag and can be used to assess the error rate of the classifier. Set oob_score=True will create an attribute called oob_score_ to the model. It is very useful for hyperparameters tuning. In this case, a validation set is not needed.

In scikit-learn, there is another class called ExtraTreeClassifier which is an extremely randomized tree model. Rather than trying every split of every variable, it randomly tries a few splits of a few variables which makes training much faster and it can build more trees — better generalization.

👉 Tips: start with 20 or 30 trees and later try with 1000 trees.

In the following image, we can see that the prediction accuracy is improving with the number of trees in the random forest.

4. Hyperparameters tuning

n_estimators=40 represents the number of trees averaged.

max_depth=3 means that the decision trees will only use 3 levels of decision.

min_sample_leaf=3 It stop training the tree further when a leaf node has 3 or less samples (before we were going all the way down to 1). Each tree will generalize better, but will be slightly less powerful on its own.

👉 Tips: the numbers that work well are 1, 3, 5, 10, 25, but it is relative to the overall dataset size.

max_features=0.5 The idea is that the less correlated the trees are with each other, the better. For row sampling, each new tree is based on a random set of rows. For column sampling, for every individual binary split, we choose from a different subset of columns. In this case, only half of the columns will be used.

👉 Tips: good values to use are 1, 0.5, log2 or sqrt.

Scikit-learn has a function called grid-search that takes a list of all the hyperparameters you want to tune and all of the values of these hyperparameters you want to try. It will run the model on every possible combination of all these hyper parameters and tell which one is the best.

training set/validation set/test set : the training set is used to train the model, the validation set is used to tune hyperparameters and the test set helps to check at the very end, if the model tuned is good on unseen data

There are different ways to create a training set and a validation set:

bagging
k-fold cross-validation

When working on a new data project, we want to iterate quickly, if running the model takes more than 10 seconds, it is a good practice to create a subset of data.

X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)

You can follow me on Instagram @oyane806 !

What is your routine in a data science project? - fastai lesson1

Océane — Thu, 12 Sep 2019 11:35:59 +0000

Hi there,

I am currently studying machine learning with fast.ai.

I organized my notes for lesson1 below in 3 categories: a process summary, code used and concepts seen in the video. The best way to learn is to be able to reexplain all of this and to apply this new knowledge to Kaggle competitions.

1. Process summary

set up jupyter notebook and the environment
download data from kaggle
convert all data into numbers or booleans: new features are extracted from dates (year, month) and string categorical data are mapped to numbers
take care of missing data: continuous missing data are replaced with the median and a new feature column is created _na with a boolean value, missing categorical variables are handled by pandas and automatically set to -1
separate training and validation set: the model is trained on the training set, to check if it is working well, it is used on the validation set after
train the model
print accuracy scores

2. Code

%load_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.imports import *
from fastai.structured import *
from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display
from sklearn import metrics

PATH = "data/bulldozers/"

df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory=False, parse_dates=["saledate"])

display_all(df_raw.tail().T)

df_raw.SalePrice = np.log(df_raw.SalePrice)

add_datepart(df_raw, 'saledate')
df_raw.saleYear.head()

train_cats(df_raw)
df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)
df_raw.UsageBand = df_raw.UsageBand.cat.codes

display_all(df_raw.isnull().sum().sort_index()/len(df_raw))

os.makedirs('tmp', exist_ok=True)
df_raw.to_feather('tmp/bulldozers-raw')
df_raw = pd.read_feather('tmp/bulldozers-raw')

df, y, nas = proc_df(df_raw, 'SalePrice')

m = RandomForestRegressor(n_jobs=-1)
m.fit(df, y)
m.score(df,y)

n_valid = 12000  # same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape

There are a few helper functions:

display_all() is created to have the name of all the features in lines instead of columns
split_vals() splits the dataset into a training set and a validation set
print_score()

def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

def split_vals(a,n): return a[:n].copy(), a[n:].copy()

def rmse(x,y): return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

The following functions are included in fast.ai library:

add_datepart() generates new numerical features from dates (year, month) and remove the previous column dates
train_cats() maps strings to integers (ex: red: 1, blue: 2, etc)
proc_df() replaces categories with their numeric codes, handles missing continuous values (replaces it by a median value and creates a new feature column _na) and split the dependent variable into a separate variable

3. Concepts

structured data/unstructured data: structured data are tabular data, an example of unstructured data is images
curse of dimensionality: idea that the more dimensions you have, the more all of the points sit on the edge of that space
no free-lunch theory: in theory, there is no type of model that will work for any kind of random data set
regression/classification: regression is continuous variable prediction (ex: price prediction) and classification is true/false categorization or identification of multiple categories (ex: categorization of fruits)
overfitting: when a model is to specific to a dataset, it will not be able to generalize well with a new dataset, a validation set helps diagnose this problem

That is it for the first lesson!

Don’t forget to recall. Are you able to explain the basic process to begin a data science notebook? How do you handle missing values? What is regression and classification? What is over-fitting?

And practice with Kaggle to make this new knowledge a second nature!

Note: I think that the easier way to begin a data science notebook is to use Google Colab.

!pip install fastai==0.7.0
from google.colab import files
uploaded = files.upload()
import pandas as pd
import io
df_raw = pd.read_csv(io.BytesIO(uploaded['train.csv']))

You can follow me on Instagram @oyane806 !