How to deal with a big dataset? - fastai lesson3

#python #datascience #machinelearning

Hi there,

This is lesson3 from fast.ai! This part will cover how to deal with a big dataset, how to construct a good validation set and how to interpret random forest models.

1. Working with a large dataset

Before loading data into a dataframe, it is a good practice to check the size of the dataset. The fastest way to do it is to check it in the command line. It can be done in jupyter notebook by adding ! in front of the lines of code.

!ls -lh
!wc -l train.csv
!head -5 train.csv > tiny_subset.csv
!shuf -n 5 -o tiny_subset.csv train.csv

These lines will give the size of the csv file and the number of lines in the file. It will create a subset of data containing 5 lines of the training set.

This tiny dataset can be used to figure out what are the datatypes used in pandas before reading in the the whole table. It helps to find the dates column as well, this is a useful information for pandas to parse that column as a date.

To create a dictionary from the pandas datatypes, we can use the following line of code:

types = df_tiny.dtypes.apply(lambda x: x.name).to_dict()

types = { ‘id’ : ‘int64’, ‘item_nbr’ : ‘int32’ ‘store_nbr’: ‘int8’ ‘unit_sales’: ‘float32’ ’onpromotion’: ‘object’ }

After this initial exploration, the reading of the whole file can be optimized. By passing a dictionary of the datatypes, the file reading can be 5 times faster.

df_all = pd.read_csv(‘train.csv’, parse_dates = [‘dates’], dtype = types, infer_datetime_format = True}

It is convenient to keep the model analysis fast and light at first. To train the model, we can use set_rf_samples(50000).

If a line of code takes quite a long time, it can be interesting to use a profiler %prun m.fit(x, y). It will give information on which lines of code took most of the time.

2. Building a robust validation set

Without a good validation set, it is hard to create a good model. It helps to tell whether a model is performing well or not. A good practice is to calibrate the validation set with the test set.

In the above image, four different models are submitted to the test set and the validation set. If the validation set is good, then the relationship between the test score and the validation score should lie in a straight line.

3. Interpreting machine learning models

Feature importance

Feature importance helps to understand which variables are contributing the most to the model. The following code prints the top 10 most important features for the current model.

def rf_feat_importance(m, df):
  return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_
                       ).sort_values('imp', ascending=False)

fi = rf_feat_importance(m, x)

def plot_fi(fi): 
  return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False,
                 color='#ffb500')

plot_fi(fi[:30]);
plt.xlabel('feature importance')

How is feature importance calculated?

The accuracy score is calculated with all the columns.
One column is chosen and its values are randomly shuffled.
The accuracy score is calculated once again.

This process is done for each columns and it is then possible to figure out which columns impact the most the accuracy score.

Features with a small importance can be removed from the model. Removing redundant columns lowers the possibility of collinearity (two columns that may be related to each other) and it will make the models run faster.

Confidence intervals based on standard deviation

After running a model, we get predictions. The question is then "how confident are we on that prediction?" The standard deviation of the predictions across the trees gives a relative understanding of how confident we should be on this prediction.

The fastai library provides a handy function called parallel_trees. It takes a random forest model m and some function to call on every tree in parallel. It will return a list of the result of applying that function to every tree.

def get_preds(t): return t.predict(X_valid)
%time preds = np.stack(parallel_trees(m, get_preds))

x = raw_valid.copy()
x['pred_std'] = np.std(preds, axis=0)
x['pred'] = np.mean(preds, axis=0)

We can explore the feature ProductSize.

First, we can count the number of lines for each ProductSize category.

x.ProductSize.value_counts().plot.barh(color='#ffb500');
plt.xlabel('value_counts')
plt.ylabel('ProductSize')

We can create a new dataframe that contains the mean of predictions and the mean of the standard deviation of predictions for each ProductSize category.

flds = ['ProductSize', 'SalePrice', 'pred', 'pred_std']
summ = x[flds].groupby(flds[0]).mean()
summ

We can take the ratio of the standard deviation values and the sum of predictions in order to compare which category has a higher deviation.

(summ.pred_std/summ.pred).sort_values(ascending=False)

In this case, we can explain that we have a higher standard deviation for the small, compact, large and mini ProductSize because we have less rows for these categories.

We can say that we are more confident about the predictions for the medium and large/medium ProductSize and less confident about the small, compact, large and mini ProductSize.

That is it for the third lesson! Happy machine learning!

You can follow me on Instagram @oyane806 !