DEV Community

Cover image for How to deal with a big dataset? - fastai lesson3
Océane
Océane

Posted on

How to deal with a big dataset? - fastai lesson3

Hi there,

This is lesson3 from fast.ai! This part will cover how to deal with a big dataset, how to construct a good validation set and how to interpret random forest models.

1. Working with a large dataset

Before loading data into a dataframe, it is a good practice to check the size of the dataset. The fastest way to do it is to check it in the command line. It can be done in jupyter notebook by adding ! in front of the lines of code.

!ls -lh
!wc -l train.csv
!head -5 train.csv > tiny_subset.csv
!shuf -n 5 -o tiny_subset.csv train.csv
Enter fullscreen mode Exit fullscreen mode

These lines will give the size of the csv file and the number of lines in the file. It will create a subset of data containing 5 lines of the training set.

This tiny dataset can be used to figure out what are the datatypes used in pandas before reading in the the whole table. It helps to find the dates column as well, this is a useful information for pandas to parse that column as a date.

To create a dictionary from the pandas datatypes, we can use the following line of code:

types = df_tiny.dtypes.apply(lambda x: x.name).to_dict()
Enter fullscreen mode Exit fullscreen mode

types = {
‘id’ : ‘int64’,
‘item_nbr’ : ‘int32’
‘store_nbr’: ‘int8’
‘unit_sales’: ‘float32’
’onpromotion’: ‘object’
}

After this initial exploration, the reading of the whole file can be optimized. By passing a dictionary of the datatypes, the file reading can be 5 times faster.

df_all = pd.read_csv(train.csv, parse_dates = [dates], dtype = types, infer_datetime_format = True}
Enter fullscreen mode Exit fullscreen mode

It is convenient to keep the model analysis fast and light at first. To train the model, we can use set_rf_samples(50000).

If a line of code takes quite a long time, it can be interesting to use a profiler %prun m.fit(x, y). It will give information on which lines of code took most of the time.

2. Building a robust validation set

Without a good validation set, it is hard to create a good model. It helps to tell whether a model is performing well or not. A good practice is to calibrate the validation set with the test set.

Alt Text

In the above image, four different models are submitted to the test set and the validation set. If the validation set is good, then the relationship between the test score and the validation score should lie in a straight line.

3. Interpreting machine learning models

Feature importance

Feature importance helps to understand which variables are contributing the most to the model. The following code prints the top 10 most important features for the current model.

def rf_feat_importance(m, df):
  return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_
                       ).sort_values('imp', ascending=False)

fi = rf_feat_importance(m, x)

def plot_fi(fi): 
  return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False,
                 color='#ffb500')

plot_fi(fi[:30]);
plt.xlabel('feature importance')
Enter fullscreen mode Exit fullscreen mode

Alt Text

How is feature importance calculated?

  • The accuracy score is calculated with all the columns.
  • One column is chosen and its values are randomly shuffled.
  • The accuracy score is calculated once again.

This process is done for each columns and it is then possible to figure out which columns impact the most the accuracy score.

Features with a small importance can be removed from the model. Removing redundant columns lowers the possibility of collinearity (two columns that may be related to each other) and it will make the models run faster.

Confidence intervals based on standard deviation

After running a model, we get predictions. The question is then "how confident are we on that prediction?" The standard deviation of the predictions across the trees gives a relative understanding of how confident we should be on this prediction.

The fastai library provides a handy function called parallel_trees. It takes a random forest model m and some function to call on every tree in parallel. It will return a list of the result of applying that function to every tree.

def get_preds(t): return t.predict(X_valid)
%time preds = np.stack(parallel_trees(m, get_preds))

x = raw_valid.copy()
x['pred_std'] = np.std(preds, axis=0)
x['pred'] = np.mean(preds, axis=0)
Enter fullscreen mode Exit fullscreen mode

We can explore the feature ProductSize.

First, we can count the number of lines for each ProductSize category.

x.ProductSize.value_counts().plot.barh(color='#ffb500');
plt.xlabel('value_counts')
plt.ylabel('ProductSize')
Enter fullscreen mode Exit fullscreen mode

Alt Text

We can create a new dataframe that contains the mean of predictions and the mean of the standard deviation of predictions for each ProductSize category.

flds = ['ProductSize', 'SalePrice', 'pred', 'pred_std']
summ = x[flds].groupby(flds[0]).mean()
summ
Enter fullscreen mode Exit fullscreen mode

Alt Text

We can take the ratio of the standard deviation values and the sum of predictions in order to compare which category has a higher deviation.

(summ.pred_std/summ.pred).sort_values(ascending=False)
Enter fullscreen mode Exit fullscreen mode


`
Alt Text

In this case, we can explain that we have a higher standard deviation for the small, compact, large and mini ProductSize because we have less rows for these categories.

We can say that we are more confident about the predictions for the medium and large/medium ProductSize and less confident about the small, compact, large and mini ProductSize.


That is it for the third lesson! Happy machine learning!

You can follow me on Instagram @oyane806 !

Top comments (0)