DEV Community

Cover image for What is your routine in a data science project? - fastai lesson1

Posted on • Updated on

What is your routine in a data science project? - fastai lesson1

Hi there,

I am currently studying machine learning with

I organized my notes for lesson1 below in 3 categories: a process summary, code used and concepts seen in the video. The best way to learn is to be able to reexplain all of this and to apply this new knowledge to Kaggle competitions.

1. Process summary

  1. set up jupyter notebook and the environment
  2. download data from kaggle
  3. convert all data into numbers or booleans: new features are extracted from dates (year, month) and string categorical data are mapped to numbers
  4. take care of missing data: continuous missing data are replaced with the median and a new feature column is created _na with a boolean value, missing categorical variables are handled by pandas and automatically set to -1
  5. separate training and validation set: the model is trained on the training set, to check if it is working well, it is used on the validation set after
  6. train the model
  7. print accuracy scores

2. Code

%load_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.imports import *
from fastai.structured import *
from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display
from sklearn import metrics

PATH = "data/bulldozers/"

df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory=False, parse_dates=["saledate"])


df_raw.SalePrice = np.log(df_raw.SalePrice)

add_datepart(df_raw, 'saledate')

train_cats(df_raw)['High', 'Medium', 'Low'], ordered=True, inplace=True)
df_raw.UsageBand =


os.makedirs('tmp', exist_ok=True)
df_raw = pd.read_feather('tmp/bulldozers-raw')

df, y, nas = proc_df(df_raw, 'SalePrice')

m = RandomForestRegressor(n_jobs=-1), y)

n_valid = 12000  # same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape

There are a few helper functions:

  • display_all() is created to have the name of all the features in lines instead of columns
  • split_vals() splits the dataset into a training set and a validation set
  • print_score()
def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 

def split_vals(a,n): return a[:n].copy(), a[n:].copy()

def rmse(x,y): return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)

The following functions are included in library:

  • add_datepart() generates new numerical features from dates (year, month) and remove the previous column dates
  • train_cats() maps strings to integers (ex: red: 1, blue: 2, etc)
  • proc_df() replaces categories with their numeric codes, handles missing continuous values (replaces it by a median value and creates a new feature column _na) and split the dependent variable into a separate variable

3. Concepts

  • structured data/unstructured data: structured data are tabular data, an example of unstructured data is images
  • curse of dimensionality: idea that the more dimensions you have, the more all of the points sit on the edge of that space
  • no free-lunch theory: in theory, there is no type of model that will work for any kind of random data set
  • regression/classification: regression is continuous variable prediction (ex: price prediction) and classification is true/false categorization or identification of multiple categories (ex: categorization of fruits)
  • overfitting: when a model is to specific to a dataset, it will not be able to generalize well with a new dataset, a validation set helps diagnose this problem


That is it for the first lesson!

Don’t forget to recall. Are you able to explain the basic process to begin a data science notebook? How do you handle missing values? What is regression and classification? What is over-fitting?

And practice with Kaggle to make this new knowledge a second nature!

Note: I think that the easier way to begin a data science notebook is to use Google Colab.

!pip install fastai==0.7.0
from google.colab import files
uploaded = files.upload()
import pandas as pd
import io
df_raw = pd.read_csv(io.BytesIO(uploaded['train.csv']))

You can follow me on Instagram @oyane806 !

Top comments (0)