DEV Community: Manav Modi

Case Study: Data Preprocessing

Manav Modi — Mon, 24 Jan 2022 14:41:21 +0000

In the final blog of this series, we will walk through the entire preprocessing workflow on the dataset related to UFO sightings. Each row in this dataset contains information like the location, the type of the sighting, the number of seconds and minutes the sighting lasted, a description of the sighting, and the date of the sighting was recorded.

The actual implementation of this is present in the python notebook here. So it is highly recommended to keep it open while reading through the blog.

Steps:

Checking column types
Dropping missing data
Extracting numbers from strings
Identifying features for standardization
Encoding categorical variables
Features from dates
Text Vectorization
Selecting the ideal dataset
Modeling the dataset.

Categorical variables and standardization

Let's tackle the categorical variables and standardization in the UFO dataset.

There are a number of categorical variables in the UFO dataset, including the location data and the type of encounter. These need to be one-hot encoded.

In addition, we need to standardize the seconds column. Check the variance using the var() method and log normalize using NumPy's log function.

Engineering new features!✨

There are several fields in the UFO dataset that are great candidates for feature engineering.

From the date field, we may want to know the month of the sighting.
The number of minutes needs to be extracted from the length of time field.
The description field contains a text description of the sighting. It would be interesting to vectorize that text and see what we can learn from it.

Some important code to remember for date extraction is to use attributes like month and hour to get the pieces of the date you need.

Regular expressions will help you extract numbers from text, and you can use group to return your results.

Scikit-learn and tf-idf vectorizer will vectorize text fields.

Feature Selection and Modeling

We need to do a little bit of feature selection before we model this data.

Keep in mind that you want to eliminate redundant features, and there are a couple of candidates for that in this dataset, both in its original form and due to feature engineering.

We also have a text vector that we can inspect and eliminate words from.

Final Thoughts🎓

Remember that preprocessing and modeling are often iterative practices, and it might take a few tries to find the ideal feature configuration that improves your model's performance. It also helps to be extremely knowledgeable about the dataset that you're working with, as well as having a good understanding of the model you're trying to build.

Check out the exercises linked to this here

Interested in Machine Learning content? Follow me on Twitter.

What is Feature Selection?

Manav Modi — Mon, 24 Jan 2022 06:02:21 +0000

What is feature selection?

Feature Selection is the method of selecting features from the existing set to be used for modeling. It doesn't create new features.

Goal: Improve model's performance

One of the easiest ways is to determine where a feature is redundant or not.

Redundant Features

Remove noisy features
Remove correlated features
Remove duplicated features

Feature selection is an iterative process.

Correlated Features

Linear models in general assume feature independence. In cases where features are statistically correlated, moving together directionally; can introduce bias.
Pearson Correlation Coefficient is the measure for this.
A score closer to :

1 indicates a strong positive correlation
0 indicates no correlation
-1 indicates a strong negative correlation. It implies that the features move in the opposite direction

Selecting features using Text Vectors

After you have vectorized the text, vocabulary and weights will be stored in the vectorizer. To pull out the vocabulary list, in order to have a look at word weights, you can use the vocabulary attribute.

Here, we have a vector of location descriptions from the hiking dataset,

print(tfidf_vec.vocabulary_)

Row data contains two components:

word weights
index of word

To take a look at weights in the fourth row, we use the data attribute on a specific row, like you would access items in a list.

print(text_tfidf[3].data)

To get the indices of the words that have been weighted, we use the indices attribute.

print(text_tfidf[3].indices)

It will be easier later on if we have the index number in the key position in the dictionary. To reverse the vocabulary dictionary, you can swap the key-value pairs by grabbing the items from the vocabulary dictionary and reversing the order. Finally, we can zip together the row indices and weights, pass them into the dict function and turn it into a dictionary.

vocab = {v:k for k,v in tfidf_vec.vocabulary_.items()}
zipped_row = dict(zip(text_tfidf[3].indices,text_tfidf[3].data))
print(vocab)
print(zipped_row)

def return_weights(vocab,vector,vector_index):
zipped = dict(zip(vector[vector_index].indices,vector[vector_index].data))
return {vocab[I]:zipped[I] for I in vector[vector_index].indices}
print(return_weights(vocab,text_tfidf,3))

Check out the exercises linked to this here

Interested in Machine Learning content? Follow me on Twitter.

What is Feature Engineering?

Manav Modi — Sat, 22 Jan 2022 07:13:37 +0000

What is Feature Engineering?

Feature Engineering is the process of the creation of new features based on existing features.

It can add features important for clustering tasks or insight into relationships between features.

Real-world data is not in its entirety and you lively have to expand and extract data in addition to preprocessing steps like standardization.

Encoding categorical variables

Since the models in scikit-learn require numerical inputs, you will need to encode categorical data.

Using Pandas

Using apply we can replace the values.

users["sub_enc"] = users["subscribed"].apply(lambda val: 
1 if val=="y" else 0)

Using scikit-learn

Alternatively, we can also do this using scikit-learn's Label Encoder method. This is helpful if it is being implemented using the scikit-learn's pipeline functionality.

Creating a LabelEncoder object helps us reuse it for training test-set or new data as well.

You can use the fit_transform method to both fit the encoder to data and transform the column.

Printing out the columns we can see how the y's and n's have been encoded to 1's and 0's.

from sklearn.preprocessing import LabelEncoder
le =LabelEncoder()
users["sub_enc_le"] = le.fit_transform(user["subscribed"])
print(users[["subscribed","sub_enc_le"]])

#### One-Hot Encoder

When we have more than 1 value to encode we can use One-Hot Encoding. For example, the fav_color column has 3 different colors: blue, green, and orange

It will be encoded as follows:

blue:[1,0,0]
green: [0,1,0]
orange[0,0,1]

This operation can be done using the get_dummies method on the desired column.

print(pd.get_dummies(users["fav_color"]))

Engineering Numerical Features

This can be helpful in dimensionality reduction.

Say you have a table of temperature readings over 3 days from 4 different cities. Given that the temperature readings are close enough to their values, it would be more appropriate to take the average value.

Here, we are applying lambda to get the mean of values. Axis=1 is specified to operate across the row.

columns = ["day1", "day2" , "day3"]
df["mean"] = df.apply(lambda row: row[columns].mean(), axis=1)
print(df)

In the case of dates, it is much more useful to reduce the granularity.

df["date_converted"] = pd.to_datetime(df["date"])
df["month"] = df["date_converted"].apply(lambda row:row.month)
print(df)

Text Classification

Let's extract numbers from a string.

import re
my_string = "temperature is 75.6 F"
pattern  = re.compile("\d+\.\d+")
temp = re.match(pattern,my_string)
print(float(temp.group(0)))

Here:

\d is used to extract digits
+ sign helps in extracting as many digits as possible
\. considers the decimal point

Vectorizing Text

We will be using tf/idf.

tf/idf is a way of vectorizing text that reflects how important a word is in a document beyond how frequently it occurs.

It stands for

tf: term frequency
idf :inverse document frequency

It places the weight on words that are ultimately more significant in the entire corpus of words.

from sklearn.feature_extraction.text import Tfidfvectorizer
print(documents.head())
tfidf_vec = TfidfVectorizer()
text_tfidf = tfidf_vec.fit_transform(documents)

We will be using Naive Bayes Classifier for text classification. It treats each feature as independent from others, which can be a naive assumption but works well on text data.

Check out the exercises linked to this here

Interested in Machine Learning content? Follow me on Twitter.

Standardizing Data

Manav Modi — Fri, 21 Jan 2022 15:20:59 +0000

#### What is standardization?

It is a preprocessing method used to transform continuous data to make it look normally distributed.

Why?

Scikit-learn models assume normally distributed data. In the case of continuous data, we might risk biasing the models.

Two methods can be used for the standardization process:

Log Normalization
Feature Scaling

These methods are applied to continuous numerical data.

When?

Models are present in linear space. (Ex. KNN, KMeans, etc.),data must also be in linear space.
Dataset features that have high variance. This could bias a model that assumes it is normally distributed.
Modeling dataset that has features that are continuous and on different scales. > For example, a dataset that has height and weight as its features needs to be standardized to make sure they are on the same linear scale.

What is Log Normalization?

Log transformation is applied
Used in datasets where the variance of a particular column is significantly high as compared to other columns
Natural log is applied on values

It is used to captured relative changes, and magnitude of change, and keeps everything in the positive space.

Let's see the implementation.

print(df)

print(df.var())

We will use the log operator from the NumPy library to perform the normalization.

import numpy as np
df["log_2"] = np.log(df["col2"])
print(df)

Let's see values.

print(np.var(df[["col1","log_2"]]))

What is feature scaling?

This method is useful when

continuous features are present on different scales.
model is in linear scale.

The transformation on the dataset is done such that the resultant mean is 0 and the variance is 1.

Here across the features, you can see how the variation is.

print(df)

print(df.var())

Using the standardscaler method from sklearn, the process is done.

from sklearn.preprocessing  import StandardScaler
scaler = StandardScaler()
df_scaled= pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print(df_scaled)
print(df.var())

Check out the exercises linked to this here

Interested in Machine Learning content? Follow me on Twitter.

Introduction to Data Preprocessing

Manav Modi — Fri, 21 Jan 2022 06:57:49 +0000

What is Data Preprocessing?

Data Preprocessing comes right in after you have cleaned up your data and done some Exploratory Data Analysis. It is the step where we prepare the data for modeling. Modeling in Python needs numerical input.

Refreshing Pandas Skills

You can skip this section if you know the basics.

Before we proceed with the series, it is important to know the commands that can assist you in knowing your dataset well.

import pandas as pd
hiking = pd.read_json("datasets/hiking.json")
print(hiking.head())

print(hiking.columns)

print(hiking.dtypes)

Removing Missing Data

Sample Data

Dropping rows with null values

print(df.dropna())

Dropping specific rows from using an array

print(df.drop([1,2,3]))

Dropping a specific column(here axis=1 specifies that column needs to be dropped.)

print(df.drop("A", axis=1))

Fetching the not null rows from a specific column.

print(df[df["B"].notnull()])

Working on DataTypes

While preprocessing the data, many times the datatype of columns is not as desired. We use the following command to convert the column datatype.

Remember: Always apply the datatype that fits all of the data in the particular column.

This code sample will help you convert column "C" to the float datatype.

df["C"] = df["C"].astype("float")
print(df.dtypes)

Stratified Sampling

Train test split is done on the dataset for training and testing the model.
Say, the original dataset is 80% class 1 and 20% class 2. You would want a similar distribution in both train and test datasets to make sure you have the best representation.

 # Total "labels" counts
y["labels"].value_counts()

X_train,X_test,y_train,y_test = train_test_split(X,y, stratify=y)
y_train["labels"].value_counts() 
y_test["labels"].value_counts()

Check out the exercises linked to this here

Interested in Machine Learning content? Follow me on Twitter.