<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Manav Modi</title>
    <description>The latest articles on DEV Community by Manav Modi (@manavmodi).</description>
    <link>https://dev.to/manavmodi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F330025%2F24fa28b2-deed-4457-ac86-a447dc6ba6a4.jpeg</url>
      <title>DEV Community: Manav Modi</title>
      <link>https://dev.to/manavmodi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/manavmodi"/>
    <language>en</language>
    <item>
      <title>Case Study: Data Preprocessing</title>
      <dc:creator>Manav Modi</dc:creator>
      <pubDate>Mon, 24 Jan 2022 14:41:21 +0000</pubDate>
      <link>https://dev.to/manavmodi/case-study-data-preprocessing-1dpf</link>
      <guid>https://dev.to/manavmodi/case-study-data-preprocessing-1dpf</guid>
      <description>&lt;p&gt;In the final blog of this series, we will walk through the entire preprocessing workflow on the dataset related to UFO sightings. Each row in this dataset contains information like the location, the type of the sighting, the number of seconds and minutes the sighting lasted, a description of the sighting, and the date of the sighting was recorded.&lt;/p&gt;

&lt;p&gt;The actual implementation of this is present in the python notebook &lt;a href="https://github.com/manavmodi22/Preprocessing-for-Machine-Learning-in-Python/blob/main/data_preprocessing_casestudy_exercise.ipynb"&gt;here&lt;/a&gt;. So it is highly recommended to keep it open while reading through the blog.&lt;/p&gt;

&lt;p&gt;Steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Checking column types&lt;/li&gt;
&lt;li&gt;Dropping missing data&lt;/li&gt;
&lt;li&gt;Extracting numbers from strings&lt;/li&gt;
&lt;li&gt;Identifying features for standardization&lt;/li&gt;
&lt;li&gt;Encoding categorical variables&lt;/li&gt;
&lt;li&gt;Features from dates&lt;/li&gt;
&lt;li&gt;Text Vectorization&lt;/li&gt;
&lt;li&gt;Selecting the ideal dataset&lt;/li&gt;
&lt;li&gt;Modeling the dataset.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Categorical variables and standardization
&lt;/h4&gt;

&lt;p&gt;Let's tackle the categorical variables and standardization in the UFO dataset.&lt;/p&gt;

&lt;p&gt;There are a number of categorical variables in the UFO dataset, including the location data and the type of encounter. These need to be one-hot encoded. &lt;/p&gt;

&lt;p&gt;In addition, we need to standardize the &lt;code&gt;seconds&lt;/code&gt; column. Check the variance using the &lt;code&gt;var()&lt;/code&gt; method and log normalize using NumPy's log function.&lt;/p&gt;

&lt;h4&gt;
  
  
  Engineering new features!✨
&lt;/h4&gt;

&lt;p&gt;There are several fields in the UFO dataset that are great candidates for feature engineering. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;From the &lt;code&gt;date&lt;/code&gt; field, we may want to know the month of the sighting. &lt;/li&gt;
&lt;li&gt;The number of minutes needs to be extracted from the &lt;code&gt;length of time&lt;/code&gt; field. &lt;/li&gt;
&lt;li&gt;The &lt;code&gt;description&lt;/code&gt; field contains a text description of the sighting. It would be interesting to vectorize that text and see what we can learn from it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some important code to remember for &lt;code&gt;date&lt;/code&gt; extraction is to use attributes like month and hour to get the pieces of the date you need. &lt;/p&gt;

&lt;p&gt;Regular expressions will help you extract numbers from text, and you can use group to return your results. &lt;/p&gt;

&lt;p&gt;Scikit-learn and tf-idf vectorizer will vectorize text fields.&lt;/p&gt;

&lt;h4&gt;
  
  
  Feature Selection and Modeling
&lt;/h4&gt;

&lt;p&gt;We need to do a little bit of feature selection before we model this data. &lt;/p&gt;

&lt;p&gt;Keep in mind that you want to eliminate redundant features, and there are a couple of candidates for that in this dataset, both in its original form and due to feature engineering. &lt;/p&gt;

&lt;p&gt;We also have a text vector that we can inspect and eliminate words from.&lt;/p&gt;

&lt;h4&gt;
  
  
  Final Thoughts🎓
&lt;/h4&gt;

&lt;p&gt;Remember that &lt;code&gt;preprocessing&lt;/code&gt; and &lt;code&gt;modeling&lt;/code&gt; are often iterative practices, and it might take a few tries to find the ideal feature configuration that improves your model's performance. It also helps to be extremely knowledgeable about the dataset that you're working with, as well as having a good understanding of the model you're trying to build.&lt;/p&gt;

&lt;p&gt;Check out the exercises linked to this &lt;a href="https://github.com/manavmodi22/Preprocessing-for-Machine-Learning-in-Python/blob/main/data_preprocessing_casestudy_exercise.ipynb"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Interested in Machine Learning content? Follow me on &lt;a href="https://twitter.com/manavmtwt"&gt;Twitter&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>programming</category>
      <category>beginners</category>
    </item>
    <item>
      <title>What is Feature Selection?</title>
      <dc:creator>Manav Modi</dc:creator>
      <pubDate>Mon, 24 Jan 2022 06:02:21 +0000</pubDate>
      <link>https://dev.to/manavmodi/what-is-feature-selection-1hkn</link>
      <guid>https://dev.to/manavmodi/what-is-feature-selection-1hkn</guid>
      <description>&lt;h4&gt;
  
  
  What is feature selection?
&lt;/h4&gt;

&lt;p&gt;Feature Selection is the method of selecting features from the existing set to be used for modeling. It doesn't create new features.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Goal: Improve model's performance&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One of the easiest ways is to determine where a feature is redundant or not.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Redundant Features&lt;/strong&gt;
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Remove noisy features&lt;/li&gt;
&lt;li&gt;Remove correlated features&lt;/li&gt;
&lt;li&gt;Remove duplicated features&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Feature selection is an iterative process.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Correlated Features&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Linear models in general assume feature independence. In cases where features are statistically correlated,  moving together directionally; can introduce bias. &lt;br&gt;
Pearson Correlation Coefficient is the measure for this. &lt;br&gt;
A score closer to :&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;1&lt;/code&gt; indicates a strong positive correlation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;0&lt;/code&gt; indicates no correlation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;-1&lt;/code&gt; indicates a strong negative correlation. It implies that the features move in the opposite direction&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;
  
  
  Selecting features using Text Vectors
&lt;/h4&gt;

&lt;p&gt;After you have vectorized the text, vocabulary and weights will be stored in the vectorizer. To pull out the vocabulary list, in order to have a look at word weights, you can use the vocabulary attribute.&lt;/p&gt;

&lt;p&gt;Here, we have a vector of location descriptions from the hiking dataset,&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print(tfidf_vec.vocabulary_)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--dyP_keAX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1642999723615/DiJpd619W.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--dyP_keAX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1642999723615/DiJpd619W.png" alt="image.png" width="416" height="248"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Row data contains two components: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;word weights&lt;/li&gt;
&lt;li&gt;index of word&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To take a look at weights in the fourth row, we use the data attribute on a specific row, like you would access items in a list.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print(text_tfidf[3].data)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--nTN7U60h--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1643000215435/oSxfJdis9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--nTN7U60h--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1643000215435/oSxfJdis9.png" alt="image.png" width="410" height="92"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To get the indices of the words that have been weighted, we use the indices attribute.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print(text_tfidf[3].indices)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--GEsV96S3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1643000229414/58q2fHFoD.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--GEsV96S3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1643000229414/58q2fHFoD.png" alt="image.png" width="412" height="44"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It will be easier later on if we have the index number in the key position in the dictionary. To reverse the vocabulary dictionary, you can swap the key-value pairs by grabbing the items from the vocabulary dictionary and reversing the order. Finally, we can zip together the row indices and weights, pass them into the dict function and turn it into a dictionary.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vocab = {v:k for k,v in tfidf_vec.vocabulary_.items()}
zipped_row = dict(zip(text_tfidf[3].indices,text_tfidf[3].data))
print(vocab)
print(zipped_row)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9BKWV7Nr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1643003847398/Je6NZNnup.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--9BKWV7Nr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1643003847398/Je6NZNnup.png" alt="image.png" width="402" height="214"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OyhvKt3s--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1643003858402/6tGfBWMrJ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OyhvKt3s--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1643003858402/6tGfBWMrJ.png" alt="image.png" width="398" height="200"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def return_weights(vocab,vector,vector_index):
zipped = dict(zip(vector[vector_index].indices,vector[vector_index].data))
return {vocab[I]:zipped[I] for I in vector[vector_index].indices}
print(return_weights(vocab,text_tfidf,3))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YdNboJ2Y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1643003872572/U1FumSwDV.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YdNboJ2Y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn.hashnode.com/res/hashnode/image/upload/v1643003872572/U1FumSwDV.png" alt="image.png" width="844" height="152"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Check out the exercises linked to this &lt;a href="https://github.com/manavmodi22/Preprocessing-for-Machine-Learning-in-Python/blob/main/data_preprocessing_chapter4_exercise.ipynb"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Interested in Machine Learning content? Follow me on &lt;a href="https://twitter.com/manavmtwt"&gt;Twitter&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>datascience</category>
    </item>
    <item>
      <title>What is Feature Engineering?</title>
      <dc:creator>Manav Modi</dc:creator>
      <pubDate>Sat, 22 Jan 2022 07:13:37 +0000</pubDate>
      <link>https://dev.to/manavmodi/what-is-feature-engineering-35pg</link>
      <guid>https://dev.to/manavmodi/what-is-feature-engineering-35pg</guid>
      <description>&lt;h4&gt;
  
  
  What is Feature Engineering?
&lt;/h4&gt;

&lt;p&gt;Feature Engineering is the process of the creation of new features based on existing features.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It can add features important for clustering tasks or insight into relationships between features.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Real-world data is not in its entirety and you lively have to expand and extract data in addition to preprocessing steps like standardization.&lt;/p&gt;

&lt;h4&gt;
  
  
  Encoding categorical variables
&lt;/h4&gt;

&lt;p&gt;Since the models in scikit-learn require numerical inputs, you will need to encode categorical data.&lt;/p&gt;

&lt;h5&gt;
  
  
  &lt;strong&gt;Using Pandas&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;Using &lt;code&gt;apply&lt;/code&gt; we can replace the values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;users["sub_enc"] = users["subscribed"].apply(lambda val: 
1 if val=="y" else 0)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h5&gt;
  
  
  &lt;strong&gt;Using scikit-learn&lt;/strong&gt;
&lt;/h5&gt;

&lt;p&gt;Alternatively, we can also do this using scikit-learn's Label Encoder method. This is helpful if it is being implemented using the scikit-learn's pipeline functionality.&lt;/p&gt;

&lt;p&gt;Creating a LabelEncoder object helps us reuse it for training test-set or new data as well.&lt;/p&gt;

&lt;p&gt;You can use the &lt;code&gt;fit_transform&lt;/code&gt; method to both fit the encoder to data and transform the column.&lt;/p&gt;

&lt;p&gt;Printing out the columns we can see how the y's and n's have been encoded to 1's and 0's.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.preprocessing import LabelEncoder
le =LabelEncoder()
users["sub_enc_le"] = le.fit_transform(user["subscribed"])
print(users[["subscribed","sub_enc_le"]])

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642827877845%2FORx52PqPr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642827877845%2FORx52PqPr.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;#### &lt;strong&gt;One-Hot Encoder&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When we have more than 1 value to encode we can use One-Hot Encoding. For example, the &lt;code&gt;fav_color&lt;/code&gt; column has 3 different colors: blue, green, and orange&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642828160061%2F9fElOTTz9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642828160061%2F9fElOTTz9.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It will be encoded as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;blue:[1,0,0]&lt;/li&gt;
&lt;li&gt;green: [0,1,0]&lt;/li&gt;
&lt;li&gt;orange[0,0,1]&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642828181859%2FvBpaDAOil.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642828181859%2FvBpaDAOil.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This operation can be done using the &lt;code&gt;get_dummies&lt;/code&gt; method on the desired column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print(pd.get_dummies(users["fav_color"]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642828360341%2FyOa2SZWaZ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642828360341%2FyOa2SZWaZ.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Engineering Numerical Features
&lt;/h4&gt;

&lt;p&gt;This can be helpful in dimensionality reduction. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Say you have a table of temperature readings over 3 days from 4 different cities. Given that the temperature readings are close enough to their values, it would be more appropriate to take the average value. &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here, we are applying lambda to get the mean of values. &lt;code&gt;Axis=1&lt;/code&gt; is specified to operate across the row.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;columns = ["day1", "day2" , "day3"]
df["mean"] = df.apply(lambda row: row[columns].mean(), axis=1)
print(df)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642829841157%2FEEtWoyPj6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642829841157%2FEEtWoyPj6.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In the case of dates, it is much more useful to reduce the granularity.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df["date_converted"] = pd.to_datetime(df["date"])
df["month"] = df["date_converted"].apply(lambda row:row.month)
print(df)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642830028213%2FvtkAGkQlK.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642830028213%2FvtkAGkQlK.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Text Classification
&lt;/h4&gt;

&lt;p&gt;Let's extract numbers from a string.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import re
my_string = "temperature is 75.6 F"
pattern  = re.compile("\d+\.\d+")
temp = re.match(pattern,my_string)
print(float(temp.group(0)))

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;\d&lt;/code&gt; is used to extract digits&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;+&lt;/code&gt; sign helps in extracting as many digits as possible&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;\.&lt;/code&gt; considers the decimal point&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Vectorizing Text&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We will be using &lt;code&gt;tf/idf&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tf/idf&lt;/code&gt; is a way of vectorizing text that reflects how important a word is in a document beyond how frequently it occurs. &lt;/p&gt;

&lt;p&gt;It stands for &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;tf: term frequency &lt;/li&gt;
&lt;li&gt;idf :inverse document frequency &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It places the weight on words that are ultimately more significant in the entire corpus of words.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.feature_extraction.text import Tfidfvectorizer
print(documents.head())
tfidf_vec = TfidfVectorizer()
text_tfidf = tfidf_vec.fit_transform(documents)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We will be using Naive Bayes Classifier for text classification. It treats each feature as independent from others, which can be a naive assumption but works well on text data.&lt;/p&gt;

&lt;p&gt;Check out the exercises linked to this &lt;a href="https://github.com/manavmodi22/Preprocessing-for-Machine-Learning-in-Python/blob/main/data_preprocessing_chapter3_exercise.ipynb" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Interested in Machine Learning content? Follow me on &lt;a href="https://twitter.com/manavmtwt" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>beginners</category>
      <category>python</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Standardizing Data</title>
      <dc:creator>Manav Modi</dc:creator>
      <pubDate>Fri, 21 Jan 2022 15:20:59 +0000</pubDate>
      <link>https://dev.to/manavmodi/standardizing-data-283i</link>
      <guid>https://dev.to/manavmodi/standardizing-data-283i</guid>
      <description>&lt;p&gt;#### What is standardization?&lt;/p&gt;

&lt;p&gt;It is a preprocessing method used to transform continuous data to make it look normally distributed. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Scikit-learn models assume normally distributed data. In the case of continuous data, we might risk biasing the models.&lt;/p&gt;

&lt;p&gt;Two methods can be used for the standardization process:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Log Normalization&lt;/li&gt;
&lt;li&gt;Feature Scaling&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These methods are applied to continuous numerical data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When?&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Models are present in linear space. (Ex. KNN, KMeans, etc.),data must also be in linear space.&lt;/li&gt;
&lt;li&gt;Dataset features that have high variance. This could bias a model that assumes it is normally distributed. &lt;/li&gt;
&lt;li&gt;Modeling dataset that has features that are continuous and on different scales. 
&amp;gt; For example, a dataset that has height and weight as its features needs to be standardized to make sure they are on the same linear scale.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  What is Log Normalization?
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Log transformation is applied&lt;/li&gt;
&lt;li&gt;Used in datasets where the variance of a particular column is significantly high as compared to other columns&lt;/li&gt;
&lt;li&gt;Natural log is applied on values&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642770484282%2FTTQ1ZEYkw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642770484282%2FTTQ1ZEYkw.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It is used to captured relative changes, and magnitude of change, and keeps everything in the positive space.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let's see the implementation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print(df) 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642770726107%2FY5bhqEDVp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642770726107%2FY5bhqEDVp.png" alt="image.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print(df.var())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642770738314%2FDqEMoYHbC.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642770738314%2FDqEMoYHbC.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We will use the log operator from the NumPy library to perform the normalization.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import numpy as np
df["log_2"] = np.log(df["col2"])
print(df)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642770768558%2FJ8SBzLm-V.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642770768558%2FJ8SBzLm-V.png" alt="image.png"&gt;&lt;/a&gt;&lt;br&gt;
Let's see values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print(np.var(df[["col1","log_2"]]))

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642770825749%2FmNxa5rRZU5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642770825749%2FmNxa5rRZU5.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  What is feature scaling?
&lt;/h4&gt;

&lt;p&gt;This method is useful when &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;continuous features are present on different scales. &lt;/li&gt;
&lt;li&gt;model is in linear scale.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The transformation on the dataset is done such that the resultant mean is 0 and the variance is 1.&lt;/p&gt;

&lt;p&gt;Here across the features, you can see how the variation is.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print(df)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642776382839%2FPIOIrDxDH.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642776382839%2FPIOIrDxDH.png" alt="image.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print(df.var())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642776418943%2FlR2Su08Lj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642776418943%2FlR2Su08Lj.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using the standardscaler method from sklearn, the process is done.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.preprocessing  import StandardScaler
scaler = StandardScaler()
df_scaled= pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print(df_scaled)
print(df.var())

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642776443908%2FFBChoCtcn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642776443908%2FFBChoCtcn.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642776460766%2FFm6fv4aib.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642776460766%2FFm6fv4aib.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Check out the exercises linked to this &lt;a href="https://github.com/manavmodi22/Preprocessing-for-Machine-Learning-in-Python/blob/main/data_preprocessing_chapter2_exercise.ipynb" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Interested in Machine Learning content? Follow me on &lt;a href="https://twitter.com/manavmtwt" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>python</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Introduction to Data Preprocessing</title>
      <dc:creator>Manav Modi</dc:creator>
      <pubDate>Fri, 21 Jan 2022 06:57:49 +0000</pubDate>
      <link>https://dev.to/manavmodi/introduction-to-data-preprocessing-4cac</link>
      <guid>https://dev.to/manavmodi/introduction-to-data-preprocessing-4cac</guid>
      <description>&lt;h2&gt;
  
  
  What is Data Preprocessing?
&lt;/h2&gt;

&lt;p&gt;Data Preprocessing comes right in after you have cleaned up your data and done some Exploratory Data Analysis. It is the step where we prepare the data for modeling. Modeling in Python needs numerical input. &lt;/p&gt;

&lt;h3&gt;
  
  
  Refreshing Pandas Skills
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;You can skip this section if you know the basics.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Before we proceed with the series, it is important to know the commands that can assist you in knowing your dataset well.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
hiking = pd.read_json("datasets/hiking.json")
print(hiking.head())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642686604057%2FTbvTv_xRK.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642686604057%2FTbvTv_xRK.png" alt="image.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print(hiking.columns)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642686689206%2F8vQr-iXIH.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642686689206%2F8vQr-iXIH.png" alt="image.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print(hiking.dtypes)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642686753805%2FoxrnZS0Tk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642686753805%2FoxrnZS0Tk.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Removing Missing Data
&lt;/h4&gt;

&lt;h5&gt;
  
  
  Sample Data
&lt;/h5&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642688263433%2FXTS-7u0dd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642688263433%2FXTS-7u0dd.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dropping rows with null values&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print(df.dropna())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642688356752%2FVuNftTH3A.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642688356752%2FVuNftTH3A.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dropping specific rows from using an array&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print(df.drop([1,2,3]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642688377993%2Fn2U_jjKU7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642688377993%2Fn2U_jjKU7.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dropping a specific column(here axis=1 specifies that column needs to be dropped.)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print(df.drop("A", axis=1))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642688474064%2FZZapwSQc3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642688474064%2FZZapwSQc3.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fetching the &lt;code&gt;not null&lt;/code&gt; rows from a specific column.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;print(df[df["B"].notnull()])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642688523708%2FZHW79pBf-.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642688523708%2FZHW79pBf-.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Working on DataTypes
&lt;/h3&gt;

&lt;p&gt;While preprocessing the data, many times the datatype of columns is not as desired. We use the following command to convert the column datatype. &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Remember: Always apply the datatype that fits all of the data in the particular column.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This code sample will help you convert column "C" to the &lt;code&gt;float&lt;/code&gt; datatype.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df["C"] = df["C"].astype("float")
print(df.dtypes)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Stratified Sampling
&lt;/h3&gt;

&lt;p&gt;Train test split is done on the dataset for training and testing the model.&lt;br&gt;
Say, the original dataset is 80% class 1 and 20% class 2. You would want a similar distribution in both train and test datasets to make sure you have the best representation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; # Total "labels" counts
y["labels"].value_counts()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642741825508%2FBEqF-Aji5Y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642741825508%2FBEqF-Aji5Y.png" alt="image.png"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X_train,X_test,y_train,y_test = train_test_split(X,y, stratify=y)
y_train["labels"].value_counts() 
y_test["labels"].value_counts()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642741893752%2FBIOiW_-rS.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642741893752%2FBIOiW_-rS.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642741911120%2FP9sEfI8zg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn.hashnode.com%2Fres%2Fhashnode%2Fimage%2Fupload%2Fv1642741911120%2FP9sEfI8zg.png" alt="image.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Check out the exercises linked to this &lt;a href="https://github.com/manavmodi22/Preprocessing-for-Machine-Learning-in-Python/blob/main/data_preprocessing_chapter1_exercise.ipynb" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Interested in Machine Learning content? Follow me on &lt;a href="https://twitter.com/manavmtwt" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>beginners</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
