<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: arham jawad</title>
    <description>The latest articles on DEV Community by arham jawad (@arham_jawad_51b88b85913b5).</description>
    <link>https://dev.to/arham_jawad_51b88b85913b5</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2506014%2Fc1eb2006-4f25-45f2-9347-dc906d23870a.png</url>
      <title>DEV Community: arham jawad</title>
      <link>https://dev.to/arham_jawad_51b88b85913b5</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/arham_jawad_51b88b85913b5"/>
    <language>en</language>
    <item>
      <title>Chapter 2: Classification</title>
      <dc:creator>arham jawad</dc:creator>
      <pubDate>Wed, 10 Dec 2025 08:10:07 +0000</pubDate>
      <link>https://dev.to/arham_jawad_51b88b85913b5/chapter-2-classification-537k</link>
      <guid>https://dev.to/arham_jawad_51b88b85913b5/chapter-2-classification-537k</guid>
      <description>&lt;h2&gt;
  
  
  MNIST
&lt;/h2&gt;

&lt;p&gt;MNIST is a dataset which is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau.&lt;/p&gt;

&lt;p&gt;In this chapter, we will train a machine learning model that can classify if a given image is of a 5 or not-5.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;fetch_openml&lt;/span&gt;

&lt;span class="n"&gt;mnist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_openml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;mnist_784&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;as_frame&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's create two variables: X and y for features and labels respectively:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mnist&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mnist&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are 70,000 images in total, and each image has 784 features / pixels. This means that each image has a resolution of 28x28.&lt;/p&gt;

&lt;p&gt;Let's split this into training set and test set:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;60000&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;60000&lt;/span&gt;&lt;span class="p"&gt;:],&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;60000&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;60000&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Training a Binary Classifier
&lt;/h2&gt;

&lt;p&gt;Let's start training a binary classifier. A binary classifier is basically a type of classifier where our model has to choose from just two options, like cat or dog, five or not-5, comedy or horror etc. In our case, we need to train a binary classifier that classifies whether a given image is of a 5 or not-5. Let's narrow down our labels into just 5 or not 5:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;y_train_5&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_train&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;y_test_5&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means that y_train_5 will have the value of True if a given image has the label of '5', and False if it's not 5.&lt;/p&gt;

&lt;p&gt;Now let's train a SGD Classifier. An SGD Classifier uses gradient descent to tweak the parameters of the model, but instead of tweaking all the parameters at once, it looks at one instance / gradient at a time, so it is very efficient for larger datasets. I will cover more about gradient descent in the next chapter.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SGDClassifier&lt;/span&gt;

&lt;span class="n"&gt;sgd_clf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SGDClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;sgd_clf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train_5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally! You’ve built the model. Everything perfect? Well, not exactly. Now we need to measure its performance, and it might sound easy in words, but measuring the performance of a classifier is significantly harder than a regression model. There are a lot of things you have to measure, we will cover these right now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Confusion Matrix&lt;/li&gt;
&lt;li&gt;Precision Score&lt;/li&gt;
&lt;li&gt;Recall Score&lt;/li&gt;
&lt;li&gt;F1 Score&lt;/li&gt;
&lt;li&gt;ROC (Receiver Operating Characteristic)&lt;/li&gt;
&lt;li&gt;ROC-AUC (Area Under the Curve)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Okay! Let's get right into this.&lt;/p&gt;

&lt;h3&gt;
  
  
  Confusion Matrix
&lt;/h3&gt;

&lt;p&gt;To first understand how a confusion matrix works, or in general, how to measure the performance of a classifier, you must first understand these terms:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: I will define them in our context (5 or not-5)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;True Positive&lt;/strong&gt;: The model said that an image is of a 5 and it was a 5. The model was correct&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;True Negative&lt;/strong&gt;: The model said that an image is not a 5 and it was not a 5. The model was correct.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;False Positive&lt;/strong&gt;: The model said that the image is a 5 but it was not 5. The model was incorrect.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;False Negative&lt;/strong&gt;: The model said that the image is not a 5 but it was a 5. The model was incorrect.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is what a &lt;strong&gt;Confusion Matrix&lt;/strong&gt; looks like:&lt;/p&gt;

&lt;p&gt;[[58391, 687],&lt;/p&gt;

&lt;p&gt;[1891,   3530]]&lt;/p&gt;

&lt;p&gt;The top left number is the number of &lt;strong&gt;True Negatives&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The top right number is the number of &lt;strong&gt;False Positives&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The bottom left number is the number of &lt;strong&gt;False Negatives&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The bottom right number is the number of &lt;strong&gt;True Positives&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;To first see the confusion matrix of the model, we should first do &lt;code&gt;cross_val_predict&lt;/code&gt; to first make some predictions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cross_val_predict&lt;/span&gt;

&lt;span class="n"&gt;y_train_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cross_val_predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sgd_clf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train_5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cv&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Precision
&lt;/h3&gt;

&lt;p&gt;Precision tells you: "Out of all the predicted positives, how many were actually positives and correct?"&lt;/p&gt;

&lt;p&gt;Precision = TP/TP+FP&lt;/p&gt;

&lt;h3&gt;
  
  
  Recall
&lt;/h3&gt;

&lt;p&gt;Recall tells you: "Out of all the actual positives, how many did the model catch?"&lt;/p&gt;

&lt;p&gt;Recall = TP/TP+FN&lt;/p&gt;

&lt;h3&gt;
  
  
  Precision and Recall
&lt;/h3&gt;

&lt;p&gt;To measure the precision and recall of a model, you can use the &lt;code&gt;precision_score&lt;/code&gt; and &lt;code&gt;recall_score&lt;/code&gt; methods provided by sklearn.metrics respectively:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;precision_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recall_score&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;precision_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_train_5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train_pred&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;recall_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_train_5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train_pred&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  F1 Score
&lt;/h3&gt;

&lt;p&gt;The F1 score is a harmonic mean of precision and recall meaning the classifier will only get a high F1 score if both precision and recall are high.&lt;/p&gt;

&lt;p&gt;To measure F1 score you can use the &lt;code&gt;f1_score&lt;/code&gt; function provided by &lt;code&gt;sklearn.metrics&lt;/code&gt; library:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;f1_score&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;f1_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_train_5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train_pred&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The F1 score favors classifiers that have similar precision and recall.&lt;/p&gt;

&lt;h3&gt;
  
  
  Precision/Recall Trade-off
&lt;/h3&gt;

&lt;p&gt;To understand this tradeoff, let's first understand how the SGDClassifier makes it classification decisions. For each instance, it computes a score based on a decision function. If that score is greater than a threshold, it assigns the instance to the positive class, otherwise it assigns it to the negative class.&lt;/p&gt;

&lt;p&gt;This results in a problem: You have to balance precision and recall, but it's very hard to do so. If you have 99% precision, recall could be very low!&lt;/p&gt;

&lt;h3&gt;
  
  
  ROC Curve
&lt;/h3&gt;

&lt;p&gt;The ROC Curve plots the Recall against FPR (False Positive Rate). But there is a tradeoff, the higher the recall, the more false positives the classifier produces.&lt;/p&gt;

&lt;p&gt;Another way to compare classifiers is to measure the area under the curve (AUC). A perfect classifier will have a ROC-AUC of 1. Scikit-learn provides the &lt;code&gt;roc_auc_score&lt;/code&gt; function in the &lt;code&gt;sklearn.metrics&lt;/code&gt; library to measure the ROC-AUC of a model.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;Finally! You have done it! You learnt about SGDClassifier, we’ve covered too many performance metrics to count and now you should be pretty comfortable making some pretty good classification projects! But this is not the end! In the next chapter, we will cover the theory behind these machine learning models, which we’ve treated as black boxes for the last two chapters. I’ll explain how they make decisions and what happens under the hood, which is really useful if you want to actually deeply understand machine learning, or if you want to go into research (Like ME!). Overall, this was a lot of fun, see you soon!&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>tutorial</category>
      <category>python</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Chapter 2: End-to-End Machine Learning Project</title>
      <dc:creator>arham jawad</dc:creator>
      <pubDate>Sat, 06 Dec 2025 15:09:29 +0000</pubDate>
      <link>https://dev.to/arham_jawad_51b88b85913b5/chapter-2-end-to-end-machine-learning-project-1b8a</link>
      <guid>https://dev.to/arham_jawad_51b88b85913b5/chapter-2-end-to-end-machine-learning-project-1b8a</guid>
      <description>&lt;h1&gt;
  
  
  Chapter 2
&lt;/h1&gt;

&lt;p&gt;In this chapter, we create a very simple project where we predict house prices given the features such as Total Number of Bedrooms, age of the house, size of the house, and much much more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pipeline
&lt;/h2&gt;

&lt;p&gt;A pipeline is just a set of data processing components. Pipelines are used very commonly as there are lot of data transformations to apply.&lt;/p&gt;

&lt;h2&gt;
  
  
  Classification or Regression?
&lt;/h2&gt;

&lt;p&gt;First, we need to decide whether this problem is Regression or Classification. In Classification, your model chooses from a given set of categories, like Cat or Dog? Genre of a Movie. In classification, the options are limited, the model just has to predict the probability of a specific thing being in a class, so it has to choose. In Regression, your model has to predict a continuous value, meaning it doesn't choose, the number it predicts can go upto infinity and down to negative infinity. The Housing Price problem is a Regression problem as we are predicting a price, and prices can vary highly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Select a Performance Measure
&lt;/h2&gt;

&lt;p&gt;Now that we know what kind of problem it is, we need to select a performance measure. For regression, the most commonly used performance measure is the RMSE, which tells you how bad your model is, not how good it is, so the goal in this problem is to minimize your RMSE, not maximize it like accuracy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Take a Look at the Data Structure
&lt;/h2&gt;

&lt;p&gt;You can start by looking at the top five rows of the data using panda's head() method.&lt;/p&gt;

&lt;p&gt;Note: You can look at more than 5 rows by specifying a number inside the head() method's bracket, the default value is 5 rows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Create a Test Set
&lt;/h2&gt;

&lt;p&gt;In Machine Learning, your models learn from a training set, and then you test them on the test set, so at this point, we will create a test set, a typical split is 80/20, where we set 80% of data for training and remaining data for testing.&lt;/p&gt;

&lt;p&gt;Scikit-learn has a &lt;code&gt;train_test_split&lt;/code&gt; method inside its &lt;code&gt;sklearn.model_selection&lt;/code&gt; library. This allows you to specify your features, and your labels, and your test_size, and it automatically splits it for you:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;

&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Data Cleaning
&lt;/h2&gt;

&lt;p&gt;Machine Learning Algorithms cannot work with missing features, so you have to take care of them, for this, you can either drop the entire rows containing missing values, drop the full feature, or set the missing values to some other value like mean, median, mode. The third option is the best and most commonly used one. Scikit-learn provides a &lt;code&gt;SimpleImputer&lt;/code&gt; inside its &lt;code&gt;sklearn.impute&lt;/code&gt; library:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.impute&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SimpleImputer&lt;/span&gt;


&lt;span class="n"&gt;imputer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SimpleImputer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;median&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since you can only apply median on numerical features, lets set aside a subset of our dataset containing only the numerical attributes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;housing_num&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;housing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_dtypes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;include&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;imputer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;housing_num&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Handling Text and Categorical Attributes
&lt;/h2&gt;

&lt;p&gt;Since ML Models cannot work with categorical data, we have to convert them into numbers. To do this, we can use Ordinal Encoder, which assigns a number to each category, it is useful for tasks where two nearby categories are more similar than two distant categories, like "Bad", "Average", "Good", "Excellent" etc. But in our scenerio, OneHotEncoder is better. It creates one column for each category, so in the presence of one category, the column for that category will be 1 and other will be 0 and vice versa:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OneHotEncoder&lt;/span&gt;

&lt;span class="n"&gt;cat_encoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OneHotEncoder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;housing_cat_1hot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cat_encoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;housing_cat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Feature Scaling
&lt;/h2&gt;

&lt;p&gt;ML Models cannot work with numbers on different scales like 1, and 100,000 or 100,000 and 1,000,000,000, so we need to balance them and bring them on a similar scale like between 0 and 1. To do this, Scikit-learn provides a &lt;code&gt;StandardScaler&lt;/code&gt; class in the &lt;code&gt;sklearn.preprocessing&lt;/code&gt; library:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StandardScaler&lt;/span&gt;

&lt;span class="n"&gt;std_scaler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;housing_num_std_scaled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;std_scaler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;housing_num&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Transformation Pipelines
&lt;/h2&gt;

&lt;p&gt;As you can clearly see, you need to apply a lot of transformations on your data, like Imputing, Feature Scaling, Categorical Encoding, and doing them manually one by one seems redundant. So, we can use the &lt;code&gt;Pipeline&lt;/code&gt; class provided by &lt;code&gt;sklearn.pipeline&lt;/code&gt; library to group all these data transformations together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.pipeline&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Pipeline&lt;/span&gt;

&lt;span class="n"&gt;num_pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;impute&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;SimpleImputer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;median&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standardize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It would be more convenient to handle both categorical and numerical features together, for this, we can use the &lt;code&gt;ColumnTransformer&lt;/code&gt; class provided by the &lt;code&gt;sklearn.compose&lt;/code&gt; library:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.compose&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ColumnTransformers&lt;/span&gt;

&lt;span class="n"&gt;num_attribs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;housing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_dtypes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;include&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;
&lt;span class="n"&gt;cat_attribs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;housing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_dtpes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;include&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;

&lt;span class="n"&gt;num_pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;impute&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;SimpleImputer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;median&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standardize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;cat_pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;impute&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;SimpleImputer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;most_frequent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;onehotencode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;OneHotEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handle_unknown&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ignore&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;preprocessor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ColumnTransformer&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_pipeline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_attribs&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cat_pipeline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cat_attribs&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Select and Train a Model
&lt;/h2&gt;

&lt;p&gt;Finally! Its time for us to select and train a machine learning model on our data! In our case, we have used the XGBoost Regressor which has same syntax as a RandomForestRegressor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nc"&gt;XGBRegressor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;330&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Evaluate on the Test Set
&lt;/h2&gt;

&lt;p&gt;I evaluated our model on the test set, and here are the results, not bad!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RMSE: 764340224.0, RMSE: 27646.703125
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>ai</category>
      <category>python</category>
    </item>
  </channel>
</rss>
