<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Migot Ndede</title>
    <description>The latest articles on DEV Community by Migot Ndede (@gm_ndede_3d7307448f4fda4a).</description>
    <link>https://dev.to/gm_ndede_3d7307448f4fda4a</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2908840%2F04dbbdbb-0f74-4656-bb50-f3cf502ee05f.jpg</url>
      <title>DEV Community: Migot Ndede</title>
      <link>https://dev.to/gm_ndede_3d7307448f4fda4a</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gm_ndede_3d7307448f4fda4a"/>
    <language>en</language>
    <item>
      <title>Data Cleaning Part I.</title>
      <dc:creator>Migot Ndede</dc:creator>
      <pubDate>Mon, 12 May 2025 17:51:32 +0000</pubDate>
      <link>https://dev.to/gm_ndede_3d7307448f4fda4a/when-is-it-necessary-to-split-a-dataset-for-analysis-is-it-before-or-after-we-clean-the-data-p31</link>
      <guid>https://dev.to/gm_ndede_3d7307448f4fda4a/when-is-it-necessary-to-split-a-dataset-for-analysis-is-it-before-or-after-we-clean-the-data-p31</guid>
      <description>&lt;p&gt;This is a multipart series highlighting the processes involved in Cleaning data for Analysis. &lt;/p&gt;

&lt;p&gt;Data cleaning, refers to the process of identifying and correcting inaccuracies, inconsistencies, and errors in a data set to help improve its readability, quality, reliability and robustness. &lt;/p&gt;

&lt;p&gt;Data wrangling, also known as data munging, is the process of transforming raw, messy data into a clean, usable format for analysis and decision-making. It involves a range of techniques like cleaning, transforming, and restructuring data to ensure it is reliable, accurate, and consistent. Essentially, data wrangling prepares data for further processing, modeling, and analysis. &lt;/p&gt;

&lt;p&gt;Benefits of Data Cleaning; includes more accurate decision-making, increased productivity, and improved data-driven insights. &lt;/p&gt;

&lt;p&gt;In Python, some of the most popular libraries for cleaning data are; Pandas among other libraries like Scikit-learn, Pyjanitor, SciPy, DataPrep, CleanLab, Scrubadub, DataCleaner, CleanPrep and many more. Data cleaning with pandas involves identifying and correcting errors, inconsistencies, and missing values in a dataset to ensure its accuracy and reliability for further analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common data cleaning tasks using pandas include:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;i) Handling Missing values:&lt;/strong&gt;&lt;br&gt;
     a) Identifying missing values using &lt;em&gt;isnull()&lt;/em&gt; and/or &lt;br&gt;
        &lt;em&gt;isna()&lt;/em&gt; functions.&lt;br&gt;
     b) Finding and filling missing values using the &lt;em&gt;fillna()&lt;/em&gt; &lt;br&gt;
        function with a specific value, mean, median and mode or &lt;br&gt;
        any other appropriate strategy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Additional fillna() Options&lt;/strong&gt;&lt;br&gt;
     &lt;strong&gt;inplace=True&lt;/strong&gt;: Modifies the DataFrame directly without &lt;br&gt;
     creating a new one.&lt;br&gt;
     &lt;strong&gt;method='ffill'&lt;/strong&gt; or &lt;strong&gt;method='pad'&lt;/strong&gt;: Fills NaN values &lt;br&gt;
     with the previous valid value.&lt;br&gt;
     &lt;strong&gt;method='bfill'&lt;/strong&gt; or &lt;strong&gt;method='backfill'&lt;/strong&gt;: Fills NaN &lt;br&gt;
     values with the next valid value.&lt;br&gt;
     &lt;strong&gt;limit:&lt;/strong&gt; Sets the maximum number of consecutive NaN &lt;br&gt;
     values to fill.&lt;br&gt;
     c) Removing rows or columns with missing values using &lt;br&gt;
        &lt;em&gt;dropna()&lt;/em&gt; function.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ii) Removing Duplicates:&lt;/strong&gt; &lt;br&gt;
    a)Identifying duplicates in a row using the &lt;em&gt;duplicated()&lt;/em&gt; &lt;br&gt;
      function.&lt;br&gt;
    b) Removing duplicate rows using &lt;em&gt;drop_duplicates()&lt;/em&gt; &lt;br&gt;
      function.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;iii) Correcting Data Types:&lt;/strong&gt; &lt;br&gt;
    a) Checking data types of columns using dtypes.&lt;br&gt;
    b) Converting data types using &lt;em&gt;astype()&lt;/em&gt; function to ensure &lt;br&gt;
       consistency and also enable proper data analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;iv) Handling Outliers:&lt;/strong&gt; &lt;br&gt;
    a) Identify outliers using descriptive statistics like &lt;br&gt;
       (Interquartile Range (IQR) method, the Z-score method) or &lt;br&gt;
       using visualization like box plots.&lt;br&gt;
    b) Removing or transforming outliers based on the context &lt;br&gt;
       and analysis objectives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;v) Clean Text Data and Formatting:&lt;/strong&gt;&lt;br&gt;
    a) Removing leading/Trailing spaces using &lt;em&gt;strip(),lstrip(), &lt;br&gt;
       rstrip()&lt;/em&gt;.&lt;br&gt;
    b) You may opt to convert the text to upper or lowercase for &lt;br&gt;
       data consistency, e.g., &lt;em&gt;lower()&lt;/em&gt; or &lt;em&gt;upper()&lt;/em&gt;. &lt;br&gt;
    c) Replacing specific characters or patterns using &lt;br&gt;
       &lt;em&gt;replace()&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;  &lt;span class="c1"&gt;#Example of replace() function in Python.
&lt;/span&gt;  &lt;span class="n"&gt;string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The Quick Brown Fox Jumped Over the Lazy Dog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
  &lt;span class="n"&gt;new_string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Over&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Under&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Output: sample string sample
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;vi) Renaming Columns:&lt;/strong&gt;&lt;br&gt;
   a) Renaming columns to meaningful names using rename() &lt;br&gt;
      function.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vii) Removing or Avoiding Irrelevant Columns:&lt;/strong&gt;&lt;br&gt;
   a) Removing or totally avoiding irrelevant columns which may &lt;br&gt;
      not be needed or necessary for analysis using &lt;em&gt;drop()&lt;/em&gt; &lt;br&gt;
      function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

  &lt;span class="c1"&gt;# Sample Data-frame with potential data cleaning issues
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; 
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;B&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; 
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;C&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Size(Sq.Miles)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;224961&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;93065&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;365755&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;248777&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10169&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Country&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; Kenya &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Uganda &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Tanzania &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;S. Sudan &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Rwanda &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;GDP(2023) in Billions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;108&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;48.77&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;79.06&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;4.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;14.1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Pop(2023) in Millions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;55.34&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;66.62&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;48.66&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;13.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;11.48&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

  &lt;span class="c1"&gt;# convert the dataset into a dataframe        
&lt;/span&gt;  &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="c1"&gt;# Drop rows with missing values
&lt;/span&gt;  &lt;span class="n"&gt;df_cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

  &lt;span class="c1"&gt;# Fill missing values with 0
&lt;/span&gt;  &lt;span class="n"&gt;df_filled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="c1"&gt;# Remove duplicate rows
&lt;/span&gt;  &lt;span class="n"&gt;df_no_duplicates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;GDP(2023) in 
  Billions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Pop(2023) in Millions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

  &lt;span class="c1"&gt;# Strip whitespace from column 'Country'
&lt;/span&gt;  &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Country&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Country&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Original DataFrame:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-------------------&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cleaned DataFrame (missing values dropped):&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-------------------------------------------&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_cleaned&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Cleaned DataFrame (missing values filled with 0):&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;---------------------------------------------------&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_filled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Cleaned DataFrame (duplicates dropped):&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;---------------------------------------&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_no_duplicates&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When working with a dataset, it is import to identify rows and columns and their data types in Python, particularly when working with data structures like Pandas DataFrames: &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Using dtypes&lt;/strong&gt;&lt;br&gt;
The &lt;em&gt;.dtypes&lt;/em&gt; attribute in Pandas is the most direct way to inspect the data type of each column in a DataFrame.&lt;/p&gt;

&lt;p&gt;Below is how this will assist you to get to know which types of data types you are working with.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

   &lt;span class="c1"&gt;# Sample DataFrame
&lt;/span&gt;   &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

   &lt;span class="c1"&gt;# Identify column data types
&lt;/span&gt;   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtypes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is also imperative to get to understand the dataset you plan to work with. Get to identify the missing rows or null values. Using the &lt;em&gt;info()&lt;/em&gt; function will help you determine which rows have null values or are missing some values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Below is the output of the above code snippet. As you can see, we have the first 3 rows (A, B, and C) which are missing some values in their rows. while the rest of the rows are having 5 each, which is the full row values.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl1mb4g2qni6p3zbink15.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl1mb4g2qni6p3zbink15.png" alt="Image description" width="730" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Below is the screenshot of what you shall see to help determine the datatypes you have. Rows A to C are of float64 datatype. Which means these are numbers, while row of Size is an int64 while Country is an object, which means it is a string value. The GPD and the Pop columns are made up of float64 which means they are in the range of numbers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fddhagz4w4gto30saso9x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fddhagz4w4gto30saso9x.png" alt="Image description" width="618" height="228"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;shape()&lt;/em&gt; function usage will help you in determining the rows and columns contained in a dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;  &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Above, is the code snippet of what you shall see to help you determine the size of data set you have to work with. (5, 7) in our case represents &lt;strong&gt;5 rows&lt;/strong&gt; and &lt;strong&gt;7 columns&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Using the &lt;em&gt;head()&lt;/em&gt; and the &lt;em&gt;tail()&lt;/em&gt; functions also gives us a snapshot of the first 5 rows of dataset and the last 5 rows of the dataset respectively - assuming that your Dataframe is stored in a variable, df.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgbes9mlm1e7il97w2kfz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgbes9mlm1e7il97w2kfz.png" alt="Image description" width="800" height="247"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tail&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6e24mtrgbdrwa74lyj5b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6e24mtrgbdrwa74lyj5b.png" alt="Image description" width="800" height="247"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To get to know how many rows of data are empty or with null values, you can employ one of the following strategies. Assuming that your Dataframe is stored in a variable, df.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnoyav4zjj4ijdr1gtgry.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnoyav4zjj4ijdr1gtgry.png" alt="Image description" width="318" height="485"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;OR&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4k6fmu9pq5rqz5kl16x5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4k6fmu9pq5rqz5kl16x5.png" alt="Image description" width="288" height="492"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Any of the above functions &lt;em&gt;isnull()&lt;/em&gt; or &lt;em&gt;isna()&lt;/em&gt; should work just fine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In Summary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This brings us to the end of of this Part I of the Data cleaning, also known as data cleansing or scrubbing, which refers to the process of identifying and correcting inconsistencies and getting rid of errors in a dataset to help improve its quality and usability. It involves removing duplicates, handling missing values, irrelevant data, fixing incorrect formats, and standardizing entries to help in consistencies. The goal here is to ensure data accuracy, completeness, and consistency, making it suitable for analysis and decision-making.&lt;/p&gt;

</description>
      <category>blogging</category>
      <category>coverimages</category>
      <category>article</category>
    </item>
    <item>
      <title>When is it necessary to split a dataset for Analysis? Is it before, or after we clean the data? That is the question.</title>
      <dc:creator>Migot Ndede</dc:creator>
      <pubDate>Mon, 05 May 2025 16:55:01 +0000</pubDate>
      <link>https://dev.to/gm_ndede_3d7307448f4fda4a/adding-a-cover-image-for-your-devto-articles-5cmh</link>
      <guid>https://dev.to/gm_ndede_3d7307448f4fda4a/adding-a-cover-image-for-your-devto-articles-5cmh</guid>
      <description>&lt;h2&gt;
  
  
  Data Analysis
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What is data?&lt;/em&gt; &lt;br&gt;
First and foremost, let us find out what Data is and is not. &lt;/p&gt;

&lt;p&gt;Data, in its simplest form, is raw, unprocessed facts and figures. It can be numbers, text, images, or any other form of information that can be stored and processed by computers. Data becomes meaningful information when it is analyzed, interpreted, and placed in context. Therefore, Data is the basic unit of information before it's been organized, analyzed, or interpreted. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;What is information?&lt;/em&gt;&lt;br&gt;
Information on the other hand is the result of taking the raw data and transforming it into a meaningful, usable format for analysis. The process may involve, interpreting, organizing, and contextualizing data. &lt;/p&gt;

&lt;p&gt;In data science, splitting your dataset effectively is an important initial step towards building a robust model. Generally, you'll want to allocate a larger portion of your data for training, very often around 70%-80%, with the remaining 20%-30% for testing. This allows the model to learn from a substantial amount of data while still retaining enough unique data points to test its predictions. &lt;br&gt;
However, this split may depend on the data size and its diversity. For smaller datasets, you may need to use techniques like &lt;strong&gt;cross-validation&lt;/strong&gt; to maximize the use of your data for training while still getting a reliable estimate of model performance. To accomplish this feat, we have to employ train_test_split() function which is a function from scikit-learn library.&lt;/p&gt;

&lt;p&gt;We use it and you have to split your dataset into training and test subsets, which helps you perform unbiased model evaluation and validation. &lt;strong&gt;x_train and y_train&lt;/strong&gt; are the parts of your dataset that you use to train—or fit—your machine learning model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#import the needed libraries 
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;
   &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Explanation of the above variables&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;i) &lt;strong&gt;X and y&lt;/strong&gt; are your &lt;strong&gt;feature&lt;/strong&gt; and &lt;strong&gt;target&lt;/strong&gt; variables respectively.&lt;br&gt;
ii) &lt;strong&gt;test_size=0.2&lt;/strong&gt; specifies that 20% of the data should be allocated to the test set.&lt;br&gt;
iii) &lt;strong&gt;random_state=42&lt;/strong&gt; ensures that the split is reproducible. &lt;/p&gt;

&lt;p&gt;The above &lt;strong&gt;train_test_split()&lt;/strong&gt; function returns: Four new variables: &lt;strong&gt;X_train, X_test, y_train, and y_test&lt;/strong&gt;. These represent the training features, testing features, training target, and testing target, respectively. &lt;/p&gt;

&lt;p&gt;By using train_test_split() function, you can effectively divide your data into separate training and testing sets, allowing you to train your models on one set and evaluate their performance on the other. &lt;br&gt;
Below is a sample code on how the whole process goes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;   &lt;span class="c1"&gt;#import libraries
&lt;/span&gt;   &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;
   &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

   &lt;span class="c1"&gt;# Create a sample dataset
&lt;/span&gt;   &lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
   &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

   &lt;span class="c1"&gt;# Split the dataset into training and testing sets
&lt;/span&gt;   &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
   &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

   &lt;span class="c1"&gt;# Print the shapes of the resulting arrays
&lt;/span&gt;   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X_train shape:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X_test shape:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;y_train shape:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;y_test shape:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;------------------------------&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="c1"&gt;# Print the resulting arrays
&lt;/span&gt;   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X_train:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;------------------------------&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X_test:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;------------------------------&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;y_train:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;------------------------------&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;y_test:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Types of Data:&lt;/strong&gt;&lt;br&gt;
Data can be broadly categorized as qualitative (descriptive, non-numerical) or quantitative (numerical, measurable). &lt;/p&gt;

&lt;p&gt;In Machine Machine, Data Analysis, and Data Science, It's generally and also highly recommended that we split the dataset before you start cleaning and pre-processing it. This helps prevent data leakages, where information from the test set influences the training set. &lt;/p&gt;

&lt;p&gt;For example, if you scale data before splitting, the scaling parameters (like mean and standard deviation) would be calculated using the entire dataset, including the test set, which can compromise the model's ability to generalize for the unseen data.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Explanation:&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Avoiding Data Leakage:&lt;/strong&gt;&lt;br&gt;
Splitting before cleaning ensures that the training set is independent of the test set. This prevents the model from "seeing" or making information from the test set during training "visible", which could lead to overfitting, (overfitting in machine learning occurs when a model learns the training data too well, including its noise and outliers, resulting in a model that performs poorly on new, unseen data. Essentially, the model memorizes the training data instead of learning the underlying patterns, leading to inaccurate predictions on new data.) and poor performance on new, unseen data. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Global Pre-processing:&lt;/strong&gt;&lt;br&gt;
Some cleaning and pre-processing steps, like handling missing values or feature engineering, are done globally across the entire dataset. These steps should be performed before splitting to avoid inconsistencies between the training and test sets. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Local Pre-processing:&lt;/strong&gt;&lt;br&gt;
Other pre-processing steps, like scaling (Data scaling is the process of transforming numerical data values to a specific range or distribution, often done to improve the performance of machine learning models. It's essentially about making all your data on the same scale, whether it's 0-1, or with a mean of 0 and a standard deviation of 1.), are often done locally, meaning they are performed separately on the training and test sets. These steps should be performed after splitting to avoid data leakage. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Why Split Before?&lt;/strong&gt;&lt;br&gt;
Splitting before cleaning and pre-processing allows you to use the training set to learn the necessary transformations and then apply those same transformations to the test set. This ensures that the model is evaluated on data that it has not "seen" during training.&lt;/p&gt;

&lt;p&gt;Now, let us explore the possibilities, and what would be considered the best option and practice.&lt;/p&gt;

&lt;p&gt;Data pre-processing involves cleaning and transforming raw data into a usable format for analysis, improving accuracy, and efficiency to help you get a higher score for Machine Learning outcome. It addresses issues like missing values, inconsistencies, and outliers in the data, preparing that data for subsequent tasks like machine learning and model training outcomes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. For what Purpose?:&lt;/strong&gt;&lt;br&gt;
   a) &lt;em&gt;Improve Data Quality:&lt;/em&gt; Addressing inaccuracies,inconsistencies, and errors in the data. &lt;/p&gt;

&lt;p&gt;b) &lt;em&gt;Enhance Model Performance:&lt;/em&gt; Preparing data for machine learning algorithms helps make it easier for the algorithm to understand and learn. &lt;/p&gt;

&lt;p&gt;c) &lt;em&gt;Streamline Analysis:&lt;/em&gt; Ensuring data is in a format suitable for analysis and visualization - so it is in the right consumable format. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Key Steps:&lt;/strong&gt;&lt;br&gt;
   a) &lt;strong&gt;Data Cleaning:&lt;/strong&gt; &lt;br&gt;
       i) &lt;em&gt;Handling Missing Values:&lt;/em&gt; Imputing or removing &lt;br&gt;
           missing data points.&lt;br&gt;
      ii) &lt;em&gt;Identifying and Correcting Errors:&lt;/em&gt; Addressing &lt;br&gt;
          inconsistencies, outliers, and other data quality &lt;br&gt;
          issues.&lt;br&gt;
      iii) &lt;em&gt;Removing Duplicates:&lt;/em&gt; Ensuring each record is &lt;br&gt;
           unique.&lt;/p&gt;

&lt;p&gt;b) &lt;strong&gt;Data Transformation:&lt;/strong&gt;&lt;br&gt;
        i) &lt;em&gt;Feature Scaling:&lt;/em&gt; Normalizing or standardizing &lt;br&gt;
           numerical features to a common scale.&lt;br&gt;
       ii) &lt;em&gt;One-Hot Encoding:&lt;/em&gt; Converting categorical data into &lt;br&gt;
           numerical representations.&lt;br&gt;
      iii) &lt;em&gt;Data Transformation:&lt;/em&gt; Applying functions to modify &lt;br&gt;
           the values of features, e.g., taking logarithms or &lt;br&gt;
           square roots.&lt;/p&gt;

&lt;p&gt;c) &lt;strong&gt;Feature Engineering:&lt;/strong&gt;&lt;br&gt;
        i) Is the creation of new features from existing ones to &lt;br&gt;
           help improve model performance.&lt;br&gt;
       ii) &lt;em&gt;Data Integration:&lt;/em&gt; Combining data from multiple &lt;br&gt;
           sources into a single dataset for easy manipulation &lt;br&gt;
           and analysis.&lt;br&gt;
      iii) &lt;em&gt;Data Reduction:&lt;/em&gt; Reducing the dimensionality of the &lt;br&gt;
           data to improve model efficiency. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Examples:&lt;/strong&gt;&lt;br&gt;
   a) &lt;em&gt;Filling Missing Values:&lt;/em&gt;&lt;br&gt;
      Replacing missing values with the mean, median, or a &lt;br&gt;
      predicted value based on other features.&lt;/p&gt;

&lt;p&gt;b) &lt;em&gt;Removing Outliers:&lt;/em&gt;&lt;br&gt;
      Identifying and removing data points that are &lt;br&gt;
      significantly different from the rest of the data.&lt;/p&gt;

&lt;p&gt;c) &lt;em&gt;Scaling Data:&lt;/em&gt;&lt;br&gt;
      Transforming numerical features to a common scale (e.g., &lt;br&gt;
      0-1 or -1-1) using techniques like Min-Max Scaling or &lt;br&gt;
      Standardization.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;        &lt;span class="c1"&gt;#Scaling using Min-Max Scaling Technique
&lt;/span&gt;        &lt;span class="c1"&gt;#import the needed packages and libraries
&lt;/span&gt;        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MinMaxScaler&lt;/span&gt;

        &lt;span class="c1"&gt;# A Sample DataFrame
&lt;/span&gt;        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;column1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; 
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;column2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;101&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;202&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;303&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;505&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
        &lt;span class="n"&gt;dataframe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;#Initialize MinMaxScaler function
&lt;/span&gt;        &lt;span class="n"&gt;scaler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MinMaxScaler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# Fit and transform the desired dataset columns
&lt;/span&gt;        &lt;span class="n"&gt;dataframe&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;column1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;column2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 
        &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataframe&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;column1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;column2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;

        &lt;span class="c1"&gt;#print(dataframe)
&lt;/span&gt;        &lt;span class="n"&gt;dataframe&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;d) &lt;em&gt;Encoding Categorical Data:&lt;/em&gt;&lt;br&gt;
      Converting categorical features into numerical &lt;br&gt;
      representations, for example, using one-hot encoding. One-hot &lt;br&gt;
      encoding is a process of converting categorical variables into &lt;br&gt;
      a binary matrix. Each category is represented by a new column, &lt;br&gt;
      and each row is marked with a 1 or a 0 depending on whether it &lt;br&gt;
      belongs to that category. This is useful because many machine &lt;br&gt;
      learning algorithms cannot work with categorical data &lt;br&gt;
      directly so we have to help the machine algorithm.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;       &lt;span class="c1"&gt;# One-hot encoding
&lt;/span&gt;       &lt;span class="c1"&gt;# import pandas package library
&lt;/span&gt;       &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

       &lt;span class="c1"&gt;# Sample dataset
&lt;/span&gt;       &lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Color&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Red&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Blue&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Green&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Red&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; 
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Size&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Small&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Medium&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Large&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Medium&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
       &lt;span class="n"&gt;dataframe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

       &lt;span class="c1"&gt;# One-hot encode the 'color' column
&lt;/span&gt;       &lt;span class="n"&gt;dataframe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_dummies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataframe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Color&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

       &lt;span class="c1"&gt;# Print the result
&lt;/span&gt;       &lt;span class="c1"&gt;#print(dataframe)
&lt;/span&gt;       &lt;span class="n"&gt;dataframe&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;        &lt;span class="c1"&gt;# convert categorical features into numerical features
&lt;/span&gt;        &lt;span class="c1"&gt;# import the needed packages and libraries 
&lt;/span&gt;        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LabelEncoder&lt;/span&gt;
        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

       &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Gender&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Male&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Female&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Female&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Male&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
       &lt;span class="n"&gt;dataframe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

       &lt;span class="n"&gt;gender&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LabelEncoder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
       &lt;span class="n"&gt;dataframe&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Gender_Encoded&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 
       &lt;span class="n"&gt;gender&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataframe&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Gender&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

       &lt;span class="c1"&gt;#print(dataframe)
&lt;/span&gt;       &lt;span class="n"&gt;dataframe&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Why do we even care?&lt;/strong&gt;&lt;br&gt;
   a) &lt;em&gt;Improved Model Accuracy:&lt;/em&gt; Pre-processing can significantly &lt;br&gt;
      improve the accuracy of machine learning models.&lt;/p&gt;

&lt;p&gt;b) &lt;em&gt;Enhanced Model Performance:&lt;/em&gt; Pre-processing can make models &lt;br&gt;
     faster and more efficient.&lt;/p&gt;

&lt;p&gt;c) &lt;em&gt;Better Interpretability:&lt;/em&gt; Cleaned and transformed data is &lt;br&gt;
      easier to understand and interpret. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When is the best time to do Feature Selection?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In a dataset, a feature is a measurable property or characteristic of the data points. It's also known as a &lt;strong&gt;variable or attribute&lt;/strong&gt;, representing a definable quality that can vary within the dataset. Features can be used to describe and understand the data, and they are often used as inputs to machine learning models. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key aspects of features:&lt;/strong&gt;&lt;br&gt;
   a) &lt;em&gt;Measurable properties:&lt;/em&gt; Features are quantifiable &lt;br&gt;
      characteristics, like age, height, or temperature.&lt;br&gt;
   b) &lt;em&gt;Variables:&lt;/em&gt; Their values can change from one data point &lt;br&gt;
     to another.&lt;br&gt;
   c) &lt;em&gt;Attributes:&lt;/em&gt; These describe the data points in a dataset.&lt;br&gt;
   d) &lt;em&gt;Inputs to models:&lt;/em&gt; In machine learning, features are &lt;br&gt;
      often used as inputs to train and predict outcomes.  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Examples of Features:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;a) In a medical dataset:&lt;br&gt;
Features could include patient age, gender, blood pressure, cholesterol levels, etc. &lt;/p&gt;

&lt;p&gt;b) In a weather dataset:&lt;br&gt;
Features could include temperature, humidity, wind speed, cloud coverage, etc. &lt;/p&gt;

&lt;p&gt;c) In a student performance dataset:&lt;br&gt;
Features could include student attendance, student grade, student, attendance, age, student GPA, etc.&lt;/p&gt;

&lt;p&gt;d) In a dataset of employee records:&lt;br&gt;
Features could include age, location, salary, title, performance metrics, etc., &lt;a href="https://www.ibm.com/think/topics/feature-selection" rel="noopener noreferrer"&gt;according to IBM&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;Feature selection is an important step in machine learning which involves selecting a subset of relevant Features from the original Feature set to reduce the Feature space while improving the model’s performance by reducing computational power. It’s a critical step in the machine learning especially when dealing with high-dimensional data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When should feature selection be done?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Perform feature selection during the model training process. Feature selection is integrated into the model training to allow the model to select the most relevant features based on the training process dynamically according to &lt;a href="https://www.geeksforgeeks.org/feature-selection-techniques-in-machine-learning/" rel="noopener noreferrer"&gt;Geeks for Geeks&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Feature selection helps by improving the model’s accuracy instead of random guessing based on all Features and increased interpretability.&lt;/p&gt;

&lt;p&gt;The best answer is to do Feature selection after splitting data otherwise there could be information leakage, if it is done before from the Test-Set.&lt;/p&gt;

&lt;p&gt;Alternatively, if the Feature selection for any particular work changes, then no generalization of Feature Importance can be done, which is not desirable.&lt;/p&gt;

&lt;p&gt;If only the Training-Set is used for Feature selection then the test-set may contain certain datasets which may defy or contradict the Feature selection done on the Training-Set as the overall historical data is not analyzed.&lt;/p&gt;

&lt;p&gt;Using only the training set for feature selection is generally not recommended because it can lead to an overly optimistic model and potentially poor generalization on unseen data. Feature selection should ideally be performed on the training set only to prevent information leakage and maintain an unbiased evaluation of the model's performance. &lt;/p&gt;

&lt;p&gt;In machine learning, feature importance scores are used to determine the relative importance of each feature in a dataset when building a predictive model. These scores are calculated using a variety of techniques, such as decision trees, random forests, linear models, and neural networks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do you evaluate feature importance?&lt;/strong&gt;&lt;br&gt;
The feature importance is calculated by noticing the increase or decrease in error when we permute the values of a feature. If permuting the values causes a huge change in the error, it means the feature is important for our model otherwise it isn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
In a nutshell, splitting a dataset is necessary for several reasons, especially in machine learning, to ensure accurate model evaluation and prevent overfitting. By dividing the dataset into training, validation, and testing sets, you can train a model on one portion, fine-tune it on another (validation), and then evaluate its performance on unseen data (testing), providing a more realistic assessment of its generalization ability. Yes, it is better to split the data into training and testing sets before doing things like scaling and imputation. Because these steps are normally accomplished using parameters learned from the training set, the same parameters are applied to the testing set.&lt;/p&gt;

&lt;p&gt;If you like the article, please add to the discussion below or any insights in the comments below. If you would us to cover other aspects of Python or ML or AI any related articles of interest to you, please let us know by tagging or adding comments and notes below. Your feedback helps a lot in learning and development.&lt;/p&gt;

</description>
      <category>blogging</category>
      <category>coverimages</category>
      <category>article</category>
    </item>
  </channel>
</rss>
