DEV Community: Migot Ndede

Data Cleaning Part I.

Migot Ndede — Mon, 12 May 2025 17:51:32 +0000

This is a multipart series highlighting the processes involved in Cleaning data for Analysis.

Data cleaning, refers to the process of identifying and correcting inaccuracies, inconsistencies, and errors in a data set to help improve its readability, quality, reliability and robustness.

Data wrangling, also known as data munging, is the process of transforming raw, messy data into a clean, usable format for analysis and decision-making. It involves a range of techniques like cleaning, transforming, and restructuring data to ensure it is reliable, accurate, and consistent. Essentially, data wrangling prepares data for further processing, modeling, and analysis.

Benefits of Data Cleaning; includes more accurate decision-making, increased productivity, and improved data-driven insights.

In Python, some of the most popular libraries for cleaning data are; Pandas among other libraries like Scikit-learn, Pyjanitor, SciPy, DataPrep, CleanLab, Scrubadub, DataCleaner, CleanPrep and many more. Data cleaning with pandas involves identifying and correcting errors, inconsistencies, and missing values in a dataset to ensure its accuracy and reliability for further analysis.

Common data cleaning tasks using pandas include:

i) Handling Missing values:
a) Identifying missing values using isnull() and/or
isna() functions.
b) Finding and filling missing values using the fillna()
function with a specific value, mean, median and mode or
any other appropriate strategy.

Additional fillna() Options
inplace=True: Modifies the DataFrame directly without
creating a new one.
method='ffill' or method='pad': Fills NaN values
with the previous valid value.
method='bfill' or method='backfill': Fills NaN
values with the next valid value.
limit: Sets the maximum number of consecutive NaN
values to fill.
c) Removing rows or columns with missing values using
dropna() function.

ii) Removing Duplicates:
a)Identifying duplicates in a row using the duplicated()
function.
b) Removing duplicate rows using drop_duplicates()
function.

iii) Correcting Data Types:
a) Checking data types of columns using dtypes.
b) Converting data types using astype() function to ensure
consistency and also enable proper data analysis.

iv) Handling Outliers:
a) Identify outliers using descriptive statistics like
(Interquartile Range (IQR) method, the Z-score method) or
using visualization like box plots.
b) Removing or transforming outliers based on the context
and analysis objectives.

v) Clean Text Data and Formatting:
a) Removing leading/Trailing spaces using strip(),lstrip(),
rstrip().
b) You may opt to convert the text to upper or lowercase for
data consistency, e.g., lower() or upper().
c) Replacing specific characters or patterns using
replace().

  #Example of replace() function in Python.
  string = "The Quick Brown Fox Jumped Over the Lazy Dog"
  new_string = string.replace("Over", "Under")
  print(new_string)
# Output: sample string sample

vi) Renaming Columns:
a) Renaming columns to meaningful names using rename()
function.

vii) Removing or Avoiding Irrelevant Columns:
a) Removing or totally avoiding irrelevant columns which may
not be needed or necessary for analysis using drop()
function.

   import pandas as pd

  # Sample Data-frame with potential data cleaning issues
data = {'A': [1, 2, 3, None, 5], 
        'B': [None, 2, 3, None, 5], 
        'C': [None, 2, 3, 4, 5],
        'Size(Sq.Miles)': [224961, 93065, 365755, 248777, 10169],
        'Country': [' Kenya ', 'Uganda ', 'Tanzania ', 'S. Sudan ', 'Rwanda '],
        'GDP(2023) in Billions': [108, 48.77, 79.06, 4.7, 14.1],
        'Pop(2023) in Millions': [55.34, 66.62, 48.66, 13.95, 11.48]}

  # convert the dataset into a dataframe        
  df = pd.DataFrame(data)

  # Drop rows with missing values
  df_cleaned = df.dropna()

  # Fill missing values with 0
  df_filled = df.fillna(0)

  # Remove duplicate rows
  df_no_duplicates = df.drop_duplicates(subset=['GDP(2023) in 
  Billions', 'Pop(2023) in Millions'])

  # Strip whitespace from column 'Country'
  df['Country'] = df['Country'].str.strip()

  print("Original DataFrame:")
  print("-------------------")
  print(df)
  print("")
  print("Cleaned DataFrame (missing values dropped):")
  print("-------------------------------------------")
  print(df_cleaned)

  print("\nCleaned DataFrame (missing values filled with 0):")
  print("---------------------------------------------------")
  print(df_filled)

  print("\nCleaned DataFrame (duplicates dropped):")
  print("---------------------------------------")
  print(df_no_duplicates)

When working with a dataset, it is import to identify rows and columns and their data types in Python, particularly when working with data structures like Pandas DataFrames:

Using dtypes
The .dtypes attribute in Pandas is the most direct way to inspect the data type of each column in a DataFrame.

Below is how this will assist you to get to know which types of data types you are working with.

   import pandas as pd

   # Sample DataFrame
   df = pd.DataFrame(data)

   # Identify column data types
   print(df.dtypes)

It is also imperative to get to understand the dataset you plan to work with. Get to identify the missing rows or null values. Using the info() function will help you determine which rows have null values or are missing some values.

   df.info()

Below is the output of the above code snippet. As you can see, we have the first 3 rows (A, B, and C) which are missing some values in their rows. while the rest of the rows are having 5 each, which is the full row values.

Below is the screenshot of what you shall see to help determine the datatypes you have. Rows A to C are of float64 datatype. Which means these are numbers, while row of Size is an int64 while Country is an object, which means it is a string value. The GPD and the Pop columns are made up of float64 which means they are in the range of numbers.

The shape() function usage will help you in determining the rows and columns contained in a dataset.

  df.shape()

Above, is the code snippet of what you shall see to help you determine the size of data set you have to work with. (5, 7) in our case represents 5 rows and 7 columns.

Using the head() and the tail() functions also gives us a snapshot of the first 5 rows of dataset and the last 5 rows of the dataset respectively - assuming that your Dataframe is stored in a variable, df.

   df.head()

   df.tail()

To get to know how many rows of data are empty or with null values, you can employ one of the following strategies. Assuming that your Dataframe is stored in a variable, df.

   df.isnull().sum()

   df.isna().sum()

Any of the above functions isnull() or isna() should work just fine.

In Summary

This brings us to the end of of this Part I of the Data cleaning, also known as data cleansing or scrubbing, which refers to the process of identifying and correcting inconsistencies and getting rid of errors in a dataset to help improve its quality and usability. It involves removing duplicates, handling missing values, irrelevant data, fixing incorrect formats, and standardizing entries to help in consistencies. The goal here is to ensure data accuracy, completeness, and consistency, making it suitable for analysis and decision-making.

When is it necessary to split a dataset for Analysis? Is it before, or after we clean the data? That is the question.

Migot Ndede — Mon, 05 May 2025 16:55:01 +0000

Data Analysis

What is data?
First and foremost, let us find out what Data is and is not.

Data, in its simplest form, is raw, unprocessed facts and figures. It can be numbers, text, images, or any other form of information that can be stored and processed by computers. Data becomes meaningful information when it is analyzed, interpreted, and placed in context. Therefore, Data is the basic unit of information before it's been organized, analyzed, or interpreted.

What is information?
Information on the other hand is the result of taking the raw data and transforming it into a meaningful, usable format for analysis. The process may involve, interpreting, organizing, and contextualizing data.

In data science, splitting your dataset effectively is an important initial step towards building a robust model. Generally, you'll want to allocate a larger portion of your data for training, very often around 70%-80%, with the remaining 20%-30% for testing. This allows the model to learn from a substantial amount of data while still retaining enough unique data points to test its predictions.
However, this split may depend on the data size and its diversity. For smaller datasets, you may need to use techniques like cross-validation to maximize the use of your data for training while still getting a reliable estimate of model performance. To accomplish this feat, we have to employ train_test_split() function which is a function from scikit-learn library.

We use it and you have to split your dataset into training and test subsets, which helps you perform unbiased model evaluation and validation. x_train and y_train are the parts of your dataset that you use to train—or fit—your machine learning model.

#import the needed libraries 
from sklearn.model_selection import train_test_split
   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Explanation of the above variables

i) X and y are your feature and target variables respectively.
ii) test_size=0.2 specifies that 20% of the data should be allocated to the test set.
iii) random_state=42 ensures that the split is reproducible.

The above train_test_split() function returns: Four new variables: X_train, X_test, y_train, and y_test. These represent the training features, testing features, training target, and testing target, respectively.

By using train_test_split() function, you can effectively divide your data into separate training and testing sets, allowing you to train your models on one set and evaluate their performance on the other.
Below is a sample code on how the whole process goes:

   #import libraries
   from sklearn.model_selection import train_test_split
   import numpy as np

   # Create a sample dataset
   X = np.array([[11, 12], [13, 14], [15, 16], [17, 18], [19, 20]])
   y = np.array(['a', 'b', 'a', 'b', 'a'])

   # Split the dataset into training and testing sets
   X_train, X_test, y_train, y_test = train_test_split(X, y, 
   test_size=0.2, random_state=42)

   # Print the shapes of the resulting arrays
   print("X_train shape:", X_train.shape)
   print("X_test shape:", X_test.shape)
   print("y_train shape:", y_train.shape)
   print("y_test shape:", y_test.shape)
   print("------------------------------")
   print("")
   # Print the resulting arrays
   print("X_train:\n", X_train)
   print("------------------------------")
   print("")
   print("X_test:\n", X_test)
   print("------------------------------")
   print("")
   print("y_train:\n", y_train)
   print("------------------------------")
   print("")
   print("y_test:\n", y_test)

Types of Data:
Data can be broadly categorized as qualitative (descriptive, non-numerical) or quantitative (numerical, measurable).

In Machine Machine, Data Analysis, and Data Science, It's generally and also highly recommended that we split the dataset before you start cleaning and pre-processing it. This helps prevent data leakages, where information from the test set influences the training set.

For example, if you scale data before splitting, the scaling parameters (like mean and standard deviation) would be calculated using the entire dataset, including the test set, which can compromise the model's ability to generalize for the unseen data.

Explanation:

1. Avoiding Data Leakage:
Splitting before cleaning ensures that the training set is independent of the test set. This prevents the model from "seeing" or making information from the test set during training "visible", which could lead to overfitting, (overfitting in machine learning occurs when a model learns the training data too well, including its noise and outliers, resulting in a model that performs poorly on new, unseen data. Essentially, the model memorizes the training data instead of learning the underlying patterns, leading to inaccurate predictions on new data.) and poor performance on new, unseen data.

2. Global Pre-processing:
Some cleaning and pre-processing steps, like handling missing values or feature engineering, are done globally across the entire dataset. These steps should be performed before splitting to avoid inconsistencies between the training and test sets.

3. Local Pre-processing:
Other pre-processing steps, like scaling (Data scaling is the process of transforming numerical data values to a specific range or distribution, often done to improve the performance of machine learning models. It's essentially about making all your data on the same scale, whether it's 0-1, or with a mean of 0 and a standard deviation of 1.), are often done locally, meaning they are performed separately on the training and test sets. These steps should be performed after splitting to avoid data leakage.

4. Why Split Before?
Splitting before cleaning and pre-processing allows you to use the training set to learn the necessary transformations and then apply those same transformations to the test set. This ensures that the model is evaluated on data that it has not "seen" during training.

Now, let us explore the possibilities, and what would be considered the best option and practice.

Data pre-processing involves cleaning and transforming raw data into a usable format for analysis, improving accuracy, and efficiency to help you get a higher score for Machine Learning outcome. It addresses issues like missing values, inconsistencies, and outliers in the data, preparing that data for subsequent tasks like machine learning and model training outcomes.

1. For what Purpose?:
a) Improve Data Quality: Addressing inaccuracies,inconsistencies, and errors in the data.

b) Enhance Model Performance: Preparing data for machine learning algorithms helps make it easier for the algorithm to understand and learn.

c) Streamline Analysis: Ensuring data is in a format suitable for analysis and visualization - so it is in the right consumable format.

2. Key Steps:
a) Data Cleaning:
i) Handling Missing Values: Imputing or removing
missing data points.
ii) Identifying and Correcting Errors: Addressing
inconsistencies, outliers, and other data quality
issues.
iii) Removing Duplicates: Ensuring each record is
unique.

b) Data Transformation:
i) Feature Scaling: Normalizing or standardizing
numerical features to a common scale.
ii) One-Hot Encoding: Converting categorical data into
numerical representations.
iii) Data Transformation: Applying functions to modify
the values of features, e.g., taking logarithms or
square roots.

c) Feature Engineering:
i) Is the creation of new features from existing ones to
help improve model performance.
ii) Data Integration: Combining data from multiple
sources into a single dataset for easy manipulation
and analysis.
iii) Data Reduction: Reducing the dimensionality of the
data to improve model efficiency.

3. Examples:
a) Filling Missing Values:
Replacing missing values with the mean, median, or a
predicted value based on other features.

b) Removing Outliers:
Identifying and removing data points that are
significantly different from the rest of the data.

c) Scaling Data:
Transforming numerical features to a common scale (e.g.,
0-1 or -1-1) using techniques like Min-Max Scaling or
Standardization.

        #Scaling using Min-Max Scaling Technique
        #import the needed packages and libraries
        import pandas as pd
        from sklearn.preprocessing import MinMaxScaler

        # A Sample DataFrame
        data = {'column1': [100, 200, 300, 400, 500], 
        'column2': [101, 202, 303, 404, 505]}
        dataframe = pd.DataFrame(data)

        #Initialize MinMaxScaler function
        scaler = MinMaxScaler()

        # Fit and transform the desired dataset columns
        dataframe[['column1', 'column2']] = 
        scaler.fit_transform(dataframe[['column1', 'column2']])

        #print(dataframe)
        dataframe

d) Encoding Categorical Data:
Converting categorical features into numerical
representations, for example, using one-hot encoding. One-hot
encoding is a process of converting categorical variables into
a binary matrix. Each category is represented by a new column,
and each row is marked with a 1 or a 0 depending on whether it
belongs to that category. This is useful because many machine
learning algorithms cannot work with categorical data
directly so we have to help the machine algorithm.

       # One-hot encoding
       # import pandas package library
       import pandas as pd

       # Sample dataset
       dataset = {'Color': ['Red', 'Blue', 'Green', 'Red'], 
        'Size': ['Small', 'Medium', 'Large', 'Medium']}
       dataframe = pd.DataFrame(dataset)

       # One-hot encode the 'color' column
       dataframe = pd.get_dummies(dataframe, columns=['Color'])

       # Print the result
       #print(dataframe)
       dataframe

        # convert categorical features into numerical features
        # import the needed packages and libraries 
        from sklearn.preprocessing import LabelEncoder
        import pandas as pd

       data = {'Gender': ['Male', 'Female', 'Female', 'Male']}
       dataframe = pd.DataFrame(data)

       gender = LabelEncoder()
       dataframe['Gender_Encoded'] = 
       gender.fit_transform(dataframe['Gender'])

       #print(dataframe)
       dataframe

4. Why do we even care?
a) Improved Model Accuracy: Pre-processing can significantly
improve the accuracy of machine learning models.

b) Enhanced Model Performance: Pre-processing can make models
faster and more efficient.

c) Better Interpretability: Cleaned and transformed data is
easier to understand and interpret.

When is the best time to do Feature Selection?

In a dataset, a feature is a measurable property or characteristic of the data points. It's also known as a variable or attribute, representing a definable quality that can vary within the dataset. Features can be used to describe and understand the data, and they are often used as inputs to machine learning models.

Key aspects of features:
a) Measurable properties: Features are quantifiable
characteristics, like age, height, or temperature.
b) Variables: Their values can change from one data point
to another.
c) Attributes: These describe the data points in a dataset.
d) Inputs to models: In machine learning, features are
often used as inputs to train and predict outcomes.

Examples of Features:

a) In a medical dataset:
Features could include patient age, gender, blood pressure, cholesterol levels, etc.

b) In a weather dataset:
Features could include temperature, humidity, wind speed, cloud coverage, etc.

c) In a student performance dataset:
Features could include student attendance, student grade, student, attendance, age, student GPA, etc.

d) In a dataset of employee records:
Features could include age, location, salary, title, performance metrics, etc., according to IBM.

Feature selection is an important step in machine learning which involves selecting a subset of relevant Features from the original Feature set to reduce the Feature space while improving the model’s performance by reducing computational power. It’s a critical step in the machine learning especially when dealing with high-dimensional data.

When should feature selection be done?

Perform feature selection during the model training process. Feature selection is integrated into the model training to allow the model to select the most relevant features based on the training process dynamically according to Geeks for Geeks

Feature selection helps by improving the model’s accuracy instead of random guessing based on all Features and increased interpretability.

The best answer is to do Feature selection after splitting data otherwise there could be information leakage, if it is done before from the Test-Set.

Alternatively, if the Feature selection for any particular work changes, then no generalization of Feature Importance can be done, which is not desirable.

If only the Training-Set is used for Feature selection then the test-set may contain certain datasets which may defy or contradict the Feature selection done on the Training-Set as the overall historical data is not analyzed.

Using only the training set for feature selection is generally not recommended because it can lead to an overly optimistic model and potentially poor generalization on unseen data. Feature selection should ideally be performed on the training set only to prevent information leakage and maintain an unbiased evaluation of the model's performance.

In machine learning, feature importance scores are used to determine the relative importance of each feature in a dataset when building a predictive model. These scores are calculated using a variety of techniques, such as decision trees, random forests, linear models, and neural networks.

How do you evaluate feature importance?
The feature importance is calculated by noticing the increase or decrease in error when we permute the values of a feature. If permuting the values causes a huge change in the error, it means the feature is important for our model otherwise it isn't.

Conclusion
In a nutshell, splitting a dataset is necessary for several reasons, especially in machine learning, to ensure accurate model evaluation and prevent overfitting. By dividing the dataset into training, validation, and testing sets, you can train a model on one portion, fine-tune it on another (validation), and then evaluate its performance on unseen data (testing), providing a more realistic assessment of its generalization ability. Yes, it is better to split the data into training and testing sets before doing things like scaling and imputation. Because these steps are normally accomplished using parameters learned from the training set, the same parameters are applied to the testing set.

If you like the article, please add to the discussion below or any insights in the comments below. If you would us to cover other aspects of Python or ML or AI any related articles of interest to you, please let us know by tagging or adding comments and notes below. Your feedback helps a lot in learning and development.