DEV Community

Cover image for Guide to Data Cleaning and Preparation for Analysis using Pandas library in Python
Isaac Oresanya
Isaac Oresanya

Posted on • Originally published at Medium

Guide to Data Cleaning and Preparation for Analysis using Pandas library in Python

Attempting analysis with unrefined data is like trying to build a sturdy house on a shaky foundation.
It leads to bad decisions that can cost you time, money, and reputation. To avoid this, you need to clean your data before you use it and verify its quality and accuracy. A clean data is the foundation of good decision-making.
The goal of this article is to show you how to clean, transform, and explore a dataset using various Pandas methods and functions. By the end of this article, you will have a better understanding of how to work with large and messy datasets using the Pandas library in Python.

Prerequisites

To follow along with this article, you will need:

  • A basic knowledge of Python and Pandas
  • A Jupyter Notebook or Google Colab environment
  • The FOIA dataset downloaded from here

What is Data Cleaning

Data cleaning is the process of identifying and correcting or removing errors, inconsistencies, and inaccuracies in data. It is an essential step in the data science workflow that ensures that data is accurate, complete, and reliable.
Data cleaning is an important step in data analysis and decision-making, as it can improve the accuracy, reliability, and efficiency of the data. However, there is no one-size-fits-all technique for data cleansing, as different data sets may require different approaches and methods. Therefore, data cleansing should be tailored to the specific needs and goals of each data project.

Some common data-cleaning techniques are:

  • Removing duplicates: This involves deleting records that are the same, or have the same values in different columns. For example, if a customer has two entries with the same name and email address, one of them can be removed as a duplicate.
  • Handling missing values: This involves dealing with records that have incomplete values, also called null values. They are represented as NaN. Missing values in a dataframe can result from data entry errors, or missing data during data collection. For example, if a survey respondent did not answer a question, the value for that question will be null. Missing values can be handled by deleting them, replacing them with a default value, or imputing them using statistical methods.
  • Standardizing formats: This involves normalizing features so that data follows a consistent format. This ensures that the values converge around a range of numbers, and not just spread out randomly. For example, if a feature has values in different units, such as meters and feet, they can be converted to a common unit, such as meters.
  • Validating data: This involves identifying errors and inconsistencies in data, like the presence of outliers or mixed datatypes. This can be a result of typographical errors, or errors in bytes encoding and decoding. For example, if a feature has values that are supposed to be numeric, but some of them are strings, they can be converted to numeric values or removed as invalid data.

In this article, I will share my thought process and steps for preparing the Freedom of Information Act (FOIA) dataset for analysis using the Pandas library in Python. This dataset contains information about the 7(a) loan program of the U.S. Small Business Administration (SBA) from fiscal year 2010 to 2019. The 7(a) loan program is the SBA's primary program for providing financial assistance to small businesses.
The dataset was obtained from the Makeover Monday project, a weekly series that challenges data enthusiasts to use data and charts to tell a story.
Additionally, you can find the source code for the data cleaning techniques used in the article on my GitHub here.

Reading and inspecting data

To begin, import the Pandas library into your environment and load the CSV file into a Pandas data frame:

import pandas as pd
pd.set_option("display.max_columns", None)

# Define custom datatypes for specific columns
dtypes = {"BankZip": str, "BusinessAge": str}

# Read the CSV file using the specified encoding and column datatypes
foiaData = pd.read_csv("foia-7afy2010-fy2019-asof-220930.csv", encoding="ISO-8859-1", dtype=dtypes)
Enter fullscreen mode Exit fullscreen mode

By setting display.max_columns to None, you're telling Pandas to display all columns without any truncation. This is particularly helpful when you're working with wide datasets or data frames containing numerous columns.

While attempting to read this CSV file using only its name, you might encounter an error and a warning:

  1. Error:
UnicodeDecodeError: 'utf-8' codec can't decide bytes in position 246121-246122: invalid continuation byte
Enter fullscreen mode Exit fullscreen mode

Description: This error occurs when the 'utf-8' encoding doesn't match the content of the file. It usually happens when there are characters in the file that cannot be decoded using the specified encoding.
Resolution: To resolve this, specify the correct encoding for the file when reading it, as demonstrated above.

  1. Warning:
DtypeWarning: Columns (13,32) have mixed types. Specify dtype option on import or set low_memory=False"
Enter fullscreen mode Exit fullscreen mode

Description: This warning arises due to mixed data types in the columns "BankZip" and "BusinessAge." It indicates that the data in these columns may not be consistently of the same type.
Resolution: To address this, change the data type of these columns to 'str' (string) initially. This allows you to inspect the data and later convert it to the appropriate data type if necessary. Create a dictionary named 'dtypes' (you can choose a different name) with column names as keys and desired data types as values. Then pass this dictionary to the 'dtype' parameter of the 'read_csv()' function when loading the data. This ensures that the data is loaded correctly with the specified data types.

Next, study the data frame using the head(), shape, and dtypes methods to get a sense of the data structure, size, columns, and data types.
Here are my few observations;

  • The dataset has 545,751 rows and 39 columns. Each row represents a loan approved by the SBA under the 7(a) program, and each column contains a variable related to the loan, such as the borrower's name, address, business type, loan amount, interest rate, term, status, etc.
|   AsOfDate | Program   | BorrName                       | BorrStreet                   | BorrCity    | BorrState   |   BorrZip | BankName                                  |   BankFDICNumber |   BankNCUANumber | BankStreet           | BankCity     | BankState   |   BankZip |   GrossApproval |   SBAGuaranteedApproval | ApprovalDate   |   ApprovalFiscalYear | FirstDisbursementDate   | DeliveryMethod   | subpgmdesc                  |   InitialInterestRate |   TermInMonths |   NaicsCode | NaicsDescription                                                       |   FranchiseCode |   FranchiseName | ProjectCounty   | ProjectState   | SBADistrictOffice                 |   CongressionalDistrict | BusinessType   |   BusinessAge | LoanStatus   | PaidInFullDate   |   ChargeOffDate |   GrossChargeOffAmount |   RevolverStatus |   JobsSupported |
|-----------:|:----------|:-------------------------------|:-----------------------------|:------------|:------------|----------:|:------------------------------------------|-----------------:|-----------------:|:---------------------|:-------------|:------------|----------:|----------------:|------------------------:|:---------------|---------------------:|:------------------------|:-----------------|:----------------------------|----------------------:|---------------:|------------:|:-----------------------------------------------------------------------|----------------:|----------------:|:----------------|:---------------|:----------------------------------|------------------------:|:---------------|--------------:|:-------------|:-----------------|----------------:|-----------------------:|-----------------:|----------------:|
|   20220930 | 7A        | RIVERLAND LASER LLC            | 703 ST CROIX STREET          | RIVER FALLS | WI          |     54022 | First National Community Bank             |             5357 |              nan | 109 E 2nd St         | NEW RICHMOND | WI          |     54017 |          491000 |                  441900 | 10/01/2009     |                 2010 | 10/01/2009              | OTH 7A           | Guaranty                    |                  6    |            120 |      332812 | Metal Coating, Engraving (except Jewelry and Silverware), and Allied S |             nan |             nan | PIERCE          | WI             | WISCONSIN DISTRICT OFFICE         |                       7 | CORPORATION    |           nan | PIF          | 08/31/2016       |             nan |                      0 |                0 |              11 |
|   20220930 | 7A        | MARTINEZ AND MARTINEZ INC.     | 2430 DALY AVE                | LOS ANGELES | CA          |     90031 | Gain FCU                                  |              nan |             3943 | 1800 W Magnolia Blvd | BURBANK      | CA          |     91506 |           35000 |                   31500 | 10/01/2009     |                 2010 | 04/01/2011              | OTH 7A           | Guaranty                    |                  6.5  |             60 |      541213 | Tax Preparation Services                                               |             nan |             nan | LOS ANGELES     | CA             | LOS ANGELES DISTRICT OFFICE       |                      28 | CORPORATION    |           nan | PIF          | 04/30/2016       |             nan |                      0 |                1 |               9 |
|   20220930 | 7A        | Erez & Koby Genesis Investment | 10999 Reed Hartman Hwy, SUit | CINCINNATI  | OH          |     45242 | JPMorgan Chase Bank, National Association |              628 |              nan | 1111 Polaris Pkwy    | COLUMBUS     | OH          |     43240 |          100500 |                   50250 | 10/01/2009     |                 2010 | 10/01/2009              | SBA EXPRES       | FA$TRK (Small Loan Express) |                  4.51 |             54 |      722110 | Full-Service Restaurants                                               |             nan |             nan | HAMILTON        | OH             | COLUMBUS DISTRICT OFFICE          |                       1 | CORPORATION    |           nan | PIF          | 05/31/2012       |             nan |                      0 |                0 |              20 |
|   20220930 | 7A        | Square Burger LLC              | 115 Kentucky Street          | McKinney    | TX          |     75069 | American Bank of Commerce                 |            18609 |              nan | 530 E Hwy 62/82      | WOLFFORTH    | TX          |     79382 |          288000 |                  259200 | 10/01/2009     |                 2010 | 10/01/2009              | PLP              | Guaranty                    |                  6    |            120 |      722110 | Full-Service Restaurants                                               |             nan |             nan | COLLIN          | TX             | DALLAS / FT WORTH DISTRICT OFFICE |                       3 | CORPORATION    |           nan | PIF          | 07/31/2014       |             nan |                      0 |                0 |              18 |
|   20220930 | 7A        | HIT FITNESS LLC                | 614 A SOUTH MAIN STREET      | MAULDIN     | SC          |     29662 | Truist Bank                               |             9846 |              nan | 214 N Tryon St       | CHARLOTTE    | NC          |     28202 |           37000 |                   18500 | 10/01/2009     |                 2010 | 10/01/2009              | SBA EXPRES       | FA$TRK (Small Loan Express) |                  7.5  |             60 |      812199 | Other Personal Care Services                                           |             nan |             nan | GREENVILLE      | SC             | SOUTH CAROLINA DISTRICT OFFICE    |                       4 | CORPORATION    |           nan | PIF          | 09/30/2013       |             nan |                      0 |                0 |               6 |
Enter fullscreen mode Exit fullscreen mode
  • The dataset has some missing values in some columns, such as "BankFDICNumber", "BankNCUANumber", "FirstDisbursementDate", "NaicsCode", "NaicsDescription", "FranchiseCode", "FranchiseName", "BusinessAge", "PaidInFullDate", and "ChargeOffDate". These missing values may indicate that the data was not recorded or available for some loans. For example, "FranchiseCode" and "FranchiseName" are only relevant for loans to franchise businesses, so they are missing for non-franchise businesses.
  • Besides the two columns whose datatypes we specifically changed on reading the data, some other columns such as "AsOfDate", "ApprovalDate", "FirstDisbursementDate", "PaidInFullDate", and "ChargeOffDate" have improper datatypes.
  • The BankZip and BankName columns have some data quality issues. The BankZip column should only contain integer values, but it has a string value "M5H 1", which is probably a typo. The BankName column has a value "MISSINGMAINBANKID", which indicates a missing value. These are the reasons why the mixed datatype warning was raised initially.

1. Handling Data Quality Issues

To fix these problems, you can drop the two rows that have these values by using the following code:

# Drop rows where BankZip is M5H 1 or BankName is MISSINGMAINBANKID
bankzip_condition = foiaData["BankZip"] == "M5H 1"
bankname_condition = foiaData["BankName"] == "MISSINGMAINBANKID"
drop_condition = bankzip_condition | bankname_condition
foiaData.drop(foiaData[drop_condition].index, inplace=True)
Enter fullscreen mode Exit fullscreen mode

The "|" operator is called the "OR" operator. It means that the rows will be dropped if either one of the conditions is true.

2. Handling Null/Missing Values

In this section, I'll review each column containing missing data in the dataset, exploring various techniques for handling these missing values.

Exploring Data Columns

To identify which columns in your dataset contain null values. You can achieve this by using the isna() method on your DataFrame. This method will return a DataFrame of the same shape as the original, but with boolean values indicating the presence of missing data. You can then use the sum() function along with isna() to count the number of missing values in each column.

foiaData.isna().sum()
Enter fullscreen mode Exit fullscreen mode

This code will return the following results:


AsOfDate                      0
Program                       0
BorrName                     34
BorrStreet                    0
BorrCity                      0
BorrState                     0
BorrZip                       0
BankName                      0
BankFDICNumber            37241
BankNCUANumber           529342
BankStreet                    2
BankCity                      2
BankState                     2
BankZip                       2
GrossApproval                 0
SBAGuaranteedApproval         0
ApprovalDate                  0
ApprovalFiscalYear            0
FirstDisbursementDate     65434
DeliveryMethod                0
subpgmdesc                    0
InitialInterestRate           0
TermInMonths                  0
NaicsCode                    11
NaicsDescription            592
FranchiseCode            498139
FranchiseName            498240
ProjectCounty                 2
ProjectState                  0
SBADistrictOffice             0
CongressionalDistrict        29
BusinessType                  5
BusinessAge              434057
LoanStatus                    0
PaidInFullDate           235140
ChargeOffDate            518461
GrossChargeOffAmount          0
RevolverStatus                0
JobsSupported                 0
dtype: int64
Enter fullscreen mode Exit fullscreen mode

Dropping Rows

Let's begin with the first column containing missing values―"BorrName," which represents the Borrower's name. Since there are only 34 rows with missing values, it's advisable to drop these rows.
Here's how to implement this in code:

# Drop rows with missing Borrower Name
foiaData.dropna(subset=["BorrName"],inplace=True)
Enter fullscreen mode Exit fullscreen mode

This column holds significant importance within the dataset, making it misleading to substitute with another Borrower's name. While an option is to input "MISSING NAME," given the limited number of missing values, such action seems unnecessary. Remember, the approach you choose should be tailored to the project's specific requirements rather than a one-size-fits-all solution.

Dropping Columns

Next, we have the BankFDICNumber and BankNCUANumber columns with 37241 and 529342 missing values respectively. The columns represent the FDIC number of the bank and the NCUA number of the bank. Including these columns adds complexity to the dataset without necessarily providing significant analytical value. So, dropping these columns allows you to focus on the key aspects of the data that are more directly related to your analysis goals, leading to a more concise and focused dataset.
We can achieve that using this code;

# Drop the BankFDICNumber and BankNCUANumber columns
columns_to_drop=["BankFDICNumber", "BankNCUANumber"]
foiaData.drop(columns_to_drop,axis=1,inplace=True)
Enter fullscreen mode Exit fullscreen mode

Introducing a Binary Indicator

More than 90% of the records have missing values in the FranchiseCode and FranchiseName columns, increasing data sparsity and reducing the overall data quality. These columns represent the franchise code and the name of the franchise.
Introducing a HasFranchise column, with two distinct values(0 and 1) that indicate whether a record has a franchise(1) or not(0) simplifies the dataset, making it more readable and intuitive for analysis.

# Create a HasFranchise column
foiaData["HasFranchise"]=foiaData["FranchiseName"].notna().astype(int)
Enter fullscreen mode Exit fullscreen mode

Then dropping the FranchiseCode and FranchiseName columns

# Drop the FranchiseCode and FranchiseName columns
columns_to_drop=["FranchiseCode", "FranchiseName"]
foiaData.drop(columns_to_drop,axis=1,inplace=True)
Enter fullscreen mode Exit fullscreen mode

Handling Categorical with Imputation

The NaicsCode, NaicsDescription, CongressionalDistrict, and BusinessType have only a few missing values, and since they are categorical columns, missing values could be imputed with the most frequent value.

# Fill missing values in NaicsCode, NaicsDescription,
# CongressionalDistrict, and BusinessType with the most frequent value
for column in ["NaicsCode", "NaicsDescription",
               "CongressionalDistrict", "BusinessType"]:
    most_frequent_value = foiaData[column].mode()[0]
    foiaData[column].fillna(most_frequent_value, inplace=True)
Enter fullscreen mode Exit fullscreen mode

A similar handling technique can be done with the ProjectCounty column, but in this case, missing values will be imputed with the most frequent value in the CountyState column to ensure consistency. The most frequent value in the CountyState column is "CT", hence it is hardcoded in the code below:

# Fill missing values in ProjectCounty with the most frequent value in
# ProjectState == CT
ctmode_projectstate=foiaData.loc[foiaData["ProjectState"]=="CT","ProjectCounty"].mode()[0]
foiaData["ProjectCounty"].fillna(ctmode_projectstate,inplace=True)
Enter fullscreen mode Exit fullscreen mode

Dropping Columns

BusinessAge column has over 75% missing values. A high percentage of missing values in the BusinessAge column might significantly affect the reliability of any analysis or insights derived from it. Imputation isn't feasible in this case, you might consider dropping the column to maintain data integrity and focus on other more complete variables.

# Drop the BusinessAge column
foiaData.drop("BusinessAge",axis=1,inplace=True)
Enter fullscreen mode Exit fullscreen mode

Dealing with Date Columns: Handling Missing Dates

The missing values in the PaidInFullDate and ChargeOffDate indicate that the data is not available for some loans. The values in the two columns are directly exclusive, that is, the presence of one automatically means the absence of the other. For example, PaidInFullDate is only applicable for loans that were paid in full, so it is missing for other types of loans. One possible solution to handling the missing values is to use words like "Not Applicable" to send a clear message to other users of the data, but using this technique means that you won't be able to change the data type of the column to a datetime format. This is because datetime columns can only contain valid date or timestamp values, and words like"Not Applicable" or "Data Entry Delay"(for the missing values in the FirstDisbursementDate column) are not valid datetime values.
Instead, you can just leave it as null, and then document this in your analysis to ensure that users understand the reasons for the missing date values and interpret them correctly. This approach maintains the integrity of your data while allowing for clear communication of its limitations.

Conclusion

We've explored various techniques for handling missing data in your dataset using Pandas library in Python.
Here's a concise breakdown of the main techniques covered:

  1. Dropping Rows and Columns: Removing entries with missing values to maintain data quality.
  2. Introducing Binary Indicators: Simplifying columns with missing data by using indicators like "HasFranchise".
  3. Imputing with Most Frequent Values: Filling in categorical columns using the most common value.
  4. Imputing with Consistent Values: Imputing missing using the most frequent value related to another column.
  5. Handling Date Columns: Leaving date columns blank when missing values signify non-applicability and documenting the logic. Depending on the context and your analysis goals, you can choose the most suitable approach to maintain data quality, integrity, and transparency. By carefully handling missing values, you ensure that your analysis is based on reliable and meaningful insights.

Make sure to document the handling of missing data, whether through imputation, flagging, or dropping, to maintain transparency in your analysis process.

In this article, we learned how to handle missing data in Python using Pandas. But how can we use this data to answer some important questions about small business financing? Stay tuned for the next article, where we will explore data analysis with PostgreSQL and Tableau.

Top comments (0)