justkmike

Posted on Oct 20, 2023

Data Cleaning with Pandas

#datascience #machinelearning #pandas #python

In this guide, we'll explore various data-cleaning techniques using Python and the Pandas library. We'll also cover functions like head(), tail(), info(), describe(), shape, and size, and demonstrate how to remove empty cells, deal with wrong data formats, access data and remove duplicates.

DataFrame Basics

`head()` and `tail()`

These functions display the first and last n rows of a DataFrame, respectively.

# Display the first 5 rows
df.head()

# Display the last 5 rows
df.tail()

`info()`

info() provides essential information about the DataFrame, including column data types, non-null counts, and memory usage.

df.info()

`describe()`

describe() offers statistical summaries of the DataFrame, such as mean, median, and quartiles.

df.describe()

`shape`

shape returns the dimensions of the DataFrame as a tuple (number of rows, number of columns).

df.shape

`size`

size returns the total number of elements in the DataFrame.

df.size

Data Cleaning

Removing Empty Cells

`dropna()`

dropna() removes rows with empty cells, and it can create a new DataFrame. If you want to modify the existing DataFrame, use the inplace=True parameter.

# Create a new DataFrame with empty cells removed
new_df = df.dropna()

# Modify the existing DataFrame in-place
df.dropna(inplace=True)

`[fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna)()`

fillna("Value to replace with") replaces empty cells with a specified value. It also supports additional parameters like axis, method, limit, and value.

# Replace empty cells with a specific value
df.fillna("Replacement Value", inplace=True)

Handling Wrong Data Formats

For example, to convert a column named "date" to datetime format:

df["date"] = pd.to_datetime(df["date"])

Removing Duplicates

To identify and remove duplicate rows:

`duplicated()`

duplicated() returns a Boolean Series, indicating whether each row is a duplicate (True) or not (False).

duplicate_rows = df.duplicated()

`drop_duplicates()`

drop_duplicates() removes duplicate rows. Use the inplace=True parameter to modify the existing DataFrame.

df.drop_duplicates(inplace=True)

Accessing Data in a DataFrame

`at` and `iat`

`at`

at is used to get or set a specific element by row and column labels.

# Get the value at row 2, column "name"
value = df.at[2, "name"]

# Assign a new value to the selected element
df.at[2, "name"] = "Justkmike"

`iat`

iat is used to access elements by row and column index.

# Get the value at row 1, column 2
value = df.iat[1, 2]

# Update data at a specific index
df.iat[1, 2] = 10

`[loc](https://www.statology.org/pandas-loc-vs-iloc/)` and `iloc`

`loc`

loc selects rows using index labels.

# Select a row with the index label "12-23-23"
selected_row = df.loc["12-23-23"]

`iloc`

iloc selects rows using integer-based indexing.

# Select the first two rows and the first two columns
selected_data = df.iloc[0:2, 0:2]

Remember that this is just the tip of the iceberg regarding Pandas. There are many more operations and functions available for data manipulation. If you encounter any issues or need further assistance, feel free to contact mwkariuki2e@gmail.com. Stay tuned for our next guide on data visualization. See you! 😊

DEV Community

Data Cleaning with Pandas

DataFrame Basics

`head()` and `tail()`

`info()`

`describe()`

`shape`

`size`

Data Cleaning

Removing Empty Cells

`dropna()`

`[fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna)()`

Handling Wrong Data Formats

Removing Duplicates

`duplicated()`

`drop_duplicates()`

Accessing Data in a DataFrame

`at` and `iat`

`at`

`iat`

`[loc](https://www.statology.org/pandas-loc-vs-iloc/)` and `iloc`

`loc`

`iloc`

Top comments (0)

Read next

Code Your Diagrams: Automate Architecture with Python's Diagrams Library

Intel Gaudi NPU Matches NVIDIA GPU Performance at 30% Lower Cost in AI Workload Tests

Real-Time AI Video Generation Hits 40 FPS: New System Creates High-Quality Clips on a Single GPU

Python 3.13: The Gateway to High-Performance Multithreading Without GIL

DataFrame Basics

head() and tail()

info()

describe()

shape

size

Data Cleaning

Removing Empty Cells

dropna()

[fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna)()

Handling Wrong Data Formats

Removing Duplicates

duplicated()

drop_duplicates()

Accessing Data in a DataFrame

at and iat

at

iat

[loc](https://www.statology.org/pandas-loc-vs-iloc/) and iloc

loc

iloc

Read next

Code Your Diagrams: Automate Architecture with Python's Diagrams Library

Intel Gaudi NPU Matches NVIDIA GPU Performance at 30% Lower Cost in AI Workload Tests

Real-Time AI Video Generation Hits 40 FPS: New System Creates High-Quality Clips on a Single GPU

Python 3.13: The Gateway to High-Performance Multithreading Without GIL

`head()` and `tail()`

`info()`

`describe()`

`shape`

`size`

`dropna()`

`[fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna)()`

`duplicated()`

`drop_duplicates()`

`at` and `iat`

`at`

`iat`

`[loc](https://www.statology.org/pandas-loc-vs-iloc/)` and `iloc`

`loc`

`iloc`