In this guide, we'll explore various data-cleaning techniques using Python and the Pandas library. We'll also cover functions like head()
, tail()
, info()
, describe()
, shape
, and size
, and demonstrate how to remove empty cells, deal with wrong data formats, access data and remove duplicates.
DataFrame Basics
head()
and tail()
These functions display the first and last n
rows of a DataFrame, respectively.
# Display the first 5 rows
df.head()
# Display the last 5 rows
df.tail()
info()
info()
provides essential information about the DataFrame, including column data types, non-null counts, and memory usage.
df.info()
describe()
describe()
offers statistical summaries of the DataFrame, such as mean, median, and quartiles.
df.describe()
shape
shape
returns the dimensions of the DataFrame as a tuple (number of rows, number of columns).
df.shape
size
size
returns the total number of elements in the DataFrame.
df.size
Data Cleaning
Removing Empty Cells
dropna()
dropna()
removes rows with empty cells, and it can create a new DataFrame. If you want to modify the existing DataFrame, use the inplace=True
parameter.
# Create a new DataFrame with empty cells removed
new_df = df.dropna()
# Modify the existing DataFrame in-place
df.dropna(inplace=True)
[fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna)()
fillna("Value to replace with")
replaces empty cells with a specified value. It also supports additional parameters like axis
, method
, limit
, and value
.
# Replace empty cells with a specific value
df.fillna("Replacement Value", inplace=True)
Handling Wrong Data Formats
For example, to convert a column named "date" to datetime format:
df["date"] = pd.to_datetime(df["date"])
Removing Duplicates
To identify and remove duplicate rows:
duplicated()
duplicated()
returns a Boolean Series, indicating whether each row is a duplicate (True) or not (False).
duplicate_rows = df.duplicated()
drop_duplicates()
drop_duplicates()
removes duplicate rows. Use the inplace=True
parameter to modify the existing DataFrame.
df.drop_duplicates(inplace=True)
Accessing Data in a DataFrame
at
and iat
at
at
is used to get or set a specific element by row and column labels.
# Get the value at row 2, column "name"
value = df.at[2, "name"]
# Assign a new value to the selected element
df.at[2, "name"] = "Justkmike"
iat
iat
is used to access elements by row and column index.
# Get the value at row 1, column 2
value = df.iat[1, 2]
# Update data at a specific index
df.iat[1, 2] = 10
[loc](https://www.statology.org/pandas-loc-vs-iloc/)
and iloc
loc
loc
selects rows using index labels.
# Select a row with the index label "12-23-23"
selected_row = df.loc["12-23-23"]
iloc
iloc
selects rows using integer-based indexing.
# Select the first two rows and the first two columns
selected_data = df.iloc[0:2, 0:2]
Remember that this is just the tip of the iceberg regarding Pandas. There are many more operations and functions available for data manipulation. If you encounter any issues or need further assistance, feel free to contact mwkariuki2e@gmail.com. Stay tuned for our next guide on data visualization. See you! 😊
Top comments (0)