Today, I explored common data issues in pandas and how to handle them. Here’s what I learned:
1. Handling Empty/Null Values
Definition: Empty or null values are missing data points in a dataset that can affect analysis.
import pandas as pd
import numpy as np
data = {'Name': ['Ramya', 'Aruna', None, 'Sekar'],
'Age': [25, np.nan, 22, 28]}
df = pd.DataFrame(data)
# Check null values
print(df.isnull())
# Fill null values
df['Age'].fillna(df['Age'].mean(), inplace=True)
print(df)
2. Removing Duplicates
Definition: Duplicate rows are repeated entries in the dataset. Removing them ensures accuracy.
df = pd.DataFrame({'Name': ['Ramya', 'Aruna', 'Ramya'], 'Age': [25, 22, 25]})
df = df.drop_duplicates()
print(df)
3. Case Sensitivity
Definition: Pandas string operations are case-sensitive. Standardizing case prevents mismatches.
df['Name'] = df['Name'].str.upper()
print(df)
4. Changing Data Type
Definition: Data type conversion ensures operations work correctly (e.g., numbers as int/float).
df['Age'] = df['Age'].astype(int)
print(df.dtypes)
5. Adding/Removing Columns
Definition: Adding or removing columns allows you to customize the dataset for analysis.
# Add a new column
df['City'] = ['Chennai', 'Velachery', 'Chennai']
# Remove a column
df.drop('City', axis=1, inplace=True)
print(df)
6. Removing Rows
Definition: Filtering out unwanted rows keeps only relevant data for analysis.
df = df[df['Age'] > 22]
print(df)
7. Interpolate Missing Values
Definition: Interpolation fills missing values by estimating them from existing data.
df = pd.DataFrame({'Score': [85, None, 90, None, 95]})
df['Score'] = df['Score'].interpolate()
print(df)
8. String Operations
Definition: String operations help clean and transform text data.
df = pd.DataFrame({'Name': [' ramya ', 'aruna', 'sekar']})
# Strip, Upper, Lower
df['Name'] = df['Name'].str.strip().str.title()
print(df)
# Split
df['First_Letter'] = df['Name'].str[0]
print(df)
# Swapcase
df['Name'] = df['Name'].str.swapcase()
print(df)
# Length
df['Length'] = df['Name'].str.len()
print(df)
9. Filtering
Definition: Filtering selects only the rows that meet certain conditions.
df = pd.DataFrame({'Name': ['Ramya', 'Aruna', 'Sekar'], 'Age': [25, 22, 28]})
filtered = df[df['Age'] > 23]
print(filtered)
10. Map vs Filter
Definition:
- Map: Applies a function to each element.
- Filter: Selects elements based on a condition.
nums = [1, 2, 3, 4, 5]
# Map: apply function to each element
squared = list(map(lambda x: x**2, nums))
print(squared)
# Filter: select elements based on condition
even = list(filter(lambda x: x % 2 == 0, nums))
print(even)
✨ Today’s takeaways: Handling data quality issues and learning data cleaning operations in pandas is crucial before performing any analysis.
Top comments (0)