arju10

Posted on Nov 26, 2024

Pandas for Machine Learning & Deep Learning

#pandas #machinelearning #dl #ai

Pandas

Grouping & Aggregation

Grouping data frame using aggregation functions, including sum and mean, calculated per category.

## Numpy
import pandas as pd

# Create a Data Frame
data = {'Category':['A', 'B', 'C', 'A','B','C','C','A'],
        'Value':[10, 20, 40, 45, 50, 90,100, 200]}

df = pd.DataFrame(data)


# Grouping by 'category' and calculating the sum and mean
grouped = df.groupby('Category').agg({'Value': ['sum', 'mean']})

print("Grouped Data Frame with Aggregation: \n\n", grouped)

Output:

Grouped Data Frame with Aggregation:

Category	Value (sum)	Value (mean)
A	255	85.000000
B	70	35.000000
C	230	76.666667

Here,the data is grouped by Category, and for each category, we calculate:

Sum of the Value column
Mean of the Value column

Explanation of Columns:

Category: The category label for each group.
Value (sum): The total sum of values within each category.
Value (mean): The average value within each category.

Grouping & Aggregation

Grouping data frame using aggregation functions, including sum, max and mean, calculated per category.

import pandas as pd

# Create dataframe
data = {
    'Category':['A','A','B','B'],
    'Values':[10, 20, 30,40]
}
df = pd.DataFrame(data)

# Grouping by 'category' and calculating the sum and mean
grouped_df = df.groupby('Category').agg({'Values': ['sum', 'mean','max']})

print("Grouped Data Frame with with multiple functions \n\n", grouped_df)

Output:

Custom Aggregation Result:

Category	sum	mean	max
A	30	15.0	20
B	70	35.0	40

MutiIndex DataFrame Operations

import pandas as pd

# Create Multi Index Dataframe
arrays = [['A', 'A', 'B', 'B'], [2020, 2021, 2022, 2023]]
index = pd.MultiIndex.from_arrays(arrays, names=('Category', 'Year'))
data = [100, 150, 200, 250]
df = pd.DataFrame(data, index=index, columns=['Sales'])

# Accessing data in a mutiindex Dataframe
sales_in_2020 = df.xs(2020, level='Year')

print("Sales Data in 2020: ", sales_in_2020)

Output

Category	Sales (2020)
A	100

Here, output for sales data in 2020, showing the total sales for each category.

Using Apply with Lambda Functions

import pandas as pd

# Create Dataframe
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Score':[85,90,95]}
df = pd.DataFrame(data)

# Applying a lambda function to modify scores
df['Adjusted Score'] = df['Score'].apply(lambda x: x+5 if x<90 else x)

print("Dataframe with Adjusted Scores :\n",df)

Output

Name	Score	Adjusted Score
Alice	85	90
Bob	90	90
Charlie	95	95

Here, this data frame includes a list of names, their original scores, and the adjusted scores:

Merging DataFrame with Different Keys

import pandas as pd

# Create two dataframe
df1 = pd.DataFrame({'ID':[1,2,3], 'Name':['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'Emp_ID':[2,3,4], 'Department':['HR','Finance','IT']})

#Merging with different keys
merged_df = pd.merge(df1, df2, left_on='ID', right_on='Emp_ID', how='outer')

print("Merged Dataframe: \n", merged_df)

Output

ID	Name	Emp_ID	Department
1.0	Alice	NaN	NaN
2.0	Bob	2.0	HR
3.0	Charlie	3.0	Finance
NaN	NaN	4.0	IT

Here, the merged data frame combines two data sources, which may contain missing values

ID: Identifier from the first data source.
Name: Name from the first data source.
Emp_ID: Employee ID from the second data source.
Department: Department from the second data source.
NaN values indicate missing data from one of the sources.

Handling Missing Values with Custom Functions

import pandas as pd
import numpy as np

# Creating dataframe with missing values
data = {'Name':['Alice', 'Bob', np.nan, 'David'], 'Age':[13,np.nan,32,20]}
df = pd.DataFrame(data)

# Filling Missing Values using a custom function
df['Name'].fillna('Unknown', inplace=True)
df['Age'].fillna(df['Age'].median(), inplace=True)

print("Dataframe after handling Missing values: \n", df)

Output

Dataframe after handling Missing values:

Name	Age
Alice	13.0
Bob	20.0
Unknown	32.0
David	20.0

Here,missing values in the Name or Age columns were replaced with a default string ("Unknown") or a specified numeric value.

Handling Missing Values with `ffilll` & `bfill` method

import pandas as pd
import numpy as np

# Creating dataframe with missing values
data = {'Name':['Alice', 'Bob', np.nan, 'David'], 'Age':[13,np.nan,32,20]}
df = pd.DataFrame(data)

# Filling Missing Values using ffill & bfill
df_ffill = df.fillna(method = 'ffill')
df_bfill = df.fillna(method = 'bfill')

print("Dataframe with forward fill: \n", df_ffill)
print("Dataframe with backwark fill: \n", df_bfill)

Output
DataFrame with Forward Fill

Forward fill propagates the last valid non-null value forward until the next non-null value is encountered.

Name	Age
Alice	13.0
Bob	13.0
Bob	32.0
David	20.0

DataFrame with Backward Fill

Backward fill propagates the next valid non-null value backward until the previous non-null value is encountered.

Name	Age
Alice	13.0
Bob	32.0
David	32.0
David	20.0

Pivoting DataFrame

import pandas as pd

# Create dataframe
data = {'Date':['2023-01-01', '2023-01-02','2023-01-01','2023-01-02'],
        'City':['NY', 'NY','LA', 'LA'],
        'Sales':[200,250,300,400]
        }

df = pd.DataFrame(data)

# Pivoting the Dataframe
pivot_df = df.pivot(index='Date', columns='City', values='Sales')

print('Pivoted Dataframe : \n', pivot_df)

Output

Pivoted Dataframe :

Date	LA	NY
2023-01-01	300	200
2023-01-02	400	250

Pivoting DataFrame with multi index columns

import pandas as pd

# Create dataframe
data = {'Date':['2023-01-01', '2023-01-02','2023-01-01','2023-01-02'],
        'City':['NY', 'NY','LA', 'LA'],
        'Type': ['Online', 'Store', 'Online','Store'],
        'Sales':[200,250,300,400]
        }

df = pd.DataFrame(data)

# Pivoting to create a Dataframe with multi-index columns
pivot_df = df.pivot(index='Date', columns=['City','Type'], values='Sales')

print('Pivoted Dataframe with multi-index columns: \n', pivot_df)

Output

Pivoted DataFrame with Multi-Index Columns:

Date	City	NY		LA
	Type	Online	Store	Online	Store
2023-01-01		200.0	NaN	300.0	NaN
2023-01-02		NaN	250.0	NaN	400.0

Melting DataFrame

import pandas as pd

# Create dataframe
data = {'ID':[1,2], 'Math':[90,79], 'Science':[85,88]}
df = pd.DataFrame(data)

# Melting the dataframe
melted_dataframe = pd.melt(df,id_vars=['ID'], value_vars=['Math', 'Science'], var_name='Subject', value_name='Score')

print("Melted Dataframe : \n", melted_dataframe)

Output

Melted DataFrame :

ID	Subject	Score
1	Math	90
2	Math	79
1	Science	85
2	Science	88

Time based indexing and resampling

import pandas as pd

# Create time Series Dataframe
date_range = pd.date_range(start='2023-01-01', periods=10, freq='D')
data = {'Sales':[100, 200, 150, 300, 250, 400, 300, 350, 300, 400]}
df = pd.DataFrame(data, index=date_range)

# Resampling to find weekly sales
weekly_sales = df.resample('W').sum()

print("Weekly Sales : \n", weekly_sales)

Output

Weekly Sales :

Date	Sales
2023-01-01	100
2023-01-08	1950
2023-01-15	700

Resampling with different aggregation

import pandas as pd

# Create time Series Dataframe
date_range = pd.date_range(start='2023-01-01', periods=10, freq='D')
data = {'Sales':[100, 200, 150, 300, 250, 400, 300, 350, 300, 400]}
df = pd.DataFrame(data, index=date_range)

# Resampling to find weekly sales
resampled_df = df.resample('W').agg({'Sales':['sum', 'mean', 'max']})

print("Resampled Dataframe with different aggregation : \n", resampled_df)

Output

Resampled DataFrame with Different Aggregation:

Date	Sales (sum)	Sales (mean)	Sales (max)
2023-01-01	100	100.000000	100
2023-01-08	1950	278.571429	400
2023-01-15	700	350.000000	400

Conditional Filtering with Multiple Conditions

import pandas as pd

# Create dataframe
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Score':[85, 80,40,70], 'Passed':[True, False, True, False]}
df = pd.DataFrame(data)

# Filtering with multiple conditions
filtered_df = df[(df['Score'] > 80) & (df['Passed'] == True)]

print("FIltered Dataframe : \n", filtered_df)

Output

Example Filtered DataFrame :

Name	Score	Passed
Alice	85	True

Creating Custom Categorical Data

import pandas as pd

# Create dataframe
data = {'Name':['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85, 70, 95,60]}
df = pd.DataFrame(data)

# Creating a new column with categorical data
df['Performance'] = pd.cut(df['Score'], bins=[0,70,90,100], labels=['Poor', 'Average', 'Excellent'])

print('Dataframe with categorical performance : \n', df)

Output
DataFrame with Categorical Performance :

Name	Score	Performance
Alice	85	Average
Bob	70	Poor
Charlie	95	Excellent
David	60	Poor

Binning Data using `pd.cut`

import pandas as pd

# Create dataframe
data = {'Score': [85, 70, 95,60, 100, 90]}
df = pd.DataFrame(data)

# Binning the scores into categories
df['Grade'] = pd.cut(df['Score'], bins=[0,59,69,79,89,100], labels=['F', 'D', 'C', 'B','A'])

print('Dataframe with binned scores : \n', df)

Output
Dataframe with binned scores:

Score	Grade
85	B
70	C
95	A
60	D
100	A
90	A

Rolling Window Calculations

import pandas as pd

# Create a time series dataframe
data_range = pd.date_range(start='2023-01-01', periods=10, freq='D')
data = {'Sales': [100, 200, 150, 300, 250, 400, 300, 350, 300, 400]}
df = pd.DataFrame(data, index= data_range)

# Calculating a rolling mean with a window of 3 days
df['Rolling Mean'] = df['Sales'].rolling(window=3).mean()

print("Dataframe with rolling mean : \n", df)

Output

DataFrame with Rolling Mean :

Date	Sales	Rolling Mean
2023-01-01	100	NaN
2023-01-02	200	NaN
2023-01-03	150	150.000000
2023-01-04	300	216.666667
2023-01-05	250	233.333333
2023-01-06	400	316.666667
2023-01-07	300	316.666667
2023-01-08	350	350.000000
2023-01-09	300	316.666667
2023-01-10	400	350.000000

Shift and Lagging Data

import pandas as pd

# Create Dataframe
data = {'Date': pd.date_range(start='2023-01-01', periods=5, freq='D'),
        'Temperature': [30, 32, 35,33, 31]}

df = pd.DataFrame(data)

# Shifting data by one to create a lag
df['Prev Day Temp'] = df['Temperature'].shift(1)

print('Dataframe with shifted data: \n', df)

Output
DataFrame with Previous Day Temperature:

Date	Temperature	Prev Day Temp
2023-01-01	30	NaN
2023-01-02	32	30.0
2023-01-03	35	32.0
2023-01-04	33	35.0
2023-01-05	31	33.0

Cumulative sum and product

import pandas as pd

# Create dataframe
data = {'Sales': [100, 200, 150, 300, 250]}
df = pd.DataFrame(data)

# Calculating cumulative sum and product
df['cumulative sum'] = df['Sales'].cumsum()
df['cumulative product'] = df['Sales'].cumprod()

print('DataFrame with cumulative operations:\n', df)

Output
DataFrame with Cumulative Operations:

Sales	Cumulative Sum	Cumulative Product
100	100	100
200	300	20000
150	450	3000000
300	750	900000000
250	1000	225000000000

Merging on multiple columns

import pandas as pd

# Create two dataframe
df1 = pd.DataFrame({'ID':[1,2,3], 'Year':[2020,2021,2023], 'Name':['Alice','Bob','Charlie'] })
df2 = pd.DataFrame({'ID':[2,3,1], 'Year':[2023,2023,2020], 'Score':[88,92,75]})

# Merging on multiple columns
merged_df = pd.merge(df1,df2, on=['ID','Year'], how='inner')

print('Merged Dataframe on multiple columns : \n', merged_df)

Output
Merged DataFrame on Multiple Columns:

ID	Year	Name	Score
1	2020	Alice	75
3	2023	Charlie	92

Handling Outliers

import pandas as pd

# Create dataframe
data = {'Value': [10, 12, 14, 100, 15, 13, 12]}
df = pd.DataFrame(data)

# Indentifying outliers using the IQR method
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Value'] < (Q1 - 1.5 * IQR)) | (df['Value'] > (Q3 + 1.5 * IQR))]

print('Outliers: \n', outliers)

Output
DataFrame with Outliers:

Value
10
15
20
100

Here, the value 100 is an outlier compared to the other values.

Creating a pivot table with multiple aggregation

import pandas as pd

# Create dataframe
data = {'City':['NY', 'LA', 'NY','SF','LA'], 'Year':[2020,2020,2021,2021,2020], 'Sales':[100, 150, 200, 250,300]}
df = pd.DataFrame(data)

# Create a pivot table with multiple aggregation
pivot_table = pd.pivot_table(df, values='Sales', index='City', columns='Year', aggfunc=['sum', 'mean'])

print('Pivot table with multiple aggregation: \n', pivot_table)

Output
Pivot Table with Multiple Aggregation :

City	Year	Sum	Mean
LA	2020	450.0	225.0
NY	2020	100.0	100.0
NY	2021	200.0	200.0
SF	2021	250.0	250.0

Here, a pivot table is created with sum and mean as the aggregation functions.

Creating a pivot table with multiple indexes

import pandas as pd

# Create dataframe
data = {'City':['NY', 'LA', 'NY','SF','LA'], 'Year':[2020,2020,2021,2021,2020], 'Sales':[100, 150, 200, 250,300]}
df = pd.DataFrame(data)

# Create a pivot table with multiple indexes
pivot_table = pd.pivot_table(df, values='Sales', index=['City', 'Year'], aggfunc='sum')

print('Pivot table with multiple indexes: \n', pivot_table)

Output
Pivot Table with Multiple Indexes :

City	Year	Sales
LA	2020	450
NY	2020	100
	2021	200
SF	2021	250

Using `map()` for Value Replacement

import pandas as pd

# Create dataframe
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Department':['HR', 'IT', 'Finance']}
df = pd.DataFrame(data)

# Mapping department to codes
department_map = {'HR':1, 'IT':2, 'Finance':3}
df['Dept Code'] = df['Department'].map(department_map)

print("Dataframe with mapped values: \n", df)

Output
DataFrame with Mapped Values:

Name	Department	Dept Code
Alice	HR	1
Bob	IT	2
Charlie	Finance	3

Here, the department names are mapped to department codes.

Detecting Duplicates

iimport pandas as pd

# Create a dataframe with duplicate rows
data = {'ID':[1,2,2,3,4,4], 'Value': [10, 20,20,30,40,40]}
df = pd.DataFrame(data)

# Detecting duplicates
duplicates = df[df.duplicated()]

print('Duplicate Row : \n', duplicates)

Output
DataFrame with Duplicate Rows:

ID	Value
2	20
4	40
2	20
4	40

In this example, the rows with ID 2 and 4 are duplicates.

Using `explode()` for list in columns

import pandas as pd

# Create a dataframe with lists in a column
data = {'ID':[1,2], 'Hobbies':[['Reading', 'Swimming'], ['Running', 'Cycling'] ]}
df = pd.DataFrame(data)

# Explodeing the 'Hobbies' Column
exploded_df = df.explode('Hobbies')

print("Dataframe after exploring lists: \n", exploded_df)

Output
DataFrame After Exploring Lists:

ID	Hobbies
1	Reading
1	Swimming
2	Running
2	Cycling

In this example, each hobby in a list associated with an ID is expanded into its own row.

Using `rank()` to rank values

import pandas as pd

# Create dataframe
data = {'Name' : ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85,90,78,92]}
df = pd.DataFrame(data)

# Ranking scores in descending order
df['Rank'] = df['Score'].rank(ascending=False)

print('Dataframe with ranked scores : \n', df)

Output
DataFrame with Ranked Scores:

Name	Score	Rank
Alice	85	3.0
Bob	90	2.0
Charlie	78	4.0
David	92	1.0

In this example, the scores are ranked in descending order, with the highest score receiving the highest rank.

Using `applymap()` for element-wise operations

import pandas as pd

# Create dataframe
data = {'A': [1,2,3], 'B':[4,5,6]}
df = pd.DataFrame(data)

# Applying an operation to each element in the Dataframe
df_transformed = df.applymap(lambda x:x**2)

print("Dataframe with squared values : \n", df_transformed)

Output
DataFrame with Ranked Scores:

Name	Score	Rank
Alice	85	3.0
Bob	90	2.0
Charlie	78	4.0
David	92	1.0

In this example, the scores are ranked in descending order, with the highest score receiving the highest rank.

Using `df.apply()` with axis parameter

import pandas as pd

# Create dataframe
data = {'A': [1,2,3], 'B':[4,5,6]}
df = pd.DataFrame(data)

# Applying a function to rows and columns using the axis parameter
row_sum = df.apply(lambda x: x.sum() , axis=1)
col_sum = df.apply(lambda x: x.sum() , axis=0)

print("Row wise sum : \n", row_sum)
print("Column wise sum : \n", col_sum)

Output
Row-wise Sum:

	Sum
0	5
1	7
2	9

Column-wise Sum:

	Sum
A	6
B	15

Create dummy variables

import pandas as pd

# Create dataframe
data = {'City': ['NY', 'LA', 'SF', 'NY']}
df = pd.DataFrame(data)

# Creating dummy varibales
dummies = pd.get_dummies(df['City'], prefix='City')

print('Dummy varibales dataframe: \n', dummies)

Output
DataFrame with Squared Values :

A	B
1	16
4	25
9	36

In this example, each value in columns A and B has been squared.

Use `query()` for filtering

import pandas as pd

# Create dataframe
data = {'Name' : ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85,90,78,92]}
df = pd.DataFrame(data)

# Use query for filtering
filtered_df = df.query('Score > 80')

print('Filtered dataframe using query : \n', filtered_df)

Output
Filtered DataFrame :

Name	Score
Alice	85
Bob	90
David	92

Here, only rows with scores above a certain threshold are included in the filtered DataFrame.

Adding a new column using `assign()`

import pandas as pd

# Create dataframe
data = {'Name' : ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85,90,78,92]}
df = pd.DataFrame(data)

# Adding a new column using assign
df = df.assign(Grade= lambda x: ['A' if Score >= 90 else 'B' for Score in x['Score']])

print("Dataframe with assigned grade column : \n", df)

Output
DataFrame with Assigned Grade Column :

Name	Score	Grade
Alice	85	B
Bob	90	A
Charlie	78	B
David	92	A

In this example, grades are assigned based on the score values: 'A'for high scores, 'B' for average scores, etc.

Applying a custom function with `groupby`

import pandas as pd

# Create dataframe
data = {
    'Category':['A','A','B','B'],
    'Value':[10, 20, 30,40]
}

df = pd.DataFrame(data)

# Applying a custom function with groupby
grouped_df = df.groupby('Category') ['Value'].apply(lambda x: x.max() - x.min())

print("Custom Aggregation result : \n", grouped_df)

Output
Custom Aggregation Result:

Category	Value
A	10
B	10

In this example, the Value column is aggregated using a custom function (such as summing specific elements) for each category.

Applying functions to grouped data

import pandas as pd

# Create dataframe
data = {
    'Team':['A','A','B','B'],
    'Points':[10, 20, 30,40]
}
df = pd.DataFrame(data)

# Applying a function to grouped data
grouped_df = df.groupby('Team') ['Points'].transform(lambda x: x/x.max() )

print("Dataframe with applied functions to grouped data : \n", grouped_df)

Output
Dataframe with Applied Functions to Grouped Data:

	Points
0	0.50
1	1.00
2	0.75
3	1.00

DataFrame info and memory usage

import pandas as pd

# Create dataframe
data = {
    'A': range(1000),
    'B':range(1000, 2000)
}
df = pd.DataFrame(data)

# Displaying dataframe info and memory usage
df_info = df.info(memory_usage='deep')

print('Dataframe info and Memory usage : \n', df_info)

Output
DataFrame Info and Memory Usage :

#	Column	Non-Null Count	Dtype
0	A	1000 non-null	int64
1	B	1000 non-null	int64

Dtypes: int64 (2 columns)
Memory Usage: 15.8 KB

Filter with `isin()`

import pandas as pd

# Create dataframe
data = {'Name' : ['Alice', 'Bob', 'Charlie', 'David'], 'City': ['NY','LA','SF','NY']}
df = pd.DataFrame(data)

# filter with isin()
filtered_df = df[df['City'].isin(['NY','SF'])]

print("Filtered Dataframe with isin method: \n", filtered_df)

Output
Filtered Dataframe with isin method:

	Name	City
0	Alice	NY
2	Charlie	SF
3	David	NY

This section demonstrates how to filter rows in a DataFrame based on a list of values using the isin method.

Concatenating Dataframe using `concat()`

import pandas as pd

# Create dataframes
df1 = pd.DataFrame({
    'A':[1,2],
    'B':[3,4]
})

df2 = pd.DataFrame({
    'A': [5,6],
    'B': [7,8]
})

# Concatening Dataframes
concat_df = pd.concat([df1, df2], ignore_index=True)

print("COncatenated DataFrame : \n", concat_df)

Output
Concatenated DataFrame :

	A	B
0	1	3
1	2	4
2	5	7
3	6	8

This section demonstrates how to concatenate two DataFrames vertically.

Sorting a DataFrame by Multiple Columns

import pandas as pd

# Create dataframe
data = {'Name' : ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85,90,78,92], 'Age': [25,37,35,21]}
df = pd.DataFrame(data)

# Sorting by multiple columns
sorted_df = df.sort_values(by=['Score', 'Age'], ascending=[False, True])

print("Sorted Dataframe by multiple columns: \n", sorted_df)

Output
Sorting a DataFrame by Multiple Columns :

Index	Name	Score	Age
3	David	92	21
1	Bob	90	37
0	Alice	85	25
2	Charlie	78	35

Primary Sorting by Score in descending order.
Secondary Sorting by Age in ascending order.

Displaying the styled Dataframe

import pandas as pd

# Create dataframe
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85, 90, 78, 92]}
df = pd.DataFrame(data)

# Applying styles to highlight scores above 80
styled_df = df.style.map(
    lambda x: 'background-color: yellow' if isinstance(x, int) and x > 80 else ''
)

# Save styled DataFrame to HTML
styled_df.to_html("styled_dataframe.html")

# Displaying the styled Dataframe
print("Styled dataframe with conditional formatting has been saved to 'styled_dataframe.html'.")

Output
"Styled dataframe with conditional formatting has been saved to 'styled_dataframe.html'."

Index	Name	Score
0	Alice	85
1	Bob	90
2	Charlie	78
3	David	92

Note: Cells with scores greater than 80 are highlighted in yellow. To view the styling, check the styled_dataframe.html file in a browser.

Using `explode()` for nested JSON columns

import pandas as pd

# Create a dataframe with nested lists
data = {
    'ID' : [1,2],
    'Hobbies':[['Reading', 'Swimming', 'Gaming'], ['Hiking', 'Drawing']]
}
df = pd.DataFrame(data)

# Exploding the 'Hobbies' column
exploded_df = df.explode('Hobbies')

print("Dataframe after exploding nested json columns:\n", exploded_df)

Output
DataFrame after Exploding Nested JSON Columns

Index	ID	Hobbies
0	1	Reading
0	1	Swimming
0	1	Gaming
1	2	Hiking
1	2	Drawing

Using `pipe()` for method chaining

import pandas as pd

# Create dataframe
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85, 90, 78, 92]}
df = pd.DataFrame(data)

# Custom function for modifying the dataframe
def add_grade_column(df):
    df['Grade'] = df['Score'].apply(lambda x: 'A' if x >=90 else 'B')
    return df

# Using pipe() for method chaining
df = df.pipe(add_grade_column)

print("Dataframe after using pipe : \n", df)

Output
DataFrame after Using Pipe

Index	Name	Score	Grade
0	Alice	85	B
1	Bob	90	A
2	Charlie	78	B
3	David	92	A

Multi Index Slicing

import pandas as pd

# Create Multi Index Dataframe
arrays = [['A', 'A', 'B', 'B'], [2020, 2021, 2022, 2023]]
index = pd.MultiIndex.from_arrays(arrays, names=('Category', 'Year'))
data = [100, 150, 200, 250]
df = pd.DataFrame(data, index=index, columns=['Sales'])

# Slicing the  mutiindex Dataframe
sliced_df = df.loc['A']

print("Sliced  mutiindex Dataframe \n", sliced_df)

Output
Sliced MultiIndex DataFrame

Year	Sales
2020	100
2021	150

Adding prefix or suffix to column names

import pandas as pd

# Create a dataframe
data = {
    'Math': [90, 80],
    'Science':[85,88]
}
df = pd.DataFrame(data)

# Adding prefix to column names
df_prefixed = df.add_prefix('Grade_')

# Adding suffix to column names
df_suffixed = df.add_suffix('_Score')

print("Dataframe with Prefix :\n", df_prefixed)
print('\n')
print("Dataframe with Suffix :\n", df_suffixed)

Output
DataFrame with Prefix

Index	Grade_Math	Grade_Science
0	90	85
1	80	88

DataFrame with Suffix

Index	Math_Score	Science_Score
0	90	85
1	80	88

Using `at` and `iat` for fast scalar access

import pandas as pd

# Create dataframes
data = {
    'A':[1,2,3],
    'B':[4,5,6]
}
df = pd.DataFrame(data)

# Accessing a single value using at and iat
value_at = df.at[1, 'A']
value_iat = df.iat[1,0]

print('values using at : ' , value_at)
print('values using iat : ' , value_iat)

Output
values using at : 2
values using iat : 2

Renaming index & columns


import pandas as pd

# Create dataframe
data = {'Name' : ['Alice', 'Bob'], 'Score': [78,92]}
df = pd.DataFrame(data, index=['Row1','Row2'])


# Renaming index & columns
df_renamed = df.rename(index={'Row1':'Student1', 'Row2':'Student2'}, columns={'Score':'Grade'})

print("Renamed Dataframe : \n",df_renamed)

Output
Renamed DataFrame

Index	Name	Grade
Student1	Alice	78
Student2	Bob	92

Use `query()` with variables

import pandas as pd

# Create dataframe
data = {'Name' : ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85,90,78,92]}
df = pd.DataFrame(data)

# Using query() with varibales
threshold = 80
filtered_df = df.query('Score > @threshold')

print("Filtered Dataframe with query() and varibales: \n", filtered_df)

Output

Filtered DataFrame withquery()and Variables

Index	Name	Score	Age
0	Alice	85	25
1	Bob	90	37
3	David	92	21

Using `groupby` with custom aggregation

import pandas as pd

# Create dataframe
data = {
    'Category':['A','A','B','B'],
    'Value':[10, 20, 30,40]
}

df = pd.DataFrame(data)

# using groupby with custom aggregation
custom_df = df.groupby('Category').agg(max_value = ('Value', 'max'), min_value=('Value', 'min'))

print("Custom Aggregation result : \n", custom_df)

Output
Custom Aggregation Result

Category	max_value	min_value
A	20	10
B	40	30

Pivoting with fill_value

import pandas as pd

# Create dataframe
data = {'Date':['2023-01-01', '2023-01-02','2023-01-01','2023-01-02'],
        'City':['NY', 'NY','LA', 'LA'],
        'Sales':[200,250,300,None]
        }

df = pd.DataFrame(data)

# Pivoting with fill_value
pivot_df = df.pivot(index='Date', columns='City', values='Sales').fillna(0)

print('Pivoted Dataframe with fill_value: \n', pivot_df)

Output

Pivoted Dataframe with fill_value:

Date	LA	NY
2023-01-01	300.0	200.0
2023-01-02	0.0	250.0

Pivoting with fill_value

import pandas as pd

# Create a dataframe with duplicate rows
data = {'ID':[1,2,2,3,4,4], 'Value': [10, 20,20,30,40,40]}
df = pd.DataFrame(data)

# Dropping duplicates rows
unique_df = df.drop_duplicates()

print("Dataframe after dropping duplicates: \n", unique_df)

Output

DataFrame After Dropping Duplicates

Index	ID	Value
0	1	10
1	2	20
3	3	30
4	4	40

Combining Dataframe with Multi-level Indexes

import pandas as pd

# Create Multi Index Dataframe
arrays = [['A', 'A', 'B', 'B'], [2020, 2021, 2022, 2023]]
index = pd.MultiIndex.from_arrays(arrays, names=('Category', 'Year'))
data1 = [100, 150, 200, 250]
df1 = pd.DataFrame(data1, index=index, columns=['Sales'])

data2 = [300, 400, 500, 6000]
df2 = pd.DataFrame(data2, index=index, columns=['Profit'])

# Combining Dataframe
combined_df = pd.concat([df1, df2], axis=1)

print("Combining Dataframe with Multi-level Indexes: \n ", combined_df)

Output

Combining DataFrame with Multi-level Indexes:

Category	Year	Sales	Profit
A	2020	100	300
	2021	150	400
B	2022	200	500
	2023	250	6000

Using `rolling()` with custom functions

import pandas as pd

# Create a time series dataframe
data_range = pd.date_range(start='2023-01-01', periods=10, freq='D')
data = {'Sales': [100, 200, 150, 300, 250, 400, 300, 350, 300, 400]}
df = pd.DataFrame(data, index= data_range)

# Applying a custom function with rolling()
df['Rolling Sum'] = df['Sales'].rolling(window=3).apply(lambda x:x.sum() if x.sum() > 50 else 0)

print("Dataframe with custom rolling sum : \n", df)

Output

DataFrame with Custom Rolling Sum

Date	Sales	Rolling Sum
2023-01-01	100	NaN
2023-01-02	200	NaN
2023-01-03	150	450.0
2023-01-04	300	650.0
2023-01-05	250	700.0
2023-01-06	400	950.0
2023-01-07	300	950.0
2023-01-08	350	1050.0
2023-01-09	300	950.0
2023-01-10	400	1050.0

Setting & Resetting Indexes

import pandas as pd

# Create dataframe
data = {'City':['NY', 'LA', 'NY','SF','LA'], 'Year':[2020,2020,2021,2021,2020], 'Sales':[100, 150, 200, 250,300]}
df = pd.DataFrame(data)

# Setting an index
df.set_index('City', inplace=True)
print("Dataframe after setting index: \n", df)
print('\n')
# Resetting index
df.reset_index(inplace=True)
print("Dataframe after resetting index: \n", df)

Output
DataFrame after Setting Index

City	Year	Sales
NY	2020	100
LA	2020	150
NY	2021	200
SF	2021	250
LA	2020	300

DataFrame after Resetting Index

City	Year	Sales
NY	2020	100
LA	2020	150
NY	2021	200
SF	2021	250
LA	2020	300

Working with time-zone

import pandas as pd

# Create a time series dataframe
data_range = pd.date_range(start='2023-01-01', periods=10, freq='D')
data = {'Sales': [100, 200, 150, 300, 250, 400, 300, 350, 300, 400]}
df = pd.DataFrame(data, index=data_range)

# Localizing the timezone to UTC (or any initial timezone)
df.index = df.index.tz_localize('UTC')

# Converting to a different timezone (US/Eastern)
df = df.tz_convert('US/Eastern')

print("Time series dataframe with time zone: \n", df)

Output
Time series dataframe with time zone:

DateTime	Sales
2022-12-31 19:00:00-05:00	100
2023-01-01 19:00:00-05:00	200
2023-01-02 19:00:00-05:00	150
2023-01-03 19:00:00-05:00	300
2023-01-04 19:00:00-05:00	250
2023-01-05 19:00:00-05:00	400
2023-01-06 19:00:00-05:00	300
2023-01-07 19:00:00-05:00	350
2023-01-08 19:00:00-05:00	300
2023-01-09 19:00:00-05:00	400

Detecting and Replacing Outliers

import pandas as pd

# Create dataframe
data = {'Values': [10,12,14,100,15,13,12]}
df = pd.DataFrame(data)

# Detecting Outliers using the IQR method
Q1 = df['Values'].quantile(0.25)
Q3 = df['Values'].quantile(0.75)

IQR = Q3 - Q1
outliers = (df['Values'] < (Q1-1.5 * IQR)) | (df['Values'] > (Q3 + 1.5 * IQR))

# Replacing outliers with the median
df.loc[outliers, 'Values'] = df['Values'].mean()

print("Dataframe after Replacing outliers: \n", df)

Output
Dataframe after Replacing Outliers:

Index	Values
0	10.000000
1	12.000000
2	14.000000
3	25.142857
4	15.000000
5	13.000000
6	12.000000

Dataframe with evaluated calculation

import pandas as pd

# Create a dataframe
data = {'A':[1,2,3,4], 'B':[5,6,7,8]}
df = pd.DataFrame(data)

# Using eval for calculations
df['C'] = df.eval('A + B *2')

print("Dataframe with evaluated calculation\n",df)

Output
Dataframe with Evaluated Calculation:

	A	B	C
0	1	5	11
1	2	6	14
2	3	7	17
3	4	8	20

Merging indicator for tracing source

import pandas as pd

# Create dataframe
df1 = pd.DataFrame({'ID': [1,2,3,4] ,'Name' : ['Alice', 'Bob', 'Charlie', 'David']})
df2 = pd.DataFrame({'ID': [3,4,5,6] ,'Name' : ['Charlie', 'David','Robin','Michale']})


# Merging with indicator to track the source
merged_df = pd.merge(df1, df2, on='ID', how='outer', indicator=True)

print('Merged dataframe with  indicator: \n', merged_df)

Output
Merged DataFrame with Indicator:

ID	Name_x	Name_y	_merge
1	Alice	NaN	left_only
2	Bob	NaN	left_only
3	Charlie	Charlie	both
4	David	David	both
5	NaN	Robin	right_only
6	NaN	Michale	right_only

Time Series with Business days

import pandas as pd

# Create a time series dataframe
data_range = pd.date_range(start='2023-01-01', periods=10, freq='B')
data = {'Sales': range(10)}
df = pd.DataFrame(data, index= data_range)

print("Time series Dataframe with business days : \n", df)

Output
Time Series DataFrame with Business Days:

Date	Sales
2023-01-02	0
2023-01-03	1
2023-01-04	2
2023-01-05	3
2023-01-06	4
2023-01-09	5
2023-01-10	6
2023-01-11	7
2023-01-12	8
2023-01-13	9

Stacking & Unstacking dataframe

import pandas as pd

# Create dataframe
data = {'City':['NY','LA'],
        '2020':[100,500],
        '2021':[200,450],
        }

df = pd.DataFrame(data)

# Stacking & Unstacking dataframe
stack_df = df.stack()
unstack_df = df.unstack()

print('Stacking dataframe \n', stack_df)
print('Unstacking dataframe \n', unstack_df)

Output
Stacked DataFrame:

Level 0	Level 1	Value
0	City	NY
	2020	100
	2021	200
1	City	LA
	2020	500
	2021	450

Unstacked DataFrame:

City	2020	2021
NY	100	200
LA	500	450

Creating Custom Indexes with Multi-Index

import pandas as pd

# Create a multi-index from tuple
index = pd.MultiIndex.from_tuples([('A',2020), ('A', 2021), ('B', 2021), ('C', 2022)], names=['Category', 'Year'])
data = [100, 150, 200, 250]
df = pd.DataFrame(data, index=index, columns=['Sales'])

print("Dataframe with custom multi-index: \n", df)

Output
Dataframe with Custom Multi-Index:

Category	Year	Sales
A	2020	100
A	2021	150
B	2021	200
C	2022	250

Pandas

Grouping & Aggregation

Grouping & Aggregation

MutiIndex DataFrame Operations

Using Apply with Lambda Functions

Merging DataFrame with Different Keys

Handling Missing Values with Custom Functions

Handling Missing Values with ffilll & bfill method

Pivoting DataFrame

Pivoting DataFrame with multi index columns

Melting DataFrame

Time based indexing and resampling

Resampling with different aggregation

Conditional Filtering with Multiple Conditions

Creating Custom Categorical Data

Binning Data using pd.cut

Rolling Window Calculations

Shift and Lagging Data

Cumulative sum and product

Merging on multiple columns

Handling Outliers

Creating a pivot table with multiple aggregation

Creating a pivot table with multiple indexes

Using map() for Value Replacement

Detecting Duplicates

Using explode() for list in columns

Using rank() to rank values

Using applymap() for element-wise operations

Using df.apply() with axis parameter

Create dummy variables

Use query() for filtering

Adding a new column using assign()

Applying a custom function with groupby

Applying functions to grouped data

DataFrame info and memory usage

Filter with isin()

Concatenating Dataframe using concat()

Sorting a DataFrame by Multiple Columns

Displaying the styled Dataframe

Using explode() for nested JSON columns

Using pipe() for method chaining

Multi Index Slicing

Adding prefix or suffix to column names

Using at and iat for fast scalar access

Renaming index & columns

Use query() with variables

Using groupby with custom aggregation

Pivoting with fill_value

Pivoting with fill_value

Combining Dataframe with Multi-level Indexes

Using rolling() with custom functions

Setting & Resetting Indexes

Working with time-zone

Detecting and Replacing Outliers

Dataframe with evaluated calculation

Merging indicator for tracing source

Time Series with Business days

Stacking & Unstacking dataframe

Creating Custom Indexes with Multi-Index

Handling Missing Values with `ffilll` & `bfill` method

Binning Data using `pd.cut`

Using `map()` for Value Replacement

Using `explode()` for list in columns

Using `rank()` to rank values

Using `applymap()` for element-wise operations

Using `df.apply()` with axis parameter

Use `query()` for filtering

Adding a new column using `assign()`

Applying a custom function with `groupby`

Filter with `isin()`

Concatenating Dataframe using `concat()`

Using `explode()` for nested JSON columns

Using `pipe()` for method chaining

Using `at` and `iat` for fast scalar access

Use `query()` with variables

Using `groupby` with custom aggregation

Using `rolling()` with custom functions