Pandas
Grouping & Aggregation
Grouping data frame using aggregation functions, including sum
and mean
, calculated per category.
## Numpy
import pandas as pd
# Create a Data Frame
data = {'Category':['A', 'B', 'C', 'A','B','C','C','A'],
'Value':[10, 20, 40, 45, 50, 90,100, 200]}
df = pd.DataFrame(data)
# Grouping by 'category' and calculating the sum and mean
grouped = df.groupby('Category').agg({'Value': ['sum', 'mean']})
print("Grouped Data Frame with Aggregation: \n\n", grouped)
Output:
Grouped Data Frame with Aggregation:
Category | Value (sum) | Value (mean) |
---|---|---|
A | 255 | 85.000000 |
B | 70 | 35.000000 |
C | 230 | 76.666667 |
Here,the data is grouped by Category
, and for each category, we calculate:
-
Sum of the
Value
column -
Mean of the
Value
column
Explanation of Columns:
- Category: The category label for each group.
- Value (sum): The total sum of values within each category.
- Value (mean): The average value within each category.
Grouping & Aggregation
Grouping data frame using aggregation functions, including sum
, max
and mean
, calculated per category.
import pandas as pd
# Create dataframe
data = {
'Category':['A','A','B','B'],
'Values':[10, 20, 30,40]
}
df = pd.DataFrame(data)
# Grouping by 'category' and calculating the sum and mean
grouped_df = df.groupby('Category').agg({'Values': ['sum', 'mean','max']})
print("Grouped Data Frame with with multiple functions \n\n", grouped_df)
Output:
Custom Aggregation Result:
Category | sum | mean | max |
---|---|---|---|
A | 30 | 15.0 | 20 |
B | 70 | 35.0 | 40 |
MutiIndex DataFrame Operations
import pandas as pd
# Create Multi Index Dataframe
arrays = [['A', 'A', 'B', 'B'], [2020, 2021, 2022, 2023]]
index = pd.MultiIndex.from_arrays(arrays, names=('Category', 'Year'))
data = [100, 150, 200, 250]
df = pd.DataFrame(data, index=index, columns=['Sales'])
# Accessing data in a mutiindex Dataframe
sales_in_2020 = df.xs(2020, level='Year')
print("Sales Data in 2020: ", sales_in_2020)
Output
Category | Sales (2020) |
---|---|
A | 100 |
Here, output for sales data in 2020
, showing the total sales for each category.
Using Apply with Lambda Functions
import pandas as pd
# Create Dataframe
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Score':[85,90,95]}
df = pd.DataFrame(data)
# Applying a lambda function to modify scores
df['Adjusted Score'] = df['Score'].apply(lambda x: x+5 if x<90 else x)
print("Dataframe with Adjusted Scores :\n",df)
Output
Name | Score | Adjusted Score |
---|---|---|
Alice | 85 | 90 |
Bob | 90 | 90 |
Charlie | 95 | 95 |
Here, this data frame includes a list of names, their original scores, and the adjusted scores:
Merging DataFrame with Different Keys
import pandas as pd
# Create two dataframe
df1 = pd.DataFrame({'ID':[1,2,3], 'Name':['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'Emp_ID':[2,3,4], 'Department':['HR','Finance','IT']})
#Merging with different keys
merged_df = pd.merge(df1, df2, left_on='ID', right_on='Emp_ID', how='outer')
print("Merged Dataframe: \n", merged_df)
Output
ID | Name | Emp_ID | Department |
---|---|---|---|
1.0 | Alice | NaN | NaN |
2.0 | Bob | 2.0 | HR |
3.0 | Charlie | 3.0 | Finance |
NaN | NaN | 4.0 | IT |
Here, the merged data frame combines two data sources, which may contain missing values
- ID: Identifier from the first data source.
- Name: Name from the first data source.
- Emp_ID: Employee ID from the second data source.
- Department: Department from the second data source.
- NaN values indicate missing data from one of the sources.
Handling Missing Values with Custom Functions
import pandas as pd
import numpy as np
# Creating dataframe with missing values
data = {'Name':['Alice', 'Bob', np.nan, 'David'], 'Age':[13,np.nan,32,20]}
df = pd.DataFrame(data)
# Filling Missing Values using a custom function
df['Name'].fillna('Unknown', inplace=True)
df['Age'].fillna(df['Age'].median(), inplace=True)
print("Dataframe after handling Missing values: \n", df)
Output
Dataframe after handling Missing values:
Name | Age |
---|---|
Alice | 13.0 |
Bob | 20.0 |
Unknown | 32.0 |
David | 20.0 |
Here,missing values in the Name
or Age
columns were replaced with a default string ("Unknown"
) or a specified numeric value.
Handling Missing Values with ffilll
& bfill
method
import pandas as pd
import numpy as np
# Creating dataframe with missing values
data = {'Name':['Alice', 'Bob', np.nan, 'David'], 'Age':[13,np.nan,32,20]}
df = pd.DataFrame(data)
# Filling Missing Values using ffill & bfill
df_ffill = df.fillna(method = 'ffill')
df_bfill = df.fillna(method = 'bfill')
print("Dataframe with forward fill: \n", df_ffill)
print("Dataframe with backwark fill: \n", df_bfill)
Output
DataFrame with Forward Fill
Forward fill propagates the last valid non-null value forward until the next non-null value is encountered.
Name | Age |
---|---|
Alice | 13.0 |
Bob | 13.0 |
Bob | 32.0 |
David | 20.0 |
DataFrame with Backward Fill
Backward fill propagates the next valid non-null value backward until the previous non-null value is encountered.
Name | Age |
---|---|
Alice | 13.0 |
Bob | 32.0 |
David | 32.0 |
David | 20.0 |
Pivoting DataFrame
import pandas as pd
# Create dataframe
data = {'Date':['2023-01-01', '2023-01-02','2023-01-01','2023-01-02'],
'City':['NY', 'NY','LA', 'LA'],
'Sales':[200,250,300,400]
}
df = pd.DataFrame(data)
# Pivoting the Dataframe
pivot_df = df.pivot(index='Date', columns='City', values='Sales')
print('Pivoted Dataframe : \n', pivot_df)
Output
Pivoted Dataframe :
Date | LA | NY |
---|---|---|
2023-01-01 | 300 | 200 |
2023-01-02 | 400 | 250 |
Pivoting DataFrame with multi index columns
import pandas as pd
# Create dataframe
data = {'Date':['2023-01-01', '2023-01-02','2023-01-01','2023-01-02'],
'City':['NY', 'NY','LA', 'LA'],
'Type': ['Online', 'Store', 'Online','Store'],
'Sales':[200,250,300,400]
}
df = pd.DataFrame(data)
# Pivoting to create a Dataframe with multi-index columns
pivot_df = df.pivot(index='Date', columns=['City','Type'], values='Sales')
print('Pivoted Dataframe with multi-index columns: \n', pivot_df)
Output
Pivoted DataFrame with Multi-Index Columns:
Date | City | NY | LA | ||
---|---|---|---|---|---|
Type | Online | Store | Online | Store | |
2023-01-01 | 200.0 | NaN | 300.0 | NaN | |
2023-01-02 | NaN | 250.0 | NaN | 400.0 |
Melting DataFrame
import pandas as pd
# Create dataframe
data = {'ID':[1,2], 'Math':[90,79], 'Science':[85,88]}
df = pd.DataFrame(data)
# Melting the dataframe
melted_dataframe = pd.melt(df,id_vars=['ID'], value_vars=['Math', 'Science'], var_name='Subject', value_name='Score')
print("Melted Dataframe : \n", melted_dataframe)
Output
Melted DataFrame :
ID | Subject | Score |
---|---|---|
1 | Math | 90 |
2 | Math | 79 |
1 | Science | 85 |
2 | Science | 88 |
Time based indexing and resampling
import pandas as pd
# Create time Series Dataframe
date_range = pd.date_range(start='2023-01-01', periods=10, freq='D')
data = {'Sales':[100, 200, 150, 300, 250, 400, 300, 350, 300, 400]}
df = pd.DataFrame(data, index=date_range)
# Resampling to find weekly sales
weekly_sales = df.resample('W').sum()
print("Weekly Sales : \n", weekly_sales)
Output
Weekly Sales :
Date | Sales |
---|---|
2023-01-01 | 100 |
2023-01-08 | 1950 |
2023-01-15 | 700 |
Resampling with different aggregation
import pandas as pd
# Create time Series Dataframe
date_range = pd.date_range(start='2023-01-01', periods=10, freq='D')
data = {'Sales':[100, 200, 150, 300, 250, 400, 300, 350, 300, 400]}
df = pd.DataFrame(data, index=date_range)
# Resampling to find weekly sales
resampled_df = df.resample('W').agg({'Sales':['sum', 'mean', 'max']})
print("Resampled Dataframe with different aggregation : \n", resampled_df)
Output
Resampled DataFrame with Different Aggregation:
Date | Sales (sum) | Sales (mean) | Sales (max) |
---|---|---|---|
2023-01-01 | 100 | 100.000000 | 100 |
2023-01-08 | 1950 | 278.571429 | 400 |
2023-01-15 | 700 | 350.000000 | 400 |
Conditional Filtering with Multiple Conditions
import pandas as pd
# Create dataframe
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Score':[85, 80,40,70], 'Passed':[True, False, True, False]}
df = pd.DataFrame(data)
# Filtering with multiple conditions
filtered_df = df[(df['Score'] > 80) & (df['Passed'] == True)]
print("FIltered Dataframe : \n", filtered_df)
Output
Example Filtered DataFrame :
Name | Score | Passed |
---|---|---|
Alice | 85 | True |
Creating Custom Categorical Data
import pandas as pd
# Create dataframe
data = {'Name':['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85, 70, 95,60]}
df = pd.DataFrame(data)
# Creating a new column with categorical data
df['Performance'] = pd.cut(df['Score'], bins=[0,70,90,100], labels=['Poor', 'Average', 'Excellent'])
print('Dataframe with categorical performance : \n', df)
Output
DataFrame with Categorical Performance :
Name | Score | Performance |
---|---|---|
Alice | 85 | Average |
Bob | 70 | Poor |
Charlie | 95 | Excellent |
David | 60 | Poor |
Binning Data using pd.cut
import pandas as pd
# Create dataframe
data = {'Score': [85, 70, 95,60, 100, 90]}
df = pd.DataFrame(data)
# Binning the scores into categories
df['Grade'] = pd.cut(df['Score'], bins=[0,59,69,79,89,100], labels=['F', 'D', 'C', 'B','A'])
print('Dataframe with binned scores : \n', df)
Output
Dataframe with binned scores:
Score | Grade |
---|---|
85 | B |
70 | C |
95 | A |
60 | D |
100 | A |
90 | A |
Rolling Window Calculations
import pandas as pd
# Create a time series dataframe
data_range = pd.date_range(start='2023-01-01', periods=10, freq='D')
data = {'Sales': [100, 200, 150, 300, 250, 400, 300, 350, 300, 400]}
df = pd.DataFrame(data, index= data_range)
# Calculating a rolling mean with a window of 3 days
df['Rolling Mean'] = df['Sales'].rolling(window=3).mean()
print("Dataframe with rolling mean : \n", df)
Output
DataFrame with Rolling Mean :
Date | Sales | Rolling Mean |
---|---|---|
2023-01-01 | 100 | NaN |
2023-01-02 | 200 | NaN |
2023-01-03 | 150 | 150.000000 |
2023-01-04 | 300 | 216.666667 |
2023-01-05 | 250 | 233.333333 |
2023-01-06 | 400 | 316.666667 |
2023-01-07 | 300 | 316.666667 |
2023-01-08 | 350 | 350.000000 |
2023-01-09 | 300 | 316.666667 |
2023-01-10 | 400 | 350.000000 |
Shift and Lagging Data
import pandas as pd
# Create Dataframe
data = {'Date': pd.date_range(start='2023-01-01', periods=5, freq='D'),
'Temperature': [30, 32, 35,33, 31]}
df = pd.DataFrame(data)
# Shifting data by one to create a lag
df['Prev Day Temp'] = df['Temperature'].shift(1)
print('Dataframe with shifted data: \n', df)
Output
DataFrame with Previous Day Temperature:
Date | Temperature | Prev Day Temp |
---|---|---|
2023-01-01 | 30 | NaN |
2023-01-02 | 32 | 30.0 |
2023-01-03 | 35 | 32.0 |
2023-01-04 | 33 | 35.0 |
2023-01-05 | 31 | 33.0 |
Cumulative sum and product
import pandas as pd
# Create dataframe
data = {'Sales': [100, 200, 150, 300, 250]}
df = pd.DataFrame(data)
# Calculating cumulative sum and product
df['cumulative sum'] = df['Sales'].cumsum()
df['cumulative product'] = df['Sales'].cumprod()
print('DataFrame with cumulative operations:\n', df)
Output
DataFrame with Cumulative Operations:
Sales | Cumulative Sum | Cumulative Product |
---|---|---|
100 | 100 | 100 |
200 | 300 | 20000 |
150 | 450 | 3000000 |
300 | 750 | 900000000 |
250 | 1000 | 225000000000 |
Merging on multiple columns
import pandas as pd
# Create two dataframe
df1 = pd.DataFrame({'ID':[1,2,3], 'Year':[2020,2021,2023], 'Name':['Alice','Bob','Charlie'] })
df2 = pd.DataFrame({'ID':[2,3,1], 'Year':[2023,2023,2020], 'Score':[88,92,75]})
# Merging on multiple columns
merged_df = pd.merge(df1,df2, on=['ID','Year'], how='inner')
print('Merged Dataframe on multiple columns : \n', merged_df)
Output
Merged DataFrame on Multiple Columns:
ID | Year | Name | Score |
---|---|---|---|
1 | 2020 | Alice | 75 |
3 | 2023 | Charlie | 92 |
Handling Outliers
import pandas as pd
# Create dataframe
data = {'Value': [10, 12, 14, 100, 15, 13, 12]}
df = pd.DataFrame(data)
# Indentifying outliers using the IQR method
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Value'] < (Q1 - 1.5 * IQR)) | (df['Value'] > (Q3 + 1.5 * IQR))]
print('Outliers: \n', outliers)
Output
DataFrame with Outliers:
Value |
---|
10 |
15 |
20 |
100 |
Here, the value 100
is an outlier compared to the other values.
Creating a pivot table with multiple aggregation
import pandas as pd
# Create dataframe
data = {'City':['NY', 'LA', 'NY','SF','LA'], 'Year':[2020,2020,2021,2021,2020], 'Sales':[100, 150, 200, 250,300]}
df = pd.DataFrame(data)
# Create a pivot table with multiple aggregation
pivot_table = pd.pivot_table(df, values='Sales', index='City', columns='Year', aggfunc=['sum', 'mean'])
print('Pivot table with multiple aggregation: \n', pivot_table)
Output
Pivot Table with Multiple Aggregation :
City | Year | Sum | Mean |
---|---|---|---|
LA | 2020 | 450.0 | 225.0 |
NY | 2020 | 100.0 | 100.0 |
NY | 2021 | 200.0 | 200.0 |
SF | 2021 | 250.0 | 250.0 |
Here, a pivot table is created with sum
and mean
as the aggregation functions.
Creating a pivot table with multiple indexes
import pandas as pd
# Create dataframe
data = {'City':['NY', 'LA', 'NY','SF','LA'], 'Year':[2020,2020,2021,2021,2020], 'Sales':[100, 150, 200, 250,300]}
df = pd.DataFrame(data)
# Create a pivot table with multiple indexes
pivot_table = pd.pivot_table(df, values='Sales', index=['City', 'Year'], aggfunc='sum')
print('Pivot table with multiple indexes: \n', pivot_table)
Output
Pivot Table with Multiple Indexes :
City | Year | Sales |
---|---|---|
LA | 2020 | 450 |
NY | 2020 | 100 |
2021 | 200 | |
SF | 2021 | 250 |
Using map()
for Value Replacement
import pandas as pd
# Create dataframe
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Department':['HR', 'IT', 'Finance']}
df = pd.DataFrame(data)
# Mapping department to codes
department_map = {'HR':1, 'IT':2, 'Finance':3}
df['Dept Code'] = df['Department'].map(department_map)
print("Dataframe with mapped values: \n", df)
Output
DataFrame with Mapped Values:
Name | Department | Dept Code |
---|---|---|
Alice | HR | 1 |
Bob | IT | 2 |
Charlie | Finance | 3 |
Here, the department names are mapped to department codes.
Detecting Duplicates
iimport pandas as pd
# Create a dataframe with duplicate rows
data = {'ID':[1,2,2,3,4,4], 'Value': [10, 20,20,30,40,40]}
df = pd.DataFrame(data)
# Detecting duplicates
duplicates = df[df.duplicated()]
print('Duplicate Row : \n', duplicates)
Output
DataFrame with Duplicate Rows:
ID | Value |
---|---|
2 | 20 |
4 | 40 |
2 | 20 |
4 | 40 |
In this example, the rows with ID 2
and 4
are duplicates.
Using explode()
for list in columns
import pandas as pd
# Create a dataframe with lists in a column
data = {'ID':[1,2], 'Hobbies':[['Reading', 'Swimming'], ['Running', 'Cycling'] ]}
df = pd.DataFrame(data)
# Explodeing the 'Hobbies' Column
exploded_df = df.explode('Hobbies')
print("Dataframe after exploring lists: \n", exploded_df)
Output
DataFrame After Exploring Lists:
ID | Hobbies |
---|---|
1 | Reading |
1 | Swimming |
2 | Running |
2 | Cycling |
In this example, each hobby
in a list associated with an ID is expanded into its own row.
Using rank()
to rank values
import pandas as pd
# Create dataframe
data = {'Name' : ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85,90,78,92]}
df = pd.DataFrame(data)
# Ranking scores in descending order
df['Rank'] = df['Score'].rank(ascending=False)
print('Dataframe with ranked scores : \n', df)
Output
DataFrame with Ranked Scores:
Name | Score | Rank |
---|---|---|
Alice | 85 | 3.0 |
Bob | 90 | 2.0 |
Charlie | 78 | 4.0 |
David | 92 | 1.0 |
In this example, the scores
are ranked in descending order, with the highest score receiving the highest rank.
Using applymap()
for element-wise operations
import pandas as pd
# Create dataframe
data = {'A': [1,2,3], 'B':[4,5,6]}
df = pd.DataFrame(data)
# Applying an operation to each element in the Dataframe
df_transformed = df.applymap(lambda x:x**2)
print("Dataframe with squared values : \n", df_transformed)
Output
DataFrame with Ranked Scores:
Name | Score | Rank |
---|---|---|
Alice | 85 | 3.0 |
Bob | 90 | 2.0 |
Charlie | 78 | 4.0 |
David | 92 | 1.0 |
In this example, the scores
are ranked in descending
order, with the highest score receiving the highest rank.
Using df.apply()
with axis parameter
import pandas as pd
# Create dataframe
data = {'A': [1,2,3], 'B':[4,5,6]}
df = pd.DataFrame(data)
# Applying a function to rows and columns using the axis parameter
row_sum = df.apply(lambda x: x.sum() , axis=1)
col_sum = df.apply(lambda x: x.sum() , axis=0)
print("Row wise sum : \n", row_sum)
print("Column wise sum : \n", col_sum)
Output
Row-wise Sum:
Sum | |
---|---|
0 | 5 |
1 | 7 |
2 | 9 |
Column-wise Sum:
Sum | |
---|---|
A | 6 |
B | 15 |
Create dummy variables
import pandas as pd
# Create dataframe
data = {'City': ['NY', 'LA', 'SF', 'NY']}
df = pd.DataFrame(data)
# Creating dummy varibales
dummies = pd.get_dummies(df['City'], prefix='City')
print('Dummy varibales dataframe: \n', dummies)
Output
DataFrame with Squared Values :
A | B |
---|---|
1 | 16 |
4 | 25 |
9 | 36 |
In this example, each value in columns A
and B
has been squared.
Use query()
for filtering
import pandas as pd
# Create dataframe
data = {'Name' : ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85,90,78,92]}
df = pd.DataFrame(data)
# Use query for filtering
filtered_df = df.query('Score > 80')
print('Filtered dataframe using query : \n', filtered_df)
Output
Filtered DataFrame :
Name | Score |
---|---|
Alice | 85 |
Bob | 90 |
David | 92 |
Here, only rows with scores
above a certain threshold are included in the filtered DataFrame.
Adding a new column using assign()
import pandas as pd
# Create dataframe
data = {'Name' : ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85,90,78,92]}
df = pd.DataFrame(data)
# Adding a new column using assign
df = df.assign(Grade= lambda x: ['A' if Score >= 90 else 'B' for Score in x['Score']])
print("Dataframe with assigned grade column : \n", df)
Output
DataFrame with Assigned Grade Column :
Name | Score | Grade |
---|---|---|
Alice | 85 | B |
Bob | 90 | A |
Charlie | 78 | B |
David | 92 | A |
In this example, grades are assigned based on the score values: 'A'
for high scores, 'B'
for average scores, etc.
Applying a custom function with groupby
import pandas as pd
# Create dataframe
data = {
'Category':['A','A','B','B'],
'Value':[10, 20, 30,40]
}
df = pd.DataFrame(data)
# Applying a custom function with groupby
grouped_df = df.groupby('Category') ['Value'].apply(lambda x: x.max() - x.min())
print("Custom Aggregation result : \n", grouped_df)
Output
Custom Aggregation Result:
Category | Value |
---|---|
A | 10 |
B | 10 |
In this example, the Value
column is aggregated using a custom function (such as summing specific elements) for each category.
Applying functions to grouped data
import pandas as pd
# Create dataframe
data = {
'Team':['A','A','B','B'],
'Points':[10, 20, 30,40]
}
df = pd.DataFrame(data)
# Applying a function to grouped data
grouped_df = df.groupby('Team') ['Points'].transform(lambda x: x/x.max() )
print("Dataframe with applied functions to grouped data : \n", grouped_df)
Output
Dataframe with Applied Functions to Grouped Data:
Points | |
---|---|
0 | 0.50 |
1 | 1.00 |
2 | 0.75 |
3 | 1.00 |
DataFrame info and memory usage
import pandas as pd
# Create dataframe
data = {
'A': range(1000),
'B':range(1000, 2000)
}
df = pd.DataFrame(data)
# Displaying dataframe info and memory usage
df_info = df.info(memory_usage='deep')
print('Dataframe info and Memory usage : \n', df_info)
Output
DataFrame Info and Memory Usage :
# | Column | Non-Null Count | Dtype |
---|---|---|---|
0 | A | 1000 non-null | int64 |
1 | B | 1000 non-null | int64 |
- Dtypes: int64 (2 columns)
- Memory Usage: 15.8 KB
Filter with isin()
import pandas as pd
# Create dataframe
data = {'Name' : ['Alice', 'Bob', 'Charlie', 'David'], 'City': ['NY','LA','SF','NY']}
df = pd.DataFrame(data)
# filter with isin()
filtered_df = df[df['City'].isin(['NY','SF'])]
print("Filtered Dataframe with isin method: \n", filtered_df)
Output
Filtered Dataframe with isin method:
Name | City | |
---|---|---|
0 | Alice | NY |
2 | Charlie | SF |
3 | David | NY |
This section demonstrates how to filter rows in a DataFrame based on a list of values using the isin
method.
Concatenating Dataframe using concat()
import pandas as pd
# Create dataframes
df1 = pd.DataFrame({
'A':[1,2],
'B':[3,4]
})
df2 = pd.DataFrame({
'A': [5,6],
'B': [7,8]
})
# Concatening Dataframes
concat_df = pd.concat([df1, df2], ignore_index=True)
print("COncatenated DataFrame : \n", concat_df)
Output
Concatenated DataFrame :
A | B | |
---|---|---|
0 | 1 | 3 |
1 | 2 | 4 |
2 | 5 | 7 |
3 | 6 | 8 |
This section demonstrates how to concatenate two DataFrames vertically.
Sorting a DataFrame by Multiple Columns
import pandas as pd
# Create dataframe
data = {'Name' : ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85,90,78,92], 'Age': [25,37,35,21]}
df = pd.DataFrame(data)
# Sorting by multiple columns
sorted_df = df.sort_values(by=['Score', 'Age'], ascending=[False, True])
print("Sorted Dataframe by multiple columns: \n", sorted_df)
Output
Sorting a DataFrame by Multiple Columns :
Index | Name | Score | Age |
---|---|---|---|
3 | David | 92 | 21 |
1 | Bob | 90 | 37 |
0 | Alice | 85 | 25 |
2 | Charlie | 78 | 35 |
-
Primary Sorting by
Score
in descending order. -
Secondary Sorting by
Age
in ascending order.
Displaying the styled Dataframe
import pandas as pd
# Create dataframe
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85, 90, 78, 92]}
df = pd.DataFrame(data)
# Applying styles to highlight scores above 80
styled_df = df.style.map(
lambda x: 'background-color: yellow' if isinstance(x, int) and x > 80 else ''
)
# Save styled DataFrame to HTML
styled_df.to_html("styled_dataframe.html")
# Displaying the styled Dataframe
print("Styled dataframe with conditional formatting has been saved to 'styled_dataframe.html'.")
Output
"Styled dataframe with conditional formatting has been saved to 'styled_dataframe.html'."
Index | Name | Score |
---|---|---|
0 | Alice | 85 |
1 | Bob | 90 |
2 | Charlie | 78 |
3 | David | 92 |
Note: Cells with scores greater than 80 are highlighted in yellow. To view the styling, check the styled_dataframe.html
file in a browser.
Using explode()
for nested JSON columns
import pandas as pd
# Create a dataframe with nested lists
data = {
'ID' : [1,2],
'Hobbies':[['Reading', 'Swimming', 'Gaming'], ['Hiking', 'Drawing']]
}
df = pd.DataFrame(data)
# Exploding the 'Hobbies' column
exploded_df = df.explode('Hobbies')
print("Dataframe after exploding nested json columns:\n", exploded_df)
Output
DataFrame after Exploding Nested JSON Columns
Index | ID | Hobbies |
---|---|---|
0 | 1 | Reading |
0 | 1 | Swimming |
0 | 1 | Gaming |
1 | 2 | Hiking |
1 | 2 | Drawing |
Using pipe()
for method chaining
import pandas as pd
# Create dataframe
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85, 90, 78, 92]}
df = pd.DataFrame(data)
# Custom function for modifying the dataframe
def add_grade_column(df):
df['Grade'] = df['Score'].apply(lambda x: 'A' if x >=90 else 'B')
return df
# Using pipe() for method chaining
df = df.pipe(add_grade_column)
print("Dataframe after using pipe : \n", df)
Output
DataFrame after Using Pipe
Index | Name | Score | Grade |
---|---|---|---|
0 | Alice | 85 | B |
1 | Bob | 90 | A |
2 | Charlie | 78 | B |
3 | David | 92 | A |
Multi Index Slicing
import pandas as pd
# Create Multi Index Dataframe
arrays = [['A', 'A', 'B', 'B'], [2020, 2021, 2022, 2023]]
index = pd.MultiIndex.from_arrays(arrays, names=('Category', 'Year'))
data = [100, 150, 200, 250]
df = pd.DataFrame(data, index=index, columns=['Sales'])
# Slicing the mutiindex Dataframe
sliced_df = df.loc['A']
print("Sliced mutiindex Dataframe \n", sliced_df)
Output
Sliced MultiIndex DataFrame
Year | Sales |
---|---|
2020 | 100 |
2021 | 150 |
Adding prefix or suffix to column names
import pandas as pd
# Create a dataframe
data = {
'Math': [90, 80],
'Science':[85,88]
}
df = pd.DataFrame(data)
# Adding prefix to column names
df_prefixed = df.add_prefix('Grade_')
# Adding suffix to column names
df_suffixed = df.add_suffix('_Score')
print("Dataframe with Prefix :\n", df_prefixed)
print('\n')
print("Dataframe with Suffix :\n", df_suffixed)
Output
DataFrame with Prefix
Index | Grade_Math | Grade_Science |
---|---|---|
0 | 90 | 85 |
1 | 80 | 88 |
DataFrame with Suffix
Index | Math_Score | Science_Score |
---|---|---|
0 | 90 | 85 |
1 | 80 | 88 |
Using at
and iat
for fast scalar access
import pandas as pd
# Create dataframes
data = {
'A':[1,2,3],
'B':[4,5,6]
}
df = pd.DataFrame(data)
# Accessing a single value using at and iat
value_at = df.at[1, 'A']
value_iat = df.iat[1,0]
print('values using at : ' , value_at)
print('values using iat : ' , value_iat)
Output
values using at :
2
values using iat :
2
Renaming index & columns
import pandas as pd
# Create dataframe
data = {'Name' : ['Alice', 'Bob'], 'Score': [78,92]}
df = pd.DataFrame(data, index=['Row1','Row2'])
# Renaming index & columns
df_renamed = df.rename(index={'Row1':'Student1', 'Row2':'Student2'}, columns={'Score':'Grade'})
print("Renamed Dataframe : \n",df_renamed)
Output
Renamed DataFrame
Index | Name | Grade |
---|---|---|
Student1 | Alice | 78 |
Student2 | Bob | 92 |
Use query()
with variables
import pandas as pd
# Create dataframe
data = {'Name' : ['Alice', 'Bob', 'Charlie', 'David'], 'Score': [85,90,78,92]}
df = pd.DataFrame(data)
# Using query() with varibales
threshold = 80
filtered_df = df.query('Score > @threshold')
print("Filtered Dataframe with query() and varibales: \n", filtered_df)
Output
Filtered DataFrame with
query()and Variables
Index | Name | Score | Age |
---|---|---|---|
0 | Alice | 85 | 25 |
1 | Bob | 90 | 37 |
3 | David | 92 | 21 |
Using groupby
with custom aggregation
import pandas as pd
# Create dataframe
data = {
'Category':['A','A','B','B'],
'Value':[10, 20, 30,40]
}
df = pd.DataFrame(data)
# using groupby with custom aggregation
custom_df = df.groupby('Category').agg(max_value = ('Value', 'max'), min_value=('Value', 'min'))
print("Custom Aggregation result : \n", custom_df)
Output
Custom Aggregation Result
Category | max_value | min_value |
---|---|---|
A | 20 | 10 |
B | 40 | 30 |
Pivoting with fill_value
import pandas as pd
# Create dataframe
data = {'Date':['2023-01-01', '2023-01-02','2023-01-01','2023-01-02'],
'City':['NY', 'NY','LA', 'LA'],
'Sales':[200,250,300,None]
}
df = pd.DataFrame(data)
# Pivoting with fill_value
pivot_df = df.pivot(index='Date', columns='City', values='Sales').fillna(0)
print('Pivoted Dataframe with fill_value: \n', pivot_df)
Output
Pivoted Dataframe with fill_value:
Date | LA | NY |
---|---|---|
2023-01-01 | 300.0 | 200.0 |
2023-01-02 | 0.0 | 250.0 |
Pivoting with fill_value
import pandas as pd
# Create a dataframe with duplicate rows
data = {'ID':[1,2,2,3,4,4], 'Value': [10, 20,20,30,40,40]}
df = pd.DataFrame(data)
# Dropping duplicates rows
unique_df = df.drop_duplicates()
print("Dataframe after dropping duplicates: \n", unique_df)
Output
DataFrame After Dropping Duplicates
Index | ID | Value |
---|---|---|
0 | 1 | 10 |
1 | 2 | 20 |
3 | 3 | 30 |
4 | 4 | 40 |
Combining Dataframe with Multi-level Indexes
import pandas as pd
# Create Multi Index Dataframe
arrays = [['A', 'A', 'B', 'B'], [2020, 2021, 2022, 2023]]
index = pd.MultiIndex.from_arrays(arrays, names=('Category', 'Year'))
data1 = [100, 150, 200, 250]
df1 = pd.DataFrame(data1, index=index, columns=['Sales'])
data2 = [300, 400, 500, 6000]
df2 = pd.DataFrame(data2, index=index, columns=['Profit'])
# Combining Dataframe
combined_df = pd.concat([df1, df2], axis=1)
print("Combining Dataframe with Multi-level Indexes: \n ", combined_df)
Output
Combining DataFrame with Multi-level Indexes:
Category | Year | Sales | Profit |
---|---|---|---|
A | 2020 | 100 | 300 |
2021 | 150 | 400 | |
B | 2022 | 200 | 500 |
2023 | 250 | 6000 |
Using rolling()
with custom functions
import pandas as pd
# Create a time series dataframe
data_range = pd.date_range(start='2023-01-01', periods=10, freq='D')
data = {'Sales': [100, 200, 150, 300, 250, 400, 300, 350, 300, 400]}
df = pd.DataFrame(data, index= data_range)
# Applying a custom function with rolling()
df['Rolling Sum'] = df['Sales'].rolling(window=3).apply(lambda x:x.sum() if x.sum() > 50 else 0)
print("Dataframe with custom rolling sum : \n", df)
Output
DataFrame with Custom Rolling Sum
Date | Sales | Rolling Sum |
---|---|---|
2023-01-01 | 100 | NaN |
2023-01-02 | 200 | NaN |
2023-01-03 | 150 | 450.0 |
2023-01-04 | 300 | 650.0 |
2023-01-05 | 250 | 700.0 |
2023-01-06 | 400 | 950.0 |
2023-01-07 | 300 | 950.0 |
2023-01-08 | 350 | 1050.0 |
2023-01-09 | 300 | 950.0 |
2023-01-10 | 400 | 1050.0 |
Setting & Resetting Indexes
import pandas as pd
# Create dataframe
data = {'City':['NY', 'LA', 'NY','SF','LA'], 'Year':[2020,2020,2021,2021,2020], 'Sales':[100, 150, 200, 250,300]}
df = pd.DataFrame(data)
# Setting an index
df.set_index('City', inplace=True)
print("Dataframe after setting index: \n", df)
print('\n')
# Resetting index
df.reset_index(inplace=True)
print("Dataframe after resetting index: \n", df)
Output
DataFrame after Setting Index
City | Year | Sales |
---|---|---|
NY | 2020 | 100 |
LA | 2020 | 150 |
NY | 2021 | 200 |
SF | 2021 | 250 |
LA | 2020 | 300 |
DataFrame after Resetting Index
City | Year | Sales |
---|---|---|
NY | 2020 | 100 |
LA | 2020 | 150 |
NY | 2021 | 200 |
SF | 2021 | 250 |
LA | 2020 | 300 |
Working with time-zone
import pandas as pd
# Create a time series dataframe
data_range = pd.date_range(start='2023-01-01', periods=10, freq='D')
data = {'Sales': [100, 200, 150, 300, 250, 400, 300, 350, 300, 400]}
df = pd.DataFrame(data, index=data_range)
# Localizing the timezone to UTC (or any initial timezone)
df.index = df.index.tz_localize('UTC')
# Converting to a different timezone (US/Eastern)
df = df.tz_convert('US/Eastern')
print("Time series dataframe with time zone: \n", df)
Output
Time series dataframe with time zone:
DateTime | Sales |
---|---|
2022-12-31 19:00:00-05:00 | 100 |
2023-01-01 19:00:00-05:00 | 200 |
2023-01-02 19:00:00-05:00 | 150 |
2023-01-03 19:00:00-05:00 | 300 |
2023-01-04 19:00:00-05:00 | 250 |
2023-01-05 19:00:00-05:00 | 400 |
2023-01-06 19:00:00-05:00 | 300 |
2023-01-07 19:00:00-05:00 | 350 |
2023-01-08 19:00:00-05:00 | 300 |
2023-01-09 19:00:00-05:00 | 400 |
Detecting and Replacing Outliers
import pandas as pd
# Create dataframe
data = {'Values': [10,12,14,100,15,13,12]}
df = pd.DataFrame(data)
# Detecting Outliers using the IQR method
Q1 = df['Values'].quantile(0.25)
Q3 = df['Values'].quantile(0.75)
IQR = Q3 - Q1
outliers = (df['Values'] < (Q1-1.5 * IQR)) | (df['Values'] > (Q3 + 1.5 * IQR))
# Replacing outliers with the median
df.loc[outliers, 'Values'] = df['Values'].mean()
print("Dataframe after Replacing outliers: \n", df)
Output
Dataframe after Replacing Outliers:
Index | Values |
---|---|
0 | 10.000000 |
1 | 12.000000 |
2 | 14.000000 |
3 | 25.142857 |
4 | 15.000000 |
5 | 13.000000 |
6 | 12.000000 |
Dataframe with evaluated calculation
import pandas as pd
# Create a dataframe
data = {'A':[1,2,3,4], 'B':[5,6,7,8]}
df = pd.DataFrame(data)
# Using eval for calculations
df['C'] = df.eval('A + B *2')
print("Dataframe with evaluated calculation\n",df)
Output
Dataframe with Evaluated Calculation:
A | B | C | |
---|---|---|---|
0 | 1 | 5 | 11 |
1 | 2 | 6 | 14 |
2 | 3 | 7 | 17 |
3 | 4 | 8 | 20 |
Merging indicator for tracing source
import pandas as pd
# Create dataframe
df1 = pd.DataFrame({'ID': [1,2,3,4] ,'Name' : ['Alice', 'Bob', 'Charlie', 'David']})
df2 = pd.DataFrame({'ID': [3,4,5,6] ,'Name' : ['Charlie', 'David','Robin','Michale']})
# Merging with indicator to track the source
merged_df = pd.merge(df1, df2, on='ID', how='outer', indicator=True)
print('Merged dataframe with indicator: \n', merged_df)
Output
Merged DataFrame with Indicator:
ID | Name_x | Name_y | _merge |
---|---|---|---|
1 | Alice | NaN | left_only |
2 | Bob | NaN | left_only |
3 | Charlie | Charlie | both |
4 | David | David | both |
5 | NaN | Robin | right_only |
6 | NaN | Michale | right_only |
Time Series with Business days
import pandas as pd
# Create a time series dataframe
data_range = pd.date_range(start='2023-01-01', periods=10, freq='B')
data = {'Sales': range(10)}
df = pd.DataFrame(data, index= data_range)
print("Time series Dataframe with business days : \n", df)
Output
Time Series DataFrame with Business Days:
Date | Sales |
---|---|
2023-01-02 | 0 |
2023-01-03 | 1 |
2023-01-04 | 2 |
2023-01-05 | 3 |
2023-01-06 | 4 |
2023-01-09 | 5 |
2023-01-10 | 6 |
2023-01-11 | 7 |
2023-01-12 | 8 |
2023-01-13 | 9 |
Stacking & Unstacking dataframe
import pandas as pd
# Create dataframe
data = {'City':['NY','LA'],
'2020':[100,500],
'2021':[200,450],
}
df = pd.DataFrame(data)
# Stacking & Unstacking dataframe
stack_df = df.stack()
unstack_df = df.unstack()
print('Stacking dataframe \n', stack_df)
print('Unstacking dataframe \n', unstack_df)
Output
Stacked DataFrame:
Level 0 | Level 1 | Value |
---|---|---|
0 | City | NY |
2020 | 100 | |
2021 | 200 | |
1 | City | LA |
2020 | 500 | |
2021 | 450 |
Unstacked DataFrame:
City | 2020 | 2021 |
---|---|---|
NY | 100 | 200 |
LA | 500 | 450 |
Creating Custom Indexes with Multi-Index
import pandas as pd
# Create a multi-index from tuple
index = pd.MultiIndex.from_tuples([('A',2020), ('A', 2021), ('B', 2021), ('C', 2022)], names=['Category', 'Year'])
data = [100, 150, 200, 250]
df = pd.DataFrame(data, index=index, columns=['Sales'])
print("Dataframe with custom multi-index: \n", df)
Output
Dataframe with Custom Multi-Index:
Category | Year | Sales |
---|---|---|
A | 2020 | 100 |
A | 2021 | 150 |
B | 2021 | 200 |
C | 2022 | 250 |
Top comments (0)