Pandas (stands for Python Data Analysis) is an open-source software library designed for data manipulation and analysis.
- Revolves around two primary Data structures: Series (1D) and DataFrame (2D)
- Built on top of NumPy, efficiently manages large datasets, offering tools for data cleaning, transformation, and analysis.
- Tools for working with time series data, including date range generation and frequency conversion. For example, we can convert date or time columns into pandas’ datetime type using pd.to_datetime(), or specify parse_dates=True during CSV loading.
- Seamlessly integrates with other Python libraries like NumPy, Matplotlib, and scikit-learn.
- Provides methods like .dropna() and .fillna() to handle missing values seamlessly
Here is a various tasks that we can do using Pandas:
Data Cleaning, Merging and Joining: Clean and combine data from multiple sources, handling inconsistencies and duplicates.
Handling Missing Data: Manage missing values (NaN) in both floating and non-floating point data.
Column Insertion and Deletion: Easily add, remove or modify columns in a DataFrame.
Group By Operations: Use "split-apply-combine" to group and analyze data.
Data Visualization: Create visualizations with Matplotlib and Seaborn, integrated with Pandas.
Pandas Dataframe:
A Pandas DataFrame is a two-dimensional table-like structure in Python where data is arranged in rows and columns. It’s one of the most commonly used tools for handling data and makes it easy to organize, analyze and manipulate data. It can store different types of data such as numbers, text and dates across its columns. The main parts of a DataFrame are:
Data: Actual values in the table.
Rows: Labels that identify each row.
Columns: Labels that define each data category.
Creating Empty DataFrame:
import pandas as pd
df = pd.DataFrame()
print(df)
Creating a DataFrame from a List
A simple way to create a DataFrame is by using a single list. Pandas automatically assigns index values to the rows when you pass a list.
- Each item in the list becomes a row.
- The DataFrame consists of a single unnamed column.
import pandas as pd
lst = ['Geeks', 'For', 'Geeks', 'is',
'portal', 'for', 'Geeks']
df = pd.DataFrame(lst)
print(df)
Creating a DataFrame from a List of Dictionaries
It represents data where each dictionary corresponds to a row. This method is useful for handling structured data from APIs or JSON files. It is commonly used in web scraping and API data processing since JSON responses often contain lists of dictionaries.
import pandas as pd
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
df = pd.DataFrame(dict)
print(df)
Add index Explicitly:
df = pd.DataFrame(dict,index=['Rollno1','Rollno2','Rollno3','Rollno4'])
Method #2: Using from_dict() function
df = pd.DataFrame.from_dict(dict)
Create dataframe by passing lists variable to dictionary
import pandas as pd
# dictionary of lists
name=['aparna', 'pankaj', 'sudhir', 'Geeku']
degree=['MBA','BCA', 'M.Tech', 'MBA']
score=[90, 40, 80, 98]
dict = {'name':name,
'degree':degree ,
'score':score}
df = pd.DataFrame(dict,index=['Rollno1','Rollno2','Rollno3','Rollno4'])
Pandas Dataframe Index:
import pandas as pd
data = {'Name': ['John', 'Alice', 'Bob', 'Eve', 'Charlie'],
'Age': [25, 30, 22, 35, 28],
'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
'Salary': [50000, 55000, 40000, 70000, 48000]}
df = pd.DataFrame(data)
print(df.index) # Accessing the index
Custom Index:
# Set 'Name' column as the index
df_with_index = df.set_index('Name')
Resetting the Index
# Reset the index back to the default integer index
df_reset = df.reset_index()
print(df_reset)
Indexing with loc
row = df.loc['Alice']
print(row)
Changing the Index
# Set 'Age' as the new index
df_with_new_index = df.set_index('Age')
print(df_with_new_index)
Accessing Columns From DataFrame
Columns in a DataFrame can be accessed individually using bracket notation Accessing a column retrieves that column as a Series, which can then be further manipulated.
# Access the 'Age' column
age_column = df['Age']
print(age_column)
Accessing Rows by Index
To access specific rows in a DataFrame, you can use iloc (for positional indexing) or loc (for label-based indexing). These methods allow you to retrieve rows based on their index positions or labels.
# Access the row at index 1 (second row)
second_row = df.iloc[1]
print(second_row)
Accessing Multiple Rows or Columns
You can access multiple rows or columns at once by passing a list of column names or index positions. This is useful when you need to select several columns or rows for further analysis.
# Access the first three rows and the 'Name' and 'Age' columns
subset = df.loc[0:2, ['Name', 'Age']]
df.loc[2,'Gender']
print(subset)
. Accessing Rows Based on Conditions
Pandas allows you to filter rows based on conditions, which can be very powerful for exploring subsets of data that meet specific criteria.
Access rows where 'Age' is greater than 25
filtered_data = df[df['Age'] > 25]
print(filtered_data)
Accessing Specific Cells with at and iat
If you need to access a specific cell, you can use the .at[] method for label-based indexing and the .iat[] method for integer position-based indexing. These are optimized for fast access to single values.
Access the 'Salary' of the row with label 2
salary_at_index_2 = df.at[2, 'Salary']
print(salary_at_index_2)
output = data.iat[row, column]
Indexing and Selecting Data with Pandas
first = data["Age"]
first.head(5)
- Selecting Multiple Columns first = data[["Age", "College", "Salary"]]
Indexing with .loc[ ]
The.loc[] function is used for label-based indexing. It allows us to access rows and columns by their labels. Unlike the indexing operator, it can select subsets of rows and columns simultaneously which offers flexibility in data retrieval.
- Selecting a Single Row by Label
import pandas as pd
data = pd.read_csv("/content/nba.csv", index_col="Name")
row = data.loc["Avery Bradley"]
print(row)
- Concatenating DataFrame using .concat()
import pandas as pd
data1 = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd']}
data2 = {'Name': ['Abhi', 'Ayushi', 'Dhiraj', 'Hitesh'],
'Age': [17, 14, 12, 52],
'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification': ['Btech', 'B.A', 'Bcom', 'B.hons']}
df = pd.DataFrame(data1, index=[0, 1, 2, 3])
df1 = pd.DataFrame(data2, index=[4, 5, 6, 7])
print(df, "\n\n", df1)
- Concatenating DataFrames by Setting Logic on Axes We can modify the concatenation by setting logic on the axes. Specifically we can choose whether to take the Union (join='outer') or Intersection (join='inner') of columns.
import pandas as pd
data1 = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd'],
'Mobile No': [97, 91, 58, 76]}
data2 = {'Name': ['Gaurav', 'Anuj', 'Dhiraj', 'Hitesh'],
'Age': [22, 32, 12, 52],
'Address': ['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'],
'Qualification': ['MCA', 'Phd', 'Bcom', 'B.hons'],
'Salary': [1000, 2000, 3000, 4000]}
df = pd.DataFrame(data1, index=[0, 1, 2, 3])
df1 = pd.DataFrame(data2, index=[2, 3, 6, 7])
print(df, "\n\n", df1)
res2 = pd.concat([df, df1], axis=1, join='inner')
res2
Now we set axes join = outer for union of dataframe which keeps all columns from both DataFrames.
res2 = pd.concat([df, df1], axis=1, sort=False)
res2
- Concatenating DataFrames by Ignoring Indexes
res = pd.concat([df, df1], ignore_index=True)
res
- Concatenating DataFrame with group keys : If we want to retain information about the DataFrame from which each row came, we can use the keys argument. This assigns a label to each group of rows based on the source DataFrame.
frames = [df, df1 ]
res = pd.concat(frames, keys=['x', 'y'])
res
- Concatenating Mixed DataFrames and Series We can also concatenate a mix of Series and DataFrames. If we include a Series in the list, it will automatically be converted to a DataFrame and we can specify the column name.
import pandas as pd
data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
df = pd.DataFrame(data1,index=[0, 1, 2, 3])
s1 = pd.Series([1000, 2000, 3000, 4000], name='Salary')
print(df, "\n\n", s1)
res = pd.concat([df, s1], axis=1)
res
Merging DataFrame
Merging DataFrames in Pandas is similar to performing SQL joins. It is useful when we need to combine two DataFrames based on a common column or index. The merge() function provides flexibility for different types of joins.
- Merging DataFrames Using One Key
import pandas as pd
data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],}
data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}
df = pd.DataFrame(data1)
df1 = pd.DataFrame(data2)
print(df, "\n\n", df1)
res = pd.merge(df, df1, on='key')
res
- Merging DataFrames Using Multiple Keys We can also merge DataFrames based on more than one column by passing a list of column names to the on argument.
import pandas as pd
data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
'key1': ['K0', 'K1', 'K0', 'K1'],
'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],}
data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
'key1': ['K0', 'K0', 'K0', 'K0'],
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}
df = pd.DataFrame(data1)
df1 = pd.DataFrame(data2)
print(df, "\n\n", df1)
res1 = pd.merge(df, df1, on=['key', 'key1'])
res1
- Merging DataFrames Using the how Argument
res = pd.merge(df, df1, how='left', on=['key', 'key1'])
res
MERGE METHOD JOIN NAME DESCRIPTION
left LEFT OUTER JOIN Use keys from left frame only
right RIGHT OUTER JOIN Use keys from right frame only
outer FULL OUTER JOIN Use union of keys from both frames
inner INNER JOIN Use intersection of keys from both frames
Joining DataFrame
The .join() method in Pandas is used to combine columns of two DataFrames based on their indexes. It's a simple way of merging two DataFrames when the relationship between them is primarily based on their row indexes. It is used when we want to combine DataFrames along their indexes rather than specific columns.
- Joining DataFrames Using .join() If both DataFrames have the same index, we can use the .join() function to combine their columns. This method is useful when we want to merge DataFrames based on their row indexes rather than columns.
res = df.join(df1)
res
Pivot Table
# importing pandas
import pandas as pd
# creating dataframe
df = pd.DataFrame({'Product': ['Carrots', 'Broccoli', 'Banana', 'Banana',
'Beans', 'Orange', 'Broccoli', 'Banana'],
'Category': ['Vegetable', 'Vegetable', 'Fruit', 'Fruit',
'Vegetable', 'Fruit', 'Vegetable', 'Fruit'],
'Quantity': [8, 5, 3, 4, 5, 9, 11, 8],
'Amount': [270, 239, 617, 384, 626, 610, 62, 90]})
pivot = df.pivot_table(index=['Product'],
values=['Amount'],
aggfunc='sum')
print(pivot)
pivot = df.pivot_table(index=['Category'],
values=['Amount'],
aggfunc='sum')
print(pivot)
pivot = df.pivot_table(index=['Product', 'Category'],
values=['Amount'], aggfunc='sum')
print(pivot)
pivot = df.pivot_table(index=['Category'], values=['Amount'],
aggfunc={'median', 'mean', 'min'})
print(pivot)
Pandas Series
import pandas as pd
import numpy as np
# creating simple array
data = np.array(['g', 'e', 'e', 'k', 's', 'f',
'o', 'r', 'g', 'e', 'e', 'k', 's'])
ser = pd.Series(data)
# retrieve the first element
print(ser[0])
Accessing First 5 Elements of Series
print(ser[:5])
Accessing Last 10 Elements of Series
print(ser[-10:])
Accessing First 5 Elements of Series
ser.head(10)
Accessing a Single Element Using index Label
print(ser[16])
Accessing a Multiple Element Using index Label
data = np.array(['g', 'e', 'e', 'k', 's', 'f',
'o', 'r', 'g', 'e', 'e', 'k', 's'])
ser = pd.Series(data, index=[10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22])
print(ser[[10, 11, 12, 13, 14]])
Access Multiple Elements by Providing Label of Index
ser = pd.Series(np.arange(3, 9), index=['a', 'b', 'c', 'd', 'e', 'f'])
print(ser[['a', 'd']])
Top comments (0)