Rajasekaran Palraj

Posted on Oct 11 • Edited on Nov 5

Data Analysis Pandas Notes

#datascience #learning #python

Pandas (stands for Python Data Analysis) is an open-source software library designed for data manipulation and analysis.

Revolves around two primary Data structures: Series (1D) and DataFrame (2D)
Built on top of NumPy, efficiently manages large datasets, offering tools for data cleaning, transformation, and analysis.
Tools for working with time series data, including date range generation and frequency conversion. For example, we can convert date or time columns into pandas’ datetime type using pd.to_datetime(), or specify parse_dates=True during CSV loading.
Seamlessly integrates with other Python libraries like NumPy, Matplotlib, and scikit-learn.
Provides methods like .dropna() and .fillna() to handle missing values seamlessly

Here is a various tasks that we can do using Pandas:

Data Cleaning, Merging and Joining: Clean and combine data from multiple sources, handling inconsistencies and duplicates.
Handling Missing Data: Manage missing values (NaN) in both floating and non-floating point data.
Column Insertion and Deletion: Easily add, remove or modify columns in a DataFrame.
Group By Operations: Use "split-apply-combine" to group and analyze data.
Data Visualization: Create visualizations with Matplotlib and Seaborn, integrated with Pandas.

Pandas Dataframe:
A Pandas DataFrame is a two-dimensional table-like structure in Python where data is arranged in rows and columns. It’s one of the most commonly used tools for handling data and makes it easy to organize, analyze and manipulate data. It can store different types of data such as numbers, text and dates across its columns. The main parts of a DataFrame are:

Data: Actual values in the table.
Rows: Labels that identify each row.
Columns: Labels that define each data category.

Creating Empty DataFrame:

import pandas as pd

df = pd.DataFrame()

print(df)

Creating a DataFrame from a List
A simple way to create a DataFrame is by using a single list. Pandas automatically assigns index values to the rows when you pass a list.

Each item in the list becomes a row.
The DataFrame consists of a single unnamed column.

import pandas as pd

lst = ['Geeks', 'For', 'Geeks', 'is', 
            'portal', 'for', 'Geeks']

df = pd.DataFrame(lst)
print(df)

Creating a DataFrame from a List of Dictionaries

It represents data where each dictionary corresponds to a row. This method is useful for handling structured data from APIs or JSON files. It is commonly used in web scraping and API data processing since JSON responses often contain lists of dictionaries.

import pandas as pd

dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA"],
        'score':[90, 40, 80, 98]}

df = pd.DataFrame(dict)

print(df)

Add index Explicitly:
df = pd.DataFrame(dict,index=['Rollno1','Rollno2','Rollno3','Rollno4'])

Method #2: Using from_dict() function

df = pd.DataFrame.from_dict(dict)

Create dataframe by passing lists variable to dictionary

import pandas as pd

# dictionary of lists
name=['aparna', 'pankaj', 'sudhir', 'Geeku']
degree=['MBA','BCA', 'M.Tech', 'MBA']
score=[90, 40, 80, 98]
dict = {'name':name,
        'degree':degree ,
        'score':score}

df = pd.DataFrame(dict,index=['Rollno1','Rollno2','Rollno3','Rollno4'])

Pandas Dataframe Index:

import pandas as pd

data = {'Name': ['John', 'Alice', 'Bob', 'Eve', 'Charlie'],
        'Age': [25, 30, 22, 35, 28],
        'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
        'Salary': [50000, 55000, 40000, 70000, 48000]}

df = pd.DataFrame(data)
print(df.index)  # Accessing the index

Custom Index:

# Set 'Name' column as the index
df_with_index = df.set_index('Name')

Resetting the Index

# Reset the index back to the default integer index
df_reset = df.reset_index()
print(df_reset)

Indexing with loc

row = df.loc['Alice']
print(row)

Changing the Index

# Set 'Age' as the new index
df_with_new_index = df.set_index('Age')
print(df_with_new_index)

Accessing Columns From DataFrame
Columns in a DataFrame can be accessed individually using bracket notation Accessing a column retrieves that column as a Series, which can then be further manipulated.

# Access the 'Age' column
age_column = df['Age']
print(age_column)

Accessing Rows by Index

To access specific rows in a DataFrame, you can use iloc (for positional indexing) or loc (for label-based indexing). These methods allow you to retrieve rows based on their index positions or labels.

# Access the row at index 1 (second row)
second_row = df.iloc[1]
print(second_row)

Accessing Multiple Rows or Columns
You can access multiple rows or columns at once by passing a list of column names or index positions. This is useful when you need to select several columns or rows for further analysis.

# Access the first three rows and the 'Name' and 'Age' columns
subset = df.loc[0:2, ['Name', 'Age']]
df.loc[2,'Gender']
print(subset)

. Accessing Rows Based on Conditions
Pandas allows you to filter rows based on conditions, which can be very powerful for exploring subsets of data that meet specific criteria.

Access rows where 'Age' is greater than 25

filtered_data = df[df['Age'] > 25]
print(filtered_data)

Accessing Specific Cells with at and iat
If you need to access a specific cell, you can use the .at[] method for label-based indexing and the .iat[] method for integer position-based indexing. These are optimized for fast access to single values.

Access the 'Salary' of the row with label 2

salary_at_index_2 = df.at[2, 'Salary']
print(salary_at_index_2)

output = data.iat[row, column]

Indexing and Selecting Data with Pandas

first = data["Age"]

first.head(5)

Selecting Multiple Columns first = data[["Age", "College", "Salary"]]

Indexing with .loc[ ]
The.loc[] function is used for label-based indexing. It allows us to access rows and columns by their labels. Unlike the indexing operator, it can select subsets of rows and columns simultaneously which offers flexibility in data retrieval.

Selecting a Single Row by Label

import pandas as pd
data = pd.read_csv("/content/nba.csv", index_col="Name")

row = data.loc["Avery Bradley"]
print(row)

Concatenating DataFrame using .concat()

import pandas as pd

data1 = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
         'Age': [27, 24, 22, 32],
         'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
         'Qualification': ['Msc', 'MA', 'MCA', 'Phd']}

data2 = {'Name': ['Abhi', 'Ayushi', 'Dhiraj', 'Hitesh'],
         'Age': [17, 14, 12, 52],
         'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
         'Qualification': ['Btech', 'B.A', 'Bcom', 'B.hons']}

df = pd.DataFrame(data1, index=[0, 1, 2, 3])

df1 = pd.DataFrame(data2, index=[4, 5, 6, 7])

print(df, "\n\n", df1)

Concatenating DataFrames by Setting Logic on Axes We can modify the concatenation by setting logic on the axes. Specifically we can choose whether to take the Union (join='outer') or Intersection (join='inner') of columns.

import pandas as pd

data1 = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
         'Age': [27, 24, 22, 32],
         'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
         'Qualification': ['Msc', 'MA', 'MCA', 'Phd'],
         'Mobile No': [97, 91, 58, 76]}

data2 = {'Name': ['Gaurav', 'Anuj', 'Dhiraj', 'Hitesh'],
         'Age': [22, 32, 12, 52],
         'Address': ['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'],
         'Qualification': ['MCA', 'Phd', 'Bcom', 'B.hons'],
         'Salary': [1000, 2000, 3000, 4000]}

df = pd.DataFrame(data1, index=[0, 1, 2, 3])

df1 = pd.DataFrame(data2, index=[2, 3, 6, 7])

print(df, "\n\n", df1)


res2 = pd.concat([df, df1], axis=1, join='inner')

res2

Now we set axes join = outer for union of dataframe which keeps all columns from both DataFrames.

res2 = pd.concat([df, df1], axis=1, sort=False)

res2

Concatenating DataFrames by Ignoring Indexes

res = pd.concat([df, df1], ignore_index=True)

res

Concatenating DataFrame with group keys : If we want to retain information about the DataFrame from which each row came, we can use the keys argument. This assigns a label to each group of rows based on the source DataFrame.

frames = [df, df1 ]

res = pd.concat(frames, keys=['x', 'y'])
res

Concatenating Mixed DataFrames and Series We can also concatenate a mix of Series and DataFrames. If we include a Series in the list, it will automatically be converted to a DataFrame and we can specify the column name.

import pandas as pd

data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']}

df = pd.DataFrame(data1,index=[0, 1, 2, 3])

s1 = pd.Series([1000, 2000, 3000, 4000], name='Salary')

print(df, "\n\n", s1)

res = pd.concat([df, s1], axis=1)

res

Merging DataFrame
Merging DataFrames in Pandas is similar to performing SQL joins. It is useful when we need to combine two DataFrames based on a common column or index. The merge() function provides flexibility for different types of joins.

Merging DataFrames Using One Key

import pandas as pd

data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
         'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],}

data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
         'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
        'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}

df = pd.DataFrame(data1)

df1 = pd.DataFrame(data2)


print(df, "\n\n", df1)

res = pd.merge(df, df1, on='key')

res

Merging DataFrames Using Multiple Keys We can also merge DataFrames based on more than one column by passing a list of column names to the on argument.

import pandas as pd

data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
         'key1': ['K0', 'K1', 'K0', 'K1'],
         'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],}

data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
         'key1': ['K0', 'K0', 'K0', 'K0'],
         'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
        'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}

df = pd.DataFrame(data1)

df1 = pd.DataFrame(data2)


print(df, "\n\n", df1)

res1 = pd.merge(df, df1, on=['key', 'key1'])

res1

Merging DataFrames Using the how Argument

res = pd.merge(df, df1, how='left', on=['key', 'key1'])

res

MERGE METHOD JOIN NAME DESCRIPTION
left LEFT OUTER JOIN Use keys from left frame only
right RIGHT OUTER JOIN Use keys from right frame only
outer FULL OUTER JOIN Use union of keys from both frames
inner INNER JOIN Use intersection of keys from both frames

Joining DataFrame
The .join() method in Pandas is used to combine columns of two DataFrames based on their indexes. It's a simple way of merging two DataFrames when the relationship between them is primarily based on their row indexes. It is used when we want to combine DataFrames along their indexes rather than specific columns.

Joining DataFrames Using .join() If both DataFrames have the same index, we can use the .join() function to combine their columns. This method is useful when we want to merge DataFrames based on their row indexes rather than columns.

res = df.join(df1)

res

Pivot Table

# importing pandas
import pandas as pd

# creating dataframe
df = pd.DataFrame({'Product': ['Carrots', 'Broccoli', 'Banana', 'Banana',
                               'Beans', 'Orange', 'Broccoli', 'Banana'],
                   'Category': ['Vegetable', 'Vegetable', 'Fruit', 'Fruit',
                                'Vegetable', 'Fruit', 'Vegetable', 'Fruit'],
                   'Quantity': [8, 5, 3, 4, 5, 9, 11, 8],
                   'Amount': [270, 239, 617, 384, 626, 610, 62, 90]})

pivot = df.pivot_table(index=['Product'],
                       values=['Amount'],
                       aggfunc='sum')
print(pivot)

pivot = df.pivot_table(index=['Category'],
                       values=['Amount'],
                       aggfunc='sum')
print(pivot)

pivot = df.pivot_table(index=['Product', 'Category'],
                       values=['Amount'], aggfunc='sum')
print(pivot)

pivot = df.pivot_table(index=['Category'], values=['Amount'],
                       aggfunc={'median', 'mean', 'min'})
print(pivot)

Pandas Series

import pandas as pd
import numpy as np

# creating simple array
data = np.array(['g', 'e', 'e', 'k', 's', 'f',
                 'o', 'r', 'g', 'e', 'e', 'k', 's'])
ser = pd.Series(data)
# retrieve the first element
print(ser[0])

Accessing First 5 Elements of Series

print(ser[:5])

Accessing Last 10 Elements of Series

print(ser[-10:])

Accessing First 5 Elements of Series

ser.head(10)

Accessing a Single Element Using index Label

print(ser[16])

Accessing a Multiple Element Using index Label

data = np.array(['g', 'e', 'e', 'k', 's', 'f',
                 'o', 'r', 'g', 'e', 'e', 'k', 's'])
ser = pd.Series(data, index=[10, 11, 12, 13, 14,
                             15, 16, 17, 18, 19, 20, 21, 22])
print(ser[[10, 11, 12, 13, 14]])

Access Multiple Elements by Providing Label of Index


ser = pd.Series(np.arange(3, 9), index=['a', 'b', 'c', 'd', 'e', 'f'])
print(ser[['a', 'd']])

Working with Missing Data in Pandas

Using isnull()

isnull() returns a DataFrame of Boolean value where True represents missing data (NaN). This is simple if we want to find and fill missing data in a dataset.

Example 1: Finding Missing Values in a DataFrame

import pandas as pd
import numpy as np

d = {'First Score': [100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score': [np.nan, 40, 80, 98]}
df = pd.DataFrame(d)

mv = df.isnull()

print(mv)

Example 2: Filtering Data Based on Missing Values

sampleFile

import pandas as pd
d = pd.read_csv("/content/employees.csv")

bool_series = pd.isnull(d["Gender"])
missing_gender_data = d[bool_series]
print(missing_gender_data)

Filling Missing Values in Pandas

Following functions allow us to replace missing values with a specified value or use interpolation methods to find the missing data.

Using fillna()

d = {'First Score': [100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score': [np.nan, 40, 80, 98]}
df = pd.DataFrame(d)

df.fillna(0)

Example 2: Fill with Previous Value (Forward Fill)

df.fillna(method='pad')

Example 3: Fill with Next Value (Backward Fill)

df.fillna(method='bfill')

Example 4: Fill NaN Values with 'No Gender'

d["Gender"].fillna('No Gender', inplace = True) 
d[10:25]

Using replace()

data.replace(to_replace=np.nan, value=-99)

Using interpolate()

 df.interpolate(method ='linear', limit_direction ='forward')

Dropping Rows with At Least One Null Value

import pandas as pd
import numpy as np

dict = {'First Score': [100, 90, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score': [52, 40, 80, 98],
        'Fourth Score': [np.nan, np.nan, np.nan, 65]}
df = pd.DataFrame(dict)

df.dropna()