DEV Community

Rajasekaran Palraj
Rajasekaran Palraj

Posted on

Data Analysis Pandas/Numpy Notes

Pandas (stands for Python Data Analysis) is an open-source software library designed for data manipulation and analysis.

  • Revolves around two primary Data structures: Series (1D) and DataFrame (2D)
  • Built on top of NumPy, efficiently manages large datasets, offering tools for data cleaning, transformation, and analysis.
  • Tools for working with time series data, including date range generation and frequency conversion. For example, we can convert date or time columns into pandas’ datetime type using pd.to_datetime(), or specify parse_dates=True during CSV loading.
  • Seamlessly integrates with other Python libraries like NumPy, Matplotlib, and scikit-learn.
  • Provides methods like .dropna() and .fillna() to handle missing values seamlessly

Here is a various tasks that we can do using Pandas:

Data Cleaning, Merging and Joining: Clean and combine data from multiple sources, handling inconsistencies and duplicates.
Handling Missing Data: Manage missing values (NaN) in both floating and non-floating point data.
Column Insertion and Deletion: Easily add, remove or modify columns in a DataFrame.
Group By Operations: Use "split-apply-combine" to group and analyze data.
Data Visualization: Create visualizations with Matplotlib and Seaborn, integrated with Pandas.

Pandas Dataframe:
A Pandas DataFrame is a two-dimensional table-like structure in Python where data is arranged in rows and columns. It’s one of the most commonly used tools for handling data and makes it easy to organize, analyze and manipulate data. It can store different types of data such as numbers, text and dates across its columns. The main parts of a DataFrame are:

Data: Actual values in the table.
Rows: Labels that identify each row.
Columns: Labels that define each data category.

Creating Empty DataFrame:

import pandas as pd

df = pd.DataFrame()

print(df)
Enter fullscreen mode Exit fullscreen mode

Creating a DataFrame from a List
A simple way to create a DataFrame is by using a single list. Pandas automatically assigns index values to the rows when you pass a list.

  • Each item in the list becomes a row.
  • The DataFrame consists of a single unnamed column.
import pandas as pd
​
lst = ['Geeks', 'For', 'Geeks', 'is', 
            'portal', 'for', 'Geeks']
​
df = pd.DataFrame(lst)
print(df)

Enter fullscreen mode Exit fullscreen mode

Creating a DataFrame from a List of Dictionaries

It represents data where each dictionary corresponds to a row. This method is useful for handling structured data from APIs or JSON files. It is commonly used in web scraping and API data processing since JSON responses often contain lists of dictionaries.

import pandas as pd
​
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA"],
        'score':[90, 40, 80, 98]}
​
df = pd.DataFrame(dict)
​
print(df)
Enter fullscreen mode Exit fullscreen mode

Add index Explicitly:
df = pd.DataFrame(dict,index=['Rollno1','Rollno2','Rollno3','Rollno4'])

Method #2: Using from_dict() function

df = pd.DataFrame.from_dict(dict)

Create dataframe by passing lists variable to dictionary

import pandas as pd

# dictionary of lists
name=['aparna', 'pankaj', 'sudhir', 'Geeku']
degree=['MBA','BCA', 'M.Tech', 'MBA']
score=[90, 40, 80, 98]
dict = {'name':name,
        'degree':degree ,
        'score':score}

df = pd.DataFrame(dict,index=['Rollno1','Rollno2','Rollno3','Rollno4'])
Enter fullscreen mode Exit fullscreen mode

Pandas Dataframe Index:

import pandas as pd

data = {'Name': ['John', 'Alice', 'Bob', 'Eve', 'Charlie'],
        'Age': [25, 30, 22, 35, 28],
        'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
        'Salary': [50000, 55000, 40000, 70000, 48000]}

df = pd.DataFrame(data)
print(df.index)  # Accessing the index
Enter fullscreen mode Exit fullscreen mode

Custom Index:

# Set 'Name' column as the index
df_with_index = df.set_index('Name')
Enter fullscreen mode Exit fullscreen mode

Resetting the Index

# Reset the index back to the default integer index
df_reset = df.reset_index()
print(df_reset)
Enter fullscreen mode Exit fullscreen mode

Indexing with loc

row = df.loc['Alice']
print(row)
Enter fullscreen mode Exit fullscreen mode

Changing the Index

# Set 'Age' as the new index
df_with_new_index = df.set_index('Age')
print(df_with_new_index)
Enter fullscreen mode Exit fullscreen mode

Accessing Columns From DataFrame
Columns in a DataFrame can be accessed individually using bracket notation Accessing a column retrieves that column as a Series, which can then be further manipulated.

# Access the 'Age' column
age_column = df['Age']
print(age_column)
Enter fullscreen mode Exit fullscreen mode

Accessing Rows by Index

To access specific rows in a DataFrame, you can use iloc (for positional indexing) or loc (for label-based indexing). These methods allow you to retrieve rows based on their index positions or labels.

# Access the row at index 1 (second row)
second_row = df.iloc[1]
print(second_row)



Enter fullscreen mode Exit fullscreen mode

Accessing Multiple Rows or Columns
You can access multiple rows or columns at once by passing a list of column names or index positions. This is useful when you need to select several columns or rows for further analysis.

# Access the first three rows and the 'Name' and 'Age' columns
subset = df.loc[0:2, ['Name', 'Age']]
df.loc[2,'Gender']
print(subset)
Enter fullscreen mode Exit fullscreen mode

. Accessing Rows Based on Conditions
Pandas allows you to filter rows based on conditions, which can be very powerful for exploring subsets of data that meet specific criteria.

Access rows where 'Age' is greater than 25

filtered_data = df[df['Age'] > 25]
print(filtered_data)

Accessing Specific Cells with at and iat
If you need to access a specific cell, you can use the .at[] method for label-based indexing and the .iat[] method for integer position-based indexing. These are optimized for fast access to single values.

Access the 'Salary' of the row with label 2

salary_at_index_2 = df.at[2, 'Salary']
print(salary_at_index_2)

output = data.iat[row, column]

Indexing and Selecting Data with Pandas

first = data["Age"]

first.head(5)

  1. Selecting Multiple Columns first = data[["Age", "College", "Salary"]]

Indexing with .loc[ ]
The.loc[] function is used for label-based indexing. It allows us to access rows and columns by their labels. Unlike the indexing operator, it can select subsets of rows and columns simultaneously which offers flexibility in data retrieval.

  1. Selecting a Single Row by Label
import pandas as pd
data = pd.read_csv("/content/nba.csv", index_col="Name")

row = data.loc["Avery Bradley"]
print(row)
Enter fullscreen mode Exit fullscreen mode

  1. Concatenating DataFrame using .concat()
import pandas as pd

data1 = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
         'Age': [27, 24, 22, 32],
         'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
         'Qualification': ['Msc', 'MA', 'MCA', 'Phd']}

data2 = {'Name': ['Abhi', 'Ayushi', 'Dhiraj', 'Hitesh'],
         'Age': [17, 14, 12, 52],
         'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
         'Qualification': ['Btech', 'B.A', 'Bcom', 'B.hons']}

df = pd.DataFrame(data1, index=[0, 1, 2, 3])

df1 = pd.DataFrame(data2, index=[4, 5, 6, 7])

print(df, "\n\n", df1)
Enter fullscreen mode Exit fullscreen mode
  1. Concatenating DataFrames by Setting Logic on Axes We can modify the concatenation by setting logic on the axes. Specifically we can choose whether to take the Union (join='outer') or Intersection (join='inner') of columns.
import pandas as pd

data1 = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
         'Age': [27, 24, 22, 32],
         'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
         'Qualification': ['Msc', 'MA', 'MCA', 'Phd'],
         'Mobile No': [97, 91, 58, 76]}

data2 = {'Name': ['Gaurav', 'Anuj', 'Dhiraj', 'Hitesh'],
         'Age': [22, 32, 12, 52],
         'Address': ['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'],
         'Qualification': ['MCA', 'Phd', 'Bcom', 'B.hons'],
         'Salary': [1000, 2000, 3000, 4000]}

df = pd.DataFrame(data1, index=[0, 1, 2, 3])

df1 = pd.DataFrame(data2, index=[2, 3, 6, 7])

print(df, "\n\n", df1)


res2 = pd.concat([df, df1], axis=1, join='inner')

res2
Enter fullscreen mode Exit fullscreen mode

Now we set axes join = outer for union of dataframe which keeps all columns from both DataFrames.

res2 = pd.concat([df, df1], axis=1, sort=False)
​
res2
Enter fullscreen mode Exit fullscreen mode
  1. Concatenating DataFrames by Ignoring Indexes
res = pd.concat([df, df1], ignore_index=True)

res
Enter fullscreen mode Exit fullscreen mode
  1. Concatenating DataFrame with group keys : If we want to retain information about the DataFrame from which each row came, we can use the keys argument. This assigns a label to each group of rows based on the source DataFrame.
frames = [df, df1 ]

res = pd.concat(frames, keys=['x', 'y'])
res
Enter fullscreen mode Exit fullscreen mode
  1. Concatenating Mixed DataFrames and Series We can also concatenate a mix of Series and DataFrames. If we include a Series in the list, it will automatically be converted to a DataFrame and we can specify the column name.
import pandas as pd

data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']}

df = pd.DataFrame(data1,index=[0, 1, 2, 3])

s1 = pd.Series([1000, 2000, 3000, 4000], name='Salary')

print(df, "\n\n", s1)
Enter fullscreen mode Exit fullscreen mode
res = pd.concat([df, s1], axis=1)

res
Enter fullscreen mode Exit fullscreen mode

Merging DataFrame
Merging DataFrames in Pandas is similar to performing SQL joins. It is useful when we need to combine two DataFrames based on a common column or index. The merge() function provides flexibility for different types of joins.

  1. Merging DataFrames Using One Key
import pandas as pd

data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
         'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],}

data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
         'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
        'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}

df = pd.DataFrame(data1)

df1 = pd.DataFrame(data2)


print(df, "\n\n", df1)
Enter fullscreen mode Exit fullscreen mode
res = pd.merge(df, df1, on='key')

res
Enter fullscreen mode Exit fullscreen mode
  1. Merging DataFrames Using Multiple Keys We can also merge DataFrames based on more than one column by passing a list of column names to the on argument.
import pandas as pd
​
data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
         'key1': ['K0', 'K1', 'K0', 'K1'],
         'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],}
​
data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
         'key1': ['K0', 'K0', 'K0', 'K0'],
         'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
        'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}
​
df = pd.DataFrame(data1)
​
df1 = pd.DataFrame(data2)
​
​
print(df, "\n\n", df1)
Enter fullscreen mode Exit fullscreen mode
res1 = pd.merge(df, df1, on=['key', 'key1'])

res1
Enter fullscreen mode Exit fullscreen mode
  1. Merging DataFrames Using the how Argument
res = pd.merge(df, df1, how='left', on=['key', 'key1'])

res
Enter fullscreen mode Exit fullscreen mode

MERGE METHOD JOIN NAME DESCRIPTION
left LEFT OUTER JOIN Use keys from left frame only
right RIGHT OUTER JOIN Use keys from right frame only
outer FULL OUTER JOIN Use union of keys from both frames
inner INNER JOIN Use intersection of keys from both frames

Joining DataFrame
The .join() method in Pandas is used to combine columns of two DataFrames based on their indexes. It's a simple way of merging two DataFrames when the relationship between them is primarily based on their row indexes. It is used when we want to combine DataFrames along their indexes rather than specific columns.

  1. Joining DataFrames Using .join() If both DataFrames have the same index, we can use the .join() function to combine their columns. This method is useful when we want to merge DataFrames based on their row indexes rather than columns.
res = df.join(df1)

res
Enter fullscreen mode Exit fullscreen mode

Pivot Table

# importing pandas
import pandas as pd

# creating dataframe
df = pd.DataFrame({'Product': ['Carrots', 'Broccoli', 'Banana', 'Banana',
                               'Beans', 'Orange', 'Broccoli', 'Banana'],
                   'Category': ['Vegetable', 'Vegetable', 'Fruit', 'Fruit',
                                'Vegetable', 'Fruit', 'Vegetable', 'Fruit'],
                   'Quantity': [8, 5, 3, 4, 5, 9, 11, 8],
                   'Amount': [270, 239, 617, 384, 626, 610, 62, 90]})
Enter fullscreen mode Exit fullscreen mode
pivot = df.pivot_table(index=['Product'],
                       values=['Amount'],
                       aggfunc='sum')
print(pivot)
Enter fullscreen mode Exit fullscreen mode
pivot = df.pivot_table(index=['Category'],
                       values=['Amount'],
                       aggfunc='sum')
print(pivot)
Enter fullscreen mode Exit fullscreen mode
pivot = df.pivot_table(index=['Product', 'Category'],
                       values=['Amount'], aggfunc='sum')
print(pivot)

Enter fullscreen mode Exit fullscreen mode
pivot = df.pivot_table(index=['Category'], values=['Amount'],
                       aggfunc={'median', 'mean', 'min'})
print(pivot)
Enter fullscreen mode Exit fullscreen mode

Pandas Series

import pandas as pd
import numpy as np

# creating simple array
data = np.array(['g', 'e', 'e', 'k', 's', 'f',
                 'o', 'r', 'g', 'e', 'e', 'k', 's'])
ser = pd.Series(data)
# retrieve the first element
print(ser[0])
Enter fullscreen mode Exit fullscreen mode

Accessing First 5 Elements of Series

print(ser[:5])
Enter fullscreen mode Exit fullscreen mode

Accessing Last 10 Elements of Series

print(ser[-10:])

Enter fullscreen mode Exit fullscreen mode

Accessing First 5 Elements of Series

ser.head(10)
Enter fullscreen mode Exit fullscreen mode

Accessing a Single Element Using index Label

print(ser[16])

Accessing a Multiple Element Using index Label

data = np.array(['g', 'e', 'e', 'k', 's', 'f',
                 'o', 'r', 'g', 'e', 'e', 'k', 's'])
ser = pd.Series(data, index=[10, 11, 12, 13, 14,
                             15, 16, 17, 18, 19, 20, 21, 22])
print(ser[[10, 11, 12, 13, 14]])
Enter fullscreen mode Exit fullscreen mode

Access Multiple Elements by Providing Label of Index


ser = pd.Series(np.arange(3, 9), index=['a', 'b', 'c', 'd', 'e', 'f'])
print(ser[['a', 'd']])
Enter fullscreen mode Exit fullscreen mode

Top comments (0)