If you're new to Python and want to work with data—like spreadsheets, tables, or numbers—then Pandas and NumPy are your best friends. This guide will walk you through the basics of data manipulation using these two powerful libraries, with simple explanations and examples.
📦 Installing Pandas and NumPy
Before you start, install the libraries using pip:
pip install pandas numpy
Or using conda (recommended for Anaconda users):
conda install pandas numpy
🔢 What is NumPy?
NumPy stands for Numerical Python. It helps you work with numbers and arrays efficiently.
✅ Creating Arrays
import numpy as np
# Create a simple array
arr = np.array([1, 2, 3, 4, 5])
print(arr)
Output:
[1 2 3 4 5]
📝 Explanation: np.array() turns a Python list into a NumPy array, which is faster and better for math operations.
✅ Array Operations
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
# Addition
print(a + b)
# Multiplication
print(a * b)
Output:
[5 7 9]
[ 4 10 18]
📝 Explanation: NumPy performs element-wise operations. It adds or multiplies each pair of elements from the arrays.
✅ Filtering with Conditions
data = np.array([10, 20, 30, 40, 50])
filtered = data[data > 30]
print(filtered)
Output:
[40 50]
📝 Explanation: This filters the array to show only values greater than 30.
📊 What is Pandas?
Pandas is a library for working with tabular data—like rows and columns in Excel.
✅ Creating a DataFrame
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'City': ['New York', 'San Francisco', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0 Alice 25 New York
1 Bob 30 San Francisco
2 Charlie 22 Chicago
📝 Explanation: A DataFrame is like a table. Each column has a name, and each row has an index.
✅ Selecting Data
# Select a column
print(df['Name'])
# Select a row by index
print(df.loc[1])
Output:
0 Alice
1 Bob
2 Charlie
Name: Name, dtype: object
Name Bob
Age 30
City San Francisco
Name: 1, dtype: object
✅ Filtering Rows
# Show people older than 23
print(df[df['Age'] > 23])
Output:
Name Age City
0 Alice 25 New York
1 Bob 30 San Francisco
✅ Adding a Column
df['Country'] = ['USA', 'USA', 'USA']
print(df)
Output:
Name Age City Country
0 Alice 25 New York USA
1 Bob 30 San Francisco USA
2 Charlie 22 Chicago USA
✅ Summary Statistics
print(df.describe())
Output:
Age
count 3.000000
mean 25.666667
std 4.041452
min 22.000000
25% 23.500000
50% 25.000000
75% 27.500000
max 30.000000
📝 Explanation: describe() gives you basic statistics like mean, min, max, etc.
✅ Grouping Data
grouped = df.groupby('City')['Age'].mean()
print(grouped)
Output:
City
Chicago 22
New York 25
San Francisco 30
Name: Age, dtype: int64
✅ Merging DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'City': ['New York', 'Chicago', 'Los Angeles']})
merged = pd.merge(df1, df2, on='ID')
print(merged)
Output:
ID Name City
0 2 Bob New York
1 3 Charlie Chicago
📝 Explanation: merge() combines two tables based on a common column.
✅ Handling Missing Data
import numpy as np
df.loc[1, 'Age'] = np.nan
print(df)
# Fill missing values
df['Age'].fillna(0, inplace=True)
print(df)
Output:
Name Age City Country
0 Alice 25.0 New York USA
1 Bob NaN San Francisco USA
2 Charlie 22.0 Chicago USA
Name Age City Country
0 Alice 25.0 New York USA
1 Bob 0.0 San Francisco USA
2 Charlie 22.0 Chicago USA
🔄 Reshaping Data
✅ NumPy Reshape
import numpy as np
arr = np.arange(12)
reshaped = arr.reshape(3, 4)
print(reshaped)
Output:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
📝 Explanation: np.arange(12) creates an array from 0 to 11. reshape(3, 4) turns it into a 3-row, 4-column array.
✅ Pandas Pivot Table
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Alice', 'Bob'],
'Subject': ['Math', 'Math', 'Science', 'Science'],
'Score': [85, 90, 95, 80]
}
df = pd.DataFrame(data)
pivot = df.pivot_table(values='Score', index='Name', columns='Subject')
print(pivot)
Output:
Subject Math Science
Name
Alice 85 95
Bob 90 80
📝 Explanation: Pivot tables summarize data. Here, we show each person’s score by subject.
🧮 Applying Functions
✅ NumPy Vectorized Operations
arr = np.array([1, 2, 3, 4, 5])
squared = arr ** 2
print(squared)
Output:
[ 1 4 9 16 25]
📝 Explanation: NumPy applies operations to each element without loops.
✅ Pandas Apply Function
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22]
})
# Add 5 years to each age
df['AgePlus5'] = df['Age'].apply(lambda x: x + 5)
print(df)
Output:
Name Age AgePlus5
0 Alice 25 30
1 Bob 30 35
2 Charlie 22 27
📝 Explanation: apply() lets you run a function on each value in a column.
🧼 Cleaning Data
✅ Removing Duplicates
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Alice'],
'Age': [25, 30, 25]
})
df_cleaned = df.drop_duplicates()
print(df_cleaned)
Output:
Name Age
0 Alice 25
1 Bob 30
📝 Explanation: drop_duplicates() removes repeated rows.
✅ Renaming Columns
df.rename(columns={'Name': 'Full Name', 'Age': 'Years'}, inplace=True)
print(df)
Output:
Full Name Years
0 Alice 25
1 Bob 30
2 Alice 25
📅 Working with Dates
df = pd.DataFrame({
'Date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03']),
'Sales': [100, 150, 200]
})
# Extract day of week
df['Day'] = df['Date'].dt.day_name()
print(df)
Output:
Date Sales Day
0 2023-01-01 100 Sunday
1 2023-01-02 150 Monday
2 2023-01-03 200 Tuesday
📝 Explanation: dt.day_name() extracts the weekday name from a date column.
🔗 Combining Data
✅ Concatenating DataFrames
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'A': [3, 4]})
combined = pd.concat([df1, df2])
print(combined)
Output:
A
0 1
1 2
0 3
1 4
📝 Explanation: concat() stacks DataFrames vertically.
📌 Conclusion
With Pandas and NumPy, you can:
- Clean messy data
- Analyze and summarize information
- Perform fast calculations
- Work with dates and times
- Combine and reshape datasets
Top comments (0)