DEV Community

Cover image for Data Manipulation With Pandas And Numpy
likhitha manikonda
likhitha manikonda

Posted on

Data Manipulation With Pandas And Numpy

If you're new to Python and want to work with data—like spreadsheets, tables, or numbers—then Pandas and NumPy are your best friends. This guide will walk you through the basics of data manipulation using these two powerful libraries, with simple explanations and examples.


📦 Installing Pandas and NumPy

Before you start, install the libraries using pip:

pip install pandas numpy
Enter fullscreen mode Exit fullscreen mode

Or using conda (recommended for Anaconda users):

conda install pandas numpy
Enter fullscreen mode Exit fullscreen mode

🔢 What is NumPy?

NumPy stands for Numerical Python. It helps you work with numbers and arrays efficiently.

✅ Creating Arrays

import numpy as np

# Create a simple array
arr = np.array([1, 2, 3, 4, 5])
print(arr)
Enter fullscreen mode Exit fullscreen mode

Output:

[1 2 3 4 5]
Enter fullscreen mode Exit fullscreen mode

📝 Explanation: np.array() turns a Python list into a NumPy array, which is faster and better for math operations.


✅ Array Operations

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Addition
print(a + b)

# Multiplication
print(a * b)
Enter fullscreen mode Exit fullscreen mode

Output:

[5 7 9]
[ 4 10 18]
Enter fullscreen mode Exit fullscreen mode

📝 Explanation: NumPy performs element-wise operations. It adds or multiplies each pair of elements from the arrays.


✅ Filtering with Conditions

data = np.array([10, 20, 30, 40, 50])
filtered = data[data > 30]
print(filtered)
Enter fullscreen mode Exit fullscreen mode

Output:

[40 50]
Enter fullscreen mode Exit fullscreen mode

📝 Explanation: This filters the array to show only values greater than 30.


📊 What is Pandas?

Pandas is a library for working with tabular data—like rows and columns in Excel.

✅ Creating a DataFrame

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 22],
    'City': ['New York', 'San Francisco', 'Chicago']
}

df = pd.DataFrame(data)
print(df)
Enter fullscreen mode Exit fullscreen mode

Output:

     Name  Age           City
0   Alice   25       New York
1     Bob   30  San Francisco
2  Charlie   22        Chicago
Enter fullscreen mode Exit fullscreen mode

📝 Explanation: A DataFrame is like a table. Each column has a name, and each row has an index.


✅ Selecting Data

# Select a column
print(df['Name'])

# Select a row by index
print(df.loc[1])
Enter fullscreen mode Exit fullscreen mode

Output:

0     Alice
1       Bob
2    Charlie
Name: Name, dtype: object

Name              Bob
Age                30
City    San Francisco
Name: 1, dtype: object
Enter fullscreen mode Exit fullscreen mode

✅ Filtering Rows

# Show people older than 23
print(df[df['Age'] > 23])
Enter fullscreen mode Exit fullscreen mode

Output:

     Name  Age           City
0   Alice   25       New York
1     Bob   30  San Francisco
Enter fullscreen mode Exit fullscreen mode

✅ Adding a Column

df['Country'] = ['USA', 'USA', 'USA']
print(df)
Enter fullscreen mode Exit fullscreen mode

Output:

     Name  Age           City Country
0   Alice   25       New York     USA
1     Bob   30  San Francisco     USA
2  Charlie   22        Chicago     USA
Enter fullscreen mode Exit fullscreen mode

✅ Summary Statistics

print(df.describe())
Enter fullscreen mode Exit fullscreen mode

Output:

             Age
count   3.000000
mean   25.666667
std     4.041452
min    22.000000
25%    23.500000
50%    25.000000
75%    27.500000
max    30.000000
Enter fullscreen mode Exit fullscreen mode

📝 Explanation: describe() gives you basic statistics like mean, min, max, etc.


✅ Grouping Data

grouped = df.groupby('City')['Age'].mean()
print(grouped)
Enter fullscreen mode Exit fullscreen mode

Output:

City
Chicago          22
New York         25
San Francisco    30
Name: Age, dtype: int64
Enter fullscreen mode Exit fullscreen mode

✅ Merging DataFrames

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'City': ['New York', 'Chicago', 'Los Angeles']})

merged = pd.merge(df1, df2, on='ID')
print(merged)
Enter fullscreen mode Exit fullscreen mode

Output:

   ID    Name      City
0   2     Bob  New York
1   3  Charlie   Chicago
Enter fullscreen mode Exit fullscreen mode

📝 Explanation: merge() combines two tables based on a common column.


✅ Handling Missing Data

import numpy as np

df.loc[1, 'Age'] = np.nan
print(df)

# Fill missing values
df['Age'].fillna(0, inplace=True)
print(df)
Enter fullscreen mode Exit fullscreen mode

Output:

     Name   Age           City Country
0   Alice  25.0       New York     USA
1     Bob   NaN  San Francisco     USA
2  Charlie 22.0        Chicago     USA

     Name   Age           City Country
0   Alice  25.0       New York     USA
1     Bob   0.0  San Francisco     USA
2  Charlie 22.0        Chicago     USA
Enter fullscreen mode Exit fullscreen mode

🔄 Reshaping Data

✅ NumPy Reshape

import numpy as np

arr = np.arange(12)
reshaped = arr.reshape(3, 4)
print(reshaped)
Enter fullscreen mode Exit fullscreen mode

Output:

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
Enter fullscreen mode Exit fullscreen mode

📝 Explanation: np.arange(12) creates an array from 0 to 11. reshape(3, 4) turns it into a 3-row, 4-column array.


✅ Pandas Pivot Table

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Alice', 'Bob'],
    'Subject': ['Math', 'Math', 'Science', 'Science'],
    'Score': [85, 90, 95, 80]
}

df = pd.DataFrame(data)
pivot = df.pivot_table(values='Score', index='Name', columns='Subject')
print(pivot)
Enter fullscreen mode Exit fullscreen mode

Output:

Subject  Math  Science
Name                  
Alice      85       95
Bob        90       80
Enter fullscreen mode Exit fullscreen mode

📝 Explanation: Pivot tables summarize data. Here, we show each person’s score by subject.


🧮 Applying Functions

✅ NumPy Vectorized Operations

arr = np.array([1, 2, 3, 4, 5])
squared = arr ** 2
print(squared)
Enter fullscreen mode Exit fullscreen mode

Output:

[ 1  4  9 16 25]
Enter fullscreen mode Exit fullscreen mode

📝 Explanation: NumPy applies operations to each element without loops.


✅ Pandas Apply Function

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 22]
})

# Add 5 years to each age
df['AgePlus5'] = df['Age'].apply(lambda x: x + 5)
print(df)
Enter fullscreen mode Exit fullscreen mode

Output:

     Name  Age  AgePlus5
0   Alice   25        30
1     Bob   30        35
2  Charlie   22        27
Enter fullscreen mode Exit fullscreen mode

📝 Explanation: apply() lets you run a function on each value in a column.


🧼 Cleaning Data

✅ Removing Duplicates

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice'],
    'Age': [25, 30, 25]
})

df_cleaned = df.drop_duplicates()
print(df_cleaned)
Enter fullscreen mode Exit fullscreen mode

Output:

   Name  Age
0  Alice   25
1    Bob   30
Enter fullscreen mode Exit fullscreen mode

📝 Explanation: drop_duplicates() removes repeated rows.


✅ Renaming Columns

df.rename(columns={'Name': 'Full Name', 'Age': 'Years'}, inplace=True)
print(df)
Enter fullscreen mode Exit fullscreen mode

Output:

  Full Name  Years
0     Alice     25
1       Bob     30
2     Alice     25
Enter fullscreen mode Exit fullscreen mode

📅 Working with Dates

df = pd.DataFrame({
    'Date': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03']),
    'Sales': [100, 150, 200]
})

# Extract day of week
df['Day'] = df['Date'].dt.day_name()
print(df)
Enter fullscreen mode Exit fullscreen mode

Output:

        Date  Sales       Day
0 2023-01-01    100    Sunday
1 2023-01-02    150    Monday
2 2023-01-03    200   Tuesday
Enter fullscreen mode Exit fullscreen mode

📝 Explanation: dt.day_name() extracts the weekday name from a date column.


🔗 Combining Data

✅ Concatenating DataFrames

df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'A': [3, 4]})

combined = pd.concat([df1, df2])
print(combined)
Enter fullscreen mode Exit fullscreen mode

Output:

   A
0  1
1  2
0  3
1  4
Enter fullscreen mode Exit fullscreen mode

📝 Explanation: concat() stacks DataFrames vertically.


📌 Conclusion

With Pandas and NumPy, you can:

  • Clean messy data
  • Analyze and summarize information
  • Perform fast calculations
  • Work with dates and times
  • Combine and reshape datasets

Top comments (0)