NumPy is for numbers.
Pure numbers, same type, organized in grids. Fast, powerful, no labels.
Real data is not like that. Real data has column names. It has strings mixed with numbers. It has dates. It has missing values. It has a mix of ages, salaries, cities, and booleans all in the same table.
NumPy cannot handle that cleanly. Pandas was built specifically for it.
If NumPy is a calculator, Pandas is the spreadsheet. And in AI and data science, you will spend more time in that spreadsheet than anywhere else.
What a DataFrame Actually Is
A DataFrame is a table with labeled rows and columns. Think of it as a dictionary of arrays, all sharing the same index.
import pandas as pd
import numpy as np
data = {
"name": ["Alex", "Priya", "Sam", "Jordan", "Lisa"],
"age": [25, 30, 22, 35, 28],
"salary": [55000, 82000, 43000, 95000, 67000],
"department": ["Engineering", "Marketing", "Engineering", "Sales", "Marketing"],
"promoted": [True, False, False, True, True]
}
df = pd.DataFrame(data)
print(df)
Output:
name age salary department promoted
0 Alex 25 55000 Engineering True
1 Priya 30 82000 Marketing False
2 Sam 22 43000 Engineering False
3 Jordan 35 95000 Sales True
4 Lisa 28 67000 Marketing True
Five rows. Five columns. Every column has a name. Every row has an index (0 through 4 by default). That index can be anything: numbers, dates, strings.
The First Things You Do With Any New DataFrame
Every time you load a new dataset, run these before doing anything else.
print(df.shape) # rows and columns
print(df.dtypes) # data type of each column
print(df.info()) # shape + dtypes + null counts together
print(df.head(3)) # first 3 rows
print(df.tail(2)) # last 2 rows
print(df.describe()) # statistics for numeric columns
Output from df.describe():
age salary
count 5.000000 5.000000
mean 28.000000 68400.000000
std 4.848683 20069.991000
min 22.000000 43000.000000
25% 25.000000 55000.000000
50% 28.000000 67000.000000
75% 30.000000 82000.000000
max 35.000000 95000.000000
describe() gives you count, mean, std, min, max and the quartiles for every numeric column in one shot. This is your first look at what the data looks like. Run it every time before touching anything else.
Selecting Columns
print(df["name"]) # one column, returns Series
print(df[["name", "salary"]]) # multiple columns, returns DataFrame
print(df["salary"].mean()) # compute on a column directly
print(df["department"].value_counts()) # frequency of each unique value
Output from value_counts():
department
Engineering 2
Marketing 2
Sales 1
Name: count, dtype: int64
value_counts() is one of the most useful quick methods. Run it on any categorical column and you instantly know the distribution of categories. Is your dataset balanced? Are there rare categories? This tells you in one line.
Selecting Rows: loc and iloc
Two methods. One uses labels, one uses positions.
iloc is position-based. Treats everything like NumPy.
print(df.iloc[0]) # first row
print(df.iloc[1:3]) # rows 1 and 2
print(df.iloc[0, 2]) # row 0, column 2 (salary)
loc is label-based. Uses actual row and column names.
print(df.loc[0]) # row with index label 0
print(df.loc[0:2, ["name", "salary"]]) # rows 0-2, specific columns
The one that trips people up: iloc[0:3] gives rows 0, 1, 2 (exclusive end). loc[0:2] gives rows 0, 1, 2 (inclusive end). They are different. loc is inclusive on both ends.
Boolean Filtering
This is where Pandas becomes genuinely powerful.
high_earners = df[df["salary"] > 65000]
print(high_earners)
Output:
name age salary department promoted
1 Priya 30 82000 Marketing False
3 Jordan 35 95000 Sales True
4 Lisa 28 67000 Marketing True
eng_promoted = df[(df["department"] == "Engineering") & (df["promoted"] == True)]
print(eng_promoted)
Output:
name age salary department promoted
0 Alex 25 55000 Engineering True
Conditions in parentheses. & for AND, | for OR, ~ for NOT. Same boolean logic from Python, applied to entire columns at once.
Adding and Modifying Columns
df["salary_monthly"] = df["salary"] / 12
df["seniority"] = df["age"].apply(lambda x: "senior" if x >= 30 else "junior")
df["salary_normalized"] = (df["salary"] - df["salary"].mean()) / df["salary"].std()
print(df[["name", "salary", "salary_monthly", "seniority", "salary_normalized"]])
Output:
name salary salary_monthly seniority salary_normalized
0 Alex 55000 4583.333333 junior -0.667754
1 Priya 82000 6833.333333 senior 0.677094
2 Sam 43000 3583.333333 junior -1.265898
3 Jordan 95000 7916.666667 senior 1.324025
4 Lisa 67000 5583.333333 junior -0.067467
.apply() runs a function on every value in a column. Lambda, regular function, anything callable. This is how you transform data row by row when vectorized operations cannot do it directly.
Handling Missing Values
Real data always has missing values. Always.
messy_data = {
"name": ["Alex", "Priya", None, "Jordan", "Lisa"],
"age": [25, None, 22, 35, 28],
"score": [88, 92, None, 76, None]
}
df_messy = pd.DataFrame(messy_data)
print(df_messy)
print("\nNull counts:")
print(df_messy.isnull().sum())
Output:
name age score
0 Alex 25.0 88.0
1 Priya NaN 92.0
2 None 22.0 NaN
3 Jordan 35.0 76.0
4 Lisa 28.0 NaN
Null counts:
name 1
age 1
score 2
isnull().sum() gives you a count of missing values per column. First thing to check after describe().
Options for handling them:
df_dropped = df_messy.dropna()
df_filled = df_messy.fillna({"age": df_messy["age"].mean(), "score": 0})
df_filled_fw = df_messy.fillna(method="ffill") # fill with previous value
print(f"Original rows: {len(df_messy)}")
print(f"After dropna: {len(df_dropped)}")
Output:
Original rows: 5
After dropna: 2
dropna() removed rows with any missing value. Only 2 rows survived out of 5. Be careful with dropping, you can lose most of your data.
fillna with a dictionary lets you specify different fill strategies per column. Mean for numerical, 0 for scores, "Unknown" for strings. This is the more controlled approach.
GroupBy: Aggregating by Category
One of the most useful operations in all of data analysis.
data = {
"name": ["Alex", "Priya", "Sam", "Jordan", "Lisa", "Ravi", "Tom"],
"department": ["Eng", "Marketing", "Eng", "Sales", "Marketing", "Eng", "Sales"],
"salary": [55000, 82000, 43000, 95000, 67000, 71000, 88000],
"years": [2, 5, 1, 8, 4, 3, 6]
}
df = pd.DataFrame(data)
dept_stats = df.groupby("department")["salary"].agg(["mean", "min", "max", "count"])
print(dept_stats)
Output:
mean min max count
department
Eng 56333.333 43000 71000 3
Marketing 74500.000 67000 82000 2
Sales 91500.000 88000 95000 2
Three departments. Salary stats for each. One line. groupby followed by a column selection followed by agg. This is the standard pattern.
Multiple columns at once:
dept_multi = df.groupby("department")[["salary", "years"]].mean()
print(dept_multi)
Output:
salary years
department
Eng 56333.33 2.00
Marketing 74500.00 4.50
Sales 91500.00 7.00
Sorting
df_sorted = df.sort_values("salary", ascending=False)
print(df_sorted[["name", "department", "salary"]])
Output:
name department salary
3 Jordan Sales 95000
6 Tom Sales 88000
1 Priya Marketing 82000
5 Ravi Eng 71000
4 Lisa Marketing 67000
0 Alex Eng 55000
2 Sam Eng 43000
Sort by multiple columns: df.sort_values(["department", "salary"], ascending=[True, False]). Alphabetical departments, highest salary first within each.
Saving and Loading
df.to_csv("employees.csv", index=False)
df_loaded = pd.read_csv("employees.csv")
df.to_json("employees.json", orient="records", indent=2)
df_loaded_json = pd.read_json("employees.json")
print(df_loaded.shape)
index=False stops Pandas from writing the row numbers as an extra column in the CSV. Almost always what you want.
Try This
Create pandas_practice.py.
Download or create a CSV file of at least 20 rows with these columns: name, age, city, score, category. Make some values missing.
Load it with pd.read_csv. Run head(), info(), describe() and print results.
Do all of the following:
Find rows where score is above the mean score.
Fill missing score values with the column median. Fill missing city values with "Unknown."
Add a new column called grade that assigns "A" for score >= 85, "B" for 70-84, "C" for below 70.
Group by category and compute mean, max, and count of scores for each group.
Sort the entire DataFrame by score descending and print the top 5.
Save the cleaned DataFrame to a new CSV called cleaned_data.csv.
Go Deeper
Official Pandas docs (best reference):
https://pandas.pydata.org/docs/user_guide/index.html
Corey Schafer's Pandas tutorial series (best YouTube series, covers everything):
https://www.youtube.com/playlist?list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS
Pandas in 10 minutes (official quick overview):
https://pandas.pydata.org/docs/user_guide/10min.html
Keith Galli's complete Pandas tutorial (real dataset walkthrough):
https://www.youtube.com/watch?v=vmEHCJofslg
Kaggle's free Pandas micro-course (hands-on exercises):
https://www.kaggle.com/learn/pandas
The Corey Schafer playlist is the one to watch alongside this post. He covers everything here in video form with great examples. The Kaggle course is worth doing for the practice exercises alone.
What's Next
You can create and manipulate DataFrames now. The next step is loading real data from files. CSVs, JSON, Excel, APIs. Each format has quirks. Each has common issues. The next post covers all of it with the errors you will actually hit and how to fix them.
Top comments (0)