DEV Community

Cover image for Pandas DataFrames: Your Data Spreadsheet
Akhilesh
Akhilesh

Posted on

Pandas DataFrames: Your Data Spreadsheet

NumPy is for numbers.

Pure numbers, same type, organized in grids. Fast, powerful, no labels.

Real data is not like that. Real data has column names. It has strings mixed with numbers. It has dates. It has missing values. It has a mix of ages, salaries, cities, and booleans all in the same table.

NumPy cannot handle that cleanly. Pandas was built specifically for it.

If NumPy is a calculator, Pandas is the spreadsheet. And in AI and data science, you will spend more time in that spreadsheet than anywhere else.


What a DataFrame Actually Is

A DataFrame is a table with labeled rows and columns. Think of it as a dictionary of arrays, all sharing the same index.

import pandas as pd
import numpy as np

data = {
    "name":       ["Alex", "Priya", "Sam", "Jordan", "Lisa"],
    "age":        [25, 30, 22, 35, 28],
    "salary":     [55000, 82000, 43000, 95000, 67000],
    "department": ["Engineering", "Marketing", "Engineering", "Sales", "Marketing"],
    "promoted":   [True, False, False, True, True]
}

df = pd.DataFrame(data)
print(df)
Enter fullscreen mode Exit fullscreen mode

Output:

    name  age  salary   department  promoted
0   Alex   25   55000  Engineering      True
1  Priya   30   82000    Marketing     False
2    Sam   22   43000  Engineering     False
3 Jordan   35   95000        Sales      True
4   Lisa   28   67000    Marketing      True
Enter fullscreen mode Exit fullscreen mode

Five rows. Five columns. Every column has a name. Every row has an index (0 through 4 by default). That index can be anything: numbers, dates, strings.


The First Things You Do With Any New DataFrame

Every time you load a new dataset, run these before doing anything else.

print(df.shape)          # rows and columns
print(df.dtypes)         # data type of each column
print(df.info())         # shape + dtypes + null counts together
print(df.head(3))        # first 3 rows
print(df.tail(2))        # last 2 rows
print(df.describe())     # statistics for numeric columns
Enter fullscreen mode Exit fullscreen mode

Output from df.describe():

             age        salary
count   5.000000      5.000000
mean   28.000000  68400.000000
std     4.848683  20069.991000
min    22.000000  43000.000000
25%    25.000000  55000.000000
50%    28.000000  67000.000000
75%    30.000000  82000.000000
max    35.000000  95000.000000
Enter fullscreen mode Exit fullscreen mode

describe() gives you count, mean, std, min, max and the quartiles for every numeric column in one shot. This is your first look at what the data looks like. Run it every time before touching anything else.


Selecting Columns

print(df["name"])                        # one column, returns Series
print(df[["name", "salary"]])            # multiple columns, returns DataFrame
print(df["salary"].mean())               # compute on a column directly
print(df["department"].value_counts())   # frequency of each unique value
Enter fullscreen mode Exit fullscreen mode

Output from value_counts():

department
Engineering    2
Marketing      2
Sales          1
Name: count, dtype: int64
Enter fullscreen mode Exit fullscreen mode

value_counts() is one of the most useful quick methods. Run it on any categorical column and you instantly know the distribution of categories. Is your dataset balanced? Are there rare categories? This tells you in one line.


Selecting Rows: loc and iloc

Two methods. One uses labels, one uses positions.

iloc is position-based. Treats everything like NumPy.

print(df.iloc[0])        # first row
print(df.iloc[1:3])      # rows 1 and 2
print(df.iloc[0, 2])     # row 0, column 2 (salary)
Enter fullscreen mode Exit fullscreen mode

loc is label-based. Uses actual row and column names.

print(df.loc[0])                         # row with index label 0
print(df.loc[0:2, ["name", "salary"]])   # rows 0-2, specific columns
Enter fullscreen mode Exit fullscreen mode

The one that trips people up: iloc[0:3] gives rows 0, 1, 2 (exclusive end). loc[0:2] gives rows 0, 1, 2 (inclusive end). They are different. loc is inclusive on both ends.


Boolean Filtering

This is where Pandas becomes genuinely powerful.

high_earners = df[df["salary"] > 65000]
print(high_earners)
Enter fullscreen mode Exit fullscreen mode

Output:

    name  age  salary   department  promoted
1  Priya   30   82000    Marketing     False
3 Jordan   35   95000        Sales      True
4   Lisa   28   67000    Marketing      True
Enter fullscreen mode Exit fullscreen mode
eng_promoted = df[(df["department"] == "Engineering") & (df["promoted"] == True)]
print(eng_promoted)
Enter fullscreen mode Exit fullscreen mode

Output:

  name  age  salary   department  promoted
0  Alex   25   55000  Engineering      True
Enter fullscreen mode Exit fullscreen mode

Conditions in parentheses. & for AND, | for OR, ~ for NOT. Same boolean logic from Python, applied to entire columns at once.


Adding and Modifying Columns

df["salary_monthly"] = df["salary"] / 12

df["seniority"] = df["age"].apply(lambda x: "senior" if x >= 30 else "junior")

df["salary_normalized"] = (df["salary"] - df["salary"].mean()) / df["salary"].std()

print(df[["name", "salary", "salary_monthly", "seniority", "salary_normalized"]])
Enter fullscreen mode Exit fullscreen mode

Output:

     name  salary  salary_monthly seniority  salary_normalized
0    Alex   55000     4583.333333    junior          -0.667754
1   Priya   82000     6833.333333    senior           0.677094
2     Sam   43000     3583.333333    junior          -1.265898
3  Jordan   95000     7916.666667    senior           1.324025
4    Lisa   67000     5583.333333    junior          -0.067467
Enter fullscreen mode Exit fullscreen mode

.apply() runs a function on every value in a column. Lambda, regular function, anything callable. This is how you transform data row by row when vectorized operations cannot do it directly.


Handling Missing Values

Real data always has missing values. Always.

messy_data = {
    "name":   ["Alex", "Priya", None, "Jordan", "Lisa"],
    "age":    [25, None, 22, 35, 28],
    "score":  [88, 92, None, 76, None]
}

df_messy = pd.DataFrame(messy_data)
print(df_messy)
print("\nNull counts:")
print(df_messy.isnull().sum())
Enter fullscreen mode Exit fullscreen mode

Output:

     name   age  score
0    Alex  25.0   88.0
1   Priya   NaN   92.0
2    None  22.0    NaN
3  Jordan  35.0   76.0
4    Lisa  28.0    NaN

Null counts:
name     1
age      1
score    2
Enter fullscreen mode Exit fullscreen mode

isnull().sum() gives you a count of missing values per column. First thing to check after describe().

Options for handling them:

df_dropped   = df_messy.dropna()
df_filled    = df_messy.fillna({"age": df_messy["age"].mean(), "score": 0})
df_filled_fw = df_messy.fillna(method="ffill")   # fill with previous value

print(f"Original rows: {len(df_messy)}")
print(f"After dropna:  {len(df_dropped)}")
Enter fullscreen mode Exit fullscreen mode

Output:

Original rows: 5
After dropna:  2
Enter fullscreen mode Exit fullscreen mode

dropna() removed rows with any missing value. Only 2 rows survived out of 5. Be careful with dropping, you can lose most of your data.

fillna with a dictionary lets you specify different fill strategies per column. Mean for numerical, 0 for scores, "Unknown" for strings. This is the more controlled approach.


GroupBy: Aggregating by Category

One of the most useful operations in all of data analysis.

data = {
    "name":       ["Alex", "Priya", "Sam", "Jordan", "Lisa", "Ravi", "Tom"],
    "department": ["Eng", "Marketing", "Eng", "Sales", "Marketing", "Eng", "Sales"],
    "salary":     [55000, 82000, 43000, 95000, 67000, 71000, 88000],
    "years":      [2, 5, 1, 8, 4, 3, 6]
}

df = pd.DataFrame(data)

dept_stats = df.groupby("department")["salary"].agg(["mean", "min", "max", "count"])
print(dept_stats)
Enter fullscreen mode Exit fullscreen mode

Output:

                    mean    min    max  count
department
Eng          56333.333  43000  71000      3
Marketing    74500.000  67000  82000      2
Sales        91500.000  88000  95000      2
Enter fullscreen mode Exit fullscreen mode

Three departments. Salary stats for each. One line. groupby followed by a column selection followed by agg. This is the standard pattern.

Multiple columns at once:

dept_multi = df.groupby("department")[["salary", "years"]].mean()
print(dept_multi)
Enter fullscreen mode Exit fullscreen mode

Output:

                salary  years
department
Eng          56333.33   2.00
Marketing    74500.00   4.50
Sales        91500.00   7.00
Enter fullscreen mode Exit fullscreen mode

Sorting

df_sorted = df.sort_values("salary", ascending=False)
print(df_sorted[["name", "department", "salary"]])
Enter fullscreen mode Exit fullscreen mode

Output:

     name department  salary
3  Jordan      Sales   95000
6     Tom      Sales   88000
1   Priya  Marketing   82000
5    Ravi        Eng   71000
4    Lisa  Marketing   67000
0    Alex        Eng   55000
2     Sam        Eng   43000
Enter fullscreen mode Exit fullscreen mode

Sort by multiple columns: df.sort_values(["department", "salary"], ascending=[True, False]). Alphabetical departments, highest salary first within each.


Saving and Loading

df.to_csv("employees.csv", index=False)
df_loaded = pd.read_csv("employees.csv")

df.to_json("employees.json", orient="records", indent=2)
df_loaded_json = pd.read_json("employees.json")

print(df_loaded.shape)
Enter fullscreen mode Exit fullscreen mode

index=False stops Pandas from writing the row numbers as an extra column in the CSV. Almost always what you want.


Try This

Create pandas_practice.py.

Download or create a CSV file of at least 20 rows with these columns: name, age, city, score, category. Make some values missing.

Load it with pd.read_csv. Run head(), info(), describe() and print results.

Do all of the following:

Find rows where score is above the mean score.

Fill missing score values with the column median. Fill missing city values with "Unknown."

Add a new column called grade that assigns "A" for score >= 85, "B" for 70-84, "C" for below 70.

Group by category and compute mean, max, and count of scores for each group.

Sort the entire DataFrame by score descending and print the top 5.

Save the cleaned DataFrame to a new CSV called cleaned_data.csv.


Go Deeper

Official Pandas docs (best reference):
https://pandas.pydata.org/docs/user_guide/index.html

Corey Schafer's Pandas tutorial series (best YouTube series, covers everything):
https://www.youtube.com/playlist?list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS

Pandas in 10 minutes (official quick overview):
https://pandas.pydata.org/docs/user_guide/10min.html

Keith Galli's complete Pandas tutorial (real dataset walkthrough):
https://www.youtube.com/watch?v=vmEHCJofslg

Kaggle's free Pandas micro-course (hands-on exercises):
https://www.kaggle.com/learn/pandas

The Corey Schafer playlist is the one to watch alongside this post. He covers everything here in video form with great examples. The Kaggle course is worth doing for the practice exercises alone.


What's Next

You can create and manipulate DataFrames now. The next step is loading real data from files. CSVs, JSON, Excel, APIs. Each format has quirks. Each has common issues. The next post covers all of it with the errors you will actually hit and how to fix them.

Top comments (0)