DEV Community

Cover image for Filtering Rows and Selecting Columns (The Right Way)
Akhilesh
Akhilesh

Posted on

Filtering Rows and Selecting Columns (The Right Way)

You know df["column"] selects a column.

You know df.head() shows the top rows.

But in real analysis, you need surgical precision. Give me rows 50 through 200 where salary is above the median and department is not Sales, and only show me the name, age, and salary columns.

That sentence is one Pandas expression when you know what you are doing. Three nested loops when you do not.

This post is about knowing what you are doing.


Two Ways to Select: loc and iloc

These are the two main selectors. They look similar but they think differently.

iloc thinks in positions. Row 0 is the first row. Column 2 is the third column. Pure integer indexing like NumPy.

loc thinks in labels. Use the actual row index value and the actual column name. More readable. More dangerous to confuse with iloc.

import pandas as pd

data = {
    "name":       ["Alex", "Priya", "Sam", "Jordan", "Lisa", "Ravi"],
    "age":        [25, 30, 22, 35, 28, 31],
    "salary":     [55000, 82000, 43000, 95000, 67000, 71000],
    "department": ["Engineering", "Marketing", "Engineering", "Sales", "Marketing", "Engineering"],
    "score":      [88, 92, 76, 85, 91, 79]
}

df = pd.DataFrame(data)
print(df)
Enter fullscreen mode Exit fullscreen mode

Output:

    name  age  salary   department  score
0   Alex   25   55000  Engineering     88
1  Priya   30   82000    Marketing     92
2    Sam   22   43000  Engineering     76
3 Jordan   35   95000        Sales     85
4   Lisa   28   67000    Marketing     91
5   Ravi   31   71000  Engineering     79
Enter fullscreen mode Exit fullscreen mode

iloc: Position-Based Selection

print(df.iloc[0])          # first row, all columns
print(df.iloc[2, 3])       # row index 2, column index 3 -> "Engineering"
print(df.iloc[1:4])        # rows 1, 2, 3 (4 is excluded)
print(df.iloc[:, 0:3])     # all rows, columns 0, 1, 2
print(df.iloc[[0, 3, 5]])  # rows 0, 3, and 5 specifically
print(df.iloc[-1])         # last row
Enter fullscreen mode Exit fullscreen mode

Output from df.iloc[2, 3]:

Engineering
Enter fullscreen mode Exit fullscreen mode

Output from df.iloc[[0, 3, 5]]:

     name  age  salary   department  score
0    Alex   25   55000  Engineering     88
3  Jordan   35   95000        Sales     85
5    Ravi   31   71000  Engineering     79
Enter fullscreen mode Exit fullscreen mode

iloc is straightforward. The confusion comes from people using it when they should use loc after filtering has shuffled the index.


loc: Label-Based Selection

print(df.loc[0])                          # row with index label 0
print(df.loc[0, "salary"])                # row 0, salary column -> 55000
print(df.loc[0:3, "name":"salary"])       # rows 0-3, columns name through salary
print(df.loc[[1, 4], ["name", "score"]]) # specific rows and columns by name
Enter fullscreen mode Exit fullscreen mode

Output from df.loc[0:3, "name":"salary"]:

     name  age  salary
0    Alex   25   55000
1   Priya   30   82000
2     Sam   22   43000
3  Jordan   35   95000
Enter fullscreen mode Exit fullscreen mode

Critical difference from iloc: loc[0:3] includes row 3. iloc[0:3] excludes row 3. loc slices are inclusive on both ends.


Why iloc Breaks After Filtering

This is the thing that catches everyone.

high_earners = df[df["salary"] > 65000]
print(high_earners)
print()
print(high_earners.iloc[0])  # first row of filtered df
print()
try:
    print(high_earners.loc[0])  # row with label 0
except KeyError as e:
    print(f"KeyError: {e}")
Enter fullscreen mode Exit fullscreen mode

Output:

    name  age  salary   department  score
1  Priya   30   82000    Marketing     92
3 Jordan   35   95000        Sales     85
4   Lisa   28   67000    Marketing     91
5   Ravi   31   71000  Engineering     79

name              Priya
age                  30
salary            82000
department    Marketing
score                92

KeyError: 0
Enter fullscreen mode Exit fullscreen mode

After filtering, the index labels are 1, 3, 4, 5. Row 0 no longer exists. loc[0] fails because there is no row labeled 0.

iloc[0] still works because it means "first row of this DataFrame, whatever its label is."

Use iloc when you want positional access after filtering. Use loc when you want to select by label and you know the labels.

Reset the index if you want to work with 0-based labels again:

high_earners_reset = high_earners.reset_index(drop=True)
print(high_earners_reset)
Enter fullscreen mode Exit fullscreen mode

Output:

    name  age  salary   department  score
0  Priya   30   82000    Marketing     92
1 Jordan   35   95000        Sales     85
2   Lisa   28   67000    Marketing     91
3   Ravi   31   71000  Engineering     79
Enter fullscreen mode Exit fullscreen mode

Labels now start at 0 again. drop=True removes the old index instead of saving it as a column.


Boolean Filtering: The Power Move

You already saw basic filtering in the Pandas post. This is the deep version.

Single condition:

seniors = df[df["age"] >= 30]
engineers = df[df["department"] == "Engineering"]
high_scorers = df[df["score"] > 85]
Enter fullscreen mode Exit fullscreen mode

Multiple conditions with AND:

eng_seniors = df[(df["department"] == "Engineering") & (df["age"] >= 28)]
print(eng_seniors[["name", "age", "department"]])
Enter fullscreen mode Exit fullscreen mode

Output:

  name  age   department
5  Ravi   31  Engineering
Enter fullscreen mode Exit fullscreen mode

Multiple conditions with OR:

marketing_or_high = df[(df["department"] == "Marketing") | (df["score"] > 90)]
print(marketing_or_high[["name", "department", "score"]])
Enter fullscreen mode Exit fullscreen mode

Output:

    name department  score
1  Priya  Marketing     92
4   Lisa  Marketing     91
Enter fullscreen mode Exit fullscreen mode

Negation with ~:

not_sales = df[~(df["department"] == "Sales")]
print(not_sales["department"].unique())
Enter fullscreen mode Exit fullscreen mode

Output:

['Engineering' 'Marketing']
Enter fullscreen mode Exit fullscreen mode

isin() for matching against a list:

selected_depts = df[df["department"].isin(["Engineering", "Marketing"])]
print(selected_depts["department"].value_counts())
Enter fullscreen mode Exit fullscreen mode

Output:

Engineering    3
Marketing      2
Enter fullscreen mode Exit fullscreen mode

between() for range checks:

mid_range = df[df["salary"].between(60000, 85000)]
print(mid_range[["name", "salary"]])
Enter fullscreen mode Exit fullscreen mode

Output:

   name  salary
1  Priya   82000
4   Lisa   67000
5   Ravi   71000
Enter fullscreen mode Exit fullscreen mode

str.contains() for partial string matching:

data2 = df.copy()
data2.loc[5, "name"] = "Ravi Kumar"

kumar = data2[data2["name"].str.contains("Kumar", case=False, na=False)]
print(kumar[["name", "department"]])
Enter fullscreen mode Exit fullscreen mode

Combining Selection and Filtering

The real pattern: filter rows first, then select only the columns you care about.

result = df.loc[df["salary"] > 65000, ["name", "salary", "department"]]
print(result)
Enter fullscreen mode Exit fullscreen mode

Output:

     name  salary   department
1   Priya   82000    Marketing
3  Jordan   95000        Sales
4    Lisa   67000    Marketing
5    Ravi   71000  Engineering
Enter fullscreen mode Exit fullscreen mode

df.loc[condition, columns] in one expression. Row selection and column selection together. This is the pattern you will use constantly.

Sort the result:

result_sorted = df.loc[df["salary"] > 65000, ["name", "salary", "department"]] \
                  .sort_values("salary", ascending=False)
print(result_sorted)
Enter fullscreen mode Exit fullscreen mode

Output:

     name  salary   department
3  Jordan   95000        Sales
1   Priya   82000    Marketing
5    Ravi   71000  Engineering
4    Lisa   67000    Marketing
Enter fullscreen mode Exit fullscreen mode

Method chaining. Filter, select columns, sort. All in one readable expression.


Modifying Values in Place

Filtering selects. loc can also assign.

df.loc[df["department"] == "Sales", "score"] += 5
print(df[df["department"] == "Sales"][["name", "department", "score"]])
Enter fullscreen mode Exit fullscreen mode

Output:

     name department  score
3  Jordan      Sales     90
Enter fullscreen mode Exit fullscreen mode

Jordan's score went from 85 to 90. Only the Sales rows were affected. Everyone else unchanged.

Never do this:

df[df["department"] == "Sales"]["score"] += 5   # wrong
Enter fullscreen mode Exit fullscreen mode

This is chained indexing. It creates a copy, modifies the copy, throws it away. The original DataFrame does not change. Pandas usually warns you with SettingWithCopyWarning. Use .loc for assignments. Always.


Quick Reference

df.iloc[row_number]                         # single row by position
df.iloc[start:end]                          # slice by position
df.iloc[row, col]                           # single value by position
df.loc[label]                               # single row by label
df.loc[start:end, col_name]                 # slice by label + column name
df.loc[bool_series, columns]                # filter + select columns
df[condition]                               # shorthand filter (rows only)
df[col_name]                                # single column
df[[col1, col2]]                            # multiple columns
df.query("salary > 60000 and age < 30")    # SQL-style string filter
Enter fullscreen mode Exit fullscreen mode

df.query() is worth knowing. It accepts a string expression and is sometimes more readable than boolean indexing for complex conditions.

result = df.query("salary > 60000 and age < 30")
print(result[["name", "age", "salary"]])
Enter fullscreen mode Exit fullscreen mode

Output:

   name  age  salary
4  Lisa   28   67000
Enter fullscreen mode Exit fullscreen mode

*args and **kwargs: Why They Matter Here

You have been calling Pandas methods with keyword arguments this whole series.

df.sort_values("salary", ascending=False)
df.fillna({"age": 0, "salary": df["salary"].median()})
df.read_csv("file.csv", sep=";", encoding="utf-8", nrows=1000)
Enter fullscreen mode Exit fullscreen mode

Those extra named arguments like ascending, encoding, nrows are kwargs. Keyword arguments. They have names and default values. You only pass them when you want to override the default.

*args and **kwargs are how Python functions accept a variable number of arguments. They are worth understanding deeply because you will use them when writing your own utility functions for data work.

def summarize(*columns, df=None, **agg_funcs):
    if df is None:
        return
    for col in columns:
        print(f"\n{col}:")
        for func_name, func in agg_funcs.items():
            print(f"  {func_name}: {func(df[col])}")

summarize("age", "salary", df=df, mean=lambda x: x.mean().round(1), maximum=max)
Enter fullscreen mode Exit fullscreen mode

Output:

age:
  mean: 28.5
  maximum: 35

salary:
  mean: 68833.3
  maximum: 95000
Enter fullscreen mode Exit fullscreen mode

*columns collects any number of positional arguments into a tuple. **agg_funcs collects any number of keyword arguments into a dictionary. This pattern shows up in utility functions, wrappers, and decorators. A dedicated deep-dive on *args and **kwargs is coming up after the core data tools posts.


Try This

Create filtering_practice.py.

Load the Titanic dataset (from the last post's practice).

Using only loc and iloc, do every one of these without loops:

Select the Name, Age, and Survived columns for the first 10 passengers using iloc.

Find all female passengers who survived. Show only Name, Age, and Fare.

Find passengers who paid more than 50 for their ticket and were in first class. How many are there?

Use str.contains to find passengers whose name contains "Mrs." Show their names and ages.

Select the last 5 rows of the dataset using iloc negative indexing.

Use df.query() to find passengers younger than 18 who survived. Compare the result to the boolean indexing version. They should be identical.

For each result, chain a .sort_values() to sort by age descending.


What's Next

You can select any slice of your data now. The next step is finding patterns across groups. Average salary per department. Count of survivors per passenger class. Maximum score per category. That is groupby, and it is one of the most powerful tools in data analysis.

Top comments (0)