Data Processing in Python: Pandas Tricks That Will Save You Hours

#python #data #pandas #tutorial

Pandas is the Swiss Army knife of data processing. Here are patterns and tricks that separate beginners from power users.

Reading Data Efficiently

import pandas as pd

# Only read columns you need
df = pd.read_csv("huge_file.csv", usecols=["name", "email", "amount"])

# Parse dates automatically
df = pd.read_csv("sales.csv", parse_dates=["order_date", "ship_date"])

# Handle large files in chunks
chunks = pd.read_csv("massive.csv", chunksize=10000)
result = pd.concat([chunk[chunk["status"] == "active"] for chunk in chunks])

Method Chaining

Write clean, readable transformations:

result = (
    pd.read_csv("orders.csv")
    .query("amount > 100")
    .assign(
        order_date=lambda df: pd.to_datetime(df["order_date"]),
        tax=lambda df: df["amount"] * 0.08,
        total=lambda df: df["amount"] + df["tax"],
    )
    .sort_values("total", ascending=False)
    .head(20)
)

GroupBy Patterns

# Multiple aggregations
summary = (
    df.groupby("category")
    .agg(
        total_sales=("amount", "sum"),
        avg_order=("amount", "mean"),
        order_count=("order_id", "count"),
        unique_customers=("customer_id", "nunique"),
    )
    .sort_values("total_sales", ascending=False)
)

# Transform: add group-level stats back to each row
df["category_avg"] = df.groupby("category")["amount"].transform("mean")
df["pct_of_category"] = df["amount"] / df["category_avg"]

Vectorized String Operations

# Clean messy data
df["email"] = df["email"].str.lower().str.strip()
df["domain"] = df["email"].str.split("@").str[1]
df["first_name"] = df["full_name"].str.split().str[0]

# Filter with string methods
gmail_users = df[df["email"].str.contains("gmail.com", na=False)]

Handling Missing Data

# See what's missing
print(df.isnull().sum())

# Fill strategically
df["price"] = df["price"].fillna(df.groupby("category")["price"].transform("median"))

# Drop rows missing critical fields
df = df.dropna(subset=["email", "order_id"])

Merge and Join

# Like SQL JOIN
orders = pd.merge(
    orders_df,
    customers_df,
    on="customer_id",
    how="left"
)

# Multiple DataFrames
from functools import reduce
dfs = [orders_df, products_df, shipping_df]
merged = reduce(lambda left, right: pd.merge(left, right, on="order_id"), dfs)

Performance Tips

# Use categories for low-cardinality strings
df["status"] = df["status"].astype("category")  # huge memory savings

# Use query() instead of boolean indexing for readability
# Instead of: df[(df["age"] > 25) & (df["city"] == "NYC")]
df.query("age > 25 and city == 'NYC'")

# Avoid iterrows! Use vectorized operations
# Slow:
for idx, row in df.iterrows():
    df.at[idx, "total"] = row["price"] * row["qty"]

# Fast:
df["total"] = df["price"] * df["qty"]

Export Patterns

# To JSON (for APIs)
df.to_json("output.json", orient="records", indent=2)

# To Excel with formatting
with pd.ExcelWriter("report.xlsx", engine="openpyxl") as writer:
    summary.to_excel(writer, sheet_name="Summary")
    details.to_excel(writer, sheet_name="Details")

# To SQL
from sqlalchemy import create_engine
engine = create_engine("postgresql://user:pass@localhost/mydb")
df.to_sql("orders", engine, if_exists="replace", index=False)

Key Takeaways

Use method chaining for readable transformations
groupby().agg() with named aggregations is powerful
Vectorized operations are 100x faster than loops
Use category dtype for memory efficiency
Read only the columns and rows you need
query() is more readable than boolean indexing

Master these patterns and you'll process data in minutes, not hours.

🚀 Level up your AI workflow! Check out my AI Developer Mega Prompt Pack — 80 battle-tested prompts for developers. $9.99

DEV Community