Pandas is the Swiss Army knife of data processing. Here are patterns and tricks that separate beginners from power users.
Reading Data Efficiently
import pandas as pd
# Only read columns you need
df = pd.read_csv("huge_file.csv", usecols=["name", "email", "amount"])
# Parse dates automatically
df = pd.read_csv("sales.csv", parse_dates=["order_date", "ship_date"])
# Handle large files in chunks
chunks = pd.read_csv("massive.csv", chunksize=10000)
result = pd.concat([chunk[chunk["status"] == "active"] for chunk in chunks])
Method Chaining
Write clean, readable transformations:
result = (
pd.read_csv("orders.csv")
.query("amount > 100")
.assign(
order_date=lambda df: pd.to_datetime(df["order_date"]),
tax=lambda df: df["amount"] * 0.08,
total=lambda df: df["amount"] + df["tax"],
)
.sort_values("total", ascending=False)
.head(20)
)
GroupBy Patterns
# Multiple aggregations
summary = (
df.groupby("category")
.agg(
total_sales=("amount", "sum"),
avg_order=("amount", "mean"),
order_count=("order_id", "count"),
unique_customers=("customer_id", "nunique"),
)
.sort_values("total_sales", ascending=False)
)
# Transform: add group-level stats back to each row
df["category_avg"] = df.groupby("category")["amount"].transform("mean")
df["pct_of_category"] = df["amount"] / df["category_avg"]
Vectorized String Operations
# Clean messy data
df["email"] = df["email"].str.lower().str.strip()
df["domain"] = df["email"].str.split("@").str[1]
df["first_name"] = df["full_name"].str.split().str[0]
# Filter with string methods
gmail_users = df[df["email"].str.contains("gmail.com", na=False)]
Handling Missing Data
# See what's missing
print(df.isnull().sum())
# Fill strategically
df["price"] = df["price"].fillna(df.groupby("category")["price"].transform("median"))
# Drop rows missing critical fields
df = df.dropna(subset=["email", "order_id"])
Merge and Join
# Like SQL JOIN
orders = pd.merge(
orders_df,
customers_df,
on="customer_id",
how="left"
)
# Multiple DataFrames
from functools import reduce
dfs = [orders_df, products_df, shipping_df]
merged = reduce(lambda left, right: pd.merge(left, right, on="order_id"), dfs)
Performance Tips
# Use categories for low-cardinality strings
df["status"] = df["status"].astype("category") # huge memory savings
# Use query() instead of boolean indexing for readability
# Instead of: df[(df["age"] > 25) & (df["city"] == "NYC")]
df.query("age > 25 and city == 'NYC'")
# Avoid iterrows! Use vectorized operations
# Slow:
for idx, row in df.iterrows():
df.at[idx, "total"] = row["price"] * row["qty"]
# Fast:
df["total"] = df["price"] * df["qty"]
Export Patterns
# To JSON (for APIs)
df.to_json("output.json", orient="records", indent=2)
# To Excel with formatting
with pd.ExcelWriter("report.xlsx", engine="openpyxl") as writer:
summary.to_excel(writer, sheet_name="Summary")
details.to_excel(writer, sheet_name="Details")
# To SQL
from sqlalchemy import create_engine
engine = create_engine("postgresql://user:pass@localhost/mydb")
df.to_sql("orders", engine, if_exists="replace", index=False)
Key Takeaways
- Use method chaining for readable transformations
-
groupby().agg()with named aggregations is powerful - Vectorized operations are 100x faster than loops
- Use
categorydtype for memory efficiency - Read only the columns and rows you need
-
query()is more readable than boolean indexing
Master these patterns and you'll process data in minutes, not hours.
🚀 Level up your AI workflow! Check out my AI Developer Mega Prompt Pack — 80 battle-tested prompts for developers. $9.99
Top comments (0)