DEV Community

郑沛沛
郑沛沛

Posted on

Data Processing in Python: Pandas Tricks That Will Save You Hours

Pandas is the Swiss Army knife of data processing. Here are patterns and tricks that separate beginners from power users.

Reading Data Efficiently

import pandas as pd

# Only read columns you need
df = pd.read_csv("huge_file.csv", usecols=["name", "email", "amount"])

# Parse dates automatically
df = pd.read_csv("sales.csv", parse_dates=["order_date", "ship_date"])

# Handle large files in chunks
chunks = pd.read_csv("massive.csv", chunksize=10000)
result = pd.concat([chunk[chunk["status"] == "active"] for chunk in chunks])
Enter fullscreen mode Exit fullscreen mode

Method Chaining

Write clean, readable transformations:

result = (
    pd.read_csv("orders.csv")
    .query("amount > 100")
    .assign(
        order_date=lambda df: pd.to_datetime(df["order_date"]),
        tax=lambda df: df["amount"] * 0.08,
        total=lambda df: df["amount"] + df["tax"],
    )
    .sort_values("total", ascending=False)
    .head(20)
)
Enter fullscreen mode Exit fullscreen mode

GroupBy Patterns

# Multiple aggregations
summary = (
    df.groupby("category")
    .agg(
        total_sales=("amount", "sum"),
        avg_order=("amount", "mean"),
        order_count=("order_id", "count"),
        unique_customers=("customer_id", "nunique"),
    )
    .sort_values("total_sales", ascending=False)
)

# Transform: add group-level stats back to each row
df["category_avg"] = df.groupby("category")["amount"].transform("mean")
df["pct_of_category"] = df["amount"] / df["category_avg"]
Enter fullscreen mode Exit fullscreen mode

Vectorized String Operations

# Clean messy data
df["email"] = df["email"].str.lower().str.strip()
df["domain"] = df["email"].str.split("@").str[1]
df["first_name"] = df["full_name"].str.split().str[0]

# Filter with string methods
gmail_users = df[df["email"].str.contains("gmail.com", na=False)]
Enter fullscreen mode Exit fullscreen mode

Handling Missing Data

# See what's missing
print(df.isnull().sum())

# Fill strategically
df["price"] = df["price"].fillna(df.groupby("category")["price"].transform("median"))

# Drop rows missing critical fields
df = df.dropna(subset=["email", "order_id"])
Enter fullscreen mode Exit fullscreen mode

Merge and Join

# Like SQL JOIN
orders = pd.merge(
    orders_df,
    customers_df,
    on="customer_id",
    how="left"
)

# Multiple DataFrames
from functools import reduce
dfs = [orders_df, products_df, shipping_df]
merged = reduce(lambda left, right: pd.merge(left, right, on="order_id"), dfs)
Enter fullscreen mode Exit fullscreen mode

Performance Tips

# Use categories for low-cardinality strings
df["status"] = df["status"].astype("category")  # huge memory savings

# Use query() instead of boolean indexing for readability
# Instead of: df[(df["age"] > 25) & (df["city"] == "NYC")]
df.query("age > 25 and city == 'NYC'")

# Avoid iterrows! Use vectorized operations
# Slow:
for idx, row in df.iterrows():
    df.at[idx, "total"] = row["price"] * row["qty"]

# Fast:
df["total"] = df["price"] * df["qty"]
Enter fullscreen mode Exit fullscreen mode

Export Patterns

# To JSON (for APIs)
df.to_json("output.json", orient="records", indent=2)

# To Excel with formatting
with pd.ExcelWriter("report.xlsx", engine="openpyxl") as writer:
    summary.to_excel(writer, sheet_name="Summary")
    details.to_excel(writer, sheet_name="Details")

# To SQL
from sqlalchemy import create_engine
engine = create_engine("postgresql://user:pass@localhost/mydb")
df.to_sql("orders", engine, if_exists="replace", index=False)
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

  1. Use method chaining for readable transformations
  2. groupby().agg() with named aggregations is powerful
  3. Vectorized operations are 100x faster than loops
  4. Use category dtype for memory efficiency
  5. Read only the columns and rows you need
  6. query() is more readable than boolean indexing

Master these patterns and you'll process data in minutes, not hours.

🚀 Level up your AI workflow! Check out my AI Developer Mega Prompt Pack — 80 battle-tested prompts for developers. $9.99

Top comments (0)