DEV Community

WDSEGA
WDSEGA

Posted on

Python数据处理10个实用模板:从CSV清洗到VLOOKUP匹配

在日常工作中,数据处理是最常见的重复性任务之一。本文整理了10个基于Python pandas库的实用数据处理模板,涵盖从CSV清洗到VLOOKUP匹配的常见场景。

1. 快速加载与预览数据

import pandas as pd

df = pd.read_csv('data.csv', encoding='utf-8')
df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')

print(df.head(10))       # 前10行
print(df.info())         # 数据类型和空值概览
print(df.describe())     # 数值列统计摘要
Enter fullscreen mode Exit fullscreen mode

小技巧:大文件用 nrows 参数先预览前100行,确认格式无误再全量加载。

2. 数据清洗:去重与空值处理

# 去除完全重复的行
df = df.drop_duplicates()

# 填充空值
df['age'] = df['age'].fillna(0)
df['name'] = df['name'].fillna('未知')
df['salary'] = df['salary'].fillna(df['salary'].median())

# 删除空值超过30%的列
threshold = len(df) * 0.7
df = df.dropna(thresh=threshold, axis=1)
Enter fullscreen mode Exit fullscreen mode

3. 字符串清洗与标准化

df['name'] = df['name'].str.strip()
df['category'] = df['category'].str.upper()
df['phone'] = df['phone'].str.replace('-', '').str.replace(' ', '')

# 批量字符串替换
mapping = {'北京': 'BJ', '上海': 'SH', '广州': 'GZ'}
df['city_code'] = df['city'].map(mapping).fillna(df['city'])
Enter fullscreen mode Exit fullscreen mode

4. 数据类型转换

df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df['month'] = df['date'].dt.month
df['price'] = pd.to_numeric(df['price'], errors='coerce')
df['status'] = df['status'].astype('category')
Enter fullscreen mode Exit fullscreen mode

5. VLOOKUP匹配(双表关联)

这是Excel用户最常用的功能,在pandas中用 merge 实现:

# 左连接(类似VLOOKUP)
result = orders.merge(
    users[['user_id', 'name', 'city', 'phone']],
    on='user_id',
    how='left'
)

# 多列匹配
result = orders.merge(users, on=['dept_id', 'team_id'], how='left')

# 不同列名匹配
result = orders.merge(users, left_on='customer_id', right_on='user_id', how='left')
Enter fullscreen mode Exit fullscreen mode

6. 数据透视表

pivot = df.pivot_table(
    values='amount', index='category', columns='month',
    aggfunc='sum', fill_value=0
)

# 分组统计
stats = df.groupby('department').agg(
    total_revenue=('amount', 'sum'),
    avg_salary=('salary', 'mean'),
    headcount=('employee_id', 'count')
).reset_index()
Enter fullscreen mode Exit fullscreen mode

7. 批量处理多个文件

import glob

all_files = glob.glob('data/*.csv')
dfs = [pd.read_csv(f, encoding='utf-8') for f in all_files]
merged = pd.concat(dfs, ignore_index=True)

# 按条件拆分
for name, group in df.groupby('region'):
    group.to_csv(f'output/{name}.csv', index=False, encoding='utf-8-sig')
Enter fullscreen mode Exit fullscreen mode

8. 数据对比与差异分析

# 新增的行
new_rows = df_new.merge(df_old, how='left', indicator=True)
added = new_rows[new_rows['_merge'] == 'right_only']

# 数值变化
comparison['change_pct'] = (
    (comparison['revenue_new'] - comparison['revenue_old'])
    / comparison['revenue_old'] * 100
).round(2)
Enter fullscreen mode Exit fullscreen mode

核心要点

  1. 先预览再处理:用 head()info() 了解数据全貌
  2. 善用链式操作:pandas支持方法链,代码更简洁
  3. 注意编码问题:中文数据建议用 utf-8-sig 编码
  4. 备份原始数据:处理前先备份,避免不可逆操作

Read more at My Blog

Top comments (0)