When merging population Excel data and GDP Excel data in Pandas, how can we ensure that the "prefecture/city name" field in the two tables is matched accurately (for example, it is "Liangshan Prefecture" in one statistical dataset and "Liangshan Yi Autonomous Prefecture" in the other)? Please write the core code logic.
For further actions, you may consider blocking this person and/or reporting abuse
Top comments (1)
Step 1: Install Required Library
First, install thefuzz (maintained fork of fuzzywuzzy):
bash
运行
pip install thefuzz python-Levenshtein # python-Levenshtein speeds up matching
Step 2: Core Code Logic
python
运行
import pandas as pd
from thefuzz import process
Load the two datasets
df_pop = pd.read_excel("population_data.xlsx") # has "prefecture/city name"
df_gdp = pd.read_excel("gdp_data.xlsx") # has "prefecture/city name"
Create a mapping: for each name in df_pop, find the BEST match in df_gdp's names
def map_names(name, candidate_list, threshold=80):
# Get the best match (score >= threshold)
match, score = process.extractOne(name, candidate_list)
return match if score >= threshold else None # return None if no good match
Apply mapping to df_pop (or df_gdp, depending on which needs standardization)
df_gdp_names = df_gdp["prefecture/city name"].tolist()
df_pop["standardized_name"] = df_pop["prefecture/city name"].apply(
lambda x: map_names(x, df_gdp_names)
)
Merge the datasets using the standardized name
merged_df = pd.merge(
df_pop, df_gdp,
left_on="standardized_name",
right_on="prefecture/city name",
how="inner" # adjust "how" (e.g., "left") based on your needs
)
Key Notes:
threshold=80: Adjust this (0–100) based on how "strict" you want matches to be (higher = stricter).
For large datasets: Use process.extract with limit or optimize with rapidfuzz (faster alternative).
After mapping: Inspect rows where standardized_name is None (no match) and handle them manually (e.g., fix typos).