DEV Community

kettle
kettle

Posted on

PYTHON QUESTION

When merging population Excel data and GDP Excel data in Pandas, how can we ensure that the "prefecture/city name" field in the two tables is matched accurately (for example, it is "Liangshan Prefecture" in one statistical dataset and "Liangshan Yi Autonomous Prefecture" in the other)? Please write the core code logic.

Top comments (1)

Collapse
 
213123213213 profile image
123

Step 1: Install Required Library
First, install thefuzz (maintained fork of fuzzywuzzy):
bash
运行
pip install thefuzz python-Levenshtein # python-Levenshtein speeds up matching
Step 2: Core Code Logic
python
运行
import pandas as pd
from thefuzz import process

Load the two datasets

df_pop = pd.read_excel("population_data.xlsx") # has "prefecture/city name"
df_gdp = pd.read_excel("gdp_data.xlsx") # has "prefecture/city name"

Create a mapping: for each name in df_pop, find the BEST match in df_gdp's names

def map_names(name, candidate_list, threshold=80):
# Get the best match (score >= threshold)
match, score = process.extractOne(name, candidate_list)
return match if score >= threshold else None # return None if no good match

Apply mapping to df_pop (or df_gdp, depending on which needs standardization)

df_gdp_names = df_gdp["prefecture/city name"].tolist()
df_pop["standardized_name"] = df_pop["prefecture/city name"].apply(
lambda x: map_names(x, df_gdp_names)
)

Merge the datasets using the standardized name

merged_df = pd.merge(
df_pop, df_gdp,
left_on="standardized_name",
right_on="prefecture/city name",
how="inner" # adjust "how" (e.g., "left") based on your needs
)
Key Notes:
threshold=80: Adjust this (0–100) based on how "strict" you want matches to be (higher = stricter).
For large datasets: Use process.extract with limit or optimize with rapidfuzz (faster alternative).
After mapping: Inspect rows where standardized_name is None (no match) and handle them manually (e.g., fix typos).