circobit

Posted on Jun 16

대부분의 스크래퍼를 망가뜨리는 위키피디아 테이블 패턴 5가지 (해결법 포함)

#python #webscraping #wikipedia #dataengineering

위키피디아는 웹 테이블 데이터의 가장 흔한 소스입니다. 동시에 단순한 스크래퍼를 망가뜨리는 엣지 케이스의 지뢰밭이기도 합니다.

HTML Table Exporter를 개발하면서 가장 많은 문제를 일으키는 5가지 패턴을 수집했습니다. 각 패턴의 감지 코드와 해결법을 함께 소개합니다.

패턴 1: 네비게이션 행 ("v t e")

문제:

<table>
  <tr>
    <td colspan="5">v t e Countries by population</td>
  </tr>
  <tr>
    <td>Rank</td><td>Country</td><td>Population</td>...
  </tr>
  ...
</table>

첫 번째 행에 위키피디아 템플릿 페이지로의 "v t e" (보기/토론/편집) 링크가 있습니다. 스크래퍼가 행 0을 헤더로 취급하면 모든 것이 깨집니다.

pd.read_html 출력 결과:

          v t e Countries by population
0   Rank                        Country    ...
1      1                          China    ...

감지:

def is_nav_row(row_values):
    """위키피디아 네비게이션 접두사 감지."""
    if not row_values:
        return False

    first_cell = str(row_values[0]).strip().lower()
    patterns = [
        r'^v\s+t\s+e\s',        # "v t e "
        r'^v\s*\|\s*t\s*\|\s*e', # "v | t | e"
        r'^\[v\]\s*\[t\]\s*\[e\]' # "[v] [t] [e]"
    ]

    import re
    return any(re.match(p, first_cell) for p in patterns)

해결:

import pandas as pd

def read_wikipedia_table(url, table_index=0):
    tables = pd.read_html(url)
    df = tables[table_index]

    # 첫 행이 네비게이션인지 확인
    if is_nav_row(df.iloc[0].values):
        # 두 번째 행을 헤더로 사용
        df.columns = df.iloc[1]
        df = df.iloc[2:].reset_index(drop=True)

    return df

패턴 2: 수평 중복 테이블

문제:

세로 공간을 절약하기 위해 위키피디아는 일부 테이블을 여러 열로 표시합니다:

| Rank | Name   | Pop  | Rank | Name    | Pop  |
|------|--------|------|------|---------|------|
| 1    | Tokyo  | 37M  | 11   | Paris   | 11M  |
| 2    | Delhi  | 32M  | 12   | Cairo   | 10M  |

논리적으로는 반복된 컬럼 구조를 가진 하나의 테이블입니다.

pd.read_html 출력 결과:

   Rank    Name  Pop  Rank.1   Name.1  Pop.1
0     1   Tokyo  37M      11    Paris    11M
1     2   Delhi  32M      12    Cairo    10M

Pandas는 6개 컬럼으로 인식합니다. "Name"으로 필터링하면 데이터의 절반을 놓칩니다.

감지:

def detect_horizontal_duplication(columns):
    """컬럼 반복 확인 (Rank, Name, Pop, Rank, Name, Pop)."""
    cols = list(columns)
    n = len(cols)

    # 2, 3, 4로 나누기 시도
    for divisor in [2, 3, 4]:
        if n % divisor != 0:
            continue

        chunk_size = n // divisor
        base_pattern = [c.rstrip('.0123456789') for c in cols[:chunk_size]]

        is_duplicate = True
        for i in range(1, divisor):
            chunk = cols[i * chunk_size : (i + 1) * chunk_size]
            normalized = [c.rstrip('.0123456789') for c in chunk]
            if normalized != base_pattern:
                is_duplicate = False
                break

        if is_duplicate:
            return chunk_size

    return None

해결:

def normalize_duplicated_table(df, base_columns):
    """수평 중복 테이블을 수직으로 스택."""
    n_repeats = len(df.columns) // base_columns

    frames = []
    for i in range(n_repeats):
        start = i * base_columns
        end = start + base_columns
        chunk = df.iloc[:, start:end].copy()
        chunk.columns = df.columns[:base_columns]
        # 모든 값이 NaN인 행 제거 (빈 두 번째 절반)
        chunk = chunk.dropna(how='all')
        frames.append(chunk)

    return pd.concat(frames, ignore_index=True)

# 사용법
df = pd.read_html(url)[0]
chunk_size = detect_horizontal_duplication(df.columns)
if chunk_size:
    df = normalize_duplicated_table(df, chunk_size)

패턴 3: 타이틀 행 (전체 컬럼 스팬)

문제:

<table>
  <tr>
    <td colspan="4">List of tallest buildings in the world</td>
  </tr>
  <tr>
    <td>Rank</td><td>Building</td><td>City</td><td>Height</td>
  </tr>
  ...
</table>

첫 번째 행은 타이틀이지 데이터가 아닙니다. colspan 확장 후:

['List of tallest...', 'List of tallest...', 'List of tallest...', 'List of tallest...']

감지:

def is_title_row(row_values, next_row_values):
    """전체 너비 타이틀 행 감지."""
    if not row_values or not next_row_values:
        return False

    # 모든 값이 동일 (colspan 확장됨)
    unique_values = set(str(v).strip() for v in row_values if str(v).strip())

    # 타이틀 행: 고유값 1개, 다음 행은 다수의 고유값
    # 그리고 값이 긴 경우 (보통 타이틀은 20자 이상)
    if len(unique_values) == 1:
        title = list(unique_values)[0]
        next_unique = len(set(str(v).strip() for v in next_row_values if str(v).strip()))
        return len(title) > 20 and next_unique > 2

    return False

해결:

def skip_title_rows(df):
    """데이터프레임 상단의 타이틀 행 제거."""
    skip_count = 0

    for i in range(min(3, len(df) - 1)):
        current_row = df.iloc[i].values
        next_row = df.iloc[i + 1].values if i + 1 < len(df) else None

        if is_title_row(current_row, next_row):
            skip_count = i + 1
        else:
            break

    if skip_count > 0:
        # 타이틀 이후 행을 헤더로 사용
        df.columns = df.iloc[skip_count]
        df = df.iloc[skip_count + 1:].reset_index(drop=True)

    return df

패턴 4: 그룹 헤더 (2단계)

문제:

|        |         | Statistics        | Statistics       |
| Rank   | Country | GDP (nominal)     | GDP (PPP)        |
|--------|---------|-------------------|------------------|
| 1      | USA     | 25.5 trillion     | 25.5 trillion    |

행 0은 카테고리 헤더. 행 1이 실제 컬럼 헤더. 둘 다 의미적으로 "헤더"입니다.

pd.read_html 출력 결과:

대부분 손상되거나 다루기 어려운 MultiIndex로 출력됩니다.

감지:

def has_grouped_headers(df):
    """2단계 그룹 헤더 감지."""
    if len(df) < 3:
        return False

    row0 = df.iloc[0].values
    row1 = df.iloc[1].values

    # 행0에서 연속 반복 값 카운트
    repeat_count = 0
    for i in range(1, len(row0)):
        if str(row0[i]).strip() == str(row0[i-1]).strip() and str(row0[i]).strip():
            repeat_count += 1

    repeat_ratio = repeat_count / max(1, len(row0) - 1)

    # 그룹 헤더는 보통 40%+ 반복 값
    # AND 행1이 행0보다 더 많은 고유 비공백 값
    unique0 = len(set(str(v).strip() for v in row0 if str(v).strip()))
    unique1 = len(set(str(v).strip() for v in row1 if str(v).strip()))

    return repeat_ratio > 0.3 and unique1 > unique0

해결:

def merge_grouped_headers(df):
    """2단계 헤더를 단일 레벨로 병합."""
    group_row = df.iloc[0].values
    header_row = df.iloc[1].values

    merged = []
    for i, (group, header) in enumerate(zip(group_row, header_row)):
        g = str(group).strip()
        h = str(header).strip()

        if not g or g == h:
            merged.append(h)
        elif not h:
            merged.append(g)
        else:
            merged.append(f"{g} - {h}")

    df.columns = merged
    return df.iloc[2:].reset_index(drop=True)

# 사용법
if has_grouped_headers(df):
    df = merge_grouped_headers(df)

패턴 5: 중첩 인포박스 테이블

문제:

위키피디아 인포박스에는 테이블 셀 안에 테이블이 포함됩니다:

<table class="infobox">
  <tr>
    <td>Population</td>
    <td>
      <table>  <!-- 중첩! -->
        <tr><td>Urban</td><td>8.3M</td></tr>
        <tr><td>Metro</td><td>20.1M</td></tr>
      </table>
    </td>
  </tr>
</table>

pd.read_html 출력 결과:

외부와 내부 테이블 모두 반환됩니다. "페이지의 모든 테이블"을 찾으면 중복과 중첩 쓰레기를 얻습니다.

감지 및 필터링:

from bs4 import BeautifulSoup
import requests

def get_top_level_tables(url):
    """중첩되지 않은 최상위 테이블만 가져오기."""
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    all_tables = soup.find_all('table')

    top_level = []
    for table in all_tables:
        # 이 테이블이 다른 테이블 안에 있는지 확인
        parent = table.parent
        is_nested = False

        while parent:
            if parent.name == 'table':
                is_nested = True
                break
            parent = parent.parent

        if not is_nested:
            top_level.append(table)

    return top_level

완전한 위키피디아 테이블 리더

모든 수정을 결합:

import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

class WikipediaTableReader:
    def __init__(self, url):
        self.url = url
        self.soup = None

    def _fetch(self):
        if self.soup is None:
            response = requests.get(self.url)
            self.soup = BeautifulSoup(response.text, 'html.parser')

    def _is_nav_row(self, values):
        if not values:
            return False
        first = str(values[0]).strip().lower()
        return bool(re.match(r'^v\s+t\s+e\s', first))

    def _is_title_row(self, values, next_values):
        unique = set(str(v).strip() for v in values if str(v).strip())
        if len(unique) != 1:
            return False
        title = list(unique)[0]
        next_unique = len(set(str(v).strip() for v in next_values if str(v).strip()))
        return len(title) > 20 and next_unique > 2

    def get_tables(self, skip_infobox=True):
        """페이지에서 모든 데이터 테이블 가져오기."""
        self._fetch()

        tables = self.soup.find_all('table')
        results = []

        for table in tables:
            # 중첩 테이블 스킵
            if table.find_parent('table'):
                continue

            # 인포박스 스킵 (요청 시)
            if skip_infobox and 'infobox' in table.get('class', []):
                continue

            try:
                df = pd.read_html(str(table))[0]
                df = self._clean_table(df)
                if len(df) > 0 and len(df.columns) > 1:
                    results.append(df)
            except Exception:
                continue

        return results

    def _clean_table(self, df):
        """모든 정제 단계 적용."""
        # 네비게이션 행 스킵
        while len(df) > 0 and self._is_nav_row(df.iloc[0].values):
            df.columns = df.iloc[1] if len(df) > 1 else df.columns
            df = df.iloc[2:].reset_index(drop=True) if len(df) > 2 else df.iloc[1:]

        # 타이틀 행 스킵
        if len(df) > 1:
            while self._is_title_row(df.iloc[0].values, df.iloc[1].values if len(df) > 1 else []):
                df.columns = df.iloc[1]
                df = df.iloc[2:].reset_index(drop=True)

        return df

# 사용법
reader = WikipediaTableReader("https://en.wikipedia.org/wiki/List_of_countries_by_population")
tables = reader.get_tables()