circobit

Posted on Jun 17

多くのスクレイパーを壊すWikipediaテーブル5パターン（と修正方法）

#python #webscraping #wikipedia #datascience

WikipediaはWebテーブルデータの最も一般的なソースです。同時に、単純なスクレイパーを壊すエッジケースの宝庫でもあります。

HTML Table Exporterを開発する中で最も問題を起こす5つのパターンを集めました。各パターンの検出コードと修正方法を紹介します。

パターン1: ナビゲーション行（"v t e"）

問題:

<table>
  <tr>
    <td colspan="5">v t e Countries by population</td>
  </tr>
  <tr>
    <td>Rank</td><td>Country</td><td>Population</td>...
  </tr>
  ...
</table>

最初の行にWikipediaテンプレートページへの"v t e"（表示/議論/編集）リンクが含まれています。スクレイパーが行0をヘッダーとして扱うと、すべてが壊れます。

pd.read_htmlの出力:

          v t e Countries by population
0   Rank                        Country    ...
1      1                          China    ...

検出:

def is_nav_row(row_values):
    """Wikipediaナビゲーションプレフィックスを検出"""
    if not row_values:
        return False

    first_cell = str(row_values[0]).strip().lower()
    patterns = [
        r'^v\s+t\s+e\s',        # "v t e "
        r'^v\s*\|\s*t\s*\|\s*e', # "v | t | e"
        r'^\[v\]\s*\[t\]\s*\[e\]' # "[v] [t] [e]"
    ]

    import re
    return any(re.match(p, first_cell) for p in patterns)

修正:

import pandas as pd

def read_wikipedia_table(url, table_index=0):
    tables = pd.read_html(url)
    df = tables[table_index]

    # 最初の行がナビゲーションかチェック
    if is_nav_row(df.iloc[0].values):
        # 2行目をヘッダーとして使用
        df.columns = df.iloc[1]
        df = df.iloc[2:].reset_index(drop=True)

    return df

パターン2: 水平に複製されたテーブル

問題:

縦のスペースを節約するため、Wikipediaは一部のテーブルを複数列で表示します：

| Rank | Name   | Pop  | Rank | Name    | Pop  |
|------|--------|------|------|---------|------|
| 1    | Tokyo  | 37M  | 11   | Paris   | 11M  |
| 2    | Delhi  | 32M  | 12   | Cairo   | 10M  |

これは論理的には繰り返し列構造を持つ1つのテーブルです。

pd.read_htmlの出力:

   Rank    Name  Pop  Rank.1   Name.1  Pop.1
0     1   Tokyo  37M      11    Paris    11M
1     2   Delhi  32M      12    Cairo    10M

Pandasは6列として認識します。"Name"でフィルタリングすると、データの半分を見逃します。

検出:

def detect_horizontal_duplication(columns):
    """列が繰り返されているかチェック (Rank, Name, Pop, Rank, Name, Pop)"""
    cols = list(columns)
    n = len(cols)

    # 2, 3, 4で割ってみる
    for divisor in [2, 3, 4]:
        if n % divisor != 0:
            continue

        chunk_size = n // divisor
        base_pattern = [c.rstrip('.0123456789') for c in cols[:chunk_size]]

        is_duplicate = True
        for i in range(1, divisor):
            chunk = cols[i * chunk_size : (i + 1) * chunk_size]
            normalized = [c.rstrip('.0123456789') for c in chunk]
            if normalized != base_pattern:
                is_duplicate = False
                break

        if is_duplicate:
            return chunk_size

    return None

修正:

def normalize_duplicated_table(df, base_columns):
    """水平に複製されたテーブルを垂直に結合"""
    n_repeats = len(df.columns) // base_columns

    frames = []
    for i in range(n_repeats):
        start = i * base_columns
        end = start + base_columns
        chunk = df.iloc[:, start:end].copy()
        chunk.columns = df.columns[:base_columns]
        chunk = chunk.dropna(how='all')
        frames.append(chunk)

    return pd.concat(frames, ignore_index=True)

パターン3: タイトル行（全列にまたがる）

問題:

<table>
  <tr>
    <td colspan="4">List of tallest buildings in the world</td>
  </tr>
  <tr>
    <td>Rank</td><td>Building</td><td>City</td><td>Height</td>
  </tr>
  ...
</table>

最初の行はタイトルであり、データではありません。colspan展開後：

['List of tallest...', 'List of tallest...', 'List of tallest...', 'List of tallest...']

検出:

def is_title_row(row_values, next_row_values):
    """全幅タイトル行を検出"""
    if not row_values or not next_row_values:
        return False

    unique_values = set(str(v).strip() for v in row_values if str(v).strip())

    if len(unique_values) == 1:
        title = list(unique_values)[0]
        next_unique = len(set(str(v).strip() for v in next_row_values if str(v).strip()))
        return len(title) > 20 and next_unique > 2

    return False

修正:

def skip_title_rows(df):
    """データフレームの先頭からタイトル行を除去"""
    skip_count = 0

    for i in range(min(3, len(df) - 1)):
        current_row = df.iloc[i].values
        next_row = df.iloc[i + 1].values if i + 1 < len(df) else None

        if is_title_row(current_row, next_row):
            skip_count = i + 1
        else:
            break

    if skip_count > 0:
        df.columns = df.iloc[skip_count]
        df = df.iloc[skip_count + 1:].reset_index(drop=True)

    return df

パターン4: グループヘッダー（2段階）

問題:

|        |         | Statistics        | Statistics       |
| Rank   | Country | GDP (nominal)     | GDP (PPP)        |
|--------|---------|-------------------|------------------|
| 1      | USA     | 25.5 trillion     | 25.5 trillion    |

行0はカテゴリヘッダー。行1は実際の列ヘッダー。どちらも意味的には「ヘッダー」です。

検出:

def has_grouped_headers(df):
    """2段階グループヘッダーを検出"""
    if len(df) < 3:
        return False

    row0 = df.iloc[0].values
    row1 = df.iloc[1].values

    repeat_count = 0
    for i in range(1, len(row0)):
        if str(row0[i]).strip() == str(row0[i-1]).strip() and str(row0[i]).strip():
            repeat_count += 1

    repeat_ratio = repeat_count / max(1, len(row0) - 1)

    unique0 = len(set(str(v).strip() for v in row0 if str(v).strip()))
    unique1 = len(set(str(v).strip() for v in row1 if str(v).strip()))

    return repeat_ratio > 0.3 and unique1 > unique0

修正:

def merge_grouped_headers(df):
    """2段階ヘッダーを1段階に統合"""
    group_row = df.iloc[0].values
    header_row = df.iloc[1].values

    merged = []
    for i, (group, header) in enumerate(zip(group_row, header_row)):
        g = str(group).strip()
        h = str(header).strip()

        if not g or g == h:
            merged.append(h)
        elif not h:
            merged.append(g)
        else:
            merged.append(f"{g} - {h}")

    df.columns = merged
    return df.iloc[2:].reset_index(drop=True)

パターン5: ネストされたインフォボックステーブル

問題:

Wikipediaのインフォボックスにはテーブルセル内にテーブルが含まれることがあります：

<table class="infobox">
  <tr>
    <td>Population</td>
    <td>
      <table>  <!-- ネスト！ -->
        <tr><td>Urban</td><td>8.3M</td></tr>
        <tr><td>Metro</td><td>20.1M</td></tr>
      </table>
    </td>
  </tr>
</table>

検出とフィルタリング:

from bs4 import BeautifulSoup
import requests

def get_top_level_tables(url):
    """トップレベルのテーブルのみ取得（ネストされたものを除外）"""
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    all_tables = soup.find_all('table')

    top_level = []
    for table in all_tables:
        parent = table.parent
        is_nested = False

        while parent:
            if parent.name == 'table':
                is_nested = True
                break
            parent = parent.parent

        if not is_nested:
            top_level.append(table)

    return top_level

完全なWikipediaテーブルリーダー

全修正を統合：

import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

class WikipediaTableReader:
    def __init__(self, url):
        self.url = url
        self.soup = None

    def _fetch(self):
        if self.soup is None:
            response = requests.get(self.url)
            self.soup = BeautifulSoup(response.text, 'html.parser')

    def _is_nav_row(self, values):
        if not values:
            return False
        first = str(values[0]).strip().lower()
        return bool(re.match(r'^v\s+t\s+e\s', first))

    def _is_title_row(self, values, next_values):
        unique = set(str(v).strip() for v in values if str(v).strip())
        if len(unique) != 1:
            return False
        title = list(unique)[0]
        next_unique = len(set(str(v).strip() for v in next_values if str(v).strip()))
        return len(title) > 20 and next_unique > 2

    def get_tables(self, skip_infobox=True):
        """ページからすべてのデータテーブルを取得"""
        self._fetch()

        tables = self.soup.find_all('table')
        results = []

        for table in tables:
            if table.find_parent('table'):
                continue
            if skip_infobox and 'infobox' in table.get('class', []):
                continue

            try:
                df = pd.read_html(str(table))[0]
                df = self._clean_table(df)
                if len(df) > 0 and len(df.columns) > 1:
                    results.append(df)
            except Exception:
                continue

        return results

    def _clean_table(self, df):
        """すべてのクリーニングステップを適用"""
        while len(df) > 0 and self._is_nav_row(df.iloc[0].values):
            df.columns = df.iloc[1] if len(df) > 1 else df.columns
            df = df.iloc[2:].reset_index(drop=True) if len(df) > 2 else df.iloc[1:]

        if len(df) > 1:
            while self._is_title_row(df.iloc[0].values, df.iloc[1].values if len(df) > 1 else []):
                df.columns = df.iloc[1]
                df = df.iloc[2:].reset_index(drop=True)

        return df

拡張機能を使うべき場合

アドホック抽出（パイプライン構築ではない）の場合、ブラウザ拡張機能がこれらのパターンをすべて自動的に処理します。

HTML Table Exporterはこれらのパターンを検出し、出力を正規化します。エッジケースのデバッグではなく、ワンクリックで完了。

実践的なチュートリアルについては、テーブルをExcelにコピーするのに最適なChrome拡張機能のガイドもご覧ください。

自動化パイプラインには上記のコードを使用してください。時々のエクスポートには、適切なツールを選びましょう。

詳しくは gauchogrid.com/ja/html-table-exporter をご覧いただくか、Chrome Web Store で無料でお試しください。

このコードを壊すWikipediaテーブルを見つけましたか？URLを共有してください——テストスイートに追加します。

DEV Community

多くのスクレイパーを壊すWikipediaテーブル5パターン（と修正方法）

パターン1: ナビゲーション行（"v t e"）

パターン2: 水平に複製されたテーブル

パターン3: タイトル行（全列にまたがる）

パターン4: グループヘッダー（2段階）

パターン5: ネストされたインフォボックステーブル

完全なWikipediaテーブルリーダー

拡張機能を使うべき場合

Top comments (0)