Create account

DEV Community

Keita Onabuta

Posted on Mar 30, 2020 • Edited on Apr 5, 2020

BERTによる日本語テキスト分類 (Azure)

#azure #bert #machinelearning #pytorch

This is Japanese article.

konabuta / AzureML-NLP

NLP for japanese language text.

AzureML-NLP

本リポジトリでは、Azure Machine Learning を利用した日本語の自然言語処理 NLP モデル構築のサンプルコードを提供します。Microsoft の NLP Best Practice を参考にしています。

コンテンツ

シナリオ	モデル	概要	対応言語
テキスト分類	BERT	テキストのカテゴリーを学習・推論する教師付き学習です。	Japanese

Get started

最初は Azure Cognitive Service の利用検討を推奨します。この学習済みのモデルで対応できない場合は、カスタムで機械学習モデルを構築する必要がございます。まず、Setup を参照し、必要なライブラリを導入してください。

View on GitHub

Microosft が公開している自然言語処理のベストプラクティス集 "NLP Best Practices" をベースにした日本語テキスト分類のサンプルコードを作成しました。

本家と大きく違う点は下記です。

日本語の BERT Tokenizer を利用する
- Mecab (+辞書) のダウンロードとインストールの手順を追加
日本語 PreTrained モデルを利用する
- Hugging Face のモデルを利用
サンプルデータとして Livedoor ニュースを利用

Mecabの辞書の導入が複雑なので本家とマージするかはまだ未定です。

コード(※抜粋)はこちらです。

1. Livedoor コーパスのデータ加工

# Livedoor ニュースコーパスをダウンロードして利用します。
from urllib.request import urlretrieve
import tarfile

text_url = "https://www.rondhuit.com/download/ldcc-20140209.tar.gz"
file_path = "./ldcc-20140209.tar.gz"
urlretrieve(text_url, file_path)

# gz ファイルを解凍します。
with tarfile.open('./ldcc-20140209.tar.gz', 'r:gz') as tar:
    tar.extractall(path='livedoor')
    tar.close()

# Pandas Dataframe を作成します。
for folder_name in os.listdir(path):
    print(folder_name)
    if folder_name.endswith(".txt") :
        continue
    for file in os.listdir(os.path.join(path, folder_name)):
        if folder_name == "LICENSE.txt" :
            continue
        with open(os.path.join(path, folder_name, file), 'r') as f:
            lines = f.read().split('\n')
            if len(lines) == 1:
                continue
            url = lines[0]
            date = lines[1]
            label = folder_name
            title = lines[3]
            text = "".join(lines[4:])
            data = {'url': url, 'date':date, 'label': label, 'title':title, 'text':text}
        s = pd.Series(data)        
        df = df.append(s, ignore_index=True)

2. ファインチューニング

準備されている関数 util_nlp を利用します。

classifier = SequenceClassifier(
    model_name=model_name, num_labels=num_labels, cache_dir=CACHE_DIR
)

with Timer() as t:
    classifier.fit(
        train_dataloader, num_epochs=NUM_EPOCHS, num_gpus=NUM_GPUS, verbose=False,
    )
train_time = t.interval / 3600

精度確認を確認します。

# テストデータの予測
preds = classifier.predict(test_dataloader, num_gpus=NUM_GPUS, verbose=False)

# 評価
accuracy = accuracy_score(df_test[LABEL_COL], preds)
class_report = classification_report(
    df_test[LABEL_COL], preds, target_names=label_encoder.classes_, output_dict=True
)

最終的な精度は 85% ぐらいでした。

accuracy : 0.866052
f1-score : 0.858849

DEV Community

BERTによる日本語テキスト分類 (Azure)

konabuta / AzureML-NLP

NLP for japanese language text.

AzureML-NLP

コンテンツ

Get started

1. Livedoor コーパスのデータ加工

2. ファインチューニング

Top comments (0)

Read next

TDoC 2024 - Day 3: Introduction to Machine Learning

Microsoft's Phi-4: Smaller AI Model Achieves Big Results Through Clean Training Data

The Role of AI in Software Testing: Applications, Use Cases, and Benefits

The Limitations of Machine Learning: What We Still Can't Teach Machines