Kaggle Coleridge 52nd Solution

#kaggle #huggingface

This article is translated from my Japanese tech blog.
https://tmyoda.hatenablog.com/entry/20210628/1624883322

About the Coleridge Competition

https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data

This is a Competition to predict the dataset names that are shown in academic papers. The only data provided is the text of the papers and GT.

How I Split into the Dataset to the Validation

In this competition, there are about 130 dataset names (targets) in the training set, but the test set includes dataset names that do not appear in the training phase.

Therefore, it must be divided without any duplication of the dataset names. So, I implemented a BFS and divided it into an 8:2 ratio to avoid any duplication.

Pipeline

Classifier

This classifier worked better than I thought, and most of our team's top submissions included this classifier.

Just classify whether a dataset name exists or not.

MLM

We almost re-use of the kernel below.
https://www.kaggle.com/tungmphung/coleridge-predict-with-masked-dataset-modeling

Jaccard filter

This is also re-use of the kernel as well.

def jaccard_filter(org_labels, threthold=0.75):
    assert isinstance(org_labels, list)

    filtered_labels = []
    for labels in org_labels:
        filtered = []

        for label in sorted(labels, key=len):
            label = clean_text(label)
            if len(filtered) == 0 or all(jaccard(label, got_label)
                                         < threthold for got_label in filtered):
                filtered.append(label)

        filtered_labels.append('|'.join(filtered))

    return filtered_labels

What I tried

Using DiceLoss, FocalLoss which is good at imbalanced data: The score decreased
NER (Named Entity Recognition): It didn't seem to be effective
SciBERT: No change
Increasing external datasets csv: Extraneous strings were hit: decreasing the score
Switching BERT to Electra: The score decreased
Changing CONNECTION_TOKEN: The number of target documents increased, and the score decreased
Beam search with k-fold: It was hard for us to run because of the time

DEV Community

Kaggle Coleridge 52nd Solution

About the Coleridge Competition

How I Split into the Dataset to the Validation

Pipeline

Classifier

MLM

Jaccard filter

What I tried

Top comments (0)

Read next

AI Breakthrough Creates Ultra-Realistic Character Animations That Interact with Their Environment

16x Smaller Neural Networks Match Full-Size Performance in 5G Wireless Systems

New GPU Method Solves Boolean Logic Puzzles 523x Faster Than Current Approaches

Navigating Financial Stability in Open Source Projects