DEV Community

Tomoya Oda
Tomoya Oda

Posted on

Kaggle Coleridge 52nd Solution

This article is translated from my Japanese tech blog.
https://tmyoda.hatenablog.com/entry/20210628/1624883322

About the Coleridge Competition

https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data

This is a Competition to predict the dataset names that are shown in academic papers. The only data provided is the text of the papers and GT.

How I Split into the Dataset to the Validation

In this competition, there are about 130 dataset names (targets) in the training set, but the test set includes dataset names that do not appear in the training phase.

Therefore, it must be divided without any duplication of the dataset names. So, I implemented a BFS and divided it into an 8:2 ratio to avoid any duplication.

Pipeline

Image description

Classifier

This classifier worked better than I thought, and most of our team's top submissions included this classifier.

Just classify whether a dataset name exists or not.

MLM

We almost re-use of the kernel below.
https://www.kaggle.com/tungmphung/coleridge-predict-with-masked-dataset-modeling

Jaccard filter

This is also re-use of the kernel as well.

def jaccard_filter(org_labels, threthold=0.75):
    assert isinstance(org_labels, list)

    filtered_labels = []
    for labels in org_labels:
        filtered = []

        for label in sorted(labels, key=len):
            label = clean_text(label)
            if len(filtered) == 0 or all(jaccard(label, got_label)
                                         < threthold for got_label in filtered):
                filtered.append(label)

        filtered_labels.append('|'.join(filtered))

    return filtered_labels

Enter fullscreen mode Exit fullscreen mode

What I tried

  • Using DiceLoss, FocalLoss which is good at imbalanced data: The score decreased
  • NER (Named Entity Recognition): It didn't seem to be effective
  • SciBERT: No change
  • Increasing external datasets csv: Extraneous strings were hit: decreasing the score
  • Switching BERT to Electra: The score decreased
  • Changing CONNECTION_TOKEN: The number of target documents increased, and the score decreased
  • Beam search with k-fold: It was hard for us to run because of the time

Top comments (0)