DEV Community

Tomoya Oda
Tomoya Oda

Posted on

Kaggle Coleridge 52nd Solution

This article is translated from my Japanese tech blog.
https://tmyoda.hatenablog.com/entry/20210628/1624883322

About the Coleridge Competition

https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data

This is a Competition to predict the dataset names that are shown in academic papers. The only data provided is the text of the papers and GT.

How I Split into the Dataset to the Validation

In this competition, there are about 130 dataset names (targets) in the training set, but the test set includes dataset names that do not appear in the training phase.

Therefore, it must be divided without any duplication of the dataset names. So, I implemented a BFS and divided it into an 8:2 ratio to avoid any duplication.

Pipeline

Image description

Classifier

This classifier worked better than I thought, and most of our team's top submissions included this classifier.

Just classify whether a dataset name exists or not.

MLM

We almost re-use of the kernel below.
https://www.kaggle.com/tungmphung/coleridge-predict-with-masked-dataset-modeling

Jaccard filter

This is also re-use of the kernel as well.

def jaccard_filter(org_labels, threthold=0.75):
    assert isinstance(org_labels, list)

    filtered_labels = []
    for labels in org_labels:
        filtered = []

        for label in sorted(labels, key=len):
            label = clean_text(label)
            if len(filtered) == 0 or all(jaccard(label, got_label)
                                         < threthold for got_label in filtered):
                filtered.append(label)

        filtered_labels.append('|'.join(filtered))

    return filtered_labels

Enter fullscreen mode Exit fullscreen mode

What I tried

  • Using DiceLoss, FocalLoss which is good at imbalanced data: The score decreased
  • NER (Named Entity Recognition): It didn't seem to be effective
  • SciBERT: No change
  • Increasing external datasets csv: Extraneous strings were hit: decreasing the score
  • Switching BERT to Electra: The score decreased
  • Changing CONNECTION_TOKEN: The number of target documents increased, and the score decreased
  • Beam search with k-fold: It was hard for us to run because of the time

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay