DEV Community


Posted on

Getting started with NLP using Bert on Kaggle

1、Import and EDA

import os
iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')
from pathlib import Path
if iskaggle:
    path = Path('/kaggle/input/us-patent-phrase-to-phrase-matching')
Enter fullscreen mode Exit fullscreen mode
import pandas as pd
df = pd.read_csv(path/'train.csv')
df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + + '; ANC1: ' + df.anchor
Enter fullscreen mode Exit fullscreen mode


from datasets import Dataset, DatasetDict
ds = Dataset.from_pandas(df)
import warnings,logging,torch
model_nm = 'anferico/bert-for-patents'
# Load model directly
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
tokenizer = AutoTokenizer.from_pretrained('anferico/bert-for-patents')
Enter fullscreen mode Exit fullscreen mode
def tok_func(x):
    return tokenizer(x['input'])
# Tokenize all the sentences using the tokenizer
tok_ds =, batched=True)
tok_ds = tok_ds.rename_columns({'score':'labels'})
Enter fullscreen mode Exit fullscreen mode

3、Test and Validation sets

eval_df = pd.read_csv(path/'test.csv')
dds = tok_ds.train_test_split(0.25, seed=42)
eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + + '; ANC1: ' + eval_df.anchor
eval_ds = Dataset.from_pandas(eval_df).map(tok_func, batched=True)
Enter fullscreen mode Exit fullscreen mode

4、Metrics and Correlation

import numpy as np
def corr(x,y): 
    ## change the 2-d array into 1-d array
    return np.corrcoef(x.flatten(), y)[0,1]
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}
Enter fullscreen mode Exit fullscreen mode

5、Training our model

from transformers import TrainingArguments,Trainer
bs = 128
epochs = 4
lr = 8e-5
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokenizer, compute_metrics=corr_d)
Enter fullscreen mode Exit fullscreen mode

Image description

6、Get the predictions on the test set

preds = trainer.predict(eval_ds).predictions.astype(float)
preds = np.clip(preds, 0, 1)
import datasets

submission = datasets.Dataset.from_dict({
    'id': eval_ds['id'],
    'score': preds

submission.to_csv('submission.csv', index=False)
Enter fullscreen mode Exit fullscreen mode

Image of Datadog

The Future of AI, LLMs, and Observability on Google Cloud

Datadog sat down with Google’s Director of AI to discuss the current and future states of AI, ML, and LLMs on Google Cloud. Discover 7 key insights for technical leaders, covering everything from upskilling teams to observability best practices

Learn More

Top comments (0)

The Most Contextual AI Development Assistant image

Our centralized storage agent works on-device, unifying various developer tools to proactively capture and enrich useful materials, streamline collaboration, and solve complex problems through a contextual understanding of your unique workflow.

👥 Ideal for solo developers, teams, and cross-company projects

Learn more