This is the fourth post in my series Sequence labelling in Python, find the previous one here: Extracting more features. Get the code for this series on GitHub.
Once we have our dataset with all the features we want to include, as well as all the labels for our sequences; we can move on to the actual training of our algorithm. For this task, we'll be using the python-crfsuite
package, so pip-install it (or use your favourite package manager to get it):
pip install python-crfsuite
Once installed, let's load our dataset:
import pandas as pd
import string
features_labels = pd.read_csv("data/features-labels.csv")
features_labels = features_labels[~features_labels['label'].isna()]
features_labels.head()
We also need some helper functions that work on strings; you'll see how they are used a bit further down:
punctuation = set(string.punctuation)
def is_punctuation(token):
return token in punctuation
def is_numeric(token):
try:
float(token.replace(",", ""))
return True
except:
return False
Input to pycrfsuite
The inputs to the algorithm must follow a particular format, where each token has its features represented by key-value pairs, each token may also have different features based on different factors, like its position. The following function takes in a dataframe and returns the corresponding features that can be consumed by the training method of our algorithm:
def featurise(sentence_frame, current_idx):
current_token = sentence_frame.iloc[current_idx]
token = current_token['token']
position = current_token['position']
token_count = current_token['token_count']
pos = current_token['pos_tag']
# Shared features across tokens
features = {
'bias': True,
'word.lower': token.lower(),
'word.istitle': token.istitle(),
'word.isdigit': is_numeric(token),
'word.ispunct': is_punctuation(token),
'word.position':position,
'word.token_count': token_count,
'postag': pos,
}
if current_idx > 0: # The word is not the first one...
prev_token = sentence_frame.iloc[current_idx-1]['token']
prev_pos = sentence_frame.iloc[current_idx-1]['pos_tag']
features.update({
'-1:word.lower': prev_token.lower(),
'-1:word.istitle':prev_token.istitle(),
'-1:word.isdigit': is_numeric(prev_token),
'-1:word.ispunct': is_punctuation(prev_token),
'-1:postag':prev_pos
})
else:
features['BOS'] = True
if current_idx < len(sentence_frame) - 1: # The word is not the last one...
next_token = sentence_frame.iloc[current_idx+1]['token']
next_tag = sentence_frame.iloc[current_idx+1]['pos_tag']
features.update({
'+1:word.lower': next_token.lower(),
'+1:word.istitle': next_token.istitle(),
'+1:word.isdigit': is_numeric(next_token),
'+1:word.ispunct': is_punctuation(next_token),
'+1:postag': next_tag
})
else:
features['EOS'] = True
return features
featurise(offer_0, 1)
By featurising the first token of the first offer we get the following:
{'bias': True,
'word.lower': 'cun',
'word.istitle': False,
'word.isdigit': False,
'word.ispunct': False,
'word.position': 1,
'word.token_count': 11,
'postag': 'np00000',
'-1:word.lower': '¡',
'-1:word.istitle': False,
'-1:word.isdigit': False,
'-1:word.ispunct': False,
'-1:postag': 'faa',
'+1:word.lower': 'a',
'+1:word.istitle': False,
'+1:word.isdigit': False,
'+1:word.ispunct': False,
'+1:postag': 'sp000'}
As you can see, the features are represented in a dictionary, where the keys can be any dictionary key, but I chose to name them like that to make it easy to find where each particular value comes from.
Again, we need some functions to build the sentences back from the tokens:
def featurize_sentence(sentence_frame):
labels = list(sentence_frame['label'].values)
features = [featurize(sentence_frame, i) for i in range(len(sentence_frame))]
return features, labels
def rollup(dataset):
sequences = []
labels = []
offers = dataset.groupby('offer_id')
for name, group in offers:
sqs, lbls = featurize_sentence(group)
sequences.append(sqs)
labels.append(lbls)
return sequences, labels
all_sequences, all_labels = rollup(features_labels)
We now have in all_sequences
and all_labels
our features and their corresponding labels ready for training.
Training
Pretty much like in any other supervised problem, we need to split our training dataset into two (preferably three) sets of data; we can use train_test_split
for this:
from sklearn.model_selection import train_test_split
train_docs, test_docs, train_labels, test_labels = train_test_split(all_sequences, all_labels)
len(train_docs), len(test_docs)
Creating a CRF
Though one can use a sklearn-like interface to create, train and infer with python-crfsuite, I've decided to use the original package and do all "by hand".
The first step is to create an object of the class Trainer
; then we can set some parameters for the training phase, feel free to play with these, as they may improve the quality of the tagger. Finally, we need to pass in our training data into the algorithm, and we do that with the append method:
import pycrfsuite
trainer = pycrfsuite.Trainer(verbose=False)
trainer.set_params({
'c1': 1.0, # coefficient for L1 penalty
'c2': 1e-3, # coefficient for L2 penalty
'max_iterations': 200,
'feature.possible_transitions': True
})
# We are feeding our training set to the algorithm here.
for xseq, yseq in zip(train_docs, train_labels):
trainer.append(xseq, yseq)
Finally, we call the method train, that will, at the same time, save the model to a file that we can then use to perform inferences in new sentences.
trainer.train('model/vuelax-bad.crfsuite')
Labelling "unseen" sequences
To perform sequence labelling on instances that our algorithm did not see during training it is necessary to use an object of the Tagger
class, and then load our saved model into it by using the open
method.
crf_tagger = pycrfsuite.Tagger()
crf_tagger.open('model/vuelax-bad.crfsuite')
Remember that each one of the sentences needs to be processed and put in the format required for the tagger to work that means, have the same features we used for training. We already have this in our test_docs
, and we can use them directly:
predicted_tags = crf_tagger.tag(test_docs[2])
print("Predicted: ",predicted_tags)
print("Correct : ",test_labels[2])
The result may show that in this specific instance the tagger made no errors:
Predicted: ['n', 'o', 's', 'd', 'd', 'n', 'p', 'n']
Correct : ['n', 'o', 's', 'd', 'd', 'n', 'p', 'n']
Evaluating the tagger
Seeing our algorithm perform very well in a single example may not inform us that well, so let's look at the bigger picture and while there may be better ways to evaluate the performance of the tagger, we'll use the traditional tools of a classification problem:
from sklearn.metrics import classification_report
all_true, all_pred = [], []
for i in range(len(test_docs)):
all_true.extend(test_labels[i])
all_pred.extend(crf_tagger.tag(test_docs[i]))
print(classification_report(all_true, all_pred))
Should give you a result similar to...
precision recall f1-score support
d 0.96 0.99 0.97 98
f 1.00 1.00 1.00 10
n 1.00 1.00 1.00 831
o 1.00 1.00 1.00 80
p 1.00 1.00 1.00 60
s 1.00 1.00 1.00 60
accuracy 1.00 1139
macro avg 0.99 1.00 1.00 1139
weighted avg 1.00 1.00 1.00 1139
Our algorithm performs very, very well.
It may seem like we are done here, but we still need to put everything together, in an attempt to make it easier for us to tag new offers outside of our training and testing sets, we'll do that in my next post putting everything together.
As always, feel free to ask some questions if you have them by leaving a comment here or contacting me on twitter via @io_exception.
Top comments (2)
I am getting "TypeError: expected bytes, list found" ,when feeding X_train(as train_docs) and Y_train(as train_labels) to the model.
My example:
*X_train[0][1] *= ['word.lower=live',
'word[-2]=ve',
'word[-3]=ive',
'word[-4]=live',
'word[-5]=live',
'word[-6]=live',
'word.isupper=False',
'word.isdigit=False',
'word.StartsWithCapital=False',
'word.hasHyphen=False',
'prev_word.lower=all',
'prev_word.isupper=False',
'prev_word.isdigit=False',
'prev_word.isCapital=True',
'prev_word.hasHyphem=False'],
Y_train[0][1] = ['word.lower=o',
'word[-2]=O',
'word[-3]=O',
'word[-4]=O',
'word[-5]=O',
'word[-6]=O',
'word.isupper=True',
'word.isdigit=False',
'word.StartsWithCapital=True',
'word.hasHyphen=False',
'prev_word.lower=o',
'prev_word.isupper=True',
'prev_word.isdigit=False',
'prev_word.isCapital=True',
'prev_word.hasHyphem=False'])
Great Tutorial!!
But i think that features-labels.csv is no longer on your repo.
Greetings!