DEV Community

Cover image for Extracting more features - Sequence labelling in Python (part 3)
Antonio Feregrino
Antonio Feregrino

Posted on

Extracting more features - Sequence labelling in Python (part 3)

This is the third post in my series Sequence labelling in Python, find the previous one here: Part of speech tagging. Get the code for this series on GitHub.

While having the POS tags is good and valuable information, it may not be enough to get valuable predictions for our task. However, we can provide our algorithm with more information; such as the length of the token, the length of the sentence, the position within the sentence, whether the token is a number or all uppercase...

Some imports:

from vuelax.tokenisation import index_emoji_tokenize
import pandas as pd
import csv

Starting from our already labelled dataset (remember I have a file called data/to_label.csv). The following are just a few helper functions to read and augment our dataset:

labelled_data = pd.read_csv("data/to_label-done.csv")
labelled_data.head()

We need to create a helper function to read all the labelled offers from the previously created labelled_data dataframe:

def read_whole_offers(dataset):
    current_offer = 0
    rows = []
    for _, row in dataset.iterrows():
        if row['offer_id'] != current_offer:
            yield rows
            current_offer = row['offer_id']
            rows = []
        rows.append(list(row.values))
    yield rows

offers = read_whole_offers(labelled_data)
offer_ids, tokens, positions, pos_tags, labels = zip(*next(offers))
print(offer_ids)
print(tokens)
print(positions)
print(pos_tags)
print(labels)

And here is the output from the first flight offer:

(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
('¡', 'CUN', 'a', 'Ámsterdam', '$', '8,960', '!', 'Sin', 'escala', 'en', 'EE.UU')
(0, 1, 5, 7, 17, 18, 23, 25, 29, 36, 39)
('faa', 'np00000', 'sp000', 'np00000', 'zm', 'dn0000', 'fat', 'sp000', 'nc0s000', 'sp000', 'np00000')
('n', 'o', 's', 'd', 'n', 'p', 'n', 'n', 'n', 'n', 'n')

Building our training set

The features I decided to augment the data with are the following:

  • Lengths of each token
  • Length of the whole offer (counted in tokens)
  • The POS tag of the token to the left
  • The POS tag of the token to the right
  • Whether the token is uppercase or not

And this is the respective function to generate said features:

def generate_more_features(tokens, pos_tags):
    lengths =  [len(l) for l in tokens]
    n_tokens =  [len(tokens) for l in tokens]
    augmented = ['<p>'] + list(pos_tags) + ['</p>']
    uppercase = [all([l.isupper() for l in token]) for token in tokens]
    return lengths, n_tokens, augmented[:len(tokens)], augmented[2:], uppercase

generate_more_features(tokens, pos_tags)

As an example output:

([1, 3, 1, 9, 1, 5, 1, 3, 6, 2, 5],
 [11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11],
 ['<p>',
  'faa',
  'np00000',
  'sp000',
...
  'sp000',
  'nc0s000',
  'sp000'],
 ['np00000',
  'sp000',
...
  'nc0s000',
  'sp000',
  'np00000',
  '</p>'],
 [False, True, False, False, False, False, False, False, False, False, False])

Finally, we need to apply the function to all the offers in our dataset, and save these to a file, to keep them handy for the next task, you can read more about it here: Training a CRF in Python.

As always, feel free to ask some questions if you have them by leaving a comment here or contacting me on twitter via @io_exception.

Heroku

Simplify your DevOps and maximize your time.

Since 2007, Heroku has been the go-to platform for developers as it monitors uptime, performance, and infrastructure concerns, allowing you to focus on writing code.

Learn More

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Dive into an ocean of knowledge with this thought-provoking post, revered deeply within the supportive DEV Community. Developers of all levels are welcome to join and enhance our collective intelligence.

Saying a simple "thank you" can brighten someone's day. Share your gratitude in the comments below!

On DEV, sharing ideas eases our path and fortifies our community connections. Found this helpful? Sending a quick thanks to the author can be profoundly valued.

Okay