DEV Community

Cover image for Sequence labelling in Python (part 1)
Antonio Feregrino
Antonio Feregrino

Posted on

2

Sequence labelling in Python (part 1)

Why?

I was looking for a cool project to practice sequence labelling with Python so... there is this Mexican website called VuelaX, in it, flight offers are shown. Most of the offers follow a simple pattern: Destination - Origin - Price - Extras, while extracting this may seem easy for a regular expression, it is not as there are many patterns. It would be tough for us to cover them all.

I know it is not ideal to work in a foreign language, but bear with me, as the same techniques could be applied in your language of choice.

The idea is to create a tagger that will be able to extract this information. However, one first tag is to identify the information that we want to extract. Following the pattern described above:

  • o: Origin
  • d: Destination
  • s: Separator token
  • p: Price
  • f: Flag
  • n: Irrelevant token
Text d o p n
¡CUN a Holanda $8,885! Sin escala EE.UU CUN Holanda 8,885 Sin escala EE.UU
¡CDMX a Noruega $10,061! (Y agrega 9 noches de hotel por $7,890!) CDMX Noruega 10,061 Y agrega 9 noches de hotel por $7,890!
¡Todo México a Pisa, Toscana Italia $12,915! Sin escala EE.UU (Y por $3,975 agrega 13 noches hotel) México Pisa, Toscana Italia 12,915 Sin escala EE.UU (Y por $3,975 agrega 13 noches hotel)

CRFs in Python

If you are familiar with data science, you know this is known as a sequence labelling problem. While there are various ways to approach it, in this post, I will show you one that uses a statistical model known as Conditional Random Fields. Having said that, I will not delve too much into details, so if you want to learn more about CRFs you are on your own; I will show you a practical way to use it with a Python implementation.

Getting some data

To start, I scraped the offer titles data from the page mentioned above. I will not detail how I did it since it is pretty straightforward to find a tutorial on web scraping on the web. If you don't feel like spending some time scraping a website, I collected some data in a CSV file that you can access now here.

This tutorial will be divided into other 4 parts:

Hopefully, you will follow along and will ask some questions if you have by leaving a comment here or contacting me on twitter via @io_exception.

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

Top comments (1)

Collapse
 
rodyoukai profile image
Rodrigo Cuéllar Hidalgo

Is there a little error in your labeled data example, for example in the first text CUN is the Origin and Holanda is the destination, this happen in al rows...

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs