DEV Community

Cover image for Sequence labelling in Python (part 1)
Antonio Feregrino
Antonio Feregrino

Posted on

2

Sequence labelling in Python (part 1)

Why?

I was looking for a cool project to practice sequence labelling with Python so... there is this Mexican website called VuelaX, in it, flight offers are shown. Most of the offers follow a simple pattern: Destination - Origin - Price - Extras, while extracting this may seem easy for a regular expression, it is not as there are many patterns. It would be tough for us to cover them all.

I know it is not ideal to work in a foreign language, but bear with me, as the same techniques could be applied in your language of choice.

The idea is to create a tagger that will be able to extract this information. However, one first tag is to identify the information that we want to extract. Following the pattern described above:

  • o: Origin
  • d: Destination
  • s: Separator token
  • p: Price
  • f: Flag
  • n: Irrelevant token
Text d o p n
¡CUN a Holanda $8,885! Sin escala EE.UU CUN Holanda 8,885 Sin escala EE.UU
¡CDMX a Noruega $10,061! (Y agrega 9 noches de hotel por $7,890!) CDMX Noruega 10,061 Y agrega 9 noches de hotel por $7,890!
¡Todo México a Pisa, Toscana Italia $12,915! Sin escala EE.UU (Y por $3,975 agrega 13 noches hotel) México Pisa, Toscana Italia 12,915 Sin escala EE.UU (Y por $3,975 agrega 13 noches hotel)

CRFs in Python

If you are familiar with data science, you know this is known as a sequence labelling problem. While there are various ways to approach it, in this post, I will show you one that uses a statistical model known as Conditional Random Fields. Having said that, I will not delve too much into details, so if you want to learn more about CRFs you are on your own; I will show you a practical way to use it with a Python implementation.

Getting some data

To start, I scraped the offer titles data from the page mentioned above. I will not detail how I did it since it is pretty straightforward to find a tutorial on web scraping on the web. If you don't feel like spending some time scraping a website, I collected some data in a CSV file that you can access now here.

This tutorial will be divided into other 4 parts:

Hopefully, you will follow along and will ask some questions if you have by leaving a comment here or contacting me on twitter via @io_exception.

Heroku

This site is built on Heroku

Join the ranks of developers at Salesforce, Airbase, DEV, and more who deploy their mission critical applications on Heroku. Sign up today and launch your first app!

Get Started

Top comments (1)

Collapse
 
rodyoukai profile image
Rodrigo Cuéllar Hidalgo

Is there a little error in your labeled data example, for example in the first text CUN is the Origin and Holanda is the destination, this happen in al rows...

Billboard image

Try REST API Generation for MS SQL Server.

DevOps for Private APIs. With DreamFactory API Generation, you get:

  • Auto-generated live APIs mapped from database schema
  • Interactive Swagger API documentation
  • Scripting engine to customize your API
  • Built-in role-based access control

Learn more

👋 Kindness is contagious

Engage with a sea of insights in this enlightening article, highly esteemed within the encouraging DEV Community. Programmers of every skill level are invited to participate and enrich our shared knowledge.

A simple "thank you" can uplift someone's spirits. Express your appreciation in the comments section!

On DEV, sharing knowledge smooths our journey and strengthens our community bonds. Found this useful? A brief thank you to the author can mean a lot.

Okay