With the advent of Natural Language Processing (NLP), traditional job searches based on static keywords are becoming less desirable because of their inaccuracy and will eventually become obsolete. While the traditional search engine performs simple keyword searches, the NLP based search engine extract named entities, key phrases, sentiment, etc. to enrich the documents with metadata and perform search query based on the extracted metadata. In this tutorial, we will build a model to extract entities, such as skills, diploma and diploma major, from job descriptions using Named Entity Recognition (NER).
In this tutorial we will use Amazon Comprehend custom entity recognizer to extract entities from job descriptions. There are two ways to train the model (see documentation):
Entity List: Provide a list of words with their associated entity type
Annotation: Provide the location of the word in the document and its entity type so Amazon Comprehend can train on both the entity and its context
Providing an entity list is usually the fastest way to train the model but this will result in lower accuracy. We decided to use the Annotation method to train the model to get the most accurate results. This step requires manual annotation of hundreds of documents which can be very time consuming. Choosing the right annotation tool is therefore of the utmost importance. In this tutorial, we used the UBIAI annotation tool (available in the beta version for free) because it comes with extensive features such as:
- ML auto-annotation
- Dictionary and regex auto-annotation
- Team collaboration to share annotation tasks
- Direct export of annotation to Amazon Comprehend format
For more information about UBIAI annotation tool ,please visit the documentation page.
Using manual dictionary and ML auto annotation, I was able to perform 200 annotation per entity type, which is the minimum required by Amazon comprehend to train a model, in few hours. Once done with the annotation, I export it using the “Annotations” option to Amazon Comprehend format:
The downloaded annotation file will include each document in a separate txt file plus a csv file that contains 4 columns:
File: The name of the file containing the document which is included in the downloaded file from the UBIAI website.
Line: The line number containing the entity
Begin Offset: The character offset in the input text (relative to the beginning of the line) that shows where the entity begins. The first character is at position 0.
End Offset: The character offset in the input text that shows where the entity ends.
Type: User defined entity type
Next step is to upload the documents and the annotation csv file to Amazon S3 database and configure Amazon Comprehend custom NER model.
Training a custom NER in Amazon Comprehend is fairly easy and quick. First, I create a folder in Amazon S3 to transfer all the training and testing documents (Note: Amazon Comprehend requires 1000 documents for training and testing), named “Document list”. In a different folder “Annotation entities list” I upload the csv file containing the annotation output from UBIAI tool.
Once all the documents and annotation csv file are uploaded, we are ready to start the NER custom recognizer configuration:
- Go to the Amazon Comprehend console and click on the Custom entity recognition (1) in the left tab.
- Click on the Train recognizer button (2)
- Next, choose the location of the annotation csv file and the training/testing documents.
- Finally select an IAM role that grants access to S3 and press the train button to start model training
Training time will depend on the training/testing volume. For 1000 documents, it took 20 min to train the model. Once trained, the training status in the dashboard will display “Trained”:
You can access the training performance by clicking on the trained model. Once satisfied with the performance, I submit an analysis job to test the model on a new job description that was never used for testing or training:
Below are the extracted entities using the Amazon Comprehend model. While the model missed few SKILLS entities, it correctly identified DIPLOMA and DIPLOMA_MAJOR and was able to extract ~70% of the skills. Given the low annotation volume (200 annotation/entity) that was used for training, the performance is particularly good.
In this tutorial, we showed how to annotate job descriptions using UBIAI tool and trained a custom entity recognizer in Amazon Comprehend. With a only few hundreds annotation, we were able to successfully extract entities from a new job description. As a next step, we can:
- Index job descriptions with metadata using extracted entities
- Create a search engine that filters jobs using entities
- Create a recommendation system using similarity between entities from job descriptions and entities extracted from resumes
The method demonstrated here can be applied to any field of study such as Medical, Finance, Science, etc. to perform advanced search query based on entities.
Any question or suggestion? Please leave a comment.