When we read an article in any news website, medium, dev.to etc.. we generally see additional sections like Recommended Articles, Similar Articles, etc.. where we see a few more articles matching the content of the article you are reading or may be based on your previous read history you get a few more recommendations.
Basically, Articles recommendation can be done in two ways.
- Collaborative Filtering
- Content similarity
Collaborative Filtering:
Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating).
This can be implemented using Machine Learning techniques by identifying the group of users having similar behaviour.
Another way of identifying a similar group of users is by using Graph Databases like Neo4j, ArangoDB etc.. where you can build a graph of users connected via their interests, activities on the website, purchase patterns etc.. and identify similar correlated user groups.
Content Similarity:
The content similarity is the degree of similarity between two articles, based on the textual content (terms appearing in them) of the two articles.
This can be implemented using information retrieval techniques like Bow (Bag Of Words), TF-IDF, etc..
In this blog post, I will explain more about implementing Content Similarity using Elastic Search, which internally uses TF-IDF for calculating the relevant articles for the given search query. I took sample articles dataset from kaggle (dataset from thenews website) for this activity.
This dataset contains 2692 articles, out of which 1408 are sports related articles and remaining 1284 articles are business related articles. Will explore sports articles in this post.
Let’s look into the interesting part… The implementation
Lets go ..! |
Below is the sample article stored in Elastic Search which talks about Cricket.
Image from Firstpost |
Above article talks more about Hashim Amla, Temba Bavuma, South Africa, England, Test Cricket ... Now, let’s see the top 4 articles matching this content (article id — 1817 as mentioned above)
Below are the recommended articles which have similar content
- “title”: “Injured Amla stands firm as South Africa build lead”
- “title”: “Amla makes century De Villiers falls for 88”
- “title”: “Amla and Stephen Cook lead South Africa to 329/5”
- “title”: “Root Stokes fire up England ” (Eng Vs SA Test match)
So, If u see most of the recommended articles are in context with the article (id- 1121).
I used Elastic search’s More Like This Query to identify the similar articles matching the content of the current article.
Sample Elastic Search More Like This Query:
Let’s see one more similar example, this time for the Football news.
Image from Business Insider |
Above article talks more about Lionel Messi and Argentina. Now, let’s see the top 4 articles matching this content (article id — 1817 as mentioned above)
Below are the recommended articles which have similar content
- “title”: “Messi record as Argentina thrash Venezuela”
- “title”: “Magical Messi grabs hat trick as Argentina romp into quarter”
- “title”: “Messi scores 50th Argentina goal in 2 0 wins over Bolivia”
- “title”: “Messi primed to end Argentina drought Copa fi”
Almost all the recommended articles are in context with the article id- 1817.
Conclusion:
Overall, ElasticSearch MoreLikeThis query will help you in identifying similar content articles in a fast and efficient manner which can give you decent recommendations based on text content.
Thank you for your time!
If you like the article keep sharing it and you can follow me on Twitter for more posts on technology, startups & leadership
Top comments (0)