Website categorization - use cases, taxonomies, content extraction

#machinelearning #datascience #python #ai

Website categorization is broadly speaking the task of classifying a website into one or more categories.

The process of categorizing websites is usually automated, and often uses supervised machine learning models as the most effective solution.

Use cases

Website categorization has many use cases from a wide range of fields. One important application is in the field of business intelligence. By analyzing the content and structure of your competitors’ websites, you can get an idea what they are up to, and discover new opportunities for your own business.

Another use case is in the management of website content. A company may have different policies for different types of websites, such as different levels of access based on the age range of visitors to the site and this can be inferred by the category of the website.

Another important application is cybersecurity, where we classify websites into potential spam, phishing or websites that we do not want to be visited by e.g. our employees.

This allows us to block such potentially harmful websites and prevent users from accessing them. We can also employ whitelists, where only safe websites on a pre-defined list are allowed to be accessed by users of the network.

Important use of website categorization is also online advertising, where advertisers can target ads more effectively if they know what the content of the site is about.

Website categorization can be used for building a dataset of millions of categorized domains and then use that to build a tool for ai powered lead generation.

Taxonomies

Taxonomies are developed to help categorize content and make it easier to find. In the context of websites, an ad-focused taxonomy is most useful and the Internet Advertising Bureau (IAB) has developed a taxonomy geared toward ads and marketing.

This taxonomy, which can be found at the IAB website, is constantly being revised based on changing user behaviors and categories. For this reason, if you are using the IAB taxonomy to categorize websites, it is important that you use the latest version.

URL Category Taxonomy for Web Content is one of the important steps when building a machine learning classifier. It is important to take time for this, as later revisions are often difficult and involve time. E.g. if you later decide to change the categories, then you may have to reclassify all the previous data items. And also need to spend time again on data acquisition for the training data set.

Also, customers may not be happy if you change the taxonomy for your classifier because if affects the data that they already classified or the platform where they use it. E.g. they may have to change the navigation menus that may have been derived directly from the taxonomy.

If your website is focused on ecommerce then a different, products oriented taxonomy may be more appropriate. The most well known ones in this segment are those from Google:
https://www.google.com/basepages/producttype/taxonomy.en-US.txt

Example of categories:

Google product taxonomy is structured by product categories and subcategories, making it easy to structure your content. It has several levels of depth or "Tiers". There are more than 1000 (sub)categories in the taxonomy, so you'll most likely find the right one for your products.

Machine learning models

Before you start building a supervised machine learning model for automated website categorization, you need to prepare a large amount of high quality training data. The more training data in your data set and the better their quality in terms of relevance and diversity, the more accurate and reliable your model will be. Therefore, it is recommended that you invest most of your time and resources into this part of the process.

There are several ways to collect training data for website categorization. One way is to use existing datasets from various agencies or other third parties. You can also use existing web-crawling tools to crawl websites yourself and collect their content into a dataset.

Another option is to manually curate a dataset by simply opening up websites that are relevant to your use case and categorising them according to your custom taxonomy or taxonomy from Google, Facebook or IAB.

There are many machine learning models that can use in general for text classification, from simpler to complex ones, listing below some of them, in increasing order of complexity.

Naive Bayes
Naive Bayes classification is a popular method of machine learning that is used to classify data into categories.

The model is based on Bayes' theorem, which states that the posterior probability of an event given some evidence is equal to the sum of the prior probability of the event (before taking into account the evidence) and the conditional probability of the evidence given that event.

Logistic regression
Logistic Regression is a supervised classification method that uses logistic function for the dependent variable with discrete possible outcomes.

Logistic regression can be binary, helping us predict if a given message is spam or not, or it supports more than two possible discrete outcomes, then known as multinomial logistic regression.

Logistic regression is an extension of the linear regression, but adapted for classification tasks. Both calculate the weighted sum of the input variables and a bias term. However, whereas with linear regression this is the output, the logistic regression calculates the logistic of the sum.

Logistic regression can be used for many different text classification models and services, including product categorization API

Selecting the decision threshold for logistic regression

For many classification problems, the decision threshold is left at the usual value 0.5. In selecting its value it makes sense to nevertheless consider the relative importance/cost of false positives and false negatives in your classification problem.

If we change the decision threshold it has different effects on two important evaluation metrics:

precision
recall

Increasing the decision threshold leads to a decrease in the number of false positives, and an increase in false negatives, and thus to higher precision and lower recall. Conversely, decreasing the decision threshold leads to a lower precision and larger recall. What we have is a tradeoff between precision and recall, with different classification problems requiring different combinations of both.

Article/Content extraction

Websites consist of a mix of content and supporting elements like menus, sidebars, and footers. These ancillary elements are usually less relevant to the central topic of the website; they may often be non-unique across multiple websites.
Text can be extracted even from images by using some kind of method to convert image to text.

Consider a news website: the menus, footer, and headlines/teasers in the sidebar may be common across many articles on that website. What we are generally interested in is the content of the article. This is where article extraction comes into play.

Article extraction consists of extracting text from a website that is likely to be relevant to the topic of interest from a website that contains other material which is less related (or unrelated) to this topic.

Text pre-processing is an important part of data pipeline for website categorization models. As we are dealing with websites, the first part consists of extraction of relevant text from the websites. For most cases, we want to remove all non-article parts of web page as part of so-called article extraction.

There has been a lot of work done on the topic of content extraction. A great early research paper on this topic is https://www.researchgate.net/publication/221519989_Boilerplate_Detection_Using_Shallow_Text_Features. It has an open source implementation written in Java.

There are also many ready made libraries available for content extraction written in python which is more commonly used in data science, e.g. goose3 (https://github.com/goose3/goose3) and newspaper (https://github.com/codelucas/newspaper).

We used the latter as part of our website categorization API (see also https://hub.docker.com/r/categorizations/websitecategorizationapi)

Conclusion

Website categorization is very important in machine learning and natural language processing. It has many use cases in a number of industries, including Cybersecurity and Online Stores Categorizations.

An important part of website categorization is the extraction of relevant text from websites (by removing boilerplate elements), where special machine learning models can be used for this purpose.

For text classification itself, a wide range of machine learning models can be used, from Logistic Regression to deep learning nets.