Web scraping has become the default way to extract data from websites. Web scraping is used, among many other things, for price monitoring, watching competitors and website updates detection. Although with the popularity of scraping came a lot of libraries and other tools to make it easier. You can easily extract all the text data from the source code of the website. However, you still have to make the effort to properly point out items of interest, i.e. using CSS selectors. If the authors of the site change its structure or the naming of the elements, adjustments to your scraper code will be necessary. But what if I said that it’s possible to prepare a mechanism that will be maximally resistant to changes in the code and even the layout of every web page?
In this article, I will show you how to extract data from websites using methods that are successfully used in processing scanned documents. You’ll learn how to track trends on a news site by regularly taking its screenshots. We will do this using image and photodetection and text recognition. Finally, we will generate clouds of the most popular words from the data obtained in this way.