DEV Community

Discussion on: Web Scraping Walkthrough with Python

Collapse
 
rhymes profile image
rhymes • Edited

Nice idea, though scraping is always dependent of the website structure and/or copyright issues (they might block your user agent or IP if they don't allow scraping). In the case of Indeed they explicitly forbid it:

You are not permitted to use Indeed’s Site or its content other than for non-commercial purposes. Use of any automated system or software, whether operated by a third party or otherwise, to extract data from the Site (such as screen scraping or crawling) is prohibited. Indeed reserves the right to take such action as it considers necessary, including issuing legal proceedings without further notice, in relation to any unauthorized use of the Site.

😏

This is going to take a while, so I'll go grab some coffee and come back...

Ahah, if you want to actually build a scraping tool I would consider Scrapy which is a framework with async concurrency builtin to build crawlers with data scraping.

It's definitely more complicated than BeautifulSoup, which is only a parsing library. Scrapy contains it all: downloaders, parsers, streaming processors, concurrency, hooks, logging, statistics. You can use BeautifulSoup as the parser, instead of the default one. It even allows you to choose either breadth first order or depth first order in crawling.

Collapse
 
awwsmm profile image
Andrew (he/him)

Oh jeez let's hope I don't get permabanned from Indeed.

Collapse
 
rhymes profile image
rhymes

There's an Indeed API on Mashape, don't know how flexible that is: rapidapi.com/indeed/api/indeed