I get to work with a variety of web scraping products and techniques at my job at Diffbot. Aligned with Diffbot's mission to "structure the world's knowledge" is an initial step of first gathering the underlying data to be structured. Diffbot is one of three western entities that truly crawl the whole public web. So this involves a pretty stellar stack of web crawling, extraction, and parsing tools.
Even with great tools, one of the challenges with crawling and extracting data from pages at a large scale is you don't really know what structure a page is going to have before you get to it. To this end, Diffbot employs a series of Automatic APIs. These are AI-enabled web extraction APIs that employ a range of techniques from computer vision through NLP to discern what data may be valuable on a page, and then to grab and structure that data.
Based on our research, around 90% of the surface of the web can be classified into 20 distinct page types. These can be discussion pages, product pages, article pages, nav pages, organizational "about" pages, and so forth. And typically each "type" of page will share a cluster of characteristics.
An event page is likely to have a time and date for the event. An article is likely to have an author. A product is likely to have an SKU. By training AI to look for available visual and non-visual fields that a page is likely to have (given it's type), you've bypassed the need to dive into site-specific structural details.
This leads me to my first tip…
Rule-based extraction is fine for small scale scraping, one-off scripts to grab some data, and sites that don't routinely change. But these days a site with data of any value that isn't dynamic to some degree is relatively rare.
Additionally, classifying extraction rules for a given domain doesn't scale to multiple domains. Simply ensuring regularly updated web data from a small group of domains routinely requires a whole team to manage the process. And the process still breaks down. Trust me, we hear this a ton in conversations with current or potential clients.
So you have a few choices for following this tip. Or at least for avoiding what this tip is meant to avoid: unscalable or regularly broken scrapers.
The first is that you can build a non-rule centered form of extraction custom to you. There are more free training data sets out there than ever before. Out of the box NLP is improving from a handful of providers. And particularly if you want to focus on a small set of domains, you may be able to pull this off.
Secondly, you can reach out to the small handful of providers who truly offer rule-less web extraction. If you're wanting to extract from a wide range of sites, your sites are regularly changing, or your seeking a variety of document types, this is likely the way to go.
Third, you can stick to gathering public web data about particularly well known sites. At the end of the day this may simply be paying someone else to maintain rule-based extractors for you. But - for example - there's a veritable cottage industry around scraping very specific sites like social media. Their whole business is provide up-to-date extractors for things like lists of members of a given Facebook group. But these scrape providers won't help if you want to monitor custom domains or on a vast majority of the web.
If you truly can't find a way to extract what you need with one of the options above, there are a few ways you can at least proof your scraping of dynamic content.
Among Diffbot products, this is what the Custom API is for. It's our only rule-based extractor and it's essentially for page types unique enough to where they don't fit into a major page category. Or you just want to grab specific pieces of information from the page. You can pair it with Crawlbot to apply this API to large numbers of pages at once.
Alternatively, this type of rule-based selector extraction is how most major extraction services work (like Import.io, plugin web extractors, Octoparse, or if you're rolling your own extractor with something like Selenium or BeautifulSoup).
Now there are a few scenarios where these selectors become useful. Typically if a site is well structured, class and ID names make sense, and you have classed elements inside of classed elements, you're good without these techniques.
But if you've spent anytime with web scraping, don't tell me you haven't occasionally gotten a few of these:
<a href="/some/stuff" data-event="ev=filedownload" data-link-event=" Our_Book "> <span class="">Download Our Book</span> </a> </div>
<div class="Cell-sc-1abjmm4-0 Layout__RailCell-sc-1goy157-1 hcxgdw"> <div class="RailGeneric__RailBox-sc-1565s4y-0 iZilXF mt5"> ... </div> <div class="RailGeneric__AdviceBox-sc-1565s4y-3 kObkOT"> ... </div> </div>
The above both stray from regular class declarations, and eschew attempts to extract data using typical selectors. They're both examples of irregular markup, but potentially in inverse ways.
The first example provides very little traditional markup that could be used for typical CSS selectors.
The second contains very specific class names that are dynamically created in something like React.
For both, we can use the same handful of advanced CSS selectors to grab the values we want.
CSS Begins With, Ends With, and Contains
You won't encounter these CSS selectors very often when building your own site. And maybe that's why they're often overlooked in explanations. But many individuals don't know that you can essentially use regex in a subset of css selector types.
Fortunately, Regex-like selectors can be applied to html attribute/value selectors.
So in the first example above, something like the following works great:
Within CSS, square brackets are used to filter. And follow the general format of:
This in and of itself doesn't solve either of our issues up there, it's the inclusion of the three regex operators for begins with, ends with, and contains.
In the above example grabbing Our_Book (note these selectors are case sensitive), the original markup has extra whitespace to either side of the characters. that's where our friend "contains" comes into play. In short these selectors work like so:
div[class^="beginsWith"] div[class$="endsWith"] div[class*="containsThis"]
Where class can be any attribute, and where the value string matches the beginning, ending, or some substring of the total value name.