Even with great tools, one of the challenges with crawling and extracting data from pages at a large scale is you don’t really know what structure a page is going to have before you get to it.
Here are a couple of quick hacks to scrape data from pages that have dynamic css rules:
Rule-based extraction is fine for small scale scraping, one-off scripts to grab some data, and sites that don’t routinely change. But these days a site with data of any value that isn’t dynamic to some degree is relatively rare.
Additionally, classifying extraction rules for a given domain doesn’t scale to multiple domains. Simply ensuring regularly updated web data from a small group of domains routinely requires a whole team to manage the process. And the process still breaks down. Trust me, we hear this a ton in conversations with current or potential clients.
So you have a few choices for following this tip. Or at least for avoiding what this tip is meant to avoid: unscalable or regularly broken scrapers.
The first is that you can build a non-rule centered form of extraction custom to you. There are more free training data sets out there than ever before. Out of the box, NLP is improving from a handful of providers. And particularly if you want to focus on a small set of domains, you may be able to pull this off.
Secondly, you can reach out to the small handful of providers who truly offer rule-less web extraction. If you’re wanting to extract from a wide range of sites, your sites are regularly changing, or you're seeking a variety of document types, this is likely the way to go.
Third, you can stick to gathering public web data about particularly well-known sites. At the end of the day, this may simply be paying someone else to maintain rule-based extractors for you. But — for example — there’s a veritable cottage industry around scraping very specific sites like social media. Their whole business is to provide up-to-date extractors for things like lists of members of a given Facebook group.
If you truly can’t find a way to extract what you need with one of the options above, there are a few ways you can at least proof your scraping of dynamic content.
Alternatively, this type of rule-based selector extraction is how most major extraction services work (like Import.io, plugin web extractors, Octoparse, or if you’re rolling your own extractor with something like Selenium or BeautifulSoup).
Now there are a few scenarios where these selectors become useful. Typically if a site is well structured, class and ID names make sense, and you have classed elements inside of classed elements, you’re good without these techniques. But if you’ve spent any time with web scraping, don’t tell me you haven’t occasionally gotten a few of these:
<div> <a href="/some/stuff" data-event="ev=filedownload" data-link-event=" Our_Book "> <span class="">Download Our Book</span> </a> </div>
<div class="Cell-sc-1abjmm4-0 Layout__RailCell-sc-1goy157-1 hcxgdw"> <div class="RailGeneric__RailBox-sc-1565s4y-0 iZilXF mt5"> ... </div> <div class="RailGeneric__AdviceBox-sc-1565s4y-3 kObkOT"> ... </div> </div>
The above both stray from regular class declarations and eschew attempts to extract data using typical selectors. They’re both examples of irregular markup, but potentially in inverse ways. The first example provides very little traditional markup that could be used for typical CSS selectors. The second contains very specific class names that are dynamically created in something like React. For both, we can use the same handful of advanced CSS selectors to grab the values we want.
CSS Begins With, Ends With, and Contains
You won’t encounter these CSS selectors very often when building your own site. And maybe that’s why they’re often overlooked in explanations. But many individuals don’t know that you can essentially use regex in a subset of CSS selector types. Fortunately, Regex-like selectors can be applied to HTML attribute/value selectors.
So in the first example above, something like the following works great:
Within CSS, square brackets are used to filter. And follow the general format of:
This in and of itself doesn’t solve either of our issues up there, it’s the inclusion of the three regex operators for begins with, ends with, and contains.
In the above example grabbing Our_Book (note these selectors are case sensitive), the original markup has extra whitespace to either side of the characters. that’s where our friend “contains” comes into play. In short, these selectors work like so:
div[class^="beginsWith"] div[class$="endsWith"] div[class*="containsThis"]
Where class can be any attribute, and where the value string matches the beginning, ending, or some substring of the total value name.
Thanks for reading this article, I hope that it has helped outline some potential methods for successfully implementing web scraping practices. This is certainly a concept that continues to be refined and improved so be sure to keep up to date as progress is made! Feel free to leave any feedback or questions in the comments section.