Web scraping is a powerful tool. I sometimes find that spinning up a full-fledged Beautiful Soup Python script for a short task is unnecessary. Today I had an issue with a web page that would not allow me to select items in a table to copy, and, even if it did, I would have had the additional, unwanted column data in my clipboard.
Solution: Console web scraping
Let's break this down.
First, what I wanted was a way to capture each element. The text I desired from the table was wrapped in a <div id="edit-tid-24-view"></div>
tag. I tried targeting them first by a "begins with" filter:
document.querySelectorAll('[id^="edit-tid"]');
This got me part of the way there, but I needed to target ID attribute values that not only started with this, but ended with -view
. In typical Regex, you might do something like /edit-tid.*-view/
. A bit greedy, but would have done the trick in my case. However, we don't really get to use Regex in querySelectors
. So, I combined two filters: one for the beginning portion, the other for the ending portion.
document.querySelectorAll('[id^="edit-tid"][id$="-view"]');
After that, it was quite simple. I wanted to loop through the NodeList
object that was returned, so I had to first convert it to an Array
.
Array.from(someObject);
Once there, I could have mapped the innerText
of each Node
from the DOM to an array of the desired strings.
Array.from(someObject).map(function(item) { return item.text; });
However, I was not satisfied with that.
I wanted my list cleanly output, and piped directly to my clipboard. Javascript allows one to select and execute a copy command on the document
object. However I was working in the console, and found something much simpler: the copy
function works in the console.
I simply concatenated the strings together with a carriage return, and copied the result to my clipboard.
Conclusion
Here's my Developer Tools Console web scraper in all it's glory.
copyText = '';
Array.from(
document.querySelectorAll('[id^="edit-tid"][id$="-view"]'))
.forEach(function (x) {
copyText += x.text + '\n'
}
);
copy(copyText);
Top comments (0)