If you are into building web scrapers, you know how hard it's to scrape infinite loading pages. Most search results you see on google focuses on two methods.
- Find the ajax on network tab, try to scrape from it.
- Use a combination of
document.body.scrollHeight
,window.scrollTo
and some for loop.
Unfortunately most of them does not work well with lazy loading images, or pages where infinite loading is triggered by smooth scrolling. Fiddling with network tab also seems counter productive in many cases and easily triggered as a bot on some websites.
I was thinking of a solution for this for few years. Tried all different ways and got disappointed because it varied a lot between websites.
Finally it kicked me when I was updating this and this on stackoverflow. Feel free to explore them.
Here is the small preview of what we will be building today. It’s a one minute video (sorry, no audio).
Case:
You need to scrape 100 results from product hunt. The result should contain post title and image url in following structure. It has to stop scraping once the limit has been reached or there are no element left.
[
{
"title": "Some product title",
"img": "https://ph-files.imgix.net/123456-abcdefghijkl"
}
]
We will be using the surefire method called window.scrollTo
, but not with document.body.scrollHeight
.
Solution:
PS: If you want to read the code, jump to the final code.
Here what we will do,
- We will extract the selector (obviously 🤷).
- Then we will find the first element on the page for that selector. We won't continue if there are no element.
- Scroll the element into view.
- Optional! Wait for a few milliseconds to let it load images and vice versa.
- Extract information from that element.
- Important! Remove the element from dom.
- Important! Scroll to top of the page.
- Do the next element or stop if limit has been reached.
The steps marked as IMPORTANT are the key. They will trigger the scroll event on the page without even scrolling manually like others do with document.body.scrollHeight
and so on.
Alright, now that you know the solution, you can code the rest of it from the video above.
I'm kidding 😅! Here are the steps!
Extracting the selectors
You probably did this lots of time. But here is a short recap anyway. I will not be writing the selectors directly here because they might change by the time you are reading this post.
Open chrome and load producthunt page. Then right click on a title.
Now pick any of these classes. We will find the right one on next step.
Write down the class name on the console. It will do a instant evaluation, so you will know if the selector is correct or not right away.
Since we have 27 results, we are probably on the right path. Since there are more or less 20-30 results when you load the page first time.
Alright, next we can extract the selector for image.
Fortunately for us, the selector for image is even more straightforward, because we have a nice data attribute there.
However, if you tweak the selector a bit, you will see there are 25 results out of 27 products. Which means it did not load the last two images.
If you scrapped this page right now, you would have 25 proper results.
Additionally I extracted the parent element for each product listing.
Now I see something bit weird, it says 34 results. Which means it did not even load the last 7 results. Not even the title. It's there but not loaded at the moment.
Finally we have three selectors.
- Product Entry (Optional):
div.white_09016 ul li
- Title:
.title_9ddaf
- Image:
[data-test="post-thumbnail"] img
These selectors can change any time since it's a react based website.
Scrape the data
You can execute these code on the browsers console or using some script/library ie: puppeteer has a page.evaluate
method for executing functions. I will be using Scratch JS to run the code on the page.
Grab Single Product
Let's create an async function called scrollAndExtract
which accepts two parameters called selector
and leaf
. Leaf is the innerText
and src
etc.
We need async
because we will be using a delay inside the function for showcase purpose.
const scrollAndExtract = async (selector, leaf) => {
const element = document.querySelector(selector);
if (element) {
element.scrollIntoView();
return element[leaf];
}
};
Let’s run it,
scrollAndExtract(".title_9ddaf", "innerText").then(console.log);
scrollAndExtract('[data-test="post-thumbnail"] img', "src").then(console.log);
Cool! We got the first title and image url.
Scroll and Remove the element
Next we will remove the element from the view. We can do this in a simpler manner by adding another parameter and tweaking our function for a bit.
Let’s add a remove
parameter. If it’s provided, we will remove the element instead of extracting the data.
const scrollAndExtract = async (selector, leaf, remove) => {
const element = document.querySelector(selector);
if (element) {
element.scrollIntoView();
if (remove) return element.remove(); // <-- Remove and exit
return element[leaf];
}
};
Let’s test it out,
scrollAndExtract(".title_9ddaf", "innerText").then(() => {
scrollAndExtract(".title_9ddaf", null, true);
});
The product title vanished,
Scrape the image
Now we can scrape the image as well in similar fashion.
scrollAndExtract('[data-test="post-thumbnail"] img', "src").then(() => {
scrollAndExtract('[data-test="post-thumbnail"] img', "src", true);
});
This will extract the src attribute from the image.
Both of them can be merged into a single function which returns an object. We can push it to an array later.
async function extractor() {
const title = await scrollAndExtract(".title_9ddaf", "innerText");
await scrollAndExtract(".title_9ddaf", null, true);
const img = await scrollAndExtract('[data-test="post-thumbnail"] img', "src");
await scrollAndExtract('[data-test="post-thumbnail"] img', null, true);
return { title, img };
}
Let’s test it out,
extractor().then(console.log);
Optional: Remove parent container for the title and image
Let’s remove the parent element after scraping the title.
This is optional because the logic will work even without this. But it will save us some space on the viewport and memory as well since we are removing the dom element.
We can remove the parent container and won’t have to worry about removing image or title element since it will be removed as well.
async function extractor() {
const title = await scrollAndExtract(".title_9ddaf", "innerText");
const img = await scrollAndExtract('[data-test="post-thumbnail"] img', "src");
// remove the parent here
await scrollAndExtract("div.white_09016 ul li", null, true);
return { title, img };
}
It should work flawlessly,
Loop through 100 elements
We won’t be using a traditional for loop. We will use recursion instead.
Let’s create another function to go through the elements one by one. We will store the results in a result array.
const products = [];
async function hundredProducts() {
if (products.length < 100) {
const data = await extractor();
if (!data.title || !data.img) return null;
products.push(data);
return hundredProducts();
}
}
This will grab the first hundred elements for us. Not only that, it will stop the loop if there are no results from extractor.
We can peek into products
array to grab our results.
hundredProducts().then(() => console.log(products));
And bam!
We got 7 results!
Wait! Wut?
Adding small delay to lazily loaded product data with images
As you can see, we got only 7 results. That’s because we told it to stop the loop if there are no image/title. It scrolled too fast to trigger any scroll event and loading new data.
Let’s use a simple delay function, which will wait for a bit before running the loop.
const delay = d => new Promise(r => setTimeout(r, d));
Also, optionally we will scroll to top of the page.
const products = [];
async function hundredProducts() {
if (products.length < 100) {
// Let's wait 0.5 seconds before moving to next one
await delay(500);
// also trigger a scroll event just in case
window.scrollTo(0, 0);
const data = await extractor();
if (!data.title || !data.img) return null;
products.push(data);
return hundredProducts();
}
}
Final Result
Alright! It’s been a long post and now we have a script and logic to scrape infinity scrolling pages like producthunt.
Here is the complete code which you can run on your browsers console. Make sure un-comment the line to run hundredProducts()
and then log products
array.
const delay = d => new Promise(r => setTimeout(r, d));
const scrollAndExtract = async (selector, leaf, remove) => {
const element = document.querySelector(selector);
if (element) {
element.scrollIntoView();
if (remove) return element.remove(); // <-- Remove and exit
return element[leaf];
}
};
async function extractor() {
const title = await scrollAndExtract(".title_9ddaf", "innerText");
const img = await scrollAndExtract('[data-test="post-thumbnail"] img', "src");
// remove the parent here
await scrollAndExtract("div.white_09016 ul li", null, true);
return { title, img };
}
const products = [];
async function hundredProducts() {
if (products.length < 100) {
await delay(500);
window.scrollTo(0, 0);
const data = await extractor();
if (!data.title || !data.img) return null;
products.push(data);
return hundredProducts();
}
}
// hundredProducts().then(() => console.log(products))
Optional: Puppeteer script
If you want to automate this with puppeteer you can put the code inside a page.evaluate
function. Here is a snippet and here is the git repo with complete code.
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.producthunt.com");
const productList = await page.evaluate(async () => {
// paste the final code here
// ...
// run the function to grab data
await hundredProducts();
// and return the product from inside the page
return products;
});
await browser.close();
})();
Closing Thoughts
This looks ten times bigger than all other posts on the internet. But you know the original version above is a lot smaller as well. No crazy scroll to height or else.
But hopefully I was able to show you a different way than how you normally scrape. Feel free to fiddle and experiment with the data.
Let me know what do you think of this method and what you think is the best method out there for scraping infinity scrolling pages in general.
Top comments (2)
Definitely gonna try this out! Thanks for the post xD
Glad someone is applying it. :)