Currently, I'm exploring various ideas for side projects. One of them needs some automated scraping that runs as a CRON job. Because I'm used to scrapping with Python using Selenium, I immediately choose them and connect them with Flask. There is 1 little problem: My previous scraping experience is only on a local machine. So I need to learn how to do scraping on the server and add a CRON job on top of that. Because it's quite a common use case, there should be many tutorials about it and make it easier when I need to troubleshoot, right?
Oh god, how wrong I was.
Okay, so before I started building this, I already researched a bit about 3 common libraries for scraping in Python with their advantage and disadvantage. To recap it a bit from what I learned:
- BeautifulSoup + requests: Simplest solution (I used requests on my previous project, so I only need to learn how to parse HTML using BeautifulSoup, which is quite easy too).
- Scrapy: So much functionality and faster than Selenium, but quite inflexible than other solutions.
In total, it took me 1 day to finish the prototype for this side project. I tested the scrapper locally and it works. It's slow, but expected. Satisfied with the result, I try to start the scraping on the VPS.
It broke. Okay no worry, I can fix this.
The error keeps changing, and it's getting more obscure.
I spent a few days after that to debug it, and my final solution is "Restart and hope for the best".
There is so much hardship that comes with scraping on the VPS.
First, VPS doesn't have GUI, so we need to configure the server or the Selenium so it can still run without GUI. I found 2 solutions regarding this, using virtual display or running Selenium with a headlessbrowser.
I tried the virtual display approach, but the error that came up is still the same. When I changed to the headless solution, the error changes, so I choose this solution.
Second, the program frequently crashes.
To handle this, I used WebDriverWait so the driver can wait until the browser finishes loading. I also added various try except in risky lines that have a high chance to create an error.
Lastly, the driver frequently disconnects.
This is the hardest problem for me. What I found out is that ChromeDriver is very unstable compared to GeckoDriver. It successfully reduces the disconnection, but still not zero. So I need to run the scrapper in batch and restart every time the driver disconnected.
As a result, I changed my approach. I used the second scraping method (BeautifulSoup + requests) for website that can handle it. For the rest of the website, I used Selenium and wait patiently.
From this project, I learned that scheduled scraping on VPS is hard and time-consuming. Moreover, the problem that I encountered doesn't include when the target website blocks your scraping attempt. To handle this, there is a need to configure my scrapper so it doesn't look like it came from a scraping program. But because I'm just doing this side project for fun, I just change a few basic configurations. If in the future I need to have another scraping on VPS, I probably will invest a bit on a few proxies to change the requester's IP or even use web scraping services such as ScraperAPI.
But if not really needed, I don't want to do scraping on VPS again.
If you're curious about the side project that I'm building, you can check it out on this page. This is a simple price aggregation website for light novels. I'm planning to release a weekly blog about something that I find interesting while working on my side projects, and this article is one of them.