I do a lot of data scraping on the web, and one of the first things I look for is an API. Even if the platform doesn't provide a publicly documented API, generally they will probably have some sort of undocumented "private" API to facilitate client-server communication like search queries or other fun AJAX stuff without reloading the page.
In fact, because it's undocumented, there may be a lot of security-related issues that they might not think about, simply because it's not intended to be consumed by the public (but probably more likely, who likes to think about security?)
One thing I've noticed, as a web scraper, is generally how easy it is to make API requests myself.
Let's say you have a blog or other content system and you wanted to implement a search function, where the user enters a search query and then your server uses it to return a list of relevant results from the database.
After building your API, you might make a request like this on the client-side:
POST https://api.myblog.com/search query=scrape&type=tag
Which is intended to return all posts that are tagged with "scrape" in it. And then you proceed to test your API, verify that it works, and commit your code.
Fantastic, job well done. You release your feature, and now you can tag your posts and let people find them using tag search.
So now I come along and I wanted a list of all your posts on your site. Perhaps I want to build my own site that basically mirrors all of your content (think about all those instagram clones) so that I can enjoy some extra traffic without doing any work. All I really have to do is run a scraper periodically checking for new content and then download them to my own server.
To figure out how your site works, I would come to your blog, type in a search query, and hit submit. I would then notice that you make an API request and then proceed to take a look at how the request and responses are constructed:
- the headers, like cookies, origin, host, referer, user-agent, custom headers, etc
- the body, to see what data is sent
- any security features like CSRF tokens or authorization
Then I would replicate this request and send it from my own server, which bypasses CORS because CORS doesn't mean anything if I can spoof the origin. And also because a lot of people probably don't really understand CORS and set
Access-Control-Allow-Origin: * on their server anyways because half the answers on StackOverflow recommend it as a solution. Conceptually, it's not that difficult to understand and I highly recommend reading about it, maybe over at MDN.
I would start by trying different things. Maybe try something as simple as an empty string
POST https://api.myblog.com/search query=&type=tag
Maybe your search engine will match on EVERY tag and then give me everything in one shot.
Or I might try some wildcards, maybe
* hoping you don't sanitize your parameters (which can potentially open up a different world of hurt!)
POST https://api.myblog.com/search query=%&type=tag
Generally, depending on what country you're operating in, the law is on your side since your terms of service will include something about unauthorized access to API's. And if you don't, you probably should! But of course not everyone will respect the law, and depending on the kind of data you're working with, sometimes you might want to be a bit more proactive in preventing scrapers from taking all your data too quickly.
Just like how getting your database hacked and exposing millions of customer data would be a massive blow to your business even if you have the right to sue the hackers, you might not want to wait until it actually happens so that you can play your legal cards.
Unlike securing a database, you can't just stop people from making requests to your server. After all, how does one distinguish between a request from your website, and a request from a 3rd party client that I wrote in Ruby or Python or Java or straight-up curl?
I believe the goal in this case is to make scrapers work hard. The harder they have to work, the more requests they need to make, the slower the data collection process becomes, and the easier it is for you to flag it as suspicious activity and then take action automatically.
Depending on the nature of your content, you might for example enforce a minimum character limit and sanitize the inputs to avoid wild card operations on the server-side. Where you do your checks is important because I build my own requests, so any front-end validation is basically useless. Relying on the client to be honest is like giving me the key to your safe and hoping that I don't open it.
Other common examples in other applications I've seen include
limiting the number of requests per time interval (ie: request cooldowns). If your app users aren't intended to make 100 requests a second, don't let them.
Paginating your results. This is a pretty common strategy for various performance-related purposes (for better or worse), but combining it with request cooldowns, it can be pretty nice.
geofencing strategies, where search results are limited based on a provided location which could be the name of a region, or a pair of latitude, longitude coordinates. Might not apply to you, but if it does, really makes life hard for scrapers.
rate limiting, where you impose limits on the number of API requests that can be made before no more requests can be made. This is useful if requests must be authenticated with a token, possibly tied to a user account. This won't be effective if I'm hitting the server directly with the same token that your own client uses.
By effectively using filters and cooldowns, you can force scrapers to work hard to obtain your data instead of just coming in and then walking away with everything in 5 seconds!
Are there any techniques that you like to use to "secure" your data from scrapers? Or perhaps it's not necessary for the average app developer to think about?