DEV Community: iankalvin

Scraping Facebook groups using Python? Avoid getting blocked with Crawlbase

iankalvin — Fri, 13 Nov 2020 11:56:03 +0000

Scraping Facebook may sound easy at first, but I've tried several times crawling and scraping different Facebook groups and ended up getting errors and CAPTCHAs most of the time, or worst, banned. For a beginner like me, this is frustrating and could take a lot of time that could have been used for something more productive.

There are ways to solve or avoid such hindrance when scraping, like solving CAPTCHAs manually or even setting a timer on your script to scrape slower. Another way to get around this is by switching your IP every couple of minutes which can be done via proxy servers or a VPN but it takes considerably more time and effort to do so.

Luckily, I’ve found a perfect solution that can handle most issues we normally encounter when scraping. It can also be easily used and integrate into any of your scraping projects. Crawlbase (formerly known as ProxyCrawl) offers an API that will allow you to easily scrape the web and it protects your web crawler against blocked requests, proxy failure, IP leak, browser crashes, and more. They are providing one of the best API that can be used by everyone, be it for small or big projects.

Getting Started

In this article, I want to share with you how I used Crawlbase to easily crawl Facebook groups using their Crawling API and built-in scraper. We will also tackle some useful parameter features like automatic scrolling to extract more data per API request.

I will be providing a very basic sample API call and code for Python 3 as well as discuss each part, which then can be used as a baseline for your existing or future projects. The scraper that I will be using can extract information like member count, usernames, member's posts, and much more in a public Facebook group.

Before we start, let’s have a list of things that we will use for this project:

Python 3
Python library from Crawlbase
Crawlbase account
Crawling API
Facebook Data scraper
Parameters for the API: page_wait, ajax_wait, scroll, and scroll_interval
Any public Facebook group URL

Simple API Call

Now that you have an idea of what we will need to accomplish this task, we can get started.

First, it is important to know that every request to Crawlbase’s API starts with the following base part:



https://api.crawlbase.com

You will also need an authentication token for every request. Crawlbase will provide two kinds of token upon signing up. The normal token for generic requests, and the Javascript token which acts like a real browser.

In this case, we will be using the Javascript token since we will need the page rendered via javascript to properly scrape Facebook groups. A token can be inserted on our request as shown below:



https://api.crawlbase.com/?token=USER_TOKEN

To make an API call, you just need to add the URL (encoded) that you wish to crawl like the given example below:



https://api.crawlbase.com/?token=JS_TOKEN&url=https%3A%2F%2Fwww.facebook.com%2FBreakingNews

This simple line will instruct the API to fetch the full HTML source code of any website that you are trying to crawl. You can make this API request using cURL on your terminal or just open a browser and paste it into the address bar.

Now that I have explained the very basics of making an API call, we can then try to use this knowledge to scrape Facebook groups.

Depending on your project, getting the full HTML source code may not be efficient if you want to extract a particular data set. You can try to build your own scraper, however, if you are just starting or if you don’t want to spend your resources and time on building it yourself, Crawlbase has various readily available data scrapers that we can use to easily scrape data from supported websites like Facebook.

Using their data scraper, we can easily retrieve the following information on most Facebook groups:

title
type
membersCount
url
description
feeds including username, text, link, likesCount, commentsCount
comments including username and text

To get all the information mentioned above, we just need to pass three parameters. The &scraper=facebook-group alongside the &page_wait and &ajax_wait parameters. Using these will return the result in JSON format.



https://api.crawlbase.com/?token=JS_TOKEN&url=https%3A%2F%2Fwww.facebook.com%2Fgroups%2F198722650913932&scraper=facebook-group&page_wait=10000&ajax_wait=true

Example output:

Scraping with Python

Crawlbase has compiled a collection of related pieces of code that we can use to write our simple API call in Python and anyone can freely use it. The below example is how we can utilize their Python library in this project.

First, make sure to download and install the Crawlbase API Python class. You can either download it from Github or use PyPi Python package manager. pip install proxycrawl



from proxycrawl.proxycrawl_api import ProxyCrawlAPI

api = ProxyCrawlAPI({'token': 'YOUR_TOKEN'})

response = api.get('https://www.facebook.com/groups/381067052051677',
                   {'scraper': 'facebook-group', 'ajax_wait':'true','page_wait': 10000})

if response['status_code'] == 200:
    print(response['body'])

Note at this instance we do not need to encode the URL since the library is encoding it already.

From this point on, using other parameters would be as easy as adding another option to the GET request.

Let us use the scroll and scroll_interval in this next example. This parameter will allow our scraper to scroll on a set time interval which in return will provide us more data as if we are scrolling down a page on a real browser. For example, if we set it to 20 then it will instruct the browser to scroll for 20 seconds after loading the page. We can set it for a maximum of 60 seconds, after which the API captures the data and brings it back to us.



from proxycrawl.proxycrawl_api import ProxyCrawlAPI

api = ProxyCrawlAPI({'token': 'YOUR_TOKEN'})

response = api.get('https://www.facebook.com/groups/381067052051677',
                   {'scraper': 'facebook-group', 'scroll': 'true', 'scroll_interval': 20})

if response['status_code'] == 200:
    print(response['body'])

As you may have noticed with the code, we will get a response or status code each time we send a request to Crawlbase. The request is a success if we get 200 for pc_status and original_status. In some cases, the request may fail, which will have a different status code like 503 for example. However, Crawlbase does not charge for such cases, so if the requests failed for some reason, you can simply retry the call.

The example output below shows a successfully scraped public Facebook group.

Conclusion

There you have it. Scraping Facebook content in just a few lines of code. As of the moment, Crawlbase only offers a scraper for groups, but you can use the Crawling API if you wish to crawl other pages.

Remember, you can use any programming language that you are familiar with and this can be integrated into any of your existing systems. The Crawlbase API is stable and reliable enough that it can serve as a backbone to any of your app. They are also offering great support for all their products that is why I’m happy using their service.

I hope you have learned something new in this article. Do not forget to sign up at Crawlbase to get your token if you want to test this on your end. The first 2000 requests are free of charge, just make sure to use the links found on this guide. :)

Scraping in Node.js + Cheerio made easy with ProxyCrawl

iankalvin — Wed, 07 Oct 2020 07:02:42 +0000

If you are new to web scraping like me, chances are, you already experienced being blocked by a certain website or unable to bypass CAPTCHAs.

As I search for an easy way to scrape web pages without worrying too much about being blocked, I came across ProxyCrawl which offers an easy to use Crawler API. The product allowed me to scrape Amazon pages smoothly with incredible reliability.

In this article, I wanted to share with you the steps on how I build a scraper and then integrate the crawling API into my project. This simple code will scrape product reviews from a list of Amazon URLs easily and write that scraped data straight to a CSV file.

Preparation

With this Node project, I have used ProxyCrawl's library and Cheerio which is like a JQuery tool for the server used in web scraping. So before starting with the actual coding, I will list all that is needed for this to work:

We need a list of URLs so I have provided several examples here.
A ProxyCrawl account. They have a free trial that you can use to call their API free of charge for your first 1000 requests, so this is perfect for our project.
The Nodejs library from ProxyCrawl
Node Cheerio Library from GitHub

Really, that’s it. So, without further ado, let’s start writing the code.

Coding with Node

At this point, you may already have installed your favorite code editor, but if not, I recommend installing Visual Studio code.

To set up our project structure, please do the following:

Create a project folder name it as Amazon
Inside the folder, create a file and name it Scraper.js

Once done, go to your terminal and install the following requirements:

npm i proxycrawl
npm i cheerio

After the package installation, go to your Amazon folder and paste the text file that contains the list of Amazon URLs which will be scraped by our code later.

Our project structure should now look like this:

Now that everything is set, let us start writing our code in the Scraper.js file. The following lines will load the Amazon-product.txt file into an array:

const fs = require('fs');
const file = fs.readFileSync('Amazon-products.txt');
const urls = file.toString().split('\n');

Next, we’ll utilize the ProxyCrawl node library so we can easily integrate the crawling API into our project.

const { ProxyCrawlAPI } = require('proxycrawl');

This code below will create a worker where we can place our token. Just make sure to replace the value with your normal token from your ProxyCrawl account:

const api = new ProxyCrawlAPI({ token: '_YOUR_TOKEN_' });

After that, we can now write a code that will do 10 requests each second to the API. We will also use the setInterval function to crawl each of the URLs in your text file.

const requestsPerSecond = 10;
var currentIndex = 0;
setInterval(() => {
  for (let i = 0; i < requestsPerSecond; i++) {
    api.get(urls[currentIndex]);
    currentIndex++;
  }
}, 1000);

At this point, we’re just loading the URLs. To do the actual scraping, we will use the Node Cheerio library and extract the reviews from the full HTML code of the webpage.

const cheerio = require('cheerio');

The next part of our code is a function which will parse the returned HTML.

function parseHtml(html) {
  // Load the html in cheerio
  const $ = cheerio.load(html);
  // Load the reviews
  const reviews = $('.review');
  reviews.each((i, review) => {
  // Find the text children
  const textReview = $(review).find('.review-text').text().replace(/\s\s+/g, '')
;
    console.log(textReview);
  })
}

This code is ready to use but will just log the results in the console. Let’s go ahead and insert a few lines to write this into a CSV file instead.

To do this, we will use the FS module that comes with node then create a variable called writeStream.

const fs = require('fs');
const writeStream = fs.createWriteStream('Reviews.csv');

*Remember that the Reviews.csv is your CSV file and you can name it whatever you want.

We’ll add a header as well:

writeStream.write(`ProductReview \n \n`);

Lastly, we’ll have to instruct our code to write the actual value to our CSV file.

writeStream.write(`${textReview} \n \n`);

Now that our scraper is complete, the full code should look like this:

const fs = require('fs');
const { ProxyCrawlAPI } = require('proxycrawl');
const cheerio = require('cheerio');
const writeStream = fs.createWriteStream('Reviews.csv');

//headers
writeStream.write(`ProductReview \n \n`);

const file = fs.readFileSync('Amazon-products.txt');
const urls = file.toString().split('\n');
const api = new ProxyCrawlAPI({ token: '_YOUR_TOKEN_' });

function parseHtml(html) {
  // Load the html in cheerio
  const $ = cheerio.load(html);
  // Load the reviews
  const reviews = $('.review');
  reviews.each((i, review) => {
    // Find the text children
    const textReview = $(review).find('.review-text').text().replace(/\s\s+/g, '');
    console.log(textReview);
    // write the reviews in the csv file
    writeStream.write(`${textReview} \n \n`);
  })
}

const requestsPerSecond = 10;
var currentIndex = 0;
setInterval(() => {
  for (let i = 0; i < requestsPerSecond; i++) {
    api.get(urls[currentIndex]).then(response => {
      // Make sure the response is success
      if (response.statusCode === 200 && response.originalStatus === 200) {
        parseHtml(response.body);
      } else {
        console.log('Failed: ', response.statusCode, response.originalStatus);
      }
    });
    currentIndex++;
  }
}, 1000);

RESULT

To run your scraper, simply press F5 on Windows or go to your terminal and type node filename

Example output:

I hope you’ve learned something from this guide. Just remember to sign up at ProxyCrawl to get your token and use the API to avoid blocks.

Feel free to utilize this code however you like 😊