zchtodd

Posted on Jun 21, 2021 • Edited on Jul 5, 2021 • Originally published at theparsedweb.com

Building a SaaS App: Beyond the Basics

#javascript #python #docker

This is the first post in a series on building your own SaaS application. We'll go step by step through what it takes to build a real product: taking payments, system monitoring, user management, and more.

So what kind of product are we going to build?

We're going to build a fully functioning (if minimal) Google rank tracker.

Enter a domain, some keywords, and the app will track performance on Google search over time. Does this idea make business sense? Probably not! But it's a fun idea that does something useful, it's a task we can accomplish, and you can take it as far as you like. We'll cover all the fundamentals of building a SaaS app along the way.

You can find the complete code on GitHub.

Building the Google Search scraper

Scraping Google search results is the core of this application. Although we could start building just about anywhere, I think beginning with the scraper itself makes sense.

The scraper should take a search query and load several pages of results. The scraper will then return those results to our app. That sounds so simple! But a lot can go wrong in-between. Because we don't want irate emails from unhappy customers, a great deal of the code will be dedicated to handling failures.

Setting up Puppeteer on an AWS instance

We'll use Puppeteer to do the scraping. Puppeteer provides a JavaScript API for remotely controlling a Chromium browser session. Best of all, the browser can run without a desktop environment (headless mode), so our code can execute independently on a server in the cloud. For this tutorial, we'll start with an Ubuntu 18.04 instance on AWS, and step through installing all of the dependencies needed for Puppeteer.

I'm using an EC2 tc2.medium instance for this project. This comes with 2 vCPUs and 4GB of RAM, so it's powerful enough to run Puppeteer, as well as what we're going to add later. An Ubuntu 18.04 instance is a good starting point.

Chromium comes bundled with Puppeteer, but there are a wide array of prerequisite system libraries that are needed before we can get started. Luckily, we can get all of that installed with this one liner.

sudo apt-get install -y ca-certificates fonts-liberation libappindicator3-1 libasound2 libatk-bridge2.0-0 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgbm1 libgcc1 libglib2.0-0 libgtk-3-0 libnspr4 libnss3 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 lsb-release wget xdg-utils

Once the Chromium dependencies are installed, we can move on to setting up Node v14. The simplest way to do this is via a downloadable setup script, which will tell our package manager how to find v14 of Node, instead of the much older version that it's already pointing to.

curl -sL https://deb.nodesource.com/setup_14.x -o nodesource_setup.sh
bash nodesource_setup.sh
apt-get install -y nodejs

At this point, we have Node and Chromium installed. Next we'll create a package.json file so that we can use NPM to install project dependencies (i.e. Puppeteer).

{
    "name": "agent-function",
    "version": "0.0.1",
    "dependencies": {
        "axios": "^0.19.2", // For communicating with the app server.
        "puppeteer": "10.0.0",
        "puppeteer-extra": "3.1.8",
        "puppeteer-extra-plugin-stealth": "2.7.8"
    }
}

After running npm install, you should have all the necessary pieces in place. Let's use a very simple Node script to verify that Puppeteer is installed and working.

const puppeteer = require("puppeteer-extra");

async function crawl() {
    console.log("It worked!!!");
}

puppeteer
    .launch({
        headless: true,
        executablePath:
            "./node_modules/puppeteer/.local-chromium/linux-884014/chrome-linux/chrome",
        ignoreHTTPSErrors: true,
        args: [
            "--start-fullscreen",
            "--no-sandbox",
            "--disable-setuid-sandbox"
        ]
    })
    .then(crawl)
    .catch(error => {
        console.error(error);
        process.exit();
    });

Notice the headless key in the config object. This means Chromium will launch without a GUI, which is what we want when running on a server in EC2. Hopefully, if all goes well, you'll see It worked!!! print to the console when you execute this script.

Making a simple Google search request

Now that we know everything is correctly installed, we should start with doing a simple Google search. We won't bother with any actual scraping at this point. The goal is simply to type a search query into the search bar, load the Google results, and take a screenshot to prove that it worked.

This is the crawl function after updating it to do what I just described.

async function crawl(browser) {
    const page = await browser.newPage();
    await page.goto("https://www.google.com/?hl=en");

    // Find an input with the name 'q' and type the search query into it, while 
    // pausing 100ms between keystrokes.
    const inputHandle = await page.waitForXPath("//input[@name = 'q']");
    await inputHandle.type("puppeteer", { delay: 100 });

    await page.keyboard.press("Enter");
    await page.waitForNavigation();

    await page.screenshot({ path: "./screenshot.png" });
    await browser.close();
}

Puppeteer loads the Google search page (adding hl=en to request the English version), enters the search query, and presses enter.

The waitForNavigation method pauses the script until the browser emits the load event (i.e. the page and all of its resources, such as CSS and images, have loaded). This is important, because we'd like to wait until the results are visible before we take the screenshot.

Hopefully you'll see something similar in screenshot.png after running the script.

Using a proxy network for scraper requests

Odds are good, however, that even if your first request was successful, you'll eventually be faced with a CAPTCHA. This is pretty much inevitable if you send too many requests from the same IP address.

The solution is to route requests through a proxy network to avoid triggering CAPTCHA blocks. The scraper will always be blocked from time to time, but with any luck, the majority of our requests will make it through.

There are many different types of proxies, and a huge number of vendor options. There are primarily three options for a scraping project like this.

Purchasing a single IP address, or a bundle of IP addresses, through a service like Proxyall. This is the lowest cost option. I purchased 5 IP addresses for about $5/month.
Data-center proxies that provide a wide range of IP addresses, but charge for bandwidth. Smartproxy, as an example, provides 100GB for $100. Many of these IP addresses, however, are already blocked.
Residential proxies also provide a wide range of IP addresses, but the addresses come from a residential or mobile ISP, and so will encounter CAPTCHA less frequently. The trade-off comes in price. Smartproxy charges $75 for 5GB of data transfer.

You may be able to get away with no proxy if your scraper works very slowly and makes infrequent requests. I actually want to track rankings for my own site, so going with a handful of dedicated IP addresses made sense.

Sending requests over the proxy, instead of the default network, is straightforward with Puppeteer. The start-up args list accepts a proxy-server value.

puppeteer
    .launch({
        headless: false,
        executablePath:
            "./node_modules/puppeteer/.local-chromium/linux-884014/chrome-linux/chrome",
        ignoreHTTPSErrors: true,
        args: [
            `--proxy-server=${proxyUrl}`, // Specifying a proxy URL.
            "--start-fullscreen",
            "--no-sandbox",
            "--disable-setuid-sandbox"
        ]
    })

The proxyUrl might be something like http://gate.dc.smartproxy.com:20000. Most proxy configurations will require a username and password, unless you're using IP white-listing as an authentication method. You'll need to authenticate with that username/password combination before making any requests.

async function crawl(browser) {
    const page = await browser.newPage();
    await page.authenticate({ username, password });
    await page.goto("https://www.google.com/?hl=en");
}

Any heavily used scraper is still going to experience getting blocked, but a decent proxy will make the process sustainable, as long as we build in good error handling.

Gathering the Search results

We turn now to the actual scraping part of the process. The overall goal of the app is to track rankings, but for simplicity's sake, the scraper doesn't care about any particular website or domain. Instead, the scraper simply returns a list of links (in the order seen on the page!) to the app server.

To do this, we're going to rely on XPath to select the correct elements on the page. CSS selectors are often not good enough when it comes to complex scraping scenarios. In this case, Google doesn't offer any easy ID or class name that we can use to identify the correct links. We'll have to rely on a combination of class names, as well as tag structure, to extract the correct set of links.

This code will extract the links and press the Next button a predetermined number of times, or until there is no more Next button.

let rankData = [];
while (pages) {
    // Find the search result links -- they are children of div elements
    // that have a class of 'g', while the links themselves must also
    // have an H3 tag as a child.
    const results = await page.$x("//div[@class = 'g']//a[h3]");

    // Extract the links from the tags using a call to 'evaluate', which
    // will execute the function in the context of the browser (i.e. not
    // within the current Node process).
    const links = await page.evaluate(
        (...results) => results.map(link => link.href),
        ...results
    );

    const [next] = await page.$x(
        "//div[@role = 'navigation']//a[descendant::span[contains(text(), 'Next')]]"
    );

    rankData = rankData.concat(links);

    if (!next) {
        break;
    }

    await next.click();
    await page.waitForNavigation();

    pages--;
}

Now that we have the search results, how do we get them out of the Node process and back to somewhere to be recorded?

There are a lot of ways to do this, but I chose to have the app make an API available for the scraper, so that it can send the results as a POST request. The Axios library makes this pretty easy, so I'll share what that looks like here.

    axios
        .post(`http://172.17.0.1/api/keywords/${keywordID}/callback/`, {
            secret_key: secretKey,
            proxy_id: proxyID,
            results: rankData,
            blocked: blocked,
            error: ""
        })
        .then(() => {
            console.log("Successfully returned ranking data.");
        });

Don't worry about the blocked or error variables here. We'll get into error handling in a moment. The most important thing here is the rankData variable, which refers to the list containing all of the search result links.

Scraper error handling

Handling the unexpected is important in any kind of programming, but especially so with a scraper. There is a lot that can go wrong: running into a CAPTCHA, proxy connection failures, our XPath becoming obsolete, general network flakiness, and more.

Some of our error handling will come later, because we can only do so much within the scraper code itself. The app will need to be smart enough to know when it should retry, or if it should retire a certain proxy IP address because it's getting blocked too frequently.

If you'll recall from earlier, the scraper returns a blocked value. Let's take a look at how we determine whether the scraper has been blocked.

    let blocked = false;

    try {
        const [captcha] = await page.$x("//form[@id = 'captcha-form']");
        if (captcha) {
            console.log("Agent encountered a CAPTCHA");
            blocked = true;
        }
    } catch (e) {}

This code simply looks for the presence of a form with the ID captcha-form and sets the blocked value to true if so. As we'll see later, if a proxy IP is reported as blocked too many times, the app will no longer use that IP address.

What's next?

I hope you've enjoyed this first part of the SaaS app series! Up next, I'll go through setting up NGINX, Flask, and Postgres using Docker, so that our scraper has an API to call. You can always find the complete code for the project on GitHub.