Jordan Scrapes Websites for Keywords

#webscraping #axios #javascript #typescript

Axios

Okay, fine. Axios is pretty good. I’ve been pretty stubborn in my use of request and request-promise. And honestly, they’ve been great. I’ve gotten used to what it can do and it’s been consistent.

The link checking stuff I’ve been working, however, has made it important to get accurate responses from the sites being scraped. As I was going through thousands of pages, I was getting a lot of false negatives. Pages would return 403 or just give me a ECONNREFUSED when using request-promise but then when I checked the pages in the browser, they would work fine.

I’m working on another post with more details on this but for now I can assuredly say that Axios completed far more successful requests than request-promise. I’m going to dig further because I have to imagine that the same work is happening under the covers and maybe I just have some kind of config wrong in request-promise.

A tale of three functions

async function getLinks

export async function getLinks(html: any, originalDomain: string, links: any[]) {
    const $ = cheerio.load(html);

    $('a').each((index, element) => {
        let link = $(element).attr('href');
        if (link && (!link.includes('javascript:') && !link.includes('tel:') && !link.includes('mailto:'))) {
            // Sometimes the first character of the link isn't the domain and has a slash. Let's clean it up
            if (link.charAt(0) === '/') {
                // This is within our original domain, so we are good
                link = link.slice(1)
            }
            // our original domain isn't in this link, skip it
            else if (!link.includes(originalDomain)) {
                return true;
            }

            let linkToPush = link.includes('http') ? link : `${originalDomain}/${link}`;
            linkToPush = linkToPush.split('?')[0];

            // We're going to skip #comment and #respond since it's not really a link
            if (!linkToPush.includes('#comment') && !linkToPush.includes('#respond') 
                && !linkToPush.includes('.PNG')
                && !linkToPush.includes('.png') 
                && !linkToPush.includes('.jpg')
                && !linkToPush.includes('.jpeg')
                && links.indexOf(linkToPush) === -1) {
                links.push(linkToPush);
            }
        }
    });

    return links;

}

This function is pretty much identical as the one from the link checker. The idea is that it accepts any html and looks for new links in order to scrape through an entire domain.

In the link checker I checked the status of every link found within the target domain, regardless of if it was pointing to another domain or not. In this project, I wanted to target specific domains and so didn’t do anything with the links that were pointing to another domain.

I didn’t do any checking of URLs that included common image tags, like .png, or .jpg. They aren’t going to contain any useful keywords so I saved myself the time and skipped them.

function checkKeywords

async function checkKeywords(html: string, keywords: string[], pagesWithKeywords: string[], currentUrl: string) {
    if (new RegExp(keywords.join("|")).test(html)) {
        console.log('found a potential here', currentUrl);
        pagesWithKeywords.push(currentUrl);
    }
}

Super simple. I accept an array of keywords and the html. I just do a simple regex test and if any of them are found on the page, I push the currentUrl into an array.

It probably is noteworthy that this is not great functional programming at all. These functions are absolutely not pure. I don’t love this and maybe I’ll adjust this more in the future.

async function getEmailAddresses

export async function getEmailAddresses(html: any, emails: string[] = []) {
    const regex = /([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+)/gi;

    const emailsToTest: string[] = html.match(regex);
    if (emailsToTest) {
        for (let i = 0; i + 1 < emailsToTest.length; i++) {
            const testTerms = ['.jpg', '.jpeg', '.png', '.svg', '.img', '.gif', '@example', '@email'];
            if (!testTerms.some(term => emailsToTest[i].toLowerCase().includes(term)) && emails.indexOf(emailsToTest[i]) === -1) {
                emails.push(emailsToTest[i]);
            }
        }
    }
    return Promise.resolve();
}

Same idea as above. I have a regex for common email address formats and I test the html for it. I also do a check to try and ensure that I’m not duplicating email addresses.

async function getEverything

async function getEverything(html: any, originalDomain: string, currentUrl: string, keywords: string[], emails: string[], pagesWithKeywords: string[]) {
    console.log('checking:', currentUrl);
    checkKeywords(html, keywords, pagesWithKeywords, currentUrl);
    await getEmailAddresses(html, emails);

    if (pagesWithKeywords.length > 0) {
        return Promise.resolve();
    }
    else {
        let newLinks: any[] = [];
        const newDomain = new URL(currentUrl).origin;
        if (domainCheck(currentUrl, originalDomain, newDomain)) {
            newLinks = await getLinks(html, originalDomain, newLinks)
        }
        // Let's cap how deep we go to 100 additional checks
        for (let i = 0; i < 100; i++) {
            if (pagesWithKeywords.length > 0) {
                return Promise.resolve();
            }

            if (newLinks[i]) {
                console.log('checking new link:', newLinks[i]);
                try {
                    // TODO: Can't this be done recursively?
                    const response = await axios(newLinks[i]);
                    checkKeywords(response.data, keywords, pagesWithKeywords, currentUrl);
                    await getEmailAddresses(html, emails);
                }
                catch (e) {
                    console.log('could not get new link', newLinks[i] );
                }
            }
        }
    }

    return Promise.reject('No potential found.');
}

This function ties it all together. There are a few notable points in this function. The first is the check that says if I’ve already found a page with keywords, let’s be done checking this domain. I only need to see if the domain contains the keywords once and then I know they are a viable lead.

Second is that while I do get a bunch of new links from getLinks, I limit the amount of those links I check to an arbitrary 100. I guess I kind of make the assumption that if I haven’t found the keywords I’m looking for in 100 pages, it probably doesn’t have it. It’s also a time protector. Active sites can easily have thousands of pages and I don’t want to spend the time going through all of that.

The gross part

This is a script that I whipped up pretty quick and it definitely needs more polish. The biggest part that I dislike is…why am I not calling getEverything recursively. I really need my parent function that is initiating all of this to manage how many times it gets called. Or getEverything could be the parent function but that means I need another function to hold all of the rest.

Stay tuned. I may try to improve this.

The post Jordan Scrapes Websites for Keywords appeared first on JavaScript Web Scraping Guy.