RidaEn-nasry

Posted on Dec 22, 2022

Scraping upwork jobs using NodeJS

#javascript #node #scraping

SEE SOURCE CODE

Why js
Doing it the simple way
Cloudflare waiting room
how to bypass cloudflare waiting room
Chosing your end-point

Why javascript ?

none of your business!!

Doing it the simple way

I like simplicity ! who doesn't right ? so one way to tackle the problem is :

using built-in https module to make request directly to pre-defined url
as data comes in chunks , we need to use a stream to collect the data, once collected save in an html file
parse the thing.
Send a local notification or api request or a godamn rpc or whoever the heck you want to alert with the new jobs

It would look something like this :

const fs = require('fs');
const http = require('https');
const url = 'https://www.upwork.com/ab/jobs/search/?q=javascript&sort=recency'
http.get(url, (res) => {
    let data = '';
    res.on('data', (chunk) => {
        data += chunk;
    });
    res.on('end', () => {
    // save it as html file
        fs.writeFile('upwork.html', data, (err) => {
            if (err) throw err;
            // if a match (your jobs keyword matches ) found send a notification
            if (parse(data)) {
                sendNotification();
            }   
        });
    });
}).on("error", (err) => {
    // handle error 
});

hmmmm ! not too fast !!! in step 3 comes cloudflare !!

Cloudflare Waiting room

Upwork is a cloudflare client and cloudflare doesn't like bots as you may noticed!! if you're not familiar with cloudflare it's one of the biggest CDN providers in the world or at least that's what they're most known for, but they also provide a shit load of other services like load balancing, firewalls, etc... and the service that's blocked us now is WAF(web application firewall) a firewall that protects web applications from bad stuff (ddos attacks, cross site scripting) and provides usefull features like performance optimization, cache control, and more. bot management is one of the features. bots if not controled can cause a lot of damage to web properties by consuming resources and potentially causing a denial of service attack. so cloudflare manages these matrix creatures using behavioral analysis and machine learning. but wait !! what about the good bots ? search engines crawlers (google, bing), performance monitoring tools and other stuff that's necessary for the web to function properly ! well cloudflare differentiates between good bots and bad bots by whitelisting them and blacklisting the bad ones. you and me as you may figured out are bad bots !!

How to bypass cloudflare waiting room

we should look like real humans. we need our request to look like it's coming from a real human in a real browser !! puppeteer to the rescue !!

Puppeteer is a node library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome or Chromium. the thing is you can control a headless browser and make it do all sorts of things that a real human might do, like clicking links, scrolling pages, and filling out forms. And if Cloudflare or any bot manager is giving you a hard time and thinks you're a bot, you can use Puppeteer to add some randomness to your script, by injecting delays and mouse movements that make your browser look less robotic. For example, you could use Puppeteer to make your browser pause for a random amount of time before clicking a link, or move the mouse in a random pattern before filling out a form. This will make it more difficult for Cloudflare to detect that you're using a bot.

so we gonna use puppeteer to lie to cloudflare and tell it that we're a real human and not a bot.

after installing puppeteer let's put it to work :

const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    await page.setUserAgent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36');
    await page.goto('https://www.upwork.com/ab/jobs/search/?q=javascript&sort=recency');
    await page.waitForTimeout(1000);
    // wait for the page to load
    const html = await page.content();
    fs.writeFile('upwork.html', html, (err) => {
        if (err) throw err;
        console.log('The file has been saved!');
    });
    await browser.close();
})();

our code actually didn't change much:

line 3: we defines an async function that will to execute our logic (why?).
line 4: then we create a new instance of a headless browser using the puppeteer.launch() method.
line 5: then we create a new page in the browser using the browser.newPage() method, which will be used to navigate to the Upwork website and scrape the job listings.

Here's the important part:

line 7: we set the user agent for the page using the page.setUserAgent() method, to make the browser appear to be a real web browser and not a bot.

The user agent is a string of text that is sent by a web browser to identify itself and its capabilities. This information is used by the server to determine which content and features to serve to the browser, and to track traffic from different browsers. In this code, the user agent is set using the page.setUserAgent() method. This method takes a string as an argument, which specifies the user agent that will be sent to the server. In this case, the user agent is set to a string that identifies the browser as a recent version of Google Chrome on a Linux operating system. This is done to make the headless browser appear to be a real web browser and not a bot, which may help to bypass any anti-bot measures that the website has in place.

line 8..9: The code navigates to the Upwork website using the page.goto() method, and waits for the page to load using the page.waitForTimeout() method. to imitate a real human behavior. (not so human.. but still).
rest..: Once the page has loaded, we uses page.content() method to get the HTML content of the page, and then writes the HTML to a file using the fs.writeFile() method. Finally, we close the browser using the browser.close() method. This will end the script and stop the headless browser from running.

what are you seeing is the new html page that we got in response from the url. we actually successfully bypassed cloudflare waiting room. as you can notice we got the job listings page and then instantly a 404 page. so besides the cloudflare waiting room, upworks seems to have some sort of other bots detection mechanism. these guys are really serious about bots. hmmmm ! i think we should be more serious about it too.

After some tinkering here and there i come up with the following script:

(async () => {
    // reading keywords from keywords.txt file
    let keywords = fs.readFileSync('/Users/wa5ina/Porn/automation/upwork-bot/keywords.txt', 'utf-8');
    keywords = keywords.split('\n');
    for (let i = 0; i < keywords.length; i++) {
        keywords[i] = keywords[i].trim();
    }
    // Launch the browser in non-headless mode
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    // Set a realistic user agent string
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36');

    // set the viewport to 1920x1080 to avoid the cookie banner 
    await page.setViewport({
        width: 1920,
        height: 1080
    });
    await page.goto('https://www.upwork.com/ab/account-security/login');
    // Wait for the page to load 
    await page.waitForTimeout(1000);

    // getting the email and password from the .env file
    const email = process.env.EMAIL;
    const password = process.env.PASSWORD;

    // enter the email and password
    await page.type('#login_username', email);
    // click the "continue with email" button
    await page.click('#login_password_continue');
    // some randomness to the mouse movement
    for (let i = 0; i < 10; i++) {
        await page.mouse.move(getRndm(0, 10000), getRndm(0, 1000));
        await page.waitForTimeout(1000);
    }
    // password
    await page.type('#login_password', password);
    await page.click('#login_control_continue');
    // move the mouse randomly to be more human 🤡
    for (let i = 0; i < 10; i++) {
        await page.mouse.move(getRndm(0, 20000), getRndm(0, 10000));
        await page.waitForTimeout(1000);
    }
    // wait for the page to load
    // wait for the search input to load 
    let allJobs = [];

    // wait for search input to load
    // await page.waitForSelector('input[placeholder="Search for job"]', { visible: true });
    for (let i = 0; i < keywords.length; i++) {
        // console.log('searching for ' + keywords[i]);
        for (let j = 0; j < 5; j++) {
            // scrolling throught 5 pages 
            await page.goto('https://www.upwork.com/ab/jobs/search/?q=' + keywords[i] + '&page=' + j + '&sort=recency');
            await page.waitForTimeout(3000);
            await page.waitForSelector('div[data-test="main-tabs-index"]', { visible: true });
            // get all sections with data-test="JobTile"
            const listings = await page.$$('section[data-test="JobTile"]');
            // change the page number of jobs
            let jobs = await Promise.all(listings.map(async (listing) => {
                // get the title of the job which in <h4 class="job-tile-title"> <a> </a> </h4>
                let posted = await getTime(listing);
                // if it's too old, then skip it
                if (tooOld(posted) === true)
                    return;
                // get title of the job
                let title = await getTitle(listing);
                // get the link of the job 
                let link = await getLink(listing);
                // get the description of the job 
                let description = await getDescription(listing);
                // get type of job {type, budget}
                let typeOfJob = await getTypeOfJob(listing);
                if (tooCheap(typeOfJob) === true)
                    return;
                // // is client's payment verified (true or false)
                let paymentverified = await isVerified(listing);
                return { posted, title, link, description, typeOfJob, paymentverified };
            }
            ));
            // filter out the undefined jobs
            jobs = jobs.filter((job) => job !== undefined);
            // push jobs to alljobs
            allJobs.push(...jobs);
        }

    }
    // Add some randomness to the requests
    const randomDelay = Math.random() * 2000;
    await page.waitForTimeout(randomDelay);
    // Close the browser
    await browser.close();
    // write to json file by overriding the file
    fs.writeFileSync('/Users/wa5ina/Porn/automation/upwork-bot/jobs.json', JSON.stringify(allJobs, null, 2));
})();

It starts off by reading in a list of keywords from a file called keywords.txt. These are our magic words (separated by line feed '\n' ) that the code uses to search for jobs.

Next, the code fires up its trusty web browser and creates a new page. It sets the user agent string to pretend to be a real web browser, because it's too cool to be a robot. It also sets the viewport to 1920x1080 to avoid the annoying cookie banner.

The code then heads over to the Upwork login page and waits patiently for it to load. It snags your email and password from the .env file and enters them into the appropriate fields on the login page. (The reason i chose to login in before scraping is because i noticed that the quality of jobs while being authenticated versus non-authenticated is noticeably different) then It clicks the "continue" button to proceed with the login process.

To make things more interesting, the code moves the mouse around randomly to mimic a human user. It's like a little dance to pass the time while the page loads. tooo smart haaaaa 🤡 !! not really !!! ok.

Once the page is loaded, the code navigates to the Upwork job search page and starts searching for jobs using each keyword in the keywords array. It's like a treasure hunt for your dream projects! For each keyword, it loops through up to five pages of search results.

For each job listed on a search results page, the code extracts the job title, link, description, and type (e.g. hourly or fixed-price) and budget. It stores all of this juicy information in an array called allJobs.

After it's searched all the keywords and scraped all the jobs, the code closes the web browser logs the thing into jobs.json file and exits.

(note that i didn't explicitly add code for some specific functions that extract data like getTitle(), getLink() etc, if i did we'll probably end up with an ugly unreadable blog post assuming it's not already, but you can find the whole code in the github repo

But i'll briefly explain the tooOld() and tooCheap() functions. as they're a bit important. i decided to filter out jobs that are too old or too cheap.
too old means that the job was posted more than 20 minutes ago. and too cheap means that the job is less than 500$ if fixed-price or less than 15$ if hourly. if you intending to use script you could edit those functions to meet your too cheap or too old.

Ffter running the script, we'll get a jobs.json file that looks something like this:

[
  {
    "posted": "5 minutes ago",
    "title": "Senior Software and App Engineer",
    "link": "https://www.upwork.com/jobs/Senior-Software-and-App-Engineer_~012d16e9ca1001988c/",
    "description": "We need an absolute ninja to go through and clean up our entire platform, and our mobile apps to perform at their highest levels possible to increase our satisfaction and functionality in the field. We need a perfectionist, and someone who works efficiently because of their superior abilities. We have several integrations that need fine-tuned and more in the pipeline that will need done. So intimate knowledge of Api and SDK integrations will be necessary. He will also assist an IT support and be available for emergency service in the case of complete failure. We do not see this being the case as we are hiring you to make the system fail proof. We are young growing, high definition, video intercom system, and there is long-term potential be on this contract term.",
    "typeOfJob": {
      "type": "Hourly: ",
      "budget": "$35.00-$46.00"
    },
    "paymentverified": true
  },
  {
    "posted": "7 minutes ago",
    "title": "Build a web and mobile application",
    "link": "https://www.upwork.com/jobs/Build-span-web-span-and-mobile-application_~015ad09536ea065d5d/",
    "description": "You can read the specification document attached. This document contains all what you need.",
    "typeOfJob": {
      "type": "Fixed-price",
      "budget": "$1000 "
    },
    "paymentverified": false
  },
  {
    "posted": "7 minutes ago",
    "title": "Order platform",
    "link": "https://www.upwork.com/jobs/Order-platform_~0185808b1d24763136/",
    "description": "Looking to get a orders platform for a remittance company. Using Laravel Customer must be able to sign up, login, get live rate, book transaction and manage beneficiary and linked attributes. What would be the time line and approx cost",
    "typeOfJob": {
      "type": "Hourly",
      "budget": "not specified"
    },
    "paymentverified": true
  },
 ]

A simple array of objects. each object represents a job. Now that we have our list, we can do with it whatever we like, use discord.js to send it to a discord channel, or use node-mailer to send it to your email, or i don't know ? send it to your grandma telepathically 👵 the choice is yours.
I chose to display it as notifications on my lovely mac.

Choosing your end-point

Follow up with this section if you want to display the jobs as macos notifications otherwise head to code repo and fellow instructions to use your preferred way.

The code for that is pretty simple, it just reads the jobs.json file and displays a notification for each job.

I used some not so fancy utility called jq to parse the json and extract all the juicy details about the jobs, like the title, posting date, type, budget, and link.

Then i simply looped through each job and plucks out the individual details for each one. i used another utility called alerter which is just a wrapper for osascript, osascript is screwed by the way, so screwed that you need some crazy magic just for it to interpret your shell variables correctly, alerter provides the convenience of saying "fuck you quotes and backslashes".
so for each job i call alerter with the job details and it'll display a notification with the job title, posting date, type, budget, and link. and if i'm feeling adventurous i can click the "open" button to check out the job link in the browser.

The notification looks something like this:

and if you like the offer you can click the "open" button to open the job link in your browser.

Again the post is ugly and unorganized enough as it is, so i'll not include the code for this part, but again you can find it in the github repo.

Now that we have our bot and notification scripts all set up, it's time to automate the process so we can sit back and let the bot do all the work for us. To do this, we'll use something called cronjobs. These are like little robots that run scripts at scheduled times. It's like hiring a personal assistant for your computer!

First, let's add bot.js as a cronjob. Open up a terminal and type in crontab -e. This will open up the crontab editor. so since our script will scrape the Upwork website and fill in the jobs.json file, i'll run it every 7 minutes to make sure i get the latest jobs. so i'll add the following line to the crontab file:

*/7 * * * * node /path/to/bot.js

This tells the cronjob to run the bot.js script every 7 minutes. You can adjust the schedule to your liking – check out this handy tool to help you out with the syntax.

Next, let's too schedule our bash script notifyjobs.sh that reads from jobs.json and display notifications
since it should run after the bot.js script, we'll schedule it to run 3 minute after the bot.js script. so we'll add the following line to the crontab file:

*/10 * * * * /path/to/notifyjobs.sh

And with that, we're done! We've successfully created a bot to extract job listings from Upwork and display them as macOS notifications. Now we can get the latest job updates from the comfort of our own desktop screens. No more endlessly scrolling through job listings – the bot does all the hard work for us! So go ahead and kick back, relax, and let the bot do the job hunting for you. Happy job hunting!

DEV Community

Scraping upwork jobs using NodeJS

Table of Contents

Why javascript ?

Doing it the simple way

Cloudflare Waiting room

How to bypass cloudflare waiting room

Choosing your end-point

Top comments (0)

Read next

Learn How to Create Responsive Admin Dashboard Using HTML CSS & JavaScript

What JavaScript Is Missing to Be Perfect

Project: Build a Weather App Using JavaScript and a Weather API

Apply CSS in Next.js with StayedCSS