Emmanuel Uchenna

Posted on Mar 17, 2024

Automating Data Collection with Apify: From Script to Deployment

#apify #web #scrapping

Web scraping is an important tool for businesses and organizations today because it allows them to gather valuable information about their customers, competitors, the market, and more.

Web scraping automates the process of extracting valuable data from websites, transforming it from raw, unstructured content into a usable format like spreadsheets, JSON, or databases. This eliminates the time-consuming and error-prone task of manual data collection, making web scraping a cost-effective and powerful tool.

Imagine the possibilities of gathering your competitor pricing data in real-time, monitoring brand mentions across social media platforms, or collecting market research insights - all without lifting a finger. Web scraping enables you to have these possibilities and empowers your business with a significant competitive advantage.

But how do you use this powerful technique? This article will guide you through the entire process of automating data collection with Apify, a user-friendly platform specifically designed for web scraping.

In this article, I will walk you through everything, from crafting your initial scraping script (Actor) using the Apify SDK for TypeScript to deploying it to the Apify Actors Store for seamless data collection, and then, I will show you how to run your deployed Actor on the Apify platform. With Apify, you don't need to be a programming pro to harness the power of web scraping and start gaining insights.

If you are excited learning about automated data collection using APify as much as I am writing this article, then let's dive right in 🚀

What is Apify?

According to the Apify documentation,

Apify is a cloud platform that helps you build reliable web scrapers, fast, and automate anything you can do manually in a web browser.

Apify is designed to tackle large-scale and demanding web scraping and automation tasks. It offers a user-friendly interface and an API for accessing its features:

Compute instances (Actors): Enables you to run dedicated programs to handle your scraping or automation needs.
Storage: Allows you to conveniently store both the requests you send and the results you obtain.
Proxies: Apify proxies play a crucial role in web scraping, allowing you to anonymize your scraping activities, evade IP address tracking, access geo-location-specific content, and more.
Scheduling: This allows you to automate your scraping actors and tasks to run at specific times or pre-defined intervals without the need for manual initiations.
Webhooks: Apify also allows you to integrate multiple Apify Actors or external systems with your Actor or task run, you can send alerts when your Actor run succeeds or fails.

While Apify itself is a platform, it works seamlessly with Crawlee, an open-source library for web scraping. This means you can run Crawlee scraping jobs locally or on your preferred cloud infrastructure, offering flexibility alongside Apify's platform features.

Introducing the Apify SDK for JavaScript

Apify provides a powerful toolkit called the Apify SDK for JavaScript, designed to streamline the creation of Actors. These Actors function as serverless microservices, capable of running on the Apify platform or independently.

Previously, the Apify SDK offered a blend of crawling functionalities and Actor building features. However, a recent update separated these functionalities into two distinct libraries: Crawlee and Apify SDK v3. Crawlee now houses the web scraping and crawling tools, while Apify SDK v3 focuses solely on features specific to building Actors for the Apify platform. This distinction allows for a clear separation of concerns and enhances the development experience for various use cases.

Image source: Apify Docs

What is an Actor?

According to the Apify Docs,

Actors are serverless programs running in the cloud. They can perform anything from simple actions (such as filling out a web form or sending an email) to complex operations (such as crawling an entire website or removing duplicates from a large dataset). Actor runs can be as short or as long as necessary. They could last seconds, hours, or even infinitely.

In the next section, I will walk you through building your scraper with Apify.

Prerequisites

To follow along with this article, you should satisfy the following conditions:

An account with a Git provider like GitHub, GitLab, or Bitbucket to store your code repository.
Node.js and NPM or Yarn are installed locally to manage dependencies and run commands.
Basic terminal/command line knowledge to run commands for initializing projects, installing packages, deploying sites, etc.
Apify CLI installed globally by running this command: npm -g install apify-cli
An account with Apify. Create a new account on the Apify Platform.

Building Your Scraper with Apify

The first step to building your scrapper is to choose a code template from the host of templates provided by Apify. Head over to the Actor templates repository and choose one.

The Actor templates help you quickly set up your web scraping projects, saving you development time and giving you immediate access to all the features the Apify platform has to offer.

For this article, I will be using the TypeScript Starter template as shown in the screenshot above. This comes with Nodejs, Cheerio, Axios

Click on your chosen template and you will be redirected to the page specific to that template, then click on "Use locally". This will display a popup with instructions on how to create your actor using your chosen template.

Since I have all conditions in the prerequisites satisfied, I will go ahead and create a new Actor using the TypeScript Starter template. For this, I will run the following commands in my terminal:

apify create my-actor -t getting_started_typescript

The above command uses the Apify CLI to create a new actor called my-actor using the TypeScript Starter template and then generates a bunch of files and folders. Below is my folder structure:

├───.actor
├───.vscode
├───.gitignore
├───.dockerignore
├───node_modules
├───package.json
├───package-lock.json
├───README.md
├───tsconfig.json
├───src
│    └───main.ts
└───storage
    ├───datasets
    │   └───default
    ├───key_value_stores
    │   └───default
    │         └───INPUT.json
    └───request_queues
        └───default

Meanings of Selected Files

The main.ts file acts as the main script for your Apify project. It's written in TypeScript and uses several libraries to achieve its goal:

Fetching and Parsing Data:
- It starts by importing libraries like axios to fetch data from the web and cheerio to parse the downloaded content (usually HTML).
Getting User Input:
- It retrieves the URL you provide as input using Actor.getInput. This URL likely points to the webpage you want to scrape data from.
Scraping Headings:
- The script then fetches the webpage content using the provided URL and parses it with Cheerio.
- It extracts all heading elements (h1, h2, h3, etc.) and stores their level (h1, h2, etc.) and text content in an array.
Saving Results:
- Finally, it saves the extracted headings (including level and text) to Apify's Dataset, which acts like a storage container for your scraped data.
Exiting Cleanly:
- The script exits gracefully using Actor.exit() to ensure proper termination of your Apify scraping process.

The script fetches a webpage based on your input URL, extracts all headings, and stores them in Apify's Dataset for further use or analysis.

The .actor/actor.json file is where you will set up information about the Actor such as the name, version, build tag, environment variables, and more.

The .actor/input_schema.json file defines the input of the Actor. In this case, I am using the Apify URL (https://www.apify.com). The content of this file is as shown below:

{
    "title": "Scrape data from a web page",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "url": {
        "title": "URL of the page",
        "type": "string",
        "description": "The URL of website you want to get the data from.",
        "editor": "textfield",
        "prefill": "https://www.apify.com"
        }
    },
    "required": ["url"]
}

Next, change the directory into the my-actor directory by running this command:

# Go into the project directory
cd my-actor

Next, it is time to run your Actor locally. To do that, run this command:

# Run it locally
apify run

When you run this script, the Apify Actor will save the extracted headings in a special storage area called a Dataset. This Dataset lets you keep track of all the data you collect over time. A new Dataset is automatically created for each time you run the script, ensuring your data stays organized.

Think of a Dataset like a table. Each heading you extract becomes a row in the table, with its level (h1, h2, etc.) and text content acting as separate columns. You can even view this data as a table within Apify, making it easy to browse and understand. Apify allows you to export your headings in a variety of formats, including JSON, CSV, and Excel. This lets you use the data in other applications or analyze it further with your favorite tools.

Deploying and Running the Scraper

In this section, I will walk you through deploying your Actor to the APify Actor Store. To do this, we will make use of the Apify CLI.

First, you need to connect your Apify account with your Actor locally. You will need to provide your Apify API token to complete these actions.

To get your Apify API token, navigate to the API console. Then click on "Settings" > "Integrations". Then copy your API token

Next, on your terminal, change the directory into the root directory of your created Actor, then run this command to sign in:

apify login -t YOUR_APIFY_TOKEN

Replace YOUR_APIFY_TOKEN with the actual token you just copied.

How to Deploy Your Actor

Apify CLI provides a command that you can use to deploy your Actor to the Apify Actor store

apify push

This command will deploy and build the Actor on the Apify Platform. You can find your newly created Actor under My Actors. Running this command will output some logs on your terminal similar to what is shown in the screenshot below:

Next, we will display the newly created Actor on the Apify platform and execute it there. To do this, navigate to My Actors page.

To run our Actor, click on it ("My Actor"), then click on "Build & Start"

The Actor tabs allow you to see the Code, Last build, Input, and Last run.

You can also export your dataset by clicking on the Export button. The supported formats include, JSON, CSV, XML, Excel, HTML Table, RSS, and JSONL.

Wrapping Up

Through this article, you have learned how to automate data collection using Apify and JavaScript (TypeScript). You have also learned how to use Apify's Actor templates to scrape websites and efficiently store the extracted data in Apify's Dataset. With Apify handling the infrastructure and functionalities like scheduling and proxies, you can focus on crafting the core scraping logic using familiar JavaScript libraries like Cheerio and Puppeteer.

Apify offers a vast library of documentation and a supportive community to guide you on your path to becoming a web scraping expert.

Get started with Apify today 🚀

DEV Community

Automating Data Collection with Apify: From Script to Deployment

What is Apify?

Introducing the Apify SDK for JavaScript

What is an Actor?

Prerequisites

Building Your Scraper with Apify

Meanings of Selected Files

Deploying and Running the Scraper

How to Deploy Your Actor

Wrapping Up

Further Readings

Top comments (0)

A Workflow Copilot. Tailored to You.

Read next

Unit Testing in Laravel: A Practical Approach for Developers

How to get Announcement from Binance ASAP?

Set-up ladybird in mac

Why Choose Vue Over Other Frameworks?

Okay