Introducing web scraper and Apify
Let's get started by Understanding what Web Scraping is.
Web Scraping is a technique for extracting data from websites leading to insightful information to developers, businesses, and organizations for diverse purposes. It can extract underlying HTML code, data stored in the database and then be able to replicate the entire website elsewhere.
We can go ahead and know about Apify. Apify is a platform that is helpful to developers as it provides powerful tools and templates for building, deploying, and publishing web scraping, data extraction, and web automation tools efficiently.
In this guide we will walk through processes of creating and deploying web scrapers using Apify SDK for JavaScript.
Understanding the template-generated code
In this project we will use a JavaScript template which utilizes various libraries such as Axios, Cheerio, and the Apify SDK which we will be discussing. Let's delve into the code before we go ahead to understand it.
// Axios - Promise based HTTP client for the browser and node.js
import axios from 'axios';
// Cheerio - The fast, flexible & elegant library for parsing and manipulating HTML and XML
import * as cheerio from 'cheerio';
// Apify SDK - toolkit for building Apify Actors
import { Actor } from 'apify';
// Initialize the Actor environment
await Actor.init();
// Get input data (e.g., URL to scrape)
const input = await Actor.getInput();
const { url } = input;
// Fetch the HTML content of the page.
const response = await axios.get(url);
// Parse the downloaded HTML with Cheerio to enable data extraction.
const $ = cheerio.load(response.data);
// Extract all headings from the page (tag name and text).
const headings = [];
$("h1, h2, h3, h4, h5, h6").each((i, element) => {
const headingObject = {
level: $(element).prop("tagName").toLowerCase(),
text: $(element).text(),
};
console.log("Extracted heading", headingObject);
headings.push(headingObject);
});
// Save headings to Dataset - a table-like storage.
await Actor.pushData(headings);
// Gracefully exit the Actor process.
await Actor.exit();
Used libraries
Axios - it is a promise HTTP clients to make requests to the specified URL.
Cheerio- it is a library for parsing and manipulating HTML that is commonly used here for extracting data from downloaded HTML content.
Apify SDK- it is for building Apify Actors, that is utilized for initializing actor environments, getting input data, and pushing extracted data to the dataset.
Data storage
For convenient and scalable storage solutions for extracted information, data is being stored in Apify’s dataset.
Enhancing development workflows
The provided template offers a structured framework and built-in functionality for web scraping tasks, therefore the development process is streamlined. Developers should not worry about the boilerplate setup instead they should focus on customizing scraping logic.
Deploying and running the scraper
It is important to follow steps below for successful deployment and running of the scraper to Apify platform.
- Create an account on Apify, don’t worry if you have the account already.
- Login into your account and navigate to the Actors section. Click on “create new actor” and choose to create an actor from your local code.
- Upload the provided file containing scraper code.
- Configure any setting or required environment variable, e.g, input URL.
- Save changes and deploy the actor.
You can execute the actor deployed on the Apify platform through accessing its detail page and running it manually or scheduling it to run at specified intervals.
In conclusion, by following steps illustrated in this guide, will enable developers to create sophisticated web scrapers with ease. In addition, developers and organizations can harness the power of web scraping to gather insightful data for their projects.
Top comments (0)