DEV Community: Dávid Lukáč

10 web scraping challenges (+ solutions) in 2025

Dávid Lukáč — Thu, 05 Dec 2024 15:04:45 +0000

Web scraping comes with its fair share of challenges. Websites are becoming increasingly difficult to scrape due to the rise of anti-scraping measures like CAPTCHAs and browser fingerprinting. At the same time, the demand for data, especially to fuel AI, is higher than ever.

As you probably know, web scraping isn’t always a stress-free process, but learning how to navigate these obstacles can be incredibly rewarding.

In this guide, we’ll cover 10 common problems you’re likely to encounter when scraping the web and, just as importantly, how to solve them:

For the solutions, we’ll use Crawlee, an open-source library for Python and Node.js, and the Apify platform. These tools make life easier, but the techniques we’ll talk about can be used with other tools as well. By the end, you’ll have a solid understanding of how to overcome some of the toughest hurdles web scraping can throw at you.

1. Dynamic content

Modern websites often use JavaScript frameworks like React, Angular, or Vue.js to create dynamic and interactive experiences. These single-page applications (SPAs) load content on the fly without refreshing the page, which is great for users but can complicate web scraping.

Traditional scrapers that pull raw HTML often miss data generated by JavaScript after the page loads. To capture dynamically loaded content, scrapers need to execute JavaScript and interact with the page, just like a browser.

That’s where headless browsers like Playwright, Puppeteer, or Selenium come in. They mimic real browsers, loading JavaScript and revealing the data you need.

In the example below, we’re using Crawlee, an open-source web scraping library, with Playwright to scrape a dynamic page (MintMobile). While Playwright alone could handle this, Crawlee adds powerful web scraping features you’ll learn about in the next sections.

import { PlaywrightCrawler, Dataset } from 'crawlee';
import { firefox } from 'playwright';

const crawler = new PlaywrightCrawler({
    launchContext: {
        // Here you can set options that are passed to the playwright .launch() function.
        launchOptions: {
            headless: true,
        },
        launcher: firefox,
    },
    async requestHandler({ page, request, log }) {
        await page.goto(request.url);

        // Extract data
        const productInfo = await page.$eval('#WebPage', (info) => {
            return {
                name: info.querySelector('h1[data-qa="device-name"]').innerText,
                storage: info.querySelector(
                    'a[data-qa="storage-selection"] p:nth-child(1)'
                ).innerText,
                devicePrice: info.querySelector(
                    'a[data-qa="storage-selection"] p:nth-child(2)'
                ).innerText,
            };
        });

        if (!productInfo) {
            log.warning(`No product info found on ${request.url}`);
        } else {
            log.info(`Extracted product info from ${request.url}`);
            // Save the extracted data, e.g., push to Apify dataset
            await Dataset.pushData(productInfo);
        }
    },
});

// Start the crawler with a list of product review pages
await crawler.run([
    "https://www.mintmobile.com/devices/samsung-galaxy-z-flip6/6473480/",
]);

2. User agents and browser fingerprinting

If a website blocks your scraper, you can’t access the data, which makes all your efforts pointless. To avoid this, you want to make your scrapers mimic real users as much as possible. Two basic elements of anti-bot defenses to keep in mind are user agents and browser fingerprinting.

A user agent is a piece of metadata sent with every HTTP request, telling the website what browser and device are making the request. It looks something like this:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36

If your scraper uses something obvious like the Axios default User Agent, axios/1.7.2 , the site will likely flag you as a bot and block your access.

Fingerprinting takes it a step further. Websites analyze details like your screen resolution, installed fonts, timezone, language, and even whether the browser is running in headless mode. All this data creates a unique “fingerprint” for your scraper. If your fingerprint looks too uniform or lacks variety, like using the same resolution or timezone across all requests, you’re more likely to get caught. Some sites can even track you across sessions, bypassing tactics like IP rotation.

As you can imagine, manually managing user agents and fingerprints can be a headache, it’s time-consuming, error-prone, and hard to keep up with as websites constantly improve their defenses.

Thankfully, modern open-source tools like Crawlee take care of these challenges for us. Crawlee automatically applies the correct user agent and fingerprints to our request to ensure our bots appear “human-like.” Its PlaywrightCrawler and PuppeteerCrawler also make headless browsers behave like real ones, lowering your chances of detection, which is why I opted for using Playwright with Crawlee in the first section 😉

3. Rate limiting

Rate limiting is how websites keep things under control by capping the number of requests a user or IP can make within a set time frame. This helps prevent server overload, defend against DoS attacks, and discourage automated scrapers. If your scraper goes over the limit, the server might respond with a 429 Too Many Requests error or even block your IP temporarily. This can be a major roadblock, interrupting your data collection and leaving you with incomplete results.

To solve this issue, you need to manage your request rates and stay within the website’s limits. Crawlee makes this easy by offering options to fine-tune how many requests your scraper sends at once, how many it sends per minute, and how it scales based on your system’s resources. This gives you the flexibility to adjust your scraper to avoid hitting rate limits while maintaining strong performance.

Here’s an example of how to handle rate limiting using Crawlee’s CheerioCrawler with adaptive concurrency to scrape Hacker News:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    // Ensure there will always be at least 2 concurrent requests
    minConcurrency: 2,
    // Prevent the crawler from exceeding 20 concurrent requests
    maxConcurrency: 20,
    // ...but also ensure the crawler never exceeds 250 requests per minute
    maxRequestsPerMinute: 250,

    async requestHandler({ request, $, enqueueLinks, log }) {
        log.info(`Processing ${request.url}...`);

        // Extract data using Cheerio
        const data = $('.athing')
            .map((index, element) => {
                const $element = $(element);
                return {
                    title: $element.find('.title a').text(),
                    rank: $element.find('.rank').text(),
                    href: $element.find('.title a').attr('href'),
                };
            })
            .get();

        // Store the results to the default dataset.
        await Dataset.pushData(data);

        // Find a link to the next page and enqueue it if it exists.
        const infos = await enqueueLinks({
            selector: '.morelink',
        });

        if (infos.processedRequests.length === 0)
            log.info(`${request.url} is the last page!`);
    },
});

await crawler.addRequests(["https://news.ycombinator.com/"]);

// Run the crawler and wait for it to finish.
await crawler.run();

console.log('Crawler finished.');

4. IP bans

Building on the discussion about rate limiting, IP bans are another common issue you might have come across when scraping the web. Simply put, when a scraper sends too many requests too quickly or behaves in ways that don’t seem natural, the server might block the IP address, either temporarily or permanently. When that happens, your data collection comes to a complete halt, and naturally, we want to prevent this from happening.

While managing your scraper’s concurrency can help avoid this, sometimes it’s not enough. If you’re still running into blocks, using proxy rotation is a great next step. By rotating IP addresses, you can spread out your requests and make it harder for websites to flag and block your crawler’s activity.

With Crawlee, adding proxies is straightforward. Whether you’re using your own servers or working with a third-party provider, Crawlee handles the rotation automatically, ensuring your requests come from different IPs.

If you already have a list of proxies ready, integrating them into your Crawlee scraper takes just a few lines of code. Here’s how you can do it:

import { ProxyConfiguration, CheerioCrawler } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        "http://proxy-1.com",
        "http://proxy-2.com",
    ]
});
const proxyUrl = await proxyConfiguration.newUrl();

const crawler = new CheerioCrawler({
    proxyConfiguration,
    // ...rest of the code
});

Alternatively, you can use a third-party tool like Apify Proxy to access a large pool of residential and datacenter proxies, making proxy management even easier. It also gives you added flexibility by letting you control proxy groups and country codes.


import { Actor } from 'apify';

const proxyConfiguration = await Actor.createProxyConfiguration({
    groups: ['RESIDENTIAL'],
    countryCode: 'US',
});
const proxyUrl = await proxyConfiguration.newUrl();

5. Honeypot traps

Honeypot traps are hidden elements in a website’s HTML designed to detect and block automated bots and scrapers. These traps, like hidden links, forms, or buttons, are invisible to regular users but can be accidentally triggered by scrapers that process every element indiscriminately. When this happens, it signals bot activity to the website, often resulting in blocks, IP bans, and other issues. In short, you want to keep your scraper far away from these traps.

One way to avoid these traps is by filtering out hidden elements. You can check for CSS properties such as display: none and visibility: hidden to exclude them from your scraping process.

Another approach is to simulate real user behavior. Instead of scraping the entire HTML, focus on specific sections of the page where the data is located. Mimicking real interactions, like clicking on visible elements or navigating the page, helps your scraper appear more human-like and prevents it from interacting with invisible elements that a user wouldn’t be aware of.

Here’s an example of how you could modify the Hacker News scraper from the earlier section to filter out Honeypot traps:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks, log }) {
        log.info(`Processing ${request.url}...`);

        // Function to check if an element is visible (filter out Honeypots)
        const isElementVisible = (element) => {
            const style = element.css([
                'display',
                'visibility',
                'opacity',
                'height',
                'width',
            ]);
            return (
                style.display !== 'none' &&
                style.visibility !== 'hidden' &&
                style.opacity !== '0'
            );
        };

        // Extract data using Cheerio while avoiding Honeypot traps
        const data = $('.athing')
            .filter((index, element) => isElementVisible($(element)))
            .map((index, element) => {
                const $element = $(element);
                return {
                    title: $element.find('.title a').text(),
                    rank: $element.find('.rank').text(),
                    href: $element.find('.title a').attr('href'),
                };
            })
            .get();

        // Store the results to the default dataset.
        await Dataset.pushData(data);

        // Find a link to the next page and enqueue it if it exists.
        const infos = await enqueueLinks({
            selector: '.morelink',
        });

        if (infos.processedRequests.length === 0)
            log.info(`${request.url} is the last page!`);
    },
});

await crawler.addRequests(["https://news.ycombinator.com/"]);

// Run the crawler and wait for it to finish.
await crawler.run();

console.log('Crawler finished.');

CAPTCHAs CAPTCHAs, or Completely Automated Public Turing tests to tell Computers and Humans Apart, are those familiar challenges we’ve all seen, clicking on traffic lights or selecting crosswalks in image grids. While frustrating for humans, they are designed to block bots, making them one of the toughest obstacles for scrapers. Encountering one during scraping can bring your process to a halt, as bots can’t solve these puzzles on their own.

The good news is that much of what we’ve already covered, like avoiding honeypot traps, rotating IPs, and making your scraper mimic human behavior, also helps reduce the chances of triggering CAPTCHAs. Websites generally try to show CAPTCHAs only when the activity looks suspicious. By blending in with regular traffic through techniques like rotating IPs, randomizing interactions, and managing request patterns thoughtfully, your scraper can often bypass CAPTCHAs entirely.

However, CAPTCHAs can still appear, even when precautions are in place. In such cases, your best bet is to integrate a CAPTCHA-solving service. Tools like Apify’s Anti Captcha Recaptcha Actor, which works with Anti-Captcha, can help you equip your crawlers with CAPTCHA-solving capabilities to handle these challenges automatically and avoid disrupting your scraping.

Here is an example of how you could use the Apify API to integrate the Anti Captcha Recaptcha Actor into your code:

import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with API token
const client = new ApifyClient({
    token: '',
});

// Prepare Actor input
const input = {
    {
    "cookies": "name=value; name2=value2",
    "key": "anticaptcha-key",
    "proxyAddress": "8.8.8.8",
    "proxyLogin": "theLogin",
    "proxyPassword": "thePassword",
    "proxyPort": 8080,
    "proxyType": "http",
    "siteKey": "6LfD3PIbAAAAAJs_eEHvoOl75_83eXSqpPSRFJ_u",
    "userAgent": "Opera 6.0",
    "webUrl": "https://2captcha.com/demo/recaptcha-v2"
}
};

(async () => {
    // Run the Actor and wait for it to finish
    const run = await client.actor("petr_cermak/anti-captcha-recaptcha").call(input);
})();

7. Data storage and organization

Storing and organizing data effectively is often overlooked in smaller projects but is actually a core component of any successful web scraping operation.

While collecting data is the first step, how you store, access, and present it has a huge impact on its usability and scalability. Web scraping generates a mix of data types, from structured information like prices and reviews to unstructured content like PDFs and images. This variety demands flexible storage solutions. For small projects, simple CSV or JSON files stored locally might work, but as your needs grow, these methods can quickly fall short.

For larger datasets or ongoing scraping, cloud-based solutions like MongoDB, Amazon S3 or Apify Storage become necessary. They’re designed to handle large volumes of data and offer quick querying capabilities.

One standout advantage of Apify Storage is that it’s specifically designed to meet the needs of web scraping. It offers Datasets for structured data, Key-Value Stores for storing metadata or configurations, and Request Queues to help manage and track your scraping workflows. It integrates seamlessly with tools like Crawlee, provides API access for straightforward data retrieval and management, and supports exporting data in multiple formats.

Best of all, Apify Storage is just one piece of the comprehensive Apify platform, which delivers a full-stack solution for all your web scraping needs.

8. Automation and monitoring

Manually running scrapers every time you need fresh data is not practical, especially for projects requiring regular updates like price tracking, market research, or monitoring real-time changes.

Automation ensures your workflows run on schedule, minimizing errors and keeping your data current, while monitoring helps detect and address issues like failed requests, CAPTCHAs, or website structure changes before they cause disruptions.

Apify Platform Monitoring simplifies this process by providing tools specifically designed for automating and monitoring web scraping workflows. With task scheduling, you can set your scrapers to run at specific intervals, ensuring consistent data updates.

As well as helping you automate scraping, Apify offers monitoring features to view task statuses, detailed logs, and error messages. These features keep you informed about your scraper’s performance, including notifications and alerts, which can be configured to inform you of task completions or errors via email, Slack, or other integrations.

9. Scalability and reliability

Building a scalable and reliable scraping operation relies on the key principles we’ve covered: avoiding blocks, maintaining data consistency, storing collected data efficiently, and automating tasks with proper monitoring. Together, these elements create a solid foundation for a system that can grow with your needs while ensuring quality and performance remain intact.

One crucial yet often overlooked aspect of scalability is infrastructure management. Handling your own servers can quickly turn into a costly and time-consuming challenge, especially as your project expands. That’s why choosing a robust cloud-based solution like Apify from the very start of your project is a smart choice. Designed for scalability, it automatically adjusts to your project’s needs, so you never have to worry about provisioning servers or hitting capacity limits. You only pay for what you use, keeping costs manageable while ensuring your scrapers keep running without interruption.

Get a free Apify plan now!

10. Real-time data scraping

The idea behind real-time data scraping is to continuously collect data as soon as it becomes available. This is often a critical requirement for projects involving time-sensitive data, such as stock market analysis, price monitoring, news aggregation, and tracking live trends.

To achieve this, you need to deploy your code to a cloud platform and automate your scraping process with a proper schedule. For example, you can deploy your scraping script as an Apify Actor and schedule it to run at intervals that match how “fresh” you need the data to be. Apify’s scheduling and monitoring tools make it easy for you to implement this automation, ensuring a constant flow of real-time data while helping you promptly handle any errors to maintain accuracy and reliability.

Conclusion

And here we are at the end of the article. I hope you’ve found it helpful and can use it as a reference when dealing with the challenges we’ve discussed. Of course, every scraping project is unique, and it’s impossible to cover every scenario in one post. That’s where the value of a strong developer community comes in.

Connecting with other developers who have faced and solved similar challenges can make a big difference. It’s a chance to exchange ideas, get advice, and share your own experiences.

If you haven’t already, I encourage you to join the Apify & Crawlee Developer Community on Discord. It’s a great space to learn, collaborate, and grow alongside others who share your interest in web scraping.

Hope to see you there!

11 best open-source web crawlers and scrapers in 2024

Dávid Lukáč — Tue, 29 Oct 2024 14:33:22 +0000

Free software libraries, packages, and SDKs for web crawling? Or is it a web scraper that you need?

Hey, we're Apify. You can build, deploy, share, and monitor your scrapers and crawlers on the Apify platform. Check us out.

If you're tired of the limitations and costs of proprietary web scraping tools or being locked into a single vendor, open-source web crawlers and scrapers offer a flexible, customizable alternative.

But not all open-source tools are the same.

Some are full-fledged libraries capable of handling large-scale data extraction projects, while others excel at dynamic content or are ideal for smaller, lightweight tasks. The right tool depends on your project’s complexity, the type of data you need, and your preferred programming language.

The libraries, frameworks, and SDKs we cover here take into account the diverse needs of developers, so you can choose a tool that meets your requirements.

What are open-source web crawlers and web scrapers?

Open-source web crawlers and scrapers let you adapt code to your needs without the cost of licenses or restrictions. Crawlers gather broad data, while scrapers target specific information. Open-source solutions like the ones below offer community-driven improvements, flexibility, and scalability—free from vendor lock-in.

Top 11 open-source web crawlers and scrapers in 2024

1. Crawlee

Language: Node.js, Python | GitHub: 15.4K+ stars | link

Crawlee is a complete web scraping and browser automation library designed for quickly and efficiently building reliable crawlers. With built-in anti-blocking features, it makes your bots look like real human users, reducing the likelihood of getting blocked.

Available in both Node.js and Python, Crawlee offers a unified interface that supports HTTP and headless browser crawling, making it versatile for various scraping tasks. It integrates with libraries like Cheerio and Beautiful Soup for efficient HTML parsing and headless browsers like Puppeteer and Playwright for JavaScript rendering.

The library excels in scalability, automatically managing concurrency based on system resources, rotating proxies to enhance efficiency, and employing human-like browser fingerprints to avoid detection. Crawlee also ensures robust data handling through persistent URL queuing and pluggable storage for data and files.

Check out Crawlee

Pros:

Easy switching between simple HTTP request/response handling and complex JavaScript-heavy pages by changing just a few lines of code.
Built-in sophisticated anti-blocking features like proxy rotation and generation of human-like fingerprints.
Integrating tools for common tasks like link extraction, infinite scrolling, and blocking unwanted assets, along with support for both Cheerio and JSDOM, provides a comprehensive scraping toolkit right out of the box.

Cons:

Its comprehensive feature set and the requirement to understand HTTP and browser-based scraping can create a steep learning curve.

🟧 Crawlee web scraping tutorial for Node.js

Best for: Crawlee is ideal for developers and teams seeking to manage simple and complex web scraping and automation tasks in JavaScript/TypeScript and Python. It is particularly effective for scraping web applications that combine static and dynamic pages, as it allows easy switching between different types of crawlers to handle each scenario.

Deploy your scraping code to the cloud

2. Scrapy

Language: Python | GitHub: 52.9k stars | link

Scrapy is one of the most complete and popular web scraping frameworks within the Python ecosystem. It is written using Twisted, an event-driven networking framework, giving Scrapy asynchronous capabilities.

As a comprehensive web crawling framework designed specifically for data extraction, Scrapy provides built-in support for handling requests, processing responses, and exporting data in multiple formats, including CSV, JSON, and XML.

Its main drawback is that it cannot natively handle dynamic websites. However, you can configure Scrapy with a browser automation tool like Playwright or Selenium to unlock these capabilities.

💡 Learn more about using Scrapy for web scraping

Pros:

Significant performance boost due to its asynchronous nature.
Specifically designed for web scraping, providing a robust foundation for such tasks.
Extensible middleware architecture makes adjusting Scrapy’s capabilities to fit various scraping scenarios easy.
Supported by a well-established community with a wealth of resources available online.

Cons:

Steep learning curve, which can be challenging for less experienced web scraping developers.
Lacks the ability to handle content generated by JavaScript natively, requiring integration with tools like Selenium or Playwright to scrape dynamic pages.
More complex than necessary for simple and small-scale scraping tasks.

Best for: Scrapy is ideally suited for developers, data scientists, and researchers embarking on large-scale web scraping projects who require a reliable and scalable solution for extracting and processing vast amounts of data.

💡 Run multiple Scrapy spiders in the cloud

Read the docs

3.MechanicalSoup

Language: Python | GitHub: 4.7K+ stars | link

MechanicalSoup is a Python library designed to automate website interactions. It provides a simple API to access and interact with HTML content, similar to interacting with web pages through a web browser, but programmatically. MechanicalSoup essentially combines the best features of libraries like Requests for HTTP requests and Beautiful Soup for HTML parsing.

Now, you might wonder when to use MechanicalSoup over the traditional combination of BS4+ Requests. MechanicalSoup provides some distinct features particularly useful for specific web scraping tasks. These include submitting forms, handling login authentication, navigating through pages, and extracting data from HTML.

MechanicalSoup makes it possible by creating a StatefulBrowser object in Python that can store cookies and session data and handle other aspects of a browsing session.

However, while MechanicalSoup offers some browser-like functionalities akin to what you'd expect from a browser automation tool such as Selenium, it does so without launching an actual browser. This approach has its advantages but also comes with certain limitations, which we'll explore next:

Pros:

Great choice for simple automation tasks such as filling out forms and scraping data from pages that do not require JavaScript rendering.
Lightweight tool that interacts with web pages through requests without a graphical browser interface. This makes it faster and less demanding on system resources.
Directly integrates Beautiful Soup, offering all the benefits you would expect from BS4, plus some extra features.

Cons:

Unlike real browser automation tools like Playwright and Selenium, MechanicalSoup cannot execute JavaScript. Many modern websites require JavaScript for dynamic content loading and user interactions, which MechanicalSoup cannot handle.
Unlike Selenium and Playwright, MechanicalSoup does not support advanced browser interactions such as moving the mouse, dragging and dropping, or keyboard actions that might be necessary to retrieve dates from more complex websites.

Best for: MechanicalSoup is a more efficient and lightweight option for more basic scraping tasks, especially for static websites and those with straightforward interactions and navigation.

🍲 Learn more about MechanicalSoup

4. Node Crawler

Language: Node.js | GitHub: 6.7K+ stars | link

Node Crawler, often referred to as 'Crawler,' is a popular web crawling library for Node.js. At its core, Crawler utilizes Cheerio as the default parser, but it can be configured to use JSDOM if needed. The library offers a wide range of customization options, including robust queue management that allows you to enqueue URLs for crawling while it manages concurrency, rate limiting, and retries.

Advantages:

Built on Node.js, Node Crawler excels at efficiently handling multiple, simultaneous web requests, which makes it ideal for high-volume web scraping and crawling.
Integrates directly with Cheerio (a fast, flexible, and lean implementation of core jQuery designed specifically for the server), simplifying the process of HTML parsing and data extraction.
Provides extensive options for customization, from user-agent strings to request intervals, making it suitable for a wide range of web crawling scenarios.
Easy to set up and use, even for those new to Node.js or web scraping.

Disadvantages:

Does not handle JavaScript rendering natively. For dynamic JavaScript-heavy sites, you need to integrate it with something like Puppeteer or a headless browser.
While Node Crawler simplifies many tasks, the asynchronous model and event-driven architecture of Node.js can present a learning curve for those unfamiliar with such patterns.

Best for: Node Crawler is a great choice for developers familiar with the Node.js ecosystem who need to handle large-scale or high-speed web scraping tasks. It provides a flexible solution for web crawling that leverages the strengths of Node.js's asynchronous capabilities.

📖 Related: Web scraping with Node.js guide

5. Selenium

Language: Multi-language | GitHub: 30.6K stars | link

Selenium is a widely-used open-source framework for automating web browsers. It allows developers to write scripts in various programming languages to control browser actions. This makes it suitable for crawling and scraping dynamic content. Selenium provides a rich API that supports multiple browsers and platforms, so you can simulate user interactions like clicking buttons, filling forms, and navigating between pages. Its ability to handle JavaScript-heavy websites makes it particularly valuable for scraping modern web applications.

Pros:

Cross-browser support: Works with all major browsers (Chrome, Firefox, Safari, etc.), allowing for extensive testing and scraping.
Dynamic content handling: Capable of interacting with JavaScript-rendered content, making it effective for modern web applications.
Rich community and resources: A large ecosystem of tools and libraries that enhance its capabilities.

Cons:

Resource-intensive: Running a full browser can consume significant system resources compared to headless solutions.
Steeper learning curve: Requires understanding of browser automation concepts and may involve complex setup for advanced features.

Best for: Selenium is ideal for developers and testers needing to automate web applications or scrape data from sites that heavily rely on JavaScript. Its versatility makes it suitable for both testing and data extraction tasks.

📖 Related: How to do web scraping with Selenium in Python

6. Heritrix

Language: Java | GitHub: 2.8K+ stars | link

Heritrix is open-source web crawling software developed by the Internet Archive. It is primarily used for web archiving - collecting information from the web to build a digital library and support the Internet Archive's preservation efforts.

Advantages:

Optimized for large-scale web archiving, making it ideal for institutions like libraries and archives needing to preserve digital content systematically.
Detailed configuration options that allow users to customize crawl behavior deeply, including deciding which URLs to crawl, how to treat them, and how to manage the data collected.
Able to handle large datasets, which is essential for archiving significant web portions.

Disadvantages:

As it is written in Java, running Heritrix might require more substantial system resources than lighter, script-based crawlers, and it might limit usability for those unfamiliar with Java.
Optimized for capturing and preserving web content rather than extracting data for immediate analysis or use.
Does not render JavaScript, which means it cannot capture content from websites that rely heavily on JavaScript for dynamic content generation.

Best for: Heritrix is best suited for organizations and projects that aim to archive and preserve digital content on a large scale, such as libraries, archives, and other cultural heritage institutions. Its specialized nature makes it an excellent tool for its intended purpose but less adaptable for more general web scraping needs.

7. Apache Nutch

Language: Java | GitHub: 2.9K+ stars | link

Apache Nutch is an extensible open-source web crawler often used in fields like data analysis. It can fetch content through protocols such as HTTPS, HTTP, or FTP and extract textual information from document formats like HTML, PDF, RSS, and ATOM.

Advantages:

Highly reliable for continuous, extensive crawling operations given its maturity and focus on enterprise-level crawling.
Being part of the Apache project, Nutch benefits from strong community support, continuous updates, and improvements.
Seamless integration with Apache Solr and other Lucene-based search technologies, making it a robust backbone for building search engines.
Leveraging Hadoop allows Nutch to efficiently process large volumes of data, which is crucial for processing the web at scale.

Disadvantages:

Setting up Nutch and integrating it with Hadoop can be complex and daunting, especially for those new to these technologies.
Overly complicated for simple or small-scale crawling tasks, whereas lighter, more straightforward tools could be more effective.
Since Nutch is written in Java, it requires a Java environment, which might not be ideal for environments focused on other technologies.

Best for: Apache Nutch is ideal for organizations building large-scale search engines or collecting and processing vast amounts of web data. Its capabilities are especially useful in scenarios where scalability, robustness, and integration with enterprise-level search technologies are required.

8.Webmagic

Language: Java | GitHub: 11.4K+ stars | link

Webmagic is an open-source, simple, and flexible Java framework dedicated to web scraping. Unlike large-scale data crawling frameworks like Apache Nutch, WebMagic is designed for more specific, targeted scraping tasks, which makes it suitable for individual and enterprise users who need to extract data from various web sources efficiently.

Advantages:

Easier to set up and use than more complex systems like Apache Nutch, designed for broader web indexing and requires more setup.
Designed to be efficient for small to medium-scale scraping tasks, providing enough power without the overhead of larger frameworks.
For projects already within the Java ecosystem, integrating WebMagic can be more seamless than integrating a tool from a different language or platform.

Disadvantages:

Being Java-based, it might not appeal to developers working with other programming languages who prefer libraries available in their chosen languages.
WebMagic does not handle JavaScript rendering natively. For dynamic content loaded by JavaScript, you might need to integrate with headless browsers, which can complicate the setup.
While it has good documentation, the community around WebMagic might not be as large or active as those surrounding more popular frameworks like Scrapy, potentially affecting the future availability of third-party extensions and support.

Best for: WebMagic is a suitable choice for developers looking for a straightforward, flexible Java-based web scraping framework that balances ease of use with sufficient power for most web scraping tasks. It's particularly beneficial for users within the Java ecosystem who need a tool that integrates smoothly into larger Java applications.

9. Nokogiri

Language: Ruby | GitHub: 6.1K+ stars | link

Like Beautiful Soup, Nokogiri is also great at parsing HTML and XML documents via the programming language Ruby. Nokogiri relies on native parsers such as the libxml2 libxml2, libgumbo, and xerces. If you want to read or edit an XML document using Ruby programmatically, Nokogiri is the way to go.

Advantages:

Due to its underlying implementation in C (libxml2 and libxslt), Nokogiri is extremely fast, especially compared to pure Ruby libraries.
Able to handle both HTML and XML with equal proficiency, making it suitable for a wide range of tasks, from web scraping to RSS feed parsing.
Straightforward and intuitive API for performing complex parsing and querying tasks.
Strong, well-maintained community ensures regular updates and good support through forums and documentation.

Disadvantages:

Specific to Ruby, which might not be suitable for those working in other programming environments.
Installation can sometimes be problematic due to its dependencies on native C libraries.
Can be relatively heavy regarding memory usage, especially when dealing with large documents.

Best for: Nokogiri is particularly well-suited for developers already working within the Ruby ecosystem and needs a robust, efficient tool for parsing and manipulating HTML and XML data. Its speed, flexibility, and Ruby-native design make it an excellent choice for a wide range of web data extraction and transformation tasks.

10. Playwright

Language: Multi-language | GitHub: 67K+ stars| link

Playwright an open-source Node.js library introduced in 2020, is widely used for automated browser testing and web scraping. It is cross-platform, supports multiple languages like TypeScript, JavaScript, Python, and Java, and works with Chromium, Firefox, and Webkit. Playwright offers unique features for web automation, including headless mode, autowaits, browser contexts, authentication state persistence, and custom selector engines.

Advantages:

Playwright supports multiple browsers including Chromium, Firefox, and WebKit, for consistent scraping across different platforms. It can also be utilized with various programming languages such as JavaScript, Python, Java, and .NET, which makes it accessible to a broader range of developers.
Playwright can operate in headless mode, which reduces resource consumption and allows for faster execution of scraping tasks without a graphical interface. The framework automatically waits for elements to be ready before interacting with them. This reduces the need for manual delays and improves reliability.
It effectively manages websites that rely on JavaScript and AJAX for content loading, so it's suitable for modern web applications. The framework automatically waits for elements to be ready before interacting with them. This reduces the need for manual delays and improves reliability.

Disadvantages:

Running multiple browser instances can consume significant system resources, particularly when scraping large volumes of data.
While capable, Playwright is primarily designed for browser automation and testing rather than dedicated web crawling, which can complicate extensive scraping tasks.

Best for: Playwright is best suited for developers looking to automate interactions with web applications that utilize modern frameworks like React or Angular. Its ability to handle dynamic content makes it ideal for scenarios where traditional HTTP request libraries fall short. It is particularly advantageous in projects that require frequent updates or interactions with complex web interfaces.

11. Katana

Language: Go | GitHub: 11.1k | link

Katana is a web scraping framework focused on speed and efficiency. Developed by Project Discovery, it is designed to facilitate data collection from websites while providing a strong set of features tailored for security professionals and developers. Katana lets you create custom scraping workflows using a simple configuration format. It supports various output formats and integrates easily with other tools in the security ecosystem, which makes it a versatile choice for web crawling and scraping tasks.

Pros:

High performance: Built with efficiency in mind, allowing for fast data collection from multiple sources.
Extensible architecture: Easily integrates with other tools and libraries, enhancing its functionality.
Security-focused features: Includes capabilities that cater specifically to the needs of security researchers and penetration testers.

Cons:

Limited community support: As a newer tool, it does not have as extensive resources or community engagement as more established frameworks.
Niche use case focus: Primarily designed for security professionals, which may limit its appeal for general-purpose web scraping tasks.

Best for: Katana is best suited for security professionals and developers looking for a fast, efficient framework tailored to web scraping needs within the cybersecurity domain. Its integration capabilities make it particularly useful in security testing scenarios where data extraction is required.

All-in-one crawling and scraping solution: Apify

Apify is a full-stack web scraping and browser automation platform for building crawlers and scrapers in any programming language. It provides infrastructure for successful scraping at scale: storage, integrations, scheduling, proxies, and more.

So, whichever library you want to use for your scraping scripts, you can deploy them to the cloud and benefit from all the features the Apify platform has to offer.

Apify also hosts a library of ready-made data extraction and automation tools (Actors) created by other developers, which you can customize for your use case. That means you don't have to build everything from scratch.

How to scrape data from any website to Excel

Dávid Lukáč — Mon, 03 Jul 2023 22:00:00 +0000

Hi, we're Apify . The Apify platform gives you access to 1,500+ tools to extract data from popular websites. Check us out .

Attracting over 750 million users worldwide, Microsoft Excel is one of the most popular applications for organizing, formatting and calculating data. Excel files are a great example of structured data, allowing users to easily manipulate datasets and gain insight into gathered data thanks to tables, graphs, and other visualizations.

But if you use Excel a lot, sooner or later, youre probably gonna come across online data that youd like to feed to your table. In this article, well show you four ways you can import any website into a structured Excel file.

Manual copy and pasting
Implementing web queries
Excels VBA language
Using web scraping tools

There are multiple ways of turning a website into an Excel table, some more complicated than others

Manual copy and pasting

Depending on the amount and format of the data you want to pull into Excel, manual copying and pasting is always an option - although often a lengthy and tedious one. It is still used in areas such as manual product mapping, which require further interaction and input from the user. It can also be useful when the data you want to extract is already in a table format: for instance, here we simply copy and paste the HTML table from our Python libraries blog article. But keep in mind that the vast majority of online data is unstructured, so this method will probably be irrelevant for most use cases.

Copy and pasting is only effective when the data is already structured

Implementing web queries

If youre using Windows, you can use Microsofts Power Query feature set to get automated structured data straight to your Excel sheet. Simply go to the Data tab, select From Web, and input the URL of the website youd like to export data from. The software will then process the page and find any usable tables that it can pull from the page. After you have the data, you can refresh the query at any time, updating your sheet at the click of a button. While this method is useful for keeping your dataset up to date, it can also only work with data thats already structured. Another disadvantage is that Power Querys ability to get web data is limited to the Windows operating system, so Mac and other OS users are out of luck here.

source: Microsoft Support

Excels VBA language

Microsofts VBA (Visual Basic for Applications) is an implementation of the event-driven programming language Visual Basic into the Office ecosystem. It allows users to access functions beyond what is available in Microsoft Office, letting them automate processes, create macros, custom forms or customize applications to fit their specific business needs. But implementing data extraction into your Excel sheet with VBA requires advanced programming and web scraping knowledge. You will also need to install libraries such as Selenium or Microsoft XML HTTP based on if you want to extract data from a webpage that runs scripts. Since using this method is fundamentally the same as web scraping, it might be better to use data extraction tools outside of Microsofts ecosystem for a more flexible workflow with more in-depth documentation.

Using web scraping tools

Web scraping lets you automate the process of data extraction, minimizing the effort put into gathering web data in bulk. Unlike the first two methods, you can scrape any unstructured data from the website, be it product names, prices, headings, descriptions, and so on. This data is then stored in a structured format, and depending on the API youre using to scrape the web, it can also be exported into machine-readable formats such as XLS (Microsoft Excel).

You can build your own web scraper with Node.js or Python libraries such as Requests, HTTPX, Selenium, or Playwright, or use pre-made tools where you simply input your desired URL or search query. Apify is a full-stack platform that lets you do both - you can build (or import) your own scraper with our web scraping templates or use a ready-made Actor from Apify Store to get the job done.

So, lets take a look at how you can get data from any website in 3 simple steps using APIs found on Apify Store:

Step 1. Find the right tool

Apify Store works just like any other app store: just search for an Actor youd like to use based on the website you want to scrape. Most scrapers are free, although you may come across some paid ones - no need to worry; even these offer a free trial without the need for a credit card. After you find your scraper, click the Try for free button. For this example, were going to be using Twitter Scraper.

Find a suitable Actor on Apify Store

🤔 Cant find the right Actor for your use case? Try out the universal Web scraper, which can extract data from any website after a little bit of setting up

Youll then be prompted to sign in or make an account if you dont already have one. You can speed up the process by signing in through your Google or GitHub profile.

Sign in or make an account to access Apify Console

Step 2. Choose the data you want to scrape

Now, fill in the input schema to tell your Actor what data you want from the website of your choice. With Twitter Scraper, we decided to scrape the top 30 results from the handle @apify (yup, were also on Twitter, so dont forget to give us a follow). This feature is especially relevant if you know how to use mail merge and want to gather data that helps you tailor outreach or PR messages, for instance.

If youre not sure what the individual inputs mean, hover your mouse over the question mark next to them to get an explanation. The Actor readme (you can find this on the Actors page in Apify Store) can also be of great help if you feel lost when filling out the schema.

Fill out the Actor's input schema

Step 3. Run the scraper and download your results in Excel

After youre done filling out the input, hit Save & Start to kick off the Actor run. Now just wait for the scrape to finish.

You can watch your results load in the Output tab

As soon your run is finished, you can hit the Export results button at the bottom of the UI. Here you can choose your desired output format (in our case Excel) and which fields from the table you want (or dont want) to export.

Download your data in an Excel format

Tip: Not every scraper has the Output / Overview tab. If thats the case, go to Storage to get the full data in Excel.

And thats it! Now that you have your data, its time to explore the world of web scraping with a bigger pool of Actors: Interested in social media? Try our Instagram, TikTok, or Reddit scrapers. If youd prefer to scrape product listings and their sellers, give the Amazon, eBay, or AliExpress scrapers a go. Looking for new hires? Scrape job listings from Indeed or Glassdoor. You can even further integrate these tools with other platforms, to bring your web scraping workflow to the next level 🚀

Missing an Actor in Apify Store? You can contribute by monetizing your own code or publishing an idea as an inspiration for our developer community. And if you need any advice from web scraping and automation enthusiasts, theres always our Discord channel.

What are the highest paying programming languages in 2022?

Dávid Lukáč — Fri, 04 Nov 2022 14:06:21 +0000

Based on data provided by employment websites Glassdoor and Indeed, web developers in the US make around $70,000 a year, while the paychecks of software developers and engineers often exceed the $90,000 mark. Of course, these are just average figures, and real salaries can vary based on multiple factors. These include the location of the job, the size and type of company, and, most importantly, the skill set and experience of the developer.

💼 Learn more about the types of jobs you can do with your programming skills

One of the most important assets of a developers skill set is the programming language(s) they have learned and mastered. Although these are often already picked out by accredited institutions that offer IT-related study programs, you can still choose which courses to attend or self-study any language you like. But choosing the right programming language is not just a matter of its difficulty, popularity, or practicality. When developers enter the job market, it also directly affects the size of their salary.

A job in programming is sure to land you an above-average pay-check

What are the highest-paying programming languages?

Thanks to Stack Overflows Developer Survey conducted in 2022, we can get relevant insights into the developer job market, backed by over 73,000 responses from programmers worldwide. One of the fields analyzed in the survey is the technology used in the developer workflow and how it's projected onto their paycheck. The sums in their list represent the median yearly salary (including bonuses) before taxes. So, lets take a closer look at the first 5 top paying programming, scripting, and markup languages identified by the survey:

1. Clojure - $106,644

Clojure leads the list as the highest-paid programming language. It's part of the Lisp family of languages, typical for its code-as-data approach and powerful macro system. It runs on Java Virtual Machine, and since it is a functional language, all of its functions are compiled to JVM bytecode. As with many other languages on the list, the higher paycheck can be attributed to the limited number of programmers who know how to use it and the seniority of those who do. But even though it sports a smaller fanbase, its still very likable (being the third most-loved language within the survey) and useful - it's implemented in big companies such as Apple, Netflix, and Walmart.

2. Erlang - $103,000

Erlang is an older functional language developed originally for telephone switches and telecommunication systems. Today, it has found its use case in scalable, concurrent systems that require high availability. While it may not be as popular as modern programming languages, it is still used today to power applications such as WhatsApp, where a fast and stable connection between millions of users is a must.

3. F# - $95,526

This universal programming language is part of the ML family and allows you to write robust, functional, and performant code. It's maintained not only by the F# Software Foundation but also by Microsoft and other open-source contributors. Its mainly viewed as a back-end language, useful for data processing, web APIs, and one-off scripts. Compared to C#, for instance, the code is much shorter since it has less noise, and its functions can stand alone because the language doesnt force the developer into the OOP paradigm.

4. Lisp - $95,000

Lisp, or LISP (List Processing Programming Language), is one of the oldest high-level languages and was made primarily for easy manipulation of data strings. Its inception can be dated all the way back to 1958. Since then, many different dialects of Lisp have been formed, such as Clojure, which sits at the top of this list. Nowadays, Lisp is used in diverse fields, such as AI, thanks to its self-modifying programs, which give applications some learning ability.

5. Ruby - $93,000

Ruby is a high-level open-source programming language suitable for both front-end and back-end development. Its main goal is to boost developer productivity with its clear syntax. Rubys simplicity is great for both beginners and devs looking to scale their programs. It's primarily used in the back-end with the Ruby on Rails framework to build web applications and automation tools or to create web page templates and write server-side code. Unlike other languages in this list, Ruby is among the top 20 most popular programming languages, so finding a job offer for a Ruby programmer should not be a chore.

6. Elixir - $92,959

Elixir is a dynamic and functional language primarily used for building scalable and maintainable web applications. It builds upon the 2nd highest-paid language - Erlang, and runs on the BEAM virtual machine, which implements the aforementioned language. It was created in 2012 with the purpose of reintroducing 1980s tech like Erlang, which is used to build distributed, fault-tolerant systems. It is, therefore, prominent in telecommunication, finance, and e-commerce industries.

7. Scala - $92,780

This high-level object-oriented and functional language was developed as an alternative to Java, with which it shares many similarities. It also runs on Java virtual machine and can be compiled into JavaScript. But in contrast with languages like Ruby, Python, or Lisp, Scala is a static-typed language. That makes it perfect for high-volume projects and has the strength to power data engineering infrastructures in many companies.

8. Perl - $90,073

Perl, also known as Practical Extraction and Reporting Language, is a family of two languages: Perl 5 (released in 1987) and its 2000s redesign known as Perl 6, which was recently renamed Raku. In syntax it's quite similar to languages such as C or Shell. Still, in the context of speed, iterative implementation in C can be up to 56 times faster than the algorithm implemented in Pearl. But this comparison is mainly relevant in mathematical tasks, and Perl still shines in the context of text manipulation, system administration, or web development.

9. Go - $89,204

This statically-typed open source language was made by Google engineers in 2009. Although Google built it, its development was reminiscent of a community project from developers to developers. It was created to be simple to learn and easy to read. But this may also be its biggest drawback since its limitations result in a smaller set of features, especially when compared to languages such as C++, with which it bears similarities. This language is implemented in cloud and server-side applications and other network interfaces.

10. Rust - $87,047

Rust is a low-level statically-typed language. It was released in 2020 and has gained popularity for its performance and safety, which is why it's often used in the back-end for system development and building operation systems or microcontroller applications. It is quite a complex language to learn, with the shift being easier when going from more challenging languages such as C++. But compared to C++ (which is in the bottom half of the list), Rust is ranked as the no.1 most-loved language in Stacks survey. So, in the end, it may just be worth the effort!

Which programming language is worth learning?

While learning one of the languages above is sure to narrow down your possible job offers to only the best in financial compensation, it might narrow your search a little too much. These languages are often either rare in the modern industry, need a very specific level of expertise, or are aimed mainly at senior developers.

Therefore, if you dont have time to spare and your priority is to land a job in the programming world, your best bet is to focus on the most popular technologies (technologies referring to programming, scripting, and markup languages) rather than the best-paid ones. Youll still get paid quite well - JavaScript, for example, is the most popular programming language out there. Its yearly median salary is $65,580, which is already over $10,000 higher than the average median income in the US.

JavaScript is the most popular technology, surpassing giants such as HTML, SQL, and even Python.

Can coding be a side gig?

If youre in school or just want to make some extra money on the side, there are many ways of monetizing your coding skills. You can freelance and take on smaller projects, work part-time, or publish your code on various platforms such as Apify. With Apify, you can monetize your code by creating a web scraping or automation service and offering it in Apify Store, charging your users a monthly subscription fee to have access to it. You can also join our rapidly growing Discord server, where you can not only meet and collaborate with web automation enthusiasts from around the world but also take on projects offered by companies and individuals alike as a freelancer.

So, choosing a programming language should be based not only on the average paycheck that comes with knowing the language but also on its popularity and prominence in the job market. Only after you take the first step and find a job in coding is when itll be much easier to make decisions about the future of your career based on certain salary expectations.

💡 Have an idea of a project for our developers, or is there something youd like to work on at Apify? Let us know on our ideas page and upvote ideas that you may find useful!