hil for SerpApi

Posted on Dec 13, 2023 • Originally published at serpapi.com

Web scraping with cURL (fetching RAW HTML data)

#bash #curl #webscraping

Do you know you can scrape a website from your command line? With curl, you have a simple tool at your fingertips, ready to collect data from the web with minimal fuss. Let's explore how powerful curl is for web scraping!

Warning: In web scraping, cURL can be used only for retrieving the raw HTML data, but not parsing or extracting specific data.

What is cURL?

cURL, is a command line tool and library for transferring data with URLs. Think of Postman, but without the GUI (Graphic User Interface). We'll play only with the Command line / Terminal instead of a clickable interface.

Keep in mind that web scraping usually involves two things:

- Get the raw HTML data

- Parsing or extracting a specific section

cURL can only do the former. So, we need to combine it with other tools to have a fully functioning web scraping tool.

Web scraping is just one of the many uses of curl; it's not its only purpose. With curl, you can do things like to download files, automate data collection, test APIs, monitor server response times, and simulate user interactions with web services all from the command line.

Basic command for cURL

The basic syntax for writing cURL command is:

curl [options...] <url>

Make sure to Install cURL

To try the commands below, make sure to install cURL first. Some operating systems ship curl by default. You can verify if you already have cURL by typing curl --version .

If you see an error, you can refer to this page, which shows how to install cURL based on your operating system:

- Install cURL on Linux

- Install cURL on MacOS

- Install cURL on Windows

12 Basic cURL commands

To warm up, here are basic cURL commands / syntax with different options and their explanations:

Command: curl http://example.com
- Explanation: Fetches the content of a webpage. (No options example)
Command: curl -o filename.html http://example.com
- Explanation: Downloads the content of a webpage to a specified file.
Command: curl -I http://example.com
- Explanation: Retrieves only the HTTP headers from a URL.
Command: curl -L http://example.com
- Explanation: Follows HTTP redirects, which is useful for capturing the final destination of a URL with multiple redirects.
Command: curl -u username:password http://example.com
- Explanation: Performs a request with HTTP authentication.
Command: curl -x http://proxyserver:port http://example.com
- Explanation: Uses a proxy server for the request.
Command: curl -d "param1=value1&param2=value2" -X POST http://example.com
- Explanation: Sends a POST request with data to the server.
Command: curl -H "X-Custom-Header: value" http://example.com
- Explanation: Adds a custom header to the request.
Command: curl -s http://example.com
- Explanation: Makes curl run in silent mode, suppressing the progress meter and error messages.
Command: curl -X PUT -T file.txt http://example.com
- Explanation: Uploads a file to the server using PUT method.
Command: curl -A "User-Agent-String" http://example.com
- Explanation: Simulates a user agent by sending a custom User-Agent header.
Command: curl --json '{"tool": "curl"}' https://example.com/
- Explanation: Send a json data on cURL

Use curl --help to see more commands.

These commands demonstrate some of the basic functionalities of curl for interacting with web servers, APIs, and handling different HTTP methods and data types.

How to use cURL for web scraping (retrieving raw HTML data)

Here are command examples of using cURL specifically for web scraping tasks.

Fetch the HTML Content of a Web Page

To get the HTML content of a web page, use:

curl http://example.com

Save the HTML Content to a File

Instead of just displaying the content, you can save it to a file:

curl -o filename.html http://example.com

After saving the HTML file, you can continue to work with this HTML file with any programming language since the raw content you want to scrape is already in this file. Now, you can load this file without sending another HTTP request.

Scrape Specific Data from the HTML

curl itself doesn't parse HTML. You’ll need to use it in combination with other command-line tools like grep, awk, or sed to extract specific data. For instance:

curl http://example.com | grep -o '<h1.*</h1>'

We're using grep command line tool to search and return certain areas using RegEx. In this example, we're fetching content between h1 tags.

Send POST Requests

If the data you need is behind a form, you might need to send a POST request:

curl -d "param1=value1&param2=value2" -X POST http://example.com/form

This method can also be used to register or log in to an account when the form is using the POST request method.

Handle pagination or multiple pages.

Your data will likely spread into multiple pages. You must loop over page numbers

and replace them in the URL (Assume this is a server-rendered website).

For example, we want to request three different pages with this structure:

https://books.toscrape.com/catalogue/page-1.html

https://books.toscrape.com/catalogue/page-2.html

https://books.toscrape.com/catalogue/page-3.html

*Notice the number is changing for each page

This is what the bash code looks like:

#!/bin/bash

# Base URL for the book catalogue
base_url="https://books.toscrape.com/catalogue/page-"

# Loop through the first three pages
for i in {1..3}; do
  # Construct the full URL
  url="${base_url}${i}.html"

  # Use curl to fetch the content and save it to a file
  curl -o "page-${i}.html" "$url"

  # Wait for a second to be polite and not overload the server
  sleep 1
done

We're using sleep here to respect the website we're scraping. We don't want to send too many requests at once.

Avoid getting blocked upon web scraping using cURL

You can set the user-agent using cURL with the -A command.

curl -A "Mozilla/5.0" https://www.google.com

Take a look at the valid user agent list here.

Rotating a proxy with cURL

You can simulate a request coming from a certain proxy server or IP address with the -x command.

curl -x http://proxyserver:port http://example.com

You might need to subscribe to a proxy provider to get several proxies you can rotate.

Using custom headers

Some websites also may require some information to be sent on the headers alongside the request. Things like cookie or referrer should be attached in the header in this situation.

curl -H "Cookie: key1=value1" -H "Referer: https://example.com" https://example.com

The alternative command is using -b for the cookie

curl -b "cookie_name=cookie_value" https://example.com

and -e for the referer header.

curl -e 'http://example.com' 'http://targetwebsite.com'

or --referer is also valid. In the above sample, example.com is the referer URL and targetwebsite.com is the target URL.

Why not use cURL for web scraping?

After seeing the power of cURL, we still need to consider why we are not using it.

No JavaScript Execution: curl cannot execute JavaScript. If a website relies on JavaScript to load content dynamically, curl will not be able to access that content. Solution: You need to use a headless browser like Puppeteer, Selenium or Playwright. Learn how to scrape dynamic website using Puppeteer and NodeJS.
Not a Parsing Tool: curl is not designed to parse HTML or extract specific data from a response; it simply retrieves the raw data. You need to use it in conjunction with other tools or languages that can parse HTML. Solution: you can use AI by OpenAI to parse the HTML data after receiving the raw HTML from cURL.
Limited Debugging Features: Unlike tools with graphical interfaces, curl has limited debugging capabilities. Understanding errors may require a good grasp of HTTP and the command line. Solution: Use a library from any programming language, for example requests for Python or Cheerio for Javascript.
No Interaction with Web Pages: curl cannot interact with web pages, fill out forms, or simulate clicks, which limits its scraping capabilities for more dynamic sites. Solution: Using a headless browser, like the solution to the first problem.

cURL vs Postman

cURL is still a powerful tool if we want to debug, test, or quickly retrieve information from a URL_. You might wonder why you should use cURL instead of a tool like Postman._

Using cURL for web scraping instead of Postman.

Using curl instead of a graphical user interface (GUI) tool like Postman has several benefits, especially for those comfortable with the command line:

Simplicity: cURL API is very simple and easy to use.
Speed and Efficiency: curl is a command-line utility, which means it can be much faster to execute because you can run it with a simple line of code instead of navigating through a GUI.
Scriptability: curl can easily be scripted and integrated into larger shell scripts or automation workflows. This allows for repetitive tasks to be automated, saving time and reducing the potential for human error.
Resource-Friendly: curl typically uses fewer system resources than a GUI application, which can be an important consideration when working on a system with limited resources.
Versatility: curl supports a wide range of protocols beyond HTTP and HTTPS, including FTP, FTPS, SCP, SFTP, and more, which can be very useful in various scenarios.
Availability: curl is often pre-installed on many Unix-like operating systems (also available at Windows starting Windows 10), making it readily available for use without the need for additional downloads.

Can cURL be used in a programming language?

Yes, curl can be used within various programming languages, typically through libraries or bindings that provide an interface to curl functionality (using ProcessBuilder). This allows programmers to make HTTP requests, interact with APIs, and perform web scraping within their code. For examples:

CURL in Python: Python provides several libraries to use curl, with pycurl being the most direct wrapper around the libcurl library. It offers Python bindings for libcurl and gives access to almost all curl capabilities. Additionally, Python's requests library, while not a direct binding to curl, provides similar functionality in a more Pythonic way. Read: How to use curl in Python and it's alternative for more.
CURL in PHP: PHP has a built-in module for curl, typically referred to as cURL or PHP-CURL. This module allows PHP scripts to make requests to servers, download files, and process HTTP responses.
CURL in Node.js: In Node.js, while there are native HTTP modules, you can also use node-libcurl, which is a binding to libcurl. This allows the use of curl functionalities in a Node.js environment. Read: How to use cURL in Javascript (Nodejs) for more.
CURL in Ruby: Ruby also has a curl-like library known as Curb which provides bindings to libcurl. This allows Ruby scripts to utilize curl's capabilities for making HTTP requests.
CURL in Java: While Java doesn't have a direct curl library, tools like Apache HttpClient and OkHttp offer similar functionalities.

FAQ

Can cURL handle Javascript-rendered websites?

Unfortunately, not. cURL is designed to transfer data with a URL, it doesn't execute Javascript. To scrape content from a JavaScript-heavy website, you would typically use tools that can render JavaScript like headless browsers (e.g., Puppeteer, Selenium, Playwright).

Where can I learn more about cURL?

cURL has a great resource at https://everything.curl.dev/

Can cURL be used for web scraping?

Yes and No. cURL can do 50% of the job from web scraping, which is retrieving the raw HTML data. The other 50%, which is parsing the data, can't be achieved by cURL alone.

DEV Community