Do you know you can scrape a website from your command line? With curl, you have a simple tool at your fingertips, ready to collect data from the web with minimal fuss. Let's explore how powerful curl is for web scraping!
Warning: In web scraping, cURL can be used only for retrieving the raw HTML data, but not parsing or extracting specific data.
What is cURL?
cURL, is a command line tool and library for transferring data with URLs. Think of Postman, but without the GUI (Graphic User Interface). We'll play only with the Command line / Terminal instead of a clickable interface.
Keep in mind that web scraping usually involves two things:
- Get the raw HTML data
- Parsing or extracting a specific section
cURL can only do the former. So, we need to combine it with other tools to have a fully functioning web scraping tool.
Web scraping is just one of the many uses of curl; it's not its only purpose. With curl, you can do things like to download files, automate data collection, test APIs, monitor server response times, and simulate user interactions with web services all from the command line.
Basic command for cURL
The basic syntax for writing cURL command is:
curl [options...] <url>
Make sure to Install cURL
To try the commands below, make sure to install cURL first. Some operating systems ship curl by default. You can verify if you already have cURL by typing curl --version .
If you see an error, you can refer to this page, which shows how to install cURL based on your operating system:
- Install cURL on Linux
- Install cURL on MacOS
- Install cURL on Windows
12 Basic cURL commands
To warm up, here are basic cURL commands / syntax with different options and their explanations:
- Command: curl http://example.com
- Explanation: Fetches the content of a webpage. (No options example)
 
- Command: curl -o filename.html http://example.com
- Explanation: Downloads the content of a webpage to a specified file.
 
- Command: curl -I http://example.com
- Explanation: Retrieves only the HTTP headers from a URL.
 
- Command: curl -L http://example.com
- Explanation: Follows HTTP redirects, which is useful for capturing the final destination of a URL with multiple redirects.
 
- Command: curl -u username:password http://example.com
- Explanation: Performs a request with HTTP authentication.
 
- Command: curl -x http://proxyserver:port http://example.com
- Explanation: Uses a proxy server for the request.
 
- Command: curl -d "param1=value1¶m2=value2" -X POST http://example.com
- Explanation: Sends a POST request with data to the server.
 
- Command: curl -H "X-Custom-Header: value" http://example.com
- Explanation: Adds a custom header to the request.
 
- Command: curl -s http://example.com
- Explanation: Makes curl run in silent mode, suppressing the progress meter and error messages.
 
- Command: curl -X PUT -T file.txt http://example.com
- Explanation: Uploads a file to the server using PUT method.
 
- Command: curl -A "User-Agent-String" http://example.com
- Explanation: Simulates a user agent by sending a custom User-Agent header.
 
- Command: curl --json '{"tool": "curl"}' https://example.com/
- Explanation: Send a json data on cURL
 
Use curl --help to see more commands.
These commands demonstrate some of the basic functionalities of curl for interacting with web servers, APIs, and handling different HTTP methods and data types.
How to use cURL for web scraping (retrieving raw HTML data)
Here are command examples of using cURL specifically for web scraping tasks.
Fetch the HTML Content of a Web Page
To get the HTML content of a web page, use:
curl http://example.com
Save the HTML Content to a File
Instead of just displaying the content, you can save it to a file:
curl -o filename.html http://example.com
After saving the HTML file, you can continue to work with this HTML file with any programming language since the raw content you want to scrape is already in this file. Now, you can load this file without sending another HTTP request.
Scrape Specific Data from the HTML
curl itself doesn't parse HTML. You’ll need to use it in combination with other command-line tools like grep, awk, or sed to extract specific data. For instance:
curl http://example.com | grep -o '<h1.*</h1>'
We're using grep command line tool to search and return certain areas using RegEx. In this example, we're fetching content between h1 tags.
Send POST Requests
If the data you need is behind a form, you might need to send a POST request:
curl -d "param1=value1¶m2=value2" -X POST http://example.com/form
This method can also be used to register or log in to an account when the form is using the POST request method.
Handle pagination or multiple pages.
Your data will likely spread into multiple pages. You must loop over page numbers
and replace them in the URL (Assume this is a server-rendered website).
For example, we want to request three different pages with this structure:
https://books.toscrape.com/catalogue/page-1.html
https://books.toscrape.com/catalogue/page-2.html
https://books.toscrape.com/catalogue/page-3.html
*Notice the number is changing for each page
This is what the bash code looks like:
#!/bin/bash
# Base URL for the book catalogue
base_url="https://books.toscrape.com/catalogue/page-"
# Loop through the first three pages
for i in {1..3}; do
  # Construct the full URL
  url="${base_url}${i}.html"
  # Use curl to fetch the content and save it to a file
  curl -o "page-${i}.html" "$url"
  # Wait for a second to be polite and not overload the server
  sleep 1
done
We're using sleep here to respect the website we're scraping. We don't want to send too many requests at once.
Avoid getting blocked upon web scraping using cURL
You can set the user-agent using cURL with the -A command.
curl -A "Mozilla/5.0" https://www.google.com
Take a look at the valid user agent list here.
Rotating a proxy with cURL
You can simulate a request coming from a certain proxy server or IP address with the -x command.
curl -x http://proxyserver:port http://example.com
You might need to subscribe to a proxy provider to get several proxies you can rotate.
Using custom headers
Some websites also may require some information to be sent on the headers alongside the request. Things like cookie or referrer should be attached in the header in this situation.
curl -H "Cookie: key1=value1" -H "Referer: https://example.com" https://example.com
The alternative command is using -b for the cookie
curl -b "cookie_name=cookie_value" https://example.com
and -e for the referer header.
curl -e 'http://example.com' 'http://targetwebsite.com'
or --referer is also valid. In the above sample, example.com is the referer URL and targetwebsite.com is the target URL.
Why not use cURL for web scraping?
After seeing the power of cURL, we still need to consider why we are not using it.
-  No JavaScript Execution: curlcannot execute JavaScript. If a website relies on JavaScript to load content dynamically,curlwill not be able to access that content. Solution: You need to use a headless browser like Puppeteer, Selenium or Playwright. Learn how to scrape dynamic website using Puppeteer and NodeJS.
-  Not a Parsing Tool: curlis not designed to parse HTML or extract specific data from a response; it simply retrieves the raw data. You need to use it in conjunction with other tools or languages that can parse HTML. Solution: you can use AI by OpenAI to parse the HTML data after receiving the raw HTML from cURL.
-  Limited Debugging Features: Unlike tools with graphical interfaces, curlhas limited debugging capabilities. Understanding errors may require a good grasp of HTTP and the command line. Solution: Use a library from any programming language, for examplerequestsfor Python or Cheerio for Javascript.
-  No Interaction with Web Pages: curlcannot interact with web pages, fill out forms, or simulate clicks, which limits its scraping capabilities for more dynamic sites. Solution: Using a headless browser, like the solution to the first problem.
cURL vs Postman
cURL is still a powerful tool if we want to debug, test, or quickly retrieve information from a URL_. You might wonder why you should use cURL instead of a tool like Postman._
Using cURL for web scraping instead of Postman.
Using curl instead of a graphical user interface (GUI) tool like Postman has several benefits, especially for those comfortable with the command line:
- Simplicity: cURL API is very simple and easy to use.
-  Speed and Efficiency: curlis a command-line utility, which means it can be much faster to execute because you can run it with a simple line of code instead of navigating through a GUI.
-  Scriptability: curlcan easily be scripted and integrated into larger shell scripts or automation workflows. This allows for repetitive tasks to be automated, saving time and reducing the potential for human error.
-  Resource-Friendly: curltypically uses fewer system resources than a GUI application, which can be an important consideration when working on a system with limited resources.
-  Versatility: curlsupports a wide range of protocols beyond HTTP and HTTPS, including FTP, FTPS, SCP, SFTP, and more, which can be very useful in various scenarios.
-  Availability: curlis often pre-installed on many Unix-like operating systems (also available at Windows starting Windows 10), making it readily available for use without the need for additional downloads.
Can cURL be used in a programming language?
Yes, curl can be used within various programming languages, typically through libraries or bindings that provide an interface to curl functionality (using ProcessBuilder). This allows programmers to make HTTP requests, interact with APIs, and perform web scraping within their code. For examples:
-  CURL in Python: Python provides several libraries to use curl, withpycurlbeing the most direct wrapper around thelibcurllibrary. It offers Python bindings forlibcurland gives access to almost allcurlcapabilities. Additionally, Python'srequestslibrary, while not a direct binding tocurl, provides similar functionality in a more Pythonic way. Read: How to use curl in Python and it's alternative for more.
-  CURL in PHP: PHP has a built-in module for curl, typically referred to ascURLorPHP-CURL. This module allows PHP scripts to make requests to servers, download files, and process HTTP responses.
-  CURL in Node.js: In Node.js, while there are native HTTP modules, you can also use node-libcurl, which is a binding tolibcurl. This allows the use ofcurlfunctionalities in a Node.js environment. Read: How to use cURL in Javascript (Nodejs) for more.
-  CURL in Ruby: Ruby also has a curl-like library known asCurbwhich provides bindings tolibcurl. This allows Ruby scripts to utilizecurl's capabilities for making HTTP requests.
-  CURL in Java: While Java doesn't have a direct curllibrary, tools likeApache HttpClientandOkHttpoffer similar functionalities.
FAQ
Can cURL handle Javascript-rendered websites?
Unfortunately, not. cURL is designed to transfer data with a URL, it doesn't execute Javascript. To scrape content from a JavaScript-heavy website, you would typically use tools that can render JavaScript like headless browsers (e.g., Puppeteer, Selenium, Playwright).
Where can I learn more about cURL?
cURL has a great resource at https://everything.curl.dev/
Can cURL be used for web scraping?
Yes and No. cURL can do 50% of the job from web scraping, which is retrieving the raw HTML data. The other 50%, which is parsing the data, can't be achieved by cURL alone.
 
 
              
 
                      

 
    
Top comments (0)