Aurken B.

Posted on Mar 15, 2021

Mastering Web Scraping 101: In-Depth Guide

#datascience #serverless #cloud #webdev

What is Web Scraping?

Web Scraping is the process of automatically collecting web data with specialized software.

Every day trillions of GBs are created, making it impossible to keep track of every new data point. Simultaneously, more and more companies worldwide rely on various data sources to nurture their knowledge to gain a competitive advantage. It's not possible to keep up with the pace manually.

That's where Web Scraping comes into play.

What is Web Scraping Used For?

As communication between systems is becoming critical, APIs are increasing in popularity. An API is a gate a website exposes to communicate with other systems. They open up functionality to the public. Unfortunately, many services don't provide an API. Others only allow limited functionality.

Web Scraping overcomes this limitation. It collects information all around the internet without the restrictions of an API.

Therefore web scraping is used in varied scenarios:

Price Monitoring

E-commerce: tracking competition prices and availability.
Stocks and financial services: detect price changes, volume activity, anomalies, etc.

Lead Generation

Extract contact information: names, email addresses, phones, or job titles.
Identify new opportunities, i.e., in Yelp, YellowPages, Crunchbase, etc.

Market Research

Real Estate: supply/demand analysis, market opportunities, trending areas, price variation.
Automotive/Cars: dealers distribution, most popular models, best deals, supply by city.
Travel and Accommodation: available rooms, hottest areas, best discounts, prices by season.
Job Postings: most demanded jobs. Industries on the rise. Biggest employers. Supply by sector, etc.
Social Media: brand presence and growing influencers tracking. New acquisition channels, audience targeting, etc.
City Discovery: track new restaurants, commercial streets, shops, trending areas, etc.

Aggregation

News from many sources. Compare prices between, i.e., insurance services, traveling, lawyers. Banking: organize all information into one place. Inventory and Product Tracking Collect product details and specs. New products.

SEO (Search Engine Optimization)

Keywords' relevance and performance. Competition tracking, brand relevance, new players' rank.

ML/AI - Data Science

Collect massive amounts of data to train machine learning models; image recognition, predictive modeling, NLP.

Bulk downloads:

PDFs or massive Image extraction at scale.

Web Scraping Process

Web Scraping works mainly as a standard HTTP client-server communication.

The browser (client) connects to a website (server) and requests the content. The server then returns HTML content, a markup language both sides understand. The browser is responsible for rendering HTML to a graphical interface.
That's it. Easy, isn't it?

There are more content types, but let's focus on this one for now. Let's dig deeper on how the underlying communication works - it'll come in handy later on.

Request - made by the browser

A request is a text the browser sends to the website. It consists of four elements:

URL: the specific address on the website.
Method: there are two main types: GET to retrieve data. And POST to submit data (usually forms).
Headers. User-Agent, Cookies, Browser Language, all go here. It is one of the most important and tricky parts of communication. Websites strongly focus on this data to determine whether a request comes from a human or a bot.
Body: commonly user-generated input. Used when submitting forms.

Response - returned by the server

When a website responds to a browser, it returns three items.

HTTP Code: a number indicating the status of the request. 200 means everything went OK. The infamous 404 means URL not found. 500 is an internal server error. You can learn more about HTTP codes.
The content: HTML. Responsible for rendering the website. Auxiliary content types include: CSS styles (appearance), Images, XML, JSON or PDF. They improve the user experience.
Headers. Just like Request Headers, these play a crucial role in communication. Amongst others, it instructs the browser to "Set-Cookie"s. We will get back to that later. Up to this point, this reflects an ordinary client-server process. Web Scraping, though, adds a new concept: data extraction.

Data Extraction - Parsing

HTML is just a long text. Once we have the HTML, we want to obtain specific data and structure it to make it usable. Parsing is the process of extracting selected data and organizing it into a well-defined structure.

Technically, HTML is a tree structure. Upper elements (nodes) are parents, and the lower are children. Two popular technologies facilitate walking the tree to extract the most relevant pieces:

CSS Selectors: broadly used to modify the look of websites. Powerful and easy to use.
XPath: they are more powerful but harder to use. They're not suited for beginners.
The extraction process begins by analyzing a website. Some elements are valuable at first sight. For example, Title, Price, or Description are all easily visible on the screen. Other information, though, is only visible in the HTML code:
Hidden inputs: it commonly contains information such as internal IDs that are pretty valuable.
XHR: websites execute requests in the background to enhance user experience. They regularly store rich content already structured in JSON format.
JSON inside HTML: JSON is a commonly used data-interchange format. Many times it's within the HTML code to serve other services - like Analytics or Marketing.
HTML attributes: add semantical meaning to other HTML elements.

Once data is structured, databases store it for later use. At this stage, we can export it to other formats such as Excel, PDF or transform it to make them available to other systems.

Web Scraping Challenges

Such a valuable process does not come free of obstacles, though.

Websites actively avoid being tracked/scraped. It's common for them to build protective solutions. High traffic websites put advanced industry-level anti-scraping solutions into place. This protection makes the task extremely challenging.

These are some of the challenges web scrapers face when dealing with relevant websites (low traffic websites are usually low value and thus have weak anti-scraping systems):

IP Rate Limit
All devices connected to the internet have an identification address, called IP. It's like an ID Card. Websites use this identifier to measure the number of requests of a device and try to block it. Imagine an IP requesting 120 pages per minute. Two requests per second. Real users cannot browse at such a pace. So to scrape at scale, we need to bring a new concept: proxies.
Rotating Proxies
A proxy, or proxy server, is a computer on the internet with an IP address. It intermediates between the requestor and the website. It permits hiding the original request IP behind a proxy IP and tricks the website into thinking it comes from another place. They're typically used as vast pools of IPs and switched between them depending on various factors. Skilled scrapers tune this process and select proxies depending on the domain, geolocation, etc.
Headers / Cookies validation
Remember Request/Response Headers? A mismatch between the expected and resulting values tells the website something is wrong. The more headers shared between browser and server, the harder it gets for automated software to communicate smoothly without being detected. It gets increasingly challenging when websites return the "Set-Cookie" header that expects the browser to use it in the following requests.

Ideally, you'd want to make requests with as few headers as possible. Unfortunately, something it's not possible leading to another challenge:

Reverse Engineering Headers / Cookies generation
Advanced websites don't respond if Headers and Cookies are not in place, forcing us to reverse-engineering. Reverse engineering is the process of understanding how a process' built to try to simulate it. It requires tweaking IPs, User-Agent (browser identification), Cookies, etc.
Javascript Execution
Most websites these days rely heavily on Javascript. Javascript is a programming language executed on the browser. It adds extra difficulty to data collection as a lot of tools don't support Javascript. Websites do complex calculations in Javascript to ensure a browser is really a browser. Leading us to:
Headless Browsers
A headless browser is a web browser without a graphical user interface controlled by software. It requires a lot of RAM and CPU, making the process way more expensive. Selenium and Puppeteer (created by Google) are two of the most used tools for the task. You guessed: Google is the largest web scraper in the world.
Captcha / reCAPTCHA (Developed by Google)
Captcha is a challenge test to determine whether or not the user is human. It used to be an effective way to avoid bots. Some companies like Anti-Captcha and 2Captcha offer solutions to bypass Captchas. They offer OCR (Optical Character Recognition) services and even human labor to solve the puzzles.
Pattern Recognition
When collecting data, you may feel tempted to go the easy way and follow a regular pattern. That's a huge red flag for websites. Arbitrary requests are not reliable either. How's someone supposed to land on page 8? It should've certainly been on page 7 before. Otherwise, it indicates that something's weird. Nailing the right path is tricky.

Conclusion

Hopefully, this grasps the overview of how data automation looks. We could stay forever talking about it, but we will get deeper into details in the coming posts.

Data collection at scale is full of secrets. Keeping up the pace is arduous and expensive. It's hard, very hard.

A preferred solution is to use batteries included services like ZenRows that turn websites into data. We offer a hassle-free API that takes care of all the work, so you only need to care about the data. We urge you to try it for FREE. We are delighted to help and even tailor-made a custom solution that works for you.

Did you find the content helpful? Spread the word and share it on Twitter, LinkedIn or Facebook.

DEV Community