<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ander Rodriguez</title>
    <description>The latest articles on DEV Community by Ander Rodriguez (@anderrv).</description>
    <link>https://dev.to/anderrv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F645443%2F7a9ee81e-ac5b-4bea-b0c3-0d509ab60e66.jpg</url>
      <title>DEV Community: Ander Rodriguez</title>
      <link>https://dev.to/anderrv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anderrv"/>
    <language>en</language>
    <item>
      <title>How To Rotate Proxies in Python</title>
      <dc:creator>Ander Rodriguez</dc:creator>
      <pubDate>Wed, 08 Jun 2022 14:06:43 +0000</pubDate>
      <link>https://dev.to/anderrv/how-to-rotate-proxies-in-python-9ip</link>
      <guid>https://dev.to/anderrv/how-to-rotate-proxies-in-python-9ip</guid>
      <description>&lt;p&gt;A proxy can hide your IP, but what happens when that gets banned? You would need a new IP. Or you could maintain a list of them and rotate proxies for each request. The final option would be to use &lt;a href="https://www.zenrows.com/?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=rotate_proxies"&gt;Smart Rotating Proxies&lt;/a&gt;, more on that later.&lt;/p&gt;

&lt;p&gt;For now, we will focus on building our custom proxy rotator. We will start from a list of regular proxies, check them to mark the working ones, and provide simple monitoring to remove the failing ones from the working list. The provided examples use Python, but the idea will work in any language.&lt;/p&gt;

&lt;p&gt;Let's dive in!&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;For the code to work, you will need &lt;a href="https://www.python.org/downloads/"&gt;python3 installed&lt;/a&gt;. Some systems have it pre-installed. After that, install all the necessary libraries by running &lt;code&gt;pip install&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;aiohttp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Proxy List
&lt;/h2&gt;

&lt;p&gt;You might not have a proxy provider with a list of domain+ports. Do not worry, we'll see how to get one.&lt;/p&gt;

&lt;p&gt;There are several &lt;a href="https://free-proxy-list.net/"&gt;lists of free proxies&lt;/a&gt; online. For the demo, grab one of those and save its content (just the URLs) in a text file (&lt;code&gt;rotating_proxies_list.txt&lt;/code&gt;). Or use the ones below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ovyQwrC3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://cdn.zenrows.com/images/blog/export-free-proxies.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ovyQwrC3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://cdn.zenrows.com/images/blog/export-free-proxies.gif" alt="Export Free Proxies" width="500" height="281"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Free proxies aren't reliable, and the ones below probably won't work for you. They are usually short-lived.&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;167.71.230.124:8080
192.155.107.211:1080
77.238.79.111:8080
167.71.5.83:3128
195.189.123.213:3128
8.210.83.33:80
80.48.119.28:8080
152.0.209.175:8080
187.217.54.84:80
169.57.1.85:8123
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, we will read that file and create an array with all the proxies. Read the file, strip empty spaces, and split each line. Careful when saving the file since we won't perform any sanity checks for valid &lt;code&gt;IP:port&lt;/code&gt; strings. We'll keep it simple.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;proxies_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"rotating_proxies_list.txt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="s"&gt;"r"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Check Proxies
&lt;/h2&gt;

&lt;p&gt;Let's assume that we want to run the scrapers at scale. The demo is simplified, but the idea would be to store the proxies and their "health state" in a reliable medium like a database. We will use in-memory data structures that disappear after each run, but you get the idea.&lt;/p&gt;

&lt;p&gt;First, let's write a simple function to &lt;strong&gt;check that the proxy works&lt;/strong&gt;. For that, call &lt;a href="http://ident.me/"&gt;ident.me&lt;/a&gt;, which will return the IP. It is a simple page that fits our use case. We will use &lt;a href="https://docs.python.org/3/library/asyncio.html"&gt;&lt;code&gt;asyncio&lt;/code&gt;&lt;/a&gt; and &lt;a href="https://docs.aiohttp.org/en/stable/"&gt;&lt;code&gt;aiohttp&lt;/code&gt;&lt;/a&gt;, an "Asynchronous HTTP Client/Server" similar to the famous &lt;code&gt;requests&lt;/code&gt;. It suits us better since its purpose is to work asynchronously, and it will help us when checking several proxies simultaneously.&lt;/p&gt;

&lt;p&gt;For the moment, it takes an item from the proxies list and calls the provided URL. Most of the code is boilerplate that will soon prove useful. There are two possible results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;😊 If everything goes OK, it prints the response's content and status code (i.e., 200), which will probably be the proxy's IP.&lt;/li&gt;
&lt;li&gt;😔 An error gets printed due to timeout or some other reason. It usually means that the proxy is not available o cannot process the request. Many of these will appear when using free proxies.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;aiohttp&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;asyncio&lt;/span&gt;

&lt;span class="n"&gt;proxies_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"rotating_proxies_list.txt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"r"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;timeout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;aiohttp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ClientTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"http://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_proxies&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;proxy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;proxies_list&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;aiohttp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ClientSession&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"http://ident.me/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;check_proxies&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;We intentionally use HTTP instead of HTTPS because many free proxies don't support SSL.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Add More Checks To Validate the Results
&lt;/h3&gt;

&lt;p&gt;An exception means that the request failed, but there are other options that we should check, such as status codes. We will consider &lt;strong&gt;valid only specific codes&lt;/strong&gt; and mark the rest as errors. The list is not an exhaustive one, adjust it to your needs. You might think, for example, that 404 "Not Found" isn't valid and should be tested again.&lt;/p&gt;

&lt;p&gt;We could also add other checks, like validating that the response contains an IP address.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;VALID_STATUSES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;301&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;302&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;307&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"http://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;VALID_STATUSES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# valid proxy
&lt;/span&gt;                &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Exception: '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Iterate Over All the Proxies
&lt;/h3&gt;

&lt;p&gt;Great! We now need to run the checks for each proxy in the array. We will loop over the proxy list calling &lt;code&gt;get&lt;/code&gt; just as before. But instead of doing it sequentially, we will use &lt;code&gt;asyncio.gather&lt;/code&gt; to launch all the requests and wait for them to finish. Async makes the code more complicated, but it &lt;a href="https://www.zenrows.com/blog/speed-up-web-scraping-with-concurrency-in-python?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=rotate_proxies"&gt;speeds up web scraping&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The list is hardcoded to get a maximum of 10 items for security, to avoid hundreds of involuntary requests.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_proxies&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;proxies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;proxies_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# limited to 10 to avoid too many requests
&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;aiohttp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ClientSession&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"http://ident.me/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;proxies&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_exceptions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We should also &lt;strong&gt;limit the number of concurrent requests&lt;/strong&gt;. We'll do it using &lt;a href="https://docs.python.org/3/library/asyncio-sync.html#asyncio.Semaphore"&gt;Semaphore&lt;/a&gt;, an object that will acquire and release a lock. It will maintain an internal counter that allows only so many calls (10 in this case), thus creating a maximum concurrency.&lt;/p&gt;

&lt;p&gt;We need to change how to call &lt;code&gt;check_proxies&lt;/code&gt; too.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Semaphore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ...
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;sem&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# ...
&lt;/span&gt;
&lt;span class="n"&gt;loop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_event_loop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_until_complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;check_proxies&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Separate Working Proxies From the Failed Ones
&lt;/h2&gt;

&lt;p&gt;Examining an output log is far from ideal, isn't it? We should keep an internal state for the proxy list. We will separate them into three groups:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;unchecked: unknown state, to be checked.&lt;/li&gt;
&lt;li&gt;working: the last call using this proxy was successful.&lt;/li&gt;
&lt;li&gt;not working: the last request failed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is easier to add or remove items from &lt;code&gt;set&lt;/code&gt;s than arrays, and they come with the advantage of avoiding duplicates. We can move proxies between lists without worrying about having the same one twice. If it's present, it just won't be added. That will simplify our code: remove an item from a set and add it to another. To achieve that, we need to modify the proxy storage slightly.&lt;/p&gt;

&lt;p&gt;Three sets will exist, one for each group seen above. The initial one, &lt;code&gt;unchecked&lt;/code&gt;, will contain the proxies from the file. A set can be initialized from an array, making it easy for us to create it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;proxies_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"rotating_proxies_list.txt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"r"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;unchecked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxies_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="c1"&gt;# limited to 10 to avoid too many requests
# unchecked = set(proxies_list)
&lt;/span&gt;&lt;span class="n"&gt;working&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;not_working&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# ...
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_proxies&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;aiohttp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ClientSession&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"http://ident.me/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;unchecked&lt;/span&gt; &lt;span class="c1"&gt;# use the new set for the loop
&lt;/span&gt;        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_exceptions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;#...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, write helper functions to move proxies between states. One helper for each state. They will add the proxy to a set and remove it - if present - from the other two. Here is where sets come in handy since we don't need to worry about checking if the proxy is present or looping over the arrays. Call "discard" to remove if present or ignored, but no exception will raise.&lt;/p&gt;

&lt;p&gt;For example, we will call &lt;code&gt;set_working&lt;/code&gt; when a request is successful. And that function will remove the proxy from the unchecked or not working sets while adding it to the working set.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;reset_proxy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;unchecked&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;working&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;discard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;not_working&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;discard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set_working&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;unchecked&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;discard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;working&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;not_working&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;discard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set_not_working&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;unchecked&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;discard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;working&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;discard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;not_working&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We are missing the crucial part! We need to edit &lt;code&gt;get&lt;/code&gt; to call these functions after each request. &lt;code&gt;set_working&lt;/code&gt; for the successful ones and &lt;code&gt;set_not_working&lt;/code&gt; for the rest.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;sem&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"http://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;VALID_STATUSES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# valid proxy
&lt;/span&gt;                    &lt;span class="n"&gt;set_working&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;set_not_working&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;set_not_working&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the moment, add some traces at the end of the script to see if it's working well. The &lt;code&gt;unchecked&lt;/code&gt; set should be empty since we run all the items. And those items will populate the other two sets. Hopefully &lt;code&gt;working&lt;/code&gt; is not empty 😅 - it might happen with free proxies.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#...
&lt;/span&gt;&lt;span class="n"&gt;loop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_event_loop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_until_complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;check_proxies&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'unchecked -&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unchecked&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# unchecked -&amp;gt; set()
&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'working -&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;working&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# working -&amp;gt; {'152.0.209.175:8080', ...}
&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'not_working -&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;not_working&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# not_working -&amp;gt; {'167.71.5.83:3128', ...}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Using the Working Proxies
&lt;/h2&gt;

&lt;p&gt;That was a straightforward way to check proxies but not truly useful yet. We now need a way to &lt;strong&gt;get the working proxies&lt;/strong&gt; and use them for the real reason: web scraping actual content. We will create a function that will select a random proxy.&lt;/p&gt;

&lt;p&gt;We included both working and unchecked proxies in our example, feel free to use only the working ones if it fits your needs. We will see later why the unchecked ones are present too.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;random&lt;/code&gt; doesn't work with sets, so we'll convert them using a &lt;code&gt;tuple&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;random&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_random_proxy&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# create a tuple from unchecked and working sets
&lt;/span&gt;    &lt;span class="n"&gt;available_proxies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unchecked&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;union&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;working&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;available_proxies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"no proxies available"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;available_proxies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we can edit the &lt;code&gt;get&lt;/code&gt; function to use a random proxy if none is present. The &lt;code&gt;proxy&lt;/code&gt; parameter is now optional. We will use that param to check the initial proxies, as we were doing before. But after that, we can forget about the proxy list and call &lt;code&gt;get&lt;/code&gt; without it. &lt;strong&gt;A random one will be used&lt;/strong&gt; and added to the &lt;code&gt;not_working&lt;/code&gt; set in case of failure.&lt;/p&gt;

&lt;p&gt;Since we will now want to get actual content, we need to return the response or raise the exception. With &lt;code&gt;aiohttp&lt;/code&gt;, unlike &lt;code&gt;requests&lt;/code&gt;, the response's content must &lt;code&gt;await&lt;/code&gt;. Here is the final version.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;proxy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_random_proxy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;sem&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"http://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;VALID_STATUSES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;set_working&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;set_not_working&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

                &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# content needs to be "awaited"
&lt;/span&gt;                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="c1"&gt;# return response
&lt;/span&gt;        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;set_not_working&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="c1"&gt;# raise exception
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Include below the script the content you want to scrape. We will merely call once again the same test URL for the demo.&lt;/p&gt;

&lt;p&gt;The idea is to start from here to build a real-world scraper based on this backbone. And to scale it, store the items in persistent storage, such as a database (i.e., Redis).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#....
&lt;/span&gt;&lt;span class="n"&gt;loop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_event_loop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_until_complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;check_proxies&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# real scraping part comes here
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;aiohttp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ClientSession&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"http://ident.me/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# True
&lt;/span&gt;        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# 200
&lt;/span&gt;        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="c1"&gt;# 152.0.209.175
&lt;/span&gt;
&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What happens with false negatives or one-time errors? Once we send a proxy to the &lt;code&gt;not_working&lt;/code&gt; set, it will remain there forever. There's no way back.&lt;/p&gt;

&lt;h2&gt;
  
  
  Re-Checking Not Working Proxies
&lt;/h2&gt;

&lt;p&gt;We should &lt;strong&gt;re-check the failed proxies&lt;/strong&gt; from time to time. There are many reasons: the failure was due to networking issues, a bug, or the proxy provider fixed it.&lt;/p&gt;

&lt;p&gt;In any case, Python allows us to set &lt;a href="https://docs.python.org/3/library/threading.html#timer-objects"&gt;&lt;code&gt;Timers&lt;/code&gt;&lt;/a&gt;, "an action that should be run only after a certain amount of time has passed". There are different ways to achieve the same end, and this is simple enough to run it using three lines.&lt;/p&gt;

&lt;p&gt;Remember the &lt;code&gt;reset_proxy&lt;/code&gt; function? We didn't use it at all until now. We will set a &lt;code&gt;Timer&lt;/code&gt; to run that function for every proxy marked as not working. Twenty seconds is a small number for a real-world case but enough for our test. We exclude a failing proxy and move it back to unchecked after some time.&lt;/p&gt;

&lt;p&gt;And this is the reason to use both working and unchecked sets in &lt;code&gt;get_random_proxy&lt;/code&gt;. Modify that function to use only working proxies for a more robust use case. And then, you can run &lt;code&gt;check_proxies&lt;/code&gt; periodically, which will loop over the unchecked elements - in this case, failed proxies that remained some time in the sin bin.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;threading&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Timer&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set_not_working&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;unchecked&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;discard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;working&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;discard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;not_working&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# move to unchecked after a certain time (20s in the example)
&lt;/span&gt;    &lt;span class="n"&gt;Timer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;20.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reset_proxy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is a final option for even more robust systems, but we'll leave the implementation up to you. &lt;strong&gt;Store analytics and usage for each proxy&lt;/strong&gt;, for example, the number of times it failed and when was the last one. Using that info, adjust the time to re-check - longer times for proxies that failed several times. Or even set some alerts if the number of working proxies goes below a threshold.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building a plain proxy rotator might seem doable for small scraping scripts, but it can grow painful. But, hey, you did it!! 👏&lt;/p&gt;

&lt;p&gt;These are the steps we followed:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Store proxy list as plain text&lt;/li&gt;
&lt;li&gt;Import from the file as an array&lt;/li&gt;
&lt;li&gt;Check each of them&lt;/li&gt;
&lt;li&gt;Separate the working ones&lt;/li&gt;
&lt;li&gt;Check for failures while scraping and remove them from the working list&lt;/li&gt;
&lt;li&gt;Re-check not working proxies from time to time&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As a note of caution, &lt;strong&gt;do not rotate IP when scraping logged-in or any other kind of session/cookies&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you don't want to worry about rotating proxies manually, you can always use &lt;a href="https://www.zenrows.com/?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=rotate_proxies"&gt;ZenRows&lt;/a&gt;, a Web Scraping API that includes Smart Rotating Proxies. It works as a regular proxy - with a single URL - but provides different IPs for each request.&lt;/p&gt;

&lt;p&gt;Did you find the content helpful? Please, spread the word and share it. 👈&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://www.zenrows.com/blog/how-to-rotate-proxies-in-python?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=rotate_proxies"&gt;https://www.zenrows.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>webdev</category>
      <category>programming</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Speed Up Web Scraping with Concurrency in Python</title>
      <dc:creator>Ander Rodriguez</dc:creator>
      <pubDate>Tue, 17 May 2022 15:09:59 +0000</pubDate>
      <link>https://dev.to/anderrv/speed-up-web-scraping-with-concurrency-in-python-l6a</link>
      <guid>https://dev.to/anderrv/speed-up-web-scraping-with-concurrency-in-python-l6a</guid>
      <description>&lt;p&gt;Scraping websites for data is a typical use case for developers. Whether it's a side project or you're building a startup, there are many reasons to scrape the web.&lt;/p&gt;

&lt;p&gt;For example, if you want to start a price comparison website, you'll need to scrape prices from various e-commerce sites. Maybe you want to build an AI that could identify products and look up their price on Amazon. The possibilities are endless.&lt;/p&gt;

&lt;p&gt;But have you ever noticed &lt;strong&gt;how slow&lt;/strong&gt; it is to get all the pages? Would you scrape all the products one after the other? There must be a better solution, right? Right?!&lt;/p&gt;

&lt;p&gt;Scraping websites can be time-consuming because you have to deal with waiting for responses from the server and rate-limiting. That's why we will show you how to &lt;strong&gt;speed up your web scraping projects by using concurrency in Python&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;For the code to work, you will need &lt;a href="https://www.python.org/downloads/"&gt;python3 installed&lt;/a&gt;. Some systems have it pre-installed. After that, install all the necessary libraries by running &lt;code&gt;pip install&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;requests beautifulsoup4 aiohttp numpy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;If you know the basics behind concurrency, skip the theory part and jump directly into the action.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Concurrency
&lt;/h2&gt;

&lt;p&gt;Concurrency is a term that deals with the ability to &lt;strong&gt;run multiple computing tasks simultaneously&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When you make requests to websites sequentially, you send out one request at a time, waiting for it to return, and then sending out the following one.&lt;/p&gt;

&lt;p&gt;However, you can send out many requests at once with concurrency, and work on them all as they return. The &lt;strong&gt;speed increases&lt;/strong&gt; from this method are incredible. Compared to sequential requests, concurrent ones will be much faster regardless of whether they are running in parallel (multiple CPUs) or not - more on this later.&lt;/p&gt;

&lt;p&gt;To understand the benefits of concurrency, we need to understand the difference between processing tasks sequentially and concurrently. For example, let's say we have five tasks that take 10 seconds each to complete.&lt;/p&gt;

&lt;p&gt;When processing them sequentially, the time it takes to complete all five is 50 seconds. However, it takes only 10 seconds for all five tasks to complete when processing them concurrently.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--nh5OG8ZL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n8ocw18fz4duujhtiwtc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--nh5OG8ZL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n8ocw18fz4duujhtiwtc.png" alt="Sequential VS Concurrent Graphic" width="673" height="638"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In addition to increasing speed, concurrency allows us to do more work in less time by distributing our web scraping workload among several processes.&lt;/p&gt;

&lt;p&gt;There are several ways to parallelize requests, such as &lt;code&gt;multiprocessing&lt;/code&gt; and &lt;code&gt;asyncio&lt;/code&gt;. From a web scraping perspective, we can use these libraries to parallelize requests to different websites or other pages on the same website. In this article, we will focus on &lt;code&gt;asyncio&lt;/code&gt;, a Python module providing infrastructure for writing single-threaded concurrent code using coroutines.&lt;/p&gt;

&lt;p&gt;Since concurrency implies more convoluted systems and code, consider if the pros outweigh the cons for your use case.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benefits of Concurrency
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;More work done in less time.&lt;/li&gt;
&lt;li&gt;Idle network time invested in other requests.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Dangers of Concurrency
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Harder to develop and debug.&lt;/li&gt;
&lt;li&gt;Race conditions.&lt;/li&gt;
&lt;li&gt;The need to check and use thread-safe functions.&lt;/li&gt;
&lt;li&gt;Block probabilities grow if not handled carefully.&lt;/li&gt;
&lt;li&gt;Concurrency comes with a system overhead, set a reasonable concurrency level.&lt;/li&gt;
&lt;li&gt;Involuntary DDoS if too many requests against a small site.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---Zq97HLr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/u45dmwnibup6cdby5alh.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---Zq97HLr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/u45dmwnibup6cdby5alh.gif" alt="Too Many Requests at the same time" width="327" height="251"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why asyncio?
&lt;/h2&gt;

&lt;p&gt;To decide what technology to use, we must understand the difference between &lt;code&gt;asyncio&lt;/code&gt; and &lt;code&gt;multiprocessing&lt;/code&gt;. And also I/O-bound and CPU-bound.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://docs.python.org/3/library/asyncio.html"&gt;asyncio&lt;/a&gt;&lt;/strong&gt; "is a library to write concurrent code using the async/await syntax". It runs on a single processor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://docs.python.org/3/library/multiprocessing.html"&gt;multiprocessing&lt;/a&gt;&lt;/strong&gt; "is a package that supports spawning processes using an API [...] allowing the programmer to fully leverage multiple processors on a given machine". Each process will start its own Python interpreter in a different CPU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I/O-bound&lt;/strong&gt; means that the program will run slower due to input/output operations. In our case, mostly network requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CPU-bound&lt;/strong&gt; means that the program will run slower due to central processor use - for example, math calculations.&lt;/p&gt;

&lt;p&gt;Why does this affect the library we will use for concurrency? Because a big part of the cost for concurrency is creating and maintaining threads/processes. For CPU-bound problems, having many processes in different CPUs will pay off. But that might not be the case for I/O-bound scenarios.&lt;/p&gt;

&lt;p&gt;Since scraping is mostly I/O-bound, we picked &lt;code&gt;asyncio&lt;/code&gt;. But in case of doubt (or just for fun), you can replicate the idea using multiprocessing and compare the results.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Hwmjk6yg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uezlzu46ml4rpzleij7b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Hwmjk6yg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uezlzu46ml4rpzleij7b.png" alt="Concurrency Comparison Graphic" width="880" height="688"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Sequential Version
&lt;/h2&gt;

&lt;p&gt;We'll start by scraping &lt;code&gt;scrapeme.live&lt;/code&gt; as an example, a fake Pokémon e-commerce prepared for testing.&lt;/p&gt;

&lt;p&gt;First, we will start with the &lt;strong&gt;sequential version of the scraper&lt;/strong&gt;. Several snippets are part of all cases, so those will remain the same.&lt;/p&gt;

&lt;p&gt;By visiting the page, we see that there are 48 pages. Since it is a testing environment, that won't change anytime soon. Our first constants will be the base URL and a range for the pages.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;base_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://scrapeme.live/shop/page"&lt;/span&gt;
&lt;span class="n"&gt;pages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;49&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# max page (48) + 1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, extract the basics from a product. For that, use &lt;code&gt;requests.get&lt;/code&gt; to get the HTML and then &lt;code&gt;BeautifulSoup&lt;/code&gt; to parse it. We will loop over each product and get some basic info from them. All the selectors come from a manual review of the content (using DevTools), but we won't go into detail here for brevity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;requests&lt;/span&gt; 
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt; 

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_details&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="c1"&gt;# concatenate page number to base URL 
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"html.parser"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

    &lt;span class="n"&gt;pokemon_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt; 
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pokemon&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".product"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="c1"&gt;# loop each product 
&lt;/span&gt;        &lt;span class="n"&gt;pokemon_list&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; 
            &lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pokemon&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"add_to_cart_button"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"data-product_id"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
            &lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pokemon&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"h2"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; 
            &lt;span class="s"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pokemon&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; 
            &lt;span class="s"&gt;"url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pokemon&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"woocommerce-loop-product__link"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"href"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
        &lt;span class="p"&gt;})&lt;/span&gt; 
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;pokemon_list&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;extract_details&lt;/code&gt; function will take a page number and concatenate that to create a URL with the base seen earlier. After getting the content and creating an array of products, return them. That means the returned values will be a list of dictionaries. It is an essential detail for later.&lt;/p&gt;

&lt;p&gt;We need to run the function above for each page, get all the results, and store them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;csv&lt;/span&gt; 

&lt;span class="c1"&gt;# modified to avoid running all the pages unintentionally 
&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;store_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;list_of_lists&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="n"&gt;pokemon_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;list_of_lists&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt; &lt;span class="c1"&gt;# flatten lists 
&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"pokemon.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"w"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pokemon_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
        &lt;span class="c1"&gt;# get dictionary keys for the CSV header 
&lt;/span&gt;        &lt;span class="n"&gt;fieldnames&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pokemon_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
        &lt;span class="n"&gt;file_writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DictWriter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pokemon_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fieldnames&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;fieldnames&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
        &lt;span class="n"&gt;file_writer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeheader&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
        &lt;span class="n"&gt;file_writer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writerows&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pokemon_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="n"&gt;list_of_lists&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; 
    &lt;span class="n"&gt;extract_details&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt; 
&lt;span class="p"&gt;]&lt;/span&gt; 
&lt;span class="n"&gt;store_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;list_of_lists&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running the code above will get two product pages, extract products (32 total), and store them in a CSV file called &lt;code&gt;pokemon.csv&lt;/code&gt;. The &lt;code&gt;store_results&lt;/code&gt; function does not affect the scraping in sequential or concurrent mode. You can skip it.&lt;/p&gt;

&lt;p&gt;Since the results are lists, we must flatten them to allow &lt;code&gt;writerows&lt;/code&gt; to do its job. That's why we named the variable &lt;code&gt;list_of_lists&lt;/code&gt; (even if it's a bit weird), only to remind everyone that it's not flat.&lt;/p&gt;

&lt;p&gt;Example of the output CSV file:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--riQRH4BI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vcpjf4dtzejni5arlv4a.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--riQRH4BI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vcpjf4dtzejni5arlv4a.jpg" alt="Products CSV Output" width="768" height="229"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you were to run the script for every page (48) total, it would generate a CSV with 755 products and spend around 30 seconds.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;time &lt;/span&gt;python script.py 

real 0m31,806s 
user 0m1,936s 
sys 0m0,073s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Introducing asyncio
&lt;/h2&gt;

&lt;p&gt;We know we can do better. If we &lt;strong&gt;perform all the requests at the same time&lt;/strong&gt;, it should take much less, right? Maybe as long as the slowest request?&lt;/p&gt;

&lt;p&gt;Concurrency should indeed run faster, but it also involves some overhead. So it is not a linear mathematical improvement. But improve we will.&lt;/p&gt;

&lt;p&gt;For that, we will use the mentioned &lt;code&gt;asyncio&lt;/code&gt;. It allows us to run several tasks on the same thread in an event loop (like Javascript does). It will run a function and switch the context to a different one when allowed. In our case, HTTP requests allow that switch.&lt;/p&gt;

&lt;p&gt;We will start seeing an example that will sleep for a second. And the script should take a second to run. Notice that we cannot call &lt;code&gt;main&lt;/code&gt; directly. We need to let asyncio know that it's an &lt;code&gt;async&lt;/code&gt; function that needs executing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;asyncio&lt;/span&gt; 

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; 
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Hello ..."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"... World!"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;time &lt;/span&gt;python script.py 
Hello ... 
... World! 

real 0m1,054s 
user 0m0,045s 
sys 0m0,008s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Simple code in parallel
&lt;/h2&gt;

&lt;p&gt;Next, we will expand an example case to run a hundred functions. Each of them will sleep for a second and print a text. It would take around one hundred seconds if we were to run them sequentially. With &lt;code&gt;asyncio&lt;/code&gt;, it will take just one!&lt;/p&gt;

&lt;p&gt;That's the power behind concurrency. As said earlier, for pure I/O-bound tasks, it will perform much faster - sleeping is not, but it counts for the example.&lt;/p&gt;

&lt;p&gt;We need to create a helper function that will sleep for a second and print a message. Then, we edit &lt;code&gt;main&lt;/code&gt; to call that function a hundred times and store each call in a tasks list. The last and crucial part is to execute and wait for all the tasks to finish. That's what &lt;a href="https://docs.python.org/3/library/asyncio-task.html#asyncio.gather"&gt;&lt;code&gt;asyncio.gather&lt;/code&gt;&lt;/a&gt; does.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;asyncio&lt;/span&gt; 

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;demo_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"Hello &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; 
    &lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; 
        &lt;span class="n"&gt;demo_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="p"&gt;]&lt;/span&gt; 
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As expected, a hundred messages and one second to execute. Perfect!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;time &lt;/span&gt;python script.py 
Hello 0 
... 
Hello 99 

real 0m1,065s 
user 0m0,063s 
sys 0m0,000s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Scraping with asyncio
&lt;/h2&gt;

&lt;p&gt;We need to apply that knowledge to scraping. The approach to follow will be to request concurrently and return product lists. Once all requests finish, store them. It might be better to save data after each request or in batches to avoid data losses for real-world cases.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Our first attempt won't have a concurrency limit, so be careful when using it. In the case of running it with thousands of URLs... well, it would perform all those requests almost at the same time. Which could cause a tremendous load on the server and probably fry your computer.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;requests&lt;/code&gt; does not support async out-of-the-box, so we will use &lt;a href="https://docs.aiohttp.org/"&gt;&lt;code&gt;aiohttp&lt;/code&gt;&lt;/a&gt; to avoid complications. &lt;code&gt;requests&lt;/code&gt; can do the job, and there is no substantial performance difference. But the code is more readable using &lt;code&gt;aiohttp&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;asyncio&lt;/span&gt; 
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;aiohttp&lt;/span&gt; 
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt; 

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_details&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="c1"&gt;# similar to requests.get but with a different syntax 
&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 

        &lt;span class="c1"&gt;# notice that we must await the .text() function 
&lt;/span&gt;        &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s"&gt;"html.parser"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

        &lt;span class="c1"&gt;# [...] same as before 
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;pokemon_list&lt;/span&gt; 

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; 
    &lt;span class="c1"&gt;# create an aiohttp session and pass it to each function execution 
&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;aiohttp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ClientSession&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
        &lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; 
            &lt;span class="n"&gt;extract_details&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt; 
        &lt;span class="p"&gt;]&lt;/span&gt; 
        &lt;span class="n"&gt;list_of_lists&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
        &lt;span class="n"&gt;store_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;list_of_lists&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CSV file should have every product (755) just as before. Since we perform all the page calls at the same time, the results will not arrive in order. If we were to add the results to the file inside &lt;code&gt;extract_details&lt;/code&gt; they might be unordered. Since we wait for all tasks to finish and then process them, the order will not be a problem.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;time &lt;/span&gt;python script.py 

real 0m11,442s 
user 0m1,332s 
sys 0m0,060s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We did it! 3x &lt;strong&gt;faster&lt;/strong&gt; is nice, but... shouldn't it be 40x? It's not that simple. Many things can affect the performance (Network, CPU, RAM, and so on).&lt;/p&gt;

&lt;p&gt;And in this demo page, we've noticed that response time slows down when we perform several calls. It might be by design. Some servers/providers can limit the number of concurrent requests to avoid too much traffic from the same IP. It is not a block but more of a queue. You will get served, but after waiting a bit.&lt;/p&gt;

&lt;p&gt;To see real speed-up, you can test against a &lt;a href="https://httpbin.org/delay/2"&gt;delay&lt;/a&gt; page. It is another testing page that will wait for 2 seconds and then return a response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;base_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://httpbin.org/delay/2"&lt;/span&gt; 
&lt;span class="c1"&gt;#... 
&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_details&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
        &lt;span class="c1"&gt;#...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Removed all the extracting and storing logic, just calling the delay URL 48 times. And it runs in under 3 seconds.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;time &lt;/span&gt;python script.py 

real 0m2,865s 
user 0m0,245s 
sys 0m0,031s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Limiting Concurrency with Semaphore
&lt;/h2&gt;

&lt;p&gt;As said earlier, we should limit the number of concurrent requests, especially against a single domain.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;asyncio&lt;/code&gt; comes with &lt;a href="https://docs.python.org/3/library/asyncio-sync.html#asyncio.Semaphore"&gt;Semaphore&lt;/a&gt;, an object that will acquire and release a lock. Its inner functionality will block some of the calls until the lock is acquired, thus creating a maximum concurrency.&lt;/p&gt;

&lt;p&gt;We need to create the semaphore with the maximum we want. And then await on the extracting function until it is available using &lt;code&gt;async with sem&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;max_concurrency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; 
&lt;span class="n"&gt;sem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Semaphore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_concurrency&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_details&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;sem&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# semaphore limits num of simultaneous downloads 
&lt;/span&gt;        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
            &lt;span class="c1"&gt;# ... 
&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; 
        &lt;span class="c1"&gt;# ... 
&lt;/span&gt;
&lt;span class="n"&gt;loop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_event_loop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
&lt;span class="n"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_until_complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It gets the job done, and it is relatively easy to implement! Here is the output with max concurrency set to 3.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;time &lt;/span&gt;python script.py 

real 0m13,062s 
user 0m1,455s 
sys 0m0,047s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It shows that the version with unlimited concurrency is not operating at its full speed 🤦. If we increment the limit to 10, the total time is similar to the unbound script.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limiting concurrency with TCPConnector
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;aiohttp&lt;/code&gt; offers an alternative solution that offers further configuration. We can create the client session passing in a custom &lt;a href="https://docs.aiohttp.org/en/stable/client_reference.html#tcpconnector"&gt;TCPConnector&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We can build it by using two parameters that suit our needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;limit&lt;/code&gt; - "total number of simultaneous connections".&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;limit_per_host&lt;/code&gt; - "limit simultaneous connections to the same endpoint" (same host, port, and &lt;code&gt;is_ssl&lt;/code&gt;).
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;max_concurrency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; 
&lt;span class="n"&gt;max_concurrency_per_host&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; 

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; 
    &lt;span class="n"&gt;connector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;aiohttp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TCPConnector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_concurrency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit_per_host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_concurrency_per_host&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;aiohttp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ClientSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;connector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;connector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
        &lt;span class="c1"&gt;# ... 
&lt;/span&gt;
&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also easy to implement and maintain! Here is the output with max concurrency set to 3 per host.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;time &lt;/span&gt;python script.py 

real 0m16,188s 
user 0m1,311s 
sys 0m0,065s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The advantage over &lt;code&gt;Semaphore&lt;/code&gt; is the option to limit the total amount of concurrent calls and requests per domain. We could use the same &lt;code&gt;session&lt;/code&gt; to scrape different sites, and each one of those would have its own limit.&lt;/p&gt;

&lt;p&gt;The downside is that it looks a bit slower. Run some tests with more pages and actual data for a real-case scenario.&lt;/p&gt;

&lt;h2&gt;
  
  
  multiprocessing
&lt;/h2&gt;

&lt;p&gt;Scraping is I/O-bound like we saw earlier. But what if we needed to mix it with some CPU-intensive computations? To test that case, we'll use a function that will &lt;code&gt;count_a_lot&lt;/code&gt; (to one hundred million) after each scraped page. It is a simple (and silly) way to force a CPU to be busy for some time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count_a_lot&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; 
    &lt;span class="n"&gt;count_to&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100_000_000&lt;/span&gt; 
    &lt;span class="n"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; 
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;count_to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
        &lt;span class="n"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;counter&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; 

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_details&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
        &lt;span class="c1"&gt;# ... 
&lt;/span&gt;        &lt;span class="n"&gt;count_a_lot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;pokemon_list&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the asyncio version, just run it as before. It might take a long time ⏳.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;time &lt;/span&gt;python script.py 

real 2m37,827s 
user 2m35,586s 
sys 0m0,244s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, &lt;strong&gt;brace for the hard part&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Adding &lt;code&gt;multiprocessing&lt;/code&gt; is a bit harder. We need to create a &lt;a href="https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor"&gt;&lt;code&gt;ProcessPoolExecutor&lt;/code&gt;&lt;/a&gt;, which "uses a pool of processes to execute calls asynchronously". It will handle the &lt;strong&gt;creation and control of each process in a different CPU&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But it won't distribute the load. For that, we will use &lt;code&gt;NumPy&lt;/code&gt;'s &lt;code&gt;array_split&lt;/code&gt;, which will slice the &lt;code&gt;pages&lt;/code&gt; range into equal chunks according to the number of CPUs.&lt;/p&gt;

&lt;p&gt;The rest of the &lt;code&gt;main&lt;/code&gt; function is similar to the &lt;code&gt;asyncio&lt;/code&gt; version but changes some syntax to match the &lt;code&gt;multiprocessing&lt;/code&gt; style.&lt;/p&gt;

&lt;p&gt;The essential difference is that we cannot call &lt;code&gt;extract_details&lt;/code&gt; directly. We could, but we'll try to obtain the maximum power by mixing &lt;code&gt;multiprocessing&lt;/code&gt; with &lt;code&gt;asyncio&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;concurrent.futures&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ProcessPoolExecutor&lt;/span&gt; 
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;multiprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cpu_count&lt;/span&gt; 
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt; 

&lt;span class="n"&gt;num_cores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cpu_count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# number of CPU cores 
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; 
    &lt;span class="n"&gt;executor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ProcessPoolExecutor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_cores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; 
        &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asyncio_wrapper&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pages_for_task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pages_for_task&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_cores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="p"&gt;]&lt;/span&gt; 
    &lt;span class="n"&gt;doneTasks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;concurrent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; 
        &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;doneTasks&lt;/span&gt; 
    &lt;span class="p"&gt;]&lt;/span&gt; 
    &lt;span class="n"&gt;store_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Long story short, each CPU process will have a few pages to scrape. There are 48 pages, and assuming your machine has 8 CPUs, each process will request six pages (6 * 8 = 48).&lt;/p&gt;

&lt;p&gt;And those six pages will run concurrently! After that, the calculations will have to wait since they are CPU-intensive. But we have many CPUs, so they should run faster than the pure &lt;code&gt;asyncio&lt;/code&gt; version.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_details_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages_for_task&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;aiohttp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ClientSession&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
        &lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; 
            &lt;span class="n"&gt;extract_details&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pages_for_task&lt;/span&gt; 
        &lt;span class="p"&gt;]&lt;/span&gt; 
        &lt;span class="n"&gt;list_of_lists&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;list_of_lists&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt; 


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;asyncio_wrapper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages_for_task&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extract_details_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages_for_task&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ☝️ is where the magic happens. Each CPU process will start an &lt;code&gt;asyncio&lt;/code&gt; with a subset of the pages (e.g., from 1 to 6 for the first one).&lt;/p&gt;

&lt;p&gt;And then, each one of those will call several URLs, using the already known &lt;code&gt;extract_details&lt;/code&gt; function.&lt;/p&gt;

&lt;p&gt;Take a moment to assimilate that. The whole process goes like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create the executor.&lt;/li&gt;
&lt;li&gt;Split the pages.&lt;/li&gt;
&lt;li&gt;Start &lt;code&gt;asyncio&lt;/code&gt; per each process.&lt;/li&gt;
&lt;li&gt;Create an &lt;code&gt;aiohttp&lt;/code&gt; session and create the tasks of a subset of pages.&lt;/li&gt;
&lt;li&gt;Extract data for each page.&lt;/li&gt;
&lt;li&gt;Consolidate and store the results.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And here are the execution times. We haven't mentioned it, but the &lt;code&gt;user&lt;/code&gt; time here plays a notable role. For the script running only &lt;code&gt;asyncio&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;time &lt;/span&gt;python script.py 

real 2m37,827s 
user 2m35,586s 
sys 0m0,244s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The version with &lt;code&gt;asyncio&lt;/code&gt; and multiple processes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;time &lt;/span&gt;python script.py 

real 0m38,048s 
user 3m3,147s 
sys 0m0,532s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Did you spot the difference? The first one took more than two minutes, and the second one 40 seconds. But in total CPU time (&lt;code&gt;user&lt;/code&gt; time) the second one was over three minutes! That is a bit more due to system overhead and all that.&lt;/p&gt;

&lt;p&gt;That shows that the parallel processing "wastes" more time (in total) but finishes before. Then it is up to you to decide which method to choose. Take also into account that it is more complicated to develop and debug 😅.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;We've seen that &lt;code&gt;asyncio&lt;/code&gt; might be enough for scraping since most of the running time goes to networking. Which is I/O-bound and works well with concurrent processing in a single core.&lt;/p&gt;

&lt;p&gt;That situation changes if the gathered data requires some CPU-intensive work. We've seen a silly example with counting, but you get the point.&lt;/p&gt;

&lt;p&gt;For most cases, &lt;code&gt;asyncio&lt;/code&gt; with &lt;code&gt;aiohttp&lt;/code&gt; - better suited than &lt;code&gt;requests&lt;/code&gt; for async work - gets the job done. Add a custom connector to limit the number of requests per domain and total concurrent ones. With those three pieces, you can start building a scraper that can scale.&lt;/p&gt;

&lt;p&gt;An important piece is to allow new URLs/tasks (something like a Queue), but that's for another day. Stay tuned!&lt;/p&gt;

&lt;p&gt;Did you find the content helpful? Please, spread the word and share it. 👈&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://www.zenrows.com/blog/speed-up-web-scraping-with-concurrency-in-python?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=concurrency_python"&gt;https://www.zenrows.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>webdev</category>
      <category>tutorial</category>
      <category>programming</category>
    </item>
    <item>
      <title>HTTP Requests in Java with Proxies</title>
      <dc:creator>Ander Rodriguez</dc:creator>
      <pubDate>Tue, 29 Mar 2022 07:51:12 +0000</pubDate>
      <link>https://dev.to/anderrv/http-requests-in-java-with-proxies-30n8</link>
      <guid>https://dev.to/anderrv/http-requests-in-java-with-proxies-30n8</guid>
      <description>&lt;p&gt;Accessing data over HTTP is more common every day. Be it APIs or webpages, intercommunication between applications is growing. And website scraping.&lt;/p&gt;

&lt;p&gt;There is no easy built-in solution to perform HTTP calls in Java. Many packages offer some related functionalities, but it's not easy to pick one. Especially if you need some extra features like connecting via authenticated proxies.&lt;/p&gt;

&lt;p&gt;We'll go from the basic request to advanced features using &lt;code&gt;fluent.Request&lt;/code&gt;, part of the &lt;a href="https://hc.apache.org/index.html"&gt;Apache HttpComponents&lt;/a&gt; project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Direct Request
&lt;/h2&gt;

&lt;p&gt;The first step is to request the desired page. We will use &lt;a href="http://httpbin.org/anything"&gt;httpbin&lt;/a&gt; for the demo. It shows headers and origin IP, allowing us to check if the request was successful.&lt;/p&gt;

&lt;p&gt;We need to import &lt;code&gt;Request&lt;/code&gt;, get the target page and extract the result as a string. The package provides methods for those cases and many more. Lastly, print the response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.hc.client5.http.fluent.Request&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestRequest&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"http://httpbin.org/anything"&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Request&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// use GET HTTP method&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;// perform the call&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;returnContent&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;// handle and return response&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;asString&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// convert response to string&lt;/span&gt;

        &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We are not handling the response nor checking for errors. It is a simplified version of a real-use case.&lt;/p&gt;

&lt;p&gt;But we can see on the result that the request was successful, and our IP shows as the origin. We'll solve that in a moment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Proxy Request
&lt;/h2&gt;

&lt;p&gt;There are many reasons to add proxies to an HTTP request, such as security or anonymity. In any case, Java libraries (usually) make adding proxies complicated.&lt;/p&gt;

&lt;p&gt;In our case, we can use &lt;code&gt;viaProxy&lt;/code&gt; with the proxy URL as long as we don't need authentication. More on that later.&lt;/p&gt;

&lt;p&gt;For now, we'll use a proxy from a free list. &lt;em&gt;Note that these &lt;a href="https://free-proxy-list.net/"&gt;free proxies&lt;/a&gt; might not work for you. They are short-time lived.&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.hc.client5.http.fluent.Request&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestRequest&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"http://httpbin.org/anything"&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"http://169.57.1.85:8123"&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Free proxy&lt;/span&gt;

        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;viaProxy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// will set the passed proxy&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;returnContent&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;asString&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

        &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Proxy with Authentication
&lt;/h2&gt;

&lt;p&gt;Paid or private proxy providers - such as &lt;a href="https://www.zenrows.com/"&gt;ZenRows&lt;/a&gt; - frequently use authentication in each call. Sometimes it is done via IP allowed lists, but it's frequent to use other means like &lt;code&gt;Proxy-Authorization&lt;/code&gt; headers.&lt;/p&gt;

&lt;p&gt;Calling the proxy without the proper auth method will result in an error: &lt;code&gt;Exception in thread "main" org.apache.hc.client5.http.HttpResponseException: status code: 407, reason phrase: Proxy Authentication Required&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Following the example, we will need two things: auth and passing the proxy as a Host.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Proxy-Authorization"&gt;&lt;code&gt;Proxy-Authorization&lt;/code&gt;&lt;/a&gt; contains the user and password base64 encoded.&lt;/p&gt;

&lt;p&gt;Then, we need to change how &lt;code&gt;viaProxy&lt;/code&gt; gets the proxy since it does not allow URLs with user and password. For that, we will create a new &lt;code&gt;HttpHost&lt;/code&gt; passing in the whole URL. It will internally handle the problem and omit the unneeded parts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;java.net.URI&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;java.util.Base64&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.hc.client5.http.fluent.Request&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.hc.core5.http.HttpHost&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestRequest&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"http://httpbin.org/anything"&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
        &lt;span class="no"&gt;URI&lt;/span&gt; &lt;span class="n"&gt;proxyURI&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="no"&gt;URI&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"http://YOUR_API_KEY:@proxy.zenrows.com:8001"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// Proxy URL as given by the provider&lt;/span&gt;
        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;basicAuth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
            &lt;span class="nc"&gt;Base64&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getEncoder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;// get the base64 encoder&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;encode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;proxyURI&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getUserInfo&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getBytes&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;// get user and password from the proxy URL&lt;/span&gt;
            &lt;span class="o"&gt;));&lt;/span&gt;
        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addHeader&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Proxy-Authorization"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Basic "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;basicAuth&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// add auth&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;viaProxy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;HttpHost&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;create&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxyURI&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="c1"&gt;// will set the passed proxy as a host&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;returnContent&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;asString&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

        &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Ignore SSL Certificates
&lt;/h2&gt;

&lt;p&gt;When adding proxies to SSL (https) connections, libraries tend to raise a warning/error about the certificate. From a security perspective, that is awesome! We avoid being shown or redirected to sites we prefer to avoid.&lt;/p&gt;

&lt;p&gt;But what about &lt;em&gt;forcing&lt;/em&gt; our connections through our own proxies? There is no security risk in those cases, so we want to ignore those warnings. That is, again, not an easy task in Java.&lt;/p&gt;

&lt;p&gt;The error goes something like this: &lt;code&gt;Exception in thread "main" javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For this case, we will modify the target URL by switching it to &lt;code&gt;https&lt;/code&gt;. And also, call a helper method that we'll create next. Nothing else changes on the main function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestRequest&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;ignoreCertWarning&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// new method that will ignore certificate warnings&lt;/span&gt;

        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://httpbin.org/anything"&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// switch to https&lt;/span&gt;
        &lt;span class="c1"&gt;// ...&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now to the complicated and verbose part. We need to create an SSL context and fake certificates. As you can see, the certificates manager and its methods do nothing. It will just bypass the inner working and thus avoid the problems. Lastly, initialize the context with the created fake certs and set it as default. And we are good to go!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;java.security.cert.X509Certificate&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;javax.net.ssl.*&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TestRequest&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;ignoreCertWarning&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;SSLContext&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
        &lt;span class="nc"&gt;TrustManager&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;trustAllCerts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;X509TrustManager&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;X509TrustManager&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;X509Certificate&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="nf"&gt;getAcceptedIssuers&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;;}&lt;/span&gt;
            &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;checkClientTrusted&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;X509Certificate&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;certs&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;authType&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{}&lt;/span&gt;
            &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;checkServerTrusted&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;X509Certificate&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;certs&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;authType&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{}&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="o"&gt;};&lt;/span&gt;

        &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SSLContext&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getInstance&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"SSL"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
            &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;init&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trustAllCerts&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
            &lt;span class="nc"&gt;SSLContext&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setDefault&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Accessing data (or scraping) in Java can get complicated and verbose. But with the right tools and libraries, we got to tame its verbosity - but for the certificate.&lt;/p&gt;

&lt;p&gt;We might get back to this topic in the future. The HttpComponents library offers attractive functionalities such as async and multi-threaded execution.&lt;/p&gt;

</description>
      <category>java</category>
      <category>beginners</category>
      <category>webdev</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Web Scraping with Python 101</title>
      <dc:creator>Ander Rodriguez</dc:creator>
      <pubDate>Wed, 19 Jan 2022 12:29:11 +0000</pubDate>
      <link>https://dev.to/anderrv/web-scraping-with-python-101-2kg3</link>
      <guid>https://dev.to/anderrv/web-scraping-with-python-101-2kg3</guid>
      <description>&lt;p&gt;The Internet is a vast data source if you know where to look - and how to extract it! Going page by page and copying the data manually is not an option anymore. And yet, many people still are doing that.&lt;/p&gt;

&lt;p&gt;It would be great if our favorite data source were to expose all of it in a convenient format such as CSV. We wish it were so.&lt;/p&gt;

&lt;p&gt;What if we told you that there is a solution? We are talking about &lt;strong&gt;Website Scraping&lt;/strong&gt;. It allows you to &lt;strong&gt;extract structured information from any source available on the Internet&lt;/strong&gt;. You can choose the data you will get and how to store it. And it is repeatable, meaning that you can run the same script every day - hour, or whatever- to get new items.&lt;/p&gt;

&lt;p&gt;Continue with us to learn how to code your own web scraper taking a job board as an example target. We will build step-by-step a web scraping project with Python.&lt;/p&gt;

&lt;p&gt;Curious? Let's dive in! 🤿&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;For the code to work, you will need &lt;a href="https://www.python.org/downloads/" rel="noopener noreferrer"&gt;python3 installed&lt;/a&gt;. Some systems have it pre-installed. After that, install all the necessary libraries by running &lt;code&gt;pip install&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt; &lt;span class="n"&gt;beautifulsoup4&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Video Tutorial
&lt;/h3&gt;

&lt;p&gt;If you prefer video content, watch this video from &lt;a href="https://www.youtube.com/channel/UC1Q1IKz_s5gewPTltRsb0-g" rel="noopener noreferrer"&gt;our Youtube channel&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/jbvVjUABfr8"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Web Scraping
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Web_scraping" rel="noopener noreferrer"&gt;Web scraping&lt;/a&gt; consists of &lt;strong&gt;extracting data from websites&lt;/strong&gt;. We could do it manually, but scraping generally refers to the automated way: software - usually called bot or crawler - visits web pages and gets the content we are after.&lt;/p&gt;

&lt;p&gt;The easier way to access data is via API (Application Programming Interface). The downside is that not many sites offer it or allow limited access. For those sites without API, extracting data directly from the content - scraping - is a viable solution.&lt;/p&gt;

&lt;p&gt;As we'll see later, many websites have countermeasures in place to avoid massive scraping. Do not worry for the moment; it should work for a few requests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Web Scraping Process Summarized
&lt;/h2&gt;

&lt;p&gt;The most famous crawlers are search engines, like Google, which visit and index almost the whole Internet. We'd better start small, so we'll begin from the basics and build upon that. These are the four main parts that form website scraping:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Inspect the Target Site&lt;/strong&gt; Get a general idea of what data you can extract and how the page organizes it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Obtain the HTML&lt;/strong&gt; Access the content by downloading the page's HTML. We will focus on static content for simplicity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extract data&lt;/strong&gt; Obtain the information you are after, usually a piece of data or a list of repeated items (i.e., a job offer or job listings).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Store extracted data&lt;/strong&gt; Once extracted, transform and store the data for its use (i.e., save to a CSV file or insert in a database).&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Basic Structure of a Web Page
&lt;/h2&gt;

&lt;p&gt;Let's zoom out for a second and go back to the fundamentals - sorry, a bit of theory is coming. You can skip the following two sections if you are already familiar with the basics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A web page consists mainly of HTML, CSS, Javascript (JS), and images&lt;/strong&gt;. Bear with us for a second if you don't understand parts of that sentence.&lt;/p&gt;

&lt;p&gt;A browser, like Chrome, opens a connection via the Internet to your favorite website using the HTTP protocol. The most common request type is &lt;code&gt;GET&lt;/code&gt;, which usually retrieves information without modifying it. Then the server processes it and sends back an HTML (HyperText Markup Language) response. Simplifying, &lt;strong&gt;HTML is a text file with a syntax that will tell the browser what content to paint, text to show, and what resources to download&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In those extra resources are the ones mentioned above:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CSS will format and style the content (i.e., colors, fonts, and many more).&lt;/li&gt;
&lt;li&gt;Javascript adds functionality and behavior, such as loading more job offers on infinite scroll.&lt;/li&gt;
&lt;li&gt;Images... well, you get the point. They can be used as part of the main content or as backgrounds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The browser will handle all the requests/responses and render the final content. Everything shows the style defined by the CSS file. Thanks to the behavior in the JS files, the infinite scroll works perfectly. Images are loaded and displayed where they should. The browser is really doing a lot of work, but usually so fast that we don't even notice as users.&lt;/p&gt;

&lt;p&gt;However, the critical part for web scraping is the &lt;strong&gt;initial HTML&lt;/strong&gt;. Some pages will load content later, but we will focus - for clarity - on those that load everything initially. That difference is usually called static vs. dynamic pages.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If your case involves dynamic pages, you can go to our article on &lt;a href="https://www.zenrows.com/blog/web-scraping-with-selenium-in-python?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=scraping_python_101" rel="noopener noreferrer"&gt;scraping with Selenium&lt;/a&gt;, a headless browser. In short, it launches a real browser to access the target webpage. But it is programmatically controlled, so you can extract the content you desire.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What is HTML
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/HTML" rel="noopener noreferrer"&gt;HTML&lt;/a&gt; "is the standard markup language for &lt;strong&gt;documents designed to be displayed in a web browser&lt;/strong&gt;." It will structure the page using tags, each one meaning something different to the browser. For example, &lt;code&gt;&amp;lt;strong&amp;gt;&lt;/code&gt; will show text in &lt;strong&gt;bold&lt;/strong&gt;, and &lt;code&gt;&amp;lt;i&amp;gt;&lt;/code&gt; will do so in &lt;em&gt;italics&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Other components will control what can be done and not the display format. Examples of that are &lt;code&gt;&amp;lt;form&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;input&amp;gt;&lt;/code&gt;, which allow us to fill and send forms to the server, such as logging in and registering.&lt;/p&gt;

&lt;p&gt;HTML elements might have attributes such as &lt;code&gt;class&lt;/code&gt; or &lt;code&gt;id&lt;/code&gt;, which are name-value pairs, separated by &lt;code&gt;=&lt;/code&gt;. They are optional but quite common, especially classes. CSS uses them for styling and Javascript for adding interactivity. Some of them are directly associated with a tag, like &lt;code&gt;href&lt;/code&gt; with &lt;code&gt;&amp;lt;a&amp;gt;&lt;/code&gt; - URL of the link tag.&lt;/p&gt;

&lt;p&gt;An example: &lt;code&gt;&amp;lt;a id="example" class="link-style" href="http://example.org/"&amp;gt;Go to example.org&amp;lt;/a&amp;gt;&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The tag &lt;code&gt;&amp;lt;a&amp;gt;&lt;/code&gt; tells us that it is a link.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;id="example"&lt;/code&gt; gives the element a unique ID. Browsers also use this for internal navigation as anchors.&lt;/li&gt;
&lt;li&gt;The CSS processor will interpret &lt;code&gt;class="link-style"&lt;/code&gt;, and it might show in a different color if so defined in the CSS file.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;href="http://example.org/"&lt;/code&gt; is the link's destination in case the user clicks the element.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Go to example.org&lt;/code&gt; is the text that the browser will show. Depending on the browser and CSS file, it might have some default style like a blue color or underline.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to Obtain HTML using Requests
&lt;/h2&gt;

&lt;p&gt;Now that we've seen the basics, let's use Python and the &lt;a href="https://docs.python-requests.org/" rel="noopener noreferrer"&gt;&lt;code&gt;Requests&lt;/code&gt;&lt;/a&gt; library to download a page.&lt;/p&gt;

&lt;p&gt;We will start by importing the library and defining a variable with the URL we want to access. Then, use the &lt;code&gt;get&lt;/code&gt; function to &lt;strong&gt;obtain the page and print the response&lt;/strong&gt;. It will be an object with the response's data: the HTML and other essential pieces such as status code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://example.org/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# &amp;lt;Response [200]&amp;gt;
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# 200
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# &amp;lt;!doctype html&amp;gt;...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We won't go into further detail on &lt;a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Status" rel="noopener noreferrer"&gt;status codes&lt;/a&gt;, which indicate whether the request was successful. The ones in the 200 - 299 range mean success (sometimes denoted &lt;code&gt;2XX&lt;/code&gt;), &lt;code&gt;3XX&lt;/code&gt; indicates a redirection, &lt;code&gt;4XX&lt;/code&gt; client error, and &lt;code&gt;5XX&lt;/code&gt; server error. Don't worry, the responses should be 200 in our tests, and you will learn later what to do otherwise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parsing Data with BeautifulSoup
&lt;/h2&gt;

&lt;p&gt;As seen in the response's text above, the data is there, but it is not easy to obtain the interesting bits. It is prepared to be consumed by a browser, not a human! To make it accessible, we will use &lt;a href="https://beautiful-soup-4.readthedocs.io/" rel="noopener noreferrer"&gt;&lt;code&gt;BeautifulSoup&lt;/code&gt;&lt;/a&gt;, "a Python library for &lt;strong&gt;pulling data out of HTML&lt;/strong&gt;."&lt;/p&gt;

&lt;p&gt;It will allow us to get the data we want using the classes and IDs mentioned above. Following the example in the previous section, we will access the title (&lt;code&gt;&amp;lt;h1&amp;gt;&lt;/code&gt;) and the link (&lt;code&gt;&amp;lt;a&amp;gt;&lt;/code&gt;). We'll see how we know what tags to look for in a moment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3t80v8agz7w0snyuxxil.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3t80v8agz7w0snyuxxil.png" alt="Example.org Annotated"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;

&lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html.parser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Example Domain
&lt;/span&gt;
&lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# More information...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What happens if we want to access several items? There are two paragraphs (&lt;code&gt;&amp;lt;p&amp;gt;&lt;/code&gt;) in the example, but the &lt;code&gt;find&lt;/code&gt; function would only get the first one. &lt;code&gt;find_all&lt;/code&gt; will do precisely that, returning a list (&lt;code&gt;ResultSet&lt;/code&gt;) instead of an element.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;paragraphs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c1"&gt;# 2
&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;paragraphs&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# ['This domain ...', 'More information...']
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But we can access more than just text. For example, the link's target URL is in the element's &lt;code&gt;href&lt;/code&gt; property, accessible with the &lt;code&gt;get&lt;/code&gt; function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;href&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c1"&gt;# https://www.iana.org/domains/example
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So far, so good. But this is a simple example, and we accessed all the items by the tags. What about a more complex page with tens of different tags? We will probably need classes, IDs, and a new concept: nesting.&lt;/p&gt;

&lt;p&gt;We will build a functional web scraper with an example site in the following sections. For now, a quick guide on the fundamental selectors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;soup.find("div")&lt;/code&gt; will get the first element that matches the &lt;code&gt;div&lt;/code&gt; tag.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;soup.find(id="header")&lt;/code&gt; looks for a node which ID is &lt;code&gt;header&lt;/code&gt;. IDs are unique, so there shouldn't be more items with the same one.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;soup.find(class_="my-class")&lt;/code&gt; returns the first item that contains the &lt;code&gt;my-class&lt;/code&gt; class. Nodes can have multiple classes separated with spaces.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;soup.find("div").find(id="header").find(class_="my-class")&lt;/code&gt; As you can see, it might get complicated fast. The example gets the first &lt;code&gt;div&lt;/code&gt;, finds the &lt;code&gt;header&lt;/code&gt; by ID, and an item with &lt;code&gt;my-class&lt;/code&gt; inside that. It is called nesting and is common in HTML, where almost all items are nested inside a parent tag.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now that we've got the basics, we can move on to the fun part.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Explore Target with DevTools
&lt;/h2&gt;

&lt;p&gt;Hold your horses 🐎! We know that coding is the fun part, but first, you'll need some &lt;strong&gt;understanding of the page you're trying to scrape&lt;/strong&gt;. And not just the content but also the structure. We suggest browsing the target site for a few minutes with DevTools open (or any other tool).&lt;/p&gt;

&lt;p&gt;For the rest of the examples, we will use &lt;a href="https://remotive.io/" rel="noopener noreferrer"&gt;remotive.io&lt;/a&gt; to demonstrate what can be done. Its homepage contains lists of job offers. We'll go step-by-step, getting the data available and structuring it.&lt;/p&gt;

&lt;p&gt;To start exploring, go to the page on a new tab and open DevTolls. You can do it by pressing &lt;code&gt;Control+Shift+I&lt;/code&gt; (Windows/Linux), &lt;code&gt;Command+Option+I&lt;/code&gt; (Mac), or &lt;code&gt;Right-click ➡ Inspect&lt;/code&gt;. Now go to the Elements tab [#1 on the image below], which will show the page's HTML. To inspect an element on the page, click on the select icon (to the left) [#2], and it will allow you to pick an item using the mouse [#3]. It will be highlighted and expanded on the Network tab [#4].&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj3rl9xuo52of5264jguj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj3rl9xuo52of5264jguj.png" alt="DevTools open on remotive.io"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once familiarized with DevTools, take a look, click elements, inspect the HTML, and understand how classes are used. Discover as much as possible and identify the essential parts (hint: look for the job entries with the class &lt;code&gt;job-tile&lt;/code&gt;). We picked a well-structured site with classes that reflect the content, but it is not always the case.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc6ibp5uu2l1575r1qnb8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc6ibp5uu2l1575r1qnb8.png" alt="Remotive.io Annotated"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then explore also other page types, such as category or job offer. Maybe some details are essential to you but only available on the offer page. That would change the way to approach the coding. Or you realize that, as is the case, the structure for the category pages (i.e., &lt;a href="https://remotive.io/remote-jobs/software-dev" rel="noopener noreferrer"&gt;Software Development&lt;/a&gt;) is the same as the homepage. That means we could scrape all the category pages for a bigger offer list! We won't do that for now, but we could do it.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Spoiler: it is not precisely the same since category pages are dynamically rendered. As mentioned above, headless browsers are needed for those cases.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Obtain HTML with Requests
&lt;/h2&gt;

&lt;p&gt;The first part of the code is the same as with example.org. &lt;strong&gt;Get the content&lt;/strong&gt; using &lt;code&gt;requests.get&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://remotive.io/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# 200
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# &amp;lt;!DOCTYPE html&amp;gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As mentioned earlier, many sites have anti-scraping software. The most basic action is blocking an IP with too many requests. Adding proxies to &lt;code&gt;requests&lt;/code&gt; is simple, and they will hide your IP. This way, in case of being banned, your home/office address would be unaffected. But even better, some proxies add rotating power, which means that they assign a new IP to every request. That makes the banning part much more complicated, giving you room to scrape with fewer restrictions.&lt;/p&gt;

&lt;p&gt;Tagging IPs is just the most common measure, but they might implement many more such as &lt;a href="https://www.zenrows.com/blog/stealth-web-scraping-in-python-avoid-blocking-like-a-ninja?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=scraping_python_101" rel="noopener noreferrer"&gt;checking headers or geolocation&lt;/a&gt;. But for most sites, this approach should work fine.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://remotive.io/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://api_key:@proxy.zenrows.com:8001&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;verify&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# &amp;lt;!DOCTYPE html&amp;gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We are fetching just one URL, which is fine to begin. But it gets more complicated when trying to &lt;a href="https://www.zenrows.com/blog/mastering-web-scraping-in-python-crawling-from-scratch?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=scraping_python_101" rel="noopener noreferrer"&gt;scrape at scale&lt;/a&gt;, so we will keep it sequential and straightforward at this point.&lt;/p&gt;

&lt;p&gt;This step might look like the easiest one at first since &lt;code&gt;requests&lt;/code&gt; handles it for us. But we mention the complications for you to be aware of 😅&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Extract Data with BeautifulSoup
&lt;/h2&gt;

&lt;p&gt;This step is the one that changes most from case to case. Since it decides &lt;strong&gt;what data you will extract&lt;/strong&gt;, it varies for each page type you want. For example, the selectors for the offer page won't work for the homepage. And suppose anything changes on the page (new fields, design change). In that case, you will probably need to modify the scraper to reflect those changes.&lt;/p&gt;

&lt;p&gt;Following the job board example, the first thing we want to access is the job offer wrapper using ID (&lt;code&gt;id="initial_job_list"&lt;/code&gt;), a job offer, and its title using class (&lt;code&gt;class_="job-tile"&lt;/code&gt;). Note that it is "class_" and not "class" since it is a reserved word in Python.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://remotive.io/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html.parser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;jobs_wrapper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;initial_job_list&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jobs_wrapper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job-tile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;job_title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job-tile-title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job_title&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Senior Front-End Engineer
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See how we use the variable created in the previous line instead of &lt;code&gt;soup&lt;/code&gt;? That means that the lookup will take place in the element - i.e., &lt;code&gt;jobs_wrapper&lt;/code&gt; - and not the whole page. This is an example of the previously mentioned nesting but storing them in variables instead of concatenation.&lt;/p&gt;

&lt;p&gt;There is a slight problem there, right? We only got ONE job offer! We need to change &lt;code&gt;jobs_wrapper.find&lt;/code&gt; for &lt;code&gt;jobs_wrapper.find_all&lt;/code&gt; to get all the job offers and then extract the data we want for each one. We will move that logic to a function to separate concerns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#...
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;job_title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job-tile-title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;extra&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;·&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job_title&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;link&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job_title&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;href&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;extra&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;extra&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remotive-tag-transparent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;jobs_wrapper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;initial_job_list&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;jobs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jobs_wrapper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job-tile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;extract_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# [
#   {
#     "id": "971825-software-dev",
#     "title": "Senior Front-End Engineer (JS/Graphics/3D)",
#     "link": "/remote-jobs/software-dev/senior-front-end-engineer-js-graphics-3d-971825",
#     "company": "Spline",
#     "location": "Remote (Anywhere)",
#     "category": "Software Development"
#   },
#   ...
# ]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;BeautifulSoup also offers the &lt;code&gt;select&lt;/code&gt; function, which takes a CSS selector and returns an array of matching elements.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;jobs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jobs_wrapper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.job-tile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For more ideas, check out our article on &lt;a href="https://www.zenrows.com/blog/mastering-web-scraping-in-python-from-zero-to-hero?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=scraping_python_101" rel="noopener noreferrer"&gt;tricks when scraping&lt;/a&gt;. We explain that you can get relevant data from different sources such as hidden inputs, tables, or metadata. The selectors used here are very useful, but there are many alternatives.&lt;/p&gt;

&lt;p&gt;And the same idea applies to extracting data, although we already moved it to the &lt;code&gt;extract_data&lt;/code&gt; function. This way, if we were to change any selector because the paged changed, we would locate it quickly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Convert and Store Data with Pandas
&lt;/h2&gt;

&lt;p&gt;We can read and operate with the extracted data. But for future analysis and usage, it is better to store it. In a real-world project, a database is usually the chosen option. We will save it in a CSV file for the demo.&lt;/p&gt;

&lt;p&gt;Since CSV files and formats are beyond this tutorial, we will use the &lt;a href="https://pandas.pydata.org/" rel="noopener noreferrer"&gt;&lt;code&gt;pandas&lt;/code&gt;&lt;/a&gt; library. The first thing is to create a &lt;code&gt;DataFrame&lt;/code&gt; from the data we got. It accepts dictionaries usually without problems. Then, we call the helpful &lt;code&gt;to_csv&lt;/code&gt; function to convert the dataframe to CSV and save it in a file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="c1"&gt;# ...
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;offers.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgmu292em957man21s1x5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgmu292em957man21s1x5.png" alt="Scraped Offers CSV File"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As with any piece of software, we should follow good development practices. Single responsibility is one of them, and we should separate each of the steps above for maintainability. Storing CSV files is probably not a good option long-term. And, when the moment comes, it will be easier to replace if that part is separated from the extraction logic.&lt;/p&gt;

&lt;p&gt;And the same idea applies to extracting data, although we already moved it to the &lt;code&gt;extract_data&lt;/code&gt; function. This way, if we were to change any selector because the paged changed, we would locate it quickly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;Celebrate now! You built your first scraper. 🎉&lt;/p&gt;

&lt;h3&gt;
  
  
  Error Handling
&lt;/h3&gt;

&lt;p&gt;We said earlier that you should not worry about the error codes for the tutorial. But for a real-world project, there are a few things that your should consider adding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Check &lt;code&gt;response.ok&lt;/code&gt;. The library comes with a boolean that tells us if it went correctly. &lt;/li&gt;
&lt;li&gt;Error handling for the critical parts (&lt;code&gt;try/except&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.zenrows.com/knowledge/retry-failed-requests?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=scraping_python_101" rel="noopener noreferrer"&gt;Retry requests in case of error&lt;/a&gt;.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://remotive.io/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# ...
&lt;/span&gt;    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# status_code is 4XX or 5XX
&lt;/span&gt;&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scaling Up
&lt;/h3&gt;

&lt;p&gt;The following steps would be to &lt;a href="https://www.zenrows.com/blog/mastering-web-scraping-in-python-scaling-to-distributed-crawling?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=scraping_python_101" rel="noopener noreferrer"&gt;scale the scrapers&lt;/a&gt; and automatize running them. We won't go deep into these topics, just a quick overview.&lt;/p&gt;

&lt;p&gt;Following the example, two main things come to mind: scraping category pages instead of the homepage and getting more data from the job offer detail page. Depending on the intention behind the scraping, one might have more sense than the other.&lt;/p&gt;

&lt;p&gt;We'll take the second one, scraping further details from each offer. We have already extracted the links in the previous steps to get them. We have to loop over them, get the link, and request the URL. Prepend the domain since it is a relative path. We simplified the data extraction part for brevity. The point is the same as above, extract the data you need using selectors.&lt;/p&gt;

&lt;p&gt;Be careful when running the code below. We added a "trick" to slice the array and get only two job offer details. The idea is to avoid a hundred requests to the same host in seconds and make it faster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# trick to avoid a hundred requests
&lt;/span&gt;
&lt;span class="n"&gt;job_details&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://remotive.io&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;link&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html.parser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;job_title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;job_desc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job-description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;job_meta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job-meta-data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;job_details&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;job_title&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;# ...
&lt;/span&gt;    &lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job_details&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# [
#     {
#         'title': 'Senior Front-End Engineer (JS/Graphics/3D)'
#     }, {
#         'title': 'Engineering Manager'
#     }, ...
# ]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Other Countermeasures
&lt;/h3&gt;

&lt;p&gt;Many other problems appear when scraping, one of the most common being blocks. For privacy or security, some sites will not allow access to users that requests too many pages. Or will show captchas to ensure that they are real users.&lt;/p&gt;

&lt;p&gt;We can take many actions to avoid blocks. Still, as mentioned earlier, the most effective one is using &lt;a href="https://www.zenrows.com/?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=scraping_python_101" rel="noopener noreferrer"&gt;smart proxies&lt;/a&gt; that will change your IP in every request. If you are interested or have any problems, check out our article on &lt;a href="https://www.zenrows.com/blog/stealth-web-scraping-in-python-avoid-blocking-like-a-ninja?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=scraping_python_101" rel="noopener noreferrer"&gt;avoiding detection&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;We'd like you to part with the four main steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Explore the target site before coding&lt;/li&gt;
&lt;li&gt;Retrieve the content (HTML)&lt;/li&gt;
&lt;li&gt;Extract the data you need with selectors&lt;/li&gt;
&lt;li&gt;Transform and store it for its use&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you leave with those points clear, we'll be more than happy 🥳&lt;/p&gt;

&lt;p&gt;Web scraping with Python, or any other language/tool, is a long road. Try not to feel overwhelmed by the immensity of resources available. Scraping Instagram on your first day will probably not be possible since they are well-known blockers. Start with an easier target and gain some confidence.&lt;/p&gt;

&lt;p&gt;Feel free to ask any follow-up questions or contact us.&lt;/p&gt;

&lt;p&gt;Thanks for reading! Did you find the content helpful? Please, spread the word and share it. 👈&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://www.zenrows.com/blog/web-scraping-with-python?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=scraping_python_101" rel="noopener noreferrer"&gt;https://www.zenrows.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>webdev</category>
      <category>tutorial</category>
      <category>beginners</category>
    </item>
    <item>
      <title>DOs and DON'Ts of Web Scraping</title>
      <dc:creator>Ander Rodriguez</dc:creator>
      <pubDate>Tue, 21 Dec 2021 11:09:49 +0000</pubDate>
      <link>https://dev.to/anderrv/dos-and-donts-of-web-scraping-4lnl</link>
      <guid>https://dev.to/anderrv/dos-and-donts-of-web-scraping-4lnl</guid>
      <description>&lt;p&gt;For those of you new to web scraping, regular users, or just curious: these tips are &lt;em&gt;golden&lt;/em&gt;. Scraping might seem an easy-entry activity, and it is. But it will take you down a rabbit hole. Before you realize it, you got blocked from a website, your code is 110% spaghetti, and there's no way you can scale that to another four sites.&lt;/p&gt;

&lt;p&gt;Ever been there? ✋ I was there 10 years ago — no shame (well, just a bit). Continue with us for a few minutes, and we'll help you navigate through the rabbit hole. 🕳️&lt;/p&gt;

&lt;h2&gt;
  
  
  DO Rotate IPs
&lt;/h2&gt;

&lt;p&gt;The simplest and most common anti-scraping technique is to ban by IP. The server will show you the first pages, but it will detect too much traffic from the same IP and block it after some time. Then your scraper will be unusable. And you won't even be able to access the webpage from a real browser. The first lesson on web scraping is &lt;strong&gt;never to use your actual IP&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Every request leaves a trace, even if you try to avoid it from your code. There are some parts of the networking that you cannot control. But you can use a proxy to change your IP. The server will see an IP, but it won't be yours. The next step, rotate the IP or use a service that will do it for you. What does this even mean?&lt;/p&gt;

&lt;p&gt;You can use a different IP every few seconds or per request. The target server can't identify your requests and won't block those IPs. You can build a massive list of proxies and take one randomly for every request. Or use a &lt;a href="https://www.zenrows.com/?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=dos_and_donts" rel="noopener noreferrer"&gt;Rotating Proxy&lt;/a&gt; which will do that for you. Either way. The chances of your scraper working correctly skyrocketed with just this change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;

&lt;span class="n"&gt;urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://ident.me&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# ... more URLs
&lt;/span&gt;&lt;span class="n"&gt;proxy_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;54.37.160.88:1080&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;18.222.22.12:3128&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;# ... more proxy IPs
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;proxy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;proxies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# prints 54.37.160.88 (or any other proxy IP)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note that these &lt;a href="https://free-proxy-list.net/" rel="noopener noreferrer"&gt;free proxies&lt;/a&gt; might not work for you. They are short-time lived.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  DO Use Custom User-Agent
&lt;/h2&gt;

&lt;p&gt;The second-most-common anti-scraping mechanism is &lt;a href="https://en.wikipedia.org/wiki/User_agent" rel="noopener noreferrer"&gt;User-Agent&lt;/a&gt;. UA is a header that browsers send in requests to identify themselves. They are usually a long string declaring the browser's name, version, platform, and many more. An example for an iPhone 13:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is nothing wrong with sending a User-Agent, and it is actually recommended to do so. The problem is &lt;strong&gt;which&lt;/strong&gt; one to send. Many HTTP clients send their own (cURL, requests in Python, or Axios in Javascript), which might be suspicious. Can you imagine your server getting hundreds of requests with a &lt;code&gt;"curl/7.74.0"&lt;/code&gt; UA? You'd be skeptical at the very least.&lt;/p&gt;

&lt;p&gt;The solution is usually finding valid UAs, like the one from the iPhone above, and using them. But it might turn against you also. Thousands of requests with &lt;strong&gt;exactly&lt;/strong&gt; the same version in short periods?&lt;/p&gt;

&lt;p&gt;So the next step is to have several valid and modern User-Agents and use those. And to keep the list updated. As with the IPs, rotate the UA in every request in your code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ... same as above 
&lt;/span&gt;&lt;span class="n"&gt;user_agents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; 
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (iPhone ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (Windows ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="c1"&gt;# ... more User-Agents 
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; 

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
    &lt;span class="n"&gt;proxy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;proxies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; 
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  DO Research Target Content
&lt;/h2&gt;

&lt;p&gt;Take a look at the source code before starting development. Many websites offer more manageable ways to scrape data than CSS selectors. A standard method of exposing data is through rich snippets, for example, via Schema.org JSON or &lt;code&gt;itemprop&lt;/code&gt; data attributes. Others use hidden inputs for internal purposes (i.e., IDs, categories, product code), and you can take advantage. There's more than meets the eye.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3htvki9tvrla802axizp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3htvki9tvrla802axizp.png" alt="Hidden Inputs on Amazon Products"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Some other sites rely on XHR requests after the first load to get the data. And it comes structured! For us, the easier way is to browse the site with DevTools open and check both the HTML and Network tab. You will have a clear vision and decide how to extract the data in a few minutes. These tricks are not always available, but you can save a headache by using them. Metadata, for example, tends to change less than HTML or CSS classes, making it more reliable and maintainable long-term.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff1kax9byyycrbchj0ykb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff1kax9byyycrbchj0ykb.png" alt="Auction.com XHR Requests"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We wrote about &lt;a href="https://www.zenrows.com/blog/mastering-web-scraping-in-python-from-zero-to-hero?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=dos_and_donts#explore-before-coding" rel="noopener noreferrer"&gt;exploring before coding&lt;/a&gt; with examples and code in Python; check out for more info.&lt;/p&gt;

&lt;h2&gt;
  
  
  DO Parallelize Requests
&lt;/h2&gt;

&lt;p&gt;After switching gear and scaling up, the old one-file sequential script will not be enough. You probably need to "professionalize" it. For a tiny target and a few URLs getting them one by one might be enough. But then scale it to thousands and different domains. It won't work correctly.&lt;/p&gt;

&lt;p&gt;One of the first steps of that scaling would be to get &lt;strong&gt;several URLs simultaneously&lt;/strong&gt; and not stop the whole scraping for a slow response. Going from 50-line-script to Google scale is a giant leap, but the first steps are achievable. There are the main things you'll need: concurrency and a queue.&lt;/p&gt;

&lt;h3&gt;
  
  
  Concurrency
&lt;/h3&gt;

&lt;p&gt;The main idea is to send multiple requests simultaneously but with a limit. And then, as soon as a response arrives, send a new one. Let's say the limit is ten. That would mean that ten URLs would always be running at any given time until there are no more, which brings us to the next step.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;We wrote a &lt;a href="https://www.zenrows.com/knowledge/using-concurrency?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=dos_and_donts" rel="noopener noreferrer"&gt;guide on using concurrency&lt;/a&gt; (examples in Python and Javascript).&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Queue
&lt;/h3&gt;

&lt;p&gt;A queue is a data structure that allows adding items to be processed later. You can start the crawling with a single URL, get the HTML and extract the links you want. Add those to the queue, and they will start running. Keep on doing the same, and you built a scalable crawler. Some points are missing, like deduplicating URLs (not crawling the same one twice) or infinite loops. But the easy way to solve it would be to set a maximum number of pages crawled and stop once you get there.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;We have an article with an example in Python &lt;a href="https://www.zenrows.com/knowledge/scrape-and-crawl-from-a-seed-url?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=dos_and_donts" rel="noopener noreferrer"&gt;scraping from a seed URL&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Still far from Google scale (obviously), but you can go to thousands of pages with this approach. To be more precise, you can have different settings per domain to avoid overloading a single target. We'll leave that up to you 😉&lt;/p&gt;

&lt;h2&gt;
  
  
  DON'T Use Headless Browsers for Everything
&lt;/h2&gt;

&lt;p&gt;Selenium, Puppeteer, and Playwright are great, no doubt, but not a silver bullet. They bring a resource overhead and slow down the scraping process. So why use them? 100% needed for Javascript rendered content and helpful in many circumstances. But ask yourself if that's your case.&lt;/p&gt;

&lt;p&gt;Most of the sites serve the data, one way or another, on the first HTML request. Because of that, we advocate going the other way around. Test first plain HTML by using your favorite tool and language (cURL, requests in Python, Axios in Javascript, whatever). Check for the content you need: text, IDs, prices. Be careful here since sometimes the data you see on the browser might be encoded (i.e.," shown in plain HTML as "). Copy &amp;amp; paste might not work. 😅&lt;/p&gt;

&lt;p&gt;In some cases, you won't find the info because it is not there on the first load, for example, in &lt;a href="https://angular.io/" rel="noopener noreferrer"&gt;Angular.io&lt;/a&gt;. No problem, headless browsers come in handy for those cases. Or &lt;a href="https://www.zenrows.com/blog/web-scraping-intercepting-xhr-requests?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=dos_and_donts" rel="noopener noreferrer"&gt;XHR scraping&lt;/a&gt; as shown above for Auction.&lt;/p&gt;

&lt;p&gt;If you find the info, try to write the extractors. A quick hack might be good enough for a test. Once you have identified all the content you want, the following point is to separate generic crawling code from the custom one for the target site.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Using Python's "requests": 2.41 seconds&lt;/li&gt;
&lt;li&gt;A playwright with chromium opening a new browser per request: 11.33 seconds&lt;/li&gt;
&lt;li&gt;Playwright with chromium sharing browser and context for all the URLs: 7.13 seconds&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It is not 100% conclusive nor statistically accurate, but it shows the difference. In the best case, we are talking about 3x slower using Playwright, and sharing context is not always a good idea. And we are not even talking about CPU and memory consumption.&lt;/p&gt;

&lt;h2&gt;
  
  
  DON'T Couple Code to Target
&lt;/h2&gt;

&lt;p&gt;Some actions are independent of the website you are scraping: get HTML, parse it, queue new links to crawl, store content, and more. In an ideal scenario, we would separate those from the ones that depend on the target site: CSS selectors, URL structure, DDBB structure.&lt;/p&gt;

&lt;p&gt;The first script is usually entangled; no problem there. But as it grows and new pages are added, separating responsibilities is crucial. We know, easier said than done. But to pause and think matters to develop a maintainable and scalable scraper.&lt;/p&gt;

&lt;p&gt;We published a &lt;a href="https://github.com/ZenRows/scaling-to-distributed-crawling" rel="noopener noreferrer"&gt;repository&lt;/a&gt; and blog post about &lt;a href="https://www.zenrows.com/blog/mastering-web-scraping-in-python-scaling-to-distributed-crawling?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=dos_and_donts" rel="noopener noreferrer"&gt;distributed crawling in Python&lt;/a&gt;. It is a bit more complicated than what we've seen so far. It uses external software (Celery for asynchronous task queue and Redis as the database).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How to get the HTML (requests VS headless browser)&lt;/li&gt;
&lt;li&gt;Filter URLs to queue for crawling&lt;/li&gt;
&lt;li&gt;What content to extract (CSS selectors)&lt;/li&gt;
&lt;li&gt;Where to store the data (a list in Redis)
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ... 
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="c1"&gt;# ... 
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;store_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="c1"&gt;# ... 
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;allow_url_filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="c1"&gt;# ... 
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;headless_chromium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;random_headers&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;random_proxies&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is still far from massive scale production-ready. But code reuse is easy, as is adding new domains. And when adding updated browsers or headers, it would be easy to modify the old scrapers to use those.&lt;/p&gt;

&lt;h2&gt;
  
  
  DON'T Take Down your Target Site
&lt;/h2&gt;

&lt;p&gt;Your extra load might be a drop in the ocean for Amazon but a burden for a small independent store. Be mindful of the scale of your scraping and the size of your targets.&lt;/p&gt;

&lt;p&gt;You can probably crawl hundreds of pages at Amazon concurrently, and they won't even notice (careful nonetheless). But many websites run on a single shared machine with poor specs, and they deserve our understanding. Tune down your scripts capabilities for those sites. It might complicate the code, but stopping if the response times increase would be nice.&lt;/p&gt;

&lt;p&gt;Another point is to inspect and comply with their robots.txt. Mainly two rules: &lt;strong&gt;do not scrape disallowed pages and obey &lt;code&gt;Crawl-Delay&lt;/code&gt;&lt;/strong&gt;. That directive is not common, but when present, represents the amount of seconds crawlers should wait between requests. There is a Python module that can help us to &lt;a href="https://docs.python.org/3.8/library/urllib.robotparser.html" rel="noopener noreferrer"&gt;comply with robots.txt&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We will not go into details but do not perform malicious activities (there should be no need to say it, just in case). We are always talking about extracting data without breaking the law or causing damage to the target site.&lt;/p&gt;

&lt;h2&gt;
  
  
  DON'T Mix Headers from Different Browsers
&lt;/h2&gt;

&lt;p&gt;This last technique is for higher-level anti-bot solutions. Browsers send several headers with a set format that varies from version to version. And advanced solutions check those and compare them to a real-world header set database. Which means you will raise red flags when sending the wrong ones. Or even more difficult to notice, by not sending the right ones! Visit &lt;a href="https://www.httpbin.org/headers" rel="noopener noreferrer"&gt;httpbin&lt;/a&gt; to see the headers your browser sends. Probably more than you imagine and some you haven't even heard of! &lt;code&gt;"Sec-Ch-Ua"&lt;/code&gt;? 😕&lt;/p&gt;

&lt;p&gt;There is no easy way out of this but to have an actual full set of headers. And to have &lt;strong&gt;plenty&lt;/strong&gt; of them, one for each User-Agent you use. Not one for Chrome and another for iPhone, nope. One. Per. User-Agent. 🤯&lt;/p&gt;

&lt;p&gt;Some people try to avoid this by using headless browsers, but we already shaw why it is better to avoid them. And anyway, you are not on the clear with them. They send the whole set of headers that work for that browser on that version. If you modify any of that, the rest might not be valid. If using Chrome with Puppeteer and overwriting the UA to use the iPhone one... you can have a surprise. A real iPhone does not send &lt;code&gt;"Sec-Ch-Ua"&lt;/code&gt;, but Puppeteer will since you overwrote UA but didn't delete that one.&lt;/p&gt;

&lt;p&gt;Some sites offer &lt;a href="https://developers.whatismybrowser.com/useragents/explore/software_name/chrome/" rel="noopener noreferrer"&gt;a list of User-Agents&lt;/a&gt;. But it is hard to get the complete sets for hundreds of them, which is the needed scale when scraping at complex sites.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ... 
&lt;/span&gt;
&lt;span class="n"&gt;header_sets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; 
    &lt;span class="p"&gt;{&lt;/span&gt; 
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Accept-Encoding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gzip, deflate, br&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cache-Control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no-cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (iPhone ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="c1"&gt;# ... 
&lt;/span&gt;    &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; 
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (Windows ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="c1"&gt;# ... 
&lt;/span&gt;    &lt;span class="p"&gt;},&lt;/span&gt; 
    &lt;span class="c1"&gt;# ... more header sets 
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; 

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
    &lt;span class="c1"&gt;# ... 
&lt;/span&gt;    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;header_sets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We know this last one was a bit picky. But some anti-scraping solutions can be super-picky and even more than headers. Some might check browser or even connection fingerprinting — high-level stuff.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Rotating IPs and having good headers will allow you to crawl and scrape most websites. Use headless browsers only when necessary and apply Software Engineering good practices.&lt;/p&gt;

&lt;p&gt;Build small and grow from there, add functionalities and use cases. But always try to keep scale and maintainability in mind while keeping success rates high. Don't despair if you get blocked from time to time, and learn from every case.&lt;/p&gt;

&lt;p&gt;Web scraping at scale is a challenging and long journey, but you might not need the best ever system. Nor a 100% accuracy. If it works on the domains you want, good enough! Do not freeze trying to reach perfection since you probably don't need it.&lt;/p&gt;

&lt;p&gt;In case of doubts, questions, or suggestions, do not hesitate to contact us.&lt;/p&gt;

&lt;p&gt;Thanks for reading! Did you find the content helpful? Please, spread the word and share it. 👈&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://www.zenrows.com/blog/dos-and-donts-of-web-scraping?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=dos_and_donts" rel="noopener noreferrer"&gt;https://www.zenrows.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>beginners</category>
      <category>webdev</category>
      <category>codenewbie</category>
    </item>
    <item>
      <title>Intro to Web Scraping with Selenium in Python</title>
      <dc:creator>Ander Rodriguez</dc:creator>
      <pubDate>Tue, 30 Nov 2021 09:42:59 +0000</pubDate>
      <link>https://dev.to/anderrv/intro-to-web-scraping-with-selenium-in-python-4011</link>
      <guid>https://dev.to/anderrv/intro-to-web-scraping-with-selenium-in-python-4011</guid>
      <description>&lt;p&gt;Ever heard of &lt;a href="https://en.wikipedia.org/wiki/Headless_browser" rel="noopener noreferrer"&gt;headless browsers&lt;/a&gt;? Mainly used for testing purposes, they give us an excellent opportunity for scraping websites that require Javascript execution or any other feature that browsers offer.&lt;/p&gt;

&lt;p&gt;You'll learn how to use &lt;a href="https://www.selenium.dev/documentation/webdriver/" rel="noopener noreferrer"&gt;Selenium&lt;/a&gt; and its multiple features to scrape and browser any web page. From finding elements to waiting for dynamic content to load. Modify the window size and take screenshots. Or add proxies and custom headers to avoid blocks. You can achieve all of that and more with this headless browser.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;For the code to work, you will need &lt;a href="https://www.python.org/downloads/" rel="noopener noreferrer"&gt;python3 installed&lt;/a&gt;. Some systems have it pre-installed. After that, install Selenium, &lt;a href="https://www.google.com/chrome/" rel="noopener noreferrer"&gt;Chrome&lt;/a&gt;, and the driver for &lt;a href="https://sites.google.com/chromium.org/driver/downloads" rel="noopener noreferrer"&gt;Chrome&lt;/a&gt;. Make sure to match the browser and driver versions, Chrome 96, as of this writing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;selenium 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Other browsers are available (Edge, IE, Firefox, Opera, Safari), and the code should work with minor adjustments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Once set up, we will write our first test. Go to a sample URL and print its current URL and title. The browser will follow redirects automatically and load all the resources - images, stylesheets, javascript, and more.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;selenium&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://zenrows.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Chrome&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;current_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# https://www.zenrows.com/
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Web Scraping API &amp;amp; Data Extraction - ZenRows
&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your Chrome driver is not in an executable path, you need to specify it or move the driver to somewhere in the path (i.e., &lt;code&gt;/usr/bin/&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;chrome_driver_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/path/to/chromedriver&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Chrome&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;executable_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chrome_driver_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# ...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You noticed that the browser is showing, and you can see it, right? It won't run headless by default. We can pass options to the driver, which is what we want to do for scraping.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ChromeOptions&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headless&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Chrome&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# ...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Finding Elements and Content
&lt;/h2&gt;

&lt;p&gt;Once the page is loaded, we can start looking for the information we are after. Selenium offers several ways to access elements: ID, tag name, class, XPath, and CSS selectors.&lt;/p&gt;

&lt;p&gt;Let's say that we want to search for something on Amazon by using the text input. We could use select by tag from the previous options: &lt;code&gt;driver.find_element(By.TAG_NAME, "input")&lt;/code&gt;. But this might be a problem since there are several inputs on the page. By inspecting the page, we see that it has an ID, so we change the selector: &lt;code&gt;driver.find_element(By.ID, "twotabsearchtextbox")&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;IDs probably don't change often, and they are a more secure way of extracting info than classes. The problem usually comes from not finding them. Assuming there is no ID, we can select the search form and then the input inside.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;selenium&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;selenium.webdriver.common.by&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;By&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.amazon.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Chrome&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nb"&gt;input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CSS_SELECTOR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;form[role=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;] input[type=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is no silver bullet; each option is appropriate for a set of cases. You'll need to find the one that best suits your needs.&lt;/p&gt;

&lt;p&gt;If we scroll down the page, we'll see many products and categories. And a shared class that often repeats: &lt;code&gt;a-list-item&lt;/code&gt;. We need a similar function (&lt;code&gt;find_elements&lt;/code&gt; in plural) to match all the items and not just the first occurrence.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#...
&lt;/span&gt;    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_elements&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CLASS_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a-list-item&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fog9lyku5xvtuxe3oxufw.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fog9lyku5xvtuxe3oxufw.jpg" alt="Amazon Items"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now we need to do something with the selected elements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Interacting with the Elements
&lt;/h2&gt;

&lt;p&gt;We'll search using the input selected above. For that, we need the &lt;code&gt;send_keys&lt;/code&gt; function that will type and hit enter to send the form. We could also type into the input and then find the submit button and click on it (&lt;code&gt;element.click()&lt;/code&gt;). It is easier in this case since the &lt;code&gt;Enter&lt;/code&gt; works fine.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;selenium.webdriver.common.keys&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Keys&lt;/span&gt;
&lt;span class="c1"&gt;#...
&lt;/span&gt;    &lt;span class="nb"&gt;input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CSS_SELECTOR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;form[role=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;] input[type=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Python Books&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;Keys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ENTER&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice that the script does not wait and closes as soon as the search finishes. The logical thing is to do something afterward, so we'll list the results using &lt;code&gt;find_elements&lt;/code&gt; as above. Inspecting the result, we can use the &lt;code&gt;s-result-item&lt;/code&gt; class.&lt;/p&gt;

&lt;p&gt;These items we will just select are &lt;code&gt;div&lt;/code&gt;s with several inner tags. We could take the link's &lt;code&gt;href&lt;/code&gt; values if interested and visit each item - we won't do that for the moment. But the &lt;code&gt;h2&lt;/code&gt; tags contain the book's title, so we need to select the title for each element. We can continue using &lt;code&gt;find_element&lt;/code&gt; since it will work for &lt;code&gt;driver&lt;/code&gt;, as seen before, and for any web element.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ...
&lt;/span&gt;    &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_elements&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CLASS_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s-result-item&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;h2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TAG_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Prints a list of around fifty items
&lt;/span&gt;
&lt;span class="c1"&gt;# Learning Python, 5th Edition ...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Don't rely too much on this approach since some tags might be empty or have no title. We should adequately implement error control for an actual use case.&lt;/p&gt;

&lt;h3&gt;
  
  
  Infinite Scroll
&lt;/h3&gt;

&lt;p&gt;For those cases when there is an infinite scroll (Pinterest), or images are lazily loaded (Twitter), we can go down also using the keyboard. Not often used, but scroll using the space bar, "Page Down", or "End" keys is an option. And we can take advantage of that.&lt;/p&gt;

&lt;p&gt;The driver won't accept it directly. We need to find first an element like &lt;code&gt;body&lt;/code&gt; and send the keys there.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TAG_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;send_keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Keys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But there is still another problem: items will not be present just after scrolling. That brings us to the next part.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wait for Content or Element
&lt;/h2&gt;

&lt;p&gt;Nowadays, many websites are Javascript intense - especially when using modern frameworks like React - and do lots of XHR calls after the first load. As with the infinite scroll, all that content won't be available to Selenium. But we can manually inspect the target website and check what the result of that processing is.&lt;/p&gt;

&lt;p&gt;It usually comes down to creating some DOM elements. If those classes are unique or they have IDs, we can wait for those. We can use the &lt;code&gt;WebDriverWait&lt;/code&gt; to put the script on hold until some criteria are met.&lt;/p&gt;

&lt;p&gt;Assume a simple case where there are no images present until some XHR finishes. This instruction will return the &lt;code&gt;img&lt;/code&gt; element as soon as it appears. The driver will wait for 3 seconds and fail otherwise.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;selenium.webdriver.support.ui&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;WebDriverWait&lt;/span&gt;
&lt;span class="c1"&gt;# ...
&lt;/span&gt;    &lt;span class="n"&gt;el&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;WebDriverWait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;until&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TAG_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;img&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Selenium provides several &lt;a href="https://selenium-python.readthedocs.io/waits.html#explicit-waits" rel="noopener noreferrer"&gt;expected conditions&lt;/a&gt; that might prove valuable. &lt;code&gt;element_to_be_clickable&lt;/code&gt; is an excellent example in a page full of Javascript, since many buttons are not interactive until some actions occur.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;selenium.webdriver.support&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;expected_conditions&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;EC&lt;/span&gt;
&lt;span class="c1"&gt;#...
&lt;/span&gt;    &lt;span class="n"&gt;button&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;WebDriverWait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;until&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;EC&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;element_to_be_clickable&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CLASS_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my-button&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Screenshots and Element Screenshots
&lt;/h2&gt;

&lt;p&gt;Be it for testing purposes or storing changes, screenshots are a practical tool. We can take a screenshot for the current browser context or a given element.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ...
&lt;/span&gt;    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save_screenshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;page.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0iqw6pu5auos5n2fbphv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0iqw6pu5auos5n2fbphv.png" alt="Amazon Small Screen"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ... 
&lt;/span&gt;    &lt;span class="n"&gt;card&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CLASS_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a-cardui&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;card&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;screenshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amazon_card.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj1k0ggjivilbknylo8ry.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj1k0ggjivilbknylo8ry.png" alt="Amazon Card Element"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Noticed the problem with the first image? Nothing wrong, but the size is probably not what you were expecting. Selenium loads by default in 800px by 600px when browsing in headless mode. But we can modify it to take bigger screenshots.&lt;/p&gt;

&lt;h2&gt;
  
  
  Window Size
&lt;/h2&gt;

&lt;p&gt;We can query the driver to check the size we're launching in: &lt;code&gt;driver.get_window_size()&lt;/code&gt;, which will print &lt;code&gt;{'width': 800, 'height': 600}&lt;/code&gt;. When using GUI, those numbers will change, so let's assume that we're testing headless mode.&lt;/p&gt;

&lt;p&gt;There is a similar function - &lt;code&gt;set_window_size&lt;/code&gt; - that will modify the window size. Or we can add an options argument to the Chrome web driver that will directly start the browser with that resolution.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--window-size=1024,768&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Chrome&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_window_size&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="c1"&gt;# {'width': 1024, 'height': 768}
&lt;/span&gt;
    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_window_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1920&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_window_size&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="c1"&gt;# {'width': 1920, 'height': 1200}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And now our screenshot will be 1920px wide.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbafmahlr7yf1e94vs9wx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbafmahlr7yf1e94vs9wx.png" alt="Amazon Big Screen"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Custom Headers
&lt;/h2&gt;

&lt;p&gt;The options mentioned above provide us with a crucial mechanism for web scraping: custom headers.&lt;/p&gt;

&lt;h3&gt;
  
  
  User Agent
&lt;/h3&gt;

&lt;p&gt;One of the essential headers to avoid blocks is User-Agent. Selenium will provide an accurate one by default, but you can change it for a custom one. Remember that there are &lt;a href="https://www.zenrows.com/blog/stealth-web-scraping-in-python-avoid-blocking-like-a-ninja?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=selenium_python" rel="noopener noreferrer"&gt;many techniques to crawl and scrape without blocks&lt;/a&gt; and we won't cover them all here.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;user_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user-agent=%s&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;user_agent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Chrome&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TAG_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# UA matches the one hardcoded above, v93
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Other Important Headers
&lt;/h3&gt;

&lt;p&gt;As a quick summary, changing the user-agent might be counterproductive if we forget to adjust some other headers. For example, the &lt;code&gt;sec-ch-ua&lt;/code&gt; header usually sends a version of the browser, and it must much the user-agent's one: &lt;code&gt;"Google Chrome";v="96"&lt;/code&gt;. But some older versions do not send that header at all, so sending it might also be suspicious.&lt;/p&gt;

&lt;p&gt;The problem is Selenium does not support adding headers. A third-party solution like &lt;a href="https://github.com/wkeeling/selenium-wire" rel="noopener noreferrer"&gt;Selenium Wire&lt;/a&gt; might solve it. Install it with &lt;code&gt;pip install selenium-wire&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It will allow us to intercept requests, among other things, and modify the headers we want or add new ones. When changing, we must delete the original one first to avoid sending duplicates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;seleniumwire&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://httpbin.org/anything&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;user_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;sec_ch_ua&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="s"&gt;Google Chrome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;;v=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;93&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; Not;A Brand&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;;v=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;99&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chromium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;;v=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;93&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;
&lt;span class="n"&gt;referer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://www.google.com&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

&lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ChromeOptions&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headless&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;interceptor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;del&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user-agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Delete the header first
&lt;/span&gt;    &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user-agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;user_agent&lt;/span&gt;
    &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sec-ch-ua&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sec_ch_ua&lt;/span&gt;
    &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;referer&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;referer&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Chrome&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request_interceptor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;interceptor&lt;/span&gt;
    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TAG_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Proxy to change the IP
&lt;/h2&gt;

&lt;p&gt;As with the headers, Selenium has limited support for proxies. We can add a proxy without authentication as a driver option. For testing, we'll use &lt;a href="https://free-proxy-list.net/" rel="noopener noreferrer"&gt;Free Proxies&lt;/a&gt; although they are not reliable, and the one below probably won't work for you at all. They are usually short-lived.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;selenium&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;
&lt;span class="c1"&gt;# ...
&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://httpbin.org/ip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;proxy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;85.159.48.170:40014&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="c1"&gt;# free proxy
&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;--proxy-server=%s&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Chrome&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TAG_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# "origin": "85.159.48.170"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For more complex solutions or ones in need of auth, Selenium Wire can help us again. We need a second set of options in this case, where we will add the proxy server we want to use.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;proxy_pass&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;seleniumwire_options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;proxy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;proxy_pass&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:@proxy.zenrows.com:8001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;verify_ssl&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Chrome&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;seleniumwire_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;seleniumwire_options&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TAG_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For proxy servers that don't rotate IPs automatically, &lt;code&gt;driver.proxy&lt;/code&gt; can be overwritten. From that point on, all requests will use the new proxy. This action can be done as many times as necessary. For convenience and reliability, we advocate for &lt;a href="https://www.zenrows.com/?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=selenium_python" rel="noopener noreferrer"&gt;Smart Rotating Proxies&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#...
&lt;/span&gt;    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Initial proxy
&lt;/span&gt;    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://user:pass@1.2.3.4:5678&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# New proxy
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Blocking Resources
&lt;/h2&gt;

&lt;p&gt;For performance, saving bandwidth, or avoiding tracking, blocking some resources might prove crucial when scaling scraping.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;selenium&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.amazon.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ChromeOptions&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headless&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;experimental_options&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prefs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;profile.managed_default_content_settings.images&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Chrome&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save_screenshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;amazon_without_images.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbbgp265hoan1bxdnlyv9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbbgp265hoan1bxdnlyv9.png" alt="Amazon Without Images"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We could even go a step further and avoid loading almost any type. Careful with this since blocking Javascript would mean no AJAX calls, for example.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;experimental_options&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prefs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;profile.managed_default_content_settings.images&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;profile.managed_default_content_settings.stylesheets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;profile.managed_default_content_settings.javascript&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;profile.managed_default_content_settings.cookies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;profile.managed_default_content_settings.geolocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;profile.default_content_setting_values.notifications&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Intercepting Requests
&lt;/h3&gt;

&lt;p&gt;Once again, thanks to Selenium Wire, we could decide programmatically over requests. It means that we can effectively block some images while allowing others. And we also can block domains using &lt;code&gt;exclude_hosts&lt;/code&gt; or allow only specific requests based on URLs matching against a regular expression with &lt;code&gt;driver.scopes&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;interceptor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Block PNG and GIF images, will show JPEG for example
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.gif&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Chrome&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request_interceptor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;interceptor&lt;/span&gt;
    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Execute Javascript
&lt;/h2&gt;

&lt;p&gt;The last Selenium feature we want to mention is executing Javascript. Some things are easier done directly in the browser, or we want to check that it worked correctly. We can &lt;code&gt;execute_script&lt;/code&gt; passing the JS code we want to be executed. It can go without params or with elements as params.&lt;/p&gt;

&lt;p&gt;We can see both cases in the examples below. There is no need for params to get the User-Agent as the browser sees it. That might prove helpful to check that the one sent is being modified correctly in the &lt;code&gt;navigator&lt;/code&gt; object since some security checks might raise red flags otherwise. The second one will take an &lt;code&gt;h2&lt;/code&gt; as an argument and return its left position by accessing &lt;code&gt;getClientRects&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Chrome&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute_script&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;return navigator.userAgent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Mozilla/5.0  ... Chrome/96 ...
&lt;/span&gt;
    &lt;span class="n"&gt;header&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CSS_SELECTOR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;headerText&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute_script&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;return arguments[0].getClientRects()[0].left&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headerText&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# 242.5
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Selenium is a valuable tool with many applications, but you have to take advantage of them in your way. Apply each feature in your favor. And many times, there are several ways of arriving at the same point; look for the one that helps you the most - or the easiest one.&lt;/p&gt;

&lt;p&gt;Once you get the handle, you'll want to grow your scraping and get more pages. There is where other challenges might appear: &lt;a href="https://www.zenrows.com/blog/mastering-web-scraping-in-python-crawling-from-scratch?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=selenium_python" rel="noopener noreferrer"&gt;crawling at scale&lt;/a&gt; and &lt;a href="https://www.zenrows.com/blog/stealth-web-scraping-in-python-avoid-blocking-like-a-ninja?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=selenium_python" rel="noopener noreferrer"&gt;blocks&lt;/a&gt;. Some tips above will help you: check the headers and proxy sections. But also be aware that crawling at scale is not an easy task. Please don't say we didn't warn you 😅&lt;/p&gt;

&lt;p&gt;I hope you leave with an understanding of how Selenium works in Python (it goes the same for other languages). An important topic that we did not cover is when Selenium is necessary. Because many times you can save time, bandwidth, and server performance by scraping without a browser. Or you can contact us, and we'll be delighted to help you crawl, scrape and scale whatever you need!&lt;/p&gt;

&lt;p&gt;Thanks for reading! Did you find the content helpful? Please, spread the word and share it. 👈&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://www.zenrows.com/blog/web-scraping-with-selenium-in-python?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=selenium_python" rel="noopener noreferrer"&gt;https://www.zenrows.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>beginners</category>
      <category>tutorial</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Web Scraping: Intercepting XHR Requests</title>
      <dc:creator>Ander Rodriguez</dc:creator>
      <pubDate>Wed, 27 Oct 2021 13:25:39 +0000</pubDate>
      <link>https://dev.to/anderrv/web-scraping-intercepting-xhr-requests-1hke</link>
      <guid>https://dev.to/anderrv/web-scraping-intercepting-xhr-requests-1hke</guid>
      <description>&lt;p&gt;Have you ever tried scraping AJAX websites? Sites full of Javascript and XHR calls? Decipher tons of nested CSS selectors? Or worse, daily changing selector? Maybe you won't need that ever again. Keep on reading, XHR scraping might prove your ultimate solution!&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;For the code to work, you will need &lt;a href="https://www.python.org/downloads/" rel="noopener noreferrer"&gt;python3 installed&lt;/a&gt;. Some systems have it pre-installed. After that, install Playwright and the browser binaries for Chromium, Firefox, and WebKit.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

pip &lt;span class="nb"&gt;install &lt;/span&gt;playwright
playwright &lt;span class="nb"&gt;install&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Intercept Responses
&lt;/h2&gt;

&lt;p&gt;As we saw in a previous blog post about &lt;a href="https://www.zenrows.com/blog/blocking-resources-in-playwright?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=xhr_scraping" rel="noopener noreferrer"&gt;blocking resources&lt;/a&gt;, headless browsers allow request and response inspection. We will use &lt;a href="https://playwright.dev/python/" rel="noopener noreferrer"&gt;Playwright&lt;/a&gt; in python for the demo, but it can be done in Javascript or using &lt;a href="https://github.com/puppeteer/puppeteer" rel="noopener noreferrer"&gt;Puppeteer&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We can quickly inspect all the responses on a page. As we can see below, the &lt;code&gt;response&lt;/code&gt; parameter contains the status, URL, and content itself. And that's what we'll be using instead of directly scraping content in the HTML using CSS selectors.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Use case: auction.com
&lt;/h2&gt;

&lt;p&gt;Our first example will be &lt;a href="https://www.auction.com/residential/ca/" rel="noopener noreferrer"&gt;auction.com&lt;/a&gt;. You might need proxies or a VPN Since it blocks outside of the countries they operate in. Anyway, it might be a problem trying to scrape from your IP since they will ban it eventually. Check out &lt;a href="https://www.zenrows.com/blog/stealth-web-scraping-in-python-avoid-blocking-like-a-ninja?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=xhr_scraping" rel="noopener noreferrer"&gt;how to avoid blocking&lt;/a&gt; if you find any issues.  &lt;/p&gt;

&lt;p&gt;Here is a basic example of loading the page using Playwright while logging all the responses.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright.sync_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sync_playwright&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.auction.com/residential/ca/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;sync_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;firefox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_until&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;networkidle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;90000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;content&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;code&gt;auction.com&lt;/code&gt; will load an HTML skeleton without the content we are after (house prices or auction dates). They will then load several resources such as images, CSS, fonts, and Javascript. If we wanted to save some bandwidth, we could filter out some of those. For now, we're going to focus on the attractive parts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fljdhw1v2nlp3si0qpzmb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fljdhw1v2nlp3si0qpzmb.png" alt="Auction.com screenshot with DevTools open"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we can see in the network tab, almost all relevant content comes from an XHR call to an assets endpoint. Ignoring the rest, we can inspect that call by checking that the response URL contains this string: &lt;code&gt;if ("v1/search/assets?" in response.url)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;There is a size and time problem: the page will load tracking and map, which will amount to more than a minute in loading (using proxies) and 130 requests :O. We could do better by blocking certain domains and resources. We were able to do it in under 20 seconds with only 7 loaded resources in our tests. We will leave that as an exercise for you ;)&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

&amp;lt;&amp;lt; 407 https://www.auction.com/residential/ca/ 
&amp;lt;&amp;lt; 200 https://www.auction.com/residential/ca/ 
&amp;lt;&amp;lt; 200 https://cdn.auction.com/residential/page-assets/styles.d5079a39f6.prod.css 
&amp;lt;&amp;lt; 200 https://cdn.auction.com/residential/page-assets/framework.b3b944740c.prod.js 
&amp;lt;&amp;lt; 200 https://cdn.cookielaw.org/scripttemplates/otSDKStub.js 
&amp;lt;&amp;lt; 200 https://static.hotjar.com/c/hotjar-45084.js?sv=5 
&amp;lt;&amp;lt; 200 https://adc-tenbox-prod.imgix.net/resi/propertyImages/no_image_available.v1.jpg 
&amp;lt;&amp;lt; 200 https://cdn.mlhdocs.com/rcp_files/auctions/E-19200/photos/thumbnails/2985798-1-G_bigThumb.jpg 
# ...


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;For a more straightforward solution, we decided to change to the &lt;code&gt;wait_for_selector&lt;/code&gt; function. It is not the ideal solution, but we noticed that sometimes the script stops altogether before loading the content. To avoid those cases, we change the waiting method.&lt;/p&gt;

&lt;p&gt;While inspecting the results, we saw that the wrapper was there from the skeleton. But each houses' content is not. So we will wait for one of those: &lt;code&gt;"h4[data-elm-id]"&lt;/code&gt;.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;sync_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# the endpoint we are insterested in 
&lt;/span&gt;        &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v1/search/assets?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;assets&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;asset&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="c1"&gt;# ...
&lt;/span&gt;    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handle_response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# really long timeout since it gets stuck sometimes
&lt;/span&gt;    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;120000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait_for_selector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h4[data-elm-id]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;120000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Here we have the output, with even more info than the interface offers! Everything is clean and nicely formatted 😎&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"item_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"E192003"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"global_property_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2981226&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"property_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5444765&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"property_address"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"13841 COBBLESTONE CT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"property_city"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"FONTANA"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"property_county"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"San Bernardino"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"property_state"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CA"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"property_zip"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"92335"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"property_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SFR"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"seller_code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"FSH"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"beds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"baths"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sqft"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1704&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"lot_size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"latitude"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;34.10391&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"longitude"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;-117.50212&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;


&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;We could go a step further and use the pagination to get the whole list, but we'll leave that to you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use case: twitter.com
&lt;/h2&gt;

&lt;p&gt;Another typical case where there is no initial content is &lt;a href="https://twitter.com/playwrightweb/status/1396888644019884033" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt;. To be able to scrape Twitter, you will undoubtedly need Javascript Rendering. As in the previous case, you could use CSS selectors once the entire content is loaded. But beware, since Twitter classes are dynamic and they will change frequently.&lt;/p&gt;

&lt;p&gt;What will most probably remain the same is the API endpoint they use internally to get the main content: &lt;code&gt;TweetDetail.&lt;/code&gt; In cases like this one, the easiest path is to check the XHR calls in the network tab in devTools and look for some content in each request. It is an excellent example because Twitter can make 20 to 30 JSON or XHR requests per page view.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl58ce6ve9nzglorc5ey2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl58ce6ve9nzglorc5ey2.png" alt="Twitter.com screenshot with DevTools open"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once we identify the calls and the responses we are interested in, the process will be similar.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright.sync_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sync_playwright&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://twitter.com/playwrightweb/status/1396888644019884033&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;sync_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# the endpoint we are insterested in
&lt;/span&gt;        &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/TweetDetail?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;

    &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;firefox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handle_response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_until&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;networkidle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The output will be a considerable JSON (80kb) with more content than we asked for. More than ten nested structures until we arrive at the tweet content. The good news is that we can now access favorite, retweet, or reply counts, images, dates, reply tweets with their content, and many more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use case: nseindia.com
&lt;/h2&gt;

&lt;p&gt;Stock markets are an ever-changing source of essential data. Some sites offering this info, such as &lt;a href="https://www.nseindia.com/market-data/live-equity-market" rel="noopener noreferrer"&gt;the National Stock Exchange of India&lt;/a&gt;, will start with an empty skeleton. After browsing for a few minutes on the site, we see that the market data loads via XHR.&lt;/p&gt;

&lt;p&gt;Another common clue is to view the page source and check for content there. If it's not there, it usually means that it will load later, which probably requires XHR requests. And we can intercept those!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxaxkhh30qti712bchfk5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxaxkhh30qti712bchfk5.png" alt="NSE screenshot with DevTools open"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since we are parsing a list, we will loop over it a print only part of the data in a structured way: symbol and price for each entry.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright.sync_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sync_playwright&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.nseindia.com/market-data/live-equity-market&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;sync_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# the endpoint we are insterested in
&lt;/span&gt;        &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;equity-stockIndices?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;symbol&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;lastPrice&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;firefox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handle_response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_until&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;networkidle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Output:
# NIFTY 50 18125.4
# ICICIBANK 846.75
# AXISBANK 845
# ...
&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;As in the previous examples, this is a simplified example. Printing is not the solution to a real-world problem. Instead, each page structure should have a &lt;a href="https://www.zenrows.com/blog/mastering-web-scraping-in-python-scaling-to-distributed-crawling?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=xhr_scraping#custom-parser" rel="noopener noreferrer"&gt;content extractor and a method to store it&lt;/a&gt;. And the system should also handle the crawling part independently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;We'd like you to go with three main points:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Inspect the page looking for clean data&lt;/li&gt;
&lt;li&gt;API endpoints change less often than CSS selectors, and HTML structure&lt;/li&gt;
&lt;li&gt;Playwright offers more than just Javascript rendering&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even if the extracted data is the same, fail-tolerance and effort in writing the scraper are fundamental factors. The less you have to change them manually, the better.&lt;/p&gt;

&lt;p&gt;Apart from XHR requests, there are many other ways to &lt;a href="https://www.zenrows.com/blog/mastering-web-scraping-in-python-from-zero-to-hero?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=xhr_scraping" rel="noopener noreferrer"&gt;scrape data beyond selectors&lt;/a&gt;. Not every one of them will work on a given website, but adding them to your toolbelt might help you often.&lt;/p&gt;

&lt;p&gt;Thanks for reading! Did you find the content helpful? Please, spread the word and share it. 👈&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://www.zenrows.com/blog/web-scraping-intercepting-xhr-requests?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=xhr_scraping" rel="noopener noreferrer"&gt;https://www.zenrows.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>webdev</category>
      <category>tutorial</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Blocking Resources with Playwright</title>
      <dc:creator>Ander Rodriguez</dc:creator>
      <pubDate>Wed, 29 Sep 2021 13:35:11 +0000</pubDate>
      <link>https://dev.to/anderrv/blocking-resources-with-playwright-3053</link>
      <guid>https://dev.to/anderrv/blocking-resources-with-playwright-3053</guid>
      <description>&lt;p&gt;Did you know that Playwright allows you to block requests and thus speed up your scraping or testing? You can block certain resource types like images, any requests by domain, or many different ways.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;For the code to work, you will need &lt;a href="https://www.python.org/downloads/" rel="noopener noreferrer"&gt;python3 installed&lt;/a&gt;. Some systems have it pre-installed. After that, install Playwright and the browser binaries for Chromium, Firefox, and WebKit.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

pip &lt;span class="nb"&gt;install &lt;/span&gt;playwright
playwright &lt;span class="nb"&gt;install&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Intro to Playwright
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/microsoft/playwright-python" rel="noopener noreferrer"&gt;Playwright&lt;/a&gt; "is a Python library to automate Chromium, Firefox, and WebKit browsers with a single API." It allows us to browse the Internet with a headless browser programmatically.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Playwright is also available for Node.js, and everything shown below can be done with a similar syntax. Check the &lt;a href="https://playwright.dev/docs/intro/" rel="noopener noreferrer"&gt;docs&lt;/a&gt; for more details.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here's how to start a browser (i.e., Chromium) in a few lines, navigate to a page, and get its title.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright.sync_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sync_playwright&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;sync_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chromium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.zenrows.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="c1"&gt;# Web Scraping API &amp;amp; Data Extraction - ZenRows
&lt;/span&gt;    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Loggin Network Events
&lt;/h3&gt;

&lt;p&gt;Subscribe to events such as &lt;a href="https://playwright.dev/python/docs/api/class-page/#page-event-request" rel="noopener noreferrer"&gt;request&lt;/a&gt; or &lt;a href="https://playwright.dev/python/docs/api/class-page/#page-event-response" rel="noopener noreferrer"&gt;response&lt;/a&gt; and log their content to see what is happening. Since we did not tell Playwright otherwise, it will load the entire page: HTML, CSS, execute Javascript, get images, and so on. Add these two lines before requesting the page to see what's going on.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_type&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.zenrows.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# &amp;gt;&amp;gt; GET https://www.zenrows.com/ document
# &amp;lt;&amp;lt; 200 https://www.zenrows.com/
# &amp;gt;&amp;gt; GET https://cdn.zenrows.com/images_dash/logo-instagram.svg image
# &amp;lt;&amp;lt; 200 https://cdn.zenrows.com/images_dash/logo-instagram.svg
&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The entire output is almost 50 lines long, with 24 resource requests. We probably don't need most of those for website scraping, so we will see how to block them and save time and bandwidth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Blocking Resources
&lt;/h2&gt;

&lt;p&gt;Why load resources and content that we won't use? Learn how to avoid unnecessary data and network requests with these techniques.&lt;/p&gt;

&lt;h3&gt;
  
  
  Block by Glob Pattern
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;page&lt;/code&gt; also exposes a method &lt;a href="https://playwright.dev/python/docs/api/class-page/#page-route" rel="noopener noreferrer"&gt;&lt;code&gt;route&lt;/code&gt;&lt;/a&gt; that will execute a handler for each matching route or pattern. Let's say that we don't want SVGs to load. Using a pattern like &lt;code&gt;"**/*.svg"&lt;/code&gt; will match requests ending with that extension. As for the handler, we need no logic for the moment, only to abort the request. For that, we'll use a lambda and the &lt;code&gt;route&lt;/code&gt; param's abort method.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;**/*.jpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.zenrows.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Note: according to the official documentation, patterns like &lt;code&gt;"**/*.{png,jpg,jpeg}"&lt;/code&gt; should work, but we found otherwise. Anyway, it's doable with the next blocking strategy.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Block by Regex
&lt;/h3&gt;

&lt;p&gt;If (for some reason 😜) you are into Regex, feel free to use them. But compiling them first is mandatory. We will block three image extensions in this case. Regex are tricky, but they offer a ton of flexibility.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="c1"&gt;# ...
&lt;/span&gt;    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\.(jpg|png|svg)$&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.zenrows.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Now there are 23 requests and only 15 responses. We saved 8 images from being downloaded!&lt;/p&gt;

&lt;h3&gt;
  
  
  Block by Resource Type
&lt;/h3&gt;

&lt;p&gt;But what happens if they use "jpeg" extension instead of "jpg"? Or avif, gif, webp? Should we maintain an updated list?&lt;/p&gt;

&lt;p&gt;Luckily for us, the &lt;code&gt;route&lt;/code&gt; param exposed in the lambda function above includes the original request and resource type. And one of those types is &lt;code&gt;image&lt;/code&gt;, perfect! You can access the &lt;a href="https://playwright.dev/python/docs/api/class-request#request-resource-type" rel="noopener noreferrer"&gt;whole resource type list&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We'll now match every request (&lt;code&gt;"**/*"&lt;/code&gt;) and add conditional logic to the lambda function. In case it is an image, abort the request as before. Else, continue with it as usual.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;**/*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;continue_&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.zenrows.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Take into consideration that some trackers use images. It is probably not a big deal when scraping or testing, but just in case.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Function handler
&lt;/h3&gt;

&lt;p&gt;We can also define functions for the handlers instead of using lambdas. That comes in handy in case we need to reuse it or it grows past a single conditional.&lt;/p&gt;

&lt;p&gt;Suppose that we want to block aggressively now. Looking at the output from the previous runs will show a list of the used resources. We'll add those to a list and then check if the type is in that list.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="n"&gt;excluded_resource_types&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stylesheet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;script&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;font&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;block_aggressively&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_type&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;excluded_resource_types&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;continue_&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# ...
&lt;/span&gt;    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;**/*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block_aggressively&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.zenrows.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;We are entirely in control now, and the versatility is absolute. From &lt;code&gt;routes.request&lt;/code&gt;, the original URL, the headers, and several other info are available.&lt;/p&gt;

&lt;p&gt;Being even more strict: block everything that is not &lt;code&gt;document&lt;/code&gt; type. That will effectively prevent anything but the initial HTML from being loaded. &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;block_aggressively&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_type&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;continue_&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;There is a single response now! We got the HTML without any other resource being downloaded. We sure have saved a lot of time and bandwidth, right? But... how much exactly?&lt;/p&gt;

&lt;h2&gt;
  
  
  Measuring Performance Boost
&lt;/h2&gt;

&lt;p&gt;We can only say that it got better if we could measure the differences. We will take a look at three approaches. But just running a script with several URLs will do. &lt;em&gt;Spoiler: we did that for 10 URLs, 1.3 seconds VS 8.4.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  HAR Files
&lt;/h3&gt;

&lt;p&gt;For those of you used to checking the DevTools Network tab, we have good news! Playwright allows HAR recording by providing an extra parameter in the &lt;code&gt;new_page&lt;/code&gt; method. As easy as that.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record_har_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;playwright_test.har&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.zenrows.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;There are some &lt;a href="https://gitgrimbo.github.io/harviewer/master/" rel="noopener noreferrer"&gt;HAR visualizers&lt;/a&gt; out there, but the easiest way is to use Chrome DevTools. Open the Network tab and click on the import button or drag&amp;amp;drop the HAR file.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0s47li1ausin4o84e1hy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0s47li1ausin4o84e1hy.png" alt="Open HAR file in DevTools"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Check time! Below is the comparison between two different HAR files. The first one is without blocking (regular navigation). The second one is blocking everything except for the initial document.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwtvaxazy7m5grmhy7mu9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwtvaxazy7m5grmhy7mu9.png" alt="Full site HAR"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkntiqkjlehhitfg3reic.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkntiqkjlehhitfg3reic.png" alt="Blocked site HAR"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Almost every resource has a "-1" Status and "Pending" Time on the blocking side. That's DevTools's way of telling us that those were blocked and not downloaded. We can see clearly on the bottom left that we performed fewer requests, and the transferred data amount is a fraction of the original! From 524kB to 17.4kB, a 96% cut.&lt;/p&gt;

&lt;h3&gt;
  
  
  Browser's Performance API
&lt;/h3&gt;

&lt;p&gt;The browsers offer an interface to check the &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/Performance" rel="noopener noreferrer"&gt;performance&lt;/a&gt; that shows how it went for things like timing. Playwright can evaluate Javascript, so we'll use it to print those results.&lt;/p&gt;

&lt;p&gt;The output will be a JSON object with a lot of timestamps. The most straightforward check is to get the difference between &lt;code&gt;navigationStart&lt;/code&gt; and &lt;code&gt;loadEventEnd&lt;/code&gt;. When blocking, it should be under half a second (i.e., 346ms); for regular navigation, above a second or even two (i.e., 1363ms).&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.zenrows.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;JSON.stringify(window.performance)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;# {"timing":{"connectStart":1632902378272,"navigationStart":1632902378244, ...
&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;As you can see, blocking can be a second faster, even more for slower sites. The less you download, the faster you can scrape!&lt;/p&gt;

&lt;h3&gt;
  
  
  CDP Session
&lt;/h3&gt;

&lt;p&gt;Going a step further, we connect directly with &lt;a href="https://chromedevtools.github.io/devtools-protocol/" rel="noopener noreferrer"&gt;Chrome DevTools Protocol&lt;/a&gt;. Playwright creates a &lt;a href="https://playwright.dev/python/docs/api/class-cdpsession/" rel="noopener noreferrer"&gt;CDP Session&lt;/a&gt; for us to extract, for example, performance metrics.&lt;/p&gt;

&lt;p&gt;We have to create a client from the page context and start communication with CDP. In our case, we will enable "Performance" before visiting the page and getting the metrics after it.&lt;/p&gt;

&lt;p&gt;The output will be a JSON-like string with interesting values such as nodes, process time, JS Heap used, and many more.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_cdp_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Performance.enable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.zenrows.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Performance.getMetrics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1o5si8k0puv89dckyucv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1o5si8k0puv89dckyucv.png" alt="Performance diff with CDP"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;We'd like you to part with three main points:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Load only needed resources.&lt;/li&gt;
&lt;li&gt;Save time and bandwidth when possible.&lt;/li&gt;
&lt;li&gt;Measure your efforts and performance before scaling up.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let's not forget that website scraping is a process with multiple steps, one of them being &lt;a href="https://www.zenrows.com/blog/stealth-web-scraping-in-python-avoid-blocking-like-a-ninja#rotating-proxies" rel="noopener noreferrer"&gt;rotating proxies&lt;/a&gt;. Those add processing time and, sometimes, charge per bandwidth.&lt;/p&gt;

&lt;p&gt;You can achieve precisely the same results while saving time/bandwidth/money. Many times, images or CSS are only overhead. Some others, no need for JS if you only want the initial static content.&lt;/p&gt;

&lt;p&gt;Thanks for reading! Did you find the content helpful? Please, spread the word and share it. 👈&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://www.zenrows.com/blog/blocking-resources-in-playwright?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=blocking_playwright" rel="noopener noreferrer"&gt;https://www.zenrows.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>programming</category>
      <category>beginners</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Web Scraping with Javascript and Node.js</title>
      <dc:creator>Ander Rodriguez</dc:creator>
      <pubDate>Wed, 01 Sep 2021 14:51:35 +0000</pubDate>
      <link>https://dev.to/anderrv/web-scraping-with-javascript-and-node-js-2d</link>
      <guid>https://dev.to/anderrv/web-scraping-with-javascript-and-node-js-2d</guid>
      <description>&lt;p&gt;Javascript and web scraping are both on the rise. We will combine them to build a simple scraper and crawler from scratch using Javascript in Node.js.&lt;/p&gt;

&lt;p&gt;Avoiding blocks is an essential part of website scraping, so we will also add some features to help in that regard. And finally, parallelize the tasks to go faster thanks to &lt;a href="https://nodejs.org/en/docs/guides/event-loop-timers-and-nexttick/" rel="noopener noreferrer"&gt;Node's event loop&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;For the code to work, you will need &lt;a href="https://nodejs.org/en/" rel="noopener noreferrer"&gt;Node&lt;/a&gt; (or &lt;a href="https://github.com/nvm-sh/nvm" rel="noopener noreferrer"&gt;nvm&lt;/a&gt;) and npm installed. Some systems have it pre-installed. After that, install all the necessary libraries by running &lt;code&gt;npm install&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install &lt;/span&gt;axios cheerio playwright
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;We are using Node v12, but you can always &lt;a href="https://node.green/" rel="noopener noreferrer"&gt;check the compatibility of each feature&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://axios-http.com/" rel="noopener noreferrer"&gt;Axios&lt;/a&gt; is a "promise based HTTP client" that we will use to get the HTML from a URL. It allows several options such as headers and proxies, which we will cover later. If you use TypeScript, they "include TypeScript definitions and a type guard for Axios errors."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cheerio.js.org/" rel="noopener noreferrer"&gt;Cheerio&lt;/a&gt; is a "fast, flexible &amp;amp; lean implementation of core jQuery." It lets us find nodes with selectors, get text or attributes, and many other things. We will pass the HTML to cheerio and then query it as we would in a browser environment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/microsoft/playwright" rel="noopener noreferrer"&gt;Playwright&lt;/a&gt; "is a Node.js library to automate Chromium, Firefox and WebKit with a single API." When Axios is not enough, we will get the HTML using a headless browser to execute Javascript and wait for the async content to load.&lt;/p&gt;
&lt;h2&gt;
  
  
  Scraping the Basics
&lt;/h2&gt;

&lt;p&gt;The first thing we need is the HTML. We installed Axios for that, and its usage is straightforward. We'll use &lt;a href="https://scrapeme.live/shop/" rel="noopener noreferrer"&gt;scrapeme.live&lt;/a&gt; as an example, a fake website prepared for scraping.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;Nice! Then, using cheerio, we can query for the two things we want right now: paginator links and products. To know how to do that, we will look at the page with Chrome DevTools open. All modern browsers offer developer tools such as these. Pick your favorite.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F24j9eq9mjh8hiuce23yo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F24j9eq9mjh8hiuce23yo.png" alt="Paginator in DevTools"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We marked the interesting parts in red, but you can go on your own and try it yourselves. In this case, all the &lt;a href="https://www.w3schools.com/cssref/css_selectors.asp" rel="noopener noreferrer"&gt;CSS selectors&lt;/a&gt; are straightforward and do not need nesting. Check the guide if you are looking for a different outcome or cannot select it. You can also use DevTools to get the selector.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo206qbgxwxoy42js5bot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo206qbgxwxoy42js5bot.png" alt="Copy Selector from DevTools"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On the Elements tab, right-click on the node ➡ Copy ➡ Copy selector.&lt;br&gt;
But the outcome is usually very coupled to the HTML, as in this case: &lt;code&gt;#main &amp;gt; div:nth-child(2) &amp;gt; nav &amp;gt; ul &amp;gt; li:nth-child(2) &amp;gt; a&lt;/code&gt;. This approach might be a problem in the future because it will stop working after any minimal change. Besides, it will only capture one of the pagination links, not all of them.&lt;/p&gt;

&lt;p&gt;We could capture all the links on the page and then filter them by content. If we were to write a full-site crawler, that would be the right approach. In our case, we only want the pagination links. Using the provided class, &lt;code&gt;.page-numbers a&lt;/code&gt; will capture all and then extract the URLs (&lt;code&gt;href&lt;/code&gt;s) from those. The selector will match all the link nodes with an ancestor containing the class &lt;code&gt;page-numbers&lt;/code&gt;.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;As for the products (Pokémon in this case), we will get id, name, and price. Check the image below for details on selectors, or try again on your own. We will only log the content for now. Check the final code for adding them to an array.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffjlzv65b6mv97tbn7cwl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffjlzv65b6mv97tbn7cwl.png" alt="Product (Charmander) in DevTools"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see above, all the products contain the class &lt;code&gt;product&lt;/code&gt;, which makes our job easier. And for each of them, the &lt;code&gt;h2&lt;/code&gt; tag and &lt;code&gt;price&lt;/code&gt; node hold the content we want. As for the product ID, we need to match an attribute instead of a class or node type. That can be done using the syntax &lt;code&gt;node[attribute="value"]&lt;/code&gt;. We are looking only for the node with the attribute, so there is no need to match it to any particular value.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;There is no error handling, as you can see above. We will omit it for brevity in the snippets but take it into account in real life. Most of the time, returning the default value (i.e...., empty array) should do the trick.&lt;/p&gt;

&lt;h2&gt;
  
  
  Following Links
&lt;/h2&gt;

&lt;p&gt;Now that we have some pagination links, we should also visit them. If you run the whole code, you'll see that they appear twice - there are two pagination bars.&lt;/p&gt;

&lt;p&gt;We will add two &lt;a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Set" rel="noopener noreferrer"&gt;sets&lt;/a&gt; to keep track of what we already visited and the newly discovered links. We are using sets instead of arrays to avoid dealing with duplicates, but either one would work. To avoid crawling too much, we'll also include a maximum.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;For the next part, we will use &lt;a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/async_function" rel="noopener noreferrer"&gt;async/await&lt;/a&gt; to avoid callbacks and nesting. An async function is an alternative to writing promise-based functions as chains. In this case, the Axios call will remain asynchronous. It might take around 1 second per page, but we write the code sequentially, with no need for callbacks.&lt;/p&gt;

&lt;p&gt;There is a small gotcha with this: &lt;code&gt;await is only valid in async function&lt;/code&gt;. That will force us to wrap the initial code inside a function, concretely in an &lt;a href="https://developer.mozilla.org/en-US/docs/Glossary/IIFE" rel="noopener noreferrer"&gt;IIFE&lt;/a&gt; (Immediately Invoked Function Expression). The syntax is a bit weird. It creates a function and then calls it immediately.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h2&gt;
  
  
  Avoid Blocks
&lt;/h2&gt;

&lt;p&gt;As said before, we need mechanisms to avoid blocks, captchas, login walls, and several other defensive techniques. It is complicated to prevent them 100% of the time. But we can achieve a high success rate with simple efforts. We will apply two tactics: adding proxies and full-set headers.&lt;/p&gt;

&lt;p&gt;There are &lt;a href="https://free-proxy-list.net/" rel="noopener noreferrer"&gt;Free Proxies&lt;/a&gt; even though we do not recommend them. They might work for testing but are not reliable. We can use some of those for testing, as we'll see in some examples.&lt;br&gt;
&lt;em&gt;Note that these &lt;a href="https://free-proxy-list.net/" rel="noopener noreferrer"&gt;free proxies&lt;/a&gt; might not work for you. They are short-time lived.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Paid proxy services, on the other hand, offer IP Rotation. Meaning that our service will work the same, but the target website will see a different IP. In some cases, they rotate for every request or every few minutes. In any case, they are much harder to ban. And when it happens, we'll get a new IP after a short time.&lt;/p&gt;

&lt;p&gt;We will use &lt;a href="https://httpbin.org/" rel="noopener noreferrer"&gt;httpbin&lt;/a&gt; for testing. It offers several endpoints that will respond with headers, IP addresses, and many more.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;The next step would be to check our request headers. The most known one is &lt;a href="https://en.wikipedia.org/wiki/User_agent" rel="noopener noreferrer"&gt;User-Agent&lt;/a&gt; (UA for short), but there are many more. Many software tools have their own, for example, Axios (&lt;code&gt;axios/0.21.1&lt;/code&gt;). In general, it is a good practice to send actual headers along with the UA. That means we need a real-world set of headers because not all browsers and versions use the same ones. We include two in the snippet: Chrome 92 and Firefox 90 in a Linux machine.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h2&gt;
  
  
  Headless Browsers
&lt;/h2&gt;

&lt;p&gt;Until now, every page visited was done using &lt;code&gt;axios.get&lt;/code&gt;, which can be inadequate in some cases. Say we need Javascript to load and execute or interact in any way with the browser (via mouse or keyboard). While avoiding them would be preferable - for performance reasons -, sometimes there is no other choice. &lt;a href="https://www.selenium.dev/" rel="noopener noreferrer"&gt;Selenium&lt;/a&gt;, &lt;a href="https://github.com/puppeteer/puppeteer" rel="noopener noreferrer"&gt;Puppeteer&lt;/a&gt;, and &lt;a href="https://github.com/microsoft/playwright" rel="noopener noreferrer"&gt;Playwright&lt;/a&gt; are the most used and known libraries. The snippet below shows only the User-Agent, but since it is a real browser, the headers will include the entire set (Accept, Accept-Encoding, etcetera).&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;This approach comes with its own problem: take a look a the User-Agents. The Chromium one includes "HeadlessChrome," which will tell the target website, well, that it is a headless browser. They might act upon that.&lt;/p&gt;

&lt;p&gt;As with Axios, we can provide extra headers, proxies, and many other options to customize every request. An excellent choice to hide our "HeadlessChrome" User-Agent. And since this is a real browser, we can intercept requests, block others (like CSS files or images), take screenshots or videos, and more.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Now we can separate getting the HTML in a couple of functions, one using Playwright and the other Axios. We would then need a way to select which one is appropriate for the case at hand. For now, it is hardcoded. The output, by the way, is the same but quite faster when using Axios.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h2&gt;
  
  
  Using Javascript's Async
&lt;/h2&gt;

&lt;p&gt;We already introduced async/await when crawling several links sequentially. If we were to crawl them in parallel, just by removing the &lt;code&gt;await&lt;/code&gt; would be enough, right? Well... not so fast.&lt;/p&gt;

&lt;p&gt;The function would call the first &lt;code&gt;crawl&lt;/code&gt; and immediately take the following item from the &lt;code&gt;toVisit&lt;/code&gt; set. The problem is that the set is empty since the crawling of the first page didn't occur yet. So we added no new links to the list. The function keeps running in the background, but we already exited from the main one.&lt;/p&gt;

&lt;p&gt;To do this properly, we need to create a queue that will execute tasks when available. To avoid many requests at the same time, we will limit its concurrency.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;If you run the code above, it will print numbers from 0 to 3 almost immediately (with a timestamp) and from 4 to 7 after 2 seconds. It might be the hardest snippet to understand - review it without hurries.&lt;/p&gt;

&lt;p&gt;We define &lt;code&gt;queue&lt;/code&gt; in lines 1-20. It will return an object with the function &lt;code&gt;enqueue&lt;/code&gt; to add a task to the list. Then it checks if we are above the concurrency limit. If we are not, it will sum one to &lt;code&gt;running&lt;/code&gt; and enter a loop that gets a task and runs it with the provided params. Until the task list is empty, then subtract one from &lt;code&gt;running&lt;/code&gt;. This variable is the one that marks when we can or cannot execute any more tasks, only allowing it below the concurrency limit. In lines 23-28, there are helper functions &lt;code&gt;sleep&lt;/code&gt; and &lt;code&gt;printer&lt;/code&gt;. Instantiate the queue in line 30 and enqueue items in 32-34 (which will start running 4).&lt;/p&gt;

&lt;p&gt;We have to use the queue now instead of a for loop to run several pages concurrently. The code below is partial with the parts that change.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Remember that Node runs in a single thread, so we can take advantage of its event loop but cannot use multiple CPUs/threads. What we've seen works fine because the thread is idle most of the time - network requests do not consume CPU time.&lt;/p&gt;

&lt;p&gt;To build this further, we need to use some storage (database) or distributed queue system. Right now, we rely on variables, which are not shared between threads in Node. It is not overly complicated, but we covered enough ground in this blog post.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Code
&lt;/h2&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;We'd like you to part with four main points:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Understand the basics of website parsing and crawling.&lt;/li&gt;
&lt;li&gt;Separate responsibilities and use abstractions when necessary.&lt;/li&gt;
&lt;li&gt;Apply the required techniques to avoid blocks.&lt;/li&gt;
&lt;li&gt;Be able to figure out the following steps to scale up.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We can build a custom web scraper using Javascript and Node.js using the pieces we've seen. It might not scale to thousands of websites, but it will run perfectly for a few ones. And moving to distributed crawling is not that far from here.&lt;/p&gt;

&lt;p&gt;If you liked it, you might be interested in the &lt;a href="https://www.zenrows.com/blog/mastering-web-scraping-in-python-from-zero-to-hero?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=scraping_javascript" rel="noopener noreferrer"&gt;Python Web Scraping guide&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Thanks for reading! Did you find the content helpful? Please, spread the word and share it. 👈&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://www.zenrows.com/blog/web-scraping-with-javascript-and-nodejs?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=scraping_javascript" rel="noopener noreferrer"&gt;https://www.zenrows.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>node</category>
      <category>tutorial</category>
      <category>programming</category>
    </item>
    <item>
      <title>Mastering Web Scraping in Python: Scaling to Distributed Crawling</title>
      <dc:creator>Ander Rodriguez</dc:creator>
      <pubDate>Wed, 25 Aug 2021 13:35:12 +0000</pubDate>
      <link>https://dev.to/anderrv/mastering-web-scraping-in-python-scaling-to-distributed-crawling-51ig</link>
      <guid>https://dev.to/anderrv/mastering-web-scraping-in-python-scaling-to-distributed-crawling-51ig</guid>
      <description>&lt;p&gt;Wondering how to build a website crawler and parser at scale? Implement a project to crawl, scrape, extract content, and store it at scale in a distributed and fault-tolerant manner. We will take all the knowledge from previous posts and combine it.&lt;/p&gt;

&lt;p&gt;First, we learned about &lt;a href="https://dev.to/anderrv/mastering-web-scraping-in-python-from-zero-to-hero-4fj4"&gt;pro techniques to scrape content&lt;/a&gt;, although we'll only use CSS selectors today. Then &lt;a href="https://dev.to/anderrv/stealth-web-scraping-in-python-avoid-blocking-like-a-ninja-16ok"&gt;tricks to avoid blocks&lt;/a&gt;, from which we will add proxies, headers, and headless browsers. And lastly, we &lt;a href="https://dev.to/anderrv/mastering-web-scraping-in-python-crawling-from-scratch-1dgd"&gt;built a parallel crawler&lt;/a&gt;, and this blog post begins with that code.&lt;/p&gt;

&lt;p&gt;If you do not understand some part or snippet, it might be in an earlier post. Brace yourselves; lengthy snippets are coming.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;For the code to work, you will need &lt;a href="https://redis.io/" rel="noopener noreferrer"&gt;Redis&lt;/a&gt; and &lt;a href="https://www.python.org/downloads/" rel="noopener noreferrer"&gt;python3&lt;/a&gt; installed. Some systems have it pre-installed. After that, install all the necessary libraries by running pip install.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install install &lt;/span&gt;requests beautifulsoup4 playwright &lt;span class="s2"&gt;"celery[redis]"&lt;/span&gt;
npx playwright &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Intro to Celery and Redis
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.celeryproject.org/en/stable/getting-started/introduction.html" rel="noopener noreferrer"&gt;Celery&lt;/a&gt; "is an open source asynchronous task queue." We created a simple parallel version in the last blog post. Celery takes it a step further by providing an actual distributed queue implementation. We will use it to distribute our load among workers and servers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://redis.io/" rel="noopener noreferrer"&gt;Redis&lt;/a&gt; "is an open source, in-memory data structure store, used as a database, cache, and message broker." Instead of using arrays and sets to store all the content (in memory), we will use Redis as a database. Moreover, Celery can use Redis as a broker, so we won't need other software to run it.&lt;/p&gt;
&lt;h2&gt;
  
  
  Simple Celery Task
&lt;/h2&gt;

&lt;p&gt;Our first step will be to create a task in Celery that prints the value received by parameter. Save the snippet in a file called &lt;code&gt;tasks.py&lt;/code&gt; and run it. If you run it as a regular python file, only one string will be printed. The console will print two different lines if you run it with &lt;code&gt;celery -A tasks worker&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The difference is in the &lt;code&gt;demo&lt;/code&gt; function call. Direct call implies "execute that task," while &lt;code&gt;delay&lt;/code&gt; means "enqueue it for a worker to process." Check the docs for more info on &lt;a href="https://docs.celeryproject.org/en/stable/userguide/calling.html#basics" rel="noopener noreferrer"&gt;calling tasks&lt;/a&gt;.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;celery&lt;/code&gt; command will not end; we need to kill it by exiting the console (i.e., &lt;code&gt;ctrl + C&lt;/code&gt;). We'll need it several times because Celery does not reload after code changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Crawling from Task
&lt;/h2&gt;

&lt;p&gt;The next step is to connect a Celery task with the crawling process. This time we will be using a slightly altered version of the &lt;a href="https://www.zenrows.com/blog/mastering-web-scraping-in-python-crawling-from-scratch#final-code" rel="noopener noreferrer"&gt;helper functions&lt;/a&gt; seen in the last post. &lt;code&gt;extract_links&lt;/code&gt; will get all the links on the page except the &lt;code&gt;nofollow&lt;/code&gt; ones. We will add filtering options later.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;We could loop over the retrieved links and enqueue them, but that would end up crawling the same pages repeatedly. We saw the basics to execute tasks, and now we will start splitting into files and keeping track of the pages on Redis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Redis for Tracking URLs
&lt;/h2&gt;

&lt;p&gt;We already said that relying on memory variables is not an option anymore. We will need to persist all that data: visited pages, the ones being currently crawled, keep a "to visit" list, and store some content later on. For all that, instead of enqueuing directly to Celery, we will use Redis to avoid re-crawling and duplicates. And enqueue URLs only once.&lt;/p&gt;

&lt;p&gt;We won't go into further details on Redis, but we will use &lt;a href="https://redis.io/commands#list" rel="noopener noreferrer"&gt;lists&lt;/a&gt;, &lt;a href="https://redis.io/commands#set" rel="noopener noreferrer"&gt;sets&lt;/a&gt;, and &lt;a href="https://redis.io/commands#hash" rel="noopener noreferrer"&gt;hashes&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Take the last snippet and remove the last two lines, the ones calling the task. Create a new file &lt;code&gt;main.py&lt;/code&gt; with the following content. We will create a list named &lt;code&gt;crawling:to_visit&lt;/code&gt; and push the starting URL. Then we will go into a loop that will query that list for items and block for a minute until an item is ready. When an item is retrieved, we call the &lt;code&gt;crawl&lt;/code&gt; function, enqueuing its execution.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;It does almost the same as before but allows us to add items to the list, and they will be automatically processed. We could do that easily by looping over &lt;code&gt;links&lt;/code&gt; and pushing them all, but it is not a good idea without deduplication and a maximum number of pages. We will keep track of all the &lt;code&gt;queued&lt;/code&gt; and &lt;code&gt;visited&lt;/code&gt; using sets and exit once their sum exceeds the maximum allowed.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;After executing, everything will be in Redis, so running again won't work as expected. We need to clean manually. We can do that by using &lt;code&gt;redis-cli&lt;/code&gt; or a GUI like &lt;a href="https://github.com/joeferner/redis-commander#readme" rel="noopener noreferrer"&gt;redis-commander&lt;/a&gt;. There are commands for deleting keys (i.e., &lt;code&gt;DEL crawling:to_visit&lt;/code&gt;) or &lt;a href="https://redis.io/commands/flushdb" rel="noopener noreferrer"&gt;flushing the database&lt;/a&gt; (careful with this one).&lt;/p&gt;

&lt;h2&gt;
  
  
  Separate Responsabilities
&lt;/h2&gt;

&lt;p&gt;We will start to separate concepts before the project grows. We already have two files: &lt;code&gt;tasks.py&lt;/code&gt; and &lt;code&gt;main.py&lt;/code&gt;. We will create another two to host crawler-related functions (&lt;code&gt;crawler.py&lt;/code&gt;) and database access (&lt;code&gt;repo.py&lt;/code&gt;). Please look at the snippet below for the repo file, it is not complete, but you get the idea. There is a &lt;a href="https://github.com/ZenRows/scaling-to-distributed-crawling" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; with the final content in case you want to check it.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;And the &lt;code&gt;crawler&lt;/code&gt; file will have the functions for crawling, extracting links, and so on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Allow Parser Customization
&lt;/h2&gt;

&lt;p&gt;As mentioned above, we need some way to extract and store content and add only a particular subset of links to the queue. We need a new concept for that: default parser (&lt;code&gt;parsers/defaults.py&lt;/code&gt;).&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;And in the repo.py file:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;There is nothing new here, but it will allow us to abstract the link and content extraction. Instead of hardcoding it in the crawler, it will be a set of functions passed as parameters. Now we can substitute the calls to these functions by an import (for the moment).&lt;/p&gt;

&lt;p&gt;For it to be completely abstracted, we need a generator or factory. We'll create a new file to host it - &lt;code&gt;parserlist.py&lt;/code&gt;. To simplify a bit, we allow one custom parser per domain. The demo includes two domains for testing: &lt;a href="https://scrapeme.live/shop/page/1/" rel="noopener noreferrer"&gt;scrapeme.live&lt;/a&gt; and &lt;a href="http://quotes.toscrape.com/page/1/" rel="noopener noreferrer"&gt;quotes.toscrape.com&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There is nothing done for each domain yet so that we will use the default parser for them.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;We can now modify the task with the new per-domain-parsers.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h2&gt;
  
  
  Custom Parser
&lt;/h2&gt;

&lt;p&gt;We will use &lt;code&gt;scrapeme&lt;/code&gt; first as an example. Check the &lt;a href="https://github.com/ZenRows/scaling-to-distributed-crawling/blob/main/parsers/scrapemelive.py" rel="noopener noreferrer"&gt;repo&lt;/a&gt; for the final version and the other custom parser.&lt;/p&gt;

&lt;p&gt;Knowledge of the page and its HTML is required for this part. Take a look at it if you want to get the feeling. To summarize, we will get the product id, name, and price for each item in the product list. Then store that in a set using the id as the key. As for the links allowed, only the ones for pagination will go through the filtering.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpox98dq3rgci5mhdvwhh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpox98dq3rgci5mhdvwhh.png" alt="Crawler Products, Pokémon"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the &lt;code&gt;quotes&lt;/code&gt; site, we need to handle it differently since there is no ID per quote. We will extract the author and quote for each entry in the list. Then, in the &lt;code&gt;store_content&lt;/code&gt; function, we'll create a list for each author and add that quote. Redis handles the creation of the lists when necessary.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxjtfl1u8lt5h4s83lvu9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxjtfl1u8lt5h4s83lvu9.png" alt="Crawler Quotes in Redis"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With the last couple of changes, we have introduced custom parsers that will be easy to extend. When adding a new site, we must create one file per new domain and one line in &lt;code&gt;parserlist.py&lt;/code&gt; referencing it. We could go a step further and "auto-discover" them, but no need to complicate it even more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Get HTML: Headless Browsers
&lt;/h2&gt;

&lt;p&gt;Until now, every page visited was done using &lt;code&gt;requests.get&lt;/code&gt;, which can be inadequate in some cases. Say we want to use a different library or headless browser, but just for some cases or domains. Loading a browser is memory-consuming and slow, so we should avoid it when it is not mandatory. The solution? Even more customization. New concept: collector.&lt;/p&gt;

&lt;p&gt;We will create a file named &lt;code&gt;collectors/basic.py&lt;/code&gt; and paste the already known &lt;code&gt;get_html&lt;/code&gt; function. Then change the defaults to use it by importing it. Next, create a new file, &lt;code&gt;collectors/headless_firefox.py&lt;/code&gt;, for the new and shiny method of getting the target HTML. As in the previous post, we will be using &lt;a href="https://playwright.dev/python/docs/intro/" rel="noopener noreferrer"&gt;playwright&lt;/a&gt;. And we will also parametrize headers and proxies in case we want to use them. &lt;em&gt;Spoiler: we will&lt;/em&gt;.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;If we want to use a headless Firefox for some domain, merely modify the &lt;code&gt;get_html&lt;/code&gt; for that parser (i.e., &lt;code&gt;parsers/scrapemelive.py&lt;/code&gt;).&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;As you can see in the &lt;a href="https://github.com/ZenRows/scaling-to-distributed-crawling/blob/main/collectors/fake.py" rel="noopener noreferrer"&gt;final repo&lt;/a&gt;, we also have a &lt;code&gt;fake.py&lt;/code&gt; collector used in &lt;code&gt;scrapemelive.py&lt;/code&gt;. Since we used that website for intense testing, we downloaded all the product pages the first time and stored them in a &lt;code&gt;data&lt;/code&gt; folder. We can customize with a headless browser, but we can do the same with a file reader, hence the "fake" name.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h2&gt;
  
  
  Avoid Detection with Headers and Proxies
&lt;/h2&gt;

&lt;p&gt;You guessed it: we want to add custom headers and use proxies. We will start with the headers creating a file &lt;code&gt;headers.py&lt;/code&gt;. We won't paste the entire content here, there are three different sets of headers for a Linux machine, and it gets pretty long. Check the &lt;a href="https://github.com/ZenRows/scaling-to-distributed-crawling/blob/main/headers.py" rel="noopener noreferrer"&gt;repo&lt;/a&gt; for the details.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;We can import a concrete set of headers or call the &lt;code&gt;random_headers&lt;/code&gt; to get one of the available options. We will see a usage example in a moment.&lt;/p&gt;

&lt;p&gt;The same applies to the proxies: create a new file, &lt;code&gt;proxies.py&lt;/code&gt;. It will contain a list of them grouped by the provider. In our example, we will include only &lt;a href="https://free-proxy-list.net/" rel="noopener noreferrer"&gt;free proxies&lt;/a&gt;. Add your paid ones in the &lt;code&gt;proxies&lt;/code&gt; dictionary and change the default type to the one you prefer. If we were to complicate things, we could add a retry with a different provider in case of failure.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note that these &lt;a href="https://free-proxy-list.net/" rel="noopener noreferrer"&gt;free proxies&lt;/a&gt; might not work for you. They are short-time lived.&lt;/em&gt;&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;And the usage in a parser:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h2&gt;
  
  
  Bringing it All Together
&lt;/h2&gt;

&lt;p&gt;It's been a long and eventful trip. It is time to put an end to it by completing the puzzle. We hope you understood the whole process and all the challenges that scraping and crawling at scale have.&lt;/p&gt;

&lt;p&gt;We cannot show here the final code, so take a look at the &lt;a href="https://github.com/ZenRows/scaling-to-distributed-crawling" rel="noopener noreferrer"&gt;repository&lt;/a&gt; and do not hesitate to comment or contact us with any doubt.&lt;/p&gt;

&lt;p&gt;The two entry points are &lt;code&gt;tasks.py&lt;/code&gt; for Celery and &lt;code&gt;main.py&lt;/code&gt; to start queueing URLs. From there, we begin storing URLs in Redis to keep track and start crawling the first URL. A custom or the default parser will get the HTML, extract and filter links, and generate and store the appropriate content. We add those links to a list and start the process again. Thanks to Celery, once there is more than one link in the queue, the parallel/distributed process starts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Focks2uaaybgf8g2mgbrv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Focks2uaaybgf8g2mgbrv.png" alt="Crawler File Tree"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Points Still Missing
&lt;/h2&gt;

&lt;p&gt;We already covered a lot of ground, but there is always a step more. Here are a few functionalities that we did not include. Also, note that most of the code does not contain error handling or retries for brevity's sake.&lt;/p&gt;

&lt;h3&gt;
  
  
  Distributed
&lt;/h3&gt;

&lt;p&gt;We didn't include it, but Celery offers it out-of-the-box. For local testing, we can start two different workers &lt;code&gt;celery -A tasks worker --concurrency=20 -n worker1&lt;/code&gt; and &lt;code&gt;... -n worker2&lt;/code&gt;. The way to go is to do the same in other machines as long as they can connect to the broker (Redis in our case). We could even add or remove workers and servers on the fly, no need to restart the rest. Celery handles the workers and distributes the load.&lt;/p&gt;

&lt;p&gt;It is important to note that the worker's name is essential, especially when starting several in the same machine. If we execute the above command twice without changing the worker's name, Celery won't recognize them correctly. Thus launch the second one as &lt;code&gt;-n worker2&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rate Limit
&lt;/h3&gt;

&lt;p&gt;Celery does not allow a rate limit per task and parameter (in our case, domain). Meaning that we can throttle workers or queues, but not to a fine-grained detail as we would like to. There are several &lt;a href="https://github.com/celery/celery/issues/5732" rel="noopener noreferrer"&gt;issues&lt;/a&gt; open and &lt;a href="https://stackoverflow.com/questions/29854102/celery-rate-limit-on-tasks-with-the-same-parameters" rel="noopener noreferrer"&gt;workarounds&lt;/a&gt;. From reading several of those, the take-away is that we cannot do it without keeping track of the requests ourselves.&lt;/p&gt;

&lt;p&gt;We could easily rate-limit to 30 requests per minute for each task with the provided param &lt;code&gt;@app.task(rate_limit="30/m")&lt;/code&gt;. But remember that it would affect the task, not the crawled domain.&lt;/p&gt;

&lt;h3&gt;
  
  
  Robots.txt
&lt;/h3&gt;

&lt;p&gt;Along with the &lt;code&gt;allow_url_filter&lt;/code&gt; part, we should also add a robots.txt checker. For that, the &lt;a href="https://docs.python.org/3/library/urllib.robotparser.html" rel="noopener noreferrer"&gt;robotparser library&lt;/a&gt; can take a URL and tell us if it is allowed to crawl it. We can add it to the default or as a standalone function, and then each scraper decides whether to use it. We thought it was complex enough and did not implement this functionality.&lt;/p&gt;

&lt;p&gt;If you were to do it, consider the last time the file was accessed with &lt;code&gt;mtime()&lt;/code&gt; and reread it from time to time. And also, cache it to avoid requesting it for every single URL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building a custom crawler/parser at scale is not an easy nor straightforward task. We provided some guidance and tips, hopefully helping you all with your day-to-day tasks.&lt;/p&gt;

&lt;p&gt;Before developing something as big as this project at scale, think about some important take-aways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Separate responsabilities.&lt;/li&gt;
&lt;li&gt;Use abstractions when necessary, but do not over-engineer.&lt;/li&gt;
&lt;li&gt;Don't be afraid of using specialized software instead of building everything.&lt;/li&gt;
&lt;li&gt;Think about scaling even if you don't need it now; just keep it in mind.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Thanks for joining us until the end. It's been a fun series to write, and we hope it's also been attractive from your side. If you liked it, you might be interested in &lt;a href="https://www.zenrows.com/blog/web-scraping-with-javascript-and-nodejs?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=distributed_crawling" rel="noopener noreferrer"&gt;the Javascript Web Scraping guide&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Do not forget to take a look at the rest of the posts in this series.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.zenrows.com/blog/mastering-web-scraping-in-python-crawling-from-scratch?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=distributed_crawling" rel="noopener noreferrer"&gt;Crawling from Scratch&lt;/a&gt; (3/4)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.zenrows.com/blog/stealth-web-scraping-in-python-avoid-blocking-like-a-ninja?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=distributed_crawling" rel="noopener noreferrer"&gt;Avoid Blocking Like a Ninja&lt;/a&gt; (2/4)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.zenrows.com/blog/mastering-web-scraping-in-python-from-zero-to-hero?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=distributed_crawling" rel="noopener noreferrer"&gt;Mastering Extraction&lt;/a&gt; (1/4)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Did you find the content helpful? Please, spread the word and share it. 👈&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://www.zenrows.com/blog/mastering-web-scraping-in-python-scaling-to-distributed-crawling?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=distributed_crawling" rel="noopener noreferrer"&gt;https://www.zenrows.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>beginners</category>
      <category>tutorial</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Mastering Web Scraping in Python: Crawling from Scratch</title>
      <dc:creator>Ander Rodriguez</dc:creator>
      <pubDate>Wed, 11 Aug 2021 11:55:02 +0000</pubDate>
      <link>https://dev.to/anderrv/mastering-web-scraping-in-python-crawling-from-scratch-1dgd</link>
      <guid>https://dev.to/anderrv/mastering-web-scraping-in-python-crawling-from-scratch-1dgd</guid>
      <description>&lt;p&gt;Have you ever tried to crawl thousands of pages? Scale that even further? Handle and recover from system failures?&lt;/p&gt;

&lt;p&gt;After seeing how to &lt;a href="https://dev.to/anderrv/mastering-web-scraping-in-python-from-zero-to-hero-4fj4"&gt;extract content from a website&lt;/a&gt; and &lt;a href="https://dev.to/anderrv/stealth-web-scraping-in-python-avoid-blocking-like-a-ninja-16ok"&gt;how to avoid being blocked&lt;/a&gt;, we'll take a look at the crawling process. To get data at scale, getting a few URLs by hand is not an option. We need to use an automated system that will discover new pages and visit them.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Disclaimer: for real-world usage, find suitable software. Below is more info on that. This guide pretends to be an introduction to how the crawling process works and doing the basics. But there are tons of details that need addressing.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;For the code to work, you will need &lt;a href="https://www.python.org/downloads/"&gt;python3 installed&lt;/a&gt;. Some systems have it pre-installed. After that, install all the necessary libraries by running &lt;code&gt;pip install&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;requests beautifulsoup4 pandas
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How to Get all the Links on the Page
&lt;/h2&gt;

&lt;p&gt;From the first article in the series, we know that getting data from a webpage is easy with &lt;code&gt;requests.get&lt;/code&gt; and &lt;code&gt;BeautifulSoup&lt;/code&gt;. We will start by finding the links in a &lt;a href="https://scrapeme.live/shop/page/1/"&gt;fake shop prepared for testing scraping&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The basics to get the content are the same. Then we get all the links on the paginator and add the links to a &lt;code&gt;set&lt;/code&gt;. We chose set to avoid duplicates. As you can see, we hardcoded the selector for the links, meaning that it is not a universal solution. For the moment, we'll focus on the page at hand.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;requests&lt;/span&gt; 
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt; 

&lt;span class="n"&gt;to_visit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'https://scrapeme.live/shop/page/1/'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'html.parser'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'a.page-numbers'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="n"&gt;to_visit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'href'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; 

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;to_visit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="c1"&gt;# {'https://scrapeme.live/shop/page/2/', '.../3/', '.../46/', '.../48/', '.../4/', '.../47/'}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  One URL at a Time, Sequential
&lt;/h2&gt;

&lt;p&gt;Now we have several links but no way to visit them all. We need some kind of loop that will execute the extracting part for every URL available to fix that. Maybe the most straightforward way, although not the scalable one, is to use the same loop. But before that, there is a missing piece: avoid crawling the same page twice.&lt;/p&gt;

&lt;p&gt;We'll keep track of already visited links in another &lt;code&gt;set&lt;/code&gt; and avoid duplicates by checking them before every request. In this case, &lt;code&gt;to_visit&lt;/code&gt; is not being used, just maintained for demo purposes. To prevent visiting every page, we'll also add a &lt;code&gt;max_visits&lt;/code&gt; variable. For now, we ignore the &lt;code&gt;robots.txt&lt;/code&gt; file, but we have to be civil and nice.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;visited&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
&lt;span class="n"&gt;to_visit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
&lt;span class="n"&gt;max_visits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; 

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;crawl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Crawl: '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'html.parser'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;visited&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'a.page-numbers'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
        &lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'href'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
        &lt;span class="n"&gt;to_visit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;visited&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;visited&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;max_visits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
            &lt;span class="n"&gt;crawl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="n"&gt;crawl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'https://scrapeme.live/shop/page/1/'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;visited&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# {'.../3/', '.../1/', '.../2/'} 
&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;to_visit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# { ... new ones added, such as pages 5 and 6 ... }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is a recursive function with two exit conditions: there are no more links to visit, or we reached the maximum visits. In either case, it will exit and print the visited links and the ones pending.&lt;/p&gt;

&lt;p&gt;It is important to note that the same link can be added many times, but it will only get crawled once. In a big project, the idea would be to set a timer and only request each URL after a few days.&lt;/p&gt;

&lt;h2&gt;
  
  
  Separation of Concerns
&lt;/h2&gt;

&lt;p&gt;We said this is not about extracting or parsing content, but we need to separate concerns before it becomes entangled. For that, we'll create three helper functions: get HTML, extract links, and extract content. As their names imply, each of them will perform one of the main tasks of web scraping.&lt;/p&gt;

&lt;p&gt;The first one will get the HTML from a URL using the same library as earlier but wrapping it in a &lt;code&gt;try&lt;/code&gt; block for security.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; 
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;''&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The second one, extracting the links, will work just as before.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_links&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'href'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'a.page-numbers'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'href'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;visited&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The last one will be the placeholder for extracting the content we want. Since we are simplifying this part, it will get basic info from the same page, no need to enter on the detail page.&lt;/p&gt;

&lt;p&gt;To show that we can extract some content, we will print each product's title (Pokémon name).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'.product'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'h2'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
 &lt;span class="c1"&gt;# Bulbasaur, Ivysaur, ...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Assembling it all together.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;crawl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;visited&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
        &lt;span class="k"&gt;return&lt;/span&gt; 
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Crawl: '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;visited&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'html.parser'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;extract_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;links&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;extract_links&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;to_visit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;links&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Noticed something different? The crawling logic is not attached to the link extracting part. Each of the helpers handles a single piece. And the &lt;code&gt;crawl&lt;/code&gt; function acts as an orchestrator by calling them and applying the results.&lt;/p&gt;

&lt;p&gt;As the project evolves, all these parts could be moved to files or passed as parameters/callbacks. We can generalize the use cases if the core is independent of the selected page and content.&lt;/p&gt;

&lt;p&gt;Are we missing something? 🤔&lt;br&gt;
We need to add the first URL and call the crawling function. Since &lt;code&gt;crawl&lt;/code&gt; is not recursive anymore, we'll handle that in a separate loop.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;to_visit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'https://scrapeme.live/shop/page/1/'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;to_visit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;visited&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;max_visits&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="n"&gt;crawl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;to_visit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Parallel Requests
&lt;/h2&gt;

&lt;p&gt;There is a significant part missing: parallelism. HTTP request handlers are idle most of the time, waiting for the response to come back. It means that we can send several of them at the same time without overloading the machine. And then process them as they came back.&lt;/p&gt;

&lt;p&gt;It is relevant to note that this approach only works if the order is not imperative. But we are already using sets, which according to &lt;a href="https://docs.python.org/3/tutorial/datastructures.html#sets"&gt;Python's definition&lt;/a&gt;, "a set is an &lt;strong&gt;unordered collection&lt;/strong&gt; with no duplicate elements." Meaning that our process was unordered from the start.&lt;/p&gt;

&lt;p&gt;Before diving deep into the parallel requests, we have to understand a couple of concepts: synchronization and queues.&lt;/p&gt;

&lt;h3&gt;
  
  
  Synchronized Queues
&lt;/h3&gt;

&lt;p&gt;There is a huge risk in threaded or &lt;a href="https://en.wikipedia.org/wiki/Parallel_computing"&gt;parallel computing&lt;/a&gt;: modifying the same variables or data structures from different threads. It means two of our requests would be adding new links to a set (i.e., &lt;code&gt;to_visit&lt;/code&gt;). Since the data structure is not protected, both could read and write it like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Both read its content, i.e. &lt;code&gt;(1, 2, 3)&lt;/code&gt; (simplified)&lt;/li&gt;
&lt;li&gt;Thread one adds links to pages &lt;code&gt;4, 5&lt;/code&gt;: &lt;code&gt;(1, 2, 3, 4, 5)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Thread two adds links to pages &lt;code&gt;6, 7&lt;/code&gt;: &lt;code&gt;(1, 2, 3, 6, 7)&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How did this happen? When thread two wrote the new links, it added them to a set with only three elements.&lt;br&gt;
&lt;em&gt;This is a very simplified version; check the links for more info.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;What can we do to avoid these conflicts? Synchronization or locking. From the &lt;a href="https://docs.python.org/3/library/queue.html"&gt;docs&lt;/a&gt;: "queues use locks to temporarily block competing threads." It means that thread one would acquire a lock on the set, read and write without any problem, and then release the lock automatically. Meanwhile, thread two would have to wait until the lock becomes available. Only then read and write.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;queue&lt;/span&gt; 

&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Queue&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'https://scrapeme.live/shop/page/1/'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;crawl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="p"&gt;...&lt;/span&gt; 
    &lt;span class="n"&gt;links&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;extract_links&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;links&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;visited&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
            &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the moment, it does not work. Do not worry. The changes in the existing code are minimum: we replaced &lt;code&gt;to_visit&lt;/code&gt; with a queue. But queues need handlers or workers to process their content. With the above, we have created a Queue and added an item (the original one). We also modified the &lt;code&gt;crawl&lt;/code&gt; function to put links in the queue instead of updating the previous set.&lt;/p&gt;

&lt;p&gt;We'll create a worker using the &lt;a href="https://docs.python.org/3/library/threading.html"&gt;threading module&lt;/a&gt; to process that queue.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;threading&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Thread&lt;/span&gt; 

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;queue_worker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
        &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# Get an item from the queue, blocks until one is available 
&lt;/span&gt;        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'to process:'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
        &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task_done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# Notifies the queue that the item has been processed 
&lt;/span&gt;
&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Queue&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
&lt;span class="n"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;queue_worker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;daemon&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 

&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'https://scrapeme.live/shop/page/1/'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# Blocks until all items in the queue are processed and marked as done 
&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Done'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="c1"&gt;# to process: https://scrapeme.live/shop/page/1/ 
# Done
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We defined a new function that will handle the queued items. For that, we enter into an infinite loop that will stop when all the processing finishes.&lt;/p&gt;

&lt;p&gt;Then &lt;code&gt;get&lt;/code&gt; an item, which will block until an item is available. We process that item; for the moment, just print it to show how it works. It will call &lt;code&gt;crawl&lt;/code&gt; later.&lt;/p&gt;

&lt;p&gt;Finally, we notify the queue that the item has been processed by calling &lt;code&gt;task_done&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Once the queue gets notified for all the items and empty, it will stop its execution and end the infinite loop. That's what the &lt;code&gt;join&lt;/code&gt; function does, "blocks until all items in the queue have been gotten and processed."&lt;/p&gt;

&lt;p&gt;Now we need two more things: process items and create more threads (it would not be parallel with just one, would it?).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;queue_worker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
        &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;visited&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;max_visits&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;visited&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
            &lt;span class="n"&gt;crawl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
        &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task_done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 

&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Queue&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
&lt;span class="n"&gt;num_workers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; 
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_workers&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="n"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;queue_worker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;daemon&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Be careful when running it since big numbers in &lt;code&gt;num_workers&lt;/code&gt; and &lt;code&gt;max_visits&lt;/code&gt; would start lots of requests. If the script had some minor bug for any reason, you could perform hundreds of requests in a few seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance
&lt;/h3&gt;

&lt;p&gt;We run benchmarks with different settings only as a reference.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sequential requests: 29,32s&lt;/li&gt;
&lt;li&gt;Queue with one worker (&lt;code&gt;num_workers = 1&lt;/code&gt;): 29,41s&lt;/li&gt;
&lt;li&gt;Queue with two workers (&lt;code&gt;num_workers = 2&lt;/code&gt;): 20,05s&lt;/li&gt;
&lt;li&gt;Queue with five workers (&lt;code&gt;num_workers = 5&lt;/code&gt;): 11,97s&lt;/li&gt;
&lt;li&gt;Queue with ten workers (&lt;code&gt;num_workers = 10&lt;/code&gt;): 12,02s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is almost no difference between sequential requests and having one worker. Threads carry some overhead, but it is barely noticeable here. It would require a more severe load test. Once we start adding workers, that overhead pays off. We could add even more, but it won't affect the outcome since they will be idle most of the time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Distributed Processing
&lt;/h2&gt;

&lt;p&gt;We won't cover the following scale-up step: distributing the crawling process among several servers. &lt;a href="https://docs.python.org/3/library/multiprocessing.html#using-a-remote-manager"&gt;Python allows it&lt;/a&gt;, and some libraries can help you with it (&lt;a href="https://docs.celeryproject.org/en/stable/"&gt;Celery&lt;/a&gt; or &lt;a href="https://python-rq.org/"&gt;Redis Queue&lt;/a&gt;). It is a huge step, and we have already covered enough for the day.&lt;/p&gt;

&lt;p&gt;As a quick preview, the idea behind it is the same as the one with the threads. Each item will be processed as we've seen until now but in different threads or even machines running the same code. With this approach, we can scale even further; theoretically, with no limit. But in reality, there is always a limit or bottleneck, usually the central node that handles the distribution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Take into Account when Scaling Up
&lt;/h2&gt;

&lt;p&gt;We've shown a simplified version of a crawling process for educational purposes. To apply all this at scale, you should consider several things first.&lt;/p&gt;

&lt;h3&gt;
  
  
  Build vs Buy vs Open Source
&lt;/h3&gt;

&lt;p&gt;Before you write your own library for crawling, try some of the options out there. Many great Open Source libraries can achieve it: &lt;a href="https://docs.scrapy.org/en/latest/"&gt;Scrapy&lt;/a&gt;, &lt;a href="http://docs.pyspider.org/en/latest/"&gt;pyspider&lt;/a&gt;, &lt;a href="https://github.com/bda-research/node-crawler"&gt;node-crawler&lt;/a&gt; (Node.js), or &lt;a href="https://github.com/gocolly/colly"&gt;Colly&lt;/a&gt; (Go). And many companies and services that provide you with scraping and crawling solutions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Avoid being blocked
&lt;/h3&gt;

&lt;p&gt;As we saw in a previous post, &lt;a href="https://dev.to/anderrv/stealth-web-scraping-in-python-avoid-blocking-like-a-ninja-16ok"&gt;there are several actions we can take to avoid blocking&lt;/a&gt;. A couple of them are proxies and headers. Here is a simple snippet adding those to our current code.&lt;br&gt;
&lt;em&gt;Note that these &lt;a href="https://free-proxy-list.net/"&gt;free proxies&lt;/a&gt; might not work for you. They are short-time lived.&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;proxies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; 
    &lt;span class="s"&gt;'http'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'http://190.64.18.177:80'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="s"&gt;'https'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'http://49.12.2.178:3128'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="p"&gt;}&lt;/span&gt; 

&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; 
    &lt;span class="s"&gt;'authority'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'httpbin.org'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="s"&gt;'cache-control'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'max-age=0'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="s"&gt;'sec-ch-ua'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="s"&gt;'sec-ch-ua-mobile'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'?0'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="s"&gt;'upgrade-insecure-requests'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'1'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="s"&gt;'user-agent'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="s"&gt;'accept'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="s"&gt;'sec-fetch-site'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'none'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="s"&gt;'sec-fetch-mode'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'navigate'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="s"&gt;'sec-fetch-user'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'?1'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="s"&gt;'sec-fetch-dest'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'document'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="s"&gt;'accept-language'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'en-US,en;q=0.9'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="p"&gt;}&lt;/span&gt; 

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; 
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;''&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Extracting Content
&lt;/h3&gt;

&lt;p&gt;We won't go into details here, only a simple snippet for extracting id, name, and price per item. We store everything in a &lt;code&gt;data&lt;/code&gt; array, which is not a great idea. But it is enough for demo purposes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt; 

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'.product'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; 
            &lt;span class="s"&gt;'id'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'a'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'data-product_id'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;})[&lt;/span&gt;&lt;span class="s"&gt;'data-product_id'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; 
            &lt;span class="s"&gt;'name'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'h2'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="s"&gt;'price'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'amount'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; 
        &lt;span class="p"&gt;})&lt;/span&gt; 

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="c1"&gt;# [{'id': '759', 'name': 'Bulbasaur', 'price': '£63.00'}, {'id': '729', 'name': 'Ivysaur', 'price': '£87.00'}, ...]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Persistency
&lt;/h3&gt;

&lt;p&gt;We haven't persisted anything, and that does not scale. In a real-world case, we should store the content and even the HTML itself for later processing. And all the discovered URLs with the timestamp time. It all starts to sound like a database is needed. Depending on the necessities, we could store just the actual content or the whole URLs, dates, HTML, etcetera generically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Canonicals
&lt;/h3&gt;

&lt;p&gt;The link extraction part does not take into consideration &lt;a href="https://en.wikipedia.org/wiki/Canonical_link_element"&gt;canonical links&lt;/a&gt;. A page can have more than one URL: query strings or hashes might modify it. In our case, we would crawl it twice. It's not a problem now, but something to consider.&lt;/p&gt;

&lt;p&gt;The right approach would be to add the canonical URL (if present) to the visited list. Then we could arrive at that same page from a different origin URL, but we would detect it as duplicate. We could also remove some query string parameters using &lt;a href="https://w3lib.readthedocs.io/en/latest/w3lib.html#w3lib.url.url_query_cleaner"&gt;url_query_cleaner&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Robots.txt
&lt;/h3&gt;

&lt;p&gt;We have not checked it because we are using a test website prepared for scraping. But please check the robots file and comply with it when crawling an actual target. And above it, do not cause more traffic than they can handle. Once again, be civil and nice ;)&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Code
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;requests&lt;/span&gt; 
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt; 
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;queue&lt;/span&gt; 
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;threading&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Thread&lt;/span&gt; 

&lt;span class="n"&gt;starting_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'https://scrapeme.live/shop/page/1/'&lt;/span&gt; 
&lt;span class="n"&gt;visited&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
&lt;span class="n"&gt;max_visits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="c1"&gt;# careful, it will crawl all the pages 
&lt;/span&gt;&lt;span class="n"&gt;num_workers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt; 
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt; 

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
        &lt;span class="c1"&gt;# response = requests.get(url, headers=headers, proxies=proxies) 
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; 
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
        &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s"&gt;''&lt;/span&gt; 

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_links&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'href'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'a.page-numbers'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'href'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;visited&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; 

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'.product'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; 
            &lt;span class="s"&gt;'id'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'a'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'data-product_id'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;})[&lt;/span&gt;&lt;span class="s"&gt;'data-product_id'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; 
            &lt;span class="s"&gt;'name'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'h2'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="s"&gt;'price'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'amount'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; 
        &lt;span class="p"&gt;})&lt;/span&gt; 

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;crawl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="n"&gt;visited&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Crawl: '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_html&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'html.parser'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;extract_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="n"&gt;links&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;extract_links&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;links&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;visited&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
            &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;queue_worker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
        &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# Get an item from the queue, blocks until one is available 
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;visited&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;max_visits&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;visited&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
            &lt;span class="n"&gt;crawl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
        &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task_done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# Notifies the queue that the item has been processed 
&lt;/span&gt;
&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Queue&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_workers&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="n"&gt;Thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;queue_worker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;daemon&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 

&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;starting_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# Blocks until all items in the queue are processed and marked as done 
&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Done'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Visited:'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;visited&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Data:'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;We'd like you to part with three main points:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Separate getting the HTML and extracting the links from the crawling itself.&lt;/li&gt;
&lt;li&gt;Choose the appropriate system for your use case: simple sequential, parallel, or distributed.&lt;/li&gt;
&lt;li&gt;Building from scratch to a vast scale will probably hurt. Take a look at free or paid libraries/solutions.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We are close to finishing this series on Web Scraping. Stay tuned for the next one on scaling this crawling process even further.&lt;/p&gt;

&lt;p&gt;Do not forget to take a look at the rest of the posts in this series.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.zenrows.com/blog/mastering-web-scraping-in-python-scaling-to-distributed-crawling?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=crawling_scratch"&gt;Scaling to Distributed Crawling&lt;/a&gt; (4/4)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.zenrows.com/blog/stealth-web-scraping-in-python-avoid-blocking-like-a-ninja?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=crawling_scratch"&gt;Avoid Blocking Like a Ninja&lt;/a&gt; (2/4)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.zenrows.com/blog/mastering-web-scraping-in-python-from-zero-to-hero?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=crawling_scratch"&gt;Mastering Extraction&lt;/a&gt; (1/4)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Did you find the content helpful? Please, spread the word and share it. 👈&lt;/p&gt;




&lt;p&gt;Originally published at &lt;a href="https://zenrows.com/blog/mastering-web-scraping-in-python-crawling-from-scratch?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=crawling_scratch"&gt;https://www.zenrows.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>beginners</category>
      <category>tutorial</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Stealth Web Scraping in Python: Avoid Blocking Like a Ninja</title>
      <dc:creator>Ander Rodriguez</dc:creator>
      <pubDate>Thu, 29 Jul 2021 14:06:34 +0000</pubDate>
      <link>https://dev.to/anderrv/stealth-web-scraping-in-python-avoid-blocking-like-a-ninja-16ok</link>
      <guid>https://dev.to/anderrv/stealth-web-scraping-in-python-avoid-blocking-like-a-ninja-16ok</guid>
      <description>&lt;p&gt;Scraping should be about &lt;a href="https://dev.to/anderrv/mastering-web-scraping-in-python-from-zero-to-hero-4fj4"&gt;extracting content from HTML&lt;/a&gt;. Sounds simple. Sometimes it is not. It has many obstacles. The first one is to obtain the said HTML.&lt;/p&gt;

&lt;p&gt;You can open a browser, go to a URL, and it's there. Dead simple. We're done. 👩‍💻&lt;/p&gt;

&lt;p&gt;If you don't need a bigger scale, that's it; you're done. But bear with us if that's not the case and you want to learn how to scrape thousand of URLs without blocking.&lt;/p&gt;

&lt;p&gt;Websites tend to protect their data and access. There are many possible actions a defensive system could take. We'll start a journey through some of them and learn how to avoid or mitigate their impact.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: when testing at scale, never use your home IP directly. A small mistake or slip and you will get banned.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;For the code to work, you will need  &lt;a href="https://www.python.org/downloads/" rel="noopener noreferrer"&gt;python3 installed&lt;/a&gt; . Some systems have it pre-installed. After that, install all the necessary libraries by running &lt;code&gt;pip install&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;requests beautifulsoup4 pandas
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  IP Rate Limit
&lt;/h2&gt;

&lt;p&gt;The most basic security system is to ban or throttle requests from the same IP. It means that a regular user would not request a hundred pages in a few seconds, so they proceed to tag that connection as dangerous.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://httpbin.org/ip&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;origin&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="c1"&gt;# xyz.84.7.83
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;IP rate limits work similar to API rate limits, but there is usually no public information about them. We cannot know for sure how many requests we can do safely.&lt;/p&gt;

&lt;p&gt;Our Internet Service Provider assigns us our IP, and we cannot affect nor mask it. The solution is to change it. We cannot modify a machine's IP, but we can use different machines. Datacenters might have different IPs, although that is not a real solution.&lt;/p&gt;

&lt;p&gt;Proxies are. They take an incoming request and relay it to the final destination. It does no processing there. But that is enough to mask our IP since the target website will see the proxies IP.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rotating Proxies
&lt;/h2&gt;

&lt;p&gt;There are &lt;a href="https://free-proxy-list.net/" rel="noopener noreferrer"&gt;Free Proxies&lt;/a&gt; even though we do not recommend them. They might work for testing but are not reliable. We can use some of those for testing, as we'll see in some examples.&lt;/p&gt;

&lt;p&gt;Now we have a different IP, and our home connection is safe and sound. Good. But what if they block the proxy's IP? We are back to the initial position.&lt;/p&gt;

&lt;p&gt;We won't go into detail about free proxies. Just use the next one on the list. Change them frequently since their lifespan is usually short.&lt;/p&gt;

&lt;p&gt;Paid proxy services, on the other hand, offer IP Rotation. Meaning that our service will work the same, but the website will see a different IP. In some cases, they rotate for every request or every few minutes. In any case, they are much harder to ban. And when it happens, we'll get a new IP after a short time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;proxies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://190.64.18.177:80&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://httpbin.org/ip&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;origin&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="c1"&gt;# 190.64.18.162
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We know about these; it means they know about it too. Some big companies won't allow traffic from known proxy IPs or datacenters. For those cases, there is a higher proxy level: Residential.&lt;/p&gt;

&lt;p&gt;More expensive and sometimes bandwidth-limited, residential proxies offer us IPs in use by "normal" people. Implying that our mobile provider will assign us that IP tomorrow. Or a friend had it yesterday. They are indistinguishable from actual final users.&lt;/p&gt;

&lt;p&gt;We can scrape whatever we want, right? The cheaper ones by default, the expensive ones when necessary. No, not there yet. We only passed the first hurdle, some more to come. We have to look like a legitimate user to avoid being tagged as a bot or scraper.&lt;/p&gt;

&lt;h2&gt;
  
  
  User-Agent Header
&lt;/h2&gt;

&lt;p&gt;The next step would be to check our request headers. The most known one is &lt;a href="https://en.wikipedia.org/wiki/User_agent" rel="noopener noreferrer"&gt;User-Agent&lt;/a&gt; (UA for short), but there are many more. UA follows a format that we'll see later, and many software tools have their own, for example, GoogleBot. Here is what the target website will receive if we directly use "python requests" or curl.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://httpbin.org/headers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;headers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# python-requests/2.25.1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://httpbin.org/headers
&lt;span class="c"&gt;# { ... "User-Agent": "curl/7.74.0" ... }&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Many sites won't check UA, but this is a huge red flag for the ones that do this. We'll have to fake it. Luckily, most libraries allow custom headers. Following the example using requests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://httpbin.org/headers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;headers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# Mozilla/5.0 ...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To get your current UA, visit &lt;a href="http://httpbin.org/headers" rel="noopener noreferrer"&gt;httpbin&lt;/a&gt; - just as the code snippet is doing - and copy it. Requesting all the URLs with the same UA might also trigger some alerts, so the solution is again a bit more complicated.&lt;/p&gt;

&lt;p&gt;Ideally, we would have all the current possible User-Agents and rotate them as we did with the IPs. Since that is close to impossible, we can at least have a few. There are &lt;a href="https://developers.whatismybrowser.com/useragents/explore/" rel="noopener noreferrer"&gt;lists of User Agents&lt;/a&gt; available for us to choose from.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;

&lt;span class="n"&gt;user_agents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (Linux; Android 11; SM-G960U) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.72 Mobile Safari/537.36&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;user_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_agents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_agent&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://httpbin.org/headers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;headers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) ...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keep in mind that browsers change versions quite often, and this list can be obsolete in a few months. If we are to use User-Agent rotation, a reliable source is essential. We can do it by hand or use a service provider.&lt;/p&gt;

&lt;p&gt;We are a step closer, but there is still one flaw in the headers: they also know this trick and check other headers along with the UA.&lt;/p&gt;

&lt;h2&gt;
  
  
  Full Set of Headers
&lt;/h2&gt;

&lt;p&gt;Each browser, or even version, sends different headers. Check Chrome and Firefox in action: (ignore X-Amzn-Trace-Id)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"headers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Accept"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"Accept-Encoding"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gzip, deflate, br"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"Accept-Language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"en-US,en;q=0.9"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"Host"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"httpbin.org"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"Sec-Ch-Ua"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Chromium&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;;v=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;92&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; Not A;Brand&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;;v=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;99&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Google Chrome&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;;v=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;92&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"Sec-Ch-Ua-Mobile"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"?0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"Sec-Fetch-Dest"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"document"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"Sec-Fetch-Mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"navigate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"Sec-Fetch-Site"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"none"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"Sec-Fetch-User"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"?1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"Upgrade-Insecure-Requests"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"User-Agent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"X-Amzn-Trace-Id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Root=1-60ff12bb-55defac340ac48081d670f9d"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"headers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Accept"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"Accept-Encoding"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gzip, deflate, br"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"Accept-Language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"en-US,en;q=0.5"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"Host"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"httpbin.org"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"Sec-Fetch-Dest"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"document"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"Sec-Fetch-Mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"navigate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"Sec-Fetch-Site"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"none"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"Sec-Fetch-User"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"?1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"Upgrade-Insecure-Requests"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"User-Agent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="nl"&gt;"X-Amzn-Trace-Id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Root=1-60ff12e8-229efca73430280304023fb9"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It means what you think it means. The previous array with 5 User Agents is incomplete. We need an array with a complete set of headers per User-Agent. For brevity, we will show a list with one item. It is already long enough.&lt;/p&gt;

&lt;p&gt;In this case, copying the result from httpbin is not enough. The ideal would be to copy it directly from the source. The easiest way to do it is from the Firefox or Chrome DevTools - or equivalent in your browser. Go to the Network tab, visit the target website, right-click on the request and copy as cURL. Then &lt;a href="https://curl.trillworks.com/#python" rel="noopener noreferrer"&gt;convert curl syntax to Python&lt;/a&gt; and paste the headers in the list.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;

&lt;span class="n"&gt;headers_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;authority&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;httpbin.org&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cache-control&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;max-age=0&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sec-ch-ua&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="s"&gt;Chromium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;;v=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;92&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; Not A;Brand&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;;v=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;99&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Google Chrome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;;v=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;92&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sec-ch-ua-mobile&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;?0&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;upgrade-insecure-requests&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user-agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;accept&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sec-fetch-site&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sec-fetch-mode&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;navigate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sec-fetch-user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;?1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sec-fetch-dest&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;accept-language&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;en-US,en;q=0.9&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;# , {...}
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headers_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://httpbin.org/headers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;headers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We could add a Referer header for extra security - such as Google or an internal page from the same website. It would mask the fact that we are always requesting URLs directly without any interaction. But be careful since they could also detect fake referrers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cookies
&lt;/h2&gt;

&lt;p&gt;We ignored the &lt;a href="https://en.wikipedia.org/wiki/HTTP_cookie" rel="noopener noreferrer"&gt;cookies&lt;/a&gt; in the headers section because the best option is not to use them. Unless we want to log in, we can ignore them most of the time. Some websites will block the content or redirect to login after a few visits. They use cookies for that. Thus we can avoid a possible block or login wall in not sending them.&lt;/p&gt;

&lt;p&gt;Other websites will be more tolerant if we perform several actions from the same IP with those session cookies. But it is difficult to say when it will work unless we test it. There is no silver bullet.&lt;/p&gt;

&lt;p&gt;Anyhow, it might look suspicious if all our requests go without cookies, but allowing them will be worse if we are not extremely careful. And there are no upsides to using session cookies for the initial request. But what happens if we want content generated in the browser after XHR calls?&lt;/p&gt;

&lt;p&gt;First, we will need to use a headless browser. Second, not sending cookies will look more than suspicious in these cases. Right after the initial load, the Javascript will try to get some content using an XHR call. There is no way we can do that call without cookies. A legitimate user cannot do that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Headless Browsers
&lt;/h2&gt;

&lt;p&gt;While avoiding them - for performance reasons - would be preferable, sometimes there is no other choice. &lt;a href="https://www.selenium.dev/" rel="noopener noreferrer"&gt;Selenium&lt;/a&gt;, &lt;a href="https://github.com/puppeteer/puppeteer" rel="noopener noreferrer"&gt;Puppeteer&lt;/a&gt;, and &lt;a href="https://github.com/microsoft/playwright" rel="noopener noreferrer"&gt;Playwright&lt;/a&gt; are the most used and known libraries. The snippet below shows only the User-Agent, but since it is a real browser, the headers will include the entire set (Accept, Accept-Encoding, etcetera)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright.sync_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sync_playwright&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;sync_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# p.webkit is also supported, but there is a problem on Linux
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;browser_type&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chromium&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;firefox&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;browser_type&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://httpbin.org/headers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;jsonContent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;inner_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pre&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jsonContent&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;headers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/93.0.4576.0 Safari/537.36
# Mozilla/5.0 (X11; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach comes with its own problem: take a look a the User-Agents. The Chromium one includes "HeadlessChrome," which will tell the target website, well, that it is a headless browser. They might act upon that.&lt;/p&gt;

&lt;p&gt;Back to the headers section: we can add custom headers that will overwrite the default ones. Replace the line in the previous snippet with this one and paste a valid User-Agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extra_http_headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is just an entry-level with headless browsers. Headless detection is a field in itself, and many people are working on it. Some to detect it, some to avoid being detected. As an example, you can visit &lt;a href="https://pixelscan.net/" rel="noopener noreferrer"&gt;pixelscan&lt;/a&gt; with an actual browser and a headless one. To be deemed "consistent," you'll need to work hard.&lt;/p&gt;

&lt;p&gt;Take a look at the screenshot below, taken when visiting pixelscan with playwright. See the UA? The one we fake is all right, but they can detect that we are lying by checking the navigator Javascript API. We could simulate that too (a bit more complicated), but then another check would identify the headless browser. In summary, having 100% coverage is complex, but you won't need it most of the time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foeftquoozwbtf9s6ohyv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foeftquoozwbtf9s6ohyv.png" alt="Pixelscan Inconsistent"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Analyzing the Firefox output from the previous example, it looks like a legit one, almost the same we see using a real browser in a Linux machine. Our advice is not to lie if possible. Faking an iPhone UA is easy; the problems come afterward. You can also fake &lt;code&gt;navigator.userAgent&lt;/code&gt; and some others. But probably not WebGL, touch events, or battery status.&lt;/p&gt;

&lt;p&gt;If you are running a crawler on the cloud, probably a Linux machine, it might look a bit unusual. But it is entirely legit to run Firefox on Linux - we are doing it right now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Geographic Limits or Geoblocking
&lt;/h2&gt;

&lt;p&gt;Have you ever tried to watch CNN from outside the US?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6p4is4dvpfat7zbnz37l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6p4is4dvpfat7zbnz37l.png" alt="CNN Geoblocked"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That's called &lt;a href="https://en.wikipedia.org/wiki/Geo-blocking" rel="noopener noreferrer"&gt;geoblocking&lt;/a&gt;. Only connections from inside the US can watch CNN live. To bypass that, we could use a &lt;a href="https://en.wikipedia.org/wiki/Virtual_private_network" rel="noopener noreferrer"&gt;Virtual Private Network (VPN)&lt;/a&gt;. We can then browse as usual, but the website will see a local IP thanks to the VPN.&lt;/p&gt;

&lt;p&gt;The same can happen when scraping websites with geoblocking. There is an equivalent for proxies: geolocated proxies. Some proxy providers allow us to choose from a list of countries. With that activated, we will only get local IPs from the US, for example.&lt;/p&gt;

&lt;h2&gt;
  
  
  Behavioral patterns
&lt;/h2&gt;

&lt;p&gt;Blocking IPs and User-Agents is not enough these days. They become unmanageable and stale in hours, if not minutes. As long as we perform requests with "clean" IPs and real-world User-Agents, we are mainly safe. There are more factors involved, but most of the requests should be valid.&lt;/p&gt;

&lt;p&gt;However, modern anti-bot software use machine learning and behavioral patterns, not just static markers (IP, UA, geolocation). That means that we would be detected if we always performed the same actions in the same order.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to the homepage&lt;/li&gt;
&lt;li&gt;Click on the "Shop" button&lt;/li&gt;
&lt;li&gt;Scroll down&lt;/li&gt;
&lt;li&gt;Go to page 2&lt;/li&gt;
&lt;li&gt;...&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After a few days, launching the same script could result in every request being blocked. Many people can perform those same actions, but bots have something that makes them obvious: speed. With software, we would execute every step sequentially, while an actual user would take a second, then click, scroll down slowly using the mouse wheel, move the mouse to the link and click.&lt;/p&gt;

&lt;p&gt;Maybe there is no need to fake all that, but be aware of the possible problems and know how to face them.&lt;/p&gt;

&lt;p&gt;We have to think what is what we want. Maybe we don't need that first request since we only require the second page. We could use that as an entry point, not the homepage. And save one request. It can scale to hundreds of URLs per domain. No need to visit every page in order, scroll down, click on the next page and start again.&lt;/p&gt;

&lt;p&gt;To scrape search results, once we recognize the URL pattern for pagination, we only need two data points: the number of items and items per page. And most of the time, that info is present on the first page or request.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://scrapeme.live/shop/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;html.parser&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.woocommerce-pagination a.page-numbers:not(.next)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;href&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c1"&gt;# https://scrapeme.live/shop/page/2/
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;href&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="c1"&gt;# https://scrapeme.live/shop/page/48/
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One request shows us that there are 48 pages. We can now queue them. Mixing with the other techniques, we would scrape the content from this page and add the remaining 47. To scrape them, we could:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shuffle the page order to avoid pattern detection&lt;/li&gt;
&lt;li&gt;Use different IPs and User-Agent, so each request looks like a new one&lt;/li&gt;
&lt;li&gt;Add delays between some of the calls&lt;/li&gt;
&lt;li&gt;Use Google as a referrer randomly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We could write some snippet mixing all of these, but the best option in real life is to use a tool with it all like &lt;a href="https://docs.scrapy.org/en/latest/" rel="noopener noreferrer"&gt;Scrapy&lt;/a&gt;, &lt;a href="http://docs.pyspider.org/en/latest/" rel="noopener noreferrer"&gt;pyspider&lt;/a&gt;, &lt;a href="https://github.com/bda-research/node-crawler" rel="noopener noreferrer"&gt;node-crawler&lt;/a&gt; (Node.js), or &lt;a href="https://github.com/gocolly/colly" rel="noopener noreferrer"&gt;Colly&lt;/a&gt; (Go). The idea being the snippets is to understand each problem on its own. But for large-scale, real-life projects, handling everything on our own would be too complicated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Captchas
&lt;/h2&gt;

&lt;p&gt;Even the best-prepared request can get caught and shown a &lt;a href="https://en.wikipedia.org/wiki/CAPTCHA" rel="noopener noreferrer"&gt;captcha&lt;/a&gt;. Nowadays, solving captchas is achievable - Anti-Captcha and 2Captcha - but a waste of time and money. The best solution is to avoid them. The second best is to forget about that request and retry.&lt;/p&gt;

&lt;p&gt;It might sound counterintuitive, but waiting for a second and retrying the same request with a different IP and set of headers will be faster than solving a captcha. Try it yourself and tell us about the experience 😉.&lt;/p&gt;

&lt;p&gt;Login Wall or Paywall&lt;br&gt;
Some websites prefer to show or redirect users to a login page instead of a captcha. Instagram will redirect anonymous users after a few visits, and Medium will show a paywall.&lt;/p&gt;

&lt;p&gt;As with the captchas, a solution can be to tag the IP as "dirty," forget the request and try again. Libraries usually follow redirects by default but offer an option not to allow them. Ideally, we would only disallow redirects to log in, sign up, or specific pages, not all of them. In this example, we will follow redirects until its location contains "accounts/login." In an actual project, we would queue that page with a delay and try again.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://instagram.com&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;allow_redirects&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;redirect&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve_redirects&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;location&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redirect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redirect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accounts/login&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# no need to exit, return would be enough
# 301 https://instagram.com/
# 301 https://www.instagram.com/
# 302 https://www.instagram.com/accounts/login/
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Be a good internet citizen
&lt;/h2&gt;

&lt;p&gt;We can use several websites for testing, but be careful when doing the same at scale. Try to be a good internet citizen a do not cause -small- DDoS. Limit your interactions per domain. Amazon can handle thousands of requests per second. But not all target sites will.&lt;/p&gt;

&lt;p&gt;We are always talking about "read-only" browsing mode. Access a page and read its contents. Never submit a form or perform active actions with malicious intent.&lt;/p&gt;

&lt;p&gt;If we were to take a more active approach, several other factors would matter: writing speed, mouse movement, navigation without clicking, browsing many pages simultaneously, etcetera. Bot prevention software is specifically aggressive with active actions. As it should for security reasons.&lt;/p&gt;

&lt;p&gt;We won't go into details on this part, but these actions will give them new reasons to block requests. Again, good citizens don't try massive logins. We are talking about scraping, not malicious activities.&lt;/p&gt;

&lt;p&gt;Sometimes websites make data collections harder, maybe not on purpose. But with modern frontend tools, CSS classes could change every day, ruining thoroughly prepared scripts. For more details, read our previous entry on &lt;a href="https://dev.to/anderrv/mastering-web-scraping-in-python-from-zero-to-hero-4fj4"&gt;how to scrape data in python&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;We'd like you to remember the low-hanging fruits:&lt;/p&gt;

&lt;p&gt;IP rotating proxies&lt;br&gt;
Full set headers, including User-Agent&lt;br&gt;
Avoid patterns that might tag you as a bot&lt;/p&gt;

&lt;p&gt;There are many more, and probably more we didn't cover. But with these techniques, you should be able to crawl and scrape at scale.&lt;/p&gt;

&lt;p&gt;Contact us if you know any more website scraping tricks or have doubts about applying them.&lt;/p&gt;

&lt;p&gt;Remember, we covered &lt;a href="https://dev.to/anderrv/mastering-web-scraping-in-python-from-zero-to-hero-4fj4"&gt;scraping&lt;/a&gt; and avoiding being blocked, but there is much more: crawling, converting and storing the content, scaling the infrastructure, and more. Stay tuned!&lt;/p&gt;

&lt;p&gt;Do not forget to take a look at the rest of the posts in this series.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.zenrows.com/blog/mastering-web-scraping-in-python-scaling-to-distributed-crawling?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=avoid_blocking" rel="noopener noreferrer"&gt;Scaling to Distributed Crawling&lt;/a&gt; (4/4)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.zenrows.com/blog/mastering-web-scraping-in-python-crawling-from-scratch?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=avoid_blocking" rel="noopener noreferrer"&gt;Crawling from Scratch&lt;/a&gt; (3/4)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.zenrows.com/blog/mastering-web-scraping-in-python-from-zero-to-hero?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=avoid_blocking" rel="noopener noreferrer"&gt;Mastering Extraction&lt;/a&gt; (1/4)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Did you find the content helpful? Please, spread the word and share it. 👈&lt;/p&gt;




&lt;p&gt;Originally published at  &lt;a href="https://www.zenrows.com/blog/stealth-web-scraping-in-python-avoid-blocking-like-a-ninja?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=avoid_blocking" rel="noopener noreferrer"&gt;https://www.zenrows.com&lt;/a&gt; &lt;/p&gt;

</description>
      <category>python</category>
      <category>beginners</category>
      <category>tutorial</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
