<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Danlof</title>
    <description>The latest articles on DEV Community by Danlof (@danlof).</description>
    <link>https://dev.to/danlof</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1200960%2F8be33d28-328e-47f6-88e9-c91e405e6efb.jpeg</url>
      <title>DEV Community: Danlof</title>
      <link>https://dev.to/danlof</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/danlof"/>
    <language>en</language>
    <item>
      <title>Web scraping in 10 mins</title>
      <dc:creator>Danlof</dc:creator>
      <pubDate>Sun, 26 Nov 2023 03:33:31 +0000</pubDate>
      <link>https://dev.to/danlof/web-scraping-in-10-mins-5h3a</link>
      <guid>https://dev.to/danlof/web-scraping-in-10-mins-5h3a</guid>
      <description>&lt;p&gt;Well, there is this website that I have been trying to scrape for a few days now.&lt;br&gt;
I had tried everything from scrapy splash on docker to almost giving up because I read somewhere that it was JavaScript rendered. Since the source code from the inspect part of the developer tools  was different from the source code from the view-source:https//... on the same developer tools.How could this be possible?&lt;br&gt;
Then I kept searching on internet and found this concept; where you can mimic web-browsers requests from a server using an API program,and  it worked magically.&lt;br&gt;
Some of the API programs are &lt;a href="https://www.postman.com/" rel="noopener noreferrer"&gt;&lt;em&gt;postman&lt;/em&gt; &lt;/a&gt;and &lt;a href="https://insomnia.rest/" rel="noopener noreferrer"&gt;&lt;em&gt;insomnia&lt;/em&gt;&lt;/a&gt;.&lt;br&gt;
I prefer using insomnia for this particular case , feel free to use any other API program of your choice.&lt;/p&gt;



&lt;p&gt;So now every website has its own rules on how to scrape it, this can be accessed by adding robots.txt at the end of a Url.ie "&lt;a href="https://pythonscraping.com/robots.txt" rel="noopener noreferrer"&gt;https://pythonscraping.com/robots.txt&lt;/a&gt;"&lt;/p&gt;

&lt;p&gt;This can generate something like :&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0fiogv1eow0yelaww58u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0fiogv1eow0yelaww58u.png" alt="roboots.txt"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Different websites have different robots.txt files follow their directives and you will have an easy time scraping&lt;/p&gt;



&lt;p&gt;So the first thing is to open inspect(Q) by right clicking the mouse.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fri9nekltozzwl0z7ee0r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fri9nekltozzwl0z7ee0r.png" alt="DevTools image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the above image we have go to network panel press on the XHR. The XHR is a JavaScript object that provides an easy way to retrieve data from a URL without having to do a full page refresh.&lt;br&gt;
Then you look for the status columns for code 200 with means the load was successful.&lt;br&gt;
press on one of the successful XHRs; a table will be shown. Look for the response column and press on it.&lt;br&gt;
From here , you have to have understood the structure of the webpage displayed by the web browser originally, because from the second table with response , you will look for that structure that emulates your web page rendered display data. For example:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fck1x7qwaw8n7qkp7hxtx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fck1x7qwaw8n7qkp7hxtx.png" alt="Page structure"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;The above is the page structure that suits my needs. Some time you may find it on the next page after load. &lt;/p&gt;



&lt;p&gt;I am expecting you have an account on any of the API programs. &lt;br&gt;
Then you need to right click on the GET request and copy it as cURL:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0sfhb7v89ma66o702eeg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0sfhb7v89ma66o702eeg.png" alt="How to copy"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then you open your insomnia API program and paste the cURL. Run it by pressing send.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0h0tiyzdzsmcfzdisf4t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0h0tiyzdzsmcfzdisf4t.png" alt="Image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As seen above we have the structure of the web page nicely rendered on the far right side.&lt;br&gt;
If your page as pagination you have to locate it for an easy loop over the items you want. &lt;br&gt;
This case we wont to needing a loop since we can just pick the whole contents of the information by setting the limit to 197(maximum value of real-estates available) &lt;/p&gt;

&lt;p&gt;Look for the keys from the APIs review column. Some of the Keys include : Features,countries,properties etc .&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5asc47w11zdub3olo2m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5asc47w11zdub3olo2m.png" alt="Keys to the project"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The above keys can be used to extract information from the webpage by a simple python script.&lt;br&gt;
But first lets us generate the code for our data output.&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl3k4ddoibx344gqu2p6u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl3k4ddoibx344gqu2p6u.png" alt="code generator"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the left hand side ,there will be a dropdown menu when you press New Requests , choose generate code.This  will be generate code in any language of your choice. For our case we will use python requests.&lt;/p&gt;



&lt;p&gt;The fun part coding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import requests
import pandas as pd

url = "https://api.knightfrank.com/2/properties/search"

x=197 # you could loop twice in batches of 100 
querystring = {"cachetoken":"www.knightfrank.co.ke|en",
               "slug":"kenya",
               "distance":"0",
               "propertyTypes":"all-types",
               "filterCurrency":"KES",
               "order":"desdate",
               "resultType":"Full",
               "floorDistributionMax":"1000",
               "landAreaDistributionMax":"250",
               "select":"NoDescription",
               "currency":"KES",
               "availability":"available",
               "division":"Residential",
               "type":"Sales",
               "floorAreaMeasurementUnit":"sqm",
               "landAreaMeasurementUnit":"acres",
               "limit":f"{x}"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Basically the above code gives us the categories available from the data.&lt;br&gt;
Most of the above code is auto generated by insomnia, you just need to modify it to your needs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;headers = { # i have excluded them just use the default auto generated ones 
...
}

# start of data extraction
res = []
r = requests.request("GET", url, headers=headers, params=querystring)

data = r.json()
for p in data["Properties"]:# the "Properties" is a key  
    res.append(p) # we want to append its content to the empty string above 

estate = pd.json_normalize(res)

# converting our file to a csv file 
estate.to_csv('estate.csv') 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After this just go through the data and see what columns are unnecessary and remove them :&lt;br&gt;
A simple python script as this one can be of use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;estate = pd.read(file/path/estate.csv)

# dropping multiple columns 
estate.drop(['column1','column2'..., axis=1])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That it, you have scraped the data from a website.&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>api</category>
    </item>
  </channel>
</rss>
