Well, there is this website that I have been trying to scrape for a few days now.
I had tried everything from scrapy splash on docker to almost giving up because I read somewhere that it was JavaScript rendered. Since the source code from the inspect part of the developer tools was different from the source code from the view-source:https//... on the same developer tools.How could this be possible?
Then I kept searching on internet and found this concept; where you can mimic web-browsers requests from a server using an API program,and it worked magically.
Some of the API programs are postman and insomnia.
I prefer using insomnia for this particular case , feel free to use any other API program of your choice.
So now every website has its own rules on how to scrape it, this can be accessed by adding robots.txt at the end of a Url.ie "https://pythonscraping.com/robots.txt"
This can generate something like :
Different websites have different robots.txt files follow their directives and you will have an easy time scraping
So the first thing is to open inspect(Q) by right clicking the mouse.
From the above image we have go to network panel press on the XHR. The XHR is a JavaScript object that provides an easy way to retrieve data from a URL without having to do a full page refresh.
Then you look for the status columns for code 200 with means the load was successful.
press on one of the successful XHRs; a table will be shown. Look for the response column and press on it.
From here , you have to have understood the structure of the webpage displayed by the web browser originally, because from the second table with response , you will look for that structure that emulates your web page rendered display data. For example:
The above is the page structure that suits my needs. Some time you may find it on the next page after load.
I am expecting you have an account on any of the API programs.
Then you need to right click on the GET request and copy it as cURL:
Then you open your insomnia API program and paste the cURL. Run it by pressing send.
As seen above we have the structure of the web page nicely rendered on the far right side.
If your page as pagination you have to locate it for an easy loop over the items you want.
This case we wont to needing a loop since we can just pick the whole contents of the information by setting the limit to 197(maximum value of real-estates available)
Look for the keys from the APIs review column. Some of the Keys include : Features,countries,properties etc .
The above keys can be used to extract information from the webpage by a simple python script.
But first lets us generate the code for our data output.
From the left hand side ,there will be a dropdown menu when you press New Requests , choose generate code.This will be generate code in any language of your choice. For our case we will use python requests.
The fun part coding:
import requests
import pandas as pd
url = "https://api.knightfrank.com/2/properties/search"
x=197 # you could loop twice in batches of 100
querystring = {"cachetoken":"www.knightfrank.co.ke|en",
"slug":"kenya",
"distance":"0",
"propertyTypes":"all-types",
"filterCurrency":"KES",
"order":"desdate",
"resultType":"Full",
"floorDistributionMax":"1000",
"landAreaDistributionMax":"250",
"select":"NoDescription",
"currency":"KES",
"availability":"available",
"division":"Residential",
"type":"Sales",
"floorAreaMeasurementUnit":"sqm",
"landAreaMeasurementUnit":"acres",
"limit":f"{x}"}
Basically the above code gives us the categories available from the data.
Most of the above code is auto generated by insomnia, you just need to modify it to your needs.
headers = { # i have excluded them just use the default auto generated ones
...
}
# start of data extraction
res = []
r = requests.request("GET", url, headers=headers, params=querystring)
data = r.json()
for p in data["Properties"]:# the "Properties" is a key
res.append(p) # we want to append its content to the empty string above
estate = pd.json_normalize(res)
# converting our file to a csv file
estate.to_csv('estate.csv')
After this just go through the data and see what columns are unnecessary and remove them :
A simple python script as this one can be of use:
estate = pd.read(file/path/estate.csv)
# dropping multiple columns
estate.drop(['column1','column2'..., axis=1])
That it, you have scraped the data from a website.
Top comments (1)
Really enjoyed your walkthrough. Thanks for sharing your insights!
For those looking to expand their toolkit, check out: extract text from webpage