Web Scraping is getting data from websites that is contained in it's html tags of the website.
Here is the link to the website we'll be scraping (https://www.buyrentkenya.com/) This websites list the the projects and house available for selling or renting in kenya.
I utilized BeautifulSoup, Python library to scrape this site. The first thing to do is to install the necessary libraries. I used other tools like requests and pandas.
To easily easily install these python libraries it's wise to create a python environment that will contain our required libraries.
utilizing the Virtualenv tool, we set up the environment, firstly install the tool:
pip install virtualenv
To use venv in your project, in your terminal, create a new project folder, cd to the project folder in your terminal, and run the following command:
mkdir web_scraping #creating a new folder
cd web_scraping
python -m venv venv # creating and environment named venv
to activate the environment use
source venv/bin/activate
- for Linux and Mac users
Scripts\activate
- for Windows users.
After activating the environment you need to install the required libraries:
Below is a snippet of how I did the above processes:
Now the environment is ready, we are set to begin our web Scraping process.
Getting Started:
Before we dive into the code, let's understand the goal. We want to collect data on houses for rent, including details such as title, location, number of bedrooms and bathrooms, description, and price. We'll be scraping data from multiple pages to create a comprehensive dataset.
Create a python script, mine I named buy_rent_kenya.py.
The code is well-structured and efficient, following these main steps:
- Send a GET request to the initial URL.
- Use BeautifulSoup to parse the HTML content.
- Extract information from each listing on the page.
- Iterate through multiple pages, repeating the process.
- Store the collected data in a Pandas DataFrame.
- Save the DataFrame to a CSV file.
import the necessary libraries for our task as below.
import pandas as pd
from bs4 import BeautifulSoup
import requests
Next thing is to get your browser agent, just search "
what is my browser agent
On your browser and you'll definitely get it.
or get it from this link
The browser agent in web scraping is crucial for mimicking different browsers, avoiding detection by websites, and ensuring compatibility. It helps prevent being flagged as a scraper, allows access to content tailored for specific browsers, and enhances overall scraping efficiency.
url = "https://www.buyrentkenya.com/houses-for-rent"
agent = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
These lines set the target URL and the user agent, which simulates a web browser. It helps in avoiding any potential blocking or restrictions imposed by the website.
Note: Replace the agent with yours, obtained from above search.
# Set headers for the HTTP request
HEADERS = ({'User-Agent':agent,'Accept-Language':'en-US, en;q=0.5'})
# Send a GET request to the URL
response = requests.get(url,headers=HEADERS)
Here, headers are defined for the HTTP request, and a GET request is made to the specified URL using the requests library
# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(response.content,'html.parser')
The HTML content of the page is parsed using BeautifulSoup, making it easier to navigate and extract information.
# Initialize lists to store data
titles = []
locations = []
no_of_bathrooms = []
no_of_bedrooms = []
descriptions = []
prices = []
links= []
Empty lists are initialized to store the extracted data.
# Find all listing cards on the page
houses = soup.find_all("div",class_="listing-card")
This line locates all HTML elements with the class "listing-card," which corresponds to individual housing listings on the page.
# Extract information from each listing
for house in houses:
# Extract title
title = house.find("span",class_="relative top-[2px] hidden md:inline").text.strip()
# Extract location
location = house.find("p",class_="ml-1 truncate text-sm font-normal capitalize text-grey-650").text.strip()
# Extract number of bedrooms and bathrooms
no_of_bedroom = house.find_all("span",class_="whitespace-nowrap font-normal")[0].text.strip()
no_of_bathroom = house.find_all("span",class_="whitespace-nowrap font-normal")[1].text.strip()
# Extract description
description = house.find("a",class_="block truncate text-grey-500 no-underline").text.strip()
# Extract price
price = house.find("p",class_="text-xl font-bold leading-7 text-grey-900").text.strip()
# Extract link
link = house.find("a",class_="text-black no-underline").get("href")
# Append extracted data to respective lists
titles.append(title)
locations.append(location)
no_of_bathrooms.append(no_of_bathroom)
no_of_bedrooms.append(no_of_bedroom)
descriptions.append(description)
prices.append(price)
links.append(link)
In this loop, data such as title, location, number of bedrooms and bathrooms, description, price, and link are extracted from each listing and appended to their respective lists.
- The title is found within a span tag with the class "relative top-[2px] hidden md:inline".
- Both the number of bedrooms and bathrooms are within span tags with the same class "whitespace-nowrap font-normal". Thus we need to utilize the BeautifulSoup find_all()which returns them as a list, thus we use indexing to return each differently. The index [0] corresponds to bedrooms, and [1] corresponds to bathrooms.
- The description is found within an a tag with the class "block truncate text-grey-500 no-underline".
- The price is located within a p tag with the class "text-xl font-bold leading-7 text-grey-900".
- The link is obtained from an a tag with the class "text-black no-underline" using the get("href") method.
Note: Make sure to inspect the HTML structure of the website you are scraping to adapt these identifiers accordingly. If the website structure changes, you may need to update these selectors accordingly.
# Display the number houses extracted from the first page about the first page
print(f"The First Page No Of Titles is {len(titles)}")
This prints out the number of titles on the first page.
The website has a pagination after the first page, which changes dynamic on the url, incrementing its number in the format below:
url = f"https://www.buyrentkenya.com/houses-for-rent?page={page}
Thus a code to extract more information for the pagenated urls is like as:
# Iterate through multiple pages
for page in range(2,56):
url = f"https://www.buyrentkenya.com/houses-for-rent?page={page}"
# Make a GET request for each page
response = requests.get(url,headers=HEADERS)
print(url)
houses = soup.find_all("div",class_="listing-card")
for house in houses:
# Repeat the process of extracting data from each listing
These lines iterate through multiple pages, updating the URL for each page, making a GET request, and extracting data from each listing on the page.
# Display the total number of titles scraped
print(f"The Total no of Titles we have scraped is {len(titles)}")
This prints out the total number of titles scraped from all pages.
The Last part is to save our extracted data into a csv file:
# Organize data into a DataFrame
data = {
"Titles": titles,
"Locations": locations,
"No Of Bathrooms": no_of_bathrooms,
"No Of Bedrooms": no_of_bedrooms,
"Prices": prices,
"Description": descriptions
}
df = pd.DataFrame(data)
print(df.shape)
#print(df.head(10))
The extracted data is organized into a Pandas DataFrame for better structure and analysis.
# Save DataFrame to a CSV file
df.to_csv("buy_rent_kenya.csv",index=False)
Finally, the DataFrame is saved as a CSV file named "buy_rent_kenya.csv". The index=False parameter ensures that the DataFrame index is not included in the CSV file.
Conclusion
Web scraping is a powerful tool for extracting valuable information from websites. This Python script provides a glimpse into the process of scraping rental property listings from Buy Rent Kenya. Keep in mind that web scraping should be done responsibly and in compliance with the terms of service of the website being scraped.
Feel free to explore, modify, and adapt the code for your specific needs. Happy coding and may your data exploration endeavors be fruitful.
Here is the link to the full code on github
Be in the look out for our next article automating the Scraping process above Using Airflow.
Top comments (0)