What will be scraped
Full Code
If you don't need explanation, have a look at full code example in the online IDE.
import time, json
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from parsel import Selector
def scroll_page(url):
service = Service(ChromeDriverManager().install())
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument('--lang=en')
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")
driver = webdriver.Chrome(service=service, options=options)
driver.get(url)
old_height = driver.execute_script("""
function getHeight() {
return document.querySelector('.zxU94d').scrollHeight;
}
return getHeight();
""")
while True:
driver.execute_script("document.querySelector('.zxU94d').scrollTo(0, document.querySelector('.zxU94d').scrollHeight)")
time.sleep(2)
new_height = driver.execute_script("""
function getHeight() {
return document.querySelector('.zxU94d').scrollHeight;
}
return getHeight();
""")
if new_height == old_height:
break
old_height = new_height
selector = Selector(driver.page_source)
driver.quit()
return selector
def scrape_google_jobs(selector):
google_jobs_results = []
for result in selector.css('.iFjolb'):
title = result.css('.BjJfJf::text').get()
company = result.css('.vNEEBe::text').get()
container = result.css('.Qk80Jf::text').getall()
location = container[0]
via = container[1]
thumbnail = result.css('.pJ3Uqf img::attr(src)').get()
extensions = result.css('.KKh3md span::text').getall()
google_jobs_results.append({
'title': title,
'company': company,
'location': location,
'via': via,
'thumbnail': thumbnail,
'extensions': extensions
})
print(json.dumps(google_jobs_results, indent=2, ensure_ascii=False))
def main():
params = {
'q': 'python backend', # search string
'ibp': 'htl;jobs', # google jobs
'uule': 'w+CAIQICINVW5pdGVkIFN0YXRlcw', # encoded location (USA)
'hl': 'en', # language
'gl': 'us', # country of the search
}
URL = f"https://www.google.com/search?q={params['q']}&ibp={params['ibp']}&uule={params['uule']}&hl={params['hl']}&gl={params['gl']}"
result = scroll_page(URL)
scrape_google_jobs(result)
if __name__ == "__main__":
main()
Preparation
Install libraries:
pip install parsel selenium webdriver webdriver_manager
Reduce the chance of being blocked
Make sure you're using request headers user-agent
to act as a "real" user visit. Because default requests
user-agent
is python-requests
and websites understand that it's most likely a script that sends a request. Check what's your user-agent
.
There's a how to reduce the chance of being blocked while web scraping blog post that can get you familiar with basic and more advanced approaches.
Code Explanation
Import libraries:
import time, json
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from parsel import Selector
Library | Purpose |
---|---|
json |
to convert extracted data to a JSON object. |
time |
to work with time in Python. |
webdriver |
to drive a browser natively, as a user would, either locally or on a remote machine using the Selenium server. |
Service |
to manage the starting and stopping of the ChromeDriver. |
Selector |
XML/HTML parser that have full XPath and CSS selectors support. |
Top-level code environment
At the beginning of the function, parameters are defined for generating the URL
. If you want to pass other parameters to the URL, you can do so using the params
dictionary.
Next, the URL is passed to the scroll_page(URL)
function to scroll the page and get all data. The result that this function returns is passed to the scrape_google_jobs(result)
function to extract the necessary data. The explanation of these functions will be in the corresponding headings below.
This code uses the generally accepted rule of using the __name__ == "__main__"
construct:
def main():
params = {
'q': 'python backend', # search string
'ibp': 'htl;jobs', # google jobs
'uule': 'w+CAIQICINVW5pdGVkIFN0YXRlcw', # encoded location (USA)
'hl': 'en', # language
'gl': 'us', # country of the search
}
URL = f"https://www.google.com/search?q={params['q']}&ibp={params['ibp']}&uule={params['uule']}&hl={params['hl']}&gl={params['gl']}"
result = scroll_page(URL)
scrape_google_jobs(result)
if __name__ == "__main__":
main()
This check will only be performed if the user has run this file. If the user imports this file into another, then the check will not work.
You can watch the video Python Tutorial: if name == 'main' for more details.
Scroll page
The function takes the URL and returns a full HTML structure.
First, let's understand how pagination works on the Google Jobs page. Data does not load immediately. If the user needs more data, they will simply scroll the page and site download a small package of data.
Accordingly, to get all the data, you need to scroll to the end of the page. To download more information, you need to scroll left side of the site where the jobs are located:
📌Note: We will scroll to the left side of the site where the job listings are located.
In this case, selenium
library is used, which allows you to simulate user actions in the browser. For selenium
to work, you need to use ChromeDriver
, which can be downloaded manually or using code. In our case, the second method is used. To control the start and stop of ChromeDriver
, you need to use Service
which will install browser binaries under the hood:
service = Service(ChromeDriverManager().install())
You should also add options
to work correctly:
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument('--lang=en')
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")
Chrome options | Explanation |
---|---|
--headless |
to run Chrome in headless mode. |
--lang=en |
to set the browser language to English. |
user-agent |
to act as a "real" user request from the browser by passing it to request headers. Check what's your user-agent . |
Now we can start webdriver
and pass the url to the get()
method.
driver = webdriver.Chrome(service=service, options=options)
driver.get(url)
The page scrolling algorithm looks like this:
- Find out the initial page height and write the result to the
old_height
variable. - Scroll the page using the script and wait 2 seconds for the data to load.
- Find out the new page height and write the result to the
new_height
variable. - If the variables
new_height
andold_height
are equal, then we complete the algorithm, otherwise we write the value of the variablenew_height
to the variableold_height
and return to step 2.
Getting the page height and scroll is done by pasting the JavaScript code into the execute_script()
method.
old_height = driver.execute_script("""
function getHeight() {
return document.querySelector('.zxU94d').scrollHeight;
}
return getHeight();
""")
while True:
driver.execute_script("document.querySelector('.zxU94d').scrollTo(0, document.querySelector('.zxU94d').scrollHeight)")
time.sleep(2)
new_height = driver.execute_script("""
function getHeight() {
return document.querySelector('.zxU94d').scrollHeight;
}
return getHeight();
""")
if new_height == old_height:
break
old_height = new_height
Now we need to process HTML using from Parsel
package, in which we pass the HTML
structure with all the data that was received after scrolling the page. This is necessary to successfully retrieve data in the next function. After all the operations are done, stop the driver:
selector = Selector(driver.page_source)
# extracting code from HTML
driver.quit()
The function looks like this:
def scroll_page(url):
service = Service(ChromeDriverManager().install())
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--lang=en")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")
driver = webdriver.Chrome(service=service, options=options)
driver.get(url)
old_height = driver.execute_script("""
function getHeight() {
return document.querySelector('.zxU94d').scrollHeight;
}
return getHeight();
""")
while True:
driver.execute_script("document.querySelector('.zxU94d').scrollTo(0, document.querySelector('.zxU94d').scrollHeight)")
time.sleep(2)
new_height = driver.execute_script("""
function getHeight() {
return document.querySelector('.zxU94d').scrollHeight;
}
return getHeight();
""")
if new_height == old_height:
break
old_height = new_height
selector = Selector(driver.page_source)
driver.quit()
return selector
In the gif below, I demonstrate how this function works:
Scrape Google Jobs
This function takes a full HTML structure and prints all results in JSON format.
To extract the necessary data, you need to find the selector where they are located. In our case, this is the .iFjolb
selector, which contains all jobs. You need to iterate each job in the loop:
for result in selector.css('.iFjolb'):
# data extraction will be here
Data like title
, company
and thumbnail
are pretty easy to retrieve. You need to find the selector and get the value:
title = result.css('.BjJfJf::text').get()
company = result.css('.vNEEBe::text').get()
thumbnail = result.css('.pJ3Uqf img::attr(src)').get()
I want to pay attention to how data such as location
and via
are retrieved. To extract them, an additional container
is created, which is a list of two values. This will speed things up a bit since you only need to find the element once and then get its values by index in constant time. Otherwise, the operation of searching for an element and extracting it would have to be carried out twice:
container = result.css('.Qk80Jf::text').getall()
location = container[0]
via = container[1]
Each job has its own number of extensions
, and we need to extract them all:
extensions = result.css('.KKh3md span::text').getall()
The complete function to scrape all data would look like this:
def scrape_google_jobs(selector):
google_jobs_results = []
for result in selector.css('.iFjolb'):
title = result.css('.BjJfJf::text').get()
company = result.css('.vNEEBe::text').get()
container = result.css('.Qk80Jf::text').getall()
location = container[0]
via = container[1]
thumbnail = result.css('.pJ3Uqf img::attr(src)').get()
extensions = result.css('.KKh3md span::text').getall()
google_jobs_results.append({
'title': title,
'company': company,
'location': location,
'via': via,
'thumbnail': thumbnail,
'extensions': extensions
})
print(json.dumps(google_jobs_results, indent=2, ensure_ascii=False))
Code | Explanation |
---|---|
google_jobs_results |
a temporary list where extracted data will be appended at the end of the function. |
css() |
to access elements by the passed selector. |
::text or ::attr(<attribute>) |
to extract textual or attribute data from the node. |
get() |
to actually extract the textual data. |
getall() |
to actually extract text data from all matching objects. |
google_jobs_results.append({}) |
to append extracted data to a list as a dictionary. |
json.dumps() |
to serialize obj to a JSON formatted str using this conversion table. |
Output:
[
{
"title": "Python Backend Engineer",
"company": "RapidApi",
"location": "Bucksport, ME",
"via": "via Ladders",
"thumbnail": null,
"extensions": [
"5 days ago",
"Full-time",
"No degree mentioned"
]
},
... other results
{
"title": "Sr. Backend Engineer w/ Python API Developer- REMOTE",
"company": "Jobot",
"location": "Denver, CO",
"via": "via Your Central Valley Jobs",
"thumbnail": null,
"extensions": [
"11 days ago",
"Full-time",
"No degree mentioned",
"Health insurance",
"Dental insurance",
"Paid time off"
]
}
]
Using Google Jobs API from SerpApi
This section is to show the comparison between the DIY solution and our solution.
The main difference is that it's a quicker approach. Google Jobs API will bypass blocks from search engines and you don't have to create the parser from scratch and maintain it.
First, we need to install google-search-results
:
pip install google-search-results
Import the necessary libraries for work:
from serpapi import GoogleSearch
import os, json
Next, we write a search query and the necessary parameters for making a request:
params = {
# https://docs.python.org/3/library/os.html#os.getenv
'api_key': os.getenv('API_KEY'), # your serpapi api
# https://site-analyzer.pro/services-seo/uule/
'uule': 'w+CAIQICINVW5pdGVkIFN0YXRlcw', # encoded location (USA)
'q': 'python backend', # search query
'hl': 'en', # language of the search
'gl': 'us', # country of the search
'engine': 'google_jobs', # SerpApi search engine
'start': 0 # pagination
}
Since we want to extract all the data, we need to use the 'start'
parameter, which is responsible for pagination.
Let's implement an infinite loop that will increase the value of the 'start'
parameter by 10 on each iteration. This will continue as long as there is something to extract:
while True:
search = GoogleSearch(params) # where data extraction happens on the SerpApi backend
result_dict = search.get_dict() # JSON -> Python dict
if 'error' in result_dict:
break
# data extraction will be here
params['start'] += 10
The data is retrieved quite simply, we just need to turn to the 'jobs_results'
key.
for result in result_dict['jobs_results']:
google_jobs_results.append(result)
Example code to integrate:
from serpapi import GoogleSearch
import os, json
params = {
# https://docs.python.org/3/library/os.html#os.getenv
'api_key': os.getenv('API_KEY'), # your serpapi api
# https://site-analyzer.pro/services-seo/uule/
'uule': 'w+CAIQICINVW5pdGVkIFN0YXRlcw', # encoded location (USA)
'q': 'python backend', # search query
'hl': 'en', # language of the search
'gl': 'us', # country of the search
'engine': 'google_jobs', # SerpApi search engine
'start': 0 # pagination
}
google_jobs_results = []
while True:
search = GoogleSearch(params) # where data extraction happens on the SerpApi backend
result_dict = search.get_dict() # JSON -> Python dict
if 'error' in result_dict:
break
for result in result_dict['jobs_results']:
google_jobs_results.append(result)
params['start'] += 10
print(json.dumps(google_jobs_results, indent=2, ensure_ascii=False))
Output:
[
{
"title": "Python Backend Engineer",
"company_name": "RapidApi",
"location": "Bucksport, ME",
"via": "via Ladders",
"description": "RapidAPI is a team of creators building for developers. We are the world's largest API Hub where over 4 million developers find, test, and connect to 40,000 APIs (and growing!) — all with a single account, single API key, and single SDK.\n\nOur users range from independent developers to the largest companies in the world. We work hard to ensure it's easy for developers to build, discover, and... connect to APIs faster while providing enterprise-wide visibility and governance. As a result, entrepreneurs and enterprises can concentrate on creating value and business outcomes.\n\nWe operate at a significant scale, but the opportunity is even bigger. You have an unprecedented opportunity to make a massive difference and empower developers to build modern software through API innovation while doing the most critical work of your career.\n\nWe have a high-performance Python backend that is dedicated to storing user documents (API projects) in a way that supports versioning, branches, and update diffing. This backend is built on top of libgit2 (the C library that powers Git) and uses Cython bindings. You’ll own parts of this complex infrastructure and will work closely with the backend engineers and architects.\n\nYou Will\n• Work and own parts of our high-performance Git-based document storage backend\n• Work on improving the performance and reliability\n• Integrate the backend with new micro-services to leverage it for various projects in the company (the Git-based backend will be a platform for other applications to run on)\n\nYou Have\n• Have a strong technical background in building Python backend applications with more than 5 years of experience.\n• Knowledge of various Python backend frameworks (e.g. Django, Tornado, Flask)\n• Experience with low-level programming experience (C, C++)Desire to write clean, maintainable, testable code\n• A deep understanding and experience with parallel processing/concurrent programming\n• Autonomous, learner, builder, creator, achiever... You know how to get things done!\n• Collaborative and how to work in a team to achieve our goals as one team.\n\nThis is an opportunity to play a key role in a fast-growing and high-scale startup company distributed across the US, Europe, and Israel. You'll be taking our product to the next level within a high talent density team and out-of-the-box thinking. Having raised $150 million in a Series D investment round in 2022; you’ll be working with a team that is scaling globally, fast.\n\nPandoLogic. Keywords: Backend Engineer, Location: Bucksport, ME - 04416",
"extensions": [
"5 days ago",
"Full-time",
"No degree mentioned"
],
"detected_extensions": {
"posted_at": "5 days ago",
"schedule_type": "Full-time"
},
"job_id": "eyJqb2JfdGl0bGUiOiJQeXRob24gQmFja2VuZCBFbmdpbmVlciIsImh0aWRvY2lkIjoiRGlHc25BN0hUandBQUFBQUFBQUFBQT09IiwidXVsZSI6IncrQ0FJUUlDSU5WVzVwZEdWa0lGTjBZWFJsY3ciLCJnbCI6InVzIiwiaGwiOiJlbiIsImZjIjoiRW93Q0Nzd0JRVUYwVm14aVFrVnVWekJaZHpGRGJVTTFORUk0U1U1VlVHWnNaVFpJU25sMVNsSlRkbk5NWlRVNGVIcFZkRkZQWjJSR00yUkNSRXBEWkRRelMwOUJiWGhyYmpCMWRUWkhTbGt4YW14MGExbEtUR3RNUW1nNVNUSnJSVkZMZURGT1JFVkNja3RKT0hSYVYweElVVE4xV25vNVdHOVlVRkV0V201UFFVbHpVMjg1VGpCT2VGbFZjVlJ1Y1daTGVFSmxjMXBoVkZNek9FcEZXVk4zV0RoeVRtWmxiMko0U2tGVlVHcGZZemRMYVRSaVJXY3RSV1UzWjFKTGJYRTBZa2xuTTNWdlRuazJja1J2TUhaVkVoZHZZMGxtV1RseU5VSTBVMmx3ZEZGUU5VOTFZWEZCTUJvaVFVUlZlVVZIWm01UVJWSTFXbUYyV0ZwNFZ6SkRSRkpJY25oWFpGaEZSMWc1ZHciLCJmY3YiOiIzIiwiZmNfaWQiOiJmY18xIiwiYXBwbHlfbGluayI6eyJ0aXRsZSI6Ii5uRmcyZWJ7Zm9udC13ZWlnaHQ6NTAwfS5CaTZEZGN7Zm9udC13ZWlnaHQ6NTAwfUFwcGx5IG9uIExhZGRlcnMiLCJsaW5rIjoiaHR0cHM6Ly93d3cudGhlbGFkZGVycy5jb20vam9iLWxpc3RpbmcvcHl0aG9uLWJhY2tlbmQtZW5naW5lZXItcmFwaWRhcGktYnVja3Nwb3J0LW1lLTQtNTQxNDE2OTkzLmh0bWw/dXRtX2NhbXBhaWduPWdvb2dsZV9qb2JzX2FwcGx5XHUwMDI2dXRtX3NvdXJjZT1nb29nbGVfam9ic19hcHBseVx1MDAyNnV0bV9tZWRpdW09b3JnYW5pYyJ9fQ=="
},
... other results
{
"title": "Sr. Backend Engineer w/ Python API Developer- REMOTE",
"company_name": "Jobot",
"location": "Denver, CO",
"via": "via Your Central Valley Jobs",
"description": "Remote, API Development with Python, Excellent Culture, Rapidly growing company!\n\nThis Jobot Job is hosted by Nicole Crosby...\n\nAre you a fit? Easy Apply now by clicking the \"Apply\" button and sending us your resume.\n\nSalary $150,000 - $210,000 per year\n\nA Bit About Us\n\nSustainable, rapidly growing logistics company for shipping products by utilizing modern API's. Their products allow their customers to connect and compare rates through a single integration, save money and improve on-time delivery metrics, create their own labels, provide real-time shipment updates, confirm accuracy of domestic and international addresses, protect against damage, loss or theft, as well as calculate carbon emissions for every shipment. Solve complex shipping logistics problems with a single integration. By leveraging our technology, businesses can streamline, automate, and gain end-to-end control of their shipping process with our suite of flexible RESTful API solutions. Join thousands of customers shipping millions of packages.\n\n1 Billion + Shipments purchased\n\n93% of US household have received our shipments\n\n99.99% API uptimes- the most reliable in the industry\n\n88% discount on shipping for all customer\n\nWhy join us?\n\nWe're a fun group of passionate entrepreneurs who built our own revolutionary software designed to make shipping simple. We started as an Engineering first company and we are proud to have a pragmatic approach to software development. Our team has a wealth of diverse experience and different backgrounds ranging from startups to large technology companies.\n\nBe part of a leading technology company\n• CI/CD inspired workflows - we deploy dozens of times a day\n• Small services over monoliths - we've deployed hundreds of services\n• Strong engineering tooling and developer support\n• Transparency and participation around architecture and technology decisions\n• Culture of blamelessness and improving today from yesterday's shortcomings\nWhat We Offer\n• Comprehensive medical, dental, vision, and life insurance\n• Competitive compensation package and equity\n• 401(k) match\n• Monthly work from home stipend of $100 net\n• Flexible work schedule and paid time off\n• Collaborative culture with a supportive team\n• A great place to work with unlimited growth opportunities\n• The opportunity to make massive contributions at a hyper-growth company\n• Make an impact on a product helping ship millions of packages per day\nJob Details\n\nOur engineering team is looking for a bright Software Engineer who is eager to collaborate with a distributed team and comfortable working in a polyglot environment. You will be a key member of our growing engineering team making important technical decisions that will shape the company's future. If you love solving tough problems in a fast-paced and collaborative environment, then we'd love to meet you.\n\nWhat you will do\n• Build new features and products for our API\n• Optimize and build back-end services for performance and scale\n• Integrate our software with carriers around the world\n• Participate in code designs, planning meetings, and code reviews\n• Write high quality software using Python/Go and some Ruby\n• Analyze/debug performance issues\n• Take ownership of major API features\nAbout You\n• 5+ years of professional software development experience\n• Strong experience with Python\n• Experience with REST, HTTP/HTTPS protocols\n• Ownership of integrating 3rd party APIs a plus\n• Strong desire to work in a fast-paced, start-up environment with multiple releases a day\n• A passion for working as part of a team - you love connecting and collaborating with others\nInterested in hearing more? Easy Apply now by clicking the \"Apply\" button",
"extensions": [
"11 days ago",
"Full-time",
"No degree mentioned",
"Health insurance",
"Dental insurance",
"Paid time off"
],
"detected_extensions": {
"posted_at": "11 days ago",
"schedule_type": "Full-time"
},
"job_id": "eyJqb2JfdGl0bGUiOiJTci4gQmFja2VuZCBFbmdpbmVlciB3LyBQeXRob24gQVBJIERldmVsb3Blci0gUkVNT1RFIiwiaHRpZG9jaWQiOiJDQWNDaTBIWTJRNEFBQUFBQUFBQUFBPT0iLCJ1dWxlIjoidytDQUlRSUNJTlZXNXBkR1ZrSUZOMFlYUmxjdyIsImdsIjoidXMiLCJobCI6ImVuIiwiZmMiOiJFb3dDQ3N3QlFVRjBWbXhpUkMwNFVqQmFjamhmU1Rack1YRTFWbGRvTVhkRGJtVXdNa2RKUkhWdVRWZzJaRVJPVmxOcWNrcEpkakZNWlRCVVRGZ3lUSGRRTjFOaFNFa3dVbTFvZGpaRWVsZEtjMWwzZUZOU2FXVlpPVjl5Y3pGTlRtOUtSa0l4TUdSM2ExRmZUalJxTUUxSFdERm9Ta2xETUZaVVMwOVhkalZwWlhOWVRraFlNazVmUW1GRmVUWlVUWFUyVFRsMFFsaFNlblJOWlZkSlNXNU5OM1pUVDFkMVNXTkpjR2xOWkhaa1dUaDRRaTFQUVdzM1RFc3hUR2RZYWpReWRUZEllakJEVEVaQmNuVkNjVTVCRWhkUVRXOW1XVFJtVFU1d1pVRnhkSE5RT0ZCVE5TMUJieG9pUVVSVmVVVkhaalZSVFd3MU0yOUNhMWhCTm01SlowWjNiMHhxYm5SVGNURmxkdyIsImZjdiI6IjMiLCJmY19pZCI6ImZjXzYiLCJhcHBseV9saW5rIjp7InRpdGxlIjoiQXBwbHkgb24gWW91ciBDZW50cmFsIFZhbGxleSBKb2JzIiwibGluayI6Imh0dHBzOi8vam9icy55b3VyY2VudHJhbHZhbGxleS5jb20vam9icy9zci4tYmFja2VuZC1lbmdpbmVlci13LXB5dGhvbi1hcGktZGV2ZWxvcGVyLXJlbW90ZS1kZW52ZXItY29sb3JhZG8vNzAyMTM0OTIzLTIvP3V0bV9jYW1wYWlnbj1nb29nbGVfam9ic19hcHBseVx1MDAyNnV0bV9zb3VyY2U9Z29vZ2xlX2pvYnNfYXBwbHlcdTAwMjZ1dG1fbWVkaXVtPW9yZ2FuaWMifX0="
}
]
Links
Add a Feature Request💫 or a Bug🐞
Top comments (0)