TABLE OF CONTENTS
- Introduction
- Procedure
- Environment setup
- Scraping Samples
- The Web Scraping Process
- Conclusion
Introduction
LinkedIn is the largest professional network over the internet accessible through mobile or web to look for jobs, internship and enlarge your network. On LinkedIn, you can find people with similar skills, interests, and experience. To access the platform, you need to sign up and create a profile.
On LinkedIn you can search for jobs, internships, and people with similar skills, interests, and experience. What do you say of automating this search process ? This let us to web scraping.
Web scraping is mostly used on sites with big data like Google, Amazon or Twitter. As a whole web scraping is refers to the extraction of data from a website. This information is collected and then exported into a format that is more useful for the user mostly csv file though some other formats are also possible like json.
What follows is tutorial on how to build a scraper in python that can be used to extract data from LinkedIn.
Procedure
Though the scraping is mostly an automation process, it is a broad process that can be broken down into several steps.
Environment setup
Python is the chosen language for this tutorial and as you can guess some precautions should be done to make sure that the environment is setup correctly and the main OS I'll use is Ubuntu(a Linux distro) virtual environment is a tool that helps you to isolate your code from the rest of the system. It is a good idea to create a virtual environment for your project.
- Make a new directory on your desktop and cd into it.
# create the directory
mkdir linkedin_scraper
# cd into the directory
cd linkedin_scraper
- To create a virtual environment, run the following command:
python -m venv venv
# or
python3 -m venv venv
- To activate the virtual environment, run the following command:
# For linux and mac users
source venv/bin/activate
# For windows users
.\venv\Scripts\activate
- Install the following packages:
pip install ipython selenium parsel pip-chill
Check the installed packages using the following command:
# To list only the main packages installed in the virtual environment, run the following command:
pip-chill
- For this automation process we will use ipython which is a python shell. It is a good idea to use it to run the automation process. On your terminal type the following command:
ipython
Note: Alternatively a jupyter notebook or a python file can be use for this process. ipython was chosen since it is a good shell and it is easy to use with no much prior requirements and interactive results.
Our environment is setup and we're ready to go.
Scraping Samples
To access LinkedIn data we need to login and thus automating this feature too. To automate the login process we will use the selenium package together with the chrome drivers. Follow these commands on your IDE.
ipython
from selenium import webdriver
from selenium.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://www.linkedin.com")
This will give as result a chrome window having the login page of LinkedIn and most of all it should be having the banner having the message below.
Chrome is being controlled by an automated software.
- Next, we need to login to LinkedIn using automation, for this we'll tell our bot it need to provide the login information. For this we'll use the chrome dev tool and get the login fields ids. To access this tool we use the keyboard shortcut Ctrl+Shift+i or we user f12 to open the dev tool.
Now click on the circled Inspect Elements icon, you can hover over any element on the web page the code will appear highlighted as seen above. You want to look for the class and id attributes.
Now click on the circled Inspect Elements icon, you can hover over any element on the web page the code will appear highlighted as seen above. You want to look for the class and id attributes.
From here, you'll be directed to your LinkedIn profile. Guess what, you successfully automated your login process.
Next we want to make a search query on google that will target all the LinkedIn profiles matching the item "Web" AND "JavaScript" on their profile.
Let go to google still using our terminal so that our automated chrome browser will be in use.
# To open the google search page
driver.get('https://www.google.com')
Let's make our query and click on the search button (this is done in the terminal).
search_query = driver.find_element(By.NAME,'q')
search_query.send_keys('site:linkedin.com/in/ AND "Web" AND "Javascript"')
from selenium.webdriver.common.keys import Keys
earch_query.send_keys(Keys.ENTER)
The search can be customised, feel free to modify at your needs.
linkedin_users_urls = driver.find_elements(By.CLASS_NAME, 'iUh30')
Note: The class name is iUh30 and it is the class name of the link that contains the LinkedIn profile URL. Note: The method name we use now is find_elements and it is the method that is used to get all the elements of a certain class. Not the find_element method that get an element.
let's verify that we have some results. We will use the len function to get the length of the list.
len(linkedin_users_urls)
I guess you noticed that the return is not what you wanted. We want to get the URL of the LinkedIn profiles. To get the URL we need to use the get_attribute method and some extra spices. Let's use a new variable to store the URL.
linkedin_users_urls_list = driver.find_elements(By.XPATH, '//div[@class="yuRUbf"]/a[@href]')
# To check the list content we run the following command
[users.text for users in linkedin_users_urls_list]
We got the URL of the linkedin profiles which means we can now start scraping the data(name, title, company, location, the profile URL and more).
The following steps we'll enter a more complex task but we'll use the same methods and variables we used before.
The Web Scraping Process
we need now some files to create our scraper. In your project directory, create the files as follows.
touch variables.py main.py
# Creates two files with the names above
variables.py
my_username = 'your email address' my_password = 'your passwowrd'
file_name = 'results.csv' # file where the results will be saved
query = 'site:linkedin.com/in/ AND "Web" AND "Javascript"'
> Variables files contains the variables that we'll use to login to LinkedIn together with the query.
We'll use the main.py file to run the main code.
> main.py
```python import variables from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.common.keys import Keys
driver.get(variables.query) # google search username.send_keys(variables.username) # username field password.send_keys(linkedin_password) # password field linkedin_users_urls_list = driver.find_elements(By.XPATH, '//div[@class="yuRUbf"]/a[@href]') [users.text for users in linkedin_users_urls_list]
This process is fast at times and to slow it down, use the sleep function from the time function and it can used at anytime in the code.
from time import sleep
sleep(2)
The process is summarised as seen below:
1. Login to LinkedIn
2. Making the google search query and submit
3. Getting the different displayed profiles in a list profile_urls
4. Iterate over the list with the new URL as index
5. Get the profile URL and open the profile page
6. Get the name, title, company, location and more from the profile page
7. Save the results to a csv file
We're done.
## Conclusion
As you can see, we've made a scraper that can scrape the data from LinkedIn. We've used the following technologies:
* Selenium
* Python
* Ipython
> Note: From time to time, LinkedIn change their class and attributes, so for future releases, I'll try to update the scraper to work with the new LinkedIn changes. Or you can try to use the LinkedIn API to get the data.
Top comments (0)