DEV Community

Cover image for Scraping Twitter comments with selenium(Python): step-by-step guide
David Hernandez Torres
David Hernandez Torres

Posted on

Scraping Twitter comments with selenium(Python): step-by-step guide

In today's world full of data, everyone uses social media to express themselves and contribute to the public voice. This is such valuable information that is just publically available to anyone, you can gather a lot of insights, feedback, and very good advice from this public opinion.
That is why, I bring you this step-by-step guide to start scraping comments on Twitter without much work.

What you will need:

  • A text editor -A programing language that selenium supports(I will be using python) -A Twitter account (Preferably not your main one)

**Warning:

Using WebScraping in the wrong manner could be unethical and illegal against some terms of service that could lead to permanent IP address bans and more. Use with no bad intentions this WebScraping tools**

Step 1: The setup

To start, we will create a new directory with a virtual environment and activate it.



 > C:\Users\Blogs\Webscraping>  python -m venv .
 > C:\Users\Blogs\Webscraping>  Scripts\activate



Enter fullscreen mode Exit fullscreen mode

This can vary from your operating system, if you are not familiar with Python and virtual environments, refer here for more guidance.

Okay, now that we have our environment running, I will install Selenium, our main dependency.

> pip install selenium

Now that we have all of our tools ready, we shall code

Step two: Our code process

Selenium is a free tool for automation processes on a web application. In these cases, we will be using the Selenium WebDriver, basically a tool that lets you run useful scripts in different browsers. In our case, we will be using Chrome.

Our main process will look like this:
(main.py)



from twitter import Twitter
from selenium import web driver
from time import sleep

##Desired post to scrape comments
URL_POST = "**"


##Account credentials
username = "**"
email = "**"
password = "**"

driver = webdriver.Chrome()

Twitter(driver).login()
Twitter(driver).get_post(URL_POST)
driver.quit()


Enter fullscreen mode Exit fullscreen mode

Selenium WebDriver lets us do a lot of stuff in a browser, but let's leave that for the next step. Right now I would recommend creating a new Twitter account and searching for a post that you would like to search. Yes, I know, we haven't defined things like the Twitter class but for now, it will be best to pass your driver as an argument.

Step 3: The Twitter class

This will be the largest and most complex part of our program. It covers three methods: Twitter login get_post and scrape.

We will first define a constructor with one input variable:

  • Driver: Our selenium driver that we started in main.py

-Wait: a useful method for searching HTML tags that are not loaded yet
(twitter.py)

`import sys
from csv_exports import twitter_post_to_csv
from time import sleep
from selenium.webdriver.common.by import By
from useful_functions import validate_span
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

class Twitter:

def __init__(self, driver):
    self.driver = driver
    self.wait = WebDriverWait(self.driver, 10)`
Enter fullscreen mode Exit fullscreen mode

Login Method
To access Twitter comments, we need to be logged in, and unfortunately, a web driver does not remember credentials. So let's start by automating our login process...

The code

`def login(self, email, username, password):
    drive = self.driver
    wait = self.wait


    ##Going to the login page URL
    drive.get("https://x.com/i/flow/login")



    ##Sends email credential to the first form input
    input_email = wait.until(EC.presence_of_element_located((By.NAME, "text")))
    input_email.clear()
    input_email.send_keys(email)
    sleep(3)
    ##Submits form
    button_1 = drive.find_elements(By.CSS_SELECTOR, "button div span span")
    button_1[1].click()

    ##Sends username credential to the second form input
    input_verification = wait.until(EC.presence_of_element_located((By.NAME, "text")))
    input_verification.clear()
    sleep(3)
    input_verification.send_keys(username)
    ##Submits form
    button_2 = drive.find_element(By.CSS_SELECTOR, "button div span span")
    sleep(3)
    button_2.click()

    ##Sends username credential to the form input
    input_password = wait.until(EC.presence_of_element_located((By.NAME, "password")))
    input_password.clear()
    sleep(3)
    input_password.send_keys(password)
    sleep(3)

    #Submits last form
    button_3 = drive.find_element(By.CSS_SELECTOR, "button div span span")
    button_3.click()
    sleep(5)`
Enter fullscreen mode Exit fullscreen mode

Here are the forms your program will be filling up:

  1. The first form:

Image description

  1. The second form

Image description

  1. The third form

Image description

BREAKDOWN

Our method waits for an, such as input_email, since it's our first request to the URL, everything needs to be loaded into full HTML.

We use the find_elements() method from the web driver, to locate the inputs in the HTML.

Our method systematically goes through every one of the forms one by one inputting and submitting keys with .send_keys() and .click() methods. We also use .clear() to make sure our input box does not contain information on it when we load the page.

We have successfully logged in.

NOTE*
The second form only appears after a few times of using a Selenium web driver to interact with the Twitter login page. Twitter detects when a bot comes in and types numbers way too fast, so this second page appears only when a bot is detected. After a few times of using this program, you will always have this show up when logging in to your scraping account.

Scrape
This will be the method that retrieves a post comments. To this, we have big limitations and some problems to solve. The first one is that there is no way possible to target only Twitter comments. Twitter comments are inside of , which, twitter likes to use a lot for lots of things.

The best way I could find to get Twitter comments is by using this method: "drive.find_elements(By.XPATH,"//span[@class='css-1jxf684 r-bcqeeo r-1ttztb7 r-qvutc0 r-poiln3']")! " in other words, using XML with classes. This returns a lot of unnecessary data and we have to do a lot of cleaning.

The second problem, a little bit less severe, Twitter's dynamic reactions.
When scrolling down or up, twitter loads or deletes HTML from the current document, so in order to get every comment possible, we have to go slowly and extract elements before we want to scroll again.

Now that we have discovered this problem, lets get to work.

def scrape(self):
drive = self.driver
containers = drive.find_elements(By.XPATH,
"//span[@class='css-1jxf684 r-bcqeeo r-1ttztb7 r-qvutc0 r-poiln3']")
##Scrape data and store it in list
scraped_data = []
temporary_1 = ""
temporary_2 = ""
index = 0
index_dict = 0
while index < len(containers):
text = containers[index].text
if text:
if text[0] == "@":
temporary_1 = text
index_dict = index_dict + 1
if validate_span(text) is True and index_dict == 1:
temporary_2 = text
arr_push = {
"username": temporary_1,
"post": temporary_2
}
scraped_data.append(arr_push)
temporary_2, temporary_1 = "", ""
index_dict = 0
index = index + 1
return scraped_data

This code retrieves all comments from the currently loaded document.
By looping through certain conditions and adding additional methods like validate_span(), we were able to successfully, clean data all of the time. If you encounter a problem in the algorithm, feel free to let me know.

The validate_span() function:
(useful_functions.py)

`def validate_span(span):
if span[0] == "." or span[0] == "ยท":
return False
if span[0] == "@":
return False
if validate_number(span):
return True
return False

def validate_number(string):
if string[len(string) - 1] == "k":
string = string[0: len(string) - 1]
string = string.replace(".", "")
index = 0
for i in string:
if (i == "1" or i == "2" or i == "3" or i == "4" or i == "5"
or i == "6" or i == "7" or i == "8" or i == "9" or i == "0"):
index = index + 1

if len(string) <= index:
    return False
else:
    return True`
Enter fullscreen mode Exit fullscreen mode

All of our unwanted elements usually follow counts, like counts or random dots and whitespace. By checking with a few conditions, this is an easy task to clean up.

The get_post method

This is the method where we loop until we get to the bottom of the page,
using the scraping method in every iteration to make sure all data is scraped.

`def get_post(self, url):

    drive = self.driver
    wait = self.wait

    drive.get(url)
    sleep(3)
    data = []
    javascript = "let inner_divs = document.querySelectorAll('[" "data-testid=\"cellInnerDiv\"]');" + ("window"
                                                                                                       ".scrollTo("
                                                                                                       "0, "
                                                                                                       "inner_divs[0].scrollHeight);") + "return inner_divs[2].scrollHeight;"

    previous_height = drive.execute_script("return document.body.scrollHeight")
    avg_scroll_height = int(drive.execute_script(javascript)) * 13
    while True:
        data = data + self.scrape()
        drive.execute_script("window.scrollTo(0, (document.body.scrollHeight +"+str(avg_scroll_height)+" ));")
        sleep(3)
        new_height = drive.execute_script("return document.body.scrollHeight")
        if new_height == previous_height:
            break
        previous_height = new_height`
Enter fullscreen mode Exit fullscreen mode

By Injecting javascript into the driver and looping while the document's scroll heights are not the same, we are able to scrape data in every part of the page.

Finally, we can do something useful with the data. In my case, am just going to print it.

for I in data:
print(I)

Now all you have to do is change your desired URL and run the main file, then wait for your data to be returned.
And you've done it! You have successfully created a web scraper for Twitter. Needless to say, use web scraping technologies in legal and ethical ways if you don't want to get in trouble...

In conclusion, Twitter comments can scraped very efficiently, but should always be done with the correct legal use, apart from this, twitter data is very valuable and can help you understand the public opinion in a topic.

Top comments (0)