Alan Stocco

Posted on Feb 4, 2021

Scraper payslips with Python | Selenium

#scraper #python #selenium #automate

Scenario:

I work in a company and my paylips are downloadable in an aspx portal. One by one, not in block.
I needed them all, for burocaracy reasons and in order to archive them.

How: py, selenium. I tried with beutifulsoup but it didn't work.

Explenation and Code

Web Driver

I used webdriver Chrome with some options in order to save the pdf-files when browser opens it. Look pref in code below.

Creating a class

class PaylipsScaper:
    # Init
    def __init__(self, username, password):
        self.username = username
        self.password = password
        # Options
        chrome_options = webdriver.ChromeOptions()
        prefs = {
            "plugins.always_open_pdf_externally": True,
            "download.default_directory": "C:\\tmp", # folder save files
            "download.prompt_for_download": False,
            "download.directory_upgrade": True,
            "safebrowsing.enabled": True
            }
        chrome_options.add_experimental_option("prefs",prefs)
        chrome_options.headless = True # If True hide browser
        self.driver = webdriver.Chrome(executable_path='chromedriver.exe', options=chrome_options)

Login

The login phase is quite easy, just select by id the area and insert the value. I just kept attention to iframe, because in that case you have to use switch_to.frame before.

    # Manage login page
    def login(self, url):
        driver = self.driver
        driver.get(url)
        driver.switch_to.frame("FunArea") 
        username = driver.find_element_by_id("login")
        password = driver.find_element_by_id("pwd")
        username.send_keys(self.username)
        time.sleep(1)   
        password.send_keys(self.password)  
        driver.find_element_by_id("CmdInvia").click()

Loop table using XPATH

I created a class that wrap the selenium driver in order to keep all cleans.
I just reproduced the clicks done by myself.
At the beginning I tried with CSS selector but for the structure of the pages was a better solution to use XPATH
(to get the XPATH with Chrome see here)
By the way I don't like the time.sleep but it was useful to avoid navigations problems during the process.

# Inside PaylipsScaper class
def get_num_rows(self, num_rows = 1):
        driver = self.driver
        self.click_to_payslips_area()            
        num_rows = len(driver.find_elements_by_xpath("//table[@id='ContTab']/tbody/tr/td/div/table/tbody/tr"))             
        return num_rows


[...other stuff...]
try:
    bot = PaylipsScaper(username, password) 
    bot.login(url_website)
    wait = WebDriverWait(bot.driver, 10)
    num_rows = bot.get_num_rows()       
    for row in range(1,num_rows+1):   
        paylip_year  = bot.get_val_in_cedolino_row(row, 4)
        paylip_month = bot.get_val_in_cedolino_row(row, 5)            
        paylip_type  = bot.get_val_in_cedolino_row(row, 7)
        bot.driver.execute_script("arguments[0].click();", WebDriverWait(bot.driver, 20).until(EC.element_to_be_clickable((By.XPATH, "/html/body/form/div/table/tbody/tr/td/div/table/tbody/tr["+str(row)+"]/td[10]/img"))))    
        time.sleep(2)  
        filepdf= dirpath + "\\*.pdf"
        list_of_files = glob.glob(filepdf)    
        file_name = max(list_of_files, key=os.path.getctime)
        current_paylip = Paylip(paylip_year, paylip_month, paylip_type, file_name)
        bot.rename_and_move (current_paylip)
        print("Downloaded:")
        print(current_paylip)

Save pdf file in folder and rename it

It's quite a brute solution anyway I got the last pdf saved in a folder and renamed it with the informations from the website.
Then I moved the files in sub-folders by year.

def rename_and_move(self, urrent_paylip):
        if current_paylip.paylip_month == "" :
            new_file_name='Cud_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_type).replace(" ", "_").replace("Completo", "").replace("NORMALE", "")+'.pdf'
        elif "TREDICESIMA" in current_paylip.paylip_type:
            new_file_name = 'Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'_Tredicesima.pdf'
        else:
            new_file_name = 'Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'.pdf'
        print(new_file_name)
        new_file_name = os.path.join(dirpath, new_file_name)
        # Rename file and move it in the year-directory
        os.rename(current_paylip.file_name, new_file_name)
        current_paylip.file_name = new_file_name
        # Check if path with year directory exist otherwise create it
        dirin=os.path.split(new_file_name)
        newdir=dirin[0]+'\\'+current_paylip.paylip_year
        if os.path.exists(newdir)==False:
                # Create directory
                os.mkdir(newdir)
        # Move file in the year-directory 
        if os.path.exists(newdir+"\\"+dirin[1]):
            # If file already exist, delete it 
            os.remove(newdir+"\\"+dirin[1])
        shutil.move (current_paylip.file_name,newdir+"\\"+dirin[1])
        return

Final situation

Got it. I have a folder with subfolders by year and in each one all the paylips with a standard name format.

What I learned:

Use of Selenium in py.
Simple automation can save a lot of time and avoid manual boring tasks.
How to write my first article here.(it's a personal task so not so useful for you but better than nothing after all)

Future improvements:

input parameters
(re)try to use css selector instead of xpath selector
(re)try to use BeautifulSoup
save last paylips saved in order, next run, to save only the not already saved paylips
read pdf and report data in file(eg google sheets)

Of course the code is useful just for me and my colleagues. Anyway I hope that the idea and process can be a good idea to someone else.

DEV Community