DEV Community

Alan Stocco
Alan Stocco

Posted on

1

Scraper payslips with Python | Selenium

Scenario:

I work in a company and my paylips are downloadable in an aspx portal. One by one, not in block.
I needed them all, for burocaracy reasons and in order to archive them.

How: py, selenium. I tried with beutifulsoup but it didn't work.

Explenation and Code

Web Driver

I used webdriver Chrome with some options in order to save the pdf-files when browser opens it. Look pref in code below.

Creating a class

class PaylipsScaper:
    # Init
    def __init__(self, username, password):
        self.username = username
        self.password = password
        # Options
        chrome_options = webdriver.ChromeOptions()
        prefs = {
            "plugins.always_open_pdf_externally": True,
            "download.default_directory": "C:\\tmp", # folder save files
            "download.prompt_for_download": False,
            "download.directory_upgrade": True,
            "safebrowsing.enabled": True
            }
        chrome_options.add_experimental_option("prefs",prefs)
        chrome_options.headless = True # If True hide browser
        self.driver = webdriver.Chrome(executable_path='chromedriver.exe', options=chrome_options)
Enter fullscreen mode Exit fullscreen mode

Login

The login phase is quite easy, just select by id the area and insert the value. I just kept attention to iframe, because in that case you have to use switch_to.frame before.

    # Manage login page
    def login(self, url):
        driver = self.driver
        driver.get(url)
        driver.switch_to.frame("FunArea") 
        username = driver.find_element_by_id("login")
        password = driver.find_element_by_id("pwd")
        username.send_keys(self.username)
        time.sleep(1)   
        password.send_keys(self.password)  
        driver.find_element_by_id("CmdInvia").click()
Enter fullscreen mode Exit fullscreen mode

Loop table using XPATH

I created a class that wrap the selenium driver in order to keep all cleans.
I just reproduced the clicks done by myself.
At the beginning I tried with CSS selector but for the structure of the pages was a better solution to use XPATH
(to get the XPATH with Chrome see here)
By the way I don't like the time.sleep but it was useful to avoid navigations problems during the process.

# Inside PaylipsScaper class
def get_num_rows(self, num_rows = 1):
        driver = self.driver
        self.click_to_payslips_area()            
        num_rows = len(driver.find_elements_by_xpath("//table[@id='ContTab']/tbody/tr/td/div/table/tbody/tr"))             
        return num_rows


[...other stuff...]
try:
    bot = PaylipsScaper(username, password) 
    bot.login(url_website)
    wait = WebDriverWait(bot.driver, 10)
    num_rows = bot.get_num_rows()       
    for row in range(1,num_rows+1):   
        paylip_year  = bot.get_val_in_cedolino_row(row, 4)
        paylip_month = bot.get_val_in_cedolino_row(row, 5)            
        paylip_type  = bot.get_val_in_cedolino_row(row, 7)
        bot.driver.execute_script("arguments[0].click();", WebDriverWait(bot.driver, 20).until(EC.element_to_be_clickable((By.XPATH, "/html/body/form/div/table/tbody/tr/td/div/table/tbody/tr["+str(row)+"]/td[10]/img"))))    
        time.sleep(2)  
        filepdf= dirpath + "\\*.pdf"
        list_of_files = glob.glob(filepdf)    
        file_name = max(list_of_files, key=os.path.getctime)
        current_paylip = Paylip(paylip_year, paylip_month, paylip_type, file_name)
        bot.rename_and_move (current_paylip)
        print("Downloaded:")
        print(current_paylip)
Enter fullscreen mode Exit fullscreen mode

Save pdf file in folder and rename it

It's quite a brute solution anyway I got the last pdf saved in a folder and renamed it with the informations from the website.
Then I moved the files in sub-folders by year.

def rename_and_move(self, urrent_paylip):
        if current_paylip.paylip_month == "" :
            new_file_name='Cud_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_type).replace(" ", "_").replace("Completo", "").replace("NORMALE", "")+'.pdf'
        elif "TREDICESIMA" in current_paylip.paylip_type:
            new_file_name = 'Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'_Tredicesima.pdf'
        else:
            new_file_name = 'Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'.pdf'
        print(new_file_name)
        new_file_name = os.path.join(dirpath, new_file_name)
        # Rename file and move it in the year-directory
        os.rename(current_paylip.file_name, new_file_name)
        current_paylip.file_name = new_file_name
        # Check if path with year directory exist otherwise create it
        dirin=os.path.split(new_file_name)
        newdir=dirin[0]+'\\'+current_paylip.paylip_year
        if os.path.exists(newdir)==False:
                # Create directory
                os.mkdir(newdir)
        # Move file in the year-directory 
        if os.path.exists(newdir+"\\"+dirin[1]):
            # If file already exist, delete it 
            os.remove(newdir+"\\"+dirin[1])
        shutil.move (current_paylip.file_name,newdir+"\\"+dirin[1])
        return
Enter fullscreen mode Exit fullscreen mode

Final situation

Got it. I have a folder with subfolders by year and in each one all the paylips with a standard name format.

What I learned:

  • Use of Selenium in py.
  • Simple automation can save a lot of time and avoid manual boring tasks.
  • How to write my first article here.(it's a personal task so not so useful for you but better than nothing after all)

Future improvements:

  • input parameters
  • (re)try to use css selector instead of xpath selector
  • (re)try to use BeautifulSoup
  • save last paylips saved in order, next run, to save only the not already saved paylips
  • read pdf and report data in file(eg google sheets)

Of course the code is useful just for me and my colleagues. Anyway I hope that the idea and process can be a good idea to someone else.

Image of Datadog

The Essential Toolkit for Front-end Developers

Take a user-centric approach to front-end monitoring that evolves alongside increasingly complex frameworks and single-page applications.

Get The Kit

Top comments (0)

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more