DEV Community

Julian Agius
Julian Agius

Posted on

Web Scraping for Scientific Papers

Introduction

ACL is an annual meeting of the Association for Computational Linguistics, covering research areas related to Natural Language Processing (NLP).
As an M.Sc in AI student specializing in NLP, I am currently on the lookout for cutting-edge research within the field of computational linguistics.

Motivation for using Web Scraping

Multiple state-of-the-art scientific papers were published in this year's event, ACL2020 (Association for Computational Linguistics).

I simply wanted to have a list of all the papers published on the ACL2020 website, together with their abstracts. By saving these details in a csv file I could use Excel to filter and colour code papers which were relevant to my dissertation.

Solution

In order to scrape the titles and abstracts for papers published for ACL2020 I wrote a short script in Python.

I firstly had to import the libraries I needed:

import requests
import pandas as pd
from bs4 import BeautifulSoup

Then I used the requests library to get the response for the ACL2020 Anthology web page. The HTML of the web page(page.content) was then parsed using BeautifulSoup.

# Get Response object for webpage
page = requests.get(URL)
# Parse webpage HTML and save as BeautifulSoup object
soup = BeautifulSoup(page.content, 'html.parser')

Initially, I extracted the titles of all the papers found on the particular web page. I used the find_all() method to look for all the paragraph tags with the following CSS classes d-sm-flex align-items-stretch, i.e. all the paragraphs that contained paper titles.

title_paras = soup.find_all('p', class_='d-sm-flex align-items-stretch')

Alt Text

However, the items in the title_paras variable are not the titles themselves... which is what I want. Therefore I had to go through each child tag for each paragraph tag, until I reached the title text stored in the a tag with the CSS class align-middle.

for para in title_paras:
    titles.append(para.find_all('span', class_='d-block')[1].find('a', class_='align-middle').text)

Alt Text

I went through a similar process to extract the abstract of each paper. The titles and abstracts were stored in two lists called titles and abstracts (shocker I know). I created a pandas dataframe using these two lists and save it to csv.

df = pd.DataFrame({'Title': titles, 'Abstract': abstracts})
df.to_csv('ACL 2020 Papers.csv', index=False)

The GitHub repo (including code and required libraries) for this short project can be found here.

Conclusion

In this post, we went over how to scrape scientific paper titles and abstracts from the ACL2020 in Python, using BeautifulSoup and saving the data in csv format using pandas.

Useful Links

Top comments (0)