<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Julian Agius</title>
    <description>The latest articles on DEV Community by Julian Agius (@julianagius).</description>
    <link>https://dev.to/julianagius</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F168915%2Fff2a1fd1-6981-4a01-9cfc-1ec33572ffbb.jpg</url>
      <title>DEV Community: Julian Agius</title>
      <link>https://dev.to/julianagius</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/julianagius"/>
    <language>en</language>
    <item>
      <title>Web Scraping for Scientific Papers</title>
      <dc:creator>Julian Agius</dc:creator>
      <pubDate>Tue, 01 Sep 2020 13:38:49 +0000</pubDate>
      <link>https://dev.to/julianagius/web-scraping-for-scientific-paper-details-6o5</link>
      <guid>https://dev.to/julianagius/web-scraping-for-scientific-paper-details-6o5</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;ACL is an annual meeting of the Association for Computational Linguistics, covering research areas related to Natural Language Processing (NLP). &lt;br&gt;
As an M.Sc in AI student specializing in NLP, I am currently on the lookout for cutting-edge research within the field of computational linguistics.&lt;/p&gt;
&lt;h1&gt;
  
  
  Motivation for using Web Scraping
&lt;/h1&gt;

&lt;p&gt;Multiple state-of-the-art scientific papers were published in this year's event, &lt;a href="https://www.aclweb.org/anthology/events/acl-2020/"&gt;ACL2020&lt;/a&gt; (Association for Computational Linguistics). &lt;/p&gt;

&lt;p&gt;I simply wanted to have a list of all the papers published on the ACL2020 website, together with their abstracts. By saving these details in a csv file I could use Excel to filter and colour code papers which were relevant to my dissertation.&lt;/p&gt;
&lt;h1&gt;
  
  
  Solution
&lt;/h1&gt;

&lt;p&gt;In order to scrape the titles and abstracts for papers published for ACL2020 I wrote a short script in Python.&lt;/p&gt;

&lt;p&gt;I firstly had to import the libraries I needed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Then I used the &lt;em&gt;requests&lt;/em&gt; library to get the response for the ACL2020 Anthology web page. The HTML of the web page(&lt;code&gt;page.content&lt;/code&gt;) was then parsed using &lt;em&gt;BeautifulSoup&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Get Response object for webpage
&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Parse webpage HTML and save as BeautifulSoup object
&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'html.parser'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Initially, I extracted the titles of all the papers found on the particular web page. I used the &lt;code&gt;find_all()&lt;/code&gt; method to look for all the paragraph tags with the following CSS classes &lt;code&gt;d-sm-flex align-items-stretch&lt;/code&gt;, i.e. all the paragraphs that contained paper titles.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;title_paras&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'p'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'d-sm-flex align-items-stretch'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QkxCLiRi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/1q9czr1srqg6oluizvsw.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QkxCLiRi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/1q9czr1srqg6oluizvsw.PNG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, the items in the &lt;code&gt;title_paras&lt;/code&gt; variable are not the titles themselves... which is what I want. Therefore I had to go through each child tag for each paragraph tag, until I reached the title text stored in the a tag with the CSS class &lt;code&gt;align-middle&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;para&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;title_paras&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;titles&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;para&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'span'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'d-block'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'a'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'align-middle'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jz4NUt9D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/mocqlm9k6fpkx9mzpvdo.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jz4NUt9D--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/mocqlm9k6fpkx9mzpvdo.PNG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I went through a similar process to extract the abstract of each paper. The titles and abstracts were stored in two lists called &lt;code&gt;titles&lt;/code&gt; and &lt;code&gt;abstracts&lt;/code&gt; (&lt;em&gt;shocker&lt;/em&gt; I know). I created a &lt;em&gt;pandas&lt;/em&gt;  dataframe using these two lists and save it to csv.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s"&gt;'Title'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;titles&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'Abstract'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;abstracts&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'ACL 2020 Papers.csv'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The GitHub repo (including code and required libraries) for this short project can be found &lt;a href="https://github.com/julianagius/ACL2020-Scraper"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;In this post, we went over how to scrape scientific paper titles and abstracts from the ACL2020 in Python, using &lt;em&gt;BeautifulSoup&lt;/em&gt; and saving the data in csv format using &lt;em&gt;pandas&lt;/em&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Useful Links
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"&gt;Beautiful Soup Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://requests.readthedocs.io/en/master/user/quickstart/"&gt;Requests Quickstart&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pandas.pydata.org/"&gt;pandas Library Homepage&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
    </item>
  </channel>
</rss>
