DEV Community

Fahmi Nurfikri
Fahmi Nurfikri

Posted on • Originally published at towardsdatascience.com

Web Scraping News with 4 lines using Python

In this article, I will show you how to collect and scrap news from various sources. Therefore, instead of spending a long time writing scraping code for each website, we will use newspaper3k to automatically extract structured information.

Let’s get started, the first step is to install the package that will be used, namely newspaper3k using pip. Open your terminal (Linux / macOS) or command prompt (windows) and type:

pip install newspaper3k
Enter fullscreen mode Exit fullscreen mode

After installation is completed, open your code editor and import the package with the following code

from newspaper import Article
Enter fullscreen mode Exit fullscreen mode

In this post, I will scrape the news from The New York Times entitled "A Lifetime of Running with Ahmaud Arbery Ends With a Text: ‘We Lost Maud’".

Next, enter the link to be scrapped

article = Article('https://www.nytimes.com/2020/05/10/us/ahmaud-arbery-georgia.html’)
Enter fullscreen mode Exit fullscreen mode

You have the choice to determine the language used. Even so, the newspaper can detect and extract language quite well. If no specific language is given, the newspaper will detect the language automatically.

But, if you want to use a specific language, change the code to be like.

article = Article('https://www.nytimes.com/2020/05/10/us/ahmaud-arbery-georgia.html', 'en') # English
Enter fullscreen mode Exit fullscreen mode

Then parse the article with the following code

article.download()
article.parse()
Enter fullscreen mode Exit fullscreen mode

Now everything is set. We can start using several methods to extract information, starting with the article authors.

print(article.authors)
# it will show ['Richard Fausset']
Enter fullscreen mode Exit fullscreen mode

Next, we will get the date that the article was published.

print(article.publish_date)
# it will show datetime.datetime(2020, 5, 10, 0, 0)
Enter fullscreen mode Exit fullscreen mode

Get full text from the article.

print(article.text)
# it will show ‘Mr. Baker was with Mr. Arbery’s family at his grave site . . .’
Enter fullscreen mode Exit fullscreen mode

You can also get image links from articles.

print(article.top_image)
# it will show'https://static01.nyt.com/images/2020/05/10/us/10georgia-arbery-1/10georgia-arbery-1-facebookJumbo-v2.jpg'
Enter fullscreen mode Exit fullscreen mode

In addition, the newspaper3k also provides methods for simple text processing, such as to get keywords and summarization. First, initialize using .nlp() method.

article.nlp()
Enter fullscreen mode Exit fullscreen mode

Get the keywords.

print(article.keywords)
# it will show
['running',
 'world',
 'needed',
 'arbery',
 'lifetime',
 'site',
 'baker',
 'text',
 'wished',
 'maud',
 'arberys',
 'lost',
 'pandemic',
 'upended',
 'ends',
 'ahmaud',
 'unsure',
 'mr']
Enter fullscreen mode Exit fullscreen mode

And next is to summarize the news, but this feature is limited to only a few languages as explained in the documentation.

print(article.summary)
# it will show 'Mr. Baker was with Mr. Arbery’s family at his grave site.\nIt was Mr. Arbery’s birthday.\nA pandemic had upended the world, and Mr. Baker felt adrift.\nSometimes he wished to be called a musician, and sometimes he did not.\nHe was unsure what he would become or how he would get there.'
Enter fullscreen mode Exit fullscreen mode

If you want to see the full code, you can see it on my Github.


The conclusion of this post is that we can scrape news with various sources in Python quite easily using the newspaper3k package, compared to if we have to spend a long time writing scraping code for each website. By using this package, we can retrieve basic information needed such as authors, publish date, text, and images from news. There are also methods for getting keywords and summaries of news. And most importantly newspaper3k can be used to do news scraping in various languages.

Top comments (0)