Web Scraping News with 4 lines using Python

#python #webscraping #news

In this article, I will show you how to collect and scrap news from various sources. Therefore, instead of spending a long time writing scraping code for each website, we will use newspaper3k to automatically extract structured information.

Let’s get started, the first step is to install the package that will be used, namely newspaper3k using pip. Open your terminal (Linux / macOS) or command prompt (windows) and type:

pip install newspaper3k

After installation is completed, open your code editor and import the package with the following code

from newspaper import Article

In this post, I will scrape the news from The New York Times entitled "A Lifetime of Running with Ahmaud Arbery Ends With a Text: ‘We Lost Maud’".

Next, enter the link to be scrapped

article = Article('https://www.nytimes.com/2020/05/10/us/ahmaud-arbery-georgia.html’)

You have the choice to determine the language used. Even so, the newspaper can detect and extract language quite well. If no specific language is given, the newspaper will detect the language automatically.

But, if you want to use a specific language, change the code to be like.

article = Article('https://www.nytimes.com/2020/05/10/us/ahmaud-arbery-georgia.html', 'en') # English

Then parse the article with the following code

article.download()
article.parse()

Now everything is set. We can start using several methods to extract information, starting with the article authors.

print(article.authors)
# it will show ['Richard Fausset']

Next, we will get the date that the article was published.

print(article.publish_date)
# it will show datetime.datetime(2020, 5, 10, 0, 0)

Get full text from the article.

print(article.text)
# it will show ‘Mr. Baker was with Mr. Arbery’s family at his grave site . . .’

You can also get image links from articles.

print(article.top_image)
# it will show'https://static01.nyt.com/images/2020/05/10/us/10georgia-arbery-1/10georgia-arbery-1-facebookJumbo-v2.jpg'

In addition, the newspaper3k also provides methods for simple text processing, such as to get keywords and summarization. First, initialize using .nlp() method.

article.nlp()

Get the keywords.

print(article.keywords)
# it will show
['running',
 'world',
 'needed',
 'arbery',
 'lifetime',
 'site',
 'baker',
 'text',
 'wished',
 'maud',
 'arberys',
 'lost',
 'pandemic',
 'upended',
 'ends',
 'ahmaud',
 'unsure',
 'mr']

And next is to summarize the news, but this feature is limited to only a few languages as explained in the documentation.

print(article.summary)
# it will show 'Mr. Baker was with Mr. Arbery’s family at his grave site.\nIt was Mr. Arbery’s birthday.\nA pandemic had upended the world, and Mr. Baker felt adrift.\nSometimes he wished to be called a musician, and sometimes he did not.\nHe was unsure what he would become or how he would get there.'

If you want to see the full code, you can see it on my Github.

The conclusion of this post is that we can scrape news with various sources in Python quite easily using the newspaper3k package, compared to if we have to spend a long time writing scraping code for each website. By using this package, we can retrieve basic information needed such as authors, publish date, text, and images from news. There are also methods for getting keywords and summaries of news. And most importantly newspaper3k can be used to do news scraping in various languages.

DEV Community

Web Scraping News with 4 lines using Python

Top comments (0)

Read next

Apache Answer 1.4.1: Bridge the Gap

Highlights from Peter Desantis keynote at AWS reinvent 2024

Advent of Code 2024 - Day 3: Mull it Over

Clean architecture: Where to start ?