DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

Rajat Thakur
Rajat Thakur

Posted on

Top 20 news datasets available on the web for free

Image description

Digital news sources have flourished at an extraordinary rate, ranging from a handful of digital news posts to many digital news sources and publications. This is because news posts now cover a wide range of issues and events, increasing their reach. These publications not only represent the world but also change and shape our perception of it.

Storing news data is now common due to the high demand for instant access to historical news data, for which people commonly use the News API. These news datasets can be useful for research purposes and for personal and professional artificial intelligence (AI) and machine learning (ML).

If you are looking for historical news data to power your AI and ML algorithms, you can use these free news datasets or the Newsdata.io tool which I will mention below. News datasets can help you find a wide range of historical stories related to any topic, organization, person, and more.

In this article, we will discuss a simple and reliable way to access historical news data sets. Let’s get right into it.
Here are the top 20 news datasets that you can download for free for your personal and professional AI, machine learning, and data analytics projects.

  1. Newsdata.io

Name- Covid-19 news dataset

Link- https://newsdata.io/files/datasets/covid19-news
This Covid-19 dataset contains the latest world news related to Coronavirus.

  1. Kaggle.com

Name- BBC News Classification (News article categorization)

Link- https://www.kaggle.com/c/learn-ai-bbc

The dataset is broken into 1490 records for training and 735 for testing. The goal will be to build a system that can accurately classify previously unseen news articles into the right category.

  1. BBC

Name- BBC datasets

Link- http://mlg.ucd.ie/datasets/bbc.html

Two news article datasets, originating from BBC News, provided for use as benchmarks for machine learning research.

  1. Harvard Dataverse

Name- A Million News Headlines

Link- https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SYBGZL

This contains data on news headlines published over a period of eighteen years. Sourced from the reputable Australian news source ABC (Australian Broadcasting Corporation)

  1. Newsdata.io

Name- Covid-19 and vaccine news dataset

Link- https://newsdata.io/files/datasets/covid-vaccine-news

This contains data on the latest published news headlines from across the web. News headlines with all the metadata and full description.

  1. Webz.io

Name- Political news articles

Link- https://webz.io/free-datasets/political-news-articles/

This contains world politics-related news article data fetch with the help of Webz.io news API.

  1. Paperswithcode

Name- COVID-19 Fake News Dataset

Link- https://paperswithcode.com/dataset/covid-19-fake-news-dataset

Along with the COVID-19 pandemic, we are also fighting an `infodemic’. Fake news and rumors are rampant on social media. Believing in rumors can cause significant harm.

  1. Kaggle

Name- India News Headlines Dataset

Link- https://www.kaggle.com/therohk/india-headlines-news-dataset

This news dataset is a persistent historical archive of notable events in the Indian subcontinent from start-2001 to end-2020, recorded in real-time by the journalists of India. It contains approximately 3.4 million events published by the Times of India.

  1. Data.world

Name- Economic News Article Tone

Link- https://data.world/crowdflower/economic-news-article-tone
Contributors read snippets of news articles. They then noted if the article was relevant to the US economy and, if so, what the tone of the article was.

  1. Archive.org

Name- World Politics news dataset

Link- https://archive.org/details/world-politics-news-dataset

This dataset contains the latest news related to politics around the world with the available news article’s metadata.

  1. IEEE.org

Name- Covid-19 and vaccine

Link- https://ieee-dataport.org/documents/covid-19-and-vaccine-news-dataset

This dataset contains world news related to Covid-19 and vaccine and also with the news article’s available metadata.

  1. IEEE.org

Name- World politics news

Link- https://ieee-dataport.org/documents/world-politics-news-dataset

This dataset contains world news related to politics and also with the news article’s available metadata.

  1. IEEE.org

Name- Covid-19 news

Link- https://ieee-dataport.org/documents/covid-19-news
This dataset contains all the latest news data related to Covid-19 from around the world.

  1. IEEE.org

Name- COVIFN : FAKE NEWS ON COVID19

Link- https://ieee-dataport.org/documents/covifn-fake-news-covid19

COVIFN is a CoVID-19-specific dataset that consists of fact-checked fake news scraped from Poynter and true news from news publishers’ verified portals. The dataset was pre-processed, the removal of special characters and non-vital information is performed.

  1. IEEE.org

Name- FAKE NEWS ON HEALTHCARE

Link- https://ieee-dataport.org/documents/fake-news-healthcare

The Internet is a vast repository of useful knowledge, but it has been contaminated by the spread of false information. Relying on misinformation can be disastrous. According to a World Health Organization survey, about 6,000 individuals were hospitalized throughout the world as a result of fake news on COVID-19 in the first three months of 2020.

  1. IEEE.org

Name- NEWS CREDIBILITY DATASET

Link- https://ieee-dataport.org/documents/news-credibility-dataset
Features of each news according to seven credibility categories

  1. IEEE.org

Name- AI-Based automated extraction of entities, entity categories, and sentiment on Covid-19 situation.

Link- https://ieee-dataport.org/documents/ai-based-automated-extraction-entities-entity-categories-and-sentiments-covid-19-situation

Artificial Intelligence (AI) based in-depth analysis of social media content would allow a strategic decision-maker to obtain evidence-based responses to complex queries.

  1. Kaggle

Name- Reddit Omicron Panic

Link- https://www.kaggle.com/yamqwe/reddit-omicron-panic

As we all know, a new variant of COVID-19 is spreading worldwide causing massive panic. This dataset captures mentions of the new variant on Reddit.

  1. Kaggle

Name- Omicron daily cases by country (COVID-19 variant)

Link- https://www.kaggle.com/yamqwe/omicron-covid19-variant-daily-cases

Tracking the progression of the new omicron COVID-19 variant

  1. IEEE.org

Name- Daily report of Covid-19 confirmed cases in Thailand.

Link- https://ieee-dataport.org/documents/daily-report-covid-19-confirmed-cases-thailand

A dataset contains a total of 578,375 COVID-19 confirmed cases reported in Thailand that were being recorded between 22 January 2021 to 30 July 2021.

Top comments (0)

50 CLI Tools You Can't Live Without

The top 50 must-have CLI tools, including some scripts to help you automate the installation and updating of these tools on various systems/distros.