This tutorial shows how to collect Google News data using HasData’s Google News API and visualize the most common topics or keywords in news headlines. We’ll process the data, remove stop words, and create a simple frequency chart with Python.
Table of Contents
- Introduction
- Setup
- Fetching Google News Data
- Processing Headlines
- Creating a Frequency Chart
- Full Code
- Next Steps
- Further Reading
Introduction
News data is rich, but raw headlines can be messy. Common words like “the”, “of”, and “in” dominate the text, making it hard to extract meaningful insights. In this guide, we’ll:
- Fetch news headlines via HasData’s Google News API.
- Extract the
highlight.titlefield from each article. - Count the frequency of meaningful words.
- Visualize the top keywords using
matplotlib.
This approach is useful for tracking trending topics, analyzing industry coverage, or quickly summarizing news from a specific domain.
Setup
You will need:
- Python 3
requestsmatplotlibnltk- Standard library modules:
json,collections.Counter,re
pip install requests matplotlib nltk
You also need to download the stopwords data from NLTK. You can do this by running the following command in Python:
import nltk
nltk.download('stopwords')
Make sure you have a HasData API key. You can get one for free from your HasData dashboard.
Fetching Google News Data
We’ll use the API to fetch headlines from a specific topic. You can change the topicToken to fetch different sections like Technology, Business, or Sports.
import requests
import json
API_KEY = "HASDATA-API-KEY"
params_raw = {
"q": "",
"gl": "us",
"hl": "en",
"topicToken": "CAAqJggKIiBDQkFTRWdvSUwyMHZNREpxYW5RU0FtVnVHZ0pWVXlnQVAB", # Example: Entertainment
}
params = {k: v for k, v in params_raw.items() if v}
news_url = "https://api.hasdata.com/scrape/google/news"
news_headers = {"Content-Type": "application/json", "x-api-key": API_KEY}
resp = requests.get(news_url, params=params, headers=news_headers)
resp.raise_for_status()
data = resp.json()
Processing Headlines
We’ll now extract the titles and filter out common stop words using NLTK's built-in list of stopwords.
from collections import Counter
import re
from nltk.corpus import stopwords
# Load stopwords from NLTK
stop_words = set(stopwords.words('english'))
titles = [item.get("highlight", {}).get("title", "") for item in data.get("newsResults", [])]
words = []
for title in titles:
for word in re.findall(r'\w+', title.lower()):
if word not in stop_words and len(word) > 2:
words.append(word)
counter = Counter(words)
most_common = counter.most_common(20)
Now we have a list of words that appear most frequently in the headlines, excluding common stop words.
Creating a Frequency Chart
Finally, we visualize the results using matplotlib.
import matplotlib.pyplot as plt
if not most_common:
print("No meaningful words.")
else:
labels, counts = zip(*most_common)
plt.figure(figsize=(12,6))
plt.bar(labels, counts, color='skyblue')
plt.xticks(rotation=45, ha='right')
plt.title("Top 20 meaningful words in news headlines")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()
You should see a clear bar chart showing the most common topics from the headlines.
Full Code
import requests
import json
from collections import Counter
import matplotlib.pyplot as plt
import re
import nltk
from nltk.corpus import stopwords
# Download stopwords if not already downloaded
nltk.download('stopwords')
API_KEY = "HASDATA-API-KEY"
# Parameters for Google News API request
params_raw = {
"q": "",
"gl": "us",
"hl": "en",
"topicToken": "CAAqJggKIiBDQkFTRWdvSUwyMHZNREpxYW5RU0FtVnVHZ0pWVXlnQVAB"
}
params = {k: v for k, v in params_raw.items() if v}
news_url = "https://api.hasdata.com/scrape/google/news"
news_headers = {"Content-Type": "application/json", "x-api-key": API_KEY}
# Fetch news data
resp = requests.get(news_url, params=params, headers=news_headers)
resp.raise_for_status()
data = resp.json()
# Extract titles
titles = [item.get("highlight", {}).get("title", "") for item in data.get("newsResults", [])]
# Load stopwords from NLTK
stop_words = set(stopwords.words('english'))
# Process words from titles
words = []
for title in titles:
for word in re.findall(r'\w+', title.lower()):
if word not in stop_words and len(word) > 2:
words.append(word)
# Count words
counter = Counter(words)
most_common = counter.most_common(20)
# Plot results
if not most_common:
print("No meaningful words.")
else:
labels, counts = zip(*most_common)
plt.figure(figsize=(12,6))
plt.bar(labels, counts)
plt.xticks(rotation=45, ha='right')
plt.title("Top 20 meaningful words in news headlines")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()
Next Steps
- Expand the stop words list to filter more common words.
- Analyze key topics using bigrams or trigrams for richer insights.
- Combine multiple topic sections to see trends across industries.
- Automate periodic fetching to track trends over time.
Further Reading
If you want to explore more advanced Google News scraping techniques, including RSS feeds, Google Search (tbm=nws), and topic-based scraping, check out our full blog post on HasData: Google News Scraping: RSS, SERP, and Topic Pages.
This article focuses on building a tool for visualizing topic frequencies, but you can combine it with the other methods to build robust pipelines and dashboards for news analysis.
Top comments (0)