Valentina Skakun for HasData

Posted on Nov 25

Building a Topic Frequency Chart from Google News Headlines

#python #tutorial #scraping #hasdata

This tutorial shows how to collect Google News data using HasData’s Google News API and visualize the most common topics or keywords in news headlines. We’ll process the data, remove stop words, and create a simple frequency chart with Python.

Introduction
Setup
Fetching Google News Data
Processing Headlines
Creating a Frequency Chart
Full Code
Next Steps
Further Reading

Introduction

News data is rich, but raw headlines can be messy. Common words like “the”, “of”, and “in” dominate the text, making it hard to extract meaningful insights. In this guide, we’ll:

Fetch news headlines via HasData’s Google News API.
Extract the highlight.title field from each article.
Count the frequency of meaningful words.
Visualize the top keywords using matplotlib.

This approach is useful for tracking trending topics, analyzing industry coverage, or quickly summarizing news from a specific domain.

Setup

You will need:

Python 3
requests
matplotlib
nltk
Standard library modules: json, collections.Counter, re

pip install requests matplotlib nltk

You also need to download the stopwords data from NLTK. You can do this by running the following command in Python:

import nltk
nltk.download('stopwords')

Make sure you have a HasData API key. You can get one for free from your HasData dashboard.

Fetching Google News Data

We’ll use the API to fetch headlines from a specific topic. You can change the topicToken to fetch different sections like Technology, Business, or Sports.

import requests
import json

API_KEY = "HASDATA-API-KEY"

params_raw = {
    "q": "",
    "gl": "us",
    "hl": "en",
    "topicToken": "CAAqJggKIiBDQkFTRWdvSUwyMHZNREpxYW5RU0FtVnVHZ0pWVXlnQVAB",  # Example: Entertainment
}

params = {k: v for k, v in params_raw.items() if v}
news_url = "https://api.hasdata.com/scrape/google/news"
news_headers = {"Content-Type": "application/json", "x-api-key": API_KEY}

resp = requests.get(news_url, params=params, headers=news_headers)
resp.raise_for_status()
data = resp.json()

Processing Headlines

We’ll now extract the titles and filter out common stop words using NLTK's built-in list of stopwords.

from collections import Counter
import re
from nltk.corpus import stopwords

# Load stopwords from NLTK
stop_words = set(stopwords.words('english'))

titles = [item.get("highlight", {}).get("title", "") for item in data.get("newsResults", [])]

words = []
for title in titles:
    for word in re.findall(r'\w+', title.lower()):
        if word not in stop_words and len(word) > 2:
            words.append(word)

counter = Counter(words)
most_common = counter.most_common(20)

Now we have a list of words that appear most frequently in the headlines, excluding common stop words.

Creating a Frequency Chart

Finally, we visualize the results using matplotlib.

import matplotlib.pyplot as plt

if not most_common:
    print("No meaningful words.")
else:
    labels, counts = zip(*most_common)
    plt.figure(figsize=(12,6))
    plt.bar(labels, counts, color='skyblue')
    plt.xticks(rotation=45, ha='right')
    plt.title("Top 20 meaningful words in news headlines")
    plt.ylabel("Frequency")
    plt.tight_layout()
    plt.show()

You should see a clear bar chart showing the most common topics from the headlines.

Full Code

import requests
import json
from collections import Counter
import matplotlib.pyplot as plt
import re
import nltk
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
nltk.download('stopwords')

API_KEY = "HASDATA-API-KEY"

# Parameters for Google News API request
params_raw = {
    "q": "",
    "gl": "us",
    "hl": "en",
    "topicToken": "CAAqJggKIiBDQkFTRWdvSUwyMHZNREpxYW5RU0FtVnVHZ0pWVXlnQVAB"
}

params = {k: v for k, v in params_raw.items() if v}

news_url = "https://api.hasdata.com/scrape/google/news"
news_headers = {"Content-Type": "application/json", "x-api-key": API_KEY}

# Fetch news data
resp = requests.get(news_url, params=params, headers=news_headers)
resp.raise_for_status()
data = resp.json()

# Extract titles
titles = [item.get("highlight", {}).get("title", "") for item in data.get("newsResults", [])]

# Load stopwords from NLTK
stop_words = set(stopwords.words('english'))

# Process words from titles
words = []
for title in titles:
    for word in re.findall(r'\w+', title.lower()):
        if word not in stop_words and len(word) > 2:
            words.append(word)

# Count words
counter = Counter(words)
most_common = counter.most_common(20)

# Plot results
if not most_common:
    print("No meaningful words.")
else:
    labels, counts = zip(*most_common)
    plt.figure(figsize=(12,6))
    plt.bar(labels, counts)
    plt.xticks(rotation=45, ha='right')
    plt.title("Top 20 meaningful words in news headlines")
    plt.ylabel("Frequency")
    plt.tight_layout()
    plt.show()

Next Steps

Expand the stop words list to filter more common words.
Analyze key topics using bigrams or trigrams for richer insights.
Combine multiple topic sections to see trends across industries.
Automate periodic fetching to track trends over time.

DEV Community