DEV Community

agenthustler
agenthustler

Posted on

Google News Scraping: Extract Latest News, Headlines and Media Sources

Web scraping Google News opens up a world of possibilities for media monitoring, trend analysis, competitive intelligence, and content aggregation. Whether you're building a news dashboard, tracking brand mentions, or analyzing media coverage patterns, extracting data from Google News programmatically gives you a significant advantage.

In this comprehensive guide, we'll explore Google News's structure, demonstrate practical scraping techniques with Python and Node.js, and show you how to scale your news extraction using Apify's cloud platform.

Understanding Google News Structure

Google News is not a single monolithic page — it's a complex ecosystem of interconnected sections, each serving different user intents. Understanding this structure is crucial before writing any scraping code.

The Main Feed

The Google News homepage (news.google.com) presents a personalized feed of top stories. Each story card typically contains:

  • Headline text — the article title as displayed by Google
  • Source name — the publication (e.g., Reuters, BBC, TechCrunch)
  • Publication timestamp — relative ("2 hours ago") or absolute dates
  • Thumbnail image — when available
  • Story cluster — related articles grouped under the main headline
  • Category label — Business, Technology, Science, etc.

Topic Pages

Google News organizes content into topic verticals:

  • /topics/BUSINESS — Business & Finance
  • /topics/TECHNOLOGY — Technology
  • /topics/SCIENCE — Science
  • /topics/HEALTH — Health
  • /topics/SPORTS — Sports
  • /topics/ENTERTAINMENT — Entertainment

Each topic page has its own sub-sections with featured stories, local coverage, and deep-dive articles.

Search Results

The search endpoint (news.google.com/search?q=...) returns news articles matching specific queries. This is often the most valuable entry point for targeted scraping because you can:

  • Search for specific company names or products
  • Filter by date range
  • Target specific languages and regions
  • Combine multiple search terms

Source Pages

Individual publication pages (news.google.com/publications/...) list recent articles from a specific news source, allowing you to monitor particular outlets.

Setting Up Your Scraping Environment

Python Setup

First, install the required libraries:

pip install requests beautifulsoup4 feedparser apify-client
Enter fullscreen mode Exit fullscreen mode

Node.js Setup

npm install axios cheerio rss-parser apify-client
Enter fullscreen mode Exit fullscreen mode

Method 1: Google News RSS Feeds

Google News provides RSS feeds — one of the easiest and most reliable ways to extract news data. RSS feeds are publicly available and don't require rendering JavaScript.

Python RSS Scraping

import feedparser
import json
from datetime import datetime

def scrape_google_news_rss(query, language='en', country='US'):
    """
    Scrape Google News via RSS feed for a given search query.
    """
    # Build the RSS URL
    base_url = "https://news.google.com/rss/search"
    params = f"?q={query}&hl={language}&gl={country}&ceid={country}:{language}"
    feed_url = base_url + params

    # Parse the feed
    feed = feedparser.parse(feed_url)

    articles = []
    for entry in feed.entries:
        article = {
            'title': entry.title,
            'link': entry.link,
            'source': entry.source.title if hasattr(entry, 'source') else 'Unknown',
            'published': entry.published,
            'published_parsed': datetime(*entry.published_parsed[:6]).isoformat(),
            'description': entry.get('summary', ''),
        }
        articles.append(article)

    return articles

# Example: Search for AI news
results = scrape_google_news_rss("artificial intelligence")
print(f"Found {len(results)} articles")

for article in results[:5]:
    print(f"\n--- {article['source']} ---")
    print(f"Title: {article['title']}")
    print(f"Date: {article['published']}")
    print(f"Link: {article['link']}")
Enter fullscreen mode Exit fullscreen mode

Node.js RSS Scraping

const RSSParser = require('rss-parser');

async function scrapeGoogleNewsRSS(query, language = 'en', country = 'US') {
    const parser = new RSSParser({
        customFields: {
            item: [['source', 'source', { keepArray: false }]],
        },
    });

    const feedUrl = `https://news.google.com/rss/search?q=${encodeURIComponent(query)}&hl=${language}&gl=${country}&ceid=${country}:${language}`;

    const feed = await parser.parseURL(feedUrl);

    return feed.items.map(item => ({
        title: item.title,
        link: item.link,
        source: item.source?._ || item.source || 'Unknown',
        published: item.pubDate,
        description: item.contentSnippet || '',
    }));
}

// Example usage
(async () => {
    const articles = await scrapeGoogleNewsRSS('web scraping');
    console.log(`Found ${articles.length} articles`);

    articles.slice(0, 5).forEach(article => {
        console.log(`\n[${article.source}] ${article.title}`);
        console.log(`Published: ${article.published}`);
    });
})();
Enter fullscreen mode Exit fullscreen mode

Method 2: Topic-Based Extraction

You can also access topic-specific RSS feeds without a search query:

TOPIC_FEEDS = {
    'world': 'https://news.google.com/rss/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx1YlY4U0FtVnVHZ0pWVXlnQVAB',
    'business': 'https://news.google.com/rss/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx6TVdZU0FtVnVHZ0pWVXlnQVAB',
    'technology': 'https://news.google.com/rss/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGRqTVhZU0FtVnVHZ0pWVXlnQVAB',
    'science': 'https://news.google.com/rss/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRFp0Y1RjU0FtVnVHZ0pWVXlnQVAB',
    'health': 'https://news.google.com/rss/topics/CAAqIQgKIhtDQkFTRGdvSUwyMHZNR3QwTlRFU0FtVnVLQUFQAQ',
    'sports': 'https://news.google.com/rss/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRFp1ZEdvU0FtVnVHZ0pWVXlnQVAB',
}

def scrape_topic_feed(topic):
    """Extract articles from a Google News topic feed."""
    if topic not in TOPIC_FEEDS:
        raise ValueError(f"Unknown topic: {topic}. Available: {list(TOPIC_FEEDS.keys())}")

    feed = feedparser.parse(TOPIC_FEEDS[topic])

    return [{
        'topic': topic,
        'title': entry.title,
        'source': entry.source.title if hasattr(entry, 'source') else 'Unknown',
        'link': entry.link,
        'published': entry.published,
    } for entry in feed.entries]

# Scrape multiple topics
all_articles = []
for topic in ['technology', 'business', 'science']:
    articles = scrape_topic_feed(topic)
    all_articles.extend(articles)
    print(f"{topic}: {len(articles)} articles")
Enter fullscreen mode Exit fullscreen mode

Method 3: Full Page Scraping with Browser Automation

For extracting richer data — images, story clusters, full article counts — you need browser-based scraping. Google News renders most of its content via JavaScript, so a headless browser is essential.

from apify_client import ApifyClient

def scrape_google_news_full(query, max_articles=100):
    """
    Use Apify to scrape Google News with full browser rendering.
    """
    client = ApifyClient("YOUR_APIFY_TOKEN")

    run_input = {
        "query": query,
        "maxArticles": max_articles,
        "language": "en",
        "country": "US",
        "extractImages": True,
        "extractFullText": False,
    }

    run = client.actor("apify/google-news-scraper").call(run_input=run_input)

    articles = []
    for item in client.dataset(run["defaultDatasetId"]).iterate_items():
        articles.append(item)

    return articles
Enter fullscreen mode Exit fullscreen mode

Filtering by Source and Date

One of the most powerful features of Google News scraping is the ability to filter results precisely:

def advanced_news_search(query, source=None, date_range=None, exact_phrase=None):
    """
    Build an advanced Google News search query with filters.
    """
    search_parts = []

    if exact_phrase:
        search_parts.append(f'"{exact_phrase}"')
    else:
        search_parts.append(query)

    if source:
        search_parts.append(f'source:"{source}"')

    if date_range:
        # date_range format: 'after:2025-01-01 before:2025-12-31'
        search_parts.append(date_range)

    full_query = ' '.join(search_parts)
    return scrape_google_news_rss(full_query)

# Examples
# Only from Reuters
reuters_articles = advanced_news_search("climate change", source="Reuters")

# Exact phrase with date filter
recent_ai = advanced_news_search(
    "artificial intelligence",
    exact_phrase="large language model",
    date_range="after:2026-01-01"
)

# Technology news from specific source
tech_news = advanced_news_search("AI startup funding", source="TechCrunch")
Enter fullscreen mode Exit fullscreen mode

Extracting Trending Topics

Google News highlights trending stories that can be valuable for content strategy and market analysis:

const axios = require('axios');
const cheerio = require('cheerio');

async function extractTrendingTopics(country = 'US', language = 'en') {
    // Google Trends RSS as a proxy for trending news topics
    const trendUrl = `https://trends.google.com/trends/trendingsearches/daily/rss?geo=${country}`;

    const RSSParser = require('rss-parser');
    const parser = new RSSParser();
    const feed = await parser.parseURL(trendUrl);

    return feed.items.map(item => ({
        title: item.title,
        traffic: item['ht:approx_traffic'] || 'N/A',
        description: item.contentSnippet,
        newsItems: item.content ? extractNewsFromContent(item.content) : [],
        pubDate: item.pubDate,
    }));
}

function extractNewsFromContent(htmlContent) {
    const $ = cheerio.load(htmlContent);
    const newsItems = [];

    $('a').each((i, el) => {
        newsItems.push({
            title: $(el).text(),
            url: $(el).attr('href'),
        });
    });

    return newsItems;
}
Enter fullscreen mode Exit fullscreen mode

Building a News Monitoring Pipeline

Here's a complete pipeline that monitors news for multiple topics and stores results:

import feedparser
import json
import hashlib
from datetime import datetime
from pathlib import Path

class NewsMonitor:
    def __init__(self, output_dir='./news_data'):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)
        self.seen_articles = self._load_seen()

    def _load_seen(self):
        seen_file = self.output_dir / 'seen_articles.json'
        if seen_file.exists():
            return set(json.loads(seen_file.read_text()))
        return set()

    def _save_seen(self):
        seen_file = self.output_dir / 'seen_articles.json'
        seen_file.write_text(json.dumps(list(self.seen_articles)))

    def _article_hash(self, title, source):
        return hashlib.md5(f"{title}:{source}".encode()).hexdigest()

    def monitor(self, queries, interval_check=True):
        """Monitor multiple queries and return only new articles."""
        new_articles = []

        for query in queries:
            feed_url = f"https://news.google.com/rss/search?q={query}&hl=en&gl=US&ceid=US:en"
            feed = feedparser.parse(feed_url)

            for entry in feed.entries:
                source = entry.source.title if hasattr(entry, 'source') else 'Unknown'
                article_id = self._article_hash(entry.title, source)

                if article_id not in self.seen_articles:
                    self.seen_articles.add(article_id)
                    article = {
                        'id': article_id,
                        'query': query,
                        'title': entry.title,
                        'source': source,
                        'link': entry.link,
                        'published': entry.published,
                        'scraped_at': datetime.now().isoformat(),
                    }
                    new_articles.append(article)

        self._save_seen()

        # Save new articles
        if new_articles:
            timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
            output_file = self.output_dir / f'news_{timestamp}.json'
            output_file.write_text(json.dumps(new_articles, indent=2))

        return new_articles

# Usage
monitor = NewsMonitor()
queries = [
    "web scraping tools 2026",
    "data extraction automation",
    "competitive intelligence software",
]
new_articles = monitor.monitor(queries)
print(f"Found {len(new_articles)} new articles across {len(queries)} queries")
Enter fullscreen mode Exit fullscreen mode

Scaling with Apify

When you need to scrape Google News at scale — thousands of queries, multiple regions, continuous monitoring — Apify provides the infrastructure to handle it reliably:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

# Run a Google News scraping task
run_input = {
    "queries": [
        "artificial intelligence",
        "machine learning startups",
        "data science tools",
    ],
    "maxArticlesPerQuery": 50,
    "language": "en",
    "country": "US",
    "outputFormat": "json",
    "proxyConfiguration": {
        "useApifyProxy": True,
        "apifyProxyGroups": ["RESIDENTIAL"],
    },
}

# Start the run
run = client.actor("apify/google-news-scraper").call(run_input=run_input)

# Fetch results
dataset = client.dataset(run["defaultDatasetId"])
articles = list(dataset.iterate_items())

print(f"Scraped {len(articles)} articles total")
Enter fullscreen mode Exit fullscreen mode

Apify handles proxy rotation, browser management, and retry logic automatically, which is critical when scraping Google properties at scale.

Data Processing and Analysis

Once you have the raw news data, here are common processing patterns:

from collections import Counter
from datetime import datetime, timedelta

def analyze_news_data(articles):
    """Analyze scraped news articles for insights."""

    # Source frequency
    sources = Counter(a['source'] for a in articles)
    print("\nTop 10 Sources:")
    for source, count in sources.most_common(10):
        print(f"  {source}: {count} articles")

    # Publication timeline
    dates = Counter()
    for a in articles:
        try:
            dt = datetime.fromisoformat(a['published_parsed'])
            dates[dt.date().isoformat()] += 1
        except (KeyError, ValueError):
            pass

    print("\nArticles per day:")
    for date in sorted(dates.keys()):
        bar = '#' * dates[date]
        print(f"  {date}: {bar} ({dates[date]})")

    # Keyword extraction (simple frequency)
    words = Counter()
    stop_words = {'the', 'a', 'an', 'in', 'on', 'at', 'to', 'for', 'of', 'and', 'is', 'are'}
    for a in articles:
        for word in a['title'].lower().split():
            if word not in stop_words and len(word) > 3:
                words[word] += 1

    print("\nTop keywords:")
    for word, count in words.most_common(15):
        print(f"  {word}: {count}")

analyze_news_data(results)
Enter fullscreen mode Exit fullscreen mode

Best Practices and Legal Considerations

When scraping Google News, keep these guidelines in mind:

  1. Respect rate limits — Don't hammer the endpoints. Use delays between requests or leverage Apify's built-in rate limiting.

  2. Use RSS feeds when possible — They're the most stable and least likely to break. Google provides them intentionally.

  3. Cache results — Don't re-scrape the same queries repeatedly within short time periods.

  4. Handle redirects — Google News links often redirect through Google's tracking URLs. Follow redirects to get the actual article URL.

  5. Check robots.txt — Always review the site's robots.txt for guidance on automated access policies.

  6. Store responsibly — News data can be large. Implement data retention policies and only keep what you need.

  7. Monitor for changes — Google frequently updates its News interface. Build in error handling and alerts for when your scraper breaks.

Conclusion

Google News scraping is a powerful technique for media monitoring, competitive intelligence, and content research. The RSS feed approach provides a reliable, low-maintenance entry point, while browser-based scraping with tools like Apify unlocks richer data extraction at scale.

Start with RSS feeds for simple monitoring use cases, then graduate to full page scraping when you need images, story clusters, and trending topic analysis. Whatever your approach, the combination of Python or Node.js with Apify's infrastructure gives you a robust pipeline for staying on top of the news cycle.

The key is starting simple, validating your data quality, and scaling incrementally as your needs grow.

Top comments (0)