Swiftproxy - Residential Proxies

Posted on Jan 9

Unlock the Power of Google News Scraping with Python

#webscraping

You want the latest news, fast and structured. No problem. Scraping Google News is one of the quickest ways to gather up-to-the-minute headlines, monitor emerging trends, and dive deep into sentiment analysis. In this post, I’ll walk you through how to scrape Google News using Python—no fluff, just actionable insights.
By the end of this tutorial, you'll know how to efficiently pull headlines and links from Google News, cleanly store them in JSON format, and even avoid blocks using proxies and headers.

Step 1: Python Environment Setup

First, make sure you have Python installed on your system. Then, we’ll install the key libraries: requests and lxml.
Run this in your terminal:

pip install requests
pip install lxml

These tools will handle HTTP requests and parse the HTML content of Google News, giving you the power to extract exactly what you need.

Step 2: Get to Know Your Target URL and XPath

Now, we need to understand the structure of Google News’ webpage. Here's the URL of the page we’ll scrape:

url = "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSlRpZ0FQAQ?hl=en-US&gl=US&ceid=US%3Aen"

This page displays multiple news articles with titles and links to related stories. To grab this information, we need to understand how the HTML is organized. Here's a simplified breakdown of the XPath structure:
Main News Container: //c-wiz[@jsrenderer="ARwRbe"]
Main News Title: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/article/a/text()
Main News Link: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/article/a/@href
Related News Container: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/div/article/
Related News Title: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/div/article/a/text()
Related News Link: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/div/article/a/@href
Now we know where to look for the data we need.

Step 3: Fetch Google News Content

We’ll fetch the page content using requests. Here’s the code to do that:

import requests

url = "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSlRpZ0FQAQ?hl=en-US&gl=US&ceid=US%3Aen"
response = requests.get(url)

if response.status_code == 200:
    page_content = response.content
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

This sends a GET request to the Google News URL and stores the HTML content. If something goes wrong (like a 404 or 500 error), we’ll know.

Step 4: Analyze the HTML with `lxml`

Once we have the raw HTML, we need to parse it to make sense of the structure. That’s where lxml comes in. Here’s how to parse the page:

from lxml import html

# Parse the HTML content
parser = html.fromstring(page_content)

This command turns the raw HTML into an object we can query using XPath.

Step 5: Extract News Data

Now, we get to the fun part: extracting headlines and links. We’ll first extract the main news headlines, then dive into the related articles. Here’s how:

# Extract main news articles
main_news_elements = parser.xpath('//c-wiz[@jsrenderer="ARwRbe"]')

news_data = []

for element in main_news_elements[:10]:  # Grab the first 10 main headlines
    title = element.xpath('.//c-wiz/div/article/a/text()')[0]
    link = "https://news.google.com" + element.xpath('.//c-wiz/div/article/a/@href')[0][1:]

    # Ensure data exists before appending to the list
    if title and link:
        news_data.append({
            "main_title": title,
            "main_link": link,
        })

With this, we’ve pulled the main headlines and links from Google News. But we’re not done yet—let's go deeper.

Step 6: Extract Related Articles

For each main headline, there are often related articles. Let’s pull those too.

# Extract related articles within each main article
for element in main_news_elements[:10]:
    related_articles = []
    related_news_elements = element.xpath('.//c-wiz/div/div/article')

    for related_element in related_news_elements:
        related_title = related_element.xpath('.//a/text()')[0]
        related_link = "https://news.google.com" + related_element.xpath('.//a/@href')[0][1:]
        related_articles.append({"title": related_title, "link": related_link})

    # Add related articles to the main news data
    news_data[-1]["related_articles"] = related_articles

Now, each main news article in our news_data list includes related articles—giving us a more comprehensive set of data.

Step 7: Export Your Data as JSON

All this data needs to be stored somewhere. We’ll save it to a JSON file so you can use it later for analysis or sentiment analysis.

import json

# Save the extracted data to a JSON file
with open('google_news_data.json', 'w') as f:
    json.dump(news_data, f, indent=4)

Now you have a file named google_news_data.json filled with news headlines and links, ready for further analysis.

Additional Tips: Working with Proxies and Custom Headers

Working with Proxies
If you're scraping a lot of data, sites like Google News might block you. To avoid that, use proxies. Here’s how:

proxies = {
    "http": "http://your_proxy_ip:port",
    "https": "https://your_proxy_ip:port",
}

response = requests.get(url, proxies=proxies)

By routing your requests through different IPs, you can scrape more efficiently without being blocked.
Customizing Headers
Websites often block requests that look like they're from bots. To avoid detection, you can add custom headers:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8',
}

response = requests.get(url, headers=headers)

These headers make your requests look like they’re coming from an actual browser.

Complete Code Sample

Here’s everything wrapped up in one script:

import requests
from lxml import html
import json

url = "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSlRpZ0FQAQ?hl=en-US&gl=US&ceid=US%3Aen"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'}
proxies = {"http": "http://your_proxy_ip:port", "https": "https://your_proxy_ip:port"}

response = requests.get(url, headers=headers, proxies=proxies)

if response.status_code == 200:
    page_content = response.content
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
    exit()

# Parse HTML
parser = html.fromstring(page_content)

# Extract news
main_news_elements = parser.xpath('//c-wiz[@jsrenderer="ARwRbe"]')
news_data = []

for element in main_news_elements[:10]:
    title = element.xpath('.//c-wiz/div/article/a/text()')[0]
    link = "https://news.google.com" + element.xpath('.//c-wiz/div/article/a/@href')[0][1:]

    # Extract related articles
    related_articles = []
    related_news_elements = element.xpath('.//c-wiz/div/div/article')
    for related_element in related_news_elements:
        related_title = related_element.xpath('.//a/text()')[0]
        related_link = "https://news.google.com" + related_element.xpath('.//a/@href')[0][1:]
        related_articles.append({"title": related_title, "link": related_link})

    news_data.append({
        "main_title": title,
        "main_link": link,
        "related_articles": related_articles
    })

# Save to JSON
with open("google_news

_data.json", "w") as json_file:
    json.dump(news_data, json_file, indent=4)

print("Data extraction complete. Saved to google_news_data.json")

Conclusion

Scraping Google News with Python is an efficient way to gather real-time news data. Whether you’re tracking trends, analyzing sentiment, or just curious about the latest headlines, this method provides a solid foundation. Use proxies and custom headers to avoid being blocked, and save your data in JSON format for easy access.

DEV Community

Unlock the Power of Google News Scraping with Python

Step 1: Python Environment Setup

Step 2: Get to Know Your Target URL and XPath

Step 3: Fetch Google News Content

Step 4: Analyze the HTML with `lxml`

Step 5: Extract News Data

Step 6: Extract Related Articles

Step 7: Export Your Data as JSON

Additional Tips: Working with Proxies and Custom Headers

Complete Code Sample

Conclusion

Top comments (0)

Read next

Building Real-Time Anomaly Detection Systems with Alibaba Cloud Elasticsearch ML Modules

System Design for Devops Engineers

5 Essential API Design Patterns for Successful AI Model Implementation

🚀 Build a Full-Stack App with Node.js, Express, MongoDB, and Next.js: The Ultimate Guide! 🚀

Step 1: Python Environment Setup

Step 2: Get to Know Your Target URL and XPath

Step 3: Fetch Google News Content

Step 4: Analyze the HTML with lxml

Step 5: Extract News Data

Step 6: Extract Related Articles

Step 7: Export Your Data as JSON

Additional Tips: Working with Proxies and Custom Headers

Complete Code Sample

Conclusion

Read next

Building Real-Time Anomaly Detection Systems with Alibaba Cloud Elasticsearch ML Modules

System Design for Devops Engineers

5 Essential API Design Patterns for Successful AI Model Implementation

🚀 Build a Full-Stack App with Node.js, Express, MongoDB, and Next.js: The Ultimate Guide! 🚀

Step 4: Analyze the HTML with `lxml`