You want the latest news, fast and structured. No problem. Scraping Google News is one of the quickest ways to gather up-to-the-minute headlines, monitor emerging trends, and dive deep into sentiment analysis. In this post, I’ll walk you through how to scrape Google News using Python—no fluff, just actionable insights.
By the end of this tutorial, you'll know how to efficiently pull headlines and links from Google News, cleanly store them in JSON format, and even avoid blocks using proxies and headers.
Step 1: Python Environment Setup
First, make sure you have Python installed on your system. Then, we’ll install the key libraries: requests
and lxml
.
Run this in your terminal:
pip install requests
pip install lxml
These tools will handle HTTP requests and parse the HTML content of Google News, giving you the power to extract exactly what you need.
Step 2: Get to Know Your Target URL and XPath
Now, we need to understand the structure of Google News’ webpage. Here's the URL of the page we’ll scrape:
url = "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSlRpZ0FQAQ?hl=en-US&gl=US&ceid=US%3Aen"
This page displays multiple news articles with titles and links to related stories. To grab this information, we need to understand how the HTML is organized. Here's a simplified breakdown of the XPath structure:
Main News Container: //c-wiz[@jsrenderer="ARwRbe"]
Main News Title: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/article/a/text()
Main News Link: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/article/a/@href
Related News Container: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/div/article/
Related News Title: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/div/article/a/text()
Related News Link: //c-wiz[@jsrenderer="ARwRbe"]/c-wiz/div/div/article/a/@href
Now we know where to look for the data we need.
Step 3: Fetch Google News Content
We’ll fetch the page content using requests
. Here’s the code to do that:
import requests
url = "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSlRpZ0FQAQ?hl=en-US&gl=US&ceid=US%3Aen"
response = requests.get(url)
if response.status_code == 200:
page_content = response.content
else:
print(f"Failed to retrieve the page. Status code: {response.status_code}")
This sends a GET request to the Google News URL and stores the HTML content. If something goes wrong (like a 404 or 500 error), we’ll know.
Step 4: Analyze the HTML with lxml
Once we have the raw HTML, we need to parse it to make sense of the structure. That’s where lxml
comes in. Here’s how to parse the page:
from lxml import html
# Parse the HTML content
parser = html.fromstring(page_content)
This command turns the raw HTML into an object we can query using XPath.
Step 5: Extract News Data
Now, we get to the fun part: extracting headlines and links. We’ll first extract the main news headlines, then dive into the related articles. Here’s how:
# Extract main news articles
main_news_elements = parser.xpath('//c-wiz[@jsrenderer="ARwRbe"]')
news_data = []
for element in main_news_elements[:10]: # Grab the first 10 main headlines
title = element.xpath('.//c-wiz/div/article/a/text()')[0]
link = "https://news.google.com" + element.xpath('.//c-wiz/div/article/a/@href')[0][1:]
# Ensure data exists before appending to the list
if title and link:
news_data.append({
"main_title": title,
"main_link": link,
})
With this, we’ve pulled the main headlines and links from Google News. But we’re not done yet—let's go deeper.
Step 6: Extract Related Articles
For each main headline, there are often related articles. Let’s pull those too.
# Extract related articles within each main article
for element in main_news_elements[:10]:
related_articles = []
related_news_elements = element.xpath('.//c-wiz/div/div/article')
for related_element in related_news_elements:
related_title = related_element.xpath('.//a/text()')[0]
related_link = "https://news.google.com" + related_element.xpath('.//a/@href')[0][1:]
related_articles.append({"title": related_title, "link": related_link})
# Add related articles to the main news data
news_data[-1]["related_articles"] = related_articles
Now, each main news article in our news_data
list includes related articles—giving us a more comprehensive set of data.
Step 7: Export Your Data as JSON
All this data needs to be stored somewhere. We’ll save it to a JSON file so you can use it later for analysis or sentiment analysis.
import json
# Save the extracted data to a JSON file
with open('google_news_data.json', 'w') as f:
json.dump(news_data, f, indent=4)
Now you have a file named google_news_data.json
filled with news headlines and links, ready for further analysis.
Additional Tips: Working with Proxies and Custom Headers
Working with Proxies
If you're scraping a lot of data, sites like Google News might block you. To avoid that, use proxies. Here’s how:
proxies = {
"http": "http://your_proxy_ip:port",
"https": "https://your_proxy_ip:port",
}
response = requests.get(url, proxies=proxies)
By routing your requests through different IPs, you can scrape more efficiently without being blocked.
Customizing Headers
Websites often block requests that look like they're from bots. To avoid detection, you can add custom headers:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8',
}
response = requests.get(url, headers=headers)
These headers make your requests look like they’re coming from an actual browser.
Complete Code Sample
Here’s everything wrapped up in one script:
import requests
from lxml import html
import json
url = "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGRqTVhZU0JXVnVMVWRDR2dKSlRpZ0FQAQ?hl=en-US&gl=US&ceid=US%3Aen"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'}
proxies = {"http": "http://your_proxy_ip:port", "https": "https://your_proxy_ip:port"}
response = requests.get(url, headers=headers, proxies=proxies)
if response.status_code == 200:
page_content = response.content
else:
print(f"Failed to retrieve the page. Status code: {response.status_code}")
exit()
# Parse HTML
parser = html.fromstring(page_content)
# Extract news
main_news_elements = parser.xpath('//c-wiz[@jsrenderer="ARwRbe"]')
news_data = []
for element in main_news_elements[:10]:
title = element.xpath('.//c-wiz/div/article/a/text()')[0]
link = "https://news.google.com" + element.xpath('.//c-wiz/div/article/a/@href')[0][1:]
# Extract related articles
related_articles = []
related_news_elements = element.xpath('.//c-wiz/div/div/article')
for related_element in related_news_elements:
related_title = related_element.xpath('.//a/text()')[0]
related_link = "https://news.google.com" + related_element.xpath('.//a/@href')[0][1:]
related_articles.append({"title": related_title, "link": related_link})
news_data.append({
"main_title": title,
"main_link": link,
"related_articles": related_articles
})
# Save to JSON
with open("google_news
_data.json", "w") as json_file:
json.dump(news_data, json_file, indent=4)
print("Data extraction complete. Saved to google_news_data.json")
Conclusion
Scraping Google News with Python is an efficient way to gather real-time news data. Whether you’re tracking trends, analyzing sentiment, or just curious about the latest headlines, this method provides a solid foundation. Use proxies and custom headers to avoid being blocked, and save your data in JSON format for easy access.
Top comments (0)