๐ Scraping NHK News Web Easy with Python
If you are learning Japanese. NHK News Web Easy is a very nice. Want to extract real-time news from NHK News Web Easy using Python?** In this tutorial, we'll use Selenium + BeautifulSoup to scrape the latest articles, save them in a structured format, and even export them as a Word document. ๐
๐ By the end of this tutorial, you will learn how to:
- Fetch and parse dynamic web pages using Selenium
- Extract news titles, links, and content with BeautifulSoup
- Export data into a structured Word document
- Avoid getting blocked and optimize your scraper ๐
๐ Demo: What We Are Building
Before we dive into the code, hereโs what our scraper will do:
- Visit NHK News Web Easy ๐
- Extract the latest news headlines & links ๐
- Scrape the full article content ๐ฐ
- Save the data into a structured Word file ๐
๐ก Hereโs an example of the output file:
๐ ๏ธ Step 1: Install Required Packages
We will use the following Python libraries:
-
requests
: To fetch webpage content -
selenium
: To handle dynamic JavaScript-loaded content -
bs4 (BeautifulSoup)
: To parse HTML -
docx
: To save news articles into a Word document
๐ Install them using (Mac):
pip install requests selenium bs4 python-docx webdriver-manager
**๐ ๏ธ Step 2:Fetching the NHK News Web Easy Homepage
Since NHK News Web Easy loads content dynamically using JavaScript, we need Selenium to handle the page rendering.
Hereโs how to launch a headless Chrome browser and fetch the homepage:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
# Configure Selenium WebDriver
chrome_options = Options()
chrome_options.add_argument("--headless") # Run in background
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
# Launch the WebDriver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)
# Visit NHK News Web Easy
NHK_URL = "https://www3.nhk.or.jp/news/easy/"
driver.get(NHK_URL)
# Wait for JavaScript to load content
time.sleep(5)
# Get the HTML source code
html = driver.page_source
driver.quit()
print(html[:500]) # Display first 500 characters of the HTML
โ
What this does:
โข Starts a headless Chrome browser to load JavaScript content
โข Fetches the entire rendered webpage (including dynamically loaded articles)
๐ Step 3: Extracting News Links
Once we have the full page source, we use BeautifulSoup to extract all news articles.
from bs4 import BeautifulSoup
# Parse HTML with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# Extract news article links
articles = soup.select("article.news-list__item a")
news_links = ["https://www3.nhk.or.jp" + a["href"] for a in articles]
print(f"Found {len(news_links)} news articles.")
print(news_links[:5]) # Preview first 5 links
โ
This code:
โข Finds all article links on the homepage
โข Extracts absolute URLs for further scraping
๐ฐ Step 4: Scraping Full Article Content
Now, letโs fetch each article and extract title + content.
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}
for news_url in news_links[:5]: # Limit to 5 articles
response = requests.get(news_url, headers=headers)
response.encoding = "utf-8"
news_soup = BeautifulSoup(response.text, "html.parser")
title = news_soup.find("h1", class_="article-title").text.strip()
content_blocks = news_soup.find("div", class_="article-body")
content = "\n".join([p.text.strip() for p in content_blocks.find_all("p")]) if content_blocks else "No content available."
print(f"๐ {title}\n{content[:200]}...\n")
๐ Conclusion
โ
This script:
โข Fetches each news page
โข Extracts the title and article body
โข Prints a preview of the first 200 characters
๐ Step 5: Saving News to Word
Finally, letโs store the scraped articles in a structured Word document.
from docx import Document
doc = Document()
doc.add_heading("NHK News Web Easy Articles", level=1)
for news_url in news_links[:5]: # Limit to 5 articles
response = requests.get(news_url, headers=headers)
response.encoding = "utf-8"
news_soup = BeautifulSoup(response.text, "html.parser")
title = news_soup.find("h1", class_="article-title").text.strip()
content_blocks = news_soup.find("div", class_="article-body")
content_jp = "\n".join([p.text.strip() for p in content_blocks.find_all("p")]) if content_blocks else "No content."
doc.add_heading(title, level=2)
doc.add_paragraph(f"Japanese Text:\n{content_jp}")
doc.add_paragraph("-" * 50)
doc.save("NHK_News.docx")
print("โ
NHK_News.docx saved successfully!")
๐ฏ Now you have a fully automated news scraper!
๐ Conclusion
โข ๐ก Selenium fetches dynamically loaded content
โข ๐ BeautifulSoup extracts articles
โข ๐ python-docx saves content in Word format
โThis is the final code. You can change the number 5 in for news_url in news_links[:5]: to any other number to adjust the number of news articles you want to generate.โ
import requests
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
from bs4 import BeautifulSoup
from docx import Document
# Fake browser request to prevent NHK from blocking
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}
# Launch Selenium
chrome_options = Options()
chrome_options.add_argument("--headless") # Run in the background
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("disable-infobars")
chrome_options.add_argument("--disable-extensions")
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)
# Access NHK Easy News
NHK_URL = "https://www3.nhk.or.jp/news/easy/"
driver.get(NHK_URL)
# Wait 5 seconds for JavaScript to load
time.sleep(5)
# Get the page HTML
html = driver.page_source
driver.quit() # Close the browser
# Parse the HTML
soup = BeautifulSoup(html, "html.parser")
# Extract news links
articles = soup.select("article.news-list__item a")
news_links = ["https://www3.nhk.or.jp" + a["href"] for a in articles]
print(f"Retrieved {len(news_links)} news articles")
if not news_links:
print("โ No news found, please check the HTML structure!")
exit()
# Create a Word document
doc = Document()
doc.add_heading("NHK News Web Easy Article Collection", level=1)
for news_url in news_links[:20]: # Scrape only the first 20 news articles
print(f"Fetching: {news_url}")
# Retrieve the news page
response = requests.get(news_url, headers=headers)
response.encoding = "utf-8" # Ensure UTF-8 decoding
news_soup = BeautifulSoup(response.text, "html.parser")
# Get news title (updated class)
title_tag = news_soup.find("h1", class_="article-title")
title = title_tag.text.strip() if title_tag else "No Title"
# Get news content (updated class)
content_blocks = news_soup.find("div", class_="article-body")
if content_blocks:
content_jp = "\n".join([p.text.strip() for p in content_blocks.find_all("p")])
else:
content_jp = "No Content"
print(f"โ
Successfully retrieved: {title}")
# Write to Word document
doc.add_heading(title, level=2)
doc.add_paragraph(f"Original Japanese Text:\n{content_jp}")
doc.add_paragraph("-" * 50)
# Save Word file
doc.save("NHK_News.docx")
print("โ
Article collection completed, saved as NHK_News.docx")
Top comments (0)