Van_mai

Posted on Feb 15, 2025

Scraping NHK News Web Easy with Python: A Step-by-Step Guide

#python #webscraping #selenium #automation

🚀 Scraping NHK News Web Easy with Python

If you are learning Japanese. NHK News Web Easy is a very nice. Want to extract real-time news from NHK News Web Easy using Python?** In this tutorial, we'll use Selenium + BeautifulSoup to scrape the latest articles, save them in a structured format, and even export them as a Word document. 📝

📌 By the end of this tutorial, you will learn how to:

Fetch and parse dynamic web pages using Selenium
Extract news titles, links, and content with BeautifulSoup
Export data into a structured Word document
Avoid getting blocked and optimize your scraper 🚀

🌍 Demo: What We Are Building

Before we dive into the code, here’s what our scraper will do:

Visit NHK News Web Easy 🔍
Extract the latest news headlines & links 📄
Scrape the full article content 📰
Save the data into a structured Word file 📝

💡 Here’s an example of the output file:

🛠️ Step 1: Install Required Packages

We will use the following Python libraries:

requests: To fetch webpage content
selenium: To handle dynamic JavaScript-loaded content
bs4 (BeautifulSoup): To parse HTML
docx: To save news articles into a Word document

👉 Install them using (Mac):

pip install requests selenium bs4 python-docx webdriver-manager

**🛠️ Step 2:Fetching the NHK News Web Easy Homepage

Since NHK News Web Easy loads content dynamically using JavaScript, we need Selenium to handle the page rendering.

Here’s how to launch a headless Chrome browser and fetch the homepage:


from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time

# Configure Selenium WebDriver
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run in background
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")

# Launch the WebDriver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)

# Visit NHK News Web Easy
NHK_URL = "https://www3.nhk.or.jp/news/easy/"
driver.get(NHK_URL)

# Wait for JavaScript to load content
time.sleep(5)

# Get the HTML source code
html = driver.page_source
driver.quit()

print(html[:500])  # Display first 500 characters of the HTML

✅ What this does:
• Starts a headless Chrome browser to load JavaScript content
• Fetches the entire rendered webpage (including dynamically loaded articles)

🔍 Step 3: Extracting News Links

Once we have the full page source, we use BeautifulSoup to extract all news articles.

from bs4 import BeautifulSoup

# Parse HTML with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

# Extract news article links
articles = soup.select("article.news-list__item a")
news_links = ["https://www3.nhk.or.jp" + a["href"] for a in articles]

print(f"Found {len(news_links)} news articles.")
print(news_links[:5])  # Preview first 5 links

✅ This code:
• Finds all article links on the homepage
• Extracts absolute URLs for further scraping

📰 Step 4: Scraping Full Article Content

Now, let’s fetch each article and extract title + content.

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}

for news_url in news_links[:5]:  # Limit to 5 articles
    response = requests.get(news_url, headers=headers)
    response.encoding = "utf-8"

    news_soup = BeautifulSoup(response.text, "html.parser")
    title = news_soup.find("h1", class_="article-title").text.strip()
    content_blocks = news_soup.find("div", class_="article-body")

    content = "\n".join([p.text.strip() for p in content_blocks.find_all("p")]) if content_blocks else "No content available."

    print(f"📌 {title}\n{content[:200]}...\n")

🚀 Conclusion

✅ This script:
• Fetches each news page
• Extracts the title and article body
• Prints a preview of the first 200 characters

📂 Step 5: Saving News to Word

Finally, let’s store the scraped articles in a structured Word document.

from docx import Document

doc = Document()
doc.add_heading("NHK News Web Easy Articles", level=1)

for news_url in news_links[:5]:  # Limit to 5 articles
    response = requests.get(news_url, headers=headers)
    response.encoding = "utf-8"
    news_soup = BeautifulSoup(response.text, "html.parser")

    title = news_soup.find("h1", class_="article-title").text.strip()
    content_blocks = news_soup.find("div", class_="article-body")
    content_jp = "\n".join([p.text.strip() for p in content_blocks.find_all("p")]) if content_blocks else "No content."

    doc.add_heading(title, level=2)
    doc.add_paragraph(f"Japanese Text:\n{content_jp}")
    doc.add_paragraph("-" * 50)

doc.save("NHK_News.docx")
print("✅ NHK_News.docx saved successfully!")

🎯 Now you have a fully automated news scraper!

🚀 Conclusion

• 📡 Selenium fetches dynamically loaded content
• 🔍 BeautifulSoup extracts articles
• 📝 python-docx saves content in Word format

“This is the final code. You can change the number 5 in for news_url in news_links[:5]: to any other number to adjust the number of news articles you want to generate.”

import requests
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
from bs4 import BeautifulSoup
from docx import Document

# Fake browser request to prevent NHK from blocking
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}

# Launch Selenium
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run in the background
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("disable-infobars")
chrome_options.add_argument("--disable-extensions")
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)

# Access NHK Easy News
NHK_URL = "https://www3.nhk.or.jp/news/easy/"
driver.get(NHK_URL)

# Wait 5 seconds for JavaScript to load
time.sleep(5)

# Get the page HTML
html = driver.page_source
driver.quit()  # Close the browser

# Parse the HTML
soup = BeautifulSoup(html, "html.parser")

# Extract news links
articles = soup.select("article.news-list__item a")
news_links = ["https://www3.nhk.or.jp" + a["href"] for a in articles]

print(f"Retrieved {len(news_links)} news articles")
if not news_links:
    print("❌ No news found, please check the HTML structure!")
    exit()

# Create a Word document
doc = Document()
doc.add_heading("NHK News Web Easy Article Collection", level=1)

for news_url in news_links[:20]:  # Scrape only the first 20 news articles
    print(f"Fetching: {news_url}")

    # Retrieve the news page
    response = requests.get(news_url, headers=headers)
    response.encoding = "utf-8"  # Ensure UTF-8 decoding
    news_soup = BeautifulSoup(response.text, "html.parser")

    # Get news title (updated class)
    title_tag = news_soup.find("h1", class_="article-title")
    title = title_tag.text.strip() if title_tag else "No Title"

    # Get news content (updated class)
    content_blocks = news_soup.find("div", class_="article-body")
    if content_blocks:
        content_jp = "\n".join([p.text.strip() for p in content_blocks.find_all("p")])
    else:
        content_jp = "No Content"

    print(f"✅ Successfully retrieved: {title}")

    # Write to Word document
    doc.add_heading(title, level=2)
    doc.add_paragraph(f"Original Japanese Text:\n{content_jp}")
    doc.add_paragraph("-" * 50)

# Save Word file
doc.save("NHK_News.docx")
print("✅ Article collection completed, saved as NHK_News.docx")

DEV Community

Scraping NHK News Web Easy with Python: A Step-by-Step Guide

🚀 Scraping NHK News Web Easy with Python

🌍 Demo: What We Are Building

💡 Here’s an example of the output file:

🛠️ Step 1: Install Required Packages

**🛠️ Step 2:Fetching the NHK News Web Easy Homepage

🔍 Step 3: Extracting News Links

📰 Step 4: Scraping Full Article Content

🚀 Conclusion

📂 Step 5: Saving News to Word

🚀 Conclusion

Top comments (0)