DEV Community

Cover image for Scraping NHK News Web Easy with Python: A Step-by-Step Guide
Van_mai
Van_mai

Posted on

Scraping NHK News Web Easy with Python: A Step-by-Step Guide

๐Ÿš€ Scraping NHK News Web Easy with Python

If you are learning Japanese. NHK News Web Easy is a very nice. Want to extract real-time news from NHK News Web Easy using Python?** In this tutorial, we'll use Selenium + BeautifulSoup to scrape the latest articles, save them in a structured format, and even export them as a Word document. ๐Ÿ“

๐Ÿ“Œ By the end of this tutorial, you will learn how to:

  • Fetch and parse dynamic web pages using Selenium
  • Extract news titles, links, and content with BeautifulSoup
  • Export data into a structured Word document
  • Avoid getting blocked and optimize your scraper ๐Ÿš€

๐ŸŒ Demo: What We Are Building

Before we dive into the code, hereโ€™s what our scraper will do:

  1. Visit NHK News Web Easy ๐Ÿ”
  2. Extract the latest news headlines & links ๐Ÿ“„
  3. Scrape the full article content ๐Ÿ“ฐ
  4. Save the data into a structured Word file ๐Ÿ“

๐Ÿ’ก Hereโ€™s an example of the output file:

๐Ÿ› ๏ธ Step 1: Install Required Packages

We will use the following Python libraries:

  • requests: To fetch webpage content
  • selenium: To handle dynamic JavaScript-loaded content
  • bs4 (BeautifulSoup): To parse HTML
  • docx: To save news articles into a Word document

๐Ÿ‘‰ Install them using (Mac):

pip install requests selenium bs4 python-docx webdriver-manager
Enter fullscreen mode Exit fullscreen mode

**๐Ÿ› ๏ธ Step 2:Fetching the NHK News Web Easy Homepage

Since NHK News Web Easy loads content dynamically using JavaScript, we need Selenium to handle the page rendering.

Hereโ€™s how to launch a headless Chrome browser and fetch the homepage:


from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time

# Configure Selenium WebDriver
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run in background
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")

# Launch the WebDriver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)

# Visit NHK News Web Easy
NHK_URL = "https://www3.nhk.or.jp/news/easy/"
driver.get(NHK_URL)

# Wait for JavaScript to load content
time.sleep(5)

# Get the HTML source code
html = driver.page_source
driver.quit()

print(html[:500])  # Display first 500 characters of the HTML
Enter fullscreen mode Exit fullscreen mode

โœ… What this does:
โ€ข Starts a headless Chrome browser to load JavaScript content
โ€ข Fetches the entire rendered webpage (including dynamically loaded articles)

๐Ÿ” Step 3: Extracting News Links

Once we have the full page source, we use BeautifulSoup to extract all news articles.

from bs4 import BeautifulSoup

# Parse HTML with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

# Extract news article links
articles = soup.select("article.news-list__item a")
news_links = ["https://www3.nhk.or.jp" + a["href"] for a in articles]

print(f"Found {len(news_links)} news articles.")
print(news_links[:5])  # Preview first 5 links
Enter fullscreen mode Exit fullscreen mode

โœ… This code:
โ€ข Finds all article links on the homepage
โ€ข Extracts absolute URLs for further scraping

๐Ÿ“ฐ Step 4: Scraping Full Article Content

Now, letโ€™s fetch each article and extract title + content.

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}

for news_url in news_links[:5]:  # Limit to 5 articles
    response = requests.get(news_url, headers=headers)
    response.encoding = "utf-8"

    news_soup = BeautifulSoup(response.text, "html.parser")
    title = news_soup.find("h1", class_="article-title").text.strip()
    content_blocks = news_soup.find("div", class_="article-body")

    content = "\n".join([p.text.strip() for p in content_blocks.find_all("p")]) if content_blocks else "No content available."

    print(f"๐Ÿ“Œ {title}\n{content[:200]}...\n")
Enter fullscreen mode Exit fullscreen mode

๐Ÿš€ Conclusion

โœ… This script:
โ€ข Fetches each news page
โ€ข Extracts the title and article body
โ€ข Prints a preview of the first 200 characters

๐Ÿ“‚ Step 5: Saving News to Word

Finally, letโ€™s store the scraped articles in a structured Word document.

from docx import Document

doc = Document()
doc.add_heading("NHK News Web Easy Articles", level=1)

for news_url in news_links[:5]:  # Limit to 5 articles
    response = requests.get(news_url, headers=headers)
    response.encoding = "utf-8"
    news_soup = BeautifulSoup(response.text, "html.parser")

    title = news_soup.find("h1", class_="article-title").text.strip()
    content_blocks = news_soup.find("div", class_="article-body")
    content_jp = "\n".join([p.text.strip() for p in content_blocks.find_all("p")]) if content_blocks else "No content."

    doc.add_heading(title, level=2)
    doc.add_paragraph(f"Japanese Text:\n{content_jp}")
    doc.add_paragraph("-" * 50)

doc.save("NHK_News.docx")
print("โœ… NHK_News.docx saved successfully!")
Enter fullscreen mode Exit fullscreen mode

๐ŸŽฏ Now you have a fully automated news scraper!

๐Ÿš€ Conclusion

โ€ข ๐Ÿ“ก Selenium fetches dynamically loaded content
โ€ข ๐Ÿ” BeautifulSoup extracts articles
โ€ข ๐Ÿ“ python-docx saves content in Word format
Enter fullscreen mode Exit fullscreen mode

โ€œThis is the final code. You can change the number 5 in for news_url in news_links[:5]: to any other number to adjust the number of news articles you want to generate.โ€

import requests
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
from bs4 import BeautifulSoup
from docx import Document

# Fake browser request to prevent NHK from blocking
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}

# Launch Selenium
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run in the background
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("disable-infobars")
chrome_options.add_argument("--disable-extensions")
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)

# Access NHK Easy News
NHK_URL = "https://www3.nhk.or.jp/news/easy/"
driver.get(NHK_URL)

# Wait 5 seconds for JavaScript to load
time.sleep(5)

# Get the page HTML
html = driver.page_source
driver.quit()  # Close the browser

# Parse the HTML
soup = BeautifulSoup(html, "html.parser")

# Extract news links
articles = soup.select("article.news-list__item a")
news_links = ["https://www3.nhk.or.jp" + a["href"] for a in articles]

print(f"Retrieved {len(news_links)} news articles")
if not news_links:
    print("โŒ No news found, please check the HTML structure!")
    exit()

# Create a Word document
doc = Document()
doc.add_heading("NHK News Web Easy Article Collection", level=1)

for news_url in news_links[:20]:  # Scrape only the first 20 news articles
    print(f"Fetching: {news_url}")

    # Retrieve the news page
    response = requests.get(news_url, headers=headers)
    response.encoding = "utf-8"  # Ensure UTF-8 decoding
    news_soup = BeautifulSoup(response.text, "html.parser")

    # Get news title (updated class)
    title_tag = news_soup.find("h1", class_="article-title")
    title = title_tag.text.strip() if title_tag else "No Title"

    # Get news content (updated class)
    content_blocks = news_soup.find("div", class_="article-body")
    if content_blocks:
        content_jp = "\n".join([p.text.strip() for p in content_blocks.find_all("p")])
    else:
        content_jp = "No Content"

    print(f"โœ… Successfully retrieved: {title}")

    # Write to Word document
    doc.add_heading(title, level=2)
    doc.add_paragraph(f"Original Japanese Text:\n{content_jp}")
    doc.add_paragraph("-" * 50)

# Save Word file
doc.save("NHK_News.docx")
print("โœ… Article collection completed, saved as NHK_News.docx")
Enter fullscreen mode Exit fullscreen mode

Heroku

Simplify your DevOps and maximize your time.

Since 2007, Heroku has been the go-to platform for developers as it monitors uptime, performance, and infrastructure concerns, allowing you to focus on writing code.

Learn More

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs