DEV Community

loading...

Day 77 Of 100DaysOfCode: Scrapping News Of Gorkha Patra Online

iamdurga profile image Durga Pokharel Updated on ・2 min read

Today is my 77th day of #100daysofcode and #python learning journey. Like the usual day, I purchased some hours to learned about pandas data visualization from datacamp.

For the rest of the time, I keep working on my first project(News scrapping). Today I scrapped news of Gorkha Patra online. I could scrap news on a few different pages. I need to write different codes for different news fields like national, economics, business, province, etc. So it takes a lot of time to scrapped news of the same news portal. Below is my code which I used to scrapped news of the national field.

Python code with BeautifulSoup

Here I import different dependencies

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from bs4 import BeautifulSoup as BS
import requests
import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
Enter fullscreen mode Exit fullscreen mode

Url of required field is given below,

url = "https://gorkhapatraonline.com/national"
Enter fullscreen mode Exit fullscreen mode

Parse News, Author, Date and Contents: News

ndict = {"Title":[], "Title URL":[], "Author": [], "Date":[], "Description": [], "Content":[]}
ndict = {'Title': [], "URL": [], "Date":[],
      "Author":[], "Author URL":[], "Content":[],"Category": [], "Description":[]}


for content in soup.select(".business"):
  newsurl=content.find('a').get('href')
  trend2 = content.select_one(".trending2")
  title = trend2.find("p").text 
  title = title.strip()

  author = trend2.find('small').text
  author = author.strip()
  author = author.split('\xa0\xa0\xa0\xa0\n')[0]
  # author
  date = trend2.find('small').text
  date = date.strip()
  date = date.split('\xa0\xa0\xa0\xa0\n')[1]
  date=date.strip()
  description = trend2.select_one(".description").text.strip()

  # now got to this news url
  http.addheaders = [('User-agent', 'Mozilla/61.0')]
  web_page = http.request('GET',newsurl)
  news_soup = BS(web_page.data, 'html5lib')
  author_url = news_soup.select_one(".post-author-name").find("a").get("href")
  news_content=""
  for p in news_soup.select_one(".newstext").findAll("p"):
    news_content+="\n"+p.text
  ndict["Title"].append(title)
  catagory = url.split("/")[-1]
  print(f"""
          Title: {title}, URL: {newsurl}
          Date: {date}, Author: {author},
          Category :{catagory} ,
          Author URL: {author_url}, 
          Description: {description},
          Content: {news_content}
            """)
Enter fullscreen mode Exit fullscreen mode

Discussion (0)

pic
Editor guide