DEV Community

Durga Pokharel
Durga Pokharel

Posted on • Edited on

4 2

Day 77 Of 100DaysOfCode: Scrapping News Of Gorkha Patra Online

Today is my 77th day of #100daysofcode and #python learning journey. Like the usual day, I purchased some hours to learned about pandas data visualization from datacamp.

For the rest of the time, I keep working on my first project(News scrapping). Today I scrapped news of Gorkha Patra online. I could scrap news on a few different pages. I need to write different codes for different news fields like national, economics, business, province, etc. So it takes a lot of time to scrapped news of the same news portal. Below is my code which I used to scrapped news of the national field.

Python code with BeautifulSoup

Here I import different dependencies

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from bs4 import BeautifulSoup as BS
import requests
import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
Enter fullscreen mode Exit fullscreen mode

Url of required field is given below,

url = "https://gorkhapatraonline.com/national"
Enter fullscreen mode Exit fullscreen mode

Parse News, Author, Date and Contents: News

ndict = {"Title":[], "Title URL":[], "Author": [], "Date":[], "Description": [], "Content":[]}
ndict = {'Title': [], "URL": [], "Date":[],
      "Author":[], "Author URL":[], "Content":[],"Category": [], "Description":[]}


for content in soup.select(".business"):
  newsurl=content.find('a').get('href')
  trend2 = content.select_one(".trending2")
  title = trend2.find("p").text 
  title = title.strip()

  author = trend2.find('small').text
  author = author.strip()
  author = author.split('\xa0\xa0\xa0\xa0\n')[0]
  # author
  date = trend2.find('small').text
  date = date.strip()
  date = date.split('\xa0\xa0\xa0\xa0\n')[1]
  date=date.strip()
  description = trend2.select_one(".description").text.strip()

  # now got to this news url
  http.addheaders = [('User-agent', 'Mozilla/61.0')]
  web_page = http.request('GET',newsurl)
  news_soup = BS(web_page.data, 'html5lib')
  author_url = news_soup.select_one(".post-author-name").find("a").get("href")
  news_content=""
  for p in news_soup.select_one(".newstext").findAll("p"):
    news_content+="\n"+p.text
  ndict["Title"].append(title)
  catagory = url.split("/")[-1]
  print(f"""
          Title: {title}, URL: {newsurl}
          Date: {date}, Author: {author},
          Category :{catagory} ,
          Author URL: {author_url}, 
          Description: {description},
          Content: {news_content}
            """)
Enter fullscreen mode Exit fullscreen mode

Day 77 Of #100DaysOfCode and #Python
Worked On My First Project (Scrapping news of gorkhapatraonline using beautifulSoup)#WomenWhoCode #CodeNewbie #100DaysOfCode #DEVCommunity pic.twitter.com/T2JZyl2XqF

— Durga Pokharel (@mathdurga) March 16, 2021

Heroku

Build apps, not infrastructure.

Dealing with servers, hardware, and infrastructure can take up your valuable time. Discover the benefits of Heroku, the PaaS of choice for developers since 2007.

Visit Site

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay