DEV Community

loading...

Day 75 Of 100DaysOfCode: Scrapping Different Categories Of News Of News Portal Using BeautifulSoup

iamdurga profile image Durga Pokharel ・2 min read

Today is my 75th day of #100Daysofcode and #python learning. Today
also used the time to learn from datacamp regarding the topic pandas and also learn basic of R programming language.

And most of the time I gave to my first project. Yesterday I scrapped news of only the news category. Today I am able to scrapped news of different categories like sports, business, world. And I used pandas to visualize data obtained from different categories. I am doing my project using the drive category. As being from a non-technical field saving CSV files to drive is new for me. I learned to save the CSV file to the drive. While doing so we need to mount our goggle drive
Below is my updated code for today.

Scrapping Different Categories Of

News

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from bs4 import BeautifulSoup as BS
import requests
import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
Enter fullscreen mode Exit fullscreen mode
categories={"news":"https://ekantipur.com/news/",
            "business":"https://ekantipur.com/business/",
            "world":"https://ekantipur.com/world/",
            "sports":"https://ekantipur.com/sports"
Enter fullscreen mode Exit fullscreen mode
http = urllib3.PoolManager()
http.addheaders = [('User-agent', 'Mozilla/61.0')]

news_values=[]
ndict = {'Title': [], "URL": [], "Date":[],
      "Author":[], "Author URL":[], "Content":[],"Category": []}
show=False
for category, url in categories.items():
  web_page = http.request('GET', url)
  soup = BS(web_page.data, 'html5lib')

  for title in soup.findAll("h2"):
    if title.a:
      title_link=title.a.get("href")
      # print(title_link)
      if title_link.split(":")[0]!="https":
        title_link=url.split(f"/{category}")[0]+title.a.get("href")
      title_text=title.text
      #print(title_link)

      news_page = http.request('GET', title_link)
      news_soup = BS(news_page.data, 'html5lib')

      date = news_soup.find("time").text
      author_url = news_soup.select_one(".author").a.get("href")
      author_name = news_soup.select_one(".author").text

      for row in news_soup.select(".row"):
        for content in row.contents:
          if content.select(".normal"):
            content=content.p.text
            break
        break

      catagory = url.split('/')[-1]

      ndict["Title"].append(title_text)
      ndict["URL"].append(title_link)
      ndict["Date"].append(date)
      ndict["Author"].append(author_name)
      ndict["Author URL"].append(author_url)
      ndict["Content"].append(content)
      ndict["Category"].append(category) 
      if show:
        print(f"""
                Title: {title_text}, URL: {title_link}
                Date: {date}, Author: {author_name},Category : {category},
                Author URL: {author_url},
                Content: {content}
                        """)
    # news_values.append()
Enter fullscreen mode Exit fullscreen mode

Dataframe of above data is,

df = pd.DataFrame(ndict, columns=list(ndict.keys()))
df
Enter fullscreen mode Exit fullscreen mode

Discussion (0)

pic
Editor guide