DEV Community

Areahints
Areahints

Posted on

Need help with python

how would you achieve the following logic using python?

  • Take a search query, for example, why do I like dogs?
  • Open browser, navigate to duckduckgo (or something else), search for my query.
  • Save the HTML of the search page.
  • Open each URL in search page (for the first page)in a new tab.
  • Save the HTML of each opened URL

Top comments (2)

Collapse
 
rhymes profile image
rhymes

Why do you need to open the pages in the browser? Wouldn't it be easier to just download the HTML?

  • open the url https://duckduckgo.com/?q=dogs with requests
  • save the HTML
  • parse it with html.parser from the standard library
  • download all the links

This is the simplest version I can think of. There are other ways to scrape pages and links.

If you truly need to "drive" the browser instead, you probably want to look into something like pyppeteer which drives a headless chrome/chromium

Collapse
 
areahints profile image
Areahints

@rhymes

this is what I've tried to do:

import os
import ssl
import logging

from bs4 import BeautifulSoup
import urllib, re
from urllib.request import Request, urlopen

# Global variables

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
headers = {'User-Agent':user_agent,} 
url_google = 'https://www.google.com/search?&q='
url_duck = 'https://duckduckgo.com/?q='

# Get user's search query and format the string
query = input('What are you searching for?: ')
query = re.sub('\\ |\\?|\\.|\\!|\\/|\\;|\\:', '+', query)

# use user's choice to make request 
choice = int(input('Select Search Engine, Google = 1, Duckduckgo = 2: '))

googlesearch = url_google + query
ducksearch = url_duck + query

def set_custom_log_info(filename):
    logging.basicConfig(filename=filename, level=logging.INFO)

def report(e:Exception):
    logging.exception(str(e))

def write_webpage_as_html(filename, data=''):
    try:
        with open(filename, 'wb') as fobj:
            fobj.write(data)
    except Exception as e:
        print(e)
        report(e)
        return False
    else:
        return True

class Search:
    _url   = ''
    _data  = ''
    _log  = None
    _soup  = None

def __init__(self, url, log):
    self._url  = url 
    self._log = log

def retrieve_webpage(self):
        try:
            if choice == 1:
                html = urllib.request.urlopen(googlesearch,None,headers)
            else:
                html = urllib.request.urlopen(ducksearch,None,headers)
        except Exception as e:
            print (e)
            self._log.report(str(e))
        else:
            self._data = html.read()
            if len(self._data) > 0:
                print ("Retrieved successfully")

if __name__ == '__main__':
    search_scrap = Search()
    search_scrap.retrieve_webpage()
    search_scrap.write_webpage_as_html()


I am still getting errors, any advice is welcome