Recently, I've been assigned a task to check whether the list of page news is too old or not.
That comes the part I need to go over the page source to get the date information.
I've heard about webscraping before so for this task I applied web scraping technique.
Here is what I learnt.
I.What is webscraping
If you’ve ever copy and pasted information from a website, you’ve performed the same function as any web scraper, only on a microscopic, manual scale.
Web scraping, also known as web data extraction, is the process of retrieving or “scraping” data from a website. Unlike the mundane, mind-numbing process of manually extracting data, web scraping uses intelligent automation to retrieve hundreds, millions, or even billions of data points from the internet’s seemingly endless frontier.
More than a modern convenience, the true power of web scraping lies in its ability to build and power some of the world’s most revolutionary business applications. ‘Transformative’ doesn’t even begin to describe the way some companies use web scraped data to enhance their operations, informing executive decisions all the way down to individual customer service experiences.
II.What we'll need
We will apply the webscraping in Python so we need to check out information about below libraries.
beautiful-soup4
You can checkout about beautiful-soup documentation in here
But in short, beautifulsoup support us to get web data in html or json format, so that we can cook to get the exact data we wantrequests
Requests support us to get the response from the web.
You can refer to the documentation of requests in here
III.Real world task:
As I mentioned earlier, the task we would need to do is check whether the page news
Basically, what we need to do is:
Open a page where all the sites are listed
Get all the links for the news site
Get the date from the site news
Assert the published date of the news must be less than 1 month til current date
Here is the link for all the news we need to check : coccoc-newtab
--> Please open in CocCoc browser or google-chrome. This might not work for other browsers like firefox
Below is the step we will need to do in details.
1.Scroll page
Selenium native function support we move to element, but that's not what we're looking for.
If we need to scroll page, we need to do some js script.
Below is the illustration code for how to do that.
def scroll_to_with_scroll_height(self, driver):
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
2.Get all the urls of site news
In the zen news, there is news from site, and there is ads new.
In the task, we only need to get the news from normal sites ( not the ads).
In order to differentiate that, we will do it by css selectors.
Below is the css selector for that:
ZEN_NEWS_NOT_CONTAINS_ADS_ITEM_CSS_SELECTOR = 'div[class] > a:not(.qc-link)[href]:not(.context-content)' \
':not(.zen-ads__context):not([href*="utm"])' \
':not([data-click-url*="click"])'
ZEN_NEWS_NOT_CONTAINS_ADS_ITEM = (By.CSS_SELECTOR, ZEN_NEWS_NOT_CONTAINS_ADS_ITEM_CSS_SELECTOR)
Method for find all the zen news:
def find_all_current_zen_except_ads_elements(self, driver):
self.wait_for_element(driver).until(ec.presence_of_element_located(NewTabZenLocators
.ZEN_NEWS_NOT_CONTAINS_ADS_ITEM))
return driver.find_elements_by_css_selector(NewTabZenLocators.ZEN_NEWS_NOT_CONTAINS_ADS_ITEM_CSS_SELECTOR)
Method for get attribute of the zen news:
def get_attribute_all_zen_except_ads_elements(self, driver, attribute_name):
attribute_value = []
for element in self.new_tab_zen_elem.find_all_current_zen_except_ads_elements(driver):
attribute_value.append(element.get_attribute(attribute_name))
return attribute_value
Remember as we need to get the url, so the attribute_name is 'href'.
3.Check the response
We need to call get method using requests to get the response, then assert the response.
Follow is how to implement
for url in url_list:
response = None
try:
response = requests.get(url)
except ConnectionError as e:
print(e)
expect(response is not None, f'Assert response not None for site {url}')
expect(response.status_code == 200, f'Assert response status code for site {url}')
4.Get the date
Followed is how to get the string_datetime using beautifulsoup for html.parser
def get_published_time_of_web_page(self, response_text):
published_time = None
soup_instance = BeautifulSoup(response_text, features='html.parser', parse_only=SoupStrainer("head"))
meta_tags = soup_instance.find_all(name="meta")
for item in meta_tags:
property_value = item.get('property')
if property_value == 'article:published_time':
published_time = item.get('content')
if published_time is None:
soup_instance = BeautifulSoup(response_text, features='html.parser', parse_only=SoupStrainer('script',
attrs={
"type": "application/ld+json"}))
list_json = soup_instance.findAll('script')
for each_json in list_json:
if 'datePublished' in each_json.text.strip():
import json
from json import JSONDecodeError
try:
json_parse = json.loads(each_json.text.strip(), strict=False)
published_time = json_parse['datePublished']
except JSONDecodeError as e:
print(e)
return published_time
Current implementation is for 2 popular types of html template of the news pages.
Then we need to parse the string_date to date type using dateutil library
import dateutil.parser
import datetime
def parse_string_to_date(string_datetime):
your_date = dateutil.parser.parse(string_datetime)
return your_date.date()
def how_many_days_til_now(string_datetime):
number_of_days = datetime.date.today() - parse_string_to_date(string_datetime)
return number_of_days.days
5.Soft-assert
Sometimes we need to use soft-assert so that the test doesn't stop right after failed.
As we need to collect all the failed results of the other sites
There is delayed_assert library for this in python.
Below is how to implement.
expect(response is not None, f'Assert response not None for site {url}')
expect(response.status_code == 200, f'Assert response status code for site {url}')
expect(how_many_days_til_now(published_time) <= 30, f'Verify date of page {url}')
# else:
# print(f'Url of the site which cannot get published date is : {url}')
assert_expectations()
Remember if we do not have assert_expectations() at the end of test, the test will always pass.
That's it.
Stay tuned for more
Notes: If you feel this blog help you and want to show the appreciation, feel free to drop by :
This will help me to contributing more valued contents.
Top comments (0)