Generally, parsing is a linear comparison of words sequence with the rules of a language. The concept of "language" is considered here in the widest context - it may be a human language (for example, English, German, Japanese, etc.) used for communication between people. As well as it can be a formalized language, in particular - any programming language.
Parsing of web-sites - is sequential syntactic analysis of information posted on Internet pages. Focus that information on the web pages is a hierarchical data set, structured using human and computer programming languages. Creating a website, the developer inevitably faces the task of determining the optimal page structure. But where to take an example of the optimal page? Do not reinvent the wheel in the initial stages of automating optimization process! It’s enough just to analyse your direct competitors, especially in such a saturated and highly competitive niche as gambling. There is a lot of such data, so a number of non-trivial tasks for its extraction should be solved, such as:
- collection of search engine results;
- large amounts of information provided in the net, which processing is hardly possible for one person or even a team of analysts;
- one person or even a well-coordinated team of operators are not able to provide frequent updates - maintaining a huge stream of dynamically changing information, because sometimes information changes every minute and its updating is hardly advisable manually; so automating this process allows you to save time on monitoring changes for instance in casino promotions and automate its updates on your site. Compared to a human, computer parser program can:
- quickly bypass thousands of web pages;
- neatly separates technical information from "human";
- unmistakably select the right and discard the superfluous;
- effectively pack the final data in the required form.
In most cases the subjected to additional processing database or spreadsheet is the result of parsing. Currently, parsers are written in a large scale of programming languages such as Python , R, C ++, Delphi, Perl, Ruby, PHP. But I certainly choose Python as the most universal language with a simple syntax. At the same time the uniqueness of Python lies in its syntax. It allows a large number of programmers to read someone else’s code, written in Python with no trouble.
If you want to improve your script in the future or write a smarter parser, then you may find some useful tips here: https://www.seleniumhq.org/download/, and https://www.crummy.com/software/BeautifulSoup/bs4/doc/.
The result (whether it’s a database or spreadsheet) needs further processing for sure. However, the subsequent manipulations with the collected information do not concern the topic of parsing.
First, let's discuss the algorithm. What do we want to get? We want to get the optimal structure of the document relative to the keyword. So it’s likely that the most successful structure is presented by those who are in the top for the chosen keyword.
Thus, the algorithm of the forthcoming work can be divided into several parts:
1) Choose an example of the page to check the structure: https://casino-now.co.uk/mobile-casino/
2) Identify a keyword: mobile casino
3) Get a list of the most optimized competitors
4) Get the structure of their pages and check the optimization parameters for the keyword
In order to ensure the correctness of the request processing, we have to programmatically set a time delay equal to the page load time. The function body will be:
def readystate_complete(d): return d.execute_script("return document.readyState") == "complete"
Then, it’s necessary to determine the keyword for which we want to view the output and implement Selenium driver to simulate keyboard input:
mainKey = "mobile casino" driver = webdriver.Firefox() driver.get("http://www.google.com") elem = driver.find_element_by_name("q") elem.send_keys(mainKey) elem.submit() WebDriverWait(driver, 30).until(readystate_complete) time.sleep(1) htmltext = driver.page_source
As the result, the source code of the page shown below will be stored in ‘htmltext’ variable:
It is worth paying attention that the robot icon presented on the screenshot means that at the moment the browser is under remote control, in our case - by Python.
After the raw html text is obtained, it's time to unload the pages for parsing. The easiest way is when you check the code for the element you are interested in, and then use regular expressions to isolate the information you need, forming a list of objects. For example, to collect the URLs of competitors:
Then let’s check the occurrence of the pattern of interest:
And write out regular expression that looks like:
pages = re.compile('(.*?)' , re.DOTALL | re.IGNORECASE).findall(str(htmltext))
As the result, we get a list of top 10 competitors’ pages for the keyword of interest.
Next stage is re-parsing, similar to the above but for each URL-address of the chosen keyword. The results are formed in the dataframe, the full presentation can be viewed at GitHub: https://github.com/TinaWard/FirstStepForParsing/.
The following is only a snippet of code that’s responsible for clearing your raw html document from tags and scripts. From this perspective, the result of this command will be the cleared text of the site, which will be used to calculate the keyword’s density.
html = driver.page_source soup = BeautifulSoup(html) "kill all script and style elements" for script in soup(["script", "style"]): "rip it out" script.extract() "get text" text = soup.get_text() "break into lines and remove leading and trailing space on each" lines = (line.strip() for line in text.splitlines()) "break multi-headlines into a line each" chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) "drop blank lines" text = '\n'.join(chunk for chunk in chunks if chunk)
The text document will be the result of the parsing script on the basis of which the necessary numerical characteristics are determined to evaluate competitors - for example, keyword density estimation.
The result is presented in the form of a file containing the competitor's page address, html page as well as the document structure and the calculated keyword density:
Thus, this material has discovered two key points of parsing - automatic browser control using selenium and raw-html pages processing by means of Beautiful Soup.
Create your web-sites based on the best practice! Good luck!
Leave comments and propose topics you would like to know more: firstname.lastname@example.org
Wordcloud of the article. Have fun!