DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

Tina
Tina

Posted on

Parse in Gambling: How to Write Your Parser in 15 Minutes?

Step 1 - Parsing: What? Why? How?

Generally, parsing is a linear comparison of words sequence with the rules of a language. The concept of "language" is considered here in the widest context - it may be a human language (for example, English, German, Japanese, etc.) used for communication between people. As well as it can be a formalized language, in particular - any programming language.
Parsing of web-sites - is sequential syntactic analysis of information posted on Internet pages. Focus that information on the web pages is a hierarchical data set, structured using human and computer programming languages. Creating a website, the developer inevitably faces the task of determining the optimal page structure. But where to take an example of the optimal page? Do not reinvent the wheel in the initial stages of automating optimization process! It’s enough just to analyse your direct competitors, especially in such a saturated and highly competitive niche as gambling. There is a lot of such data, so a number of non-trivial tasks for its extraction should be solved, such as:

  • collection of search engine results;
  • large amounts of information provided in the net, which processing is hardly possible for one person or even a team of analysts;
  • one person or even a well-coordinated team of operators are not able to provide frequent updates - maintaining a huge stream of dynamically changing information, because sometimes information changes every minute and its updating is hardly advisable manually; so automating this process allows you to save time on monitoring changes for instance in casino promotions and automate its updates on your site. Compared to a human, computer parser program can:
  • quickly bypass thousands of web pages;
  • neatly separates technical information from "human";
  • unmistakably select the right and discard the superfluous;
  • effectively pack the final data in the required form.

In most cases the subjected to additional processing database or spreadsheet is the result of parsing. Currently, parsers are written in a large scale of programming languages such as Python , R, C ++, Delphi, Perl, Ruby, PHP. But I certainly choose Python as the most universal language with a simple syntax. At the same time the uniqueness of Python lies in its syntax. It allows a large number of programmers to read someone else’s code, written in Python with no trouble.

Step 1 P.S. - Ways to Improve

If you want to improve your script in the future or write a smarter parser, then you may find some useful tips here: https://www.seleniumhq.org/download/, and https://www.crummy.com/software/BeautifulSoup/bs4/doc/.
The result (whether it’s a database or spreadsheet) needs further processing for sure. However, the subsequent manipulations with the collected information do not concern the topic of parsing.

Step 2 - Software implementation

First, let's discuss the algorithm. What do we want to get? We want to get the optimal structure of the document relative to the keyword. So it’s likely that the most successful structure is presented by those who are in the top for the chosen keyword.
Thus, the algorithm of the forthcoming work can be divided into several parts:
1) Choose an example of the page to check the structure: https://casino-now.co.uk/mobile-casino/
2) Identify a keyword: mobile casino
3) Get a list of the most optimized competitors
4) Get the structure of their pages and check the optimization parameters for the keyword

In order to ensure the correctness of the request processing, we have to programmatically set a time delay equal to the page load time. The function body will be:

// hidden setup JavaScript code goes in this preamble area const hiddenVar = 42 def readystate_complete(d): return d.execute_script("return document.readyState") == "complete"

Then, it’s necessary to determine the keyword for which we want to view the output and implement Selenium driver to simulate keyboard input:

// hidden setup JavaScript code goes in this preamble area const hiddenVar = 42 mainKey = "mobile casino" driver = webdriver.Firefox() driver.get("http://www.google.com") elem = driver.find_element_by_name("q") elem.send_keys(mainKey) elem.submit() WebDriverWait(driver, 30).until(readystate_complete) time.sleep(1) htmltext = driver.page_source

As the result, the source code of the page shown below will be stored in β€˜htmltext’ variable:
screen
It is worth paying attention that the robot icon presented on the screenshot means that at the moment the browser is under remote control, in our case - by Python.
screen
After the raw html text is obtained, it's time to unload the pages for parsing. The easiest way is when you check the code for the element you are interested in, and then use regular expressions to isolate the information you need, forming a list of objects. For example, to collect the URLs of competitors:
screen
Then let’s check the occurrence of the pattern of interest:
screen
And write out regular expression that looks like:
// hidden setup JavaScript code goes in this preamble area const hiddenVar = 42 pages = re.compile('(.*?)' , re.DOTALL | re.IGNORECASE).findall(str(htmltext))

As the result, we get a list of top 10 competitors’ pages for the keyword of interest.
Next stage is re-parsing, similar to the above but for each URL-address of the chosen keyword. The results are formed in the dataframe, the full presentation can be viewed at GitHub: https://github.com/TinaWard/FirstStepForParsing/.
The following is only a snippet of code that’s responsible for clearing your raw html document from tags and scripts. From this perspective, the result of this command will be the cleared text of the site, which will be used to calculate the keyword’s density.
// hidden setup JavaScript code goes in this preamble area const hiddenVar = 42 html = driver.page_source soup = BeautifulSoup(html) "kill all script and style elements" for script in soup(["script", "style"]): "rip it out" script.extract() "get text" text = soup.get_text() "break into lines and remove leading and trailing space on each" lines = (line.strip() for line in text.splitlines()) "break multi-headlines into a line each" chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) "drop blank lines" text = '\n'.join(chunk for chunk in chunks if chunk)

The text document will be the result of the parsing script on the basis of which the necessary numerical characteristics are determined to evaluate competitors - for example, keyword density estimation.
The result is presented in the form of a file containing the competitor's page address, html page as well as the document structure and the calculated keyword density:
screen

Wrapping Up

Thus, this material has discovered two key points of parsing - automatic browser control using selenium and raw-html pages processing by means of Beautiful Soup.
Create your web-sites based on the best practice! Good luck!
Leave comments and propose topics you would like to know more: tina.ward@mail.uk
Wordcloud of the article. Have fun!
screen

Top comments (9)

Collapse
 
erihus profile image
Erik Huges • Edited on

When we pay attention to the excellent quality of various websites, we forget that not all of them are clean. I am not in the sense that they smell bad, but in the fact that there can be a lot of deception. Therefore, I advise you to contact only quality websites like Mister Bet, which is clearly popular roulette77.us/strategies/4567

Collapse
 
tobytuckett profile image
TobyTuckett

Hello, this is very interesting, but I think that not everyone who has read this will want to do it, it's another matter to play gambling yourself, because this is cool entertainment and a way to earn money, so I also confidently recommend that you study here a review of the best reliable online casinos for Canadians 2021, which contains a large selection of high-rated casinos with excellent bonuses, a cool gaming interface and the process in general, go ahead and find out without wasting time.

12 Rarely Used Javascript APIs You Need

Practical examples of some unique Javascript APIs that beautifully demonstrate a practical use-case.