DEV Community: Albert Ulysses

Data Pipelines with Apache Airflow - Book Review

Albert Ulysses — Mon, 13 Jun 2022 22:12:14 +0000

The Book:

Data Pipelines with Apache Airflow By Bas HarenSlak & Julian de Ruiter

Short Summary:

"Data Pipelines with Apache Airflow" is an introductory and intermediate book about Apache Airflow. The book covers everything from introducing Airflow to giving some excellent ideas for generic use cases. The book has four parts: "Getting Started," "Beyond the Basics," "Airflow in Practice," and "In the Clouds." "Getting Started" and "Beyond the Basics," detail Airflow- such as how to use the framework and interacting with DAGs. "Airflow in Practice" also has some Airflow details but focuses more on the practical parts of Airflow, such as security. "In the Clouds" give examples of deploying a project using AWS, Azure, and GCP.

What I liked:

The first two parts are excellent - the authors do a great job explaining Airflow. I learned so much that I was able to use immediately. Airflow is such a massive framework, but somehow, they were able to condescend all the essential concepts into these two parts without going into unnecessary topics.

What I disliked:

The last two parts went into some detail but didn't work for me as much as they assume a lot about your project. I understand the thinking behind these parts - they wanted to show examples and starting points - but for me, a lot of it was irrelevant. The only chapter I admit to skipping is chapter 17 because it discusses deployment in Azure, and I don't see myself using Azure anytime soon.

Review round-up:

This book is excellent for anyone working with Apache Airflow. Although I couldn't say the last two parts worked for me, the authors say in their "About this Book" that after chapter five, the reader can read what they feel is necessary. So perhaps, I overread what I needed/ overread what the Authors suggested, and I don't think it takes away from everything you can learn from this book. Again the first two parts are more than worth it, and if you decide to read the book, skim the chapters and see if you need to read the entire book.

Rating:

8/10 Python Snakes

Practices of the Python Pro - Book Review

Albert Ulysses — Tue, 01 Mar 2022 18:04:12 +0000

The Book:

Practices of the Python Pro By Dane Hillard

Short Summary:

"Practices of the Python Pro" is a book to teach developers about better software practices. The title is a bit misleading because it suggests that the book is about writing better Python code; it's actually about good Software practices and uses Python as the medium. The book has three main parts and a fourth closing part. The first part is about why designing software matters - basically an argument for why someone needs to learn the material in the book. The second talks broadly about the concepts, and the third goes into detail. The book ends with suggestions to the reader about where to go next.

What I liked:

I thought the book was great. As I said, the title is a bit misleading, but the material is excellent. I liked the concepts the second part covers, including separation of concerns, abstraction, and encapsulation, designing for high performance (speed testing), and different types of testing (such as unit versus integration). The third part was also noteworthy- it talks about how to improve your code and gets into more specific problems.

What I disliked:

The one problem with the book is the title. It should be named "Best Software Development Practices with examples in Python."

The title should emphasize that it's more about software development and uses Python as its learning tool.

Review round-up:

The material is vital for Python developers, and even as an intermediate developer myself, it helped reinforce many other concepts I've learned elsewhere. I appreciate that it's not at all complex to understand. It's a book that was missing in the Python community. Many second/intermediate Python books give you particular use cases, but this book gives you a view above that and helps organize and think about your software as a whole. I can see myself re-reading this book in the future.

Rating:

9/10 Python Snakes

Thinking in Pandas - Book Review

Albert Ulysses — Mon, 01 Nov 2021 19:58:06 +0000

The Book:

Thinking in Pandas By Hannah Stepanek

Short Summary:

"Thinking in Pandas" is a book written on how to optimize your Pandas code. It starts with an introduction to the Pandas library, the basics of loading/merging, and how Pandas works under the hood. The middle chapters detail the typical things you would use Pandas for and how to optimize those operations. It ends with ways that you can use tools outside of the library for speed and the library's future.

What I liked:

I like the detail that Hannah went into because it helped me understand the library as a whole. It wasn't a cookbook-style book, but more like a course on Pandas and why one thing is better. I also enjoyed the diverse set of optimizations.

What I disliked:

Although the overall book is excellent, there were two things that I wish I could change. First are the code examples. Aesthetically they were tough on the eyes. They could have done a little more work on making it look nice. Second, Hannah doesn't dive into distributed solutions for large datasets. I like to think that most people will use Spark for their distributed solutions, but those interested in using only Pandas can use a library like Dask.

Review round-up:

This book is necessary for any Data Engineer or a Pandas user interested in developing better habits. The book has some shortcomings, but they aren't enough to warrant skipping this book. I used some of the suggestions immediately, and I'm grateful for all the work Hannah put into it.

Rating:

8/10 Python Snakes

Functional Programming in Python - Book Review

Albert Ulysses — Wed, 14 Jul 2021 00:41:00 +0000

The Book:

Functional Programming In Python By Martin McBride

Short Summary:

"Functional Programming in Python" is a short book on writing code and solving problems using Python in a Functional Programming(FP) idiom. The first chapter goes over the three main paradigms - procedure programming, object-oriented programming, and FP. After the first chapter, we get into FP Fundamentals and how to express them in Python. The longer middle chapters go deep into iterators and iterables. The book ends with several chapters that dive into FP ideas that don't necessarily apply to Python.

What I liked:

There were several things I liked about how the author wrote the book. First, I like how the chapters are formatted. It is nice to know that the last chapters aren't necessary for Python but interesting nonetheless. I also enjoyed how the author explained the concepts. I feel as if I can implement these ideas reasonably soon. It was a good weekend book that didn't need a lot of focus to understand.

What I disliked:

I didn't dislike much about the book. If there was one thing that I would comment on, it was on the chapters about iterator functions. I'm not sure what the thought process is to give so much space to explain the different functions, and it didn't make a clear connection to FP. Perhaps my comprehension is lacking, but I had a hard time connecting the two.

Review round-up:

I would suggest any intermediate Python developer interested in exploring FP concepts in Python read this book. It's not a beginner book, nor do I think anyone with solid FP knowledge would gain much out of reading it.

Rating:

7/10 Python Snakes

Unit Testing Your Web Scraper

Albert Ulysses — Fri, 13 Nov 2020 04:48:37 +0000

Goals:

By the end of this tutorial, you will have a starting point for writing unit tests for a web scraper. I also hope that this motivates the reader to learn more about test-driven development. This tutorial is less about teaching you how to do something. Instead, it suggests how to set up and think about your testing for web scrapping scripts.

Tools and prereqs:

The reader should know how to run pytests, if you don't, I suggest you read the first part of clean architecture for a primer listed in the resources section.
It will benefit the reader to have done at least one web scrape using Beautiful Soup before going over this tutorial.

Step 1

The example website we will be using is https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html. This website is a service to practice web scraping.
The first step is to decide what data we will want to collect.

Luckily, the "Product information" section makes it an easy task. We will collect the fields: UPC, Product Type, Price (excl. tax), Tax, and Availability In stock.

Step 2

My first suggestion begins before we start writing any code. I think all of your cleaning code should be organized together and tested together. Therefore you should separate your cleaning code in a "clean.py" file.

touch clean.py

Step 3

With Test-Driven Development, you write your test before writing your program code. I like using a format similar to follow Ken Youens' format in "Tiny Python Project" (link below). He starts with testing that the file exists in the right location. He then separates each of the tests with a commented outline.

import os

prg = './clean.py'


# -------------------------------------------------
def test_exists():
    """checks if the file exist"""

    assert os.path.isfile(prg)

You can run the test by typing this:
pytest -xv test.py
This test should pass, if it doesn't make sure the clean.py is in the correct directory.

Step 4

For the next step, we will be going back to the Product Information section of the webpage. When storing our Price and tax data, we will want to keep them as float data. However, if we tried to get the text, the pound symbol would prevent python from converting it into a float number. This problem lends itself to our first test.

# -------------------------------------------------
def test_price():
    """£51.77 -> 51.77 type float"""

    res = monetary('£51.77')
    assert res == float(51.77)`

Now when we run the test using:
pytest -xv test.py . We will see an error, which is a great thing.

I want to take the time to discuss this step a little more because this step is the main focus of the tutorial. There are much better TDD sources and excellent web scraping tutorials, but I don't always see where to start with TDD for your scrapes. Starting with how you want your data to look is a great way to get started with TDD and a great way to ensure your data is clean. As a data analyst, data engineer, or data scientist, there will most likely be several cleaning data steps. Web getting your data from the web; this can be your first step.
I know I got a bit wordy, but I would like to summarize these thoughts and the tutorial: write a test script to reflect what your data should look like and then write the code.

Step 5

Now we can write the code for this test. This code is how I chose to write the code; there can be more than one way. When I write code for small projects, I like to think of two advice pieces that I have read from real smart developers. First, get the code to pass the test and make sure it does what you are trying to get done. Second, is don't overcomplicate it with features you think you'll need in the future. Here is my code:

import re


def monetary(value_field):
    """Returns a float for items that are values"""
    amount = re.sub('[^0-9.]', '', value_field)
    return float(amount)

Sure, I could have written it this way:

def monetary(value_field):
    """Returns a float for items that are values"""
    amount = value_field[1:]
    return float(amount)

and for this project, it would be fine. I used regrex, because any time I see characters, I think to myself, "just use regrex". However, it doesn't matter as long as you are getting the desired result.

Step 6:

Now it is time for you to do the same thing for writing a test and code that returns an integer for the availability column! You can see my complete code and "answer" at the link below under resources.

Conclusion

We went over a starting point to create Unit tests for a web scraper. Thank you for reading and please let me know if you have any questions our suggestions!

Quick side note

In my project, I save the data as a NamedTuple in a model file. There will be a link at the end of this article with more information about NamedTuples if you have not used one before.

Resources:

Clean Architectures:
https://leanpub.com/clean-architectures-in-python
Tiny Python Projects:
https://github.com/kyclark/tiny_python_projects
NamedTuple:
https://dbader.org/blog/writing-clean-python-with-namedtuples
My code:
https://github.com/AlbertUlysses/beautiful-test

Web Scraping with Python - Reproducing Requests

Albert Ulysses — Thu, 15 Oct 2020 05:56:46 +0000

Have you noticed a gap in knowledge between learning how to scrape and practicing with your first project?
An introduction tutorial teaches us how to scrape using Beautiful Soup or Scrapy on simple HTML data. After completion, we think to ourselves, "Finally, all the data in the World Wide Web is mine."
Then BAM, the first page you want to scrape, is a dynamic webpage.
From there, you will probably start reading about "how you need Selenium," but first, you need to learn Docker. The problem then gets messier because you're using a Windows machine, so is the community version even supported? The result is an avalanche of more unanswered questions, and we abandon the project altogether.
I'm here to tell you that there is a technique that lies between using HTML scraping tools and learning Selenium that you can try before jumping ship. Best of all, it's probably a technology you already know - Requests.

Goals:

By the end of this tutorial, you'll learn how to mimic a website's request to get more HTML pages or JSON data.
Disclaimer, in some cases, the website will still need to use Selenium or other technologies for web scraping.

Overview:

The method is a three-step procedure.

First, we will check if the website is the right candidate for this technique.
Second, we will examine the request and return data using Postman.
Third, we will write the python code.

Tools and prereqs:

Postman
A basic understanding of Web Scraping
A basic understanding of Python's Requests library

Step 1

My browser is Firefox, but I'm sure Chrome has a similar feature. Under the Web developer option, select the network option.

Generate dynamic data and examine the requests. For my example webpage, you can invoke this by selecting an option.

Now, you might have noticed that many requests are happening. So how do you find the one you need? The trick here is to look at the type and the response itself. In my example, there are many js (Javascript) types and two HTML types. Since we are trying to avoid dealing with Javascript, it is a natural move for us first to inspect the two HTML options.
Luckily, the first HTML type is the request we need. We can tell that it's the right choice because we can see the data we want in the response.

Step 2

Next, we will use Postman to mimic the request. We do this to isolate, inspect, and confirm that this is the data we need.
With Firefox, we right-click to copy and select the URL option.Then paste into Postman with the right request type.

Right-click to copy and select data form.

In some cases, you might need to add something to the header, but Postman autocompletes a lot. So in this example, it's not necessary.
Then run it.

The data looks good. It's a dedicated HTML page that we can scrape.

Step 3

The last step is writing the python code. Because this isn't a Request or Python tutorial, I'm not going to get detailed in this step, but I will tell you what I did.
I messed with the requests a bit and realized that I could get the same data without the county code, so my data form is numerical. Which means using a range is just fine.
Here is my code for creating a request that gets all the HTML for all those pages.


import requests
import json
request_list = []
for i in range(1,84):
    response = requests.post(f"https://mvic.sos.state.mi.us/Voter/SearchByCounty?CountyID={i}", verify=False)
    request_list.append(response.text)

With this, we can use a combination of an HTML parser and requests to retrieve dynamic data.

BONUS tip:

Here is a website that can turn cURL requests into Python code.

https://curl.trillworks.com/

Cool, right? You can use this if your request is simple, or you just are feeling lazy. Be warned, it doesn't work all the time, and some headers may need to be adjusted still.

Conclusion

We went over how to use a web browser, Postman, and the Request library to continue your web scraping journey. An intermediate technique for those who are learning Web Scraping. Good Luck!