DEV Community: CincyBC

Final: How to Structure the DAG

CincyBC — Sun, 04 Jun 2023 14:30:00 +0000

In the previous posts, we've talked about how easy it is to create custom hooks and operators in an object oriented style of Airflow. Airflow itself is moving more in the direction of Functional Programming with taskflow decorators turning any Python function into a task. However, it's still valuable to learn and use Operator/Hook/Sensor classes to build extendable tools to plug into your DAGs as needed. As I said in a previous post, the Airflow community is very active, and there are Hooks and Operators already out there to work with pretty much any tool you want. So, in all likelihood, you won't need to build everything you need from scratch.

If we were to develop our FundScraper to its ultimate conclusion using Data Engineering best practices, this is how I would structure it:

EmptyOperator Start Task (not actually executed, but good to have if you want to add tasks later)
Custom Extract Operator - This would be similar to the one we created a couple of articles ago, but it would write the entire contents of the html to a file to create a "snapshot" of the website for the day we scraped. We would write html to an S3 bucket using the s3 hook that the AWS community has developed.
Custom Transform Operator - This would be the rest of our Custom Operator we created that extracted the values we wanted, but this time, it would pull from the html file we created with the Extract Operator. The flow would be s3 hook to read the html file, "transformations" to extract the values we want, and finally s3 hook to write the files as JSON back to a new bucket or prefix in s3.
PostgresOperator - Use the off the shelf PostgresOperator to fun a COPY INTO query reading our s3 file and put it in an RDS instance in S3.
EmptyOperator End Task (again, not actually run)

While the Custom Extract Operator cannot be idempotent since the website will change, we can still make the other tasks idempotent by separating them out. What benefit does this have?

Let's say the website changes and the values we pull aren't in the same spot as they used to be. If we had the "extract/transform/load" all in one task and didn't write until the end, then everything would fail and we'd lose the state of the website that day forever (unless a website like waybackmachine just happened to be storing it). Writing it to an html file in cheap S3 storage solves that. We would then be able to change the transformation to find the values we want and rerun all of the failed tasks to catch up.

Why not just include a couple of "checkpoints" in the code writing them in a single task? Why do we need 3 tasks?

Think of the tasks as checkpoints. We can then pick up at any of the checkpoints that failed making modifications.

Can this be applied to more traditional ETL jobs moving data from a transactional database to an analytical database? Absolutely. Just replace the extract task from a webscraper to an Operator written to interact with the flavor of your transactional database.

There is more we can do with all of this....More ways for us to abstract things out from the base DAG and allow us to create as many scrapers as we'd like via yaml (as I did with the abstract config), but I'm going to close this series here. It's been fun showing how to take a small scraping script you would run with a cron job that couldn't be extended and had zero observability and turned it into a robust ETL process with fail safes, observability, and extendability. I hope you enjoyed it. Let me know if you'd like me to extend this series!

Custom Airflow Hooks

CincyBC — Thu, 04 May 2023 14:25:00 +0000

In the last post, we took our monolith scraper class and converted it to a custom Airflow Operator. In Airflow, Operators execute tasks and if you want to create an interface with an outside service, you do that via Airflow Hooks. Just as there are many off-the-shelf Provider Operators in Airflow developed by the community, there are several off-the-shelf Hooks to interface with services.

For example, there is a PostgresOperator to run Postgres Queries. If you look under the hook of the Operator, it is using a PostgresHook to run the queries. If you are creating your own Python functions or Operators in Airflow, you can import just the Hook to use that interface.

Why is this important for us?

Back in the beginning of this series, I broke down web scraping on a continuum from API to manually scraping websites. Interacting with an API would be an interface and (there are examples out there how you could build a Custom Hook to interact with an external API)[https://docs.astronomer.io/learn/airflow-importing-custom-hooks-operators]. Interacting with the "hidden APIs" I described would also be an interface, albeit, less of a traditional one.

Hooks generally will take a conn_id argument where you can put variables including sensitive variables and use them in your Hooks. Here is the example from Astronomer's website:

# import the hook to inherit from
from airflow.hooks.base import BaseHook


# define the class inheriting from an existing hook class
class MyHook(BaseHook):
    """
    Interact with <external tool>.
    :param my_conn_id: ID of the connection to <external tool>
    """

    # provide the name of the parameter which receives the connection id
    conn_name_attr = "my_conn_id"
    # provide a default connection id
    default_conn_name = "my_conn_default"
    # provide the connection type
    conn_type = "general"
    # provide the name of the hook
    hook_name = "MyHook"

    # define the .__init__() method that runs when the DAG is parsed
    def __init__(
        self, my_conn_id: str = default_conn_name, *args, **kwargs
    ) -> None:
        # initialize the parent hook
        super().__init__(*args, **kwargs)
        # assign class variables
        self.my_conn_id = my_conn_id
        # (optional) call the '.get_conn()' method upon initialization
        self.get_conn()

    def get_conn(self):
        """Function that initiates a new connection to your external tool."""
        # retrieve the passed connection id
        conn_id = getattr(self, self.conn_name_attr)
        # get the connection object from the Airflow connection
        conn = self.get_connection(conn_id)

        return conn

    # add additional methods to define interactions with your external tool

You'll notice this code will run the get_conn() method upon instantiation if you leave the "optional" code in there.

If you were to be interacting with a website that required authentication, you could use s = requests.Session() and return that session where it has return conn in the code above. Everything you need to make a custom hook is there for you to adapt.

For our purposes though, we don't need anything fancier than a hook that will make a web call for us and return the contents of the webpage in the form of beautifulsoup. So here is our code:

# import the hook to inherit from
from airflow.hooks.base import BaseHook
from bs4 import BeautifulSoup
import requests


# define the class inheriting from an existing hook class
class ScraperHook(BaseHook):
    """
    Interact with external websites.
    :param url: URL to call
    """

    # provide the name of the hook
    hook_name = "ScraperHook"

    # define the .__init__() method that runs when the DAG is parsed
    def __init__(self, url: str, parser: str = "html.parser", *args, **kwargs) -> None:
        # initialize the parent hook
        super().__init__(*args, **kwargs)
        # assign class variables
        self.url = url
        self.parser = parser

    def get_website(self):
        """Function that returns content of website."""

        r = requests.get(self.url)
        if r.status_code == 200:
            soup: BeautifulSoup = BeautifulSoup(r.content, self.parser)
            return soup
        else:
            return r.status_code

If you recall the web_call function from our functional DAG, it's the same code, but now it lives in the Airflow Hook. This aligns with Airflow best practices of putting your interfaces in hooks!

When we instantiate this (be it in a taskflow or in a custom operator), we can just call it like this:

from hooks.scraper_hook import ScraperHook

soup = ScraperHook(url=<the url>).get_website()

As always, the code is online here.

Class to Airflow Custom Operator

CincyBC — Mon, 01 May 2023 14:24:00 +0000

In the last post, we created a very simple single, task Airflow DAG to run our Sprott Scraper. However, just like our functional scraper, that DAG was a one trick pony that couldn't easily be used for something else. For that, we would need to break it up like we did in the abstract class methods with an extract object, transform object, and load object. This time, each object becomes a task, which is executed in Airflow by Operators. Let's first look at how to create a custom operator from our Python class.

If you noticed the code block after I made the functional sprott scraper DAG, you'd see I replaced the @task decorator with the classic way of doing it with the PythonOperator explicitly called out. You could conceivably write all your tasks as Python functions, but what if you wanted to create reusable classes like we did in our Python pipeline? You can create your own Airflow Operator really easily!

Let's start with the class scraper that we had from a few weeks ago. I'll modify it in only 2 spots and create a custom operator.

# fund_scraper_operator.py
import requests
from bs4 import BeautifulSoup
import json
# To turn into an operator, you just need to inherit the BaseOperator
from airflow.models.baseoperator import BaseOperator


class FundScraperOperator(BaseOperator):
    def __init__(self, url):
        self.url = url

    def web_call(self):  # Extract
        r = requests.get(self.url)
        if r.status_code == 200:
            soup = BeautifulSoup(r.content, "html.parser")
            return soup
        else:
            return r.status_code

    def get_fund_values(self, soup, index, class_name, replace_list):  # Transform
        fund_values = soup.find_all('div', class_=class_name)
        value = fund_values[index].contents
        for x in replace_list:
            value = value.replace(x, '')
        return str(value[0]).strip()

    def write_json(self, data, filename='data.json'):  # Load
        with open(filename, 'w') as f:
            json.dump(data, f, indent=4)
    # You will override "execute" in the BaseOperator with this.
    def execute(self):
        soup = self.web_call()
        data = {}
        data['shareprice'] = self.get_fund_values(
            soup, 4, 'fundHeader_value', ['$US', ','])
        data['u3o8_stock'] = self.get_fund_values(
            soup, 6, 'fundHeader_value', ['$US', ','])
        self.write_json(data)

That's it! You'll notice the only changes from the original script are importing the BaseOperator class from Airflow and inheriting it. We then "extend" it by adding to the parameters it takes (it'll get all of the DAG/Task context from the BaseOperator, so we can't use BaseOperator protected parameters) and override the execute method with our own. It's incredibly easy to make an Operator.

In your Airflow setup, in your Airflow Home next to your dags directory, put this custom operator in a directory called plugins. Just like dags, plugins will be in your PYTHONPATH, so you can import it into a DAG like this:

from bs4 import BeautifulSoup
import pendulum
from datetime import timedelta

from airflow import DAG
from airflow.decorators import task
from fund_scraper_operator import FundScraperOperator


# Default args used when create a new dag
args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
    'schedule_interval': '@daily',
}

with DAG(
    dag_id='Functional_Sprott_Scraper',
    schedule_interval='5 20 * * 1-6',
    start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
    default_args=args,
    render_template_as_native_obj=True,
    tags=['price', 'scraper']
) as dag:

    scrape_task = FundScraperOperator(
                    url='https://sprott.com/investment-strategies/physical-commodity-funds/uranium/')

The Airflow community is very vibrant, so you'll find Operators to interact with a lot of services. Want to write to S3? There are hooks and operators supported by the community for that. Do you want to query a PSQL database? There are also hooks/operators for that already written. That's one of the reasons why you don't want to create monolith Operators like we have in this article. Before we separate/abstract out our tasks like with did with abstract classes, it's important to talk about Airflow hooks.

As always, the code is up in Github here.

Scraper Function to Airflow DAG

CincyBC — Thu, 27 Apr 2023 14:22:00 +0000

We have a self-created Python pipeline, and we can set up a cron job to run the script on a daily basis for whenever we want. Back when I started engineering pipelines at work, I was using Django/Celery/Redis, but at home I was using cron. I had some, but limited visibility into the jobs (mostly in logs) and decided I needed a better tool; that's where Airflow came in.

Each workflow is captured in a Directed Acyclic Graph (DAG) of tasks that are performed as you have laid out (not recursively). Some people have called Airflow "cron on steroids," but it's really much more than that. You'll see why it's more than that in a moment, but it's true that at its core, Airflow is a tool to schedule workflows on a cron based schedule (there are some aliases like @daily for everyday at midnight instead of 0 0 * * * and @hourly for 0 * * * *).

The classic, verbose way to set up a DAG is like this:

# From the Airflow tutorial
# The DAG object; we'll need this to instantiate a DAG
from airflow import DAG

# Operators; we need this to operate!
from airflow.operators.python import PythonOperator
with DAG(
    "tutorial_dag",
    # These args will get passed on to each operator
    # You can override them on a per-task basis during operator initialization
    default_args={"retries": 2},
    description="DAG tutorial",
    schedule=None,
    start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
    catchup=False,
    tags=["example"],
) as dag:

This will create a new workflow with ID "tutorial_dag" in the UI. The tooltip and description at the top of the screen on the DAG view will say "DAG tutorial." If it fails, it'll retry 2x before showing up as failed (something cron doesn't do!). The schedule or schedule_interval is set to None, so this workflow won't run on any schedule and can only be run manually. You could change that parameter to @hourly if you wanted it to be run every hour or 15 8 * * * if you wanted it to be run everyday at 8:15 UTC. The start_date is important because your workflow won't run before that date (set to 2021-01-01 here) and if you set catchup=True it'll run every run between your start_date and the latest run depending on your schedule; so be careful! tags are something you can filter on in the UI to find your DAGs quicker.

That's it! Of course, there are lots more knobs you can turn, but when you're just getting started you really just need dag_id, start_date, and schedule_interval.

How do you pass in a task? That's also super easy. Using the simple scraping script I wrote first, you would just do the following:

from airflow import DAG
from airflow.decorators import task
import pendulum
with DAG(
    dag_id='Functional_Sprott_Scraper',
    schedule_interval='5 20 * * 1-6',
    start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
    default_args=args,
    render_template_as_native_obj=True,
    tags=['price', 'scraper']
) as dag:

    def web_call(url):  # Extract
        import requests
        r = requests.get(url)
        if r.status_code == 200:
            soup: BeautifulSoup = BeautifulSoup(r.content, "html.parser")
            return soup
        else:
            return r.status_code

    def get_fund_values(soup, index, class_name):  # Transform
        fund_values = soup.find_all('div', class_=class_name)
        value = fund_values[index].contents
        return str(value[0]).strip().replace('$US', '').replace(',', '')

    def write_json(data, filename='data.json'):  # Load
        import json
        with open(filename, 'w') as f:
            json.dump(data, f, indent=4)

    @task()
    def execute_scraper():
        soup = web_call(
            url='https://sprott.com/investment-strategies/physical-commodity-funds/uranium/')
        data = {}
        data['shareprice'] = get_fund_values(soup, 4, 'fundHeader_value')
        data['u3o8_stock'] = get_fund_values(soup, 6, 'fundHeader_value')
        write_json(data)

Here is our DAG!

The @task decorator is short for this:

    def execute_scraper():
        soup = web_call(
            url='https://sprott.com/investment-strategies/physical-commodity-funds/uranium/')
        data = {}
        data['shareprice'] = get_fund_values(soup, 4, 'fundHeader_value')
        data['u3o8_stock'] = get_fund_values(soup, 6, 'fundHeader_value')
        write_json(data)

    scrape_task = PythonOperator(task_id='scrape_task', python_callable=execute_scraper)

Most people nowadays just write Python scripts to execute in Airflow, so they short handed the way to call the most basic operator, the PythonOperator.

I'm going to detour and talk about Operators next time since they define your tasks. There are also things called hooks and sensors I'll briefly go into.

As always, the code is on Github here.

Abstract Configurations

CincyBC — Mon, 24 Apr 2023 14:04:00 +0000

Quickly reviewing what we've done in the previous post, we turned an object that was responsible for the full ETL process and created separate objects for extracting, transforming, and loading and created a pipeline object called FundScraper to run through the process. However, FundScraper isn't very abstracted. You can put any URL you want in there, but the transformations are quite limited. The pipeline will only work for a single URL, which is a bit of a pain, and you could swap out the JSONWriter with a different file format, but we would need to modify the object to write to a database. Let's refactor it a bit so that it can handle configurations and be slightly more extendible.

# How it currently is
class FundScraper:
    def __init__(self, requester: WebRequester, parser: HTMLParser, transformer: DataTransformer, writer: DataWriter):
        self.requester = requester
        self.parser = parser
        self.transformer = transformer
        self.writer = writer

    def run(self, url, indexes, class_name, filename='data.json'):
        response = self.requester.get(url)
        if response.status_code != 200:
            raise ValueError(f"Error retrieving {url}: {response.status_code}")
        soup = self.parser.parse(response.content)
        data = self.transformer.transform(
            soup.find_all('div', class_=class_name), indexes)
        self.writer.write(data, filename)

In all likelihood, we would probably want to run this across multiple webpages, potentially multiple divs, for multiple values. We could add a for-loop to collect data from multiple places on a single webpage like this:

        data = {}
        for index, index_number in enumerate(indexes):
            data = self.transformer.transform(
                soup.find_all('div', class_=class_name), data, index_number, value_names[index])

Modifying the transform method to this:

class FundValueTransformer(DataTransformer):
    def transform(self, values, dictionary: dict, index: int, value_name: str):
        dictionary[value_name] = str(values[index].contents[0]).strip().replace(
            '$US', '').replace(',', '')
        return dictionary

So now it's able to request a website, pull the contents with beautifulsoup, extract the values (i.e. transform the contents to something useful), and write to JSON.

How do we give it instructions? We could parse a JSON, create simple Python parameters, or we could use Yet Another Markup Language (YAML), which is a pretty popular configuration tool found in other applications.

A config we could use for our example would be like this:

sprott:
  url: https://sprott.com/investment-strategies/physical-commodity-funds/uranium/
  class_name:
    - fundHeader_value: {4: shareprice, 6: u3o8_stock}

Really simply, the keys followed by colons show up as dictionaries when parsed in Python and the dashes show up as lists. You can also provide it a dictionary like I do for the index and value_name. You can see how we could easily add more values, more html tags, and more URLs to this list.

Last, but not least, we have to reconfigure our basic if script is called function at the bottom to parse the yaml. That could be done with the following:

if __name__ == '__main__':
    config = safe_load(open('config.yml', 'r'))

    scraper = FundScraper(RequestsWebRequester(), BeautifulSoupHTMLParser(
    ), FundValueTransformer(), JSONDataWriter())

    for key, value in config.items():
        for class_name in value['class_name']:
            for tag, indexes in class_name.items():
                scraper.run(url=value['url'],
                            class_name=tag,
                            indexes=[i for i in indexes.keys()],
                            value_names=[v for v in indexes.values()],
                            filename=f"{key}.json")

It's a bit more complex than the last round where we just instantiated the scraper and ran it because now we're running it multiple times over different URLs the separate configurations for each URL.

All of this is being done this way for a reason that will be clearer in a couple of weeks after I create an Airflow DAG with it all, but before then, we need to dive into Airflow first at a high level as a workflow orchestrator and, second, its components.

As always, the code can be found here.

From Class to Abstract Classes

CincyBC — Wed, 19 Apr 2023 14:31:00 +0000

In the last installment, we created a Python class called FundScraper that we found wasn't really all that helpful or robust. It was more code with no real added benefits. Reading through the SOLID principles, we decided it's time to create what are called abstract classes.

If we were to think about what it is our FundScraper is doing in the ETL process, we have something that reaches out to a website, something that parses the website into html, something that transforms the html into values we want, and then something that writes to json.

Here that is in abstract classes:

from abc import ABC, abstractmethod

class WebRequester(ABC):
    @abstractmethod
    def get(self, url):
        pass


class HTMLParser(ABC):
    @abstractmethod
    def parse(self, content):
        pass


class DataTransformer(ABC):
    @abstractmethod
    def transform(self, data):
        pass


class DataWriter(ABC):
    @abstractmethod
    def write(self, data, filename):
        pass

Think of these as individual tasks in our pipeline now abstracted out. In Data Engineering, a concept called idempotence is really important. Essentially, it is no matter if a task is run once or 1000 times, you achieve the same result. It would be bad if you ran the task to write the JSON 1000x and it wrote your JSON file 1000 instead of writing over teh same file. It's the same way in a database; if I reran an INSERT task not know it's been run before, depending on the database, I would either get "duplicate key error" or in the case of Snowflake, a duplicate record inserted. Imagine a computer somehow looping on that task unable to get past it until someone stopped it. You'd have to just truncate your table and reload from scratch.

How does that work with a webscraper where the underlying website is always changing? Absent something like Wayback Machine that stores websites as they looked on certain days (but not everyday), there is no way for us to capture that again with our task. Same for databases; unless we take a snapshot of a database, there is no real way to run an extraction query on it and get the same result 5 minutes later. Thank goodness with databases, there is something called Change Data Capture (CDC) and Change Tracking (CT) that would allow us to look at the logs of the database recreate what it would look like; but that's a much deeper/longer conversation.

Now that we're thinking of these as tasks, we can go one step further and think of another website we want to extract information from. We could have the same WebRequester/HTMLParser (with a different URL) and same DataWriter, but with a different DataTransformer more suited to the values we're working with. We could call it WebSite2Scraper and pull in what we need.

But here is our FundScraper:

class FundScraper:
    def __init__(self, requester: WebRequester, parser: HTMLParser, transformer: DataTransformer, writer: DataWriter):
        self.requester = requester
        self.parser = parser
        self.transformer = transformer
        self.writer = writer

    def run(self, url, indexes, class_name, filename='data.json'):
        response = self.requester.get(url)
        if response.status_code != 200:
            raise ValueError(f"Error retrieving {url}: {response.status_code}")
        soup = self.parser.parse(response.content)
        data = self.transformer.transform(
            soup.find_all('div', class_=class_name), indexes)
        self.writer.write(data, filename)

You see it's coupled to the abstract classes we created above. You'll also notice the Python packages requests, BeautifulSoup, and json are all absent. Again, with SOLID principles, we've fully decoupled them into new classes.

class RequestsWebRequester(WebRequester):
    def get(self, url):
        return requests.get(url)


class BeautifulSoupHTMLParser(HTMLParser):
    def parse(self, content):
        return BeautifulSoup(content, "html.parser")


class FundValueTransformer(DataTransformer):
    def transform(self, values, indexes):
        return {"values": [str(values[i].contents[0]).strip().replace('$US', '').replace(',', '')
                           for i in indexes]}


class JSONDataWriter(DataWriter):
    def write(self, data, filename):
        with open(filename, 'w') as f:
            json.dump(data, f, indent=4)

How do we call this now? It's actually very similar to what we were doing before, but we instantiate our FundScraper class with the exact methods we want it to execute in our pipeline:

scraper = FundScraper(RequestsWebRequester(), BeautifulSoupHTMLParser(
), FundValueTransformer(), JSONDataWriter())

scraper.run(url='https://sprott.com/investment-strategies/physical-commodity-funds/uranium/',
            indexes=[4, 6],
            class_name='fundHeader_value',
            filename='data.json')

So now, since we've decoupled everything and only call what we need, we run our FundScraper instantiating with the exact tasks and then run it on our values. Note here, this code produces a slightly different JSON than our last code since it hard codes the keys as "values." This could have been changed with a dictionary rather than a list.

As always, you can find this code on Github here. Next, we'll continue with a little more abstraction and then we'll jump into the two ways we could create our Airflow DAGs given everything we've discussed.

From Functional to Class: a look at SOLID coding

CincyBC — Sun, 16 Apr 2023 14:29:00 +0000

So now we have a set of unit tested functions that make up our extract from the website, transform into usable numbers, and load to a JSON file, but the functions aren't very scalable. What if we established a FundScraper class to use not just at the one website, but other similar fund website?

A class in Python isn't hard to create. We could create one just by calling Class FundScraper with something like the URL when we instantiate it.

class FundScraper:
    def __init__(self, url):
        self.url = url

Notice we don't put the args up in the class like this: class FundScraper(url). That's because this would imply it is inheriting another class called url, which is not our intention.

We could bring in the rest of our functions from the functional scraper article and be pretty much finished. What we would need is to replace the runner that we put after if __name__ == '__main__':.

Again, this is every easy to do. We would just create another method like this:

    def run(self):
        soup = self.web_call()
        data = {}
        data['shareprice'] = self.get_fund_values(
            soup, 4, 'fundHeader_value', ['$US', ','])
        data['u3o8_stock'] = self.get_fund_values(
            soup, 6, 'fundHeader_value', ['$US', ','])
        self.write_json(data)

You can find the full scraper class here. You would run this by importing the FundScraper with the URL and then using the run method like this:

from scrapper_class import FundScraper

scraper = FundScraper(url='https://sprott.com/investment-strategies/physical-commodity-funds/uranium/')
scraper.run()

However, if you wanted to run this on another website, it wouldn't really work unless it were set up exactly the same way and you were extracting the same exact information.

That's because our class isn't aligned with SOLID principles.

S – Single Responsibility Principle (SRP)
This doesn't mean it can only have a single method or function, but rather that the class only has one responsibility. In FundScraper, it's responsible for Extracting, Transforming, and Loading; which is three responsibilities.

O – Open-Close Principle (OCP)
This means that classes should be "open for extension, but closed for modification." Remember how I said that putting an arg in the class like this class FundScraper(url) meant it inherited another class? This would be a way you could extend it. If we had a simple Scraper class that was generic and only scraped/extracted the webpage, we could inherit that. That would be an example of extending a class. Simple modifications of the base class can have blast radius, so we try to extend instead of modify.

L – Liskov Substitution Principle (LSP)
Our FundScraper class relies on bs4, requests, and json, and assumes they will all work properly. It would be better to abstract these out with abstract classes or create new interfaces to make FundScraper more robust.

I – Interface Segregation Principle (ISP)
This is saying that no class should be forced to implement interface methods that it does not use, which is not the case for FundScraper since all methods are used.

D – Dependency Inversion Principle (DIP)
This last principle says that high-level classes should not depend on low-level classes; but rather they should both rely on abstractions. FundScraper has a tight coupling with BeautifulSoup, requests, and json, which makes it difficult to test in abstraction.

Next time, we'll build out the scaper with abstract classes and discuss some principles of Data Engineering before I do a high level intro to Airflow and finally convert the abstract classes into custom operators and a Directed Acyclic Graph (DAG).

Quick Detour on Unit Testing with PyTest

CincyBC — Thu, 13 Apr 2023 15:14:00 +0000

We strive to make functions of a single responsibility in our code. You'll notice in the previous post, we had 3 functions that had one responsibility: Grab the webpage, clean up the values, and write a json. We didn't create a mammoth function to do all 3 at once, though our simple script from a couple of posts ago essentially did just that.

As we create our functions, it's good to create tests to make sure each unit in your system is functioning as expected. When a system gets large, it becomes harder and harder to debug where problems are occurring. Edge cases will pop up that your functions won't be able to handle and your script will fail. Unit testing allows us to look at the inputs/outputs of our function, dream up possibilities, and ensure our function can handle it. If we need to make modifications to the function in the future, we can see if the function still is able to handle the scenarios you first presented it.

What does this process look like?

In the order of operations, ideally you:

Create unit test(s) with different types of inputs and their expected outputs
Create your function like this:

def my_function(your_inputs):
   pass

Run your unit test(s) and watch them fail.
Replace pass in the function with what it is supposed to achieve (e.g. grab a website and return it).
Run your unit test(s)
Alter your function and run your unit test(s) until all of the unit tests are passing

Congratulations! You've just gone through the process of Test Driven Development (TDD).

Let's take a look at the web_call function we created last time and make some unit tests for it. Before I begin though, I have to add an important point about unit tests: They are to be self-contained; no internet connection required.

How can we test a function that makes a web_call without the web?

Through mocking. First, I'm going to create a PyTest "fixture" (reusable) that says we're about to make a get request and instead of actually making the request, this is what I want you to return.

import requests
from unittest.mock import patch
from bs4 import BeautifulSoup
import json
import pytest
from functional_scraper import web_call, get_fund_values, write_json


@pytest.fixture
def mock_requests_get():
    with patch.object(requests, 'get') as mock_get:
        yield mock_get


@pytest.fixture
def mock_requests_failed_get():
    with patch.object(requests, 'get') as mock_get:
        mock_get.return_value.status_code = 400
        yield mock_get

You'll notice I have 2 mocks: one where I have a successful call and one where I have a call that returns a 400 status code (client-side failure). We could easily make a dozen more of these with different status_codes, but that wouldn't provide any additional value; making these two to cover the two basic scenarios in the conditional should suffice.

What are our unit tests?

def test_web_call(mock_requests_get):
    mock_requests_get.return_value.status_code = 200
    mock_requests_get.return_value.content = '<html><head></head><body><div></div></body></html>'
    result = web_call('http://www.example.com')
    assert isinstance(result, BeautifulSoup)


def test_failed_web_call(mock_requests_failed_get):
    result = web_call('http://www.example.com')
    assert result == 400

You'll see in the first one, we return a status_code and some basic html content. We then make sure the function returned the BeautifulSoup content as it should.

In the second unit test, we test the other side, where we don't get a successful call to the webpage and get a 400 status_code.

Both of these unit tests pass. I have set up a pipeline in Github using Github Actions to run my unit tests every time I push new code into the repo!

Join me again next time when we turn our function into a class.

As always, you can find all of my code here in the Github repo.

Bootstrapped to Functional

CincyBC — Mon, 10 Apr 2023 14:34:00 +0000

In the last installment, we had a simple 10 line script (excluding imports) that pulled or extracted the share value and u3o8 stock from a website. It then cleaned up or transformed those values into numbers and printed them to the screen. You'll notice though that even in those 10 lines of code, there was some repetition.

fund_values = soup.find_all('div', class_='fundHeader_value')
shareprice = fund_values[4].contents
shareprice_value = str(shareprice[0]).strip().replace('$US', '')

u3o8 = fund_values[6].contents
u3o8_stock = str(u3o8[0]).strip().replace(',', '')

Both share price and u3o8 stock are essentially the same thing: they reach into the 'div' called 'fundHeader_value' at an index and then clean up the string. This violates a general rule in good coding practices called "DRY" or "Don't Repeat Yourself." Early in my coding days, my mentor always used to tell me, if you have to do it twice, create a function. Creating these functions are the first step to abstraction and more robust, scaleable code!

How can we break this code up into functions? In any data processing, I like to think through the common lens of extraction, transformation, and loading or ETL. It's also common nowadays to find ELT, especially in streaming pipelines, so don't get caught up in what order those letters are in as long as you're extracting first.

What's our extraction? It's pulling the webpage down into BeautifulSoup.

What's our Transformation? It's processing the string and producing a clean number.

What's our Load? It's just loading it to the screen in a print statement.

Let's turn these into functions:
Extract

def web_call(url):  # Extract
    r = requests.get(url)
    if r.status_code == 200:
        soup: BeautifulSoup = BeautifulSoup(r.content, "html.parser")
        return soup
    else:
        return r.status_code

Now we have a generic function that you can put any url into, not just the Sprott url, and do in 1 line of code what you were previously doing in the above 6 lines of code.

Here is how you'd do it for Sprott:

soup = web_call(
        url='https://sprott.com/investment-strategies/physical-commodity-funds/uranium/')

That's it!

Transform

def get_fund_values(soup, index, class_name):  # Transform
    fund_values = soup.find_all('div', class_=class_name)
    value = fund_values[index].contents
    return str(value[0]).strip().replace('$US', '').replace(',', '')

This is only slightly abstracted (not as generic), but instead of pulling the soup content, getting the contents, and cleaning the contents n number of times, you can do it now n times with one line of code each.

Here is how you'd do it now for share price:

shareprice_value = get_fund_values(soup, 4, 'fundHeader_value')

We have 2 values we're grabbing but there are more on the website. We could grab each value with a single line of code.

Load is just a Python function called print() at the moment, but let's make this useful. Why don't we create a JSON file with the two values we grab? Here is a simple function that will write a JSON file:

Load

def write_json(data, filename='data.json'):  # Load
    with open(filename, 'w') as f:
        json.dump(data, f, indent=4)

Now when we run this script, it won't just run it from top to bottom because we have everything in functions; so we tell Python that when this script is run, here is your course of actions:

if __name__ == '__main__':
    soup = web_call(
        url='https://sprott.com/investment-strategies/physical-commodity-funds/uranium/')
    data = {}
    data['shareprice'] = get_fund_values(soup, 4, 'fundHeader_value')
    data['u3o8_stock'] = get_fund_values(soup, 6, 'fundHeader_value')
    write_json(data=data)

This will call the website, create a dictionary called data, fill the dictionary with the 2 values, and write the dictionary to a JSON file.

What's really cool about these functions is that they're no longer just usable in this script. We can import these functions into other Python scripts and create similar scrapers!

Code is below and in Github (here)[https://github.com/CincyBC/bootstrap-to-airflow/blob/main/src/functional_scraper.py]. Join me next time for a quick chat about Unit Tests and then off to the land of Object Oriented Programming (OOP) to see how we can turn this into a Scraper Class that will form the baseline for our Airflow DAG.

The Web Scraping Continuum

CincyBC — Tue, 04 Apr 2023 02:00:04 +0000

Over the next few weeks, I'm going to be evolving simple code that scrapes a website and turn it into a more robust set of tasks performing an ETL (Extract Transform Load) process orchestrated by the widely used open source software Airflow. All of the code can be found (here)[https://github.com/CincyBC/bootstrap-to-airflow] in Github.

First, a couple bits on Web Scraping.

What is Web Scraping?

Web scraping refers to pulling information off the web whether it be a bot or someone paid to copy and paste the daily stock price off a website into an Excel spreadsheet. In an ideal world, all data you need would be available from an API (Application Programming Interface) where you post your request and get a response with the data you need, but setting up and maintaining APIs for general consumption costs money.

There are a couple things to consider when thinking about web scraping:

Check your local jurisdiction for any legal restrictions. In the United States, you're not, for example, allowed to scrape copyrighted Getty Images pictures from the web and sell t-shirts with the image on them without the consent of Getty Images.
Check the website for any restrictions or rate limitations (sometimes in a robots.txt file). This isn't a problem when you are manually pulling data from a website, but if you have a script hitting a website several times a second, you could be impacting the performance of the website, which could lead your IP Address to get blocked from accessing the website.

The Web Scraping Continuum

Now that we have that out of the way, it's time to talk about the web scraping continuum. On one extreme, you have people paid to pull information regularly from websites. I worked at a company where someone was paid to write a report each day on what happened in the market and that person would manually go to a list of websites and pull information everyday. One step away from this is web scraping with a package like Selenium that does the same thing as this person, but the browser button pushing is done in an automated browser.

On the other end, especially with SPAs (Single Page Applications), you can skip loading web pages and use the API the web pages use as if it were a fully accessible API. How? (Here is a good writeup.)[https://blog.devgenius.io/scrape-data-without-selenium-by-exposing-hidden-apis-946b23850d47] Sometimes, this "hidden api" doesn't return a nice JSON, but rather a block of html you need to parse. This is inconvenient, but the blocks of html usually stay fairly consistent over time, so it's still fairly stable.

Right in the middle of the continuum is good ol' BeautifulSoup where you make a request out to an entire webpage (not just a block returned from a hidden api) and parse it all for the bits that you want. This is actually where our series will start; with building a web scraper in 10 lines of code with BeautifulSoup. Join me as we turn this simple 10 lines of code into a robust Airflow DAG with custom operators.

How to get Started with Data And Help Your Community

CincyBC — Tue, 24 Jan 2023 03:43:21 +0000

(Previously published at www.adventofdata.com)
Turning the page on 2022, it's time to think of your New Year's Resolutions. For those of you who want to up your skills in Data Engineering, Machine Learning, and Data Analytics, why not build your skills on public sector datasets rather than on random Kaggle datasets? There is no shortage in this world of problems or data, but there is a shortage of professionals with an interest to evaluate them, especially in the public and non-profit sectors!

Let's pick an issue, like housing, and see what we can learn in the process.

Problem Statement

Ever since the Great Recession, corporations have been snapping up single-family homes across the US and are causing a housing crisis; especially in "post-industrial" cities like Cincinnati, Ohio. Laura Brunner, President of the Port of Cincinnati, said in testimony in front of Congress that the once privately held homes make "a cash cow for investors but a money pit for renters." Some have even leveraged technology to manage properties in dispersed geographies from afar with minimal staff, becoming what this author calls, the "automated landlord."

Since 2008, which neighborhoods have seen more single-family properties go from individual ownership to corporate ownership adjusting for size of neighborhoods?

Solution

There were roughly 60,000 transfers of single-family properties between 2008 and October, 2022, which I classified as going from a corporate entity to another corporate entity, corporate entity to person, person to corporate entity, or person to person. I took the difference of person -> corporate minus corporate -> person to calculate the net of how many of the properties went from individual ownership to corporate ownership. After geocoding and aggregating by neighborhood, I adjusted the raw counts for relative size of the neighborhood and found neighborhoods like Corryville, CUF, Avondale, and North Fairmount were the biggest targets for corporate ownership while neighborhoods like OTR and Pendleton went in the reverse direction.

(Description: The neighborhoods of Corryville, CUF, Avondale, and North Fairmount have the most single-family properties moving to corporate ownership adjusted for size.)

What skills were used in the process and what could you learn if you were to replicate this?

Data Engineering

Data Sources:
Cincinnati Area Geographic Information System (CAGIS) Geodatabase
Property Ownership Snapshots
Property Transfer Information
Property Use Code Mapping (Single Family are 510)

Skills to learn

Batch Data Processing — Many organizations still process data in batches, whether snapshots or event processing (transfers). You could just process the snapshots at the beginning and end and see an increase/decrease in ownership, but the story could be in the transfers if there is a buying/selling frenzy. I used transfer counts in the visualizations above.
Data Modeling (Normalization) — All of the files are flat files, which means you'll find a lot of redundancy. Where there is redundancy, there will be errors. Notice how "AUX FUNDING LLC" (the automated landlord from the Vice story) has the same owner address, but spelled differently. This can cause problems in data analyses depending on the problem you are trying to solve. Normalizing the owners by breaking them out into their own table will reduce those types of headaches.
Data Modeling (Slowly Changing Dimensions) — Try creating a master ownership table utilizing type II SCDs with an effective_date field (when they acquired the property), an end_date field (when they gave up ownership), and a current_owner flag (boolean).
Batch Data Pipeline — Transfer files are updated a couple times per month. Build a pipeline using Airflow that extracts from the website and upserts new data into your tables!

Machine Learning

Tools:
Hugging Face
SpaCy
Scikit-Learn
MLFlow
There is no flag to discern a human owner vs a corporate entity, so you have to figure it out on your own. ML can assist given there are tens of thousands of records to go.

Skills to Learn

Tagging — This is a traditional Classification, so one way to go about it is to go through your data and tag it. I would recommend starting with a brute force SQL query to flag owners with words typically associated with US companies like '% LLC%', '% INC%', '% TRUST%', etc. You have to be careful with this approach or "SMITH V INC E" and "DI CORP IO LISA" will be listed as corporate entities.
Modeling — Once you have a decent number tagged, pre-process the data. Models understand numbers and not strings, so you have to vectorize the data. The simplest and least compute would be to create a dimension for every word that appears with TF-IDF Vectorizer and then running your vectorized data through Scikit models.
NER — Named Entity Recognition is a developed field in Natural Language Processing and you can find a lot of pre-trained models on Hugging Face to see if you can utilize this awesome resource.

Data Analytics

Sources/Tools:
Neighborhood Geojson (API Endpoint)
Geopandas
Plotly

Skills to Learn

GIS — The CAGIS dataset at the top of the article has polygon coordinates for every single-family parcel in Cincinnati (filter by tax district: 1, class: 510). To get counts by neighborhood, join on the CAGIS index with geopandas.
Statistics — After you have counts for the transfers by neighborhood, you'll find that a couple neighborhoods have a lot more than others. Is that an insight? Not necessarily. The City of Cincinnati paid a consultant tens of thousands of dollars to study the residential tax abatement program in Cincinnati and it also found that certain neighborhoods had high counts. What's wrong with just taking counts? The denominator. A neighborhood like Westwood (upwards of 6000 single family properties) can't be compared with Winton Hills (just under 100 single family properties) on a 1:1 basis, i.e. 400 transfers in Westwood's 6000 properties isn't as big of a deal as 400 in Winton Hills' 100 properties. If Westwood and Winton Hills were otherwise equal, you'd expect just based on size that out of 100 transfers, 98 of them would be in Westwood and only 2 in Winton Hills. The below visualization depicts counts only and paints a different picture.

Visualization — With Geopandas, you can use Matplotlib.pyplot, but give a different library a try like Plotly! Try different types of visualizations like heat maps in addition to the choropleth.

What's Next?

Give this a try or try it with similar data sources where you live. If the dataset isn't available, in the US, you can submit a Freedom Of Information Act (FOIA) request to obtain different datasets from the appropriate government organization. People often know there is a problem (e.g. with housing, with "food deserts", with lead exposure in children, etc), but oftentimes the analysis of the problem is either never really done or done expediently for some deadline. Communities need data experts like you to explore and define issues. You don't have to finish before you can start sharing your gained insights with your community. Only once leaders understand exactly where problems exist can they start to create a solution.

Plus, you can use the opportunity to improve your own data skills!

Create Your Own Geocoder with Postgis/TIGER

CincyBC — Mon, 30 May 2022 16:48:11 +0000

Cover Image from Census Website
If you caught my post last week on how to easily get started with Postgres/Postgis, this is a followup on how to create your own local geocoder with Census/TIGER data. It's so easy, the folks behind Postgis have even created scripts to automate scraping and loading Census data for you. If you followed the tutorial, you now have a database called Geocoder. Inside your Geocode database with Postgis extension, you can just follow the tutorial in the Postgis docs, but I hope this tutorial provides a bit more context and few extra steps that make it easier for beginners.

The first step is to run these queries in your query tool.

CREATE EXTENSION fuzzystrmatch;

CREATE EXTENSION postgis_tiger_geocoder;

--this one is optional if you want to use the rules based standardizer (pagc_normalize_address)
CREATE EXTENSION address_standardizer;

These queries will install a tool that will try to match a string with a degree of certainty. If you've ever used the Python tool Fuzzy Wuzzy, you'll know what fuzzystrmatch does.

It also creates schemas in your database. Since you create schemas, it's a good idea to grant access to them to your other users:

GRANT USAGE ON SCHEMA tiger to PUBLIC;
GRANT USAGE ON SCHEMA tiger_data TO PUBLIC;
GRANT SELECT, REFERENCES, TRIGGER ON ALL TABLES IN SCHEMA tiger TO PUBLIC;
GRANT SELECT, REFERENCES, TRIGGER ON ALL TABLES IN SCEHMA tiger_data TO PUBLIC;
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA tiger TO PUBLIC;
ALTER DEFAULT PRIVILEGES IN SCHEMA tiger_data
GRANT SELECT, REFERENCES ON TABLES TO PUBLIC;

The last query will set default to future proof access to these tables. (These queries came from the Manning book Postgis in Action)

If you check the tables in the schema, you'll find they're mostly empty. The next step then is to fill them with the data you need to geocode!

This next query will set up variables for a bash/shell script, so the query changes at the end depending on your os. The docs also call this profile 'debbie'. You can call it geocoder or whatever you want. Just remember it for the following query.

INSERT INTO tiger.loader_platform(os, declare_sect, pgbin, wget, unzip_command, psql, path_sep,
           loader, environ_set_command, county_process_command)
SELECT 'debbie', declare_sect, pgbin, wget, unzip_command, psql, path_sep,
       loader, environ_set_command, county_process_command
  FROM tiger.loader_platform
  WHERE os = 'sh';

Now that you've create this record, go into the table tiger.loader_platform and set the paths to everything. If you don't have wget, this is how the bash scripts will pull data from the census, so you'll need to install it and set the path to the library (unless it is set in your Path) to where you can call it. If you have a mac, you can easily get set with wget with Homebrew: brew install wget.

After you've set those variables (don't worry about setting pgbin if you have psql in your Path as we did if you followed the tutorial in my last post), you need to create a folder called gisdata/. The tutorial recommends you put it in your root, but I put it on my Desktop/ so I could easily get rid of it after finishing grabbing the data. If you set it to anything except for your root, make sure you update the staging_fold in tiger.loader_platform.

This next query will generate a script to scrape data from the Census (it beats downloading, unzipping, running a loading command in shell, and then deleting all of the dozens of files you'll need to download for the geocoder).

psql -c "SELECT Loader_Generate_Nation_Script('debbie')" -d geocoder -tA > /gisdata/nation_script_load.sh

Again, change 'debbie' to the profile we created above. Also, the last bit is the path to your /gisdata/ directory, so if you put it on your deskstop like I did, you would need to update that to something like ~/Desktop/gisdata/nation_script_load.sh. If you're on a PC, you'll need to create a .bat file instead of .sh.

Boom, just like that, you've got a script to automate pulling data from the census and loading it into your database!

Go ahead and look in the script to make sure it came out OK. Does it have your database name and password? Will it be able to run psql with how it is set? If psql is in your Path, then you don't need to specify where pgbin is. I completely removed that variable and references to it from the script when I went through this tutorial.

The next task is to run the script. It shouldn't take too long depending on your internet speed.

sh nation_script_load.sh

You've just loaded basic data for all states and counties in the US. To confirm, you can run the following query:

SELECT count(*) FROM tiger_data.county_all;

It'll return 3233. Doing the same FROM tiger_data.state_all; will return all 50 states + DC + territories for a count of 56.

Now, it's time to make a couple of decisions. Will you be doing any census data work with the geocoder? Will you want to know which census tract the address is in? If not, skip the next query. If you do, run this query:

UPDATE tiger.loader_lookuptables SET load = true WHERE load = false AND lookup_name IN('tract', 'bg', 'tabblock');

This will allow you to download the necessary files to find the census tract, block, and block group for addresses you geocode.

The next decision is for which address you'll be running the geocoder for. If you live in New York, there is a possibility you'll care about geocoding addresses in NY, but not really care about geocoding address in WY, so why waste the time/disk space?

In the same fashion we created the automated script to load the basic nation/counties data above (a necessary requirement no matter if you only want NY), it's time to pull the data for the state(s) you want with the following query:

psql -c "SELECT Loader_Generate_Script(ARRAY['NY'], 'debbie')" -d geocoder -tA > /gisdata/ny_load.sh

Again, fill in which states you want in the ARRAY by their abbreviation; change 'debbie' to whatever you called your profile to store variables ('geocode'); and make sure you have the correct destination for your file (e.g. ~/Desktop/gisdata/ny_load.sh). Run your script (this can take a lot of time depending on how fast your internet/computer are): sh ny_load.sh.

That's it, you now have a geocoder on your local machine!

The docs recommend you clean up your tables. If you created these new, and didn't have issues during setup, these queries shouldn't actually do much, if anything, but it's always good to clean up zombie rows for performance.

SELECT install_missing_indexes();
vacuum (analyze, verbose) tiger.addr;
vacuum (analyze, verbose) tiger.edges;
vacuum (analyze, verbose) tiger.faces;
vacuum (analyze, verbose) tiger.featnames;
vacuum (analyze, verbose) tiger.place;
vacuum (analyze, verbose) tiger.cousub;
vacuum (analyze, verbose) tiger.county;
vacuum (analyze, verbose) tiger.state;
vacuum (analyze, verbose) tiger.zip_lookup_base;
vacuum (analyze, verbose) tiger.zip_state;
vacuum (analyze, verbose) tiger.zip_state_loc;

Give it a test run! If you chose NY like I did in the tutorial, run this query to get the geo-coordinates for Rockefeller Center Plaza:

SELECT g.rating, ST_X(g.geomout) As lon, ST_Y(g.geomout) As lat,
    (addy).address As stno, (addy).streetname As street,
    (addy).streettypeabbrev As styp, (addy).location As city, (addy).stateabbrev As state,(addy).zip
    FROM geocode('45 Rockefeller Plaza, New York, NY 10111', 1) As g;

You'll notice something called a rating when you run this query. If you get a rating of 0, it's certain it got the address correct. The higher the rating, the less the fuzzystrmatch is sure it got it right. You can play around with that by spelling Rockefeller something else like Rockfella and see what it comes up with. You can also run this without the zip code and it'll come up with the correct coordinates (albeit, slightly slower query).

You can also run batch queries if you want to get multiple geo-coordinates in a single query.

Happy geocoding!