DEV Community: Anthony Agnone

Advent of Code -- Your New Holiday Season Routine!

Anthony Agnone — Tue, 26 Nov 2019 22:47:00 +0000

Harden up your problem solving and coding proficiency while celebrating the holidays

The Gist

Man, I really need to brush up on <language/skill>.

I should check out <language/skill>. I'm seeing it appear more and more, and it will become important for me to be proficient in.

How many times have you recently had some of these conversations with yourself? Chances are, you have at least a few times, and most of us will also readily admit that we also haven't done much of anything about it. Well, here's a fun holiday season routine for you that can help you out: Advent of Code!

What it is

Advent of Code is an annual series of daily coding challenges that happens during the year-end holiday season.
While its name alludes to Christian traditions, it is a simple pun choice meant to convey the 25-day nature of the series (come one, come all!).
Each day, between the 1st and 25th of December, a new puzzle is published. The puzzles usually string along a wonderfully-constructed storyline, keeping you entertained as you are challenged.
Here's an example from last year:

Example Puzzle (Day 1 2018)

After feeling like you've been falling for a few minutes, you look at the device's tiny screen. "Error: Device must be calibrated before first use. Frequency drift detected. Cannot maintain destination lock." Below the message, the device shows a sequence of changes in frequency (your puzzle input). A value like +6 means the current frequency increases by 6; a value like -3 means the current frequency decreases by 3.

For example, if the device displays frequency changes of +1, -2, +3, +1, then starting from a frequency of zero, the following changes would occur:

Current frequency 0, change of +1; resulting frequency 1.
Current frequency 1, change of -2; resulting frequency -1.
Current frequency -1, change of +3; resulting frequency 2.
Current frequency 2, change of +1; resulting frequency 3.
In this example, the resulting frequency is 3.

Here are other example situations:

+1, +1, +1 results in 3
+1, +1, -2 results in 0
-1, -2, -3 results in -6

Starting with a frequency of zero, what is the resulting frequency after all of the changes in frequency have been applied?

At this point, a link is available on the page for you to obtain the full input sequence (about a thousand inputs) as a text file.
This is where the rubber meets the road between your problem solving skills and software implementation know-how.

Be careful! Most people will happily nose dive into regurgitating code that reads the input, and then face-plant when they realize it's time to implement the core algorithm.
Take the time first to make sure you fully understand the problem at hand, and (at least mentally) form the algorithmic solution before rushing to the I/O code.

Why you should do it

You need not use any sort of significant computing hardware to participate in these challenges -- feel free to leave your cloud-computing mindset to the side, and get yourself back into the single machine, single problem mindset of your first data structures and algorithms endeavor 😉.
The puzzles, designed personally by Eric 👏, are designed such that solving the problem in brute-force fashion will take entirely too long, but solving in a more intelligent fashion will solve the puzzle in at most 15 seconds on commodity hardware.

You need not use any sort of super duper optimized programming language, either. Note that, since the correct approach to the puzzle will give you the answer in seconds, the true differentiator in solutions is development time.
Thus, an appropriate approach in an interpreted language (Python, R, etc) will prevail over a naive approach written in a compiled language (C, Rust, Go, etc).
I love this aspect. Some folks love to gripe about how is the best because you can always optimize it to run faster and closer to the bare metal etc etc etc.
However, how fast your algorithm runs on hardware is usually not the bottleneck in your end system:

The more expensive factors of an algorithmic solution tend to be things like developer time-to-solution, lack of parallelism, or I/O congestion, as opposed to compiler-enabled program optimization.

When was the last time you were in an interview for an established software company, and they constrained you to a language of choice? ¹
Usually, you were instead constrained to (either explicitly, or implicitly via a failed interview!) providing a solution that was modular, efficient, and scalable.
All of these things are capable in any Turing-complete language. So forget about language wars for this series, and focus on the real intellectual meat of the problem.

Indeed, many dispose of competition altogether here, and use Advent of Code as an opportunity to learn a new language.
Personally, I plan to do my first pass in Python, and then follow up in either Go or Javascript (gotta stay future-proof 🧐).
Whether you are looking for intense competition, a learning opportunity, or a chance to establish a good habit this holiday season, consider finding it in Advent of Code!

Ready to Go?

Start by reviewing some puzzles from past years.
Feeling your more competitive side? Join the private leaderboard here with code 498713-51f0c909 to see how you match up with other participants.

Share your opinion

Why are you planning or not planning to participate in this? If you are not, what would change your mind?

Do you have tried-and-tested ways to ensure the habit of completing daily tasks like this for extended periods of time?

What else would you like to share?

I hope there aren't too many of you that have been a part of such an interview 😬. ↩

Another One in the Books 2019/11/23

Anthony Agnone — Sat, 23 Nov 2019 19:12:04 +0000

WAYR (What Are You Reading)

In my latest read, I've finally started a formal plunge into the world of micro-service software architectures by diving into Building Microservices: Designing Fine-Grained Systems by Sam Newman.
Since seeing this architecture appear at my place of work, I've become very curious and interested in what this means for effective integration of machine learning algorithms into standard product offerings. When you take away all the details of service communications and data storage, this really just smells like more concrete ways of modularizing and encapsulating distinct functionalities in a product.

That being said, there is still slightly more to it than this -- boundaries between these fine-grained services can often be defined more based on business-level logic boundaries, rather than more traditional software-level or data-level boundaries.
I'm about halfway through this one, and am still fully enthralled; Sam has done much more than "fill in the gaps" of architecture details. He has consistently connected theoretical choices to practical implications, so that you can feel much more confident when approaching the development of your first micro-service application.

I definitely recommend this one. He's already underway working on the next version, but it doesn't appear to be coming for at least another year. Go ahead with the current version!

Learning

As mentioned above, I've become interested in analyzing the interplay of micro-service architectures with the current state of machine learning deployment technologies.
It's certainly not expected that every research scientist in industry need understand all of the DevOps, service design, testing considerations, etc that goes into proper micro-service systems.

However, a lot companies are currently struggling at marrying long-standing notions of a proper software life-cycle with more recent developments of what a proper machine learning model life-cycle is.

I'll give you a hint: they are not the same 😬.

I can already tell you that I will be making future dedicated posts both about theory and applications of micro-services with machine learning solutions, so I don't want to delve into it too much here. However, please reach out if you have shared interest in this area!

Tools

I hate paper.

You won't find me hugging any trees, but it's just so wasteful amongst all the technology we have now. What do we use paper for? For the most part, it's just a historical form of sharing information, right?
So...now that we have things like quasi-ubiquitous internet and data storage, why do we still have so much paper?

We still have so much paper mainly due to a transition period, in which companies and individuals change their habits of communicating information.

In the meantime, however, the Doxie Go SE scanner I just got is a fantastic way to move my life more towards fully paper-less. I've never had so much fun using a paper-related device (read: printers suck).

Ok, so what's so good about it?

Portability

This thing is compact. I won't throw any decimal points at you -- it's the length of a standard sheet of paper, and about the width and height of your pinky finger.
You can charge it with a standard micro USB cable, and then take it with you across the room or the world.

Ease Of Use

It just works. And that's saying a lot, since office appliances are usually miserable to deal with.
This is my first Doxie product, but they definitely have my attention now moving forward.

Two buttons: power and WiFi. Can you guess what each does?

It's easy to connect to a new network: you can first connect to the device itself from your computer, and then use its web interface to connect it to the desired network.

Once it's connected to your network, you can view your recent scans, group them to your liking, and then send them to your folder/cloud/destination of choice. I bet they have yours covered!

OCR (Optical Character Recognition)

This mouthful of an acronym for the auto-magical process of converting a picture/scan of something with text to a digital document that "knows" what the text is in the image.
Besides being impressed and saying "man, that's cool", there is a massive benefit that this provides over the more traditional paper filing system of previous centuries:

With OCR scanners, we now have immediate access to a searchable repository of every piece of paper we've processed with them.

Let me bring this point home with my current favorite workflow with this: my growing dinner recipe book!

My wife and I recently got settled into our new home. Amongst all the various projects and errands we are both doing, routinely making a nice meal has become more difficult.
We have recently started using a service that sends us (weekly) a few meal-ready sets of ingredients, with recipe sheets for how to prepare them.

At this point in the process, our parents and grandparents now air-drop into the conversation with a three-hole punch and binder to help us make our very own recipe book.
While this is thoughtful, how about a digital recipe book that is basically impossible to become damaged, lost, or otherwise unwieldy? I'm picking the latter 😁.

Every time we finish a meal, I pop the recipe card through the scanner and send it to my Google Drive. Now, whenever one of us is feeling , all we need to do is go into Google Drive and search for and voila! All recipes matching your search auto-magically appear for you.
This isn't your hot-shot machine learning solution, but I had this process up and running within 30 minutes of receiving the scanner.

My two cents

In a recent NFL game, defensive end Myles Garrett smashed his helmet onto the head of QB Mason Rudolph during a scuffle, leading us to related incidents like this in the NFL, and how much of a media ruckus was made with them.
The angle I want to take here is to step back and claim how ridiculous all of this is.

Consider how much human effort went into the media coverage, NFL penalty determination, and fan conversation about this. Is it all worth it that much, compared to the rest of our lives?

Two grown boys, getting paid millions of dollars, got in a schoolyard brawl in front of a physical-and-digital audience of millions of people.
Then, thousands, if not millions of people spend their time and effort to get paid to do their job in the next few days, which in some part related to this event.

If you pool together the collective attention and brain power of all of these people, is this really the best use of us?

Share your opinion

What are your thoughts on this incident with Myles Garrett, and how the media played a part in the aftermath?

Do you have any experience with micro-services and/or machine learning deployment?

Do you have additional ideas about the implications of OCR and traditional paper processes?

What else would you like to share?

OpenAI's Hide-and-Seek Findings, the Systems Perspective

Anthony Agnone — Sat, 21 Sep 2019 16:53:48 +0000

Yes, the agents cheated, but what does that mean for the system?

OpenAI released a fantastic piece on some results obtained in a multi-agent hide-and-seek simulation, in which multiple hiders and multiple seekers play the popular children's game.

The simulation had some interesting aspects to it, such as tools (boxes, ramps, walls) that the agents could use to aid them in achieving their objective of effective hiding/seeking.
However, the more notable result is that extended simulation of the environment led to emergent behavior; that is, behavior that is fundamentally unplanned or unexpected.

For example, some of the expected behavior is that the hiders would eventually learn to build an enclosure with the walls and/or boxes that hides the ramps from the seekers.
This way, the ramps cannot be used to go over the walls and into the built enclosure from above.
Now, what the environment designers did not expect (the emergent behavior) is that the seekers would learn that they could use the ramp to get on top of a box, and then use a running motion to essentially "surf" the box anywhere they pleased!

Using this method, the seekers found a way to access the hider-built enclosures from above that was not intended by the designers of the system!

The seekers had gamed the system.

Now, what do you think the hiders did in response to this behavior? Some of you may think that, since the seekers had learned, to some extent, undefined behavior of the system, that the hiders might respond with some ridiculous action, since the system was now in a state of disarray.

But think about it. The system was not in any sort of unknown state.

While it may be in a state that the designers did not explicitly intend to create, the agents were continuing to operate in a manner in which they saw as optimal for their desired outcomes.

Thus, the hiders learned to paralyze the seekers' ability to surf boxes!

They did this by using the pre-allocated initial time in which the seekers are frozen to lock all of the boxes and ramps.
Then, they use any time left to construct a quick enclosure with the movable walls and then lock the walls.
This way, the seekers now, once again, have no way to get inside the enclosure (at least, that's the thought...).
Well played, guys.

I think that is fascinating, but on a different level than most of OpenAI's analysis focuses on.
They do mention that the agents find out how to game their way to a system:

"[...] agents build [...] strategies and counterstrategies, some of which we did not know our environment supported"

However, they then dive into detail only about the scenarios that the agents learned, and completely ignored the environmental design flaws themselves. I think the latter is the more interesting phenomenon!
I'd like to turn the analysis on its head -- let's now hold the agent designs constant, and vary the environment's state structure and reward system.
Analyze how different incentive/response systems induce different agent strategies.
The field of reinforcement learning is progressing wonderfully, especially in recent years. We've gone from checkers solvers to a Go champion in just a few decades -- our agent modeling is getting pretty dang good.
Now, how about our multi-agent environment modeling?

Multi-Agent Environment Design

OpenAI has certainly thought about it. Per their final paragraph,

Building environments is not easy and it is quite often the case that agents find a way to exploit the environment you build or the physics engine in an unintended way.

A great article on reward function design written by @BonsaiAI on Medium mentions that "you get what you incentivize, not [necessarily] what you intend."
That beautifully summarizes the inherent dilemma in designing a reward system for a certain outcome.
You certainly have your mental picture of how your system of incentives will lead to the system as a whole reaching the desired state(s), but have you considered all of the minute ways in which your system may have some "cracks" in it?
Obviously, this is easier said than done. This divergence of "intent vs outcome" is readily seen in our daily lives, whether professionally or not:

software engineers intend to turn documented specifications into functional software that is a faithful rendition of the documented change.
company executives intend to compensate employees appropriately, based on how much value they provide to the company as a whole.
sports team managers intend to apply game plans and player lineups that will bring victory over each successive opposing team.
etc...

The unwavering and succinct truth for all of these situations is that the system behaves exactly as it is designed; there are not undesigned consequences, only unintended ones.

To make this idea clear, let's take the compensation scenario a little further.
Say there are employees near the middle of the corporate hierarchy who are unhappy with their compensation, and are taking issue
with the overall design of the compensation structure (assume the structure is readily known across the organization).
Statements these employees may make will go along the lines of "this system is broken" or "what's happening here is wrong".
However, what cannot be said in these circumstances (assuming a compassionate and fair designer) is "this system is not doing what it is designed to."

Of course it is! It is doing exactly what it is instructed to do!
If it should be doing something different than what it is now, then it should be changed as such.
Now, we may have intended for the system to be doing one thing, but that may or may not actually be the final design.
However, regardless of intent, what is happening is a perfect rendition of the system that was chosen.

A New Frontier

I'm excited to see more theory develop around effective design of environmental incentive systems, especially in multi-agent scenarios.
The applications for theory like this are littered in our daily lives, and are even among the most important questions we seek to answer with regard to living amongst each other.
Here are some examples:

what's the best way for us to govern ourselves and others? ¹
what's the best way to organize how we define and exchange value between each other?
what's the best way to collaborate with each other towards a common end product or creation?

It should only take one or two of those examples to get you sufficiently motivated for this. And that's great...because this area of research is just getting started in some respects.
For example, I imagine there is a plethora of historical publication on system-level analysis of things like governments, economic systems, and managerial hierarchies.
However, all of this precedence is going to soon be married with the recent advances in multi-agent RL.
The important similarities and differences between these theory families has the potential to lead to an explosion of knowledge and application in topics of human systems and computer-agent systems alike.

Conclusion

Systems will always be gamed, whether the agents are human or digital.

What are your thoughts on effective ways to prevent/detect/fight the exploitation of incentive systems?

What are some interesting "timeless" academic works you know of, which analyze human/agent systems at large?

How about the same for reward design in multi-agent RL?

What other applications do you see here that I didn't touch on?

I'm looking forward to the day where an electoral candiate's proposed policies can be evaluated by simulation, rendering the circus of televised debates useless ↩

Reproducible Data Processing with Make + Docker

Anthony Agnone — Tue, 06 Aug 2019 02:07:39 +0000

Avoiding reproducibility hell with dependency management and containerization

Motivation

When performing experiments in data science and machine learning, two main blockers of initial progress are delays building/using “base code” and lack of reproducibility.
Thanks to some great open source tools, you don’t have to be a software guru to circumvent these obstacles and get meaning from your data in a much smoother process.

“Hey there, I got this error when I ran your code…can you help me?”

oh yeah, that file…

…and it’s something facepalm-worthy. Here you are, trying to hit the ground running with a friend or colleague on an interesting idea, and you’re now side-tracked debugging a file-not-found error. Welcome back to your intro programming course!

I’m sure the owner of the code also loves nothing more than to spend a bunch of time helping someone step through these issues at a snail’s pace. The sheer euphoria you two have just shared over the promise of recent experimental results has now morphed into unspoken embarrassment and frustration that the demonstration has failed before showing any worth, whatsoever.

But it’s fine. It’s fine! Your buddy knows just where to find that missing file. You’re told that you will have it within minutes, and then you will be on your way!

“Alright, download that file — I just emailed it to you. Then run train.py, you should get 98% accuracy in 20 epochs.”

Aha! This is it! The time has come to join the ranks of esteemed data magicians, casting one keyboard spell after another, watching your data baby’s brain get progressively more advanced as it beckons for a role in a new Terminator movie! Let’s see what we get!

but I did what you said 🙁

…yeah, we’ve all been there.

What could it be? Well, maybe it’s something obvious. I know python, and I know what your code should be doing. I’ll just pop open your train.py to poke around and…NOPE.

Don’t worry, this isn’t going to be a pinky-waving article about how to always write a software masterpiece and scoff at anything you deem insubordinate. That’s a sticky subject in general, as it’s wrought with subjectivity and competing standards. These examples aim to just emphasize how there are a myriad of ways in which we would not prefer for new experiments to start.

We’re interested in re-producing and improving on results in a convenient fashion, not stumbling to re-create past achievements. With that in mind, let’s have a look at some popular tools that can be used to streamline the start of any new ML software project: Docker and Make.

Docker

The python ecosystem has some great features for dealing with dependencies, such as pip and virtualenv. These tools allow for one to easily get up and running according to some specification of what needs to be installed to proceed with running some code.

For example, say you have just come across the scikit-learn library (and it’s love at first sight, of course). You are particularly drawn to one of its demo examples, but would like to re-produce it with the data housed in a pandas DataFrame. Furthermore, another project you are working on requires an ancient version of pandas, but you would like to use features available only in a newer version. With pip and virtualenv, you have nothing to fear (…but fear itself).

# create and activate environment
virtualenv pandas_like_ml
source pandas_like_ml/bin/activate

# install your desired libraries
pip install --upgrade pip
pip install scikit-learn==0.21.1
pip install pandas==0.19.1

# the main event
python eigenfaces.py -n 20000

# we're done here, so exit the environment
source deactivate

When you learn this flow for the first time, you feel freed from the hellish existence that is dependency management. You triumphantly declare that you shall never, ever be conquered again by the wrath of a missing package or a bloated monolithic system environment. However, this unfortunately isn’t always enough…

Python environment tools fall short when the dependency is not at the language level, but at the system level.

For example, say you would like to set up your machine learning project with a MongoDB database backend. No problem! pip install pymongo and then we’re home free! Not so fast…

Well…that didn’t go as expected. Now, in addition to setting up my library dependencies, we need to also manage a library outside of python? Gah! Further delays! Time to google for the package name for mongoDB…

What if I don’t even know what operating system my colleague is using? I can’t give him some sudo apt-get install snippet if he’s on CentOS. Even more to the point, there’s no easy way to automate this step for future projects. Make me do something once, I’ll do it. Make me do it again…zzzz.

So, we’re faced with the desire to standardize and automate setting up software libraries and other system dependencies for new data-related endeavors, and sadly our usual python tools have fallen short. Enter Docker: an engine for running services on an OS as lightweight virtualization packages called containers. Docker containers are the realization of the definition of a Docker image, which is specified by a file called a Dockerfile.

# you can specify a base image as a foundation to build on
FROM ubuntu:16.04

# make a partition, and specify the working directory
VOLUME /opt
WORKDIR /opt

# install some base system packages
RUN apt-get update && apt-get install -y \
    python3 \
    python3-dev \
    python3-pip \
    python3-setuptools

# install some python packages
RUN pip3 install --upgrade pip
RUN pip3 install \
    scikit-learn==0.21.1 \
    pandas==0.19.1

# set the container's entry point, just a bash shell for now.
# this can also be a single program to run, i.e. a python script.
ENTRYPOINT ["/bin/bash"]

Think of a Dockerfile as a (detailed) recipe of setup steps we would need to do in order to get the system in the state we would like for the experiment. Examples include things like setting up a database, installing libraries, and initializing a directory structure. If you’ve ever made a nice shell script to do some setup like this for you, you were not far from the typical Docker workflow. There are many benefits that Docker has over a shell script for this, most notably being containerization: with Docker containers, we are abstracted away from the host system that the container is running on. The virtual system that the container is running in is defined in its own process. Because of this, we can have multiple containers running completely different setups, but on the same host machine. How’s that for some insulation against system dependency hell?

Additionally, we are further insulated from issues like missing files and differences of system state. We know exactly what the system state will be when it is run. We know this because we have made it so via the explicit instructions in the Dockerfile.

To actually build the image, we use a command like the following:

docker build \
    -t my_first_container \
    -f Dockerfile

At this point, we have built the image. With this image, we can repeatedly instantiate it as desired, e.g. to perform multiple experiments.

docker run \
    --rm \
    -it \
    my_first_container

Voila!

If we left at this point and ran in N directions to do various different experiments, these commands may get rather cumbersome to type…

docker run \
    --mount type=bind,source="$(pwd)",target=/opt \
    --mount type=bind,source=${CORPORA_DIR},target=/corpora \
    -p ${JUPYTER_PORT}:${JUPYTER_PORT} \
    -ti \
    --rm \
    my_advanced_container \
    jupyter-lab \
        --allow-root \
        --ip=0.0.0.0 \
        --port=${JUPYTER_PORT} \
        --no-browser \
        2>&1 | tee log.txt

Don’t worry if your eyes gloss over at this. The point is it’s a lot to keep typing. That’s fine though, we have shell scripts for a reason. With shell scripts, we can encapsulate minute details of making a very specific sequence of commands into something as mindless as bash doit.sh. However, consider also a scenario in which your Dockerfile definition depends on other files (i.e. a requirements.txt file or a file of environment variables to use). In this case, we also would like to know automatically when the Docker image needs to be re-created, based on upstream dependencies.

So what has four letters, saves you from typing long, arduous commands, and automates dependency management?

Make

GNU Make is a wonderous tool, gifted to us by the same software movement that has made the digital world what it is today. I’ll save you a more sparkly introduction and jump into the core abstraction of what it is: a DAG-based approach to intelligently managing dependencies of actions in a process, in order to efficiently achieve a desired outcome.

Ok, it’s also a convenient way to compile C code. But focus on the first definition, and think bigger! Re-using the general DAG-based dependency management idea has led to some great tools over the years, like Drake (not the rapper), Luigi (not Mario’s brother), and perhaps most notably Airflow (AirBnB’s baby, but now part of the Apache Foundation).

Consider the contrived example below. We’d like to make predictions on audio-visual data with a trained model. As a new raw image appears, do we need to re-train the model in order to create a prediction? Setting aside applications such as online learning, we do not. Similarly, say we just updated some parameters of our trained model. Do we need to re-cull the raw images, in order to re-create the same data sample? Nope.

This is where Make comes into play. By specifying a Makefile with “targets” that correspond to (one or more) desired outputs in the DAG, invoking that target will automatically provide that outcome for you, while only re-invoking dependency processes that are necessary.

Make can be used for pretty much anything that involves actions and their dependencies. It’s not always right tool in the shed (see Airflow for this process on distributed applications), but it can get you pretty far. I even used it to generate the image above! Here’s what the Makefile looks like.

# the "graph.png" target specifies "graph.dot" as a dependency
# when "graph.png" is invoked, it invokes "graph.dot" only if necessary

graph.png: graph.dot
    dot graph.dot -Tpng > graph.png

# the "graph.dot" target specifies "make_graph.py" as a dependency
# so, this command is only re-run when...
#   1) make_graph.py changes
#   2) graph.dot is not present
graph.dot: make_graph.py
    python make_graph.py

Marrying the Two

So we’ve ailed over to struggles of reproducible work and introduced great tools to manage environment encapsulation (Docker) and dependency management (Make). These are two pretty cool cats, we should introduce them to each other!

Photo by Product School on Unsplash

P.S. Which one is Docker, and which is Make?

Let’s say we’ve just found the Magenta project, and would like to set up an environment to consistently run demos and experiments in, without further regard to what version of this_or_that.py is running on someone’s computer. After all, on some level, we don’t care what version of this_or_that.py is running on your machine. What we care is that you are able to experience the same demo/result that the sender has experienced, with minimal effort.

So, let’s set up a basic Dockerfile definition that can accomplish this. Thankfully, the Magenta folks have done the due diligence of creating a base Docker image themselves, to make it trivial to build from:

# base image
FROM tensorflow/magenta

# set partition and working directory
VOLUME /opt
WORKDIR /opt

# install base system packages
RUN apt-get update && apt-get install -y \
    vim \
    portaudio19-dev

# install python libraries
COPY requirements.txt /tmp/requirements.txt
RUN pip install --upgrade pip
RUN pip install -r /tmp/requirements.txt

# container entry point
ENTRYPOINT ["/bin/bash"]

After specifying the base image as Magenta’s, we set a working directory on an /opt volume, install some system-level and python-level dependencies, and make a simple bash entry point until we have a working application. A typical requirements.txt file might look like this:

jupyterlab
seaborn
scikit-learn
matplotlib
pyaudio

Awesome. So now we have a specification of our desired environment. We can now make a Makefile which handles some of the dependencies at play:

# use the name of the current directory as the docker image tag
DOCKERFILE ?= Dockerfile
DOCKER_TAG ?= $(shell echo ${PWD} | rev | cut -d/ -f1 | rev)
DOCKER_IMAGE = ${DOCKER_USERNAME}/${DOCKER_REPO}:${DOCKER_TAG}

$(DOCKERFILE): requirements.txt
    docker build \
        -t ${DOCKER_IMAGE} \
        -f ${DOCKERFILE} \
        .

.PHONY image
image: $(DOCKERFILE)

.PHONY: run
run:
     nvidia-docker run \
         --mount type=bind,source="$(shell pwd)",target=/opt \
         -i \
         --rm \
         -t $(DOCKER_IMAGE)

This Makefile specifies targets for run, image, and $(DOCKERFILE). The $(DOCKERFILE) target lists requirements.txt as a dependency, and thus will trigger a re-build of the Docker image when that file changes. The image target is a simple alias for the $(DOCKERFILE) target. Finally, the run target allows a concise call to execute the desired program in the Docker container, as opposed to typing out the laborious command each time.

One Docker to Rule Them All?

At this point, you may be motivated to go off and define every possible dependency in a Dockerfile, in order to never again be plagued with the troubles of ensuring an appropriate environment for your next project. For example, Floydhub has an all-in-one Docker image for deep learning projects. This image specification includes numerous deep learning frameworks and supporting python libraries.

Don’t do that!

For the sake of argument, let’s take that to the limit. After the next 100 projects that you work on, what will your Docker image look like? And what about after the next 1000 projects? Over time, it will just become as bloated as if you had incrementally changed your main OS in each project. This goes against the containerization philosophy of Docker — your containers should be lightweight while remaining sufficient.

Furthermore, with all of that bloat you lose the ability to sustain multiple directions of projects that require different versions of dependencies. What if one of your projects requires the latest version of Tensorflow to run, but you don’t want to update the 99 previous projects (and deal with all of the failures the updates bring)?

Conclusion

In this part of the Towards Efficient and Reproducible (TEAR) ML Workflows series, we’ve established the basis for making experiments and applications a relatively painless process. We used containerization via Docker to ensure experiments and applications are reproducible and easy to execute. We then used some automatic dependency management via Make for keeping experiment pipelines efficient and simple to run.

Photo by Susan Holt Simpson on Unsplash

It’s worth noting that there are numerous alternative solutions to these two; however, they follow the same general patterns: containerization gives you reproducibility and automatic dependency management gives you efficiency. From there, the value added in other solutions usually comes down to bells and whistles like cloud integration, scalability, or general ease of use. To each, your own choice of tools.

Visualizing House Price Distributions

Anthony Agnone — Fri, 19 Jul 2019 18:29:33 +0000

With Zillow and python's Folium, it's easier than ever

Wait, but Why?

I’m in the process of closing on my first home in Atlanta, GA, and have been heavily using various real estate websites like Zillow, Redfin, and Trulia. I’ve also been toying with Zillow’s API, although somewhat spotty in functionality and documentation. Despite its shortcomings, I was fully inspired once I read the post by Lukas Frei on using the folium library to seamlessly create geography-based visualizations. A few days and some quick fun later, I’ve combined Zillow and Folium to make some cool visualizations of housing prices both within Atlanta and across the U.S.

Topics

API integration
Graph traversal
Visualization

A Small Working Example

Let’s start simple by using some pre-aggregated data I downloaded from the Zillow website. This data set shows the median price by square foot for every state in the U.S. for each month from April 1996 to May 2019. Naturally, one could build a rich visualization on the progression of these prices over time; however, let’s stick with the most recent prices for now, which are in the last column of the file.

Having a look at the top-10 states, there aren’t many surprises. To be clear, I was initially caught off guard by the ordering of some of these, notably D.C. and Hawaii topping the chart. However, recall the normalization of “per square foot” in the metric. By that token, I’m maybe more surprised now that California still hits #3, given its size.

Top 10 price/sqft in thousands of $$$ (May 2019)

Anyways, onto the show! Since this is a visualization article, I’ll avoid throwing too many lines of code in your face, and link it all to you to it at the end of the article. In short, I downloaded a GeoJSON file of the U.S. states from the folium repo. This was a great find, because it immediately gave me the schema of the data that I needed to give to folium for a seamless process; the only information I needed to add was the pricing data (to generate coloring in the final map). After providing that, a mere 5 lines of code got me the following plot:

Heatmap of price/sqft of homes in the U.S. for May 2019

One Step Further

Now that I’d dipped my toes into the waters of Zillow and Folium, I was ready to be immersed. I decided to create a heat map of Metro Atlanta housing prices. One of the drawbacks of the Zillow API is that it’s rather limited in search functionality — I couldn’t find any way to perform a search based on lat/long coordinates, which would have been quite convenient for creating a granular heat map. Nevertheless, I took it as an opportunity to brush up on some crawler-style code; I used the results of an initial search by a city’s name as seeds for future calls to get the comps (via the GetComps endpoint) of those homes.

It’s worth noting that Zillow does have plenty of URL-based search (example) filters that one could use to e.g. search by lat/long (see below). Obtaining the homes from the web page then becomes a scraping job, though, and you are subject to any sudden changes in Zillow’s web page structure. That being said, scraping projects can be a lot of fun; if you’d like to build this into what I made, let me know!

Returning to the chosen path, I mentioned that I used initial results as entry points into the web of homes in a given city. With those entry points, I kept recursing into calls for each homes comps. An important assumption here is that Zillow’s definition of similarity between houses includes location proximity in addition to other factors. Without location proximity, the comp-based traversal of homes will be very non-smooth with respect to location.

So, what algorithms are at our disposal for traversing through a network of nodes in different ways? Of course, breadth-first search (BFS) and depth-first search (DFS) quickly come to mind. For the curious, have a look at the basic logic flow of it below. Besides a set membership guard, new homes are only added to the collection when they satisfy the constraints asserted in the meets_criteria function. For now, I do a simple L2 distance check between a pre-defined root lat/long location and the current home’s location. This criterion encouraged the search to stay local to the root, for the purposes of a well-connected and granular heat map. The implementation below uses DFS by popping off the end of the list (line 5) and adding to the end of the list (14), but BFS can be quickly achieved by changing either line (but not both) to instead use the front of the list.

Letting this algorithm run for 10,000 iterations on Atlanta homes produces the following map in just a few minutes! What’s more, the generated web page by folium is interactive, allowing common map navigation tools like zooming and panning. To prove out its modularity, I generated some smaller-scale maps of prices for Boston, MA and Seattle, WA as well.

Heat map of Atlanta housing prices. See the interactive version here.

The Code

As promised, here’s the project. It has a Make+Docker setup for ease of use and reproducibility. If you’d like to get an intro to how these two tools come together nicely for reproducible data science, keep reading here. Either way, the README will get you up and running in no time, either via script or Jupyter notebook. Happy viz!

What Now?

There are numerous different directions in which we could take this logic next. I’ve detailed a few below for stimulation, but I’d prefer to move in the direction that has the most support, impact, and collaboration. What do you think?