Timothy Renner

Posted on

# Functional Python: Fabulous Filter

This post is the second in a series on the functional side of Python. I've been told in code reviews that my Python looks like Clojure, which I naturally took as a complement even if it wasn't. So I decided to write a series of posts here detailing how I write functional Python (where appropriate), bit by bit.

In the last post I wrote on this topic, I discussed `map`. `map` is one of two built-in higher order functions in Python. There used to be a third, `reduce`, but that was since moved into the standard library, and I think that's super weird. I'll explain why when I do a post on `reduce`. For today, I'll focus on `filter`. This'll be short, because filter's pretty simple.

## The Basics

`filter` is a higher order function that takes two arguments: a function that returns a boolean and a sequence to apply it to. It produces a generator that removes elements of the sequence that are False (or false-ish, like `[]` and `None`).

The classic example is the even/odd filter.

``````def even(n):
return n%2 == 0

x = [1, 2, 3, 4, 5]
y = filter(even, x)
print(list(y))
>>> [2, 4]
``````

Pretty simple.

## When `filter` is not the Right Choice

Similar to `map`, we can mimic `filter`'s functionality with a comprehension.

``````z = [n for n in x if even(n)]

print(z)
>>> [2, 4]
``````

In general, the same rules that applied to `map` apply here too. If you're operating on a finite, already-in-memory sequence then a comprehension is more readable. If you've got an infinite sequence and need a generator, `filter`'s a good choice. Although you can use a comprehension to create a generator for infinite sequences, it's not as common. If you need to compose the filter with others, the `filter` function is definitely the way to go. I'll cover composition in great detail in another post.

## Stream Processing

One of the goals of me writing these posts is to show examples of these patterns with real-world projects. This example is adapted from a script in my Profanity Power Index project, which streams data from Twitter's Streaming API for tweets containing profanity associated with some number of targets. It sends the filtered tweets to Elasticsearch for storage and visualization.

In my last post I showed how we used `map` to convert tweets into documents that Elasticsearch can load. Now I'll show how I used `filter` to remove the tweets that only contained clean language.

This is the function we're going to filter.

``````def contains_profanity(tweet):
# _extract_text is just a helper function that pulls the text
# out of the tweet, including any quoted retweets.
tweet_text = _extract_text(tweet).lower()

# PROFANITY is a list of profane words.
# It was nice to put the swear words in the code itself and not
# just the commit messages.
for profanity in PROFANITY:
if profanity in tweet_text:
return True

# If we made it this far, it's a clean tweet and we don't want
# those.
return False
``````

The script (abbreviated) looks something like this:

``````# track is the list of targets.
# api is an authenticated Twitter API client.
tweet_stream = api.GetStreamFilter(track=track)

# Filter to the tweets we want.
profane_tweet_stream = filter(contains_profanity, tweet_stream)

# Apply the map function to create Elasticsearch documents.
bulk_action_stream = map(tweet_to_bulk, tweet_stream)

for ok, resp in streaming_bulk(client, bulk_action_stream):
if not ok:
print(resp)
``````

See how we were able to daisy-chain `filter` and `map` together without materializing more than one record into memory at a time? We can build incredibly robust, memory-efficient pipelines by chaining generators together.

## What's Next

I showed that `filter` is a lot like `map` - it can be replicated with comprehensions or even with loops. But the more complex the pipeline gets, the more complicated that loop gets:

``````for tweet in stream:
if contains_profanity(tweet):
tweet_doc = tweet_to_bulk(tweet)
# ... send it to Elasticsearch.
``````

If I wanted to add to this pipeline using the loop, I have to make the choice to either add indentation or use `if`/`continue` to short circuit the processing. Using `map` and `filter`, I just add another expression.