This post is the second in a series on the functional side of Python. I've been told in code reviews that my Python looks like Clojure, which I naturally took as a complement even if it wasn't. So I decided to write a series of posts here detailing how I write functional Python (where appropriate), bit by bit.
In the last post I wrote on this topic, I discussed
map is one of two built-in higher order functions in Python. There used to be a third,
reduce, but that was since moved into the standard library, and I think that's super weird. I'll explain why when I do a post on
reduce. For today, I'll focus on
filter. This'll be short, because filter's pretty simple.
filter is a higher order function that takes two arguments: a function that returns a boolean and a sequence to apply it to. It produces a generator that removes elements of the sequence that are False (or false-ish, like
The classic example is the even/odd filter.
def even(n): return n%2 == 0 x = [1, 2, 3, 4, 5] y = filter(even, x) print(list(y)) >>> [2, 4]
map, we can mimic
filter's functionality with a comprehension.
z = [n for n in x if even(n)] print(z) >>> [2, 4]
In general, the same rules that applied to
map apply here too. If you're operating on a finite, already-in-memory sequence then a comprehension is more readable. If you've got an infinite sequence and need a generator,
filter's a good choice. Although you can use a comprehension to create a generator for infinite sequences, it's not as common. If you need to compose the filter with others, the
filter function is definitely the way to go. I'll cover composition in great detail in another post.
One of the goals of me writing these posts is to show examples of these patterns with real-world projects. This example is adapted from a script in my Profanity Power Index project, which streams data from Twitter's Streaming API for tweets containing profanity associated with some number of targets. It sends the filtered tweets to Elasticsearch for storage and visualization.
In my last post I showed how we used
map to convert tweets into documents that Elasticsearch can load. Now I'll show how I used
filter to remove the tweets that only contained clean language.
This is the function we're going to filter.
def contains_profanity(tweet): # _extract_text is just a helper function that pulls the text # out of the tweet, including any quoted retweets. tweet_text = _extract_text(tweet).lower() # PROFANITY is a list of profane words. # It was nice to put the swear words in the code itself and not # just the commit messages. for profanity in PROFANITY: if profanity in tweet_text: return True # If we made it this far, it's a clean tweet and we don't want # those. return False
The script (abbreviated) looks something like this:
# track is the list of targets. # api is an authenticated Twitter API client. tweet_stream = api.GetStreamFilter(track=track) # Filter to the tweets we want. profane_tweet_stream = filter(contains_profanity, tweet_stream) # Apply the map function to create Elasticsearch documents. bulk_action_stream = map(tweet_to_bulk, tweet_stream) # Load Elasticsearch. for ok, resp in streaming_bulk(client, bulk_action_stream): if not ok: print(resp)
See how we were able to daisy-chain
map together without materializing more than one record into memory at a time? We can build incredibly robust, memory-efficient pipelines by chaining generators together.
I showed that
filter is a lot like
map - it can be replicated with comprehensions or even with loops. But the more complex the pipeline gets, the more complicated that loop gets:
for tweet in stream: if contains_profanity(tweet): tweet_doc = tweet_to_bulk(tweet) # ... send it to Elasticsearch.
If I wanted to add to this pipeline using the loop, I have to make the choice to either add indentation or use
continue to short circuit the processing. Using
filter, I just add another expression.