This post is the second in a series on the functional side of Python. I've been told in code reviews that my Python looks like Clojure, which I naturally took as a complement even if it wasn't. So I decided to write a series of posts here detailing how I write functional Python (where appropriate), bit by bit.
In the last post I wrote on this topic, I discussed map
. map
is one of two built-in higher order functions in Python. There used to be a third, reduce
, but that was since moved into the standard library, and I think that's super weird. I'll explain why when I do a post on reduce
. For today, I'll focus on filter
. This'll be short, because filter's pretty simple.
The Basics
filter
is a higher order function that takes two arguments: a function that returns a boolean and a sequence to apply it to. It produces a generator that removes elements of the sequence that are False (or false-ish, like []
and None
).
The classic example is the even/odd filter.
def even(n):
return n%2 == 0
x = [1, 2, 3, 4, 5]
y = filter(even, x)
print(list(y))
>>> [2, 4]
Pretty simple.
When filter
is not the Right Choice
Similar to map
, we can mimic filter
's functionality with a comprehension.
z = [n for n in x if even(n)]
print(z)
>>> [2, 4]
In general, the same rules that applied to map
apply here too. If you're operating on a finite, already-in-memory sequence then a comprehension is more readable. If you've got an infinite sequence and need a generator, filter
's a good choice. Although you can use a comprehension to create a generator for infinite sequences, it's not as common. If you need to compose the filter with others, the filter
function is definitely the way to go. I'll cover composition in great detail in another post.
Stream Processing
One of the goals of me writing these posts is to show examples of these patterns with real-world projects. This example is adapted from a script in my Profanity Power Index project, which streams data from Twitter's Streaming API for tweets containing profanity associated with some number of targets. It sends the filtered tweets to Elasticsearch for storage and visualization.
In my last post I showed how we used map
to convert tweets into documents that Elasticsearch can load. Now I'll show how I used filter
to remove the tweets that only contained clean language.
This is the function we're going to filter.
def contains_profanity(tweet):
# _extract_text is just a helper function that pulls the text
# out of the tweet, including any quoted retweets.
tweet_text = _extract_text(tweet).lower()
# PROFANITY is a list of profane words.
# It was nice to put the swear words in the code itself and not
# just the commit messages.
for profanity in PROFANITY:
if profanity in tweet_text:
return True
# If we made it this far, it's a clean tweet and we don't want
# those.
return False
The script (abbreviated) looks something like this:
# track is the list of targets.
# api is an authenticated Twitter API client.
tweet_stream = api.GetStreamFilter(track=track)
# Filter to the tweets we want.
profane_tweet_stream = filter(contains_profanity, tweet_stream)
# Apply the map function to create Elasticsearch documents.
bulk_action_stream = map(tweet_to_bulk, tweet_stream)
# Load Elasticsearch.
for ok, resp in streaming_bulk(client, bulk_action_stream):
if not ok:
print(resp)
See how we were able to daisy-chain filter
and map
together without materializing more than one record into memory at a time? We can build incredibly robust, memory-efficient pipelines by chaining generators together.
What's Next
I showed that filter
is a lot like map
- it can be replicated with comprehensions or even with loops. But the more complex the pipeline gets, the more complicated that loop gets:
for tweet in stream:
if contains_profanity(tweet):
tweet_doc = tweet_to_bulk(tweet)
# ... send it to Elasticsearch.
If I wanted to add to this pipeline using the loop, I have to make the choice to either add indentation or use if
/continue
to short circuit the processing. Using map
and filter
, I just add another expression.
Top comments (0)