DEV Community

loading...

Quickly find common phrases in a large list of strings

Matthew Schwartz
Managing development teams and writing software for 25+ years. Owner of SocialSentiment.io.
・3 min read

Python is very good at efficiently iterating over sets of data and gathering useful information. This is often accomplished with a surprisingly short amount of code.

I recently came across a use case within a Python application where I wanted to find repeated phrases in sets of social media posts. This is an easily managed problem because these posts are relatively short, typically under 300 characters, and therefore we can process thousands directly in memory.

NLTK includes text correlation utilities to solve this problem. In my case I didn't get the results I require, so I found a simple solution that requires no libraries.

Ins and Outs

First let's define the input and output. Our function will take an iterable of strings, a maximum phrase length (default 3), and a minimum repeat count (default 2). As we'll soon see, the choice of phrase length will have a huge impact on performance. Minimum repeat count will have a smaller but still significant impact.

The output will be a dictionary. Each key will be a tuple of words which make up a found phrase. Words returned will be all lower case. Each value in the dictionary will be the number of times it's found.

In many cases you'll want to ignore stop words. Useful lists can be found in a gist and the comments under it on Github.

stopwords = ["a", "i", "some", "which", "where", ... ]

The Algorithm

Let's start our function:

import re

def get_common_phrases(texts, maximum_length=3, minimum_repeat=2) -> dict:
    phrases = {}

First let's break down the texts into phrases. These will be tuples between 1 and maximum_length in size. For each phrase we'll count how often we find it.

for t in texts:
    # Replace separators and punctuation with spaces
    text = re.sub(r'[.!?,:;/\-\s]', ' ', text)
    # Remove extraneous chars
    text = re.sub(r'[\\|@#$&~%\(\)*\"]', '', text)

    words = text.split(' ')
    # Remove stop words and empty strings
    words = [w for w in words if len(w) and w.lower() not in stopwords]
    length = len(words)
    # Look at phrases no longer than maximum_length words long
    size = length if length <= maximum_length else maximum_length
    while size > 0:
        pos = 0
        # Walk over all sets of words
        while pos + size <= length:
            phrase = words[pos:pos+size]
            phrase = tuple(w.lower() for w in phrase)
            if phrase in phrases:
                phrases[phrase] += 1
            else:
                phrases[phrase] = 1
            pos += 1
        size -= 1

You'll notice that as we increase maximum_length this will trigger many more loop iterations. The processing time will grow exponentially. So set the maximum to as small a number as reasonably possible. In my case I found 3 to be a good value.

Next we'll remove phrases found less than the minimum required number of times.

    phrases = {k: v for k, v in phrases.items() if v >= minimum_repeat}

And last we remove sub-phrases unless they are found much more frequently than their longer counterparts. I found this to be the most interesting and useful problem to solve to get quality results. I set a threshold of 25% deviation in count, meaning if the shorter sub-phrase is found often outside the longer phrase, we'll include both in the output.

longest_phrases = {}
keys = list(phrases.keys())
keys.sort(key=len, reverse=True)
for phrase in keys:
    found = False
    for l_phrase in longest_phrases:
        # If the entire phrase is found in a longer tuple...
        intersection = set(l_phrase).intersection(phrase)
        if len(intersection) == len(phrase):
            # ... and their frequency overlaps by 75% or more, we'll drop it
            difference = (phrases[phrase] - longest_phrases[l_phrase]) / longest_phrases[l_phrase]
            if difference < 0.25:
                found = True
                break
    if not found:
        longest_phrases[phrase] = phrases[phrase]

return longest_phrases

Let's test the output. Here's sample input:

texts = (
    "This is the first text where I want to catch some common phrases",
    "This is a second text where I hope to catch some common phrases",
    "This is a third text which should catch some common phrases",
    "I'm a unique string",
    "A post with text"
)

The output of this function will be

{
    ('catch', 'common', 'phrases'): 3,
    ('text',): 4,
    ('text', 'catch', 'common'): 2
}

We've been running this algorithm on SocialSentiment.io for a few weeks with very positive results. We track the sentiment of social media posts which reference publicly traded companies. This function helps us find and display frequently found phrases in those posts.

Discussion (0)