DEV Community

Cover image for Processing Text in Python: split, get and strip
Timothy Cummins
Timothy Cummins

Posted on

Processing Text in Python: split, get and strip

During this last week I have been helping a friend of mine understand some Python concepts while he was trying to pull data out of a .txt file. So I thought I would share what we covered and maybe it will end up helping someone else struggling with the same sort of material. These methods are split, get and strip.

.split()

The split method is a very important tool to make use of in python. This method allows you to take a string of text and break it apart by a character you use and return the string as a list containing the parts that were split apart. Usually when using split and the default setting of this method you will be separating your string by white spaces or in other words empty spaces. For example lets create a simple sentence: sentence= "I fought the law but the law won"

words = sentence.split()
print(words)
Enter fullscreen mode Exit fullscreen mode

Alt Text

Though you can also give it a parameter on where you want it to split the text or even how many splits you want it to do.

Splitting on "the":

words2 = sentence.split("the")
print(words2)
Enter fullscreen mode Exit fullscreen mode

Alt Text

Limiting the amount of splits to 4:

words3 = sentence.split(" ",4)
print(words3)
Enter fullscreen mode Exit fullscreen mode

Alt Text

.get()

The next method I would like to talk about is get(). Normally this method is just used to return the value of a specified key, but where it becomes very useful in working with text is that instead of returning an error if the value doesn't exist you can have it return a specified value. For example the get method is needed to create a dictionary for a word counter.

word_counts = {}
for word in words:
    word_counts[word] = word_counts.get(word,0)+1
Enter fullscreen mode Exit fullscreen mode

Alt Text

So in the above example we are starting with an empty dictionary and creating a for loop to cycle through our list of words. Then we are taking advantage of the get methods ability to return a specified specified default value of 0 even if it has never seen that key before. Then lastly we just add the +1 to the end allowing the function to add 1 to the key once it is added to the dictionary.

.strip()

The last method I find necessary for working with text is the strip method. What this allows you to do by default is get rid of white space before and after a selected string. Though if you specify a character, it will remove that character or set of characters if it occurs at the beginning or end of the string. So to show you why this is useful let us say that we were trying to create a word count but there are some commas in our string so when we use our split method our list looks like this:['I', 'fought', 'the', 'law,', 'but', 'the,', 'law', 'won']. So let's try our using our get method to create a word count of this new list.

for word in words2:
    word_counts[word] = word_counts.get(word,0)+1
Enter fullscreen mode Exit fullscreen mode

Alt Text

As you can see now it recognizes 'law' and 'law,' as two separate words, but to fix this we can use a for loop and our strip method to remove the unwanted commas from our words.

new_words=[]
for w in words:
    new_words.append(w.strip(','))
print(new_words)
Enter fullscreen mode Exit fullscreen mode

Alt Text

Now once again we have our words without extra punctuation and can get a correct count. This method is very common in Natural Language Processing for not only removing punctuation but in also in removing variations in words so that you can compare similar words such as you and you're.

I hope this helps anyone learning how to process text with Python and that you will have fun continuing your journey.

Top comments (0)