Nazanin Ashrafi

Posted on Oct 3

Text Cleaning in Python

#python

Hey to my fellow Python enthusiasts.

This article is about a very simple text cleaning. Nothing fancy or complicated here.
This might not be exactly a practical script but it'll help you get the idea of how some stuff works, like accessing the file and breaking texts into Words.

In this article I'll be talking about how to clean up the symbols and keep the words in each separate line.

So for example, we have this text file and want to get rid of the symbols

!@#<h2>**The Qu!ck Br0wn F0x Jumps Over The L@zy D0g.**</h2>
 (Did you see that?) Th!s is a sample $tring with **m!xed** cAsE, num3r@ls (like 123),

Let's work on the logic and breaking down the problems and then we'll go through this step by step.

First we need an empty list to add the cleaned words to.
Then we have to access the text file which means opening and the file.
Python reads a file line bye line, not words by words. So first we have to go through each line, then go through each words and then go through each characters.
(line → word → character)
So when we go through each characters, we need to check if it's a symbol or not.
If it is, then it should go out and if it isn't, then we'll keep it.

Now let's work through this logic step by step with code snippets:

Let's start off with creating an empty list:

words = []

We also need to create a list of the characters that we identify as the things that we don't want in out text.

symbols = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']

It's time to access and open the file

with open("text.txt") as file:

In step 3 we talked about how python reads a file line bye line. So to do that and go through each characters, we need nested loops.

looping through lines:

with open("text.txt") as file:
     for line in file:

Looping through words of each lines:
This part is a bit tricky because we can't just say:

for word in line:

Here comes the split() method ins play:
What split() Does:
It takes a single string and breaks it up into a list of smaller strings.

with open("text.txt") as file:
    for line in file:
        for word in line.split():

So let me explain this part:

for line in file will get each line.
for word in line will get each words from the line.
let's say this is our text : "Hello, World!"

what for word in line does is that it will get each words but not a word a whole.

So basically it's not gonna print out "hello" but instead it prints out something like this:

H
e
l
l
o

And we don't want that. That's why we need the split() method to tell the python "okay cut the whole word from the string and then move on to the next word, instead of printing them out by characters and then moving on to the next words"

Here's an example:

text = "Hello, World!"
for word in text:
    print(word)

text = "Hello, World!"
for word in text.split():
    print(word)

The first one would print out:

H
e
l
l
o
,

W
o
r
l
d
!

While the second one would print out:

Hello,
World!

Now that we have each words as whole, then we can move on to the next part, which is to check for a symbol in each words.
But before we loop through each characters like this:

for char in word:

We need to create an empty string outside of this for loop, so that each time a character is not a symbol, it's go to that variable.
Let's call it cleaned_word

cleaned_word = ""

with open("text.txt") as file:
    for line in file:
        for word in line.split():
            cleaned_word = ""

Now we can loop through each character and see if there's a symbol or not:

            cleaned_word = ""
            for char in word:
                if char not in symbols:
                    cleaned_word += char

Now all we have to is to get the cleaned_word and add them to the words list:

            final_word = cleaned_word.lower()
            words.append(final_word)

So now let's put together the whole code block:

with open("text.txt") as file:
    for line in file:
        for word in line.split():
            cleaned_word = ""
            for char in word:
                if char not in symbols:
                    cleaned_word += char
            final_word = cleaned_word.lower()
            words.append(final_word)

The las and final step would be to get each word from the words list:

for word in words:
    print(word)

You can also filter the cleaned words and only print out the words that are longer than 5 charachters:

            final_word = cleaned_word.lower()
            if len(final_word) > 5:
                words.append(final_word)

with open("text.txt") as file:
    for line in file:
        for word in line.split():
            cleaned_word = ""
            for char in word:
                if char not in symbols:
                    cleaned_word += char
            final_word = cleaned_word.lower()
            if len(final_word) > 5:
                words.append(final_word)


for word in words:
    print(word)