DEV Community

Cover image for Text Cleaning in Python
Nazanin Ashrafi
Nazanin Ashrafi

Posted on

Text Cleaning in Python

Hey to my fellow Python enthusiasts.

This article is about a very simple text cleaning. Nothing fancy or complicated here.
This might not be exactly a practical script but it'll help you get the idea of how some stuff works, like accessing the file and breaking texts into Words.

In this article I'll be talking about how to clean up the symbols and keep the words in each separate line.

So for example, we have this text file and want to get rid of the symbols

!@#<h2>**The Qu!ck Br0wn F0x Jumps Over The L@zy D0g.**</h2>
 (Did you see that?) Th!s is a sample $tring with **m!xed** cAsE, num3r@ls (like 123), 
Enter fullscreen mode Exit fullscreen mode

Let's work on the logic and breaking down the problems and then we'll go through this step by step.

  1. First we need an empty list to add the cleaned words to.

  2. Then we have to access the text file which means opening and the file.

  3. Python reads a file line bye line, not words by words. So first we have to go through each line, then go through each words and then go through each characters.
    (line → word → character)

  4. So when we go through each characters, we need to check if it's a symbol or not.
    If it is, then it should go out and if it isn't, then we'll keep it.

Now let's work through this logic step by step with code snippets:

  • Let's start off with creating an empty list:
words = []
Enter fullscreen mode Exit fullscreen mode
  • We also need to create a list of the characters that we identify as the things that we don't want in out text.
symbols = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
Enter fullscreen mode Exit fullscreen mode
  • It's time to access and open the file
with open("text.txt") as file:
Enter fullscreen mode Exit fullscreen mode
  • In step 3 we talked about how python reads a file line bye line. So to do that and go through each characters, we need nested loops.

looping through lines:

with open("text.txt") as file:
     for line in file:
Enter fullscreen mode Exit fullscreen mode

Looping through words of each lines:
This part is a bit tricky because we can't just say:

for word in line:
Enter fullscreen mode Exit fullscreen mode

Here comes the split() method ins play:
What split() Does:
It takes a single string and breaks it up into a list of smaller strings.

with open("text.txt") as file:
    for line in file:
        for word in line.split():
Enter fullscreen mode Exit fullscreen mode

So let me explain this part:

for line in file will get each line.
 for word in line will get each words from the line.
let's say this is our text :  "Hello, World!"

what for word in line does is that it will get each words but not a word a whole.

So basically it's not gonna print out "hello" but instead it prints out something like this:

H
e
l
l
o
Enter fullscreen mode Exit fullscreen mode

And we don't want that. That's why we need the split() method to tell the python "okay cut the whole word from the string and then move on to the next word, instead of printing them out by characters and then moving on to the next words"

Here's an example:

text = "Hello, World!"
for word in text:
    print(word)
Enter fullscreen mode Exit fullscreen mode

vs

text = "Hello, World!"
for word in text.split():
    print(word)
Enter fullscreen mode Exit fullscreen mode

The first one would print out:

H
e
l
l
o
,

W
o
r
l
d
!
Enter fullscreen mode Exit fullscreen mode

While the second one would print out:

Hello,
World!
Enter fullscreen mode Exit fullscreen mode

Now that we have each words as whole, then we can move on to the next part, which is to check for a symbol in each words.
But before we loop through each characters like this:

for char in word:
Enter fullscreen mode Exit fullscreen mode

We need to create an empty string outside of this for loop, so that each time a character is not a symbol, it's go to that variable.
Let's call it cleaned_word

cleaned_word = ""
Enter fullscreen mode Exit fullscreen mode
with open("text.txt") as file:
    for line in file:
        for word in line.split():
            cleaned_word = ""
Enter fullscreen mode Exit fullscreen mode

Now we can loop through each character and see if there's a symbol or not:

            cleaned_word = ""
            for char in word:
                if char not in symbols:
                    cleaned_word += char
Enter fullscreen mode Exit fullscreen mode

Now all we have to is to get the cleaned_word and add them to the words list:

            final_word = cleaned_word.lower()
            words.append(final_word)
Enter fullscreen mode Exit fullscreen mode

So now let's put together the whole code block:

with open("text.txt") as file:
    for line in file:
        for word in line.split():
            cleaned_word = ""
            for char in word:
                if char not in symbols:
                    cleaned_word += char
            final_word = cleaned_word.lower()
            words.append(final_word)
Enter fullscreen mode Exit fullscreen mode

The las and final step would be to get each word from the words list:

for word in words:
    print(word)
Enter fullscreen mode Exit fullscreen mode

You can also filter the cleaned words and only print out the words that are longer than 5 charachters:

            final_word = cleaned_word.lower()
            if len(final_word) > 5:
                words.append(final_word)
Enter fullscreen mode Exit fullscreen mode
with open("text.txt") as file:
    for line in file:
        for word in line.split():
            cleaned_word = ""
            for char in word:
                if char not in symbols:
                    cleaned_word += char
            final_word = cleaned_word.lower()
            if len(final_word) > 5:
                words.append(final_word)


for word in words:
    print(word)
Enter fullscreen mode Exit fullscreen mode

So that's basically it for this article.
See yall in the next one.


You can also reach out to me on twitter: @nazanin_ashrafi

Top comments (0)