Introduction
A word cloud (also called tag cloud or weighted list) is a visual representation of text data. Words are usually single words, and the importance of each is shown with font size or color. This article will discuss how to generate a Word Cloud using Python.
📚 Python Libraries
wordcloudCounterreos
Input File
For the input file, you need a file that contains text. You can use a site like Project Gutenberg to find books that are available online. You can generate word clouds from famous books such as Alice in Wonderland by Lewis Carroll or Dracula by Bram Stoker. The possibilities are endless. This project used The Raven by Edgar Allan Poe. Simply copy the contents into a text file and save it in the same directory as the Python script.
Implementation
We need to write a function that will open this text file, iterate through the words, remove punctuation, and count the frequency of each word. We must also make sure to ignore word case, words that do not contain all alphabets, and common words like "and" or "the". We will follow these 5 basic steps to implement the word cloud.
- Import Necessary libraries
- Open Text file
- Clean Text file
- Generate Word cloud
- Save Word cloud
Step 1: Import the necessary libraries
from wordcloud import WordCloud
from collections import Counter
import re
import os
- The WordcCloud library is what generate the wordcloud.
- The
Countermodule is used to create adictionarythat will count the frequency of each word. - The
remodule used to remove punctuations from the text file. - The
osmodule is used for file handling
Step 2: Open the text file
def get_file(filename):
with open(filename, encoding='utf-8') as file_object:
content = [word.lower().strip() for word in file_object]
return ' '.join(content)
The get_file() function will open the text file using UTF-8 encoding, and returns content of file.
Step 3: Clean Text file
def clean_file(data):
data = re.sub(r'[^\w\s]', '', data)
stopwords = ('a', 'an', 'and', 'as', 'at', 'but', 'by', 'from', 'he', 'him', 'i', 'is', 'my', 'of', 'or',
'on', 'said', 'that', 'the', 'there', 'this', 'to', 'with')
return Counter([word for word in data.split() if word not in stopwords])
The clean_file() function takes 1 parameter data which is the text file that will be passed to it. The re module is used to remove punctuation marks from the text. Additionally, any stopwords (i.e. commonly used words) will be removed as well. The results are returned in a Counter object which is used to count the frequency of the remaining words in the file.
Step 4: Generate Wordcloud
def generate_wordcloud(data):
return WordCloud(height=800, width=1200).generate_from_frequencies(data)
The generate_wordcloud() function takes 1 parameter, data, which is the Counter object. It uses this word frequency hashmap to generate a Wordcloud image 800x1200 pixels in size. The result is the following Wordcloud image:
Step 5: Save Wordcloud
def save_wordcloud(data, filename):
data.to_file(os.path.join(filename))
print(f'{filename} has been successfully saved.')
Finally we want to save this Wordcloud image. The save_wordcloud() function takes two parameters, data and filename. data is the wordcloud object that will be saved, and filename is that name the file will be saved as.
Implementing the Code
def main():
# Get the path of the text file
raven_path: str = os.path.join('the_raven.txt')
# Open this text file:
raven_file: str = get_file(raven_path)
# Clean the text file:
process_file: dict = clean_file(raven_file)
# Generate wordcloud
raven_cloud: [Wordcloud] = generate_wordcloud(process_file)
# Save wordcloud image as 'raven_cloud.jpg'
save_wordcloud(raven_cloud, 'raven_cloud.jpg')
if __name__ == '__main__':
main()
The Full Code
import os
import re
from collections import Counter
from wordcloud import WordCloud
def get_file(filename):
with open(filename, encoding='utf-8') as fo:
content = [i.lower().strip() for i in fo]
return ' '.join(content)
def clean_file(data):
data = re.sub(r'[^\w\s]', '', data)
stopwords = ('a', 'an', 'and', 'as', 'at', 'but',
'by', 'from', 'he', 'him', 'i', 'is',
'my', 'of', 'or', 'on', 'said', 'that',
'the', 'there', 'this', 'to', 'with')
return Counter([word for word in data.split()
if word not in stopwords])
def generate_wordcloud(data):
return WordCloud(height=800, width=1200).generate_from_frequencies(data)
def save_wordcloud(data, filename):
data.to_file(os.path.join(filename))
print(f'{filename} has been successfully saved.')
def main():
raven_path = os.path.join('the_raven.txt')
raven_file = get_file(raven_path)
process_file = clean_file(raven_file)
raven_cloud = generate_wordcloud(process_file)
save_wordcloud(raven_cloud, 'raven_cloud.jpg')
if __name__ == '__main__':
main()
Conclusion
After reading this tutorial you should now be able to generate your own Wordcloud using Python. Use your imagination and have fun! Checkout the WordCloud for Python Documentation to learn more what you can do with this Python library. Please leave like or comment if you found this article interesting!
- Code available at GitHub

Top comments (0)