DEV Community

Cover image for [Python] Creating a wordcloud using the dev.to API
K
K

Posted on

[Python] Creating a wordcloud using the dev.to API

I'm looking to set up an automated process using the devto API. As a first step, I created a word cloud from articles fetched via the API.

Background

As a Japanese person, I'm not particularly proficient in English. However, I'm interested in programming topics from English-speaking communities. In the world of programming, it's said that there's no silver bullet, and diversity is essential in design and coding, just as it is in D&I.

Even in Japanese technical information, there's some diversity, but there are slight biases and characteristics due to the influence of country-specific issues and interests. Therefore, I thought it would be useful to obtain technical information from English-speaking communities, which have a larger number of speakers, to grasp diversity in programming.

To efficiently obtain information from English-speaking sources, I want to proceed with data acquisition and processing using APIs. Today, I'd like to try processing by fetching the latest articles and creating a word cloud as a way to sample this data.

Methodology

dev.to provides a public API with documentation, offering an easy means to fetch articles. Wanting to simplify API handling, I wrote a library called devtopy to streamline API operations.

Using this library to fetch articles allows for efficient retrieval of the latest articles at any time. The retrieved articles are split into words, filtered for verbs, adjectives, and nouns, and their frequencies are counted. Using these words and their counts, we can easily create a word cloud. As it's for dev.to, I prepared a grayscale mask image to create the word cloud, expressing that it's a word cloud related to dev.to.

Code

The repository is available at the following URL. I'll briefly explain the main contents of the code:

Execution Method

I intend to execute it from the CLI using the fire library. By adding a number after the execution command, you can control the number of articles to fetch. If this number is omitted, the default value is 25, which fetches and creates a word cloud from the 25 most recent articles.

%./src > python main.py 100
Enter fullscreen mode Exit fullscreen mode

Fetching Articles

Articles can be easily fetched using the previously mentioned devtopy. You can fetch 1-1000 recent articles with get_latest_articles. Using Counter from collections allows for more convenient counter processing than a regular dict. For example, the update method doesn't simply update values but adds to existing values.

def fetch_articles(dt: DevTo, article_count: int) -> List[Dict]:
    articles = dt.articles.get_latest_articles(page=1, per_page=article_count).articles
    return articles

def process_article(dt: DevTo, article: Dict) -> str:
    article_data = dt.articles.get_by_id(article.id)
    return process_text(article_data.body_markdown)

def get_word(article_count: int = 10) -> Counter:
    dt = DevTo(api_key=API_KEY)
    articles = fetch_articles(dt, article_count)

    word_counter = Counter()
    for article in articles:
        processed_text = process_article(dt, article)
        word_counter.update(processed_text.split())

    return word_counter
Enter fullscreen mode Exit fullscreen mode

Text Preprocessing

The text is split and unnecessary parts of speech are removed. Also, words are converted to their base form. After such processing, we create text separated by single-byte spaces so that word counts can be made.

def preprocess_text(text: str) -> List[str]:
    stop_words = set(stopwords.words("english"))
    tokens = word_tokenize(text.lower())
    return [word for word in tokens if word.isalnum() and word not in stop_words]

def lemmatize_words(words: List[str]) -> List[str]:
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in words]

def process_text(text: str) -> str:
    preprocessed_words = preprocess_text(text)
    lemmatized_words = lemmatize_words(preprocessed_words)
    return " ".join(lemmatized_words)
Enter fullscreen mode Exit fullscreen mode

Generating the Word Cloud

We create a word cloud from the actual dictionary data using the counter. When creating the word cloud, we use a grayscale format mask image. This ensures that the word cloud is created in the black areas of the image, while no text is input in the white areas.

def save_wordcloud_image(wordcloud: WordCloud) -> None:
    """Save the word cloud image to a file"""
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    output_path = os.path.join(OUTPUT_DIR, f"wordcloud_{datetime.now():%Y%m%d%H%M}.png")
    plt.savefig(output_path, bbox_inches="tight", pad_inches=0, dpi=300)
    logger.info(f"WordCloud image saved to: {output_path}")

def create_wordcloud(word_counter: Dict[str, int]) -> None:
    mask = np.array(Image.open(MASK_IMG_PATH))

    wordcloud = WordCloud(
        width=1920,
        height=1920,
        background_color="white",
        mask=mask,
        contour_width=1,
        contour_color="steelblue",
    ).generate_from_frequencies(word_counter)

    plt.figure(figsize=(10, 10))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")

    save_wordcloud_image(wordcloud)
Enter fullscreen mode Exit fullscreen mode

Generated Word Cloud

Image description

Thoughts and Impressions

By fetching a small number of the latest articles and creating a word cloud, we can understand recent trends. Also, by fetching articles by tag units like #python, or filtering articles with a certain number of positive reactions, we might be able to create word clouds for more advanced or expert-oriented content and read those trends. I thought that by properly preprocessing data acquisition via API, we might be able to read more interesting trends.

Future Plans

First, I want to be able to extract articles to my satisfaction by repeatedly preprocessing and generating word clouds. Then, I'd like to ask the recently released ChatGPT 4o-mini for Japanese translations and summaries, post that content to Discord via Webhook, so that I, a non-English speaker, can efficiently consume English descriptive information.

Conclusion

Using APIs and word clouds, we can easily grasp technology trends. I think there are ways to apply this both within and outside of dev.to, so I hope the methods of data acquisition, string processing, and word cloud generation will be useful references.

Top comments (0)