DEV Community

Riches
Riches

Posted on

Most frequently used words in a text "Codewars"

Write a function that, given a string of text (possibly with punctuation and line-breaks), returns an array of the top-3 most occurring words, in descending order of the number of occurrences.

Assumptions:
A word is a string of letters (A to Z) optionally containing one or more apostrophes (') in ASCII.
Apostrophes can appear at the start, middle or end of a word ('abc, abc', 'abc', ab'c are all valid)
Any other characters (e.g. #, \, / , . ...) are not part of a word and should be treated as whitespace.
Matches should be case-insensitive, and the words in the result should be lowercased.
Ties may be broken arbitrarily.
If a text contains fewer than three unique words, then either the top-2 or top-1 words should be returned, or an empty array if a text contains no words.
Examples:
top_3_words("In a village of La Mancha, the name of which I have no desire to call to
mind, there lived not long since one of those gentlemen that keep a lance
in the lance-rack, an old buckler, a lean hack, and a greyhound for
coursing. An olla of rather more beef than mutton, a salad on most
nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra
on Sundays, made away with three-quarters of his income.")

=> ["a", "of", "on"]

top_3_words("e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e")

=> ["e", "ddd", "aa"]

top_3_words(" //wont won't won't")

=> ["won't", "wont"]

Bonus points (not really, but just for fun):
Avoid creating an array whose memory footprint is roughly as big as the input text.
Avoid sorting the entire array of unique words.

Steps:

  1. We define a regular expression that matches the pattern of words that we want. Remember to add the contractions with apostrophes
  2. Using the match method we try to match the given string with the regular expression. if given we return an array of all accepted string else we return an empty string.
  3. Now lets convert to lowercase since casing is nor required in the algorithm.
  4. Let's create a map that will hold a key value pair of each word and the corresponding times it appears in the sentence.
  5. Let's use a loop to iterate through the words and get the number of times each appear and store them in the map.
  6. Now lets sort the map in descending order so that we can easily select the top 3.
  7. Using the slice let's retrieve the top 3 words then return them.

Algorithm

function topThreeWords(text) {
    // Define a regular expression to match words (including contractions with apostrophes)
    const wordRegex = /[a-zA-Z0-9']+/g;
    // Extract words from the text and convert them to lowercase
    const words = text.match(wordRegex) || [];
    const lowercaseWords = words.map((word) => word.toLowerCase());
    // Create a map to count the occurrences of each word
    const wordCount = new Map();
    lowercaseWords.forEach((word) => {
        if (wordCount !== "'") {
            if (wordCount.has(word)) {
                wordCount.set(word, wordCount.get(word) + 1);
            } else {
                wordCount.set(word, 1);
            }
        }
    });
    // Sort the word count map by frequency in descending order
    const sortedWords = [...wordCount.entries()].sort((a, b) => b[1] - a[1]);
    // Extract the top three words
    const topThreeWords = sortedWords.slice(0, 3).map((entry) => entry[0]);
    return topThreeWords;
}
console.log(topThreeWords("e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e"));
Enter fullscreen mode Exit fullscreen mode

Top comments (0)