seanpgallivan

Posted on Mar 6, 2021 • Edited on Mar 7, 2021

Solution: Short Encoding of Words (ver. 1)

#algorithms #javascript #java #python

This is part of a series of Leetcode solution explanations (index). If you liked this solution or found it useful, please like this post and/or upvote my solution post on Leetcode's forums.

Note: This is my first version of a solution for this problem. Due to the constraints listed for this problem, this version is the more performant solution, but the nature of this problem really calls for a trie solution, so I've included a breakdown of the trie approach here, as well.

Leetcode Problem #820 (Medium): Short Encoding of Words

Description:

(Jump to: Solution Idea || Code: JavaScript | Python | Java | C++)

A valid encoding of an array of words is any reference string s and array of indices indices such that:

words.length == indices.length

The reference string s ends with the '#' character.

For each index indices[i], the substring of s starting from indices[i] and up to (but not including) the next '#' character is equal to words[i].

Given an array of words, return the length of the shortest reference string s possible of any valid encoding of words.

Examples:

Example 1:

Input: words = ["time", "me", "bell"]

Output: 10

Explanation: A valid encoding would be s = "time#bell#" and indices = [0, 2, 5].
words[0] = "time", the substring of s starting from indices[0] = 0 to the next '#' is underlined in "time#bell#"
words[1] = "me", the substring of s starting from indices[1] = 2 to the next '#' is underlined in "time#bell#"
words[2] = "bell", the substring of s starting from indices[2] = 5 to the next '#' is underlined in "time#bell#"

Example 2:

Input: words = ["t"]

Output: 2

Explanation: A valid encoding would be s = "t#" and indices = [0].

Example 1:
Input:	words = ["time", "me", "bell"]
Output:	10
Explanation:	A valid encoding would be s = "time#bell#" and indices = [0, 2, 5]. words[0] = "time", the substring of s starting from indices[0] = 0 to the next '#' is underlined in "time#bell#" words[1] = "me", the substring of s starting from indices[1] = 2 to the next '#' is underlined in "time#bell#" words[2] = "bell", the substring of s starting from indices[2] = 5 to the next '#' is underlined in "time#bell#"

Example 2:
Input:	words = ["t"]
Output:	2
Explanation:	A valid encoding would be s = "t#" and indices = [0].

Constraints:

1 <= words.length <= 2000

1 <= words[i].length <= 7

words[i] consists of only lowercase letters.

Idea:

(Jump to: Problem Description || Code: JavaScript | Python | Java | C++)

So a simple encoding of the input would be to add the '#' marker to the end of each word and then join them in a string. Per the instructions, this encoding can be made shorter if you can combine two or more words into one encoded word. In order to do this, the smaller word would have to be not just a substring of the larger word, but the rightmost substring, or its suffix.

A naive solution here would be to compare each word to each other word and examine if the larger word has the smaller word as its suffix, but with a range of up to 2000 words, that would mean almost 4 million potential combinations.

But if we're asked to check for matching suffixes, we might also be thinking of a trie solution. This seems like a good time for a trie, but tries tend to have a lot of processing and memory overhead involved, and in this instance there's an easier method.

Going back to our naive method, what if, instead of comparing each word to the up to 2000 other words, we instead just identified which possible words could share a suffix with the current word and check for them? Since each word is at most 7 characters long, that means only up to 6 checks per word, rather than 2000.

In order to make this work more efficiently, of course, we'll have to first make a map of the words in W so that we don't have to iterate through it repeatedly. In this case, we don't need to store a value, so we can use a set as our wordmap.

Then for each word, we should check for each of its suffixes and remove any matches from our set as unnecessary if found.

Implementation:

For Javascript and Python, it's faster/easier to just join() the remaining words before counting the length, while for Java and C++ it's faster to just iterate through the set directly.

Javascript Code:

(Jump to: Problem Description || Solution Idea)

var minimumLengthEncoding = function(W) {
    let set = new Set(W)
    for (let word of W)
        if (set.has(word))
            for (let i = 1; i < word.length; i++) 
                set.delete(word.slice(i))
    return Array.from(set).join().length + 1
};

Python Code:

(Jump to: Problem Description || Solution Idea)

class Solution:
    def minimumLengthEncoding(self, W: List[str]) -> int:
        wset = set(W)
        for word in W:
            if word in wset:
                for i in range(1,len(word)):
                    wset.discard(word[i:])
        return len("#".join(list(wset))) + 1

Java Code:

(Jump to: Problem Description || Solution Idea)

class Solution {
    public int minimumLengthEncoding(String[] W) {
        Set<String> set = new HashSet<>(Arrays.asList(W));
        for (String word : W)
            if (set.contains(word))
                for (int i = 1; i < word.length(); i++) 
                    set.remove(word.substring(i));
        int ans = set.size();
        for (String word : set) ans += word.length();
        return ans;
    }
}

C++ Code:

(Jump to: Problem Description || Solution Idea)

class Solution {
public:
    int minimumLengthEncoding(vector<string>& W) {
        unordered_set<string> wset(W.begin(), W.end());
        for (string &word : W)
            if (wset.find(word) != wset.end())
                for (int i = 1; i < word.length(); i++) 
                    wset.erase(word.substr(i));
        int ans = wset.size();
        for (string word : wset) ans += word.size();
        return ans;
    }
};

How I fixed 20 seconds of lag for every user in just 20 minutes.

Our AI agent was running 10-20 seconds slower than it should, impacting both our own developers and our early adopters. See how I used Sentry Profiling to fix it in record time.

DEV Community

Solution: Short Encoding of Words (ver. 1)

Leetcode Problem #820 (Medium): Short Encoding of Words

Description:

Examples:

Constraints:

Idea:

Implementation:

Javascript Code:

Python Code:

Java Code:

C++ Code:

How I fixed 20 seconds of lag for every user in just 20 minutes.

Top comments (0)

Read next

Copier vs Cookiecutter

7 Best Java IDEs for Developers

Staging vs Production: Understanding the Key Differences

Building a Local AI Code Reviewer with ClientAI and Ollama - Part 2