DEV Community

loading...
Cover image for Solution: Find Duplicate File in System

Solution: Find Duplicate File in System

seanpgallivan profile image seanpgallivan ・4 min read

This is part of a series of Leetcode solution explanations (index). If you liked this solution or found it useful, please like this post and/or upvote my solution post on Leetcode's forums.


Leetcode Problem #609 (Medium): Find Duplicate File in System


Description:


(Jump to: Solution Idea || Code: JavaScript | Python | Java | C++)

Given a list paths of directory info, including the directory path, and all the files with contents in this directory, return all the duplicate files in the file system in terms of their paths. You may return the answer in any order.

A group of duplicate files consists of at least two files that have the same content.

A single directory info string in the input list has the following format:

  • "root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

It means there are n files (f1.txt, f2.txt ... fn.txt) with content (f1_content, f2_content ... fn_content) respectively in the directory "root/d1/d2/.../dm". Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:

  • "directory_path/file_name.txt"

Examples:

Example 1:
Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"]
Output: [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]
Example 2:
Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)"]
Output: [["root/a/2.txt","root/c/d/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Constraints:

  • 1 <= paths.length <= 2 * 10^4
  • 1 <= paths[i].length <= 3000
  • 1 <= sum(paths[i].length) <= 5 * 10^5
  • paths[i] consist of English letters, digits, '/', '.', '(', ')', and ' '.
  • You may assume no files or directories share the same name in the same directory.
  • You may assume each given directory info represents a unique directory. A single blank space separates the directory path and file info.

Idea:


(Jump to: Problem Description || Code: JavaScript | Python | Java | C++)

The order to group duplicate files, we should use a map to store the file paths by content value. For each string (pStr) in paths, we can iterate through the string up to the first space to find the path. Then we can iterate through the remainder of pStr and use two more pointers (j, k) to mark the indexes around the filename (file) and contents (cont).

When we find a ')', we've found the end of a complete entry, so we should add it to our content map (contMap) by merging path and file (with a '/' between) and storing the result in contMap under cont.

Once we've added all files to contMap, we can iterate through its values and add any groups that are larger than 1 (indicating duplicates) to our answer array (ans) before we return ans.

  • Time Complexity: O(N + C) where N is the total number of files and C is the number of different keys in contMap
  • Space Complexity: O(N) for N files in contMap

Implementation:

Python is much faster when using split() as opposed to direct iteration through the strings.

Java is faster when using a StringBuilder to compile the path + file before entry into contMap.


Javascript Code:


(Jump to: Problem Description || Solution Idea)

var findDuplicate = function(paths) {
    let contMap = new Map(), ans = []
    for (let pStr of paths) {
        let i = 0, j, k
        while (pStr.charAt(i) !== ' ') i++
        let path = pStr.slice(0,i)
        for (j = ++i; i < pStr.length; i++)
            if (pStr.charAt(i) === '(') k = i
            else if (pStr.charAt(i) === ')') {
                let pathfile = path + '/' + pStr.slice(j, k),
                    cont = pStr.slice(k+1, i)
                if (!contMap.has(cont))
                    contMap.set(cont, [pathfile])
                else contMap.get(cont).push(pathfile)
                j = i + 2
            }
    }
    for (let v of contMap.values())
        if (v.length > 1) ans.push(v)
    return ans
};
Enter fullscreen mode Exit fullscreen mode

Python Code:


(Jump to: Problem Description || Solution Idea)

class Solution:
    def findDuplicate(self, paths: List[str]) -> List[List[str]]:
        contMap, ans = defaultdict(list), []
        for pStr in paths:
            sep = pStr.split(" ")
            for i in range(1, len(sep)):
                parts = sep[i].split('(')
                cont = parts[1][:-1]
                contMap[cont].append(sep[0] + '/' + parts[0])
        for v in contMap.values():
            if len(v) > 1: ans.append(v)
        return ans
Enter fullscreen mode Exit fullscreen mode

Java Code:


(Jump to: Problem Description || Solution Idea)

class Solution {
    public List<List<String>> findDuplicate(String[] paths) {
        Map<String, List<String>> contMap = new HashMap<>();
        StringBuilder pathfile = new StringBuilder();
        for (String pStr : paths) {
            int i = 0;
            pathfile.setLength(0);
            while (pStr.charAt(i) != ' ') i++;
            pathfile.append(pStr.substring(0,i)).append('/');
            int pLen = ++i;
            for (int j = i, k = 0; i < pStr.length(); i++)
                if (pStr.charAt(i) == '(') {
                    pathfile.append(pStr.substring(j,i));
                    k = i + 1;
                } else if (pStr.charAt(i) == ')') {
                    String cont = pStr.substring(k, i);
                    if (!contMap.containsKey(cont))
                        contMap.put(cont, new ArrayList<>());
                    contMap.get(cont).add(pathfile.toString());
                    j = i + 2;
                    pathfile.setLength(pLen);
                }
        }
        List<List<String>> ans = new ArrayList<>();
        for (List<String> v : contMap.values())
            if (v.size() > 1) ans.add(v);
        return ans;
    }
}
Enter fullscreen mode Exit fullscreen mode

C++ Code:


(Jump to: Problem Description || Solution Idea)

class Solution {
public:
    vector<vector<string>> findDuplicate(vector<string>& paths) {
        unordered_map<string, vector<string>> contMap;
        for (auto &pStr : paths) {
            int i = 0;
            while (pStr[i] != ' ') i++;
            string path = pStr.substr(0,i);
            for (int j = i + 1, k = 0; i < pStr.size(); i++)
                if (pStr[i] == '(') k = i+1;
                else if (pStr[i] == ')') {
                    string pathfile = path + '/' + pStr.substr(j, k-j-1),
                        cont = pStr.substr(k, i-k);
                    if (contMap.find(cont) == contMap.end())
                        contMap[cont] = vector<string>();
                    contMap[cont].push_back(pathfile);
                    j = i + 2;
                }
        }
        vector<vector<string>> ans;
        for (auto &kv : contMap)
            if (kv.second.size() > 1) ans.push_back(kv.second);
        return ans;
    }
};
Enter fullscreen mode Exit fullscreen mode

Discussion (1)

Collapse
cipharius profile image
Valts Liepiņš

Considering that files could be large, perhaps one could use same approach but store file checksums instead?

If the rare case of collision is of concern, checksums can still be used as first pass filter and then match the file contents, just in case.

Forem Open with the Forem app