Find Duplicated File in System

This problem requires more attention to the data structures and parsing then necessarily to the algorithm itself. Here it is https://leetcode.com/problems/find-duplicate-file-in-system/description/

Given a list of directory info including directory path, and all the files with contents in this directory, you need to find out all the groups of duplicate files in the file system in terms of their paths.
A group of duplicate files consists of at least two files that have exactly the same content.
A single directory info string in the input list has the following format:
"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"
It means there are n files (f1.txtf2.txt ... fn.txt with content f1_contentf2_content ... fn_content, respectively) in directory root/d1/d2/.../dm. Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.
The output is a list of group of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:
"directory_path/file_name.txt"
Example 1:
Input:
["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]
Output:  
[["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

The algorithm that comes to mind for an O(n) solution is a linear run through the paths, indexing a hash table using the file content as the key, using as the value the file path, and finally creating the output by going through the hash table and adding to the output any content that has more than one file associated with it (so technically it is an O(2n) solution).
The parsing, casts and data structures to be used are the key here to solve this problem. It requires more focus on these details rather than thinking deeply about the algorithm itself.
I was a little surprised though to see that my solution was that fast, I was expecting someone to come up with an O(1n) instead of O(2n). Thanks, Marcelo.


public class Solution
{
public IList<IList<string>> FindDuplicate(string[] paths)
{
Hashtable files = new Hashtable();

for (int i = 0; i < paths.Length; i++)
{
string[] parts = paths[i].Split(' ');
string path = parts[0];

for (int j = 1; j < parts.Length; j++)
{
string file = parts[j];
int begin = file.IndexOf('(');
int end = file.IndexOf(')');
string content = file.Substring(begin + 1, end - begin - 1);

string key = path + '/' + file.Substring(0, begin);
if (!files.ContainsKey(content))
{
Hashtable htFile = new Hashtable();
htFile.Add(key, true);
files.Add(content, htFile);
}
else
{
Hashtable htFile = (Hashtable)files[content];
if (!htFile.ContainsKey(key))
{
htFile.Add(key, true);
}
files[content] = htFile;
}
}
}

List<IList<string>> retVal = new List<IList<string>>();

foreach (string fileContent in files.Keys)
{
Hashtable htInnerFiles = (Hashtable)files[fileContent];
if (htInnerFiles.Count > 1)
{
List<string> list = new List<string>();
foreach (string ss in htInnerFiles.Keys)
{
list.Add(ss);
}
retVal.Add(list);
}
}

return retVal;
}
}

Comments

  1. I'm not sure why this problem was marked as having "Medium" difficulty level, but for easy problems like this I like to play with the language either to make it very concise like in

    import collections

    class Solution:
    def findDuplicate(self, paths):
    """
    :type paths: List[str]
    :rtype: List[List[str]]
    """
    index = collections.defaultdict(list)
    for path in paths:
    directory_name, *files = path.split(" ")
    for file in files:
    file_name, _, content = file.rpartition("(")
    content = content[:-1]
    index[content].append(directory_name + "/" + file_name)
    return [file_paths for _, file_paths in index.items() if len(file_paths) > 1]

    or well structured:

    import collections

    Directory = collections.namedtuple("Directory", ["name", "files"])
    File = collections.namedtuple("File", ["name", "content"])

    def parse_file(file):
    """
    :type file: str
    :rtype File
    """
    name, _, content = file.rpartition("(")
    return File(name=name, content=content[:-1])

    def parse_directory(path):
    """
    :type path: str
    :rtype Directory
    """
    name, *files = path.split(" ")
    return Directory(name=name, files=map(parse_file, files))

    class Solution:
    def findDuplicate(self, paths):
    """
    :type paths: List[str]
    :rtype: List[List[str]]
    """
    index = collections.defaultdict(list)
    for directory in map(parse_directory, paths):
    for file in directory.files:
    index[file.content].append(directory.name + "/" + file.name)
    return [file_paths for _, file_paths in index.items() if len(file_paths) > 1]

    Thanks for sharing, Marcelo!

    ReplyDelete
  2. Replies
    1. btw, not that it makes a big difference in terms of performance, the short solution I pasted above is actually very easy to modify to have remove the need for the second pass:

      import collections

      class Solution:
      def findDuplicate(self, paths):
      """
      :type paths: List[str]
      :rtype: List[List[str]]
      """
      index = collections.defaultdict(list)
      result = []
      for path in paths:
      directory_name, *files = path.split(" ")
      for file in files:
      file_name, _, content = file.rpartition("(")
      content = content[:-1]
      duplicates = index[content]
      duplicates.append(directory_name + "/" + file_name)
      if len(duplicates) == 2:
      result.append(duplicates)
      return result

      or better formatted version - https://ideone.com/ZdXrCq :)

      Cheers!

      Delete

Post a Comment

Popular posts from this blog

Changing the root of a binary tree

ProjectEuler Problem 719 (some hints, but no spoilers)

The Power Sum, a recursive problem by HackerRank