Find Duplicated File in System

- February 19, 2018

This problem requires more attention to the data structures and parsing then necessarily to the algorithm itself. Here it is https://leetcode.com/problems/find-duplicate-file-in-system/description/

Given a list of directory info including directory path, and all the files with contents in this directory, you need to find out all the groups of duplicate files in the file system in terms of their paths.

A group of duplicate files consists of at least two files that have exactly the same content.

A single directory info string in the input list has the following format:

"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

It means there are n files (f1.txt, f2.txt ... fn.txt with content f1_content, f2_content ... fn_content, respectively) in directory root/d1/d2/.../dm. Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of group of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:

"directory_path/file_name.txt"

Example 1:

Input:
["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]
Output:  
[["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

The algorithm that comes to mind for an O(n) solution is a linear run through the paths, indexing a hash table using the file content as the key, using as the value the file path, and finally creating the output by going through the hash table and adding to the output any content that has more than one file associated with it (so technically it is an O(2n) solution).
The parsing, casts and data structures to be used are the key here to solve this problem. It requires more focus on these details rather than thinking deeply about the algorithm itself.
I was a little surprised though to see that my solution was that fast, I was expecting someone to come up with an O(1n) instead of O(2n). Thanks, Marcelo.

public class Solution
{
public IList<IList<string>> FindDuplicate(string[] paths)
{
Hashtable files = new Hashtable();

for (int i = 0; i < paths.Length; i++)
{
string[] parts = paths[i].Split(' ');
string path = parts[0];

for (int j = 1; j < parts.Length; j++)
{
string file = parts[j];
int begin = file.IndexOf('(');
int end = file.IndexOf(')');
string content = file.Substring(begin + 1, end - begin - 1);

string key = path + '/' + file.Substring(0, begin);
if (!files.ContainsKey(content))
{
Hashtable htFile = new Hashtable();
htFile.Add(key, true);
files.Add(content, htFile);
}
else
{
Hashtable htFile = (Hashtable)files[content];
if (!htFile.ContainsKey(key))
{
htFile.Add(key, true);
}
files[content] = htFile;
}
}
}

List<IList<string>> retVal = new List<IList<string>>();

foreach (string fileContent in files.Keys)
{
Hashtable htInnerFiles = (Hashtable)files[fileContent];
if (htInnerFiles.Count > 1)
{
List<string> list = new List<string>();
foreach (string ss in htInnerFiles.Keys)
{
list.Add(ss);
}
retVal.Add(list);
}
}

return retVal;
}
}

Comments

TarasFebruary 26, 2018 at 11:33 PM
I'm not sure why this problem was marked as having "Medium" difficulty level, but for easy problems like this I like to play with the language either to make it very concise like in

import collections

class Solution:
def findDuplicate(self, paths):
"""
:type paths: List[str]
:rtype: List[List[str]]
"""
index = collections.defaultdict(list)
for path in paths:
directory_name, *files = path.split(" ")
for file in files:
file_name, _, content = file.rpartition("(")
content = content[:-1]
index[content].append(directory_name + "/" + file_name)
return [file_paths for _, file_paths in index.items() if len(file_paths) > 1]

or well structured:

import collections

Directory = collections.namedtuple("Directory", ["name", "files"])
File = collections.namedtuple("File", ["name", "content"])

def parse_file(file):
"""
:type file: str
:rtype File
"""
name, _, content = file.rpartition("(")
return File(name=name, content=content[:-1])

def parse_directory(path):
"""
:type path: str
:rtype Directory
"""
name, *files = path.split(" ")
return Directory(name=name, files=map(parse_file, files))

class Solution:
def findDuplicate(self, paths):
"""
:type paths: List[str]
:rtype: List[List[str]]
"""
index = collections.defaultdict(list)
for directory in map(parse_directory, paths):
for file in directory.files:
index[file.content].append(directory.name + "/" + file.name)
return [file_paths for _, file_paths in index.items() if len(file_paths) > 1]

Thanks for sharing, Marcelo!
ReplyDelete
Replies
Marcelo De BarrosMarch 3, 2018 at 10:11 PM
Neat!!! :)
ReplyDelete
Replies

Add comment

Search This Blog

Another Casual Coder

Find Duplicated File in System

Comments

Post a Comment

Popular posts from this blog

Shortest Bridge – A BFS Story (with a Twist)

Quasi FSM (Finite State Machine) problem + Vibe

Classic Dynamic Programming IX