I am in the process of consolidating media files that are on various external drives onto one new large drive. Now many of the files are duplicated across the source drives and I want to ensure that I only have one copy of any given file on the destination drive.
I have created an application that compares two drives, source and destination, and produces a list of files that are unique to the source drive. The app then allows me to copy the file to the destination drive.
While my application works, unfortunately it took some 36 minutes to complete the list comparison when one of the lists describe a 4TB drive. So I needed to refactor.
I now have three versions of the application and they all run at different speeds. In all versions the data starts life as a text list of typically 90,000 lines of file paths e.g.
Code: Select all
/Volumes/MediaDisc_20120512/_Zack/onelight_V2/04 EXPOSURE 2.mp4
Code: Select all
NA NewMediaBU Cinema_TV Apollo 13.mp4 /Volumes/NewMediaBU/Cinema_TV/Apollo 13.mp4
Code: Select all
Private function ListUniqueFiles @pMoviesSource, @pMoviesDestination
set itemdel to tab
repeat for each line tFile in pMoviesDestination
put item 4 of tFile & cr after tDestinationFileNames -- just the file names
end repeat
put 0 into tCopyCount
put 0 into tUniqueCount
repeat for each line tFile in pMoviesSource
put item 4 of tFile into tFileName
if item 4 of tFile is in tDestinationFileNames then
-- do nothing
add one to tCopyCount
else
-- List it as an uncopied file
add one to tUniqueCount
put tFile & cr after tUniqueFiles
end if
end repeat
delete the last char of tUniqueFiles
return tUniqueFiles
end ListUniqueFiles
The first app takes 68-70 seconds to produce a list of movie files that are unique to the source drive.
So 68 seconds became the time to beat. Remember that the 4TB drive with some 380,000 files took 36 minutes to list audio files.
Next I decided to rewrite the app so that it stored the data in arrays and did the search of the destination using a binary chop search.
Code: Select all
private function UniqueFilesA @pSourceA,@pDestinationA
## passed two arrays -
## returns an array of records that are unique to pSourceA
put Recordcount (pDestinationA) into tDestinationCount
put the keys of pSourceA into tSourceKeys
put 0 into tCount
put 0 into tFinds
repeat for each line tKey in tSourceKeys
if BinarySearch (pDestinationA, pSourceA[tKey]["FileName"],0,tDestinationCount) then
-- do nothing as the file is on both drives
add 1 to tFinds
else
add 1 to tCount
put pSourceA[tKey] into tUniqueToSourceA[tCount]
end if
end repeat
put tFinds & " records found on both." & cr after field "debug"
put tCount & " records found only on source." & cr after field "debug"
return tUniqueToSourceA
end UniqueFilesA
private Function BinarySearch @pArray, pItem, pLeft, pRight
put pLeft into tLeft
put pRight into tRight
Repeat while tLeft <= tRight
put floor ((tLeft+tRight)/2) into tMidpoint
if pArray[tMidpoint]["FileName"] < pItem then
put tMidpoint + 1 into tLeft
else if pArray[tMidpoint]["FileName"] > pItem then
put tMidpoint - 1 into tRight
else
return true
end if
end Repeat
Return false -- not found
end BinarySearch
This made me wonder if using a binary chop with text lists would also give a speed increase. I wrote the following code:
Code: Select all
Private function BuildListOfFilesUniqueToSource @pFilesOnSource, pFilesOnDestination
set itemdel to "/"
sort lines of pFilesOnDestination ascending by item -1 of each
put the number of lines of pFilesOnDestination into tRecCount
put the number of lines of pFilesOnSource into tSrcRecCount
repeat for each line tLine in pFilesOnSource
If OnlyOnSource(pFilesOnDestination,tRecCount,tLine) then
put tLine & cr after tNewList
end if
end repeat
delete the last char of tNewList
put cr & cr & "NewList is built ok with " & the number of lines in tNewList & " records" & cr after field "debug"
return tNewList
end BuildListOfFilesUniqueToSource
Private function OnlyOnSource @pDestinationList, pRecCount, pTarget
## use a binary chop or search
set itemdel to "/"
put pRecCount into tRight
put 0 into tLeft
put item -1 of pTarget into tTargetFileName
Repeat while tLeft <= tRight
put floor ((tLeft+tRight)/2) into tMidpoint
put item -1 of (line tMidpoint of pDestinationList) into tTestLine
if tTestLine < tTargetFileName then
put tMidpoint + 1 into tLeft
else if tTestLine > tTargetFileName then
put tMidpoint - 1 into tRight
else
return false
end if
end Repeat
Return true -- not found
end OnlyOnSource
I have just found a post that Richard made describing a new array comparison command that is in version 9 so I will see if it can be of any use.
I welcome your comments.
Simon