Fastest method of comparing two lists.

Simon Knight · Post by **Simon Knight** » Thu Jun 06, 2019 4:00 pm

Hi,
I am in the process of consolidating media files that are on various external drives onto one new large drive. Now many of the files are duplicated across the source drives and I want to ensure that I only have one copy of any given file on the destination drive.

I have created an application that compares two drives, source and destination, and produces a list of files that are unique to the source drive. The app then allows me to copy the file to the destination drive.

While my application works, unfortunately it took some 36 minutes to complete the list comparison when one of the lists describe a 4TB drive. So I needed to refactor.

I now have three versions of the application and they all run at different speeds. In all versions the data starts life as a text list of typically 90,000 lines of file paths e.g.

Code: Select all

/Volumes/MediaDisc_20120512/_Zack/onelight_V2/04 EXPOSURE 2.mp4

My first attempt took the raw list above and delimited the items using tabs so that the lines were changed tp look like :

Code: Select all

NA	NewMediaBU	Cinema_TV	Apollo 13.mp4	/Volumes/NewMediaBU/Cinema_TV/Apollo 13.mp4

The app used these lists passing them to the following function:

Code: Select all

Private function ListUniqueFiles @pMoviesSource, @pMoviesDestination
   set itemdel to tab
   repeat for each line tFile in pMoviesDestination
      put item 4 of tFile & cr after tDestinationFileNames  -- just the file names
   end repeat
   
   put 0 into tCopyCount
   put 0 into tUniqueCount
   repeat for each line tFile in pMoviesSource
      put item 4 of tFile into tFileName
      if item 4 of tFile is in tDestinationFileNames then
         -- do nothing
         add one to tCopyCount
      else
         -- List it as an uncopied file
         add one to tUniqueCount
         put tFile & cr after tUniqueFiles
      end if
      
   end repeat
   delete the last char of tUniqueFiles
   
   return tUniqueFiles
end ListUniqueFiles

My two test drives are Source with 27,000 files and destination with 89,000 files of all types.
The first app takes 68-70 seconds to produce a list of movie files that are unique to the source drive.

So 68 seconds became the time to beat. Remember that the 4TB drive with some 380,000 files took 36 minutes to list audio files.

Next I decided to rewrite the app so that it stored the data in arrays and did the search of the destination using a binary chop search.

Code: Select all

private function UniqueFilesA @pSourceA,@pDestinationA
   ## passed two arrays - 
   ## returns an array of records that are unique to pSourceA
   
   put Recordcount (pDestinationA) into tDestinationCount
   put the keys of pSourceA into tSourceKeys
   put 0 into tCount
   put 0 into tFinds
   repeat for each line tKey in tSourceKeys
      if  BinarySearch (pDestinationA, pSourceA[tKey]["FileName"],0,tDestinationCount) then
         -- do nothing as the file is on both drives
         add 1 to tFinds
      else
         add 1 to tCount
         put pSourceA[tKey] into tUniqueToSourceA[tCount]
      end if
   end repeat
   put tFinds & " records found on both." & cr after field "debug"
   put tCount & " records found only on source." & cr after field "debug"
   return tUniqueToSourceA
end UniqueFilesA
   
private Function BinarySearch  @pArray, pItem, pLeft, pRight
   
   put pLeft into tLeft
   put pRight into tRight
   
   Repeat while tLeft <= tRight
      put floor ((tLeft+tRight)/2) into tMidpoint
      if pArray[tMidpoint]["FileName"] < pItem then
                  put tMidpoint + 1 into tLeft
      else if pArray[tMidpoint]["FileName"] > pItem then
         put tMidpoint - 1 into tRight
      else
         return true
      end if
      
   end Repeat
   Return false  -- not found
end BinarySearch

The arrays have numeric keys and using them reduced the time to produce the list down to 15.5 seconds.

This made me wonder if using a binary chop with text lists would also give a speed increase. I wrote the following code:

Code: Select all

Private function BuildListOfFilesUniqueToSource @pFilesOnSource, pFilesOnDestination
   set itemdel to "/"
   sort lines of pFilesOnDestination ascending by item -1 of each
   
   put the number of lines of pFilesOnDestination into tRecCount
   put the number of lines of pFilesOnSource into tSrcRecCount
   
   repeat for each line tLine in pFilesOnSource
      
      If OnlyOnSource(pFilesOnDestination,tRecCount,tLine) then 
         put tLine & cr after tNewList
      end if
      
   end repeat
   delete the last char of tNewList
   put cr & cr & "NewList is built ok with " & the number of lines in tNewList & " records" & cr after field "debug"
   return tNewList
end BuildListOfFilesUniqueToSource

Private function OnlyOnSource  @pDestinationList, pRecCount, pTarget
   ## use a binary chop or search
   set itemdel to "/"

   put pRecCount into tRight
   put 0 into tLeft
   
   put item -1 of pTarget into tTargetFileName
   
   Repeat while tLeft <= tRight
      
      put floor ((tLeft+tRight)/2) into tMidpoint
      put item -1 of (line tMidpoint of pDestinationList) into tTestLine
      
      if tTestLine < tTargetFileName then
         put tMidpoint + 1 into tLeft
         
      else if tTestLine > tTargetFileName then
         put tMidpoint - 1 into tRight
         
      else
         return false
      end if
      
   end Repeat
   
   Return true  -- not found
   
end OnlyOnSource

This code increased the processing time to 395-400 seconds. My guess is that reading line number n of list is quite slow when compared with the command is in of the first version.

I have just found a post that Richard made describing a new array comparison command that is in version 9 so I will see if it can be of any use.

I welcome your comments.

Simon

rkriesel · Post by **rkriesel** » Fri Jun 07, 2019 6:04 am

Hi, Simon. Could there be a file name that's used for two different files? Could there be two file names that have the same content?

Have you tried commands union and difference? If I understand your requirement, you can get what you want like this:

Code: Select all

command digest @rUnion, pNewBatch, @rNewLines
   difference pNewBatch with rUnion into rNewLines
   union rUnion with pNewBatch
end digest

Does that work for you? How fast?
-- Dick

Simon Knight · Post by **Simon Knight** » Fri Jun 07, 2019 7:46 am

Dick,

rkriesel wrote: ↑
Fri Jun 07, 2019 6:04 am
Could there be a file name that's used for two different files?

Why this is possible it is unlikely in my case. In the past I have tried to manually consolidate media files stored on a collection of disks as well as removing items from iTunes. These files have been copied into new folder structures and have become a mess as I never completed the task. The result is multiple duplication of whole folders of media files. My most recent consolidation was started with the aim of clearing the mess up. The application was written to look for files that were left behind. I will probably add some checking for duplicates and may add some checking of file metadata before I commit to reformatting the old drives.

rkriesel wrote: ↑
Fri Jun 07, 2019 6:04 am
Could there be two file names that have the same content?

This is a minor problem. Typically this occurs when an old movie file has been converted to a more modern codec as part of Apple's plan to drop QuickTime7 and its codecs. These are hard to detect as they will have different creation dates and sizes. However, the result is that two files are stored where only one is required and my new shallow folder structure should make them simpler to spot by inspection.

rkriesel wrote: ↑
Fri Jun 07, 2019 6:04 am
Have you tried commands union and difference?

While I have not tried these commands reading the dictionary makes me think that can't use them with the arrays I presently use.

= The content of individual elements of the templateArray does not affect the final result

At the moment my arrays use integers as keys : e.g. MyArray[35]["FileName"] and MyArray[35]["FolderName"] so it is likely that these integer keys will refer to different file records in both arrays. I will see if changing the key to the name of the file will work e.g. MyArray["MyGreatMovie.mov"]["FolderName"]. In some ways this appears to be a better way to use arrays but it means that the binary chop will not work as is as it relies on a numbered order. Using the file names as keys will most likely require two new arrays are created just to track the order of the records. Of course if the difference command works the binary chop will not be needed.

During my searches I found a demonstration stack "CountedSets" on this forum page : viewtopic.php?p=101857#p101857. It seems to do a similar merge of two lists using the sort command but at the moment I don't understand how or what it is doing.

best wishes

Simon

rkriesel · Post by **rkriesel** » Fri Jun 07, 2019 9:01 am

Simon Knight wrote: ↑
Fri Jun 07, 2019 7:46 am

rkriesel wrote: ↑
Fri Jun 07, 2019 6:04 am
Have you tried commands union and difference?
While I have not tried these commands reading the dictionary makes me think that can't use them with the arrays I presently use.

You don't seem to need the arrays you presently use. You can just feed your long delimited lists into commands difference and union. Has that worked for you?
-- Dick

rkriesel · Post by **rkriesel** » Fri Jun 07, 2019 10:05 am

rkriesel wrote: ↑
Fri Jun 07, 2019 9:01 am
You can just feed your long delimited lists into commands difference and union.

should have been:
You can just split your long delimited lists and then feed them into commands difference and union.

Simon Knight · Post by **Simon Knight** » Fri Jun 07, 2019 2:35 pm

Hi here are the results when using "Difference" command is a mighty 0.4 seconds. (0.002 seconds versus 0.404).

I was unable to find a way of using split without looping the long lists twice each which would take tens of seconds.

Code: Select all

Run on : Friday, June 7, 2019. At : 13:40:47
Application file name : MediaConsolidationTool_Arrays_NewTimerRefactor.rev
Version 0.5 dated Fri, Jun 7, 2019 (Array Refactor).
Source Volume : MediaDisc_20120512.
Destination Volume : TOSHIBA EXT.
Seeking : Audio files.
Handler : listFilesWithPaths completed in 10.512 seconds.
Handler : ConvertToArray (v1) completed in 3.521 seconds.
Array     : tSourceFilesA holds 16675 records.
Handler : listFilesWithPaths completed in 57.95 seconds.
Handler : ConvertToArray (v1) completed in 13.513 seconds.
Array     : tDestinationFilesA holds 8930 records.
Handler : SortArray completed in 0.043 seconds.
Handler : SortArray completed in 0.092 seconds.
14027 files found on both drive.
2648 files found only on source drive.
Handler : UniqueFilesA completed in 0.404 seconds.


Complete 
Time to Complete : 86.175 seconds

———

Run on : Friday, June 7, 2019. At : 14:02:03
Application file name : MediaConsolidationTool_Arrays_NewTimerRefactor.rev
Version 0.6 dated Fri, Jun 7, 2019 (Diff' cmd).
Source Volume : MediaDisc_20120512.
Destination Volume : TOSHIBA EXT.
Seeking : Audio files.
Handler : listFilesWithPaths completed in 10.724 seconds.
Handler : ConvertToArray (v1) completed in 3.415 seconds.
Array     : tSourceFilesA holds 10671 records.
Handler : listFilesWithPaths completed in 56.943 seconds.
Handler : ConvertToArray (v1) completed in 13.134 seconds.
Array     : tDestinationFilesA holds 8885 records.
Handler : UniqueFilesA completed in 0.002 seconds.
Array     : tUniqueToSourceA holds 2607 records.


Complete 
Time to Complete : 84.571 seconds

The handler of interest is "UniqueFilesA".
Of concern is that the diff using code finds 41 fewer records.

best wishes
Simon

FourthWorld · Post by **FourthWorld** » Fri Jun 07, 2019 6:58 pm

This is a good exercise, but is rsync an option?

Simon Knight · Post by **Simon Knight** » Fri Jun 07, 2019 8:43 pm

I rejected using the difference command as it was giving different results when compared to the previous version. I decided to have a closer look at the split command. The problem is that I need to be able to sort the data by filename alone ignoring the path and I could not see a way of getting the split command to give me an array in the form of DataArray[filename] = the full file path or DataArray[numericKey]["FileName"], DataArray[numericKey]["FilePath"] from a list of path names like volumes/TopFolder/Nextfolder/Filename.mov

So I tried this:

Code: Select all

Private function ConvertToArray @pList
   -- converts the list passed in into a array
   set rowdelimiter to cr
   split pList by row
   
   set itemdel to "/"
   put 0 into tCounter
   put the keys of pList into tKeys
   repeat for each line tKey in tKeys
      put pList[tKey] into tFilePath
      put the last item of tFilePath into tFileName
      If NameCheck(tFileName) then
         add one to tCounter
         put tFileName into tDataA[tCounter]["FileName"]
         put item -2 of tFilePath into tDataA[tCounter]["FolderName"]
         put  tFilePath into tDataA[tCounter]["FilePath"]
      end if
   end repeat
   
   return tDataA
   
end ConvertToArray

This works by converting the list that is passed to the handler into an Array with numeric keys. Next this array is processed creating a second array which is a subset of the first in the desired format. (The function NameCheck returns true if the file has the correct extension)

So all well and good? No, unfortunately its slow taking over 12 seconds to process the file list generated from the smaller drive. The previous version of my app uses a handler (see below) that just parses the list in 3.3 seconds. I investigated and the split command is discovered that it is taking 11.5 seconds of the 12 to complete.

Code: Select all

Private function ConvertToArray pList
   -- converts the list passed in into a array
   set itemdel to "/"
   put 0 into tCounter
   ## Loop thru the list adding files of the correct type to the array
   Repeat for each line tLine in pList
      put the last item of tLine into tFileName
      If NameCheck(tFileName) then
         add one to tCounter
         put tLine into tDataArray[tCounter]["FilePath"]
         put tFileName into tDataArray[tCounter]["FileName"]
         put item -2 of tLine into tDataArray[tCounter]["FolderName"]
      end if
   end Repeat
   
   return tDataArray
end ConvertToArray

The list passed into these routines has over 89000 lines long. I am a little surprised that "split" is so slow but I may be using it incorrectly. At least using my own handlers means that I retain full control over the data and that all data errors are my own.

Simon Knight · Post by **Simon Knight** » Fri Jun 07, 2019 10:19 pm

FourthWorld wrote: ↑
Fri Jun 07, 2019 6:58 pm
This is a good exercise, but is rsync an option?

The exercise has been interesting and highlights the importance of using the most efficient data structure. Most of the time that means using arrays which are so much faster than lists. Its the first time I have used a binary chop and its speed is also impressive and when combined with arrays reduced the processing of the data from 36 minutes to 0.4 seconds.

I don't think rsync will give me enough control over what is copied. I want to reduce duplication of files on the destination and I suspect that using rsync or similar would result in file duplication that would have to be resolved post sync. I am also extracting a number of files from deep inside iTunes' folder structure and I suspect flattening the structure would cause problems.

FourthWorld · Post by **FourthWorld** » Sat Jun 08, 2019 12:16 am

If you have a good solution go with it. But as you learn more about rsync's rich options you may find it impressively helpful down the road.

Simon Knight · Post by **Simon Knight** » Sat Jun 08, 2019 7:01 am

Hi Richard,

Thanks for the heads-up, I have just spent the past hour researching rsync and can see that there are hours of fun to be had trying to get the excludes and includes right. Thankfully I have spotted the --dry-run command, otherwise it would be a little like poking a sleeping tiger with a short stick.

best wishes

Simon

kdjanz · Post by **kdjanz** » Sat Jun 08, 2019 7:16 am

Hi Simon

Having multiple drives of archived content myself, I would be very interested in trying your solution on my problem. Could I get a copy of your stack for testing?

Thanks

Kelly

Simon Knight · Post by **Simon Knight** » Sat Jun 08, 2019 9:18 am

Hi Kelly,

Yes of course but please note that my application only checks file names so it will not detect if the same name is used for different media. It works for me because I know that in the past I have copied whole folder structures of media files to new disks with the intention of sorting things out and that these files have had unique descriptive names. Also I have not proved that my app is 100% accurate although so far I have not found any mistakes in the list it produces.

Having said all that the application does not delete any files and any files that are copied from source to destination have to be user selected and are copied to a top level folder "FromOldMediaDisk".

Be prepared to wait for it to finish building its lists. In my testing it takes 90 seconds to process 90,000 files on the first 1TB source drive against 380,000 drives on the second 4TB destination drive.

These notes may be helpful.
The file reading handlers are in the stack script along with the debug time handler.
The file list group control contains scripts for managing the file selection list.
The green button script holds the list / array manipulation code. This code will probably end up being moved to the card.

Please let me know how you get on.

best wishes
Simon

Simon Knight · Post by **Simon Knight** » Sat Jun 08, 2019 1:34 pm

Here is an updated version that allows the user to choose where copies of the files selected in the list are saved.

kdjanz · Post by **kdjanz** » Sat Jun 08, 2019 7:22 pm

Thanks - l’ll give it a go when I get a chance to sit and have some thinking time. Spring has sprung here in Canada, so that means everyone races out to enjoy the few weeks we have of summer.

Appreciate it,

Kelly

LiveCode Forums

Fastest method of comparing two lists.

Fastest method of comparing two lists.

Re: Fastest method of comparing two lists.

Re: Fastest method of comparing two lists.

Re: Fastest method of comparing two lists.

Re: Fastest method of comparing two lists.

Re: Fastest method of comparing two lists.

Re: Fastest method of comparing two lists.

Re: Fastest method of comparing two lists.

Re: Fastest method of comparing two lists.

Re: Fastest method of comparing two lists.

Re: Fastest method of comparing two lists.

Re: Fastest method of comparing two lists.

Re: Fastest method of comparing two lists.

Re: Fastest method of comparing two lists.

Re: Fastest method of comparing two lists.