I had exported my Safari bookmarks to a file so I could manipulate them. My file is 11,500 lines and contains 294 bookmark folders. The file format is in XML (fairly easy to understand and parse) but has folders nested within folders. This makes processing the file linearly (top to bottom) difficult to do accurately. But as a first pass that is what I did. I wrote a linear routine that loops through the file using
Code: Select all
Repeat for each line tLine in field “myData”
— process tLine
End repeat
The execution time: around 2 seconds.
But I knew that, if I was going to release this, I would have to bite the bullet and code it properly using recursion (ie. the handler that processes a folder calls itself whenever it encounters an embedded folder, etc). While this required a bit of mental gymnastics to figure out I knew I would have to give up the “repeat for…” loop I had used before because each time I called “process new folder” I needed to start in the field wherever I was, and continue processing from that point on, and then when I unwound that call, I would have to continue from where the last folder ended, etc, in short I had to have explicit control over the line number pointer in the field. So I wrote a function called GetNextLine that increments the pointer, fetches a line, and tests for end of file. That way I could have any number of “repeat while not EOF” loops and have them start and stop at different sequential places.
Code: Select all
function getNextLine
add 1 to lLineCount
put line lLineCount of fld “myData” into tLine
if tLine = "</HTML>" then put true into lEOF
return tLine
end getNextLine
And yes, that worked remarkably well to read and categorise the bookmarks accurately and reduced the number of orphans from 59 to just 3; which happened to be 3 URL’s that were not categorised inside any folder anyway and so were legitimate orphans from a folder perspective.
Great, except it took 40 minutes to run. Yes, you read that correctly. Same input file, same computer, 2 seconds vs 40 minutes. How could processing the same file go from 2 secs to 40 mins? I was curious, and set about to find out.
Since the internal data processing was essentially the same between the two versions, I decided I would focus on the loop. And, there was no point in drilling into the “repeat for each line” version since that was already at virtually instant speed.
But in the other program I decided to just read through the file, top to bottom, using the getNextLine() function and doing NO other processing. And that took 36 Mins (or 90% of the time).
So I added a field to output the line number as it was being fetched and watched the progress. It actually started out pretty quickly (30 lines per second or thereabouts) but then as things progressed it was getting slower and slower. By the time it reached 5000 lines I could easily keep up with counting as fast as the computer was processing ie. About 5 lines a second. And it continued to deteriorate.
So I made two conclusions: first, “repeat for each line tLine…” is incredibly optimised, processing, even with the folder and bookmark identification and counting, about 5000 lines per second. And the manual version using “get next line” is incredibly non-optimised, processing at most (even with NO folder or bookmark processing) about 5 lines per second. 5000 vs 5. What could account for a 1,000 fold difference?
Watching the display as it continued to deteriorate I came to the conclusion that my “fetch next line” routine was actually starting at the TOP of the field every time and counting all the way down to the point where the next line was located. So the 10,000th line had to start at the beginning and count 9,999 lines before identifying the 10,000 line. And then repeat this whole process again counting 10,000 lines to identify the 10,001st line, etc. At least that is what it looks like. Can anyone confirm this?
So then I thought, well this really sinks any recursive solution if it can’t be done with an optimised read through the data. Unless… (and I hope) someone knows a better approach?
Mark