The scenario is that I've downloaded the wikipedia database which contains the information only, no images and such. It's a 5 gig compressed download which is an uncompressed 23 gig XML file.
What I'm doing with it is generating an index file which only contains the title of each article and the starting character position of the article within the XML file.
The intent is to use the index file for searching, using the starting char position to grab the article from the XML file.
So far, using the starting char to grab the data from the XML file is not a problem as Rev is more than quick enough to do this. Even with my likely poor programming habits and lack of optimization.
The problem so far is generating the index file. At the current rate, it takes about 3 days for my program to generate the index file. This just isn't acceptable if I release this program for others to use. There are over 3 million articles in the 23 gig XML file.
I'd like to know if anyone here is willing to look at my code below and see if there's any way to make it quicker, say from 3 days to maybe 3 hours. Mind you, I did put in a slight few milliseconds delay in order to keep from pegging my CPU out to 100 for the long duration that it might take, whether that be 3 hours or 3 days, just not cool pegging the cpu out for that long. With the current delays built in, my CPU runs at about 50% instead of maxed out at 100% while my code is running.
Code: Select all
on mouseUp
put empty into field "ListIndex"
put "WikipediaIndex.dat" into varIndexFile
put "enwiki-20090902-pages-articles.xml" into varFileToRead
open file varFileToRead for read
put 1 into varCounter
put 1 into varDelayLoop
repeat
read from file varFileToRead at varCounter for 1 line
if the result is "eof" then exit repeat
put it into varLineData
if "<title>" is among the chars of varLineData then
put offset("<title>",varLineData) + 6 into varStartOffset
put (varCounter + offset("<title>",varLineData)) into varCharLoc
delete char 1 to varStartOffset of varLineData
put offset("</title>",varLineData) into varStartOffset
delete char varStartOffset to (varStartOffset + 7) of varLineData
delete the last char of varLineData
put varLineData & "|" & varCharLoc & cr into varLineData
put varLineData after varIndexData
put field "LabelStatusCount" + 1 into field "LabelStatusCount"
end if
put the number of chars of varLineData + varCounter into varCounter
put varDelayLoop + 1 into varDelayLoop
if varDelayLoop is 5 then
wait for 2 milliseconds
put 1 into varDelayLoop
end if
if the number of lines of varIndexData is 1000 then
put quote & "file:" & varIndexFile & quote into varSaveIndexFile
put varIndexData after URL "file:WikipediaIndex.dat"
put empty into varIndexData
wait for 5 milliseconds
end if
end repeat
close file varFileToRead
put field "LabelStatusCount" + 1 into field "LabelStatusCount"
put quote & "file:" & varIndexFile & quote into varSaveIndexFile
put varIndexData after URL "file:WikipediaIndex.dat"
answer "Wikipedia data indexed and ready for use."
end mouseUp
Thanks,
~Garrett