Page 1 of 2
Faster than what I have going here?
Posted: Fri Sep 11, 2009 6:44 pm
by Garrett
Ok, still on my offline wikipedia reader here.
The scenario is that I've downloaded the wikipedia database which contains the information only, no images and such. It's a 5 gig compressed download which is an uncompressed 23 gig XML file.
What I'm doing with it is generating an index file which only contains the title of each article and the starting character position of the article within the XML file.
The intent is to use the index file for searching, using the starting char position to grab the article from the XML file.
So far, using the starting char to grab the data from the XML file is not a problem as Rev is more than quick enough to do this. Even with my likely poor programming habits and lack of optimization.
The problem so far is generating the index file. At the current rate, it takes about 3 days for my program to generate the index file. This just isn't acceptable if I release this program for others to use. There are over 3 million articles in the 23 gig XML file.
I'd like to know if anyone here is willing to look at my code below and see if there's any way to make it quicker, say from 3 days to maybe 3 hours. Mind you, I did put in a slight few milliseconds delay in order to keep from pegging my CPU out to 100 for the long duration that it might take, whether that be 3 hours or 3 days, just not cool pegging the cpu out for that long. With the current delays built in, my CPU runs at about 50% instead of maxed out at 100% while my code is running.
Code: Select all
on mouseUp
put empty into field "ListIndex"
put "WikipediaIndex.dat" into varIndexFile
put "enwiki-20090902-pages-articles.xml" into varFileToRead
open file varFileToRead for read
put 1 into varCounter
put 1 into varDelayLoop
repeat
read from file varFileToRead at varCounter for 1 line
if the result is "eof" then exit repeat
put it into varLineData
if "<title>" is among the chars of varLineData then
put offset("<title>",varLineData) + 6 into varStartOffset
put (varCounter + offset("<title>",varLineData)) into varCharLoc
delete char 1 to varStartOffset of varLineData
put offset("</title>",varLineData) into varStartOffset
delete char varStartOffset to (varStartOffset + 7) of varLineData
delete the last char of varLineData
put varLineData & "|" & varCharLoc & cr into varLineData
put varLineData after varIndexData
put field "LabelStatusCount" + 1 into field "LabelStatusCount"
end if
put the number of chars of varLineData + varCounter into varCounter
put varDelayLoop + 1 into varDelayLoop
if varDelayLoop is 5 then
wait for 2 milliseconds
put 1 into varDelayLoop
end if
if the number of lines of varIndexData is 1000 then
put quote & "file:" & varIndexFile & quote into varSaveIndexFile
put varIndexData after URL "file:WikipediaIndex.dat"
put empty into varIndexData
wait for 5 milliseconds
end if
end repeat
close file varFileToRead
put field "LabelStatusCount" + 1 into field "LabelStatusCount"
put quote & "file:" & varIndexFile & quote into varSaveIndexFile
put varIndexData after URL "file:WikipediaIndex.dat"
answer "Wikipedia data indexed and ready for use."
end mouseUp
Any advice, code etc greatly appreciated.
Thanks,
~Garrett
Posted: Fri Sep 11, 2009 8:22 pm
by bn
Garret,
what comes to mind is
put field "LabelStatusCount" + 1 into field "LabelStatusCount"
here you access a field and than access it again and force a screen update. Accessing a field and screenupdate especially are time consuming. I'd suggest a variable that counts say your "number of lines of varIndexData" and only update the field every 1000 times around.
The same goes for "number of lines of varIndexData" you make Rev count the number of lines 999 times before you save.
If you set up a variable lets say here
Code: Select all
put varLineData & "|" & varCharLoc & cr into varLineData
where you generate the lines then you have a different approach for both cases.
With that counter you could say
Code: Select all
if mySnazzyCounter = 1000 then
-- pseudocode
update my feedback field
save my index
put 0 into mySnazzyCounter
end if
this should save you a day or so.
and here seems to be a little redundancy
Code: Select all
put offset("<title>",varLineData) + 6 into varStartOffset
put (varCounter + offset("<title>",varLineData)) into varCharLoc
delete char 1 to varStartOffset of varLineData
put offset("</title>",varLineData) into varStartOffset
delete char varStartOffset to (varStartOffset + 7) of varLineData
delete the last char of varLineData
you delete part of the variable which means memory shuffling for rev. But you know where you are so you could start your offset searches for <title> with the kown offsets (the charsToSkip)
from the dictionary:
offset(charsToFind,stringToSearch[,charsToSkip])
that might help a little. But since you do it many times it might save some too.
maybe you do a test run (on a small part of the 23 gigs) and tell us your mileage
regards
Bernd[/quote]
Posted: Fri Sep 11, 2009 8:29 pm
by Garrett
I sure will Bernd! And thanks for the suggestions. I'll let ya know in a few days.. I still have one day left on my current run of indexing the file with my current code.
Thanks,
~Garrett
Posted: Sat Sep 12, 2009 1:10 am
by Mark Smith
Another possible optimisation would be around this part:
Code: Select all
if the number of lines of varIndexData is 1000 then
put quote & "file:" & varIndexFile & quote into varSaveIndexFile
put varIndexData after URL "file:WikipediaIndex.dat"
put empty into varIndexData
wait for 5 milliseconds
Two things - for each time through the loop, the engine has to count the number of lines in varIndexData - it might be better to keep a running count that you increment each time you add to varIndexData, and reset to 0 each time you save it out to varSaveIndexFile.
The other thing would be to "open file varSaveIndexFile for write", at the beginning, and simply "write varIndexData to file varSaveIndexFile" instead of the url syntax. Remember to close the file when you're done.
These might make a noticeable difference, particularly later on in the process when the index file gets big.
Posted: Sat Sep 12, 2009 2:40 am
by Garrett
Thanks

Posted: Mon Sep 14, 2009 5:53 pm
by Garrett
Ok, either I just suck or with the current concept being used, this just isn't going to be a quick, or at least acceptable task for Rev.
The adjustments didn't seem to make enough of a difference
Here's the last code used:
Code: Select all
put empty into field "ListIndex"
put "...\WikipediaIndex3.dat" into varIndexFile
put "...\enwiki-20090902-pages-articles.xml" into varFileToRead
open file varFileToRead for read
open file varIndexFile for write
put 1 into varCounter
put 0 into varTally
repeat
read from file varFileToRead at varCounter for 1 line
if the result is "eof" then exit repeat
put it into varLineData
if "<title>" is among the chars of varLineData then
put (varCounter + offset("<title>",varLineData)) into varCharLoc
put char (offset("<title>",varLineData) + 7) to (offset("</title>",varLineData) -1) of varLineData & "|" & varCharLoc & cr after varIndexData
put varTally + 1 into varTally
end if
put the number of chars of varLineData + varCounter into varCounter
if varTally is "1000" then
put field "LabelStatusCount" + 1000 into field "LabelStatusCount"
put 0 into varTally
write varIndexData to file varIndexFile at eof
put empty into varIndexData
end if
end repeat
put field "LabelStatusCount" + varTally into field "LabelStatusCount"
write varIndexData to file varIndexFile at eof
close file varFileToRead
close file varIndexFile
answer "Wikipedia data indexed and ready for use."
I'm wondering if going for line for line is not the right approach to this. What if I did this in larger chunks? Say something like 1 meg chunks or 10 meg chunks or maybe even 100 meg chunks? Would that make any difference?
Hmmm... Well, only one way to find out I guess.

Posted: Mon Sep 14, 2009 7:44 pm
by bn
Hi Garret,
How did it go, did it finish the indexing allright? How long did it take. How much an improvement did you find with your changes?
What do you call an acceptable time for indexing 23 GB?
How big a chunk is a line? is it more 100 chars or is it more like 10000 chars? How often do you hit with your <Title> in relation to no hit, i.e. on average how many lines do you read before you hit <Title>.
All this influences your design decision. Disk access is "slow". So bigger chunks may well make a difference.
One thing I noticed is if I replace
Code: Select all
if "<title>" is among the chars of varLineData then
put (varCounter + offset("<title>",varLineData)) into varCharLoc
put char (offset("<title>",varLineData) + 7) to (offset("</title>",varLineData) -1) of varLineData & "|" & varCharLoc & cr after varIndexData
put varTally + 1 into varTally
end if
with
Code: Select all
put offset("<title>",varLineData) into tStartTitle
if tStartTitle > 0 then
put (varCounter + tStartTitle) into varCharLoc
put char (tStartTitle + 7) to (offset("</title>",varLineData) -1) of varLineData & "|" & varCharLoc & cr after varIndexData
put varTally + 1 into varTally
end if
it does speed things up especially when <Title> is a little further down the road and not right at the beginning of the line you test.
I would do some benchmarking on a small subset of the data to test your variations. Maybe 50 MB or so. Or even less 5 MB. It should give you a good idea about the time it takes. Just dont expect any miracles, 23 GB is a tall order.
regards
Bernd
Posted: Wed Sep 16, 2009 4:04 am
by Garrett
Well, with a different approach in coding and using some of the suggestions, I was able to cut 3 days indexing down to 10 hours indexing. Still not good enough, I want around 3 hours total time. So I'm still toying with ideas and code here.
Back in a few days or so with results.
Posted: Wed Sep 16, 2009 4:39 pm
by bn
Garrett,
I tried your script on a 24 MB file with 13200 occurences of <Title>xxx</Title> In the version you posted it took me about 7600 milliseconds. I changed your script to take in sequential chunks of the big file and it took about 630 milliseconds. So indeed reading in bigger chunks does improve considerably the speed of your script. Here it is:
Code: Select all
on mouseUp
put empty into field "ListIndex"
put empty into field "LabelStatusCount"
put "...\WikipediaIndex3.dat" into varIndexFile
put "...\enwiki-20090902-pages-articles.xml" into varFileToRead
open file varFileToRead for read
open file varIndexFile for write
put 1 into varCounter
put 0 into varTally
put false into tReachedTheEnd
--put the milliseconds into tStart
repeat until tReachedTheEnd
read from file varFileToRead for 1000000 -- change for bigger chunks, 1 million seems to be pretty good
if the result is "eof" then put true into tReachedTheEnd
put it into tPartText
repeat for each line varLineData in tPartText
if "<title>" is among the chars of varLineData then
put (varCounter + offset("<title>",varLineData)) into varCharLoc
put offset("</title>",varLineData) into tClosingTitle
-- here we dont find a closing title
if tClosingTitle = 0 then
-- read until the end of line, closing title is on the same line as beginning title
read from file varFileToRead for 1 line
put it into myRestOfLine
put the number of chars of myRestOfLine + varCounter into varCounter
put varLineData & myRestOfLine into tLastLIneOfImport
put char (offset("<title>",tLastLIneOfImport) + 7) to (offset("<title>",tLastLIneOfImport) -1 ) of tLastLIneOfImport & "|" & varCharLoc & cr after varIndexData
put varTally + 1 into varTally
else
put char (offset("<title>",varLineData) + 7) to tClosingTitle-1 of varLineData & "|" & varCharLoc & cr after varIndexData
put varTally + 1 into varTally
end if
end if
put the number of chars of varLineData + varCounter into varCounter
if varTally is 1000 then
put field "LabelStatusCount" + 1000 into field "LabelStatusCount"
wait 0 milliseconds
put 0 into varTally
write varIndexData to file varIndexFile at eof
put empty into varIndexData
end if
end repeat
end repeat
put field "LabelStatusCount" + varTally into field "LabelStatusCount"
write varIndexData to file varIndexFile at eof
close file varFileToRead
close file varIndexFile
--put the milliseconds - tStart into field "myTime"
answer "Wikipedia data indexed and ready for use."
end mouseUp
It should work as your script. As an aside, when I take the quotes off of
it saves about 50 milliseconds on the 24 MB.
I left the your line approach since Rev is very fast if you do it for each line. So I didnt even try to parse repeatedly for offset(<Title>) with chars to skip.
The one version I proposed with
put offset("<title>",varLineData) into tStartTitle
if tStartTitle > 0 then
is actually slower then your version with is among.
regards
Bernd
Posted: Wed Sep 16, 2009 8:02 pm
by Garrett
Bernd, thank you very much.
I tried almost exactly the same thing as you did here a few days prior, but ran into one problem which yours also runs into... The varcounter numbers come out all wrong, which frustrated me to no end and I couldn't figure out why the counter was off on characters. I gave up on this way of doing it because I couldn't figure out why the numbers were off.
Well, after some playing with it today I found out why the varCounter was off.... For each line we added the characters of, it was coming up 1 character short of the real amount. I'm not sure exactly why at the moment unless the cr at the end of each line is not being counted.
I resolved it by doing this:
Code: Select all
put the number of chars of varLineData + varCounter + 1 into varCounter
Now I don't know if this is a Windows issue or this will be the same on Linux and OS X. I'll be generating some test exectuables for each platform later to test and see.
But anyway, I believe with all the suggestions and code provided by all of you here that this may finally work as intended and within the amount of time I was looking for. I haven't done a full run of the entire 23 gig file yet though, but will soon.
Again, thanks everyone for your help.
Posted: Wed Sep 16, 2009 8:35 pm
by bn
Garrett,
the linedelimiter is not counted if you ask for the number of chars of a line, so there is your 1 char. This is a Rev thing, and it is what one would expect since you count the content of a line, and a linedelimiter is not content of a line.
Since you are only off one the linedelimiter in the original file is probably a linefeed(unix).
So there should not be a problem with your code on different platforms, Rev should handle that (or should it?) the main thing is you have the the right amount. Since to my knowledge only Windows has Carriage return/linefeed i.e. two characters and you develop on windows and you are only off by 1, meaning the original file only has a one character linedelimter, you should be fine with Linux (linefeed) and Mac (return) as the line delimiters anyways. Unless of course I am wrong...
regards
Bernd
Posted: Wed Sep 16, 2009 8:49 pm
by Garrett
I spoke too soon
All is well for around a hundred lines, then the count begins to go stray

It seems the count begins to increase too much.
Here's the code I was playing with
Code: Select all
on mouseUp
put empty into field "ListLog"
put "Started: " & the short system date & " - " & the short system time after field "ListLog"
set the enabled of button "Generate Indexing" to false
put "...\WikipediaIndex3.dat" into varIndexFile
put "...\enwiki-20090902-pages-articles.xml" into varFileToRead
open file varFileToRead for read
open file varIndexFile for write
put 1 into varCounter
put 0 into varTally
put false into tReachedTheEnd
repeat 100 times --until tReachedTheEnd
read from file varFileToRead for 1000000
if the result is "eof" then
put true into tReachedTheEnd
end if
put it into tPartText
repeat for each line varLineData in tPartText
if "<title>" is among the chars of varLineData then
put (varCounter + offset("<title>",varLineData)) into varCharLoc
--put varCounter into varCharLoc
put offset("</title>",varLineData) into tClosingTitle
if tClosingTitle = 0 then
read from file varFileToRead at varCharLoc for 1 line
put it into myRestOfLine
put the number of chars of myRestOfLine + varCounter + 1 into varCounter
put varLineData & myRestOfLine into tLastLIneOfImport
put char (offset("<title>",tLastLIneOfImport) + 7) to (offset("<title>",tLastLIneOfImport) -1 ) of tLastLIneOfImport & "|" & varCharLoc & cr after varIndexData
put varTally + 1 into varTally
else
put char (offset("<title>",varLineData) + 7) to tClosingTitle-1 of varLineData & "|" & varCharLoc & cr after varIndexData
put varTally + 1 into varTally
end if
end if
put the number of chars of varLineData + varCounter + 1 into varCounter
if varTally is 1000 then
wait 0 milliseconds
put 0 into varTally
write varIndexData to file varIndexFile at eof
put empty into varIndexData
end if
end repeat
end repeat
set the enabled of button "Generate Indexing" to true
set the enabled of field "ListLog" to true
put cr & "Complete: " & the short system date & " - " & the short system time after field "ListLog"
write varIndexData to file varIndexFile at eof
close file varFileToRead
close file varIndexFile
answer "Wikipedia data indexed and ready for use."
end mouseUp
As of this moment I am unable to determine why this is or how to check for the increase in the number.
Posted: Wed Sep 16, 2009 9:13 pm
by Garrett
Ok, this almost resolved the noted issue in my last reply...
Code: Select all
end repeat
put varCounter - 1 into varCounter
end repeat
set the enabled of button "Generate Indexing" to true
set the enabled of field "ListLog" to true
put cr & "Complete: " & the short system date & " - " & the short system time after field "ListLog"
write varIndexData to file varIndexFile at eof
close file varFileToRead
close file varIndexFile
answer "Wikipedia data indexed and ready for use."
end mouseUp
By reducing the counter by 1 each time we load a chunk for checking..
Code: Select all
put varCounter - 1 into varCounter
..we cut back on the error only slightly. A few thousand entries later, we again start encountering the increase in the varcounter yet again.
So now I'm a bit lost again. This routine seems to be the fastest possible way of doing this, but the issues with getting a proper char position of the entries may kill this approach
Ok, back to trying things out with this one to see if I can figure it out.
Posted: Wed Sep 16, 2009 9:41 pm
by Garrett
Ok, I think I may have found the problem
At lines 10225 through 10227 I get the following in the index file:
Ismailis|174999492
|174999990
Islamism|175002036
You'll notice that the second line in the index there is missing it's title data. All is well until after that entry.
Here is what the data returns for each of the three using the character postions:
Ismailis|174999492 returns " <title>Ismailis</title>"
|174999990 returns " <title>Indus (disambiguation)</title>"
Islamism|175002036 returns "mism</title>"
For the last line we end up with a 12 character addition to the varcounter that shouldn't be there.
The second line for some odd reason, the title doesn't get included in the index.
Anyone have any ideas what might be going on here??
Thanks,
~Garrett
Posted: Wed Sep 16, 2009 10:07 pm
by bn
what is in the raw data for the respective lines, could there be an empty title? Can you post a subset of your data somewhere to the net to play around with?
regards
Bernd