Faster than what I have going here?

Garrett · Post by **Garrett** » Fri Sep 11, 2009 6:44 pm

Ok, still on my offline wikipedia reader here.

The scenario is that I've downloaded the wikipedia database which contains the information only, no images and such. It's a 5 gig compressed download which is an uncompressed 23 gig XML file.

What I'm doing with it is generating an index file which only contains the title of each article and the starting character position of the article within the XML file.

The intent is to use the index file for searching, using the starting char position to grab the article from the XML file.

So far, using the starting char to grab the data from the XML file is not a problem as Rev is more than quick enough to do this. Even with my likely poor programming habits and lack of optimization.

The problem so far is generating the index file. At the current rate, it takes about 3 days for my program to generate the index file. This just isn't acceptable if I release this program for others to use. There are over 3 million articles in the 23 gig XML file.

I'd like to know if anyone here is willing to look at my code below and see if there's any way to make it quicker, say from 3 days to maybe 3 hours. Mind you, I did put in a slight few milliseconds delay in order to keep from pegging my CPU out to 100 for the long duration that it might take, whether that be 3 hours or 3 days, just not cool pegging the cpu out for that long. With the current delays built in, my CPU runs at about 50% instead of maxed out at 100% while my code is running.

Code: Select all

on mouseUp
  put empty into field "ListIndex"
  put "WikipediaIndex.dat" into varIndexFile
  put "enwiki-20090902-pages-articles.xml" into varFileToRead
  open file varFileToRead for read
  put 1 into varCounter
  put 1 into varDelayLoop
  repeat
    read from file varFileToRead at varCounter for 1 line
    if the result is "eof" then exit repeat
    put it into varLineData
    if "<title>" is among the chars of varLineData then
      put offset("<title>",varLineData) + 6 into varStartOffset
      put (varCounter + offset("<title>",varLineData)) into varCharLoc
      delete char 1 to varStartOffset of varLineData
      put offset("</title>",varLineData) into varStartOffset
      delete char varStartOffset to (varStartOffset + 7) of varLineData
      delete the last char of varLineData
      put varLineData & "|" & varCharLoc & cr into varLineData
      put varLineData after varIndexData
      put field "LabelStatusCount" + 1 into field "LabelStatusCount"
    end if
    put the number of chars of varLineData + varCounter into varCounter
    put varDelayLoop + 1 into varDelayLoop
    if varDelayLoop is 5 then
      wait for 2 milliseconds
      put 1 into varDelayLoop
    end if
    if the number of lines of varIndexData is 1000 then
      put quote & "file:" & varIndexFile & quote into varSaveIndexFile
      put varIndexData after URL "file:WikipediaIndex.dat"
      put empty into varIndexData
      wait for 5 milliseconds
    end if
  end repeat
  close file varFileToRead
  put field "LabelStatusCount" + 1 into field "LabelStatusCount"
  put quote & "file:" & varIndexFile & quote into varSaveIndexFile
  put varIndexData after URL "file:WikipediaIndex.dat"
  answer "Wikipedia data indexed and ready for use."
end mouseUp

Any advice, code etc greatly appreciated.

Thanks,
~Garrett

bn · Post by bn » Fri Sep 11, 2009 8:22 pm

Garret,
what comes to mind is

put field "LabelStatusCount" + 1 into field "LabelStatusCount"

here you access a field and than access it again and force a screen update. Accessing a field and screenupdate especially are time consuming. I'd suggest a variable that counts say your "number of lines of varIndexData" and only update the field every 1000 times around.
The same goes for "number of lines of varIndexData" you make Rev count the number of lines 999 times before you save.
If you set up a variable lets say here

Code: Select all

put varLineData & "|" & varCharLoc & cr into varLineData

where you generate the lines then you have a different approach for both cases.
With that counter you could say

Code: Select all

if mySnazzyCounter = 1000 then
-- pseudocode
update my feedback field
save my index
put 0 into mySnazzyCounter 
end if

this should save you a day or so.

and here seems to be a little redundancy

Code: Select all

put offset("<title>",varLineData) + 6 into varStartOffset 
      put (varCounter + offset("<title>",varLineData)) into varCharLoc 
      delete char 1 to varStartOffset of varLineData 
      put offset("</title>",varLineData) into varStartOffset 
      delete char varStartOffset to (varStartOffset + 7) of varLineData 
      delete the last char of varLineData

you delete part of the variable which means memory shuffling for rev. But you know where you are so you could start your offset searches for <title> with the kown offsets (the charsToSkip)
from the dictionary:

offset(charsToFind,stringToSearch[,charsToSkip])

that might help a little. But since you do it many times it might save some too.

maybe you do a test run (on a small part of the 23 gigs) and tell us your mileage
regards
Bernd[/quote]

Garrett · Post by **Garrett** » Fri Sep 11, 2009 8:29 pm

I sure will Bernd! And thanks for the suggestions. I'll let ya know in a few days.. I still have one day left on my current run of indexing the file with my current code.

Thanks,
~Garrett

Mark Smith · Post by **Mark Smith** » Sat Sep 12, 2009 1:10 am

Another possible optimisation would be around this part:

Code: Select all

if the number of lines of varIndexData is 1000 then 
put quote & "file:" & varIndexFile & quote into varSaveIndexFile 
put varIndexData after URL "file:WikipediaIndex.dat" 
put empty into varIndexData 
wait for 5 milliseconds

Two things - for each time through the loop, the engine has to count the number of lines in varIndexData - it might be better to keep a running count that you increment each time you add to varIndexData, and reset to 0 each time you save it out to varSaveIndexFile.

The other thing would be to "open file varSaveIndexFile for write", at the beginning, and simply "write varIndexData to file varSaveIndexFile" instead of the url syntax. Remember to close the file when you're done.

These might make a noticeable difference, particularly later on in the process when the index file gets big.

Garrett · Post by **Garrett** » Sat Sep 12, 2009 2:40 am

Thanks

Garrett · Post by **Garrett** » Mon Sep 14, 2009 5:53 pm

Ok, either I just suck or with the current concept being used, this just isn't going to be a quick, or at least acceptable task for Rev.

The adjustments didn't seem to make enough of a difference

Here's the last code used:

Code: Select all

  put empty into field "ListIndex"
  put "...\WikipediaIndex3.dat" into varIndexFile
  put "...\enwiki-20090902-pages-articles.xml" into varFileToRead
  open file varFileToRead for read
  open file varIndexFile for write
  put 1 into varCounter
  put 0 into varTally
  repeat
    read from file varFileToRead at varCounter for 1 line
    if the result is "eof" then exit repeat
    put it into varLineData
    if "<title>" is among the chars of varLineData then
      put (varCounter + offset("<title>",varLineData)) into varCharLoc
      put char (offset("<title>",varLineData) + 7) to (offset("</title>",varLineData) -1) of varLineData & "|" & varCharLoc & cr after varIndexData
      put varTally + 1 into varTally
    end if
    put the number of chars of varLineData + varCounter into varCounter
    if varTally is "1000" then
      put field "LabelStatusCount" + 1000 into field "LabelStatusCount"
      put 0 into varTally
      write varIndexData to file varIndexFile at eof
      put empty into varIndexData
    end if
  end repeat
  put field "LabelStatusCount" + varTally into field "LabelStatusCount"
  write varIndexData to file varIndexFile at eof
  close file varFileToRead
  close file varIndexFile
  answer "Wikipedia data indexed and ready for use."

I'm wondering if going for line for line is not the right approach to this. What if I did this in larger chunks? Say something like 1 meg chunks or 10 meg chunks or maybe even 100 meg chunks? Would that make any difference?

Hmmm... Well, only one way to find out I guess.

bn · Post by bn » Mon Sep 14, 2009 7:44 pm

Hi Garret,

How did it go, did it finish the indexing allright? How long did it take. How much an improvement did you find with your changes?
What do you call an acceptable time for indexing 23 GB?
How big a chunk is a line? is it more 100 chars or is it more like 10000 chars? How often do you hit with your <Title> in relation to no hit, i.e. on average how many lines do you read before you hit <Title>.
All this influences your design decision. Disk access is "slow". So bigger chunks may well make a difference.
One thing I noticed is if I replace

Code: Select all

if "<title>" is among the chars of varLineData then
         put (varCounter + offset("<title>",varLineData)) into varCharLoc
         put char (offset("<title>",varLineData) + 7) to (offset("</title>",varLineData) -1) of varLineData & "|" & varCharLoc & cr after varIndexData
         put varTally + 1 into varTally
      end if

with

Code: Select all

 put offset("<title>",varLineData) into tStartTitle
      if tStartTitle > 0 then
         put (varCounter + tStartTitle) into varCharLoc
         put char (tStartTitle + 7) to (offset("</title>",varLineData) -1) of varLineData & "|" & varCharLoc & cr after varIndexData
         put varTally + 1 into varTally
      end if

it does speed things up especially when <Title> is a little further down the road and not right at the beginning of the line you test.
I would do some benchmarking on a small subset of the data to test your variations. Maybe 50 MB or so. Or even less 5 MB. It should give you a good idea about the time it takes. Just dont expect any miracles, 23 GB is a tall order.
regards
Bernd

Garrett · Post by **Garrett** » Wed Sep 16, 2009 4:04 am

Well, with a different approach in coding and using some of the suggestions, I was able to cut 3 days indexing down to 10 hours indexing. Still not good enough, I want around 3 hours total time. So I'm still toying with ideas and code here.

Back in a few days or so with results.

bn · Post by bn » Wed Sep 16, 2009 4:39 pm

Garrett,
I tried your script on a 24 MB file with 13200 occurences of <Title>xxx</Title> In the version you posted it took me about 7600 milliseconds. I changed your script to take in sequential chunks of the big file and it took about 630 milliseconds. So indeed reading in bigger chunks does improve considerably the speed of your script. Here it is:

Code: Select all

on mouseUp
   put empty into field "ListIndex"
   put empty into field "LabelStatusCount" 
   put "...\WikipediaIndex3.dat" into varIndexFile 
   put "...\enwiki-20090902-pages-articles.xml" into varFileToRead 
   open file varFileToRead for read 
   open file varIndexFile for write 
   put 1 into varCounter 
   put 0 into varTally
   put false into tReachedTheEnd
   --put the milliseconds into tStart
   repeat until tReachedTheEnd
      read from file varFileToRead for 1000000 -- change for bigger chunks, 1 million seems to be pretty good
      if the result is "eof" then put true into tReachedTheEnd
      put it into tPartText
      repeat for each line varLineData in tPartText
         if "<title>" is among the chars of varLineData then 
            put (varCounter + offset("<title>",varLineData)) into varCharLoc
            put offset("</title>",varLineData) into tClosingTitle
            -- here we dont find a closing title
            if tClosingTitle = 0 then
               -- read until the end of line, closing title is on the same line as beginning title
               read from file varFileToRead for 1 line
               put it into myRestOfLine
               put the number of chars of myRestOfLine + varCounter into varCounter
               put varLineData & myRestOfLine into tLastLIneOfImport
               put char (offset("<title>",tLastLIneOfImport) + 7) to (offset("<title>",tLastLIneOfImport) -1 ) of tLastLIneOfImport & "|" & varCharLoc & cr after varIndexData 
               put varTally + 1 into varTally 
            else
               put char (offset("<title>",varLineData) + 7) to tClosingTitle-1 of varLineData & "|" & varCharLoc & cr after varIndexData 
               put varTally + 1 into varTally 
            end if
         end if
         put the number of chars of varLineData + varCounter into varCounter 
         if varTally is 1000 then 
            put field "LabelStatusCount" + 1000 into field "LabelStatusCount"
            wait 0 milliseconds
            put 0 into varTally 
            write varIndexData to file varIndexFile at eof 
            put empty into varIndexData 
         end if 
      end repeat
   end repeat
   put field "LabelStatusCount" + varTally into field "LabelStatusCount" 
   write varIndexData to file varIndexFile at eof 
   close file varFileToRead 
   close file varIndexFile 
   --put the milliseconds - tStart into field "myTime"
   answer "Wikipedia data indexed and ready for use."
end mouseUp

It should work as your script. As an aside, when I take the quotes off of

Code: Select all

if varTally is "1000"

it saves about 50 milliseconds on the 24 MB.
I left the your line approach since Rev is very fast if you do it for each line. So I didnt even try to parse repeatedly for offset(<Title>) with chars to skip.
The one version I proposed with

put offset("<title>",varLineData) into tStartTitle
if tStartTitle > 0 then

is actually slower then your version with is among.

regards
Bernd

Garrett · Post by **Garrett** » Wed Sep 16, 2009 8:02 pm

Bernd, thank you very much.

I tried almost exactly the same thing as you did here a few days prior, but ran into one problem which yours also runs into... The varcounter numbers come out all wrong, which frustrated me to no end and I couldn't figure out why the counter was off on characters. I gave up on this way of doing it because I couldn't figure out why the numbers were off.

Well, after some playing with it today I found out why the varCounter was off.... For each line we added the characters of, it was coming up 1 character short of the real amount. I'm not sure exactly why at the moment unless the cr at the end of each line is not being counted.

I resolved it by doing this:

Code: Select all

put the number of chars of varLineData + varCounter + 1 into varCounter

Now I don't know if this is a Windows issue or this will be the same on Linux and OS X. I'll be generating some test exectuables for each platform later to test and see.

But anyway, I believe with all the suggestions and code provided by all of you here that this may finally work as intended and within the amount of time I was looking for. I haven't done a full run of the entire 23 gig file yet though, but will soon.

Again, thanks everyone for your help.

bn · Post by bn » Wed Sep 16, 2009 8:35 pm

Garrett,
the linedelimiter is not counted if you ask for the number of chars of a line, so there is your 1 char. This is a Rev thing, and it is what one would expect since you count the content of a line, and a linedelimiter is not content of a line.

Since you are only off one the linedelimiter in the original file is probably a linefeed(unix).

So there should not be a problem with your code on different platforms, Rev should handle that (or should it?) the main thing is you have the the right amount. Since to my knowledge only Windows has Carriage return/linefeed i.e. two characters and you develop on windows and you are only off by 1, meaning the original file only has a one character linedelimter, you should be fine with Linux (linefeed) and Mac (return) as the line delimiters anyways. Unless of course I am wrong...

regards

Bernd

Garrett · Post by **Garrett** » Wed Sep 16, 2009 8:49 pm

I spoke too soon

All is well for around a hundred lines, then the count begins to go stray

It seems the count begins to increase too much.

Here's the code I was playing with

Code: Select all

on mouseUp
  put empty into field "ListLog"
  put "Started:  " & the short system date & " - " & the short system time after field "ListLog"
  set the enabled of button "Generate Indexing" to false
  put "...\WikipediaIndex3.dat" into varIndexFile
  put "...\enwiki-20090902-pages-articles.xml" into varFileToRead
  open file varFileToRead for read
  open file varIndexFile for write
  put 1 into varCounter
  put 0 into varTally
  put false into tReachedTheEnd
  repeat 100 times --until tReachedTheEnd
    read from file varFileToRead for 1000000
    if the result is "eof" then
      put true into tReachedTheEnd
    end if
    put it into tPartText
    repeat for each line varLineData in tPartText
      if "<title>" is among the chars of varLineData then
        put (varCounter + offset("<title>",varLineData)) into varCharLoc
        --put varCounter into varCharLoc
        put offset("</title>",varLineData) into tClosingTitle
        if tClosingTitle = 0 then
          read from file varFileToRead at varCharLoc for 1 line
          put it into myRestOfLine
          put the number of chars of myRestOfLine + varCounter + 1 into varCounter
          put varLineData & myRestOfLine into tLastLIneOfImport
          put char (offset("<title>",tLastLIneOfImport) + 7) to (offset("<title>",tLastLIneOfImport) -1 ) of tLastLIneOfImport & "|" & varCharLoc & cr after varIndexData
          put varTally + 1 into varTally
        else
          put char (offset("<title>",varLineData) + 7) to tClosingTitle-1 of varLineData & "|" & varCharLoc & cr after varIndexData
          put varTally + 1 into varTally
        end if
      end if        
      put the number of chars of varLineData + varCounter + 1 into varCounter
      if varTally is 1000 then
        wait 0 milliseconds
        put 0 into varTally
        write varIndexData to file varIndexFile at eof
        put empty into varIndexData
      end if
    end repeat
  end repeat 
  set the enabled of button "Generate Indexing" to true
  set the enabled of field "ListLog" to true
  put cr & "Complete:  " & the short system date & " - " & the short system time after field "ListLog"
  write varIndexData to file varIndexFile at eof
  close file varFileToRead
  close file varIndexFile
  answer "Wikipedia data indexed and ready for use."
end mouseUp

As of this moment I am unable to determine why this is or how to check for the increase in the number.

Garrett · Post by **Garrett** » Wed Sep 16, 2009 9:13 pm

Ok, this almost resolved the noted issue in my last reply...

Code: Select all

    end repeat
    put varCounter - 1 into varCounter
  end repeat 
  set the enabled of button "Generate Indexing" to true
  set the enabled of field "ListLog" to true
  put cr & "Complete:  " & the short system date & " - " & the short system time after field "ListLog"
  write varIndexData to file varIndexFile at eof
  close file varFileToRead
  close file varIndexFile
  answer "Wikipedia data indexed and ready for use."
end mouseUp

By reducing the counter by 1 each time we load a chunk for checking..

Code: Select all

put varCounter - 1 into varCounter

..we cut back on the error only slightly. A few thousand entries later, we again start encountering the increase in the varcounter yet again.

So now I'm a bit lost again. This routine seems to be the fastest possible way of doing this, but the issues with getting a proper char position of the entries may kill this approach

Ok, back to trying things out with this one to see if I can figure it out.

Garrett · Post by **Garrett** » Wed Sep 16, 2009 9:41 pm

Ok, I think I may have found the problem

At lines 10225 through 10227 I get the following in the index file:

Ismailis|174999492

|174999990

Islamism|175002036

You'll notice that the second line in the index there is missing it's title data. All is well until after that entry.

Here is what the data returns for each of the three using the character postions:

Ismailis|174999492 returns " <title>Ismailis</title>"

|174999990 returns " <title>Indus (disambiguation)</title>"

Islamism|175002036 returns "mism</title>"

For the last line we end up with a 12 character addition to the varcounter that shouldn't be there.

The second line for some odd reason, the title doesn't get included in the index.

Anyone have any ideas what might be going on here??

Thanks,
~Garrett

bn · Post by bn » Wed Sep 16, 2009 10:07 pm

what is in the raw data for the respective lines, could there be an empty title? Can you post a subset of your data somewhere to the net to play around with?
regards
Bernd