Faster than what I have going here?

bn · Post by bn » Thu Sep 17, 2009 12:04 am

Garrett,
there is a flaw in the script, when reading data in in chunks on never knows where the line ending is. I figured one reads in the chunk and then reads until a line ending and starts from there with repeat for each line.
the relevant part of the code is

Code: Select all

 repeat until tReachedTheEnd
      read from file varFileToRead for 1000000 
      if the result is "eof" then 
         put true into tReachedTheEnd 
      end if 
      put it into tPartText
      
      -- the read could have ended before a line end, lets go on and read until the line is complete
      -- this way we should always have complete lines
      if not tReachedTheEnd then
         --breakpoint
         read from file varFileToRead for 1 line
         put it after tPartText
      end if

But I still did not figure out where the counting goes wrong. But with this modification the situation should become easier to understand. No more missing </title>

regards
Bernd

bn · Post by bn » Thu Sep 17, 2009 12:53 am

Garret,

there seems to be a problem with the line delimiter. When I save the text file I use with just a return as a line delimiter the script works fine. If I save the text file with carriage return/linefeed as a delimiter I get into problems.
So maybe you try to determine what line endings your big file has.
regards
Bernd

Garrett · Post by **Garrett** » Thu Sep 17, 2009 2:14 am

I'll see if I can extract some of the file without causing any changes to any characters of it, such as line endings, and upload it.

bn · Post by bn » Thu Sep 17, 2009 10:59 am

Garrett,
I downloaded 3 articles from wikepedia (german) in xml format. I tried with this xml file and that did work. Maybe for debugging you work on small files that you can easily look at in text editors.
The version that works for me is this one

Code: Select all

on mouseUp 
   put empty into field "ListLog" 
   put "Started:  " & the short system date & " - " & the short system time after field "ListLog" 
   set the enabled of button "Generate Indexing" to false 
   put "...\WikipediaIndex3.dat" into varIndexFile 
   put "...\enwiki-20090902-pages-articles.xml" into varFileToRead 
   open file varFileToRead for read 
   open file varIndexFile for write 
   put 0 into varCounter 
   put 0 into varTally 
   put false into tReachedTheEnd 
   -- repeat 100 times --until tReachedTheEnd 
   repeat until tReachedTheEnd
      read from file varFileToRead for 1000000 
      if the result is "eof" then 
         put true into tReachedTheEnd 
      end if 
      put it into tPartText
      
      -- the read could have ended before a line end, lets go on and read until the line is complete
      -- this way we should always have complete lines
      if not tReachedTheEnd then
         --breakpoint
         read from file varFileToRead for 1 line
         put it after tPartText
      end if
      
      repeat for each line varLineData in tPartText
         
         if "<title>" is among the chars of varLineData then 
            put (varCounter + offset("<title>",varLineData)) into varCharLoc 
            put char (offset("<title>",varLineData) + 7) to (offset("</title>",varLineData) -1 ) of varLineData & "|" & varCharLoc & cr after varIndexData 
            put varTally + 1 into varTally 
         end if        
         put the number of chars of varLineData + varCounter + 1 into varCounter 
         if varTally is 1000 then 
            put field "LabelStatusCount" + 1000 into field "LabelStatusCount"
            wait 0 milliseconds 
            put 0 into varTally 
            write varIndexData to file varIndexFile at eof 
            put empty into varIndexData 
         end if 
      end repeat 
      --put varCounter - 1 into varCounter 
   end repeat 
   set the enabled of button "Generate Indexing" to true 
   set the enabled of field "ListLog" to true 
   put cr & "Complete:  " & the short system date & " - " & the short system time after field "ListLog" 
   put field "LabelStatusCount" + varTally into field "LabelStatusCount"
   write varIndexData to file varIndexFile at eof 
   close file varFileToRead 
   close file varIndexFile 
   answer "Wikipedia data indexed and ready for use." 
end mouseUp

When I access the data at the indexed char number it always starts with <title>theTitleOfTheArticle...
there might still be problems that I dont see, but this probably can be worked out for the indexing. The xml file does indeed have the linefeed character = ASCII 10 as a line delimiter.
regards
Bernd

FourthWorld · Post by **FourthWorld** » Thu Sep 17, 2009 6:20 pm

Reading until a specific character like a line ending will be slower than reading for a specified number of bytes, because the engine has to compare each incoming byte as it reads. So in addition to bypassing the issue with determining the correct line endings, you could speed things up by turning your algo inside out:

Rather than reading until a line ending, grab as much data as your buffer can reasonably hold each time through and process all of its lines but the last one, repeating this until you hit EOF, and then process the last line.

Garrett · Post by **Garrett** » Thu Sep 17, 2009 8:04 pm

Here's an extract of the 23 gig file, it's near 100 megs uncompressed, and compressed in a zip it's a 35 meg download.

http://www.paraboliclogic.com/misc/enwi ... -chunk.zip

I'll read both replies above this in a bit...

Thanks a bunch

~Garrett

Philhold · Post by **Philhold** » Thu Sep 17, 2009 9:14 pm

I don't know if this helps. I unzipped the file on Mac OSX and dropped it on BBedit. The line endings are \r.

Cheers

Phil

bn · Post by bn » Fri Sep 18, 2009 12:16 am

Garret,

I think I know what happens:
When I download from
http://en.wikipedia.org/wiki/Special:Export

AccessibleComputing
Anarchism
AfghanistanHistory
AfghanistanGeography
AfghanistanPeople|106366
AfghanistanCommunications
AfghanistanTransportations
AfghanistanMilitary
AfghanistanTransnationalIssues

I get a xml file that has the line delimiter ASCII 10 as in Unix.
In the chunk file you posted the line delimiter is ASCII 13 ASCII 10 as in Windows. That is why the position number is off. (On top of that the first 3 chars are high ASCII chars that dont show up in the direkt download from Wikipedia, but they dont bother you)
If you change in the last working script I posted you do
put :

Code: Select all

 put the number of chars of varLineData + varCounter + 2 into varCounter

i.e. you add _2_ instead of _1_ then the script works and you find for all of your chunk file the exact same position : <title>xxxx
if it is a Unix file add 1
if it is a Windows file add 2

BUT if you do a binary read on your Windows file (as I just found out) you just have to add 1 and the script works. Rev then apparently treats return and linefeed each as a line delimiter which in the script is an empty line and the counter goes up one char

You can look at the ASCII values if you do a _binary_ read. Rev changes the line delimiter in a simple read to ASCII 10 (linefeed) and gets rid of the ASCII 13 (return)
make a stack and import with the following script 2000 chars and look for ascii 10, if a ascii 13 is before it is a Windows file

Code: Select all

on mouseUp
   put "/Users/berndnig/Desktop/enwiki-20090902-pages-chunk.xml" into varFileToRead
   put empty into field 2
   open file varFileToRead for binary read 
   read from file varFileToRead for 2000
   put it into temp
   close file varFileToRead 
   lock screen
   repeat with i = 1 to 2000
      if chartoNum(char i of temp) = 10 then 
         put i && ":" && chartoNum(char i of temp) && char i of temp  after field 2
      else
         put i && ":" && chartoNum(char i of temp) && char i of temp & return after field 2
      end if
   end repeat
end mouseUp

So the script works, just has to take into account the line delimiter, as you suspected earlier. If you go for a binary read you would not have to worry about what the delimiter is and it is even a little faster, always only adding 1 for each line.
like in

Code: Select all

open file varFileToRead for binary read

I hope I didnt confuse you as much as this confused me while looking into it.

You should hopefully be able to index your file now.

regards
Bernd

Garrett · Post by **Garrett** » Sat Sep 19, 2009 7:08 pm

Bernd.... I can't thank you enough! Your code further above seems to have hit the nail right on the head. Not only are character positions coming up exactly as they should, the code is still blazing fast.

I did a few tests looping for 100, 200, 500, 1000 and 10000 times, checking various entries in the resulting index and all was coming up as it should. The last test of 10000 times took only 11 minutes to process on my pc which read and processed near half of the 23 gig file!

File size in bytes:
24,560,823,780

Chars read in:
10,002,099,113

(characters would represent bytes wouldn't it? my mind fails me sometimes so maybe I'm spacing out on this.)

Time Started:
10:45:03 AM

Complete:
10:55:56 AM

So if I'm not spacing out here, the end result should be that the entire file will be indexed in under 30 minutes... That's a far cry from what I started with and even expected. I was aiming to at least match that of the other programs I had tried, all of which took about 3 hours to process their own index files.

Well... now for the final test then..... Letting it run it's course through the entire file.

Back in a little while.

~Garrett

Garrett · Post by **Garrett** » Sat Sep 19, 2009 7:59 pm

Final run results!!!!

* drum roll please *

It took a mere 32 minutes total to completely index the entire 23 gig XML file. The end result is an index listing 9,013,937 entries of which only around 3 million are actual entries.. the rest are extraneous entries which serve as redirects to the articles or are image references or category references. In my final version I'll be implementing a filter which will weed out as much of the extraneous stuff as possible. The index file only weighs in at 312 mb.

This type of result is all thanks to everyone of you who helped me out on this project. Without all of you helping me out, I'd still be sitting here for days indexing this monster.

Not only did you help me beat my goal of indexing in 3 hours, you beat my goal by 2 and a half hours!

I will eventually be making this entire project available in source for anyone else interested.

I still have many things to be done, such as automatic downloading of the 5 gig zipped file of the database, uncompress it and such. I want to see if I can also add support for pausing the download and such.

I may also attempt to write code to convert the XML file to hopefully something more compact than 23 gigs.

I had already started on the search/viewer part of this, so already have made a lot of headway on that side.. Parsing the data from the articles and presenting them in a minimal formatted view is near done.

Again.... Thank you all so very much for the help you have given me on this. You guys seriously went out of your way to help me and i can't thank you enough for that.

bn · Post by bn » Sat Sep 19, 2009 11:28 pm

Garrett,
glad it did work out for you.

I never thoutght you would get id down to 32 minutes on a 23 GB file.

Congratulations!

regards

Bernd