Garret,
I think I know what happens:
When I download from
http://en.wikipedia.org/wiki/Special:Export
AccessibleComputing
Anarchism
AfghanistanHistory
AfghanistanGeography
AfghanistanPeople|106366
AfghanistanCommunications
AfghanistanTransportations
AfghanistanMilitary
AfghanistanTransnationalIssues
I get a xml file that has the line delimiter ASCII 10 as in Unix.
In the chunk file you posted the line delimiter is ASCII 13 ASCII 10 as in Windows. That is why the position number is off. (On top of that the first 3 chars are high ASCII chars that dont show up in the direkt download from Wikipedia, but they dont bother you)
If you change in the last working script I posted you do
put :
Code: Select all
put the number of chars of varLineData + varCounter + 2 into varCounter
i.e. you add _2_ instead of _1_ then the script works and you find for all of your chunk file the exact same position : <title>xxxx
if it is a Unix file add 1
if it is a Windows file add 2
BUT if you do a
binary read on your Windows file (as I just found out) you just have to add 1 and the script works. Rev then apparently treats return and linefeed each as a line delimiter which in the script is an empty line and the counter goes up one char
You can look at the ASCII values if you do a _binary_ read. Rev changes the line delimiter in a simple read to ASCII 10 (linefeed) and gets rid of the ASCII 13 (return)
make a stack and import with the following script 2000 chars and look for ascii 10, if a ascii 13 is before it is a Windows file
Code: Select all
on mouseUp
put "/Users/berndnig/Desktop/enwiki-20090902-pages-chunk.xml" into varFileToRead
put empty into field 2
open file varFileToRead for binary read
read from file varFileToRead for 2000
put it into temp
close file varFileToRead
lock screen
repeat with i = 1 to 2000
if chartoNum(char i of temp) = 10 then
put i && ":" && chartoNum(char i of temp) && char i of temp after field 2
else
put i && ":" && chartoNum(char i of temp) && char i of temp & return after field 2
end if
end repeat
end mouseUp
So the script works, just has to take into account the line delimiter, as you suspected earlier. If you go for a binary read you would not have to worry about what the delimiter is and it is even a little faster, always only adding 1 for each line.
like in
Code: Select all
open file varFileToRead for binary read
I hope I didnt confuse you as much as this confused me while looking into it.
You should hopefully be able to index your file now.
regards
Bernd